Ethics and Context in Second Language Testing: Rethinking Validity in Theory and Practice [1 ed.] 1032471778, 9781032471778


118 23

English Pages 248 [249] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half Title
Title
Copyright
Contents
List of contributors
PART I The ethical contextualization of validity
1 Context, construct, and ethics
2 Validity and validation: an alternative perspective
PART II Agency and empowerment prompted by test adequacy
3 The racializing power of language assessments
4 “It’s not their English”: narratives contesting the validity of a high-stakes test
5 The ethical potential of L2 portfolio assessment: a critical perspective review
PART III Sociointeractional perspectives on assessment
6 Portfolio assessment: facilitating language learning in the wild
7 The role of an inscribed object in a German classroom-based paired speaking assessment: does the topic card help elicit the targeted speaking and interactional competencies?
8 L1–L2 speaker interaction: affordances for assessing repair practices
9 Yardsticks for the future of language assessment: disclosing the meaning of measurement
Index
Recommend Papers

Ethics and Context in Second Language Testing: Rethinking Validity in Theory and Practice [1 ed.]
 1032471778, 9781032471778

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

ETHICS AND CONTEXT IN SECOND LANGUAGE TESTING

This innovative, timely text introduces the theory and research of critical approaches to language assessment, foregrounding ethical and socially contextualized concerns in language testing and language test validation in today’s globalized world. The editors bring together diverse perspectives, qualitative and quantitative methodologies, and empirical work on this subject that speak to concerns about social justice and equity in language education, from languages and contexts around the world – offering an overview of key concepts and theoretical issues and field-advancing suggestions for research projects. This book offers a fresh perspective on language testing that will be an invaluable resource for advanced students and researchers of applied linguistics, sociolinguistics, language policy, education, and related fields – as well as language program administrators. M. Rafael Salaberry is Mary Gibbs Jones Professor of Humanities in the Department of Modern and Classical Literatures and Cultures at Rice University (USA). Albert Weideman is Professor of Applied Language Studies and Research Fellow at the University of the Free State (South Africa). Wei-Li Hsu is Lecturer at the Center for Languages and Intercultural Communication, at Rice University (USA).

ETHICS AND CONTEXT IN SECOND LANGUAGE TESTING Rethinking Validity in Theory and Practice

Edited by M. Rafael Salaberry, Albert Weideman and Wei-Li Hsu

Designed cover image: © Getty Images | Qweek First published 2024 by Routledge 605 Third Avenue, New York, NY 10158 and by Routledge 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2024 selection and editorial matter, M. Rafael Salaberry, Albert Weideman, and Wei-Li Hsu; individual chapters, the contributors The right of M. Rafael Salaberry, Albert Weideman, and Wei-Li Hsu to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. ISBN: 978-1-032-47177-8 (hbk) ISBN: 978-1-032-47175-4 (pbk) ISBN: 978-1-003-38492-2 (ebk) DOI: 10.4324/9781003384922 Typeset in Galliard by Apex CoVantage, LLC

CONTENTS

List of contributors PART I

vii

The ethical contextualization of validity

1

1 Context, construct, and ethics M. Rafael Salaberry and Albert Weideman

3

2 Validity and validation: an alternative perspective Albert Weideman and Bart Deygers PART II

28

Agency and empowerment prompted by test adequacy

51

3 The racializing power of language assessments Casey Richardson

53

4 “It’s not their English”: narratives contesting the validity of a high-stakes test Gordon Blaine West and Bala Thiruchelvam

80

5 The ethical potential of L2 portfolio assessment: a critical perspective review Mitsuko Suzuki

106

vi Contents

PART III

Sociointeractional perspectives on assessment

133

6 Portfolio assessment: facilitating language learning in the wild Elisa Räsänen and Piibi-Kai Kivik

135

7 The role of an inscribed object in a German classroom-based paired speaking assessment: does the topic card help elicit the targeted speaking and interactional competencies? Katharina Kley

162

8 L1–L2 speaker interaction: affordances for assessing repair practices Katharina Kley, Silvia Kunitz, and Meng Yeh

186

9 Yardsticks for the future of language assessment: disclosing the meaning of measurement Albert Weideman

220

Index235

CONTRIBUTORS

Bart Deygers is Assistant Professor at Ghent University (Belgium). His

research focuses on fairness and justice in language testing, on the impact of high-stakes language requirements, and on L2 gains in an instructed SLA setting.

Piibi-Kai Kivik, Ph.D., is Senior Lecturer of Estonian language and culture at the Department of Central Eurasian Studies at Hamilton Lugar School of Global and International Studies at Indiana University Bloomington (USA). She is an applied linguist who studies L2 interaction and learning in a variety of instructional contexts. Katharina Kley is Lecturer in German at Rice University’s Center for Lan-

guages and Intercultural Communication (USA). Her research focuses on the teaching and testing of interactional competence, classroom-based assessment, assessing speaking, and online learning.

Silvia Kunitz is Associate Professor at Linköping University (Sweden). In

her research, she adopts a conversation analytic approach to study language learning environments and L2 interactions.

Elisa Räsänen, MA, is Senior Lecturer of Finnish language and culture at the Department of Central Eurasian Studies at Hamilton Lugar School of Global and International Studies at Indiana University Bloomington (USA). She is also a doctoral researcher at the Centre for Applied Language Studies at the University of Jyväskylä, Finland.

viii Contributors

Casey Richardson is Assistant Professor of Culturally and Linguistically

Diverse Education at Western Colorado University (USA). She earned her Ph.D. from the University of Arizona in Tucson, where she taught educational leadership, ESL, and second language acquisition courses. Her research interests include policy, bilingual education, and critical and assetsbased pedagogies.

M. Rafael Salaberry is Professor in the Department of Modern and Clas-

sical Literatures and Cultures at Rice University (USA). He carries out research in several areas of second language acquisition: L2 instructional and assessment practices, the development of L2/L3 syntactic-semantic morphology, multilingualism, and bilingual education.

Mitsuko Suzuki is a Ph.D. candidate in the Second Language Studies pro-

gram at the University of Hawaiʻi at Mānoa (USA). Prior to arriving in Honolulu as a Fulbright award recipient, she taught English at Japanese high schools and universities. Her current research interests include critical language pedagogy and classroom-based assessment.

Bala Thiruchelvam is currently undertaking a master of teaching (secondary) degree at the University of Newcastle (Australia). He specializes in technology and English as an additional language or dialect. His research interests are in pedagogical approaches to teaching with technology and in personalized and adaptive learning. Albert Weideman is Professor of Applied Language Studies and Research

Fellow at the University of the Free State (South Africa). He recently published Assessing Academic Literacy in a Multilingual Context: Transition and Transformation (2021, Multilingual Matters). He focuses on language assessment design and developing a theory of applied linguistics.

Gordon Blaine West is a PhD candidate in Second Language Acquisition at the University of Wisconsin-Madison (USA). His research interests are in transmodalities, narrative analysis, and critical language pedagogy. His work has been published in Applied Linguistics Review, Critical Inquiry in Language Studies, and Linguistics and Education. Meng Yeh is Teaching Professor at the Center for Languages and Intercultural Communication at Rice University (USA). Her current studies focus on using naturally occurring conversations in general language courses and languages for specific purposes, as well as evaluating the effectiveness of its pedagogy.

PART I

The ethical contextualization of validity

1 CONTEXT, CONSTRUCT, AND ETHICS M. Rafael Salaberry and Albert Weideman

Why this book? Redefinition, contested measures, ethicality

Why would one wish to examine afresh the way that language assessment impacts on second language learning and second language teaching? Several motivations underlie the production of this book, the first of which almost certainly is the changes in the way that we define language both for assessment and for individual development. That requires a re-examination of the construct we assess when we test language ability, or design courses to foster its development. Once we have reconsidered the definition of language ability, there is a second reason to be found in the continuing contestation in validity theory, not only between the earlier undisclosed concept of validity and its subsequent disclosure by notions of construct, appropriateness, and social impact (Weideman, 2019a, 2019b), but in a broadening of our view of both validity and validation. The third, and equally connected reason, is suggested in the title of this book: we have developed greater sensitivity to ethical considerations, not the least of which is that this has stimulated an enhanced awareness of the institutional contexts in which the ability to handle an additional language is tested. Since the institutional environments in which tests are used are, more often than not, multilingual in nature, these contexts call up a series of further questions that challenge us to design assessments responsibly. Specifically, language testing has been challenged to expand the ideas of justice and fairness beyond their initial interpretations and use in our field. As will be noted in what follows, the three reasons are related, and the problems they highlight are often similarly intertwined. This chapter will examine these three interconnected motivations in turn, before setting out DOI: 10.4324/9781003384922-2

4  M. Rafael Salaberry and Albert Weideman

how the various chapters give us a diversity of views on the ways that test designers may meet the design challenges to assess the ability to use additional languages in more responsible ways. A re-conceptualization of (the construct of) language

How we define language and language ability is crucial for the theoretical defensibility of the designs we create to develop language and to assess the ability to use it. Over the last two decades, the field of second language acquisition (SLA) has witnessed three major turning points that have influenced the way we define a second language: a social turn (e.g., Block, 2003; Huth, 2020), an embodied turn (e.g., Eskildsen & Wagner, 2015; Mondada, 2011), and a multilingual turn (e.g., Conte & Meier, 2014; May, 2013; Ortega, 2013).

• A social/interactional turn within SLA has increased attention to lan-

guage as a social phenomenon and seeing language learning as a social process, not just a cognitive one. • An embodied turn in the development of an additional language has alerted us to the fact that we employ multimodal means of communication. • A multilingual turn has promoted, in its critique of the traditional view of language as discrete national standard languages, a more fluid view of language as languaging (and translanguaging) – to the fact, in other words, that in using language repertoires in situ we both address local needs and involve local interlocutors. A common concern of these re-conceptualizations of the construct of language has been its expansion to incorporate, for instance, the notions of interactional competence and the concept of the multilingual learner and speaker. This critical reconceptualization of language also influences the design of testing instruments to assess second language proficiency. Accordingly, the fluid, semiotic, multimodal, and contextualized nature of (the construct of) language should be reflected in language assessments. Notwithstanding these new characterizations of language, many language tests remain monolingual, static, formulaic, and closed. An illustration of this is given later when we note some of the critiques that have been made of commercially developed tests and, to address that problem, we provide examples of potential alternatives. Turning first to the conceptualization of language ability as a socially grounded, interactional competence, we note that more than two decades ago He and Young (1998, p. 7) stated that the interactional competence (IC) to use language is “co-constructed by all participants in an interactive

Context, construct, and ethics  5

practice and is specific to that practice.” The phrase “that practice” already indicates the contextuality of such competence, a point that we shall return to several times. Apart from the general language resources identified in previous models of language competence – notably grammatical structure, lexical resources, and expressive meaning – the construct of interactional competence highlights the role played by contextually specific identity resources (i.e., participation frameworks) and interactional resources (i.e., a variety of modes of turn-taking, sequence and preference organization, and repair, for different types of discourse). For the purpose of assessment, therefore, the definition of interactional competence brings up two problems: it is determined by the local context (thus, not necessarily generalizable), and it is based on the co-construction of meaning (thus, language ability cannot be attributed to one individual alone). It follows that the earlier perspective on language, which focuses mainly on the general structural and semantic means of expression, is much less concerned with context than an interactional view that sees language as communication (Weideman, 2009), and how lingual interaction is accomplished within a specific social relationship. The earlier, less contextualized one focuses on skills-based assessment, defining skills conventionally in essentially 19th-century terms: listening, speaking, reading, and writing. The one centered on interaction turns its attention to skills-neutral, functional language use (Weideman, 2021). A predominantly structural view of language will emphasize the command of general means of expression, whereas a socially informed perspective will promote the mastery of communication in various kinds of typically different discourse. The restrictive perspective will attempt to isolate a ‘skill’ to be tested, while the open, disclosed view of language will respond to the reality that skills cannot be separated, and that lingual interaction is characteristically multimodal. So, the challenge, given this change in perspective, is: Can we move beyond a decontextualized view of language knowledge and language proficiency? Can it be replaced or augmented by a functional and interactional perspective on language competence? When one considers the critique of, for instance, international high-stakes English language proficiency tests, including IELTS (e.g., Hamid et al., 2019; Hamid & Hoang, 2018; Noori & Mirhosseini, 2021; Pearson, 2019), one begins to appreciate the size of the challenge. Moreover, have large-scale institutional tests been able to uphold traditional skills-based divisions in assessing language ability? Most appear to favor some accommodation of both skills-based and functional perspectives in how they define what they test. To take a further example of the issues involved, consider the ACTFL OPI (American Council on the Teaching of Foreign Languages Oral Proficiency Interview), which has become an unavoidable point of reference for

6  M. Rafael Salaberry and Albert Weideman

the US context. Since the time of its publication (1982), it has become the prevalent testing instrument to measure speaking interactional competence in the US. Crucially, despite early calls for modifications (e.g., Bachman, 1990; Chalhoub-Deville, 2003; Kramsch, 1986; Raffaldini, 1988; Salaberry, 2000; Shohamy, 1990), the ACTFL proficiency model remains mostly unchanged since the time of its original publication. The success of the ACTFL-OPI test in the US, however, showcases the benefit of implementing some type of institutionalized, systematic procedure to assess language competence. Liskin-Gasparro (2003, p. 489) concludes in this respect that “the notion of proficiency as construed by the ACTFL Guidelines . . . seems to have found its legitimacy in the arenas of policy, program development, and classroom instruction.” A similar institutionally sanctioned model is to be found in the Common European Framework of Reference (CEFR) for languages (Council of Europe, 2011). Though it is widely used as the theoretical basis for language assessments, also outside of Europe, and though it has been augmented and modified over time, it, too, has remained relatively untouched by substantial criticism from language testers (e.g., Alderson, 2007; Byrnes, 2007; Figueras, 2012; Fulcher, 2004; Weir, 2005). The success of these models, despite the obvious lacunae in our understanding of the process of language development or the nature of language interaction, rests primarily on their pragmatic institutional nature. For instance, van Lier (1989, p. 501) argues that it is “possible to sidestep the issue of construct validity altogether and be satisfied with measuring whatever oral language use happens to be elicited by the OPI, since it is in any case the best instrument available.” A more philosophical perspective is offered by Spolsky (1997, p. 246), who vindicates the effort launched by such institutionalized tests to move forward with a less than ideal assessment scheme because “(m)any if not most of the important decisions we have to make in life are made in a state of insufficient empirical evidence.” It should be noted, however, that Spolsky added a qualification to his position: “(o)nce we accept the need for a gatekeeping function, we are ethically bound to seek the most complete information available.” We return to this point later. Arguably, there have been some modifications of the ACTFL-OPI over the last 30-plus years. It is clear, however, that these modifications have not introduced the type of significant changes needed to address some of the critiques advanced in the 1980s and 1990s, especially those related to the three ‘turns’ mentioned earlier (i.e., interactional, embodied, multilingual). Inadequate design responses to these critiques mean that challenges remain (see Salaberry & Burch, 2021 for an overview of those challenges faced by the testing of speaking in context). Perhaps the construct of interactional competence may be difficult to operationalize in conventional assessment tasks. Indeed, the assessment of a broader definition of proficiency, as an

Context, construct, and ethics  7

interactional, communicative ability, represents a substantial test design challenge (see Sandlund & Sundqvist, 2019 for an analysis of the difficulties faced by raters, intent on assessing interactional competence, to identify instances of IC-related behavior). The assessment of such a complex construct, as implied by Spolsky, inevitably means that there are compromises to be made on the path toward developing adequate testing instruments. And, in fact, several ingenuous compromises have been described in detail for the incorporation of IC-oriented testing practices as is the case of the integration of both emic (i.e., participant-based) and etic (i.e., observer-based) perspectives of assessment (Kley, 2019), or the testing of what can be described as “generic practices” of IC (Huth & Betz, 2019). Given the sizable number of publications shedding light on the interactional nature of discourse and the multilingual identity of test takers, the time appears to be right for the expansion of successful institutionalized models of proficiency to make them better assessment tools. Such large-scale alternatives are already being proposed in discussions of dynamic assessment (Poehner & Lantolf, 2005) and learning-oriented assessment (Turner  & Purpura, 2016), and more, and at the same time more imaginative designs, are possible. As stated earlier, the concept of IC described by He and Young (1998), as part of the dynamic co-construction of interactional events, prompted the profession to face the theoretical difficulties for the assignment of individual grades to a co-constructed performance (i.e., language ability that cannot be attributed to one individual alone) determined by the local context (thus, not generalizable). These issues continue to represent a challenge for the adoption of an IC-oriented syllabus (e.g., Chalhoub-Deville, 2003; Roever & Ikeda, 2022; Salaberry & Kunitz, 2019; Waring, 2018). Potential practical answers to the challenge of adopting a more IC-oriented approach to assessment are offered by the authors of Chapters 7 and 8 of this volume (on the multimodal and interactive nature of assessment respectively). In these chapters we find highly detailed descriptions and analyses of commonly used classroom assessment tools and practices along with their (unforeseen) effects on learner participation and achievement. The authors analyze their data and evaluate potential adaptations of their model from the perspective of a conversation analytic framework. It should be noted that a reconsideration of the construct to be measured can indeed be shown to have an impact on test design. The skills-neutral approach used in academic literacy tests in South Africa provides an example of a productive operationalization of this complex construct (Weideman, 2021; see also NExLA, 2023 for analyses of the empirical qualities of such undertakings). The redefinition that lies at the basis of this essentially functional perspective on what should be assessed (Patterson & Weideman, 2013a, 2013b) needs a measure of technical imagination to be converted

8  M. Rafael Salaberry and Albert Weideman

into an appropriate blueprint, further specifications, subtests, and items; but it is an example of how academic interaction with others is anticipated in its development. Interactional competence, we should remind ourselves, encompasses many modes of communication, through a variety of media, and is not always (like oral interaction) immediate and face to face; in fact, asynchronous communication, and anticipation of a verbal response in a mode other than speech, are almost a hallmark of the particular kind of ability we are assessing when we measure how competently someone handles the demands of academic discourse. Before we return later to a description of the other chapters in this book, we first consider the second issue highlighted previously: validity and test validation, and how that relates to the complex of problems addressed in this book. Reconsidering validity as feature, and validation as process

Since Messick’s (1989) emphasis on a ‘unitary’ perspective on validity, his claims in this respect have been at the basis of many and varying justifications for the validation of tests. The current orthodoxy, of an argument-based process to validate test results (Kane, 2010; Kane et al., 2017; Stansfield, 1993), likewise finds its grounding in Messick’s views. In Messick’s influential model of validation (1989), construct validity is one of four components; the other three are: social values, social consequences, and test uses. Construct validity relates to a clear and theoretically defensible definition of the language ability being measured, i.e., to test adequacy. Values are the implicit social assumptions underlying test use, while consequences refer to the social impact of a test. Test use encompasses issues of appropriateness and interpretation. We have discussed earlier the many challenges brought to light by the assessment of the ability to use an additional language. In this section, we start by re-emphasizing the importance of construct validity before highlighting the social issues that emerge when we test that ability. In this regard, we note that Messick (1994, p.  13) claims that “test validity and social values are intertwined and that evaluation of intended and unintended consequences of any testing is integral to the validation of test interpretation and use.” The fact that Messick’s views have not gone uncontested (Popham, 1997; Borsboom et al., 2004; Weideman, 2012) will be examined in more detail in Chapter 2. Fulcher (2015, pp. 108–124) argues that the initial acceptance of Messick’s argument about validity eventually led to four broad paradigms on validity theories: instrumentalism, new realism, constructivism, and technicalism. Each paradigm critiques Messick either for not providing a method for doing validity, for ignoring the social in validation, or for being too concerned with scores rather than tests.

Context, construct, and ethics  9

The latter contestation, about whether validity can or cannot be conceptualized as a characteristic of a test, but that it depends instead on the valid interpretation of test results, remains unresolved. The observation can be made that many who today deny that validity can feature as a characteristic of a test simply solve the contradiction that this claim implies by finding conceptual equivalents for validity, most often ‘quality,’ ‘adequacy,’ or ‘appropriateness,’ the latter two featuring prominently as early as Messick’s original statements. The current orthodoxy proposes to determine the quality of a test (and the meaningful interpretations of its results) through a process of argument. In this, the so-called interpretive turn in the humanities is evident. One way of resolving the debate is to acknowledge both a ‘narrow’ and a ‘wide’ view of validity (Schildt et al., 2023), with the first referring to the very early view of validity as the characteristic of a test to test what it has set out to measure, i.e., to fulfil its purpose, the second to a position that includes the social and ethical impact of a test. Another productive way to regain conceptual clarity may be to acknowledge that a test can, as a designed object or instrument, have any number of attributable features: reliability, theoretical defensibility (or what is conventionally called ‘construct validity’), appropriateness, usefulness, and also validity. It is true that these objective features of a language test cannot be recognized other than in how they are revealed in the interaction of the test with subjective agents, including those who (subjectively) have to interpret its results. Phrased differently: the objective validity of the measuring instrument is dependent on its subjective validation. This instrumental validity, a feature of a language test, is indeed to be confirmed in a process of validation. Contestations aside, we need to acknowledge that construct validity, defined as the defensibility of a language test in terms of a widely accepted or credible theoretical definition, needs to be revisited by those who wish to design tests of language ability responsibly. The socially inspired definition of language as expression-in-interaction – as communication – that was discussed in the previous section requires that we also revisit the design formats in which we assess language ability. A new construct, with relevant and credible support from theory, should inspire us to leave behind, if required, our cherished notions of how language tests must be designed. Perhaps it is not enough to say: “Yes, we know that listening, speaking, reading and writing cannot actually be disentangled, but we integrate them as best we can.” As we observed earlier, imaginative language test design must be sensitive to the challenge of shifting constructs and be prepared to change in tune with them. More importantly, not paying enough heed to shifting perspectives on what the ability is that we are testing has serious implications downstream

10  M. Rafael Salaberry and Albert Weideman

in the design and development process of our measurement instruments. Overlooking essential aspects of construct validity leads to systemic consequential effects across a broad range of social, economic, and political dimensions of the educational system. Thus, the articulation of a theoretically defensible construct is no longer the only problem. Besides that, the three other components (values, consequences, and uses) are often overlooked in the validation process, especially in the case of classroom- and programlevel assessment (Fulcher, 2015; McNamara et  al., 2019). Furthermore, many studies (e.g., Alderson et al., 2014; Biber et al., 2016; Carey, 1996; Cumming, 2002; Galaczi & Taylor, 2018; Roever & Kasper, 2018) have analyzed how contextual and social factors have a measurable impact on test validity. For instance, even though tests are neither value-free nor contextfree (Carey, 1996), construct-irrelevant ‘noise’ is often excluded from validation (Cumming, 2002). It is therefore necessary to investigate how test results are used and how test-based decisions influence test users. Accordingly, some scholars (e.g., Bachman  & Purpura, 2008; Davies, 1997, 2008; Kunnan, 2018) have argued that valid tests should act ethically for the public good, and that they should also balance the individual test taker’s rights and societal demands. We turn now to this, the third issue flagged at the outset. Ethical assessment design: justice and fairness in the spotlight

What is discussed as ethical issues in language assessment is frequently a conglomerate of ideas that refer both to justice (often in the sense of accountability) and fairness, as an ethical requirement for the design and administration of tests. At times, the definitions given to both ideas may be contradictory or simply unclear. For many, these issues remain linked to validity theory, which occupied our thoughts in the previous section. So, for example, Fulcher (2015) notes that power may determine which people or institutions can decide what is tested, how it is tested, and for what purposes. In that case validity theory must, in this view, support and justify social practice until such a time as some other currency is discovered that is accepted as fairer. Fulcher (2015) terms this alternative Pragmatic Realism, which focuses on the notion of effect-driven testing and extends Messick’s notion of consequential validity to include practices that undermine meritocracy. That is in line with Bachman’s remark (2000, p. 23) that “there can be no consideration of ethics without validity,” as well as with McNamara’s observation (2006) that Messick focused the attention of test designers on two important points: the need to discuss test constructs and the interpretation of their results as questions of values, along with the need to explicitly articulate the notion of consequential validity.

Context, construct, and ethics  11

It will serve us well to attempt to disentangle justice and fairness, however. Taking justice first, we may analyze the idea in relation to language test design by noting that a distinction can be made between internal justice and external justice. The latter is the one we probably focus on more easily: for instance, we are alert to injustices due to decision-making based on high-stakes tests (Purpura, 2016; Schildt et al., 2023). We concern ourselves with the possible and actual legal consequences of improper test use, the abuse of language tests by inappropriately employing them for purposes they were not intended. We ask whether the use of a test is legitimate, and we may challenge the interpretation of its results in a court of law. This is so because language tests can have gate-keeping functions (Spolsky, 1997). It is clear that this kind of justice concerns the complex relation between (a) the test as a designed object or technical instrument, (b) the subjective interpretation of the test results (the set of further technical objects generated by the implementation of the test), and (c) the legal consequences of such action, including (d) the subjective decisions taken on the basis of the test results. The former kind of justice, which has been termed ‘internal justice’ earlier, relates in turn to the internal, analogical link between the technical design of the language test and the juridical mode of experience. Such an internal conceptual link examines the relationship not between human subjects (those taking the test, or those using its results) and the instrumental object, the test, but rather between two modes or aspects of experience, the technical and the juridical (see Chapter 2 for a further explanation). Internal justice is a technical justice, inherent in the test design, referring to the question of whether the language test does justice to the ability being measured, whether its design is open to correction and repair, as well as to the rectification or elimination of what is known as construct-irrelevant noise. Such internal technical justice of course forms the basis of eventual legal challenges to the administration and use of language tests but is not to be equated with it: the important systematic point that its identification makes is that language test designers and developers are accountable for their designs, in the first instance professionally and immediately. External legal consequences may eventually follow where technical accountability has not been responsibly exercised. Why justice and fairness are often not adequately distinguished conceptually is clear: in becoming accountable for designs, the juridical dimension of technical design often anticipates the ethical side of language testing: if a test does justice to the ability that is being tested, it foreshadows the possibility that its results will have a compassionate or at least a beneficial impact. That impact has, following Messick, been termed “consequential validity”. It may of course not always be benign, but a test that is publicly justifiable is expected to have positive consequences.

12  M. Rafael Salaberry and Albert Weideman

Over the last quarter of a century, a heightened awareness of these issues has led to a number of proposals of how to regulate this kind of professional accountability among language test designers and producers. Unsurprisingly, as we have noted earlier, most of this discussion has taken place under the banner of ethical considerations. Starting with a special volume of Language Testing focused on the analysis of the role (and limits) of ethics in language testing (1997), we note, for example, how Davies justifies the need to focus on the ethical use of language tests given the increasing impact of “commercial and market forces”, as well as the effect of government policies. Shohamy (1997) points out in the same issue how language tests may serve ethically questionable and implicit political goals that are often quite distinct from their stated purposes. Any consideration of ethical questions in language testing must distinguish between justice and fairness, however. So, for example, when McNamara and Ryan (2011) define fairness as the evidential basis that investigates the meaningfulness of test results and justice as the questioning of the use of the test and the social values it embodies, it is clear their definition attempts to acknowledge their conceptual indebtedness to Messick’s claims. But when McNamara and Ryan (2011) define fairness as “the extent to which the test quality, especially its psychometric quality, ensures procedural equality for individual and subgroups of test-takers and the adequacy of the representation of the construct in test materials and procedures,” it is equally evident that they include, under ‘fairness,’ both the internal technical justice discussed earlier (“the adequacy of the representation of the construct”) and external justice (that ensures “procedural equality . . . of test-takers”). Their intention is nonetheless clear: for a test to be fair, it cannot be instrumentally biased against individuals or groups. In a word: a language test should conform to the well-known principle in international codes of ethics and practice for language testing: it must possess technical beneficence. It must respect test takers, may not place them at an unanticipated disadvantage, or do harm that could have been foreseen; it must strive to be administered with care and compassion, keeping in mind the interests of the test users, including those of the test takers. Though related to justice of both the internal and external kind, fairness is thus distinguishable from that. Moreover, widening the conception of fairness beyond the rudimentary notions of reliability and other conventional concepts that tests must possess means that it now encompasses much more. Language testing has progressed beyond claiming that a test is fair once its potential misclassification of candidates has shrunk to acceptable levels (see Chapter 2). We are now sensitive to whether a test has bias, as shown, for example, in differential item functioning (DIF) analyses for various genders, ethnic or language groups, or convictions and persuasions. As both Chapters 7

Context, construct, and ethics  13

and 8 show, assessment task design and construction may contribute to unfair outcomes. We are similarly alert to how test takers experience the administration of a test. More studies are emerging about post-test surveys of test reception (Van der Walt & Steyn, 2007, for instance, link this to the validation argument for a language test; in Chapter 4, West & Thiruchelvam offer a detailed discussion of a wide-ranging survey). Thus, test takers are given a voice, and designers are bound to listen to the opinions of those at the receiving end of their tests. As proposed by Richardson (Chapter 3): To enact more democratic practices, testing companies . . . would need to invite representation by and sustain active collaborations with local communities to protect those being tested. Such participation could hold powerful institutions more accountable in test construction and administration. Shohamy (1998, 2001, 2022) argues forcefully for the role of critical language testing to consider the views of those being tested, while also accounting for the consequences of testing. These kinds of considerations illustrate that language testing rests on social constructs that are embedded within value systems which serve social, cultural, and political goals (McNamara, 2001, 2006; McNamara & Roever, 2006). Especially relevant is how we take into account the undemocratic or democratic processes shaping how power is shared amongst stakeholders, from elites and elite institutions to local community members. The goal of social justice (Rawls, 1999, p. 16) means that, as a first point of departure, language test developers need to make the public aware of the intended (and unintended) consequences of tests (Shohamy, 1993, 2001). Particularly prominent in this regard are the perspectives deriving from what is called critical language testing. This approach is characterized by pleas for more thoroughly democratic testing. Language test designers are ideally striving to represent better the diversity of test takers’ backgrounds (e.g., immigrants, heritage speakers, indigenous populations, and others). They are aware that we need multi-language, multimodal tests. In addition, critical language testing refers to the continuous need to raise questions about the consequences of designing and using tests in schools and society in general by evaluating the fairness and justice of language tests in various domains. It is concerned specifically with research on the political power of tests and test providers, about the uses and misuses of tests, about injustices and ethicality. This research has shown how central agencies – ministries of education, testing boards, principals, and teachers – may misuse tests, employing them for their own benefit only, or for inappropriate purposes. The enormous power of tests based on their unique feature of

14  M. Rafael Salaberry and Albert Weideman

determining the future of test takers and educational systems needs to be critically examined. In giving voice to those who are at the receiving end of test designs, special attention needs to be given to how multilingual speakers and minority language users are treated (e.g., Limerick, 2019; López-Gopar et al., 2021; Schissel, 2020; Schissel & Khan, 2021; Schissel et al., 2021). This features most prominently in Part II of this volume. We therefore turn now to how the content of what follows relates to the motivations for the publication of this book described in the three sections earlier. A range of pointers towards responsible language test design

A diversity of views and approaches to address the challenges outlined in this chapter is brought together in this book. Its chapters offer qualitative and quantitative examinations of socially and ethically defensible assessments across a range of localized settings. It considers how socially contextualized and decontextualized assessments might impact academic achievement, access, and learning, as well as justice and equity more broadly. The range of testing contexts analyzed provides a glimpse of the possibilities afforded by critical and alternative perspectives on the responsible use of language assessments in different settings. Even though the critiques of the conceptualization of construct validity (especially after Messick’s seminal contributions) and the ensuing various proposals to consider a wider range of variables to define validity within a social context are not entirely new, the chapters in this book bring to bear fresh empirical evidence from a range of testing settings on these issues. Some chapters focus primarily on the analysis of social justice in assessment practices at the societal level (large-scale tests), whereas others – especially micro-analytical studies at the classroom level – deal primarily with fairness at the level of small groups. Finally, the authors of some chapters frame their studies more in terms of empowerment of individuals and promoting individual agency. The chapters draw on different research methodologies too, ranging from conversation analytical approaches to test data to interview data of stakeholders through to quantitative comparisons of different groups of test takers. The volume is divided into three parts representing three thematic approaches to what we believe are some of the most important issues in responsibly designing tests to measure ability in an additional language: Part I introduces the volume, dealing mainly with the ethical challenges and contextualization and theoretical grounds of language assessment (Chapters 1 and 2).

Context, construct, and ethics  15

Part II analyzes agency and empowerment prompted by test adequacy (Chapters 3, 4, and 5) Part III tackles socio-interactional perspectives on classroom assessment (Chapters 6, 7, and 8) At the end of each chapter, readers will find a section with additional relevant readings related to the topic of the chapter, a few discussion questions, and some suggested applied research projects. A final chapter (Chapter 9) gives a glimpse of criteria for language test design that are likely to become ever more relevant in the future. In Chapter  2, Weideman and Deygers argue that validity theory has become contested terrain in language assessment, with several competing paradigms at play. Weideman and Deygers claim that the relativism of the interpretivist (‘linguistic’) turn in Messick has raised objections among ‘realists.’ To address this problem, the authors argue that we may need to distinguish, as has been argued in this chapter, between subjective validation and objective validity. We may also need to consider the notion that there are several degrees of adequacy, a concept which is synonymous with validity. Starting with the primitive concept of a test measuring what it is intended to measure, and producing a result, they then proceed to the more modern one that validity depends on construct. Thereafter they analyze the interpretivist notion of meaningfulness, before examining how it culminates in linking validity to transparency, accessibility, utility, accountability, justice, fairness, and reputability. Weideman and Deygers note that, at that point, we may need to acknowledge that we are no longer discussing the conceptualization of validity or validation, but of responsible assessment design. Responsible language test design takes its cue from a comprehensive framework of broad design principles, and we do the required justification of assessment designs a conceptual disservice by lumping everything together under the umbrella of ‘validity,’ or even calling the argument for test quality a ‘validation.’ The second section of this volume focuses on the ethical consequences of tests from the perspectives of agency and empowerment. The starting point of Chapter 3 is critical race theory and raciolinguistic ideologies in language testing in Arizona, USA. Chapter 4 has as its main focus a bottom-up approach to assess the needs of those impacted by testing, and who gets to determine the validity of a test in Korea, while Chapter 5 deals with power relations in the use of portfolio assessment, making the point that portfolios do not automatically guarantee ethical or valid assessment to L2 learners. In Chapter 3, Richardson echoes Shohamy’s (1993, 2001) call for multilingual and multimodal tests reflecting immigrants’ background by analyzing how the monolingual ideology embedded in Arizona’s English Learner Assessments jeopardizes English learners’ access to educational resources.

16  M. Rafael Salaberry and Albert Weideman

Concerns surrounding the achievement of Arizona’s English learners (ELs) focus on the ineffectiveness of the Structured English Immersion (SEI) model (Arias  & Faltis, 2012; Moore, 2014), and to a lesser extent, ELidentification methods (Abedi, 2008): the Primary Home Language Other Than English (PHLOTE) home language survey (Florez, 2012) and the Arizona English Language Learner Assessment (AZELLA; Florez, 2010; Gándara  & Orfield, 2012). It points out that the social and ethical consequences for Arizona ELs are severe. SEI is traumatizing (Combs et  al., 2005), which produces increasingly segregated classes (Gándara & Orfield, 2012; Martinez-Wenzl et al., 2010) and, as compared to other English language development models, results in lower graduation rates (Rios-Aguilar et  al., 2012) and higher dropout rates (Gándara  & Orfield, 2012), and does not yield increased academic achievement (ibid; Martinez-Wenzl et al., 2010). As pointed out by Richardson, despite the negative effects on ELs, Arizona has been slow to change. Given the existing literature on how language is used to racialize language users (Hill 1998; Leeman, 2004; Urciuoli, 1996), Richardson’s theoretical analysis builds on scholarship in Critical Race Theory (CRT) as a framework to contextualize and to challenge ‘objective’ truths (Delgado, 1995; Matsuda et  al., 1993) that persist despite the deleterious consequences faced daily by Arizona ELs. In line with CRT’s commitment to center marginalized experiences and knowledge, Richardson urges test designers and administrators to adopt practices of democratic assessment (Shohamy, 2001) so that (a) the EL community becomes empowered collaborators in assessment design, implementation, interpretation, and monitoring, (b) their rights are protected, and (c) assessments are understood to reify inequalities and are therefore continually challenged (see Shohamy, 1998). In Chapter  4, West and Thiruchelvam examine the impact and consequences of an English exit exam on both instructors and students at a university in South Korea. A passing score on the locally developed, highstakes assessment, determined by major, was required for graduation from university for all students. Given the potential and actual consequences of failing the assessment, from delaying graduation, and having offers for further education and employment rescinded because of failing scores, this study follows calls from scholars to look beyond positivist and psychometric models, and instead look at the consequential validity of the interpretation and use of scores had on students and instructors (Messick, 1989; McNamara & Roever, 2006; Shohamy, 2001). Interviews were conducted with regular English language instructors (n = 7) and with students (n = 4) and instructors (n = 2) in a special, remedial test preparation course designed by the university for students who had repeatedly failed the English language exam. A  narrative analysis (De Fina  & Georgakopoulou, 2011) of

Context, construct, and ethics  17

the accounts (De Fina, 2009) given in interviews, drawing on positioning theory (Bamberg, 1997), was done to investigate how students and instructors positioned themselves in relation to the assessment policy not only as passive subjects, but as resistant agents working against the policy in various ways. Regular course instructors reported working against the policy by explicitly working on test preparation. Remedial course instructors also positioned themselves as aligned with the students in working to subvert the test policy through more lenient scoring. Students who failed the exam positioned themselves as victims of an assessment policy that had steep consequences; however, they also positioned themselves as capable English users, refusing to recognize the validity of the assessment. Results from this exploratory study add complexity to our understandings of validity from a stakeholder perspective, including the role of washback (Messick, 1996). In the final chapter of Part II, Suzuki examines learner agency in portfolio assessment. Portfolios require learners to become participants of the assessment process who actively select, reflect, and evaluate learning. That is, they support values such as respect for authority. In theory, this practice can enhance learners’ sense of control and ownership in their learning and assessment, empowering them as active agents and producers of language in civil society. This student-involved decision-making approach is in direct contrast to the hierarchical power relations in traditional testing (Shohamy, 2001). This study, therefore, critically reviewed empirical studies related to L2 learners’ perspectives on empowerment during their portfolio assessments. The analysis revealed that, despite the positive ethical and social consequences claimed in theory, portfolio assessments do not necessarily develop L2 learners’ agency. For some L2 learners, the portfolio was another imposed test to provide the teacher with their evidence of learning (e.g., Kristmanson et al., 2013; Hirvela & Sweetland, 2005; Pollari, 2000). Other students increased their sense of ownership in L2 learning but admitted their vulnerability to the institution’s emphasis on high-stakes assessments (e.g., Pearson, 2017; Zhang, 2009). While portfolio activities allowed these learners to explore and negotiate their identities as L2 language users (Canagarajah, 1999), standardized tests were still their academic gatekeepers. Various social factors, such as teachers’ portfolio approaches, learner characteristics, and educational-cultural standard practices, seemed to influence these L2 learners’ sense of empowerment in portfolios. The chapter ends with how L2 portfolio assessment can be more ethically desirable and socially valid. The third section of the present volume focuses on socio-interactional perspectives in classroom assessment. In Chapter  6 – a portfolio in the wild – the focus is on seeking positive washback effect. Chapter  7 examines paired and group speaking assessments tasks in a classroom-based test

18  M. Rafael Salaberry and Albert Weideman

context, while Chapter  8 adopts an approach informed by multimodal conversation analysis, analyzing the impact of task characteristics and test prompts. It deserves mention that, while there is a wide range of language abilities and competences that can be part of language testing that could be examined, the emphasis of the present volume is on the assessment of speaking skills/oral production/oral language use. Though, as we observed earlier, interactional competence is decidedly and demonstrably not limited to this one ‘skill,’ there is a significant amount of language testing dedicated to the evaluation of speaking/oral skills (see discussion of the impact of the ACTFL-OPI model earlier) that makes this relevant. In addition, the topics discussed in this volume span a range of critical social and ethical perspectives on current (traditional) testing and assessment practices, with speaking assessment central in this discussion. This critical perspective unites the various cases analyzed in detail in the various chapters. Accordingly, in Chapter 6, Räsänen and Kivik’s study contributes to the ongoing discussion of assessment and authentic, situated language use in L2 teaching (McNamara & Roever, 2006; Ross & Kasper, 2013; Youn 2015; Cohen 2019). Assessment, as an etic judgment of individual performance against standards, is incompatible with the emic perspective of language use to realize social action (Kley, 2019). As teacher-researchers, Räsänen and Kivik introduced a portfolio task in the Finnish and Estonian language courses at a North American university to collect learner data on independent use and to expand the opportunities for target language practice beyond the classroom. The portfolio as assessment was intended to create the washback effect of emphasizing the importance of independent use. Students in introductory through advanced language classes collected samples of and self-reflections on their language use ‘in the wild’ over a semester and entered these in an electronic portfolio. The activities included their interactions with native and non-native speakers of Finnish/Estonian in a variety of situations in different modalities and engagement with target language content. The chapter demonstrates how the use of the portfolio prompted students to use the target language in the wild, and raised the learners’ awareness of themselves as L2 users, including socially situated interactions, and, more importantly, it increased their agency in the learning process by pushing them to identify learnables (Eskildsen & Majlesi, 2018). Besides vocabulary items, these learnables included social expectations of the language use in a situation as represented by the management of registers and code switching (cf. Compernolle, 2018), managing interaction as non-native speakers, recognizing the learning benefit of L2 use beyond the classroom, and cultural and social phenomena associated with the target. In Chapter 7, Kley focuses her attention on the assessment of language competence framed within a multimodal approach to language use. To

Context, construct, and ethics  19

achieve that goal, she analyzed the effects of two speaking test tasks on topic initiations and their corresponding scores. One of the two tasks was an open-topic task where the peers are supposed to chat for five to ten minutes about several topics of their choice. The other task was a discussion task in which participants were provided with several topics on a card, which was drawn from a set of such cards. The two tasks were used in classroom-based paired speaking tests designed for first-year learners of German. In the firstyear German classes selected for the study, students are explicitly taught interactional competence (e.g., initiate topics, to expand on topics, initiate repair in case of non-understanding, etc.). The data were evaluated following a conversation analysis methodology. The analysis of the data revealed that the initiation of topics between test takers who engaged in the opentopic task was balanced. That is, both test takers initiated topics at about the same rate. In comparison, the test taker pairs who engaged in the discussion task, where the topics to be discussed were provided, initiated topics in a rather unbalanced fashion, in that one of the participants took on a dominant role and initiated all topics listed on the topic card. The findings suggest that the discussion task may be misleading in terms of test takers’ ability to initiate topics. Even though a test taker does not introduce any topics for that task, it does not mean that s/he is unable to do so. The discussion task may impact our understanding of what a student can do and test takers’ scores. In our classroom-based test setting, such a misconception would also affect our overall classroom teaching and testing. Consequently, Kley concludes that one option to make this type of assessment more reliable and valid is to modify the rubric for the discussion task or the task design. In Chapter 8, Kley, Kunitz, and Yeh specifically assess the role of the interlocutor in two speaking tests of Chinese as a foreign language (CFL) conducted at a US university. As is well known from previous research in language testing (Bachman, 1990; Bachman & Palmer, 1996; McNamara, 1996), test discourse and assigned scores may be influenced by various contextual factors (e.g., test task, interlocutor, etc.). This study focused on the effect of the interlocutor’s native/non-native-speakerness on the test taker’s production of repair practices. Accordingly, 28 second semester CFL learners participated in two classroom-based speaking tests: one with a fellow CFL student (“peer interaction”) and one with a native-speaking student at the same university (“NS interaction”). For both tests, the test taker pairs engaged in an open-topic task. Overall, the test takers’ lower engagement in other-initiated repair and other-directed word searches during peer interaction represented a safe strategy to avoid potentially face-threatening situations (such as displaying non-understanding or asking for help that the peer might not be able to give). On the other hand, the increased number of other-directed word searches in NS interaction was related to the linguistic

20  M. Rafael Salaberry and Albert Weideman

epistemic asymmetry between the two interlocutors, with the test taker orienting to the NS as more knowledgeable. At the same time, the test taker seemed to orient to the importance of establishing intersubjectivity with the NS, which would explain the higher number of repair initiations. The authors of the chapter conclude that the NS interaction appears to provide more affordances for initiating repairs. In general, although not all test takers and NSs react in the same way, the findings of the study indicate that a linguistically asymmetric test setting may be most fruitful if the testing objective is to elicit repair practices. In the final section, a concluding chapter by Weideman provides a reflection, from the point of view of a test designer, of what we have accomplished in fronting juridical and ethical issues in language testing, asking whether we have yardsticks to gauge both current trends and potential future advances. He starts by noting that we need a criterion, first, for knowing what constitutes an advance in language test design, and what would signal regression. Weideman’s argument implies that we should not succumb to the cynical view that there can be no development, and that everything is but a power play between less and more influential agents – the argument of shifting of the deckchairs on the Titanic, only to find that nothing has changed or will change. He argues that an evaluation of where we have progressed or failed in the last two decades will show that, despite throwbacks and unmet challenges, there have been substantial gains. One illustration of such a gain is the rising level of language assessment literacy. But there are many more. What we need in order to identify gains or losses he calls a “theory of disclosure.” Such a theory asks: how is the meaning of language test design opened up? With the tenets of a theory that articulates what makes designs (more) meaningful, we are readying ourselves to generate yardsticks that we can use to evaluate future developments. We need to articulate such principles, stipulating the requirements of how the many remaining challenges for responsible test design can be met. This book is an attempt to do just that. References Abedi, J. (2008). Classification system for English language learners: Issues and recommendations. Educational Measurement: Issues and Practice, 27(3), 17–31. Alderson, J. C. (2007). The CEFR and the need for more research. The Modern Language Journal, 91(4), 659–663. www.jstor.org/stable/4626093 Alderson, J. C., Haapakangas, E. L., Huhta, A., Nieminen, L.,  & Ullakonoja, R. (2014). The diagnosis of reading in a second or foreign language. Routledge. Arias, M. B., & Faltis, C. (Eds.) (2012). Implementing educational language policy in Arizona: Legal, historical, and current practices in SEI (Vol. 86). Multilingual Matters.

Context, construct, and ethics  21

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press. Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17(1), 1–42. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests (Vol. 1). Oxford University Press. Bachman, L. F.,  & Purpura, J. E. (2008). Language assessments: Gate-keepers or door openers? In B. Spolsky  & F. M. Hult (Eds.), Handbook of educational linguistics (pp.  456–468). Blackwell Publishers. https://doi. org/10.1002/9780470694138.ch32 Bamberg, M. G. (1997). Positioning between structure and performance. Journal of Narrative and Life History, 7(1–4), 335–342. https://doi.org/10.1075/ jnlh.7.42pos Biber, D., Gray, B., & Staples, S. (2016). Predicting patterns of grammatical complexity across language exam task types and proficiency levels. Applied Linguistics, 37(5), 639–668. https://doi.org/10.1093/applin/amu059 Block, D. (2003). The social turn in second language acquisition. Georgetown University Press. Borsboom, D., Mellenbergh, G. J.,  & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/ 0033-295X.111.4.1061 Byrnes, H. (2007). Developing national language education policies: Reflections on the CEFR. The Modern Language Journal, 91(4), 679–685. www.jstor.org/ stable/4626099 Canagarajah, S. (1999). Resisting linguistic imperialism in English teaching. Oxford University Press. Carey, P. (1996). A review of psychometric and consequential issues related to performance assessment (TOEFL Monograph 3). Educational Testing Service. Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing, 20(4), 369–383. https://doi.org/10.119 1/0265532203lt264oa Cohen, A. (2019). Considerations in assessing pragmatic appropriateness in spoken language. Language Teaching, 1–20. https://doi.org/10.1017/S02614448190 00156 Combs, M. C., Evans, C., Fletcher, T., Parra, E., & Jiménez, A. (2005). Bilingualism for the children: Implementing a dual-language program in an English-only state. Educational Policy, 19(5), 701–728. Compernolle, R. A. (2018). Dynamic strategic interaction scenarios: A  vygotskian approach to focusing on meaning and form. In M. Ahmadian & M. García Mayo (Eds.), Recent perspectives on task-based language learning and teaching (pp. 79–98). De Gruyter Mouton. https://doi.org/10.1515/9781501503399-005 Conteh, J., & Meier, G. (Eds.) (2014). The multilingual turn in languages education: Opportunities and challenges. Multilingual Matters. Council of Europe. (2011). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press. Cumming, A. (2002). Assessing L2 writing: Alternative constructs and ethical dilemmas. Assessing Writing, 8(2), 73–83. https://doi.org/10.1016/S10752935(02)00047-8

22  M. Rafael Salaberry and Albert Weideman

Davies, A. (1997). Demands of being professional in language testing. Language Testing, 14(3), 328–339. https://doi.org/10.1177/026553229701400309 Davies, A. (2008). Ethics, professionalism, rights and codes. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education (2nd ed., Vol. 7; Language testing and assessment; pp.  429–443). Springer. https://doi. org/10.1007/978-3-319-02261-1_27 De Fina, A. (2009). Narratives in interview – The case of accounts: For an interactional approach to narrative genres. Narrative Inquiry, 19(2), 233–258. https://doi.org/10.1075/ni.19.2.03def De Fina, A.,  & Georgakopoulou, A. (2011). Analyzing narrative: Discourse and sociolinguistic perspectives. Cambridge University. Delgado, R. (Ed.) (1995). Critical race theory: The cutting edge. Temple University Press. Eskildsen, S. W., & Majlesi, A. R. (2018). Learnables and teachables in second language talk: Advancing a social reconceptualization of central SLA tenets. Introduction to the special issue. The Modern Language Journal, 102, 3–10. Eskildsen, S. W., & Wagner, J. (2015). Embodied L2 construction learning. Language Learning, 65, 419–448. https://doi.org/10.1111/lang.12106 Figueras, N. (2012). The impact of the CEFR. ELT Journal, 66(4), 477–485. https://doi.org/10.1093/elt/ccs037 Florez, I. R. (2010). Do the AZELLA cut scores meet the standards? A  validation review of the Arizona English language learner assessment (Civil Rights Project). University of California. Florez, I. R. (2012). Examining the validity of the Arizona English language learners assessment cut scores. Language Policy, 11(1), 33–45. Fulcher, G. (2004). Deluded by artifices? The common European framework and harmonization. Language Assessment Quarterly: An International Journal, 1(4), 253–266. https://doi.org/10.1207/s15434311laq0104_4 Fulcher, G. (2015). Re-examining language testing: A  philosophical and social inquiry. Routledge. Galaczi, E.,  & Taylor, L. (2018). Interactional competence: Conceptualizations, operationalizations, and outstanding questions. Language Assessment Quarterly, 15(3), 219–236. https://doi.org/10.1080/15434303.2018.1453816 Gándara, P., & Orfield, G. (2012). Why Arizona matters: The historical, legal, and political contexts of Arizona’s instructional policies and US linguistic hegemony. Language Policy, 11(1), 7–19. https://doi.org/10.1007/s10993-011-9227-2 Hamid, M. O., Hardy, I., & Reyes, V. (2019). Test-takers’ perspectives on a global test of English: Questions of fairness, justice and validity. Language Testing in Asia, 9(1), 16. https://doi.org/10.1186/s40468019-0092-9 Hamid, M. O., & Hoang, N. T. (2018). Humanising language testing. TESL-EJ, 22(1), 1–20. He, A. W., & Young, R. (1998). Language proficiency interviews: A discourse approach. Talking and Testing: Discourse Approaches to the Assessment of oral Proficiency, 14, 1–24. https://doi.org/10.1075/sibil.14.02he Hill, J. H. (1998). Language, race, and white public space. American Anthropologist, 100(3), 680–689. https://doi.org/10.1525/aa.1998.100.3.680 Hirvela, A., & Sweetland, Y. L. (2005). Two case studies of L2 writers’ experiences across learning-directed portfolio contexts. Assessing Writing, 10(3), 192–213. https://doi.org/10.1016/j.asw.2005.07.001

Context, construct, and ethics  23

Huth, T. (2020). Interaction, language use, and second language teaching. Routledge. Huth, T., & Betz, E. (2019). Testing interactional competence in second language classrooms: Goals, formats and caveats. In R. Salaberry  & S. Kunitz (Eds.), Teaching and Testing L2 Interactional Competence (pp.  322–356). Routledge. https://doi.org/10.4324/9781315177021 Kane, M. T. (2010). Validity and fairness. Language Testing, 27(2), 177–182. https://doi.org/10.1177/0265532209349467 Kane, M. T., Kane, J., & Clauser, B. E. (2017). A validation framework for credentialing tests. In C. W. Buckendahl & S. Davis-Becker (Eds.), Testing in the professions: Credentialing polices and practice (pp. 20–41). Routledge. Kley, K. (2019). What counts as evidence for interactional competence? Developing criteria for a German classroom-based paired speaking project. In Salaberry, R.,  & Kunitz, S. (Eds.), Teaching and testing L2 interactional competence: Bridging theory and practice (pp.  291–381). Routledge. https://doi. org/10.4324/9781315177021 Kramsch, C. (1986). From language proficiency to interactional competence. Modern Language Journal, 70, 366–372. https://doi.org/10.2307/326815 Kristmanson, P., Lafargue, C., & Culligan, K. (2013). Experiences with autonomy: Learners’ voices on language learning. Canadian Modern Language Review, 69(4), 462–486. https://doi.org/10.3138/cmlr.1723.462 Kunnan, A. J. (2018). Evaluating language assessment. Routledge. Leeman, J. (2004). Racializing language: A  history of linguistic ideologies in the US Census. Journal of Language and Politics, 3(3), 507–534. https://doi. org/10.1093/sf/soad060 Lilja, N., & Piirainen-Marsh, A. (2018). Connecting the language classroom and the wild: Re-enactments of language use experiences. Applied Linguistics, 40(4). https://doi.org/10.1093/applin/amx045 Lilja, N.,  & Piirainen-Marsh, A. (2019). Making sense of interactional trouble through mobile-supported sharing activities. In R. Salaberry & S. Kunitz (Eds.), Teaching and testing L2 interactional competence: Bridging theory and practice (pp. 260–288). Routledge. https://doi.org/10.4324/9781315177021 Limerick, N. (2019). Three multilingual dynamics of Indigenous language use that challenge standardized linguistic assessment. Language Assessment Quarterly, 16(4–5), 379–392. https://doi.org/10.1080/15434303.2019.1674313. Liskin-Gasparro, J. E. (2003). The ACTFL proficiency guidelines and the oral proficiency interview: A brief history and analysis of their survival. Foreign Language Annals, 36(4), 483–490. López-Gopar, M. E., Schissel, J. L., Leung, C., & Morales, J. (2021). Co-constructing social justice: Language educators challenging colonial practices in Mexico. Applied Linguistics, 42(6), 1097–1109. https://doi.org/10.1093/applin/ amab047 Martinez-Wenzl, M., Pérez, K., & Gändara, P. (2010). Is Arizona’s approach to educating its English learners superior to other forms of instruction? (The Civil Rights Project). Proyecto Derechos Civiles. Matsuda, M., Lawrence, C., Delgado, R., & Crenshaw, K. (Eds) (1993). Words that wound: Critical race theory, assaultive speech and the first amendment. Westview Press. May, S. (Ed.) (2013). The multilingual turn: Implications for SLA, TESOL, and bilingual education. Routledge.

24  M. Rafael Salaberry and Albert Weideman

McNamara, T. (1996). Measuring second language performance. Longman. McNamara, T. (2001). Language assessment as social practice: Challenges for research. Language Testing, 18(4), 333–349. https://doi.org/10.1177/02655 3220101800402 McNamara, T. (2006). Validity in language testing: The challenge of Sam Messick’s legacy. Language Assessment Quarterly: An International Journal, 3(1), 31–51. https://doi.org/10.1207/s15434311laq0301_3 McNamara, T., Knoch, T., & Fan, J. (2019). Fairness, justice, and language assessment. Oxford University. McNamara, T.,  & Roever, C. (2006). Language testing: The social dimension. Blackwell. McNamara, T., & Ryan, K. (2011). Fairness versus justice in language testing: the place of English literacy in the Australian Citizenship Test. Language Assessment Quarterly, 8(2), 161–178. https://doi.org/10.1080/15434303.2011.565438 Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104). American Council on Education and Macmillan. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13–23. https://doi. org/10.2307/1176219 Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241–256. https://doi.org/10.1177/026553229601300302 Mondada, L. (2011). Understanding as an embodied, situated and sequential achievement in interaction. Journal of Pragmatics, 43(2), 542–552. https://doi. org/10.1016/j.pragma.2010.08.019 Moore, S. C. K. (Ed.) (2014). Language policy processes and consequences: Arizona case studies (Vol. 98). Multilingual Matters. NExLA (Network of Expertise in Language Assessment). (2023). Bibliography. https://nexla.org.za/research-on-language-assessment/. Accessed 23 March 2023. Noori, M., & Mirhosseini, S. A. (2021). Testing language, but what?: Examining the carrier content of IELTS preparation materials from a critical perspective. Language Assessment Quarterly, 18(4), 382–397. https://doi.org/10.1080/15 434303.2021.1883618 Ortega, L. (2013). SLA for the 21st century: Disciplinary progress, transdisciplinary relevance, and the bi/multilingual turn. Language Learning, 63, 1–24. https:// doi.org/10.1111/j.1467-9922.2012.00735.x Patterson, R., & Weideman, A. (2013a). The typicality of academic discourse and its relevance for constructs of academic literacy. Journal for Language Teaching, 47(1), 107–123. https://doi.org/10.4314/jlt.v47i1.5 Patterson, R., & Weideman, A. (2013b). The refinement of a construct for tests of academic literacy. Journal for Language Teaching, 47(1), 125–151 https://doi. org/10.4314/jlt.v47i1.6 Pearson, J. (2017). Processfolio: Uniting academic literacies and critical emancipatory action research for practitioner-led inquiry into EAP writing assessment. Critical Inquiry in Language Studies, 14(2–3), 158–181. https://doi.org/10.1 080/15427587.2017.1279544 Pearson, W. S. (2019). Critical perspectives on the IELTS test. ELT Journal, 73(2), 197–206. https://doi.org/10.1093/elt/ccz006

Context, construct, and ethics  25

Poehner, M. E., & Lantolf, J. P. (2005). Dynamic assessment in the language classroom. Language Teaching Research, 9(3), 233–265. Pollari, P. (2000). “This is my portfolio”: Portfolios in upper secondary school English studies (ERIC Document Reproduction Service No. ED450415). Institute for Educational Research. Popham, W. J. (1997). Consequential validity: Right concern – Wrong concept. Educational Measurement: Issues and Practice, 913. https://doi.org/10.1111/ j.1745-3992.1997.tb00586.x Purpura, J. E. (2016). Second and foreign language assessment. The Modern Language Journal, 100(S1), 190–208. www.jstor.org/stable/44135003 Raffaldini, T. (1988). The use of situation tests as measures of communicative ability. Studies in Second Language Acquisition, 10, 197–216. www.jstor.org/ stable/44488173 Rawls, J. (1999). A Theory of Justice (Revised ed.). The Belknap Press of Harvard University Press. Rios-Aguilar, C., González-Canché, M., & Moll, L. (2012). Implementing structured English immersion in Arizona: Benefits, challenges, and opportunities. Teachers College Record, 114(9), 1–18. https://doi.org/10.1177/016146811211400903 Roever, C., & Ikeda, N. (2022). What scores from monologic speaking tests can (not) tell us about interactional competence. Language Testing, 39(1), 7–29. https://doi.org/10.1177/02655322211003332 Roever, C.,  & Kasper, G. (2018). Speaking in turns and sequences: Interactional competence as a target construct in testing speaking. Language Testing, 35(3), 331–355. https://doi.org/10.1177/0265532218758128. Ross, S. J., & Kasper, G. (Eds.) (2013). Assessing second language pragmatics. Palgrave Macmillan. Salaberry, M. R. (2000). Revising the revised format of the ACTFL Oral Proficiency Interview. Language Testing, 17(3), 289–310. https://doi.org/10.1177/ 026553220001700301 Salaberry, M. R.,  & Burch, A. R. (Eds.) (2021). Assessing speaking in context: Expanding the construct and its applications (Vol. 149). Multilingual Matters. https://doi.org/10.1080/15434303.2022.2095913 Salaberry, M. R.,  & Kunitz, S. (Eds.) (2019). Teaching and testing L2 interactional competence: Bridging theory and practice. Routledge. https://doi.org/10. 4324/9781315177021 Sandlund, E., & Sundqvist, P. (2019). Doing versus assessing interactional competence. In R. Salaberry & S. Kunitz (Eds.), Teaching and testing l2 interactional competence (pp. 357–396). Routledge. https://doi.org/10.4324/9781315177021 Schildt, L., Deygers, B. & Weideman, A. (2023). Language testers and their place in the policy web. Language Testing (Online first: article first published online on 17 August). https://doi.org/10.1177/02655322231191133 Schissel, J. L. (2020). Moving beyond deficit positioning of linguistically diverse test takers: Bi/multilingualism and the essence of validity. The Sociopolitics of English Language Testing, 91, 89–108. Schissel, J. L., De Korne, H.,  & López-Gopar, M. (2021). Grappling with translanguaging for teaching and assessment in culturally and linguistically diverse contexts: Teacher perspectives from Oaxaca, Mexico. International Journal of

26  M. Rafael Salaberry and Albert Weideman

Bilingual Education and Bilingualism, 24(3), 340–356. https://doi.org/10.10 80/13670050.2018.1463965 Schissel, J. L., & Khan, K. (2021). Responsibilities and opportunities in language testing with respect to historicized forms of socio-political discrimination: A matter of academic citizenship. Language Testing, 38(4), 640–648. https://doi. org/10.1177/02655322211028590 Shohamy, E. (1990). Language testing priorities: A  different perspective. Foreign Language Annals, 23, 385–394. https://doi.org/10.1111/j.1944-9720.1990. tb00392.x Shohamy, E. (1993). The power of tests: The impact of language tests on teaching and learning. NFLC Occasional Papers. National Foreign Language Center, Washington, DC. Shohamy, E. (1997). Testing methods, testing consequences: Are they ethical? Are they fair? Language Testing, 14(3), 340–349. https://doi.org/10.1177/ 026553229701400310 Shohamy, E. (1998). Critical language testing and beyond. Studies in Educational Evaluation, 24(4), 331–45. https://doi.org/10.1016/S0191-491X(98)00020-0 Shohamy, E. (2001). Democratic assessment as an alternative. Language Testing, 18(4), 373–391. https://doi.org/10.1191/026553201682430094 Shohamy, E. (2022). Critical language testing, multilingualism and social justice. TESOL Quarterly, 56(4), 1445–1457. https://doi.org/10.1002/tesq.3185 Spolsky, B. (1997). The ethics of gatekeeping tests: What have we learned in a hundred years? Language Testing, 14(3), 242–247. https://doi.org/10.1177/ 026553229701400302 Stansfield, C. W. (1993). Ethics, standards, and professionalism in language testing. Issues in Applied Linguistics, 4(2), 189–206. https://doi.org/10.5070/ L442030811 Turner, C. E.,  & Purpura, J. E. (2016). Learning-oriented assessment in second and foreign language classrooms. Handbook of Second Language Assessment, 12, 255–274. Urciuoli, B. (1996). Exposing prejudice: Puerto Rican experiences of language, race, and class. Westview. Van der Walt, J., & Steyn, H. Jr. 2007. Pragmatic validation of a test of academic literacy at tertiary level. Ensovoort, 11(2), 138–153. Van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: Oral proficiency interviews as conversation. TESOL Quarterly, 23(3), 489–508. https://doi.org/10.2307/3586922 Waring, H. Z. (2018). Teaching L2 interactional competence: Problems and possibilities. Classroom Discourse, 9(1), 57–67. https://doi.org/10.1080/1946301 4.2018.1434082 Weideman, A. (2009). Beyond expression: A systematic study of the foundations of linguistics. Paideia Press. Weideman, A. (2012). Validation and validity beyond Messick. Per Linguam, 28(2), 1–14. Weideman, A. (2019a). Degrees of adequacy: The disclosure of levels of validity in language assessments. Koers, 84(1). https://doi.org/10.19108/ KOERS.84.1.2451.

Context, construct, and ethics  27

Weideman, A. (2019b). Validation and the further disclosures of language test design. Koers, 84(1). https://doi.org/10.19108/KOERS.84.1.2452. Weideman, A. (2021). A  skills-neutral approach to academic literacy assessment. In A. Weideman, J. Read & T. du Plessis (Eds.), Assessing academic literacy in a multilingual society: Transition and transformation (New Perspectives on Language and Education: 84; pp.  22–51). Multilingual Matters. https://doi. org/10.21832/WEIDEM6201. Weir, C. J. (2005). Limitations of the common European framework for developing comparable examinations and tests. Language Testing, 22(3), 281–300. https:// doi.org/10.1191/0265532205lt309oa Youn, S. J. (2015). Validity argument for assessing L2 pragmatics in interaction using mixed methods. Language Testing, 32(2), 199–225. https://doi. org/10.1177/0265532214557113 Zhang, S. (2009). Has portfolio assessment become common practice in EFL classrooms? Empirical studies from China. English Language Teaching, 2(2), 98–118.

2 VALIDITY AND VALIDATION An alternative perspective Albert Weideman and Bart Deygers

Contested views and conceptual uncertainty

In informal discussion, one still encounters the claim that tests of language ability must be both reliable and valid. Once one consults the academic literature on this, however, one quickly learns that it may be problematic to claim that a language test is valid; rather, one is encouraged to look at how valid the interpretation of the results of such a test is. Kane (1992, p. 527) states unequivocally that validity “is associated with the interpretation assigned to test scores rather than with the scores or the test.” Proponents of this view urge us to build an argument for the quality of the test through a process of validation (Kane, 1992, 2010, 2011, 2016). Though this is not the whole story, for many this approach was initiated by the work of one of the founders of the orthodoxy, Messick (1980, 1981, 1988, 1989). This orthodoxy has succeeded the earlier, traditional views of validity. It constitutes an interpretive turn, claiming that “Test validity is . . . an overall evaluative judgment of the adequacy and appropriateness of inferences drawn from test scores” (Messick, 1980, p. 1023; cf. too 1981, p. 18). Where Messick (1980, 1981) speaks of an “evaluative judgement”, one notices that Kane (1992, p.  527) has replaced that with “the interpretation assigned to test scores”. This subtle modification has gone unnoticed and undiscussed in debates on validity theory, and it need not concern us now. What is noteworthy is that this stance, which now is represented more generally by argument-based validity theories, raises several immediate questions, and leads to at least one complication. The first question would be: if validity is not considered to be a quality or characteristic of a test proper, DOI: 10.4324/9781003384922-3

Validity and validation  29

does that mean that the test then has no other features either, such as reliability, intuitive appeal, defensibility, or usefulness? To overcome this ultimately contradictory assumption, the argument made would view the process of validation as containing a wide range of inferences that are preconditions for valid score use. In that case, reliability, generalizability, and scoring protocols may be part of the inferences to be justified during validation. But if one takes the claim that validity is not a feature of a test at face value (as do Davies & Elder, 2005, and Fulcher & Davidson, 2007; see later), there is another question. If an assessment lacks only the feature of validity, the second question has to be: what arguments can be produced to indicate that it possesses only other qualities, but not validity? Or, thirdly, would the presence of other characteristics in a test also be dependent entirely on interpretation? It is evident that such a position opens up to either a regression, or an inescapable relativism, from which, as Fulcher (2015, p. 111) correctly points out, the only protection is “due process”, in this case, a process that produces a validation argument (Kane, 2016). It is not surprising that in the work of Borsboom and colleagues (Borsboom et  al., 2004), as well as in the earlier objections raised by Popham (1997), the interpretive turn that the concept of validity underwent with Messick has been contested. Nor are they alone in their challenge to the current orthodoxy. Fulcher and Davidson (2007, p. 279), for instance, ask the designers of language tests to consider whether the claim that validity is not a feature of a language test is perhaps a mistake, observing: If a test is typically used for the same inferential decisions, over and over again, and if there is no evidence that it is being used for the wrong decisions, could we not simply speak of the validity of that particular test – as a characteristic of it? To summarize: if the validation process repeatedly arrives at the same set of positive inferences from the interpretations of the test results, how can one avoid ascribing validity to that test? A similar point is made by Davies and Elder (2005, p. 797) in relation to tests building a reputation over time, before concluding that “in some sense validity does reside in test instruments”, and it is therefore “not just a trick of semantics . . . to say that one test is more valid than the other for a particular purpose” (Davies & Elder, 2005, p. 798). Furthermore, one has the impression that those who subscribe to the current orthodoxy do not bother to do the close reading and subsequent analytical treatment that, for example, Fulcher (2015) extends to the conceptual quandary (see also Deygers, 2018).

30  Albert Weideman and Bart Deygers

One effect of denying that a test can characteristically be valid is that all manner of circumlocutions then emerge, to be adopted as synonymous concepts. ‘Validity’ may well resurface as ‘adequacy’, ‘effectiveness’, or ‘quality’. This is evident already in Messick’s own claim (1980, p. 1023; cf. too 1981, p. 18) that “test validity is . . . an overall evaluative judgment of the adequacy and appropriateness of inferences drawn from test scores.” One may thus at times find references to the ‘effectiveness’ of a test in ‘measuring the construct’, or to the valid data it may yield, without any consideration of why the data or the measurement would be valid, but not the instrument that yielded the data. It is not surprising, therefore, that despite McNamara and Roever’s (2006, p. 250f.) subscription to the current orthodoxy, and their critique of the position of Borsboom et  al. (2004), they themselves continue to speak about the validity of a test (McNamara & Roever, 2006, p. 17), assuming that a “test is . . . a valid measure of the construct” (2006, p. 109), and discussing the necessity of “items measuring only the skill or the ability under investigation” (2006, p. 81). There is no reference in these conceptual concessions to the claim that validity is solely dependent on the interpretation of the scores derived from such tests or such items. In every case, what is lacking is a closer reading of the texts in which claims for and against an argument-based logic are made. The several questions that it throws up were discussed in the previous paragraphs. The answer to those questions should probably result from an examination of these original statements. Without that, clarity on ‘validity’ and ‘validation’ will remain a conceptual morass. Because of the concepts and interpretations that have been added to the development of validity theory, the concept ‘validity’ has gradually become burdened with other notions. This exposes a complication. Perhaps as a result of the conceptual dilemma referred to in the previous paragraphs, numbers of commentators on language testing have added their own explanations, in what may be a well-intentioned attempt at conceptual clarity or practical application. Such further explication of the ‘prime consideration’ in language testing often confusingly finds itself interpreted as an explanation of validity. Bachman (2001, p. 110) and Bachman and Palmer (1996, p. 17), for example, have introduced the notion of ‘usefulness’ as the prime feature of language testing. In their explanation of this, they incorporate construct validity and test impact (Bachman  & Palmer 1996, p. 18), or what, since Messick, has been called ‘consequential validity’ (Messick, 1989), an idea referring to the social effects of the application of a measurement of language ability (Weideman, 2017b). Adding to the confusion, the interpretation by Fulcher and Davidson (2007, p. 15) of Bachman and Palmer (1996) is that their “notion of test ‘usefulness’ provides an alternative way of looking at validity.” Incidentally, though

Validity and validation  31

Messick himself is often thought of as having merely construct validity in mind as the ‘unifying’ or prime goal of language testing, he himself introduced the additional idea of ‘appropriateness’, or taking into account the social impact of language testing (McNamara & Roever, 2006; McNamara et al., 2019). ‘Consequential validity’ relates validity to the social impact of language tests. The complication is compounded still further when one observes, for example, that Kunnan (2000, 2004) states that the prime goal of language testing is fairness, while Kane finds the primary consideration in meaningfulness (Kane, 2011, p. 3). Taking this one step further in his contribution to a symposium on the principles of test validation at the annual Language Testing Research Colloquium (LTRC), Tannenbaum (2018), following the lead of Kane (2011) in the ‘interpretive’ paradigm, claimed: “I equate meaningfulness with validity.” Xi’s proposal (2010, p.  167), in turn, calls for the “integration of fairness into validity”. The claims made about the “prime consideration in language testing” appear to slip easily into merely extending validity to include whatever is considered to be a further pivotal feature. In discussing the Code of Ethics of the International Language Testing Association (ILTA), which he was instrumental in drafting, Davies (2008, p. 491) observes that “what we are seeing in the professionalizing and ethicalizing of language testing is a wider and wider understanding of validity.” Here the ethical dimensions of language testing, such as ‘beneficence’, treating those taking a test with care and compassion, and with regard to the impact of the results of a language test, become nothing more than extensions of the concept of validity. Validity is thus strongly associated or equated with other qualities of a test, such as usefulness, meaningfulness, impact, fairness, or beneficence, which cannot logically be its equivalents. Probably the most insightful treatment of the philosophical undertow of the various contesting conceptions of validity is Fulcher’s (2015, pp. 108–125) analysis of its four different variations, as well as his own proposal for a pragmatic realism that is based in experience and a critical evaluation of theory. It is informative not only for the acuity of the analysis, but for the trouble it takes to pay close attention to the texts in which the claims are made, to challenge them with reference to their own philosophical starting points, and to correct them with reference to opposing philosophies. There is no doubt that in language assessment there are contested and conflicting views on validity. West and Thiruchelvam (in this volume) observe that there is as yet no “deeply theorized” conception of validity. Since the 1990s, the division into construct, concurrent, and other kinds of validity has been overtaken by argument-based validity thinking that makes validity dependent on interpretation. The current orthodoxy has itself been

32  Albert Weideman and Bart Deygers

reinterpreted and augmented in various ways, as we have seen in the previous discussion. What sense can one make of all this? Seeking to pursue a productive route, we discuss in this chapter several avenues that are helpful in showing a way out of the conceptual quagmire. We aim to make good the promise in the title of an alternative perspective on validity, and at the same time to gain in analytical clarity. First, the analysis will demonstrate that the various understandings of validity can perhaps be interpreted as a progressive unfolding or deepening of the concept (Weideman, 2019a, 2019b). This is a complex analysis, but understanding it has resounding rewards. The discussion will finally attempt to clarify validity conceptually by recognizing the difference between the subjective process of validation and the objective features of a language test. In order to gain conceptual clarity, we proceed from the premise that it will be productive to employ a theory of applied linguistics, to which we turn first before engaging in the conceptual clarification. What a theory of applied linguistics can accomplish

Language testing is generally viewed as a sub-discipline of applied linguistics (Davies, 2008; McNamara, 2003; McNamara & Roever, 2006, p. 255; McNamara et al., 2019, p. 1). If one defines applied linguistics as a discipline concerned with design (Weideman, 2017a), our theoretical perspective engages with the designed, planned, or intentionally shaped language interventions that are the primary artefacts of the field. Already in this definition one notices the idea of ‘design’, of planning, forming, and shaping. It is a pivotal aspect of language test designs that constitute one of the primary applied linguistic artefacts. Later, we shall note that ‘design’ is the key to understanding the idea of what the ‘technical’ mode of our experience is. We employ the term ‘mode’ or ‘modality’ here to mean “a way (or mode) of being” (Strauss, 2009), which refers to an aspect, function, or dimension of a concrete entity, instead of to the concrete entities, events, states, or processes themselves. As will become clear in the following discussion, each such modality has a nuclear moment; we shall employ the term ‘design’ as the core idea that defines the technical mode, and sometimes use the terms interchangeably. What are called “primary applied linguistic artefacts” in the previous paragraph are designed solutions to large-scale, pervasive, or persistently vexing language problems. They are qualified by their leading technical (‘design’) function. These planned applied linguistic solutions that we encounter come in three main shapes: language policies and plans; language assessments and tests; and language curricula and courses. Davies (2008, p. 298) identifies these three sets of solutions as the main designed language interventions,

Validity and validation  33

viewing the task of those working in the discipline as one of ensuring that “applied linguistics is prepared in its curricula and its assessments and in its planning . . . to be accountable.” Significantly, he adds that this can be achieved by “theorising practice”. To identify the designed nature of applied linguistic artefacts such as language assessments is one thing; to theorize that characteristic, we need to take a second conceptual step. That step is to abstract the technical modality, the characterizing design mode or function of such instruments, in an attempt to isolate it for further theoretical scrutiny. The systematic analyses supporting this kind of conceptualization have been adequately described elsewhere (Weideman, 2009, 2017a, 2017b, 2019a, 2019b, 2021), and will for the sake of brevity not be repeated here, though, for the sake of understanding and clarity, reference to them is encouraged. What is important to note is that, in building the theory, this second step – of conceptually abstracting the technical modality – may begin with the concrete artefact, in this case the language test. But when a theory is developed for applied linguistics, it is the mode of being, the technical, that is lifted out, compared and contrasted to other modalities, and held up for scrutiny, and not the concrete artefact, the test. When we look at the test itself, the concrete technically qualified applied linguistic artefact, we may thus observe that it carries the stamp of design, but that is not the only modality in which it functions. While the technical mode has a nuclear moment of design (Schuurman, 2009, p. 417; Strauss, 2009, p. 127), other distinctly different modalities are involved as well: The literature on language assessment . . . is replete with discussion of its social, ethical, economic, juridical and lingual dimensions, as we note in analyses of the technical appropriateness of a test for a certain population (its social side), of the impact of a test and its benefits or disadvantages for those who take it (its ethical concerns), its usefulness (relating to its analogical economic aspects), whether it does justice to the ability measured, also legally (a juridical consideration), or whether its results are interpretable and meaningful (its analogically lingual feature). These other modalities are referred to, or reflected, in the leading technical function of the test design. (Weideman, 2020, pp. 8–9) Through these other dimensions in which the language test (or plan, or course) functions, the relationship among them and the leading technical function of the applied linguistic artefact is demonstrated. The theory is built on the premise that there is a conceptual reflection of each of these within the technical, and that each ensuing echo of another sphere within

34  Albert Weideman and Bart Deygers

FIGURE 2.1 Coherence

of the technical dimension with others (and their traces)

the technical yields a conceptual primitive. The further hypothesis is that each of the reflections of other modalities within the technical may yield an applied linguistic concept or idea that generates a broad design principle for the planning, development, refinement, and evaluation of applied linguistic solutions. Again, this chapter will not pursue a detailed explanation of this, since the references to such explanations given previously should be sufficient, but it will make explicit the relation between the realization of these principles in language test design in the discussion that follows. In Figure 2.1, we have a summary of the technical function of design, and the traces or analogies within its modal structure to other dimensions or facets of experience (adapted from Weideman, 2017a, p. 224). From this perspective, one may derive, successively, the design principles of unity in multiplicity (technical homogeneity), generated by the connection that the technical sphere has with the numerical facet of experience; of range and scope (echoes of the spatial within the technical); of reliability or consistency (relating the technical measurement to the kinematic aspect); of validity (a physical analogy); of differentiation (a trace of the organic function); of feeling and perception (echoing the sensitive dimension); of rational defensibility (relating to the analytical); and so on, to the principles of technical interpretability (a lingual analogy); of technical appropriateness, implementability, or fit (where the social comes into play in the design); of utility (a trace of the economic); of alignment (echoing the aesthetic); of accountability (in which the technical anticipates the meaning of the juridical); of fairness and care (ethical analogies); and of credibility (anticipating the certitudinal).

Validity and validation  35

Nothing illustrates the general and broad sweep of each of these principles better than the various ways in which they can be implemented and applied. No principle is realized in only one single, prescribed or immutable manner. For every test design, the shape in which the test meets the general requirements may be different. The design principles exercise an appeal to language test designers, as technical agents or subjects, to give positive form and shape to each in their technically conceived measurements, the technical objects that they are crafting. There is an interplay here between technical subjects and technical objects. We shall return later to this important complex applied linguistic idea of technical subject and object, but turn next to an illustration of how these principles function in ensuring the quality of the test design. How principles of design are realized

In several analyses of existing language tests and courses, we find illustrations of how the design principles referred to in the previous section have been actualized. To illustrate adherence to the principle of technical homogeneity, deriving from the numerical analogies within the technical (see Figure 2.1), we could use the kind of example given by Van Dyk (2010, p. 152) and Weideman (2020, p. 66), of the factor analysis of the Test of Academic Literacy Levels for Postgraduate Students (TALPS) (Figure 2.2). The representation of the level of homogeneity in this pilot administration of TALPS may encourage test designers to look for reasons why

FIGURE 2.2 

Factor analysis of TALPS: 2011 pilot

36  Albert Weideman and Bart Deygers

items 39–43, all belonging to a particular subtest, are outliers that do not group together with all the other items, across various subtests. An argument will have to be made about whether they form a technical unity within a multiplicity of measurement components within the parameters set by the designers. Of course, a factor analysis, generated by statistical methods associated with Classical Test Theory (CTT), is not the only measure of technical unity of a test: Weideman (2020, p. 67), for example, presents Rasch analyses (Linacre, 2018) deriving from the infit mean square readings of items in an Assessment of Language for Economics and Finance (ALEF) to indicate that “ALEF exhibits the expected technical integrity: its items show both homogeneity and overall fit.” As regards a second principle, relating to the technical scope or range of a designed applied linguistic intervention, Pretorius (2015, p. 219) relates this to two other design principles, consistency and validity, in arguing why the course for nurse communication she designed will have specified limits; we have italicized some of the spatial interconnections with the design to illustrate the coherence between the technical and the spatial: In order for the activities to be consistent and valid, the scope of the course needs to be well defined .  .  . . [T]his course is to be designed specifically for nursing staff to improve their communicative competence within the setting of nursing practice. This means that it is limited to communication that occurs between specific role players within the hospital/clinic setting. . . . For the purposes of this course, which needs to be designed within certain practical limitations, it will have to be designed specifically to aid interaction between the following role players: nursing staff, doctors and patients. In the case of language tests, a corresponding specification of their technical range or scope will be essential. These first two design principles, of technical homogeneity and scope, emanate, as we have indicated, from the analogical reflections of, respectively, the numerical and spatial aspects of experience. Similarly, Keyser (2017, p. 150) finds a realization of the principle of consistency or technical reliability, that derives from the echo of the kinematic dimension within the technical, in the various indices (in this case coefficient alpha, sometimes colloquially referred to as Cronbach’s alpha) that a statistical analysis of her second pilot for part of an Afrikaans test of academic literacy for postgraduate students (TAGNaS) has revealed (Table 2.1), before discussing the merits of each separate subtest. When we turn to the most primitive conception of validity, we observe how that centers on the productivity of each item, on evaluating, in other words, how every item will contribute to what the test intends to measure in

Validity and validation  37 TABLE 2.1 Betroubaarheidskoëff isiënte van die subtoetse in Toetsfase 2 soos bepaal

deur Iteman 4.3 [Reliability coeff icients for the subtests in Test phase 2, as determined by Iteman 4.3]

Subtest

Alpha SEM

Alle items [All items] Skommelteks [Scrambled text] Graf iese & visuele inligting [Graphic & visual information] Woordeskat (enkel) [Vocabulary (single word)] Woordeskat (dubbel) [Vocabulary (two words)] Tekstipe [Text type] Teksbegrip [Text comprehension] Teksbegrip: Vergelyking tekste [Text comprehension: comparing texts] Grammatika & teksverband [Grammar & text relations] Opsomming [Summarizing] Verwysings [Referencing]

0.92 0.86 0.62 0.70 0.40 0.66 0.78 0.53 0.82 0.40 0.84

4.25 0.74 1.48 1.23 0.56 0.82 2.49 0.96 1.50 1.03 0.75

an adequate way, all of which is clearly linked conceptually to the analogical physical moments within the technical sphere. The original physical concept of cause and effect is analogically reflected in the technical in the concept of technical cause that yields a technical effect. The technical cause in this case is the administration of the test, and the technical effect is the score, the result of its application. To gauge item productivity at this most basic level, test designers will first of all set certain parameters for item difficulty or facility, usually aiming for a percentage correct or P-value of not less than 20, since that would indicate that the item is too difficult, and of not more than 80, which would mean the item is too easy. Then they will apply a further parameter for item discrimination, usually in respect of the total point-biserial correlation (Rpbis), i.e., a correlation of the score for each item with the total score, or of its Rit (for item-rest correlation). These are customarily Pearson coefficient correlations, providing a measure of the discriminating power of items (CITO, 2005, p. 29), and conventionally aiming for discriminating power of 0.3 or more. To these first two parameters for item productivity, Van Dyk and Weideman (2004, p. 18), mindful of a broadening of the rudimentary concept to reach out to embrace construct validity, add that technical concept (of construct validity) as further dimension, so that test designers may use a matrix, as in Table 2.2, to determine the level of acceptability of an item. Every item may, in that case, be adjudged to have either low or high productivity (i.e., discriminatory and facility value), or low or high alignment with what the test has set out to test. In fact, the disclosure of the initial basic level of adequacy of a language test is being aligned here with a theoretically justifiable idea of what is

38  Albert Weideman and Bart Deygers

alignment

TABLE 2.2  Matrix for determining whether an item contributes to the test

high low

acceptable unacceptable low

productivity

desirable not ideal high

measured. Such a theoretical defense of what it measured is referred to as the construct of language ability. It has been closely investigated in South African language assessment research, and is, according to some, the leitmotif of this work (Read, 2015, 2016; Weideman et  al., 2020; NExLA, 2023). The detailed attention to theoretically justifying the construct in this work (Patterson & Weideman, 2013a, 2013b), relying conceptually on the links between the technical and analytical dimensions, is a deepening of the meaning not only of validity, but of responsible, deliberately intentional and rational language test design. While traditionally the relations among various subtests in a test of language ability have been viewed as indications of validity, Weideman (2021, p.  11) sees those relations as organic rather than physical analogical concepts, since they echo the functional contribution of each to a viable technical whole. The subtest inter-correlations and subtest–test correlations that he takes as examples again have to fit into certain parameters to ensure such technical viability: the inter-correlations of subtests preferably need to be lower, say between 0.2 and 0.5, while the test–subtest correlations need to be above 0.6 or even 0.7. In this way, each component of the technical whole contributes viably to its overall functioning, demonstrating one aspect of how the test fulfils the requirement of technical differentiation across functionally different tasks. Let us take a few final examples of how the meaning of the technical is unfolded and further disclosed in the links between the pace-setting technical function of a language test and other dimensions of experience by referring again to four of the studies used as illustration here. In the first of these, Van Dyk (2010) addresses a range not only of constitutive applied linguistic concepts, such as those already referred to, but also shows how the lingual echoes in the technical allow test designers to probe the interpretation of test results, their meaningfulness, and their utility – the latter an economic echo. In considering the conceptual ways in which the technical anticipates the social aspect of human interaction, Keyser (2017) examined the condition of technical appropriateness. Her research into developing an Afrikaans postgraduate academic literacy test, TAGNaS (“Toets van Akademiese Geletterdheid vir Nagraadse Studente”), involved among other things finding an answer to the question: “Is there an appropriate fit between the test takers’

Validity and validation  39

FIGURE 2.3 

Wright map: person–item distribution map for ALEF

ability and the difficulty of the items?” To answer such questions, we can use a Rasch analysis, showing whether there is a desirably high degree of fit between the ability of test takers and the items used. Weideman (2020) uses this kind of analysis to show, on a Wright map (Figure 2.3) that in the refined, post-pilot version of ALEF, persons and items fit into the desired

40  Albert Weideman and Bart Deygers

parameters of between −3 and 3 logits (Van der Walt, 2012; Van der Walt & Steyn, 2007). As Weideman (2020, p.  71) remarks, the distribution of candidates in Figure 2.3 (the ‘persons’, on the left) indicates a fairly normal curve, a first, conventional indication of fit; but also, though “item 42 again becomes noticeable as an outlier (it is the most difficult item), [and] two other items, 45 and 16, also show up at the other extreme (they are too easy, as is confirmed by the CTT analysis),” all still fall within the parameters set. As Van Dyk (2010) does, Rambiritch (2012), in turn, focuses on the analogical lingual technical moment of conceptualizing the interpretation of test results, but has her sights on other lingual analogies as well, such as technical transparency – enhanced by the degree of adequacy in which information about the test has been disseminated – as well as other social reverberations within the technical that yield the idea of technical accessibility. These regulative ideas, she argues (Rambiritch, 2012, p. 213), “of transparency, accessibility and accountability in the design and development of TALPS [Test of Academic Literacy for Postgraduate Students]” eventually support and “tie the technical, qualifying dimension of the designed test to its juridical and ethical aspects.” “The juridical analogies within the technical,” she concludes, give rise to a condition or test design requirement that dictates that such a test must be . . . fair and publicly defensible, while the ethical analogies generate a design condition that calls for a sense of care and concern for others. (Rambiritch, 2012, p. 213; emphases added) This merits a separate discussion, to which we now turn. A new focal point for language test evaluation: justice and fairness

This chapter has argued that the technical design qualities of a measurement instrument are a necessary but insufficient condition for validity. Argumentbased validity theory does not negate that argument. Quite the contrary: it stresses the importance of design-based inferences and requires test developers to specify score interpretations and support those interpretations (claims) with evidence (data). While we recognize that argument-based validity theory considers robust measurement a precondition to valid score use (Kane, 2012; Kane et al., 2017), we argue that argument-based validity theory may have opened a pathway to disregarding some fundamental design principles of robust measurement.

Validity and validation  41

To be clear, we do not argue that the field as such has relegated these responsibilities. In fact, one could argue that real-world test validation, as demonstrated by publicly available technical reports, to a large extent still relies on what Messick called the evidential basis of validity. Even ETS, the institution that has employed some of the foremost validity theorists, uses rather traditional conceptions of validity (construct, content, predictive, consequential) to demonstrate the soundness of its tests (ETS, 2021). This leaves us with an interesting discrepancy. Argument-based validity theory requires that test developers take responsibility for validating intended or foreseeable score use. As we have just noted, this type of practical validity research conducted by large scale corporations may still rely on rather classical conceptions of validity. If test users intend to use tests or scores for unintended purposes, argument-based validity theory would task them with providing a robust validity argument. We are not aware of any case, however, where this has actually happened. To the best of our knowledge, there is no evidence to suggest that score users such as policy makers or university administrators have been inclined to validate unintended score use, or even realize that they have been awarded this responsibility (Deygers et al., 2021; Deygers & Malone, 2019; O’Loughlin, 2011). Perhaps that is the biggest weakness when argument-based validity is applied to contentious real-world test usage; that in the case of unintended score use, the responsibility for validation lies with organizations that are disinclined to take it. As a case in point: there are no indications that the use of tests for ethically questionable purposes has decreased since the dawn of argument-based validity theory – quite the contrary (Elder et al., 2019; Rocca et al., 2019). If it is true that argument-based validity theory does not offer a credible or reliable solution to purposeful test misuse by parties who are disinclined to engage in validation processes, we must conclude that the theory may not have provided a practical application of Messick’s concern for test consequences. The problem of unethical score use remains unanswered. In fact, most ethical discussions in language testing remain unresolved, because of various reasons – one of which is the nature of the relationship between validity, fairness, and justice. A  detailed discussion exceeds the scope and purpose of this chapter, so we paint with a broad brush (but see Deygers, 2019). Even so, it would be accurate to state that fairness has been conceptualized narrowly and broadly. The former view is typically limited to psychometrically investigating item bias, while the latter also focuses on equal access, rater severity, sensitivity reviews, and the like (McNamara & Roever, 2006). Some exceptions notwithstanding, most of the language testing literature (e.g., Kunnan, 2000) considers fairness to be connected to construct relevance and to the evidential basis of validity (AERA, 2014; McNamara & Ryan, 2011; Messick, 1989), though the exact relationship between the two

42  Albert Weideman and Bart Deygers

may differ. The less clearly defined concept of justice focuses on the function a test serves in society, on the values it represents, and on the impact it has on a test-taker population (Deygers, 2017; McNamara & Ryan, 2011). In the testing literature, justice has been conceptualized as a component of the consequential basis of validity (McNamara & Ryan, 2011; Messick, 1989) or as near-synonymous to validity (Davies, 2010, 2008, p. 491). In discussing issues of test use and justice, Davies (2013) even refers to ethics as the doppelgänger of validity. Almost a decade later, we take Davies’s point, but state it as a problem. Though another perspective on the relation is possible, if it is true that ethical test use – justice – is to be seen as part and parcel of validity, it will be to the detriment of both. Indeed, if ethical test use is seen as synonymous to validity, but if validating unintended test use is seen as a test-external matter – the responsibility for which is relegated to a party that is disinclined to take it – then argument-based validity theory has effectively ensured that ethical score use for unintended purposes is nobody’s responsibility. Based on this reasoning, we would argue that argument-based validity theory may be unable to address test misuse by external parties. If the aim is to discourage ethically indefensible test use, what is required is not validity theory, but an acute understanding of the actual distribution of power in society. What is needed is the realization that language testers play a comparatively minor role in the policy context (Deygers et  al., 2021; Lo Bianco, 2001, 2014). What is essential, in short, is for language testing as a field to develop policy literacy and the will to engage with policy makers on the basis of empirically founded measurement-based arguments that speak to them. As a field, it is important to recognize the effect of alignment or misalignment among applied linguistic interventions (Weideman, 2019c), and the problematic disconnect between language policy and language assessment (Deygers, 2017, p. 149; Deygers et al., 2017). The contribution by West and Thiruchelvam (2024) in this volume highlights a serious disconnect among language policy, language assessment, and language instruction within a tertiary institution. Building on the renewed current focus in language assessment on justice and fairness (e.g., Kunnan, 2000, 2004), Deygers (2018, 2019) and McNamara et al. (2019) have augmented the initial work on this in a closer examination of these applied linguistic ideas. The question posed at the beginning of this section is once again relevant: What might be sufficient conditions for test quality in light of considerations of justice and fairness? Since prominence has been given in this chapter to the realization of design principles in test development, we may take the several principles that Deygers (2017, p.  159) mentions: a just testing policy must treat test takers with dignity and have respect for their

Validity and validation  43

privacy; must be backed by empirical evidence; will have consulted all primary stakeholders, thus ensuring accountability; and even high stakes testing policies should be just. The application of such principles will be a step towards sufficient attention in test development to responsible test design. Such is the level of interest in the juridical and ethical connections between the technical instruments that measure language ability, often on a national scale, and the political and moral dimensions of society, that pleas have now arisen for test results obtained at the national–global interface to be democratically negotiated (Addey et al., 2020). With new contributions and insights gaining in prominence, these focal points for the evaluation of language assessment are likely to increase in significance in the future. As we have pointed out: they take us beyond validity, both in the orthodox and classical sense. Before we return to that point in the Conclusion later, we provide a final note of where, in our opinion, an argument-based approach to validity and validation has made a positive contribution. A note on subjective validation and objective technical validity

All of the principles of test design previously mentioned have more than one kind of realization, as well as a number of further dimensions that have been dealt with only briefly here. What do these design principles have to say about the validity of language tests, and their validation? The contribution that the notion of argument-based demonstrations of test adequacy has made is to view that integration of evidence as a subjective process of validation. One may view this kind of process as a technically stamped one, since the argument is neither a legal nor a political (though some would contest that), nor a mere academic argument: instead, it is a justificatory argument relying on what Van Dyk (2010) has called a technical-analytical procedure. We take all this to mean that the technical process of validation will be guided by normative principles of technical procedure. Contrary to the current orthodoxy, however, we would claim that language tests are objective applied linguistic artefacts with any number of characteristics. ‘Objective’ here does not have the popular meaning of ‘unbiased’ or ‘scientific’: it simply indicates that the language test we design, develop, then examine and evaluate, is a technically qualified object. That designed technical object stands in relation to technical subjects who, either as design teams or affected others, need to consider its quality or technical adequacy. Apart from its adequacy or validity, those technical subjects will, usually through a process that is a subjective validation guided by agreed normative procedures, also examine a multiplicity of other dimensions in which the technical object functions. As technical subjects, they will specifically need to determine whether the test has been responsibly

44  Albert Weideman and Bart Deygers

designed, but also whether it is functioning, and has been employed responsibly. It has been the argument of this chapter that the notion of policy literacy extends the range of technically involved subjects or agents who need to evaluate language test quality and use. The relation between subjective validation and technical validity is therefore one of a multiplicity of technical subject-object relations that are inescapable in respect of a theory of applied linguistics. Conclusion

It should be clear, in light of the previous arguments, that conceptually we have moved well beyond technical validity. Despite the central idea of test validity having been gradually disclosed, from its rudimentary basis in the analogical physical concept of technical cause and effect – the working of the measurement instrument – to a consideration of organic overtones (in the principle of technical differentiation), even echoing the sensitive dimension of experience in the notion of face validity – the intuitive technical appeal that an instrument might have – to its being fulfilled in the theoretical defensibility of the ability it seeks to measure (its construct), these unfolding concepts have taken us much further. We would do well if we begin to see that it is not so much ‘validation’ that we are undertaking, but what Deygers (2018) calls “test evaluation”. In responsibly adhering to principles of technical meaningfulness, appropriateness, utility, alignment, accountability, fairness, and compassion, we are beyond designing language tests that are valid. We have arrived at the point of applying a set of responsible design principles, to which test designers and test users must give positive form in the language tests they develop or employ. Recommended further reading, discussion questions and suggested research projects Further reading

For those who have only just started to engage with language testing and assessment, a good introduction will be Read (2018). For a book with more detail, Fulcher (2010) will be highly suitable; Green (2014) provides another informative introduction. Inevitably, you will have to gain some understanding of statistics, and how such analyses are employed in language assessment. For this, Green (2013) and Lowie and Seton (2013) are good starting points. Once the basic concepts of language assessment are understood, you may want to engage with the philosophical concepts and ideas behind developments in the field. Use Weideman (2018) for that.

Validity and validation  45

Fulcher, G. (2010). Practical language testing. Hodder Education. Green, A. (2014). Exploring language assessment and testing: Language in action. Routledge. Green, R. (2013). Statistical analyses for language testers. Palgrave Macmillan. Lowie, W.,  & Seton, B. (2013). Essential statistics for applied linguistics. Palgrave Macmillan. Read, J. (2018). Researching language testing and assessment. In B. Paltridge  & A. Phakiti (Eds.), Research methods in applied linguistics: A  practical resource (pp. 286–300). Bloomsbury Academic. Weideman, A. (2018). Positivism and postpositivism. In C. A. Chapelle (Ed.), The concise encyclopedia of applied linguistics. Wiley & Sons. 10.1002/9781405198431. wbeal0920.pub2

Discussion questions

1. In the debate between ‘realists’ and ‘interpretivists’ in validity theory, which side, in your opinion, has the most persuasive arguments? 2. In your own reading, how often do you note that synonymous concepts for ‘validity’ are employed, in an effort to avoid saying that ‘the test (being discussed) is valid’? Read Messick (1989) for example, and look out for ‘adequacy’ or, in other discussions, concepts like ‘quality’, ‘effect(ive)’, ‘working’, ‘operation’, or their opposites, e.g. “construct irrelevant”. 3. This chapter proposes a process of evaluation of test quality through the employment of a set of principles for responsible test design. In your opinion, would this be a less or more practical way of doing so than the argument-based approach suggested by Kane? 4. Can test validation be equated with responsible test design? Is responsible test design, as conceptualized in this chapter, less or more comprehensive than what is conventionally understood as ‘validation’? 5. If justice and fairness are indeed becoming more prominent in language test development and use, where, in your opinion, do ideas like accessibility, transparency, accountability and the alignment of policy with tests fit in? Suggested research projects

1. Weideman (2020) suggests a responsible approach to test design that utilizes a number of claims which can be made in respect of the early evaluation of test quality. The analysis (a) examines these claims of test quality by using Classical Test Theory and Rasch analyses, and (b) brings together warrants for the claims in the form of an argument backed by evidence. Apply that methodology to any other language test that you have encountered.

46  Albert Weideman and Bart Deygers

2. Apply the 14 principles of responsible design mentioned in Figure 1 (and explained further here as well as in the literature referred to) not to a language test, but either to (a) a language policy within an institution (such as a state or a university) or (b) a language course or curriculum (as in Pretorius, 2015). Your aim is to demonstrate that the principles apply across applied linguistic interventions. 3. Refer to Deygers (2017, p. 159) and apply the six principles for a just test to a test with which you are familiar. Develop and justify your own methodologies to investigate whether the test you have chosen to evaluate conforms sufficiently to these principles. Point out both where the requirements are fulfilled, and where they may fall short, by gathering and presenting evidence to support your conclusions. 4. This chapter encourages “language testing as a field to develop policy literacy and the will to engage with policy makers on the basis of empirically founded measurement-based arguments”. Design a set of questionnaires to investigate the language and language testing policy literacy of test administrators and other users of language tests that you are familiar with, pilot and administer them, and write up the results. References Addey, C., Maddox, B.,  & Zumbo, B. D. (2020). Assembled validity: Rethinking Kane’s argument-based approach. Assessment in Education: Principles, Policy  & Practice, 27(6), 588–606. https://doi.org/10.1080/0969594X.2020. 1843136 AERA. (2014). The standards for educational and psychological testing. American Educational Research Association. Bachman, L. F. (2001). Designing and developing useful language tests. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, T. Lumley, T. McNamara & K. O’Loughlin (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (pp. 109–116). Cambridge University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press. Borsboom, D., Mellenbergh, G. J.,  & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/ 0033-295X.111.4.1061 CITO. (2005). TiaPlus user’s manual. M & R Department. Davies, A. (2008). Accountability and standards. In B. Spolsky & F. M. Hult (Eds.), The handbook of educational linguistics (pp.  483–494). Blackwell. https://doi. org/10.1002/9780470694138 Davies, A. (2010). Test fairness: A  response. Language Testing, 27(2). 171–176. https://doi.org/10.1177/0265532209349466 Davies, A. (2013). Fifty years of language assessment. In A. J. Kunnan (Ed.), The companion to language assessment. John Wiley  & Sons. https://doi.org/ 10.1002/9781118411360

Validity and validation  47

Davies, A.,  & Elder, C. (2005). Validity and validation in language testing. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp.  795–813). Lawrence Erlbaum Associates. https://doi.org/10.4324/ 9781410612700 Deygers, B. (2017). Just testing: Applying theories of justice to high-stakes language tests. International Journal of Applied Linguistics, 168(2), 143–163. https:// doi.org/10.1075/itl.00001.dey Deygers, B. (2018). Book review: Antony John Kunnan: Evaluating language assessments. Language Testing, 36(1), 154–157. https://doi.org/10.1177/026553 2218778211 Deygers, B. (2019). Fairness and justice in English language assessment. In X. Gao (Ed.), Second handbook of English language teaching. Springer (Handbooks of Education). https://doi.org/10.1007/978-3-319-58542-0_30-1 Deygers, B., Bigelow, M., Lo Bianco, J., Nadarajan, D., & Tani, M. (2021). Low print literacy and its representation in research and policy. Language Assessment Quarterly (published online ahead of print). https://doi.org/10.1080/154343 03.2021.1903471 Deygers, B., & Malone, M. E. (2019). Language assessment literacy in university admission policies, or the dialogue that isn’t. Language Testing, 36(3), 347–368. https://doi.org/10.1177/0265532219826390 Deygers, B., Van den Branden, K.,  & Van Gorp, K. (2017). University entrance language tests: A matter of justice. Language Testing, 35(4), 449–476. https:// doi.org/10.1177/0265532217706196 Elder, C., Knoch, U., & Harradine, O. (2019). Language requirements for Australian citizenship: insights from a Senate enquiry. In C. Roever & G. Wigglesworth (Eds.), Social perspectives on language testing (pp. 73–88). Peter Lang. ETS. (2021). GRE general test fairness and validity (for test takers). www.ets.org/ gre/revised_general/about/fairness/ Fulcher, G. (2015). Re-examining language testing: A  philosophical and social inquiry. Routledge. Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. Routledge. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535. https://doi.org/10.1037/0033-2909.112.3.527 Kane, M. T. (2010). Validity and fairness. Language Testing, 27(2), 177–182. https://doi.org/10.1177/0265532209349467 Kane, M. T. (2011). Validity score interpretations and uses: Messick lecture, language testing research colloquium, Cambridge, April  2010. Language Testing, 29(1), 3–17. https://doi.org/10.1177/0265532211417210 Kane, M. T. (2012). Articulating a validity argument. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 34–48). Routledge. https://doi.org/10.4324/9781003220756 Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198–211. https://doi.org/10.1080/0969594X.2015.1060192 Kane, M. T., Kane, J.,  & Clauser, B. E. (2017). A  validation framework for credentialing tests. In C. W. Buckendahl & S. Davis-Becker (Eds.), Testing in the professions: Credentialing polices and practice (pp. 20–41). Routledge. https:// doi.org/10.4324/9781315751672-2

48  Albert Weideman and Bart Deygers

Keyser, G. (2017). Die teoretiese begronding vir die ontwerp van ’n nagraadse toets van akademiese geletterdheid in Afrikaans. MA dissertation, University of the Free State, Bloemfontein. http://hdl.handle.net/11660/7704 Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in language assessment: Selected papers from the 19th Language testing research colloquium, Orlando, Florida (pp. 1–14). University of Cambridge Local Examinations Syndicate. https://doi.org/10.1002/9781118411360.wbcla144 Kunnan, A. J. (Ed.) (2000). Studies in language testing 9: Fairness and validation in language assessment. Cambridge University Press. Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context (Studies in language testing; 18; pp. 27–45). Cambridge University Press. https://doi.org/10.1177/02655322211057040 Linacre, J. M. (2018). A user’s guide to WINSTEPS Ministep: Rasch-model computer programs. Winsteps. Lo Bianco, J. (2001). Policy literacy. Language and Education, 15(2–3), 212–227. https://doi.org/10.1080/09500780108666811 Lo Bianco, J. (2014). Dialogue between ELF and the field of language policy and planning. Journal of English as a Lingua Franca, 3(1), 197–213. https://doi. org/10.1515/jelf-2014-0008 McNamara, T. (2003). Looking back, looking forward: Rethinking Bachman. Language Testing, 20(4), 466–473. https://doi.org/10.1093/elt/ccs041 McNamara, T., Knoch, U., & Fang, J. (2019). Fairness, justice and language assessment: The role of measurement. Oxford University Press. McNamara, T.,  & Roever, C. (2006). Language testing: The social dimension. Blackwell. McNamara, T., & Ryan, K. (2011). Fairness versus justice in language testing: the place of English literacy in the Australian Citizenship Test. Language Assessment Quarterly, 8(2), 161c178. https://doi.org/10.1080/15434303.2011.565438. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35(11), 1012–1027. https://doi.org/10.1037/0003-066X.35.11.1012 Messick, S. (1981). Evidence and ethics in the evaluation of tests. Educational Researcher, 10(9), 9–20. https://doi.org/10.3102/0013189X010009009 Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences of measurement. In H. Wainer  & I. H. Braun (Eds.), Test validity (pp.  33–45). Lawrence Erlbaum Associates. https://doi. org/10.1002/j.2330-8516.1986.tb00185.x Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education/Collier Macmillan. Network of Expertise in Language Assessment [NExLA]. (2023). Bibliography. https://nexla.org.za/research-on-language-assessment/. Accessed 27 April 2023. O’Loughlin, K. (2011). The interpretation and use of proficiency test scores in university selection: How valid and ethical are they? Language Assessment Quarterly, 8(2), 146–160. https://doi.org/10.1080/15434303.2011.564698 Patterson, R., & Weideman, A. (2013a). The typicality of academic discourse and its relevance for constructs of academic literacy. Journal for Language Teaching, 47(1), 107–123. https://doi.org/10.4314/jlt.v47i1.5

Validity and validation  49

Patterson, R., & Weideman, A. (2013b). The refinement of a construct for tests of academic literacy. Journal for Language Teaching, 47(1), 125–151 https://doi. org/10.4314/jlt.v47i1.6 Popham, W. J. (1997). Consequential validity: Right concern – Wrong concept. Educational Measurement: Issues and Practice, 913. https://doi. org/10.1111/j.1745-3992.1997.tb00586.x Pretorius, M. (2015). The theoretical justification of a communicative course for nurses: Nurses on the move. MA dissertation, University of the Free State, Bloemfontein. Rambiritch, A. (2012). Accessibility, transparency and accountability as regulative conditions for a post-graduate test of academic literacy. PhD thesis, University of the Free State, Bloemfontein. http://hdl.handle.net/11660/1571 Read, J. (2015). Assessing English proficiency for university study. Palgrave Macmillan. Read, J. (Ed.) (2016). Post-admission language assessment of university students. Springer. Rocca, L., Carlsen, C. H.,  & Deygers, B. (2019). Linguistic integration of adult migrants: Requirements and learning opportunities. Report on the 2018 Council of Europe and ALTE survey on language and knowledge of society policies for migrants. Council of Europe. Schuurman, E. (2009). Technology and the future: A philosophica1 challenge (Trans. H. D. Morton). Paideia Press [Originally published in 1972 as: Techniek en toekomst: Confrontatie met wijsgerige beschouwingen, Van Gorcum]. Strauss, D. F. M. (2009). Philosophy: Discipline of the disciplines. Paideia Press. Tannenbaum, R. J. (2018). Validity aspects of score reporting. In Contribution to symposium 4, Language Testing Research Colloquium 2018 (Auckland): Reconceptualizing, challenging, and expanding principles of test validation. LTRC. Van der Walt, J. (2012). The meaning and uses of test scores: An argument-based approach to validation. Journal for Language Teaching, 46(2), 141–155. https:// doi.org/10.4314/jlt.v46I2.9 Van der Walt, J., & Steyn, H. Jr. (2007). Pragmatic validation of a test of academic literacy at tertiary level. Ensovoort, 11(2), 138–153. Van Dyk, T. (2010). Konstitutiewe voorwaardes vir die ontwerp en ontwikkeling van ’n toets vir akademiese geletterdheid [Constitutive conditions for the design and development of a test of academic literacy]. PhD thesis, University of the Free State, Bloemfontein. http://hdl.handle.net/11660/1918 Van Dyk, T., & Weideman, A. (2004). Finding the right measure: From blueprint to specification to item type. Journal for Language Teaching, 38(1), 15–24. https://doi.org/10.4314/jlt.v38i1.6025 Weideman, A. (2009). Constitutive and regulative conditions for the assessment of academic literacy. Southern African Linguistics and Applied Language Studies Special Issue: Assessing and Developing Academic Literacy, 27(3), 235–251. Weideman, A. (2017a). Responsible design in applied linguistics: Theory and practice. Springer International Publishing. https://doi.org/10.1007/978-3-31941731-8 Weideman, A. (2017b). Chapter 12: The refinement of the idea of consequential validity within an alternative framework for responsible test design. In J. Allan & A. J. Artiles (Eds.), Assessment inequalities: Routledge world yearbook of education (pp. 218–236). Routledge. https://doi.org/10.4324/9781315517377

50  Albert Weideman and Bart Deygers

Weideman, A. (2019a). Degrees of adequacy: The disclosure of levels of validity in language assessments. Koers, 84(1). https://doi.org/10.19108/KOERS.84. 1.2451 Weideman, A. (2019b). Validation and the further disclosures of language test design. Koers, 84(1). https://doi.org/10.19108/KOERS.84.1.2452 Weideman, A. (2019c). Definition and design: Aligning language interventions in education. Stellenbosch Papers in Linguistics Plus, 56, 33–48. https://doi. org/10.5842/56-0-782 Weideman, A. (2020). Complementary evidence in the early stage validation of language tests: Classical test theory and Rasch analyses. Per Linguam, 36(2), 57‑75. https://doi.org/ss10.5785/36-2-970 Weideman, A. (2021). Context, construct, and validation: A perspective from South Africa. Language Assessment Quarterly. https://doi.org/10.1080/15434303.2 020.1860991 Weideman, A., Read, J., & Du Plessis, T. (Eds.) (2020). Assessing academic literacy in a multilingual society: Transition and transformation. Multilingual Matters. https://doi.org/10.21832/9781788926218 West, G. B., & Thiruchelvam, B. (2024). “It’s not their English”: Narratives contesting the validity of a high-stakes test. In this volume. Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2), 147–170. https://doi.org/10.1177/0265532209349465

PART II

Agency and empowerment prompted by test adequacy

3 THE RACIALIZING POWER OF LANGUAGE ASSESSMENTS Casey Richardson

Introduction Validity and critical language testing

The social and political role of tests has consistently been overlooked despite calls from scholars like Messick (1981, 1989) to validate tests not just in terms of psychometric measurement but in terms of their influence in society (see also Shohamy, 1998). Messick’s work extended traditional psychometric conceptualizations of validity to include consideration of the ways in which social, psychological, cultural, and political factors influence the value implications and consequences of test interpretation and use (1981, 1988, 1989, 1996). The key questions are whether the potential and actual social consequences of test interpretation and use are not only supportive of the intended testing purposes, but at the same time are consistent with other social values. Because the values served in the intended and unintended outcomes of test interpretation and test use both derive from and contribute to the meaning of the test scores, the appraisal of social consequences of testing is also seen to be subsumed as an aspect of construct validity. (Messick, 1989, p. 18) Though Messick, Shohamy, and other scholars have long argued that the social consequences of tests be considered in construct validity, test scores continue to have deleterious effects on students. DOI: 10.4324/9781003384922-5

54  Casey Richardson

Shohamy later expanded upon Messick’s validity work within the context of language testing and conceptualized critical language testing (CLT; Shohamy, 1998). CLT examines the use and consequences of tests in schools and society. The practical applications of CLT include minimizing test misuse by publicizing the un/intended consequences of tests (Shohamy, 1993, 2001a), and by promoting more democratic means of assessment1 (2001). CLT’s critical approach is evident from Shohamy’s description that testing is not neutral. Rather, it is both a product and an agent of cultural, social, political, educational and ideological agendas that shape the lives of individual[s]. . . . [CLT] views test takers as political subjects in a political context. (1998, p. 332) Though some language testers have shifted their focus to understanding the mis/use of tests as gatekeeping tools (Spolsky, 1997), e.g., in determining migration or citizenship eligibility (Extra et  al., 2011; Hoang  & Hamid, 2017; Hogan-Brun et al., 2009; McNamara & Shohamy, 20082), test and policymakers largely have not taken accountability for the consequences of test mis/use (Shohamy, 2001a). While scholars like Shohamy explicitly attend to the political context of test mis/use, most practitioners fail to investigate how race and racism are embedded within language testing policies and practices. To understand modern co-constructions of race and language and how they serve to maintain white hegemony, this chapter outlines nation-state governmentality and raciolinguistic ideologies that uphold perceptions of linguistic deficiency with reference to English learners (ELs). The chapter also explores two tenets of critical race theory to better understand the extent to which tests yield intentionally harmful effects on ELs in particular. Two US-centric standardized tests are offered as examples through which to view the disparate, high-stakes consequences faced by ELs. This chapter concludes with a call to challenge current testing policies, practices, and testers, and alternatively offers forth more democratic and equitable assessments that support ELs instead of penalizing them. Nation-state governmentality and raciolinguistic ideologies

Notwithstanding test misuse and validity issues, standardized tests indicate what is socially and culturally considered “legitimate” knowledge. The linguistic varieties utilized in standardized tests reflect what is employed by language users who too are considered socially and culturally “legitimate”. Language has historically operated as a tool to control and categorize

The racializing power of language assessments  55

people, as examined by critical, socio- and raciolinguistics scholars alike. In her exploration into c/overt language policies, Shohamy (2006) described the establishment of a hegemonic “national” language that served as the “legitimate”, prestigious standard to which all other languages or varieties were to be considered deficient and the users of them, of lower status: In the same way that there were “correct” people who were associated with the “right” groups, there were also “correct” languages and those born into those languages, entitled “native speakers”, who were born into the group and therefore belonged. . . . The notion of a language as a closed and finite system was in direct parallel to the idea of a nationstate as a closed and finite society to which only certain people had the legitimacy to belong. (Shohamy, 2006, p. 31) The governmentality of nation-states depended on language-based stratifications to maintain social order, upholding the notion that a dominant language was necessary for national unity. Rosa and Flores contended that: Nation-state/colonial governmentality relied on raciolinguistic ideologies that positioned colonized populations as inferior to idealized European populations. . . . The raciolinguistic ideologies that organized these colonial relations continue to shape the world order in the postcolonial era by framing racialized subjects’ language practices as inadequate for the complex thinking processes needed to navigate the global economy, as well as the targets of anxieties about authenticity and purity. (2017, p. 627) Historical distinctions of Europeanness (whiteness) versus non-Europeanness (non-whiteness) are still being re/constructed to define what is “appropriate,” e.g., standardized academic English. Yet “standard” and “academic” language are less objective categories and more used to stigmatize certain communities’ linguistic practices as deviant and in opposition to the imagined, idealized English and whiteness of “legitimate” subjects of nation-states (Flores & Rosa, 2015). Indeed, raciolinguistic ideologies portray racialized people as linguistically different and inferior to the unmarked white norm even when they model standardized norms (Rosa & Flores, 2017). Such ideologies uphold white hegemony by construing monolingualism as the norm to which all racialized people should aspire (Flores, 2013). Monolingual standards pervade what knowledge is tested. Authorities, i.e., speakers of the dominant language in decision-making roles, employ

56  Casey Richardson

language tests to create national and collective identities (Ricento, 2006; Shohamy, 2006). Shohamy suggested that “it became clear to those in authority in the nation-state that controlling language was an important symbol and indication of status and power and consequently of control over major resources and people in society” (2006, p.  30). In this sense, language tests reproduce institutionalized hierarchies of racial and linguistic legitimacy. Raciolinguistic ideologies promote the belief that changing deficient linguistic practices leads to the elimination of racial hierarchies, which neglects the real though often covert ways in which the racial order e.g., white hegemony, delimits what is construed as “appropriate” (Flores & Rosa, 2015; Rosa & Flores, 2017). The onus on English learners (ELs) to change their linguistic habits is central to Flores and Rosa’s raciolinguistic work that emphasizes the numerous ways racialized communities are perceived to have inherently inferior language practices that must be corrected and policed, regardless of how well they approximate standardized academic English. In the US, the students most at risk of bearing this burden are the roughly five million K–12 students designated as English language learners – three-quarters of whom use Spanish as their first language (National Center for Education Statistics, 2023). Though English (language) learners is not the label oft-adopted by raciolinguistic scholars, English learners (ELs) is used here to reflect the present binary re/constructed between already English-proficient children on whom standardized tests are normed as compared to those developing English proficiency. Herein ELs is employed to encompass the students identified by the US Department of Education’s Office for Civil Rights (OCR) and the Arizona Department of Education as entitled to language assistance programs. The standardized testing experiences of ELs, and the consequences deriving from such practices, are problematized in US K–12 contexts in this chapter. The two cases of federal and state-level legislation illuminate the ways in which common testing practices that are framed as objective means to remedy English learners’ linguistic deficiencies uphold racial and socioeconomic hierarchies. English learner testing validity issues

As students in countries like the United States have become increasingly multicultural and multilingual, there is a need to understand the use of tests and their impact/s on diverse populations. This section reviews a primary issue of language testing English learners (EL) in English: the linguistic complexity of the test versus the linguistic skills of the EL. Researchers have long-questioned the validity of testing ELs for content knowledge (e.g., math or science) when they are not yet English proficient

The racializing power of language assessments  57

due to the fact that linguistic factors have an impact on EL test outcomes (Solano-Flores, 2008; Solano-Flores & Li, 2006; Solano-Flores & Trumbull, 2003). Proficiency in the language prevalent on standardized achievement tests, often referred to as “academic” language3 takes at minimum several years to develop according to second language acquisition scholarship. The question of when ELs have enough proficiency for the results of their tests to be valid is not the focus of this chapter and has been discussed elsewhere (e.g., Valdés  & Figueroa, 1994). Yet testing ELs with such instruments before they are proficient in this linguistic variety is at best not appropriate due to issues of construct validity. Conversely, it raises questions about why ELs are not assessed in ways where they could more fully display their linguistic repertoires of multiple, dynamic language proficiencies (see e.g., García & Wei, 2014; Otheguy et al., 2015) for more accurate indications of their content knowledge. For test scores to more meaningfully reflect the content knowledge of ELs, language accommodations have been proposed – the most effective of which minimize sources of construct irrelevance (Abedi, 2012, 2016; Pennock-Roman  & Rivera, 2011). Research has shown that ELs benefit from certain, relevant accommodations, such as customized dictionaries of non-content terms, bilingual glossaries (Abedi, 2012; Abedi et  al., 2004; Pennock-Roman  & Rivera, 2011), and language modifications to reduce the linguistic complexity of test items (Abedi et al., 2004; Abedi & Lord, 2001). Studies of the linguistic complexity of test items have highlighted how achievement tests are to some degree language proficiency tests (Abedi, 2004; Abedi  & Dietel, 2004; Solano-Flores  & Trumbull, 2003). Subsequently, EL test scores continue to be influenced by their language proficiency (Gándara & Baca, 2008; Menken, 2008; Solórzano, 2008; Wolf & Leon, 2009; Wright & Li, 2008). ELs who have recently arrived or those who have limited proficiency may not yet be equipped with the language skills necessary to perform at the same levels as their English-proficient peers. Their scores lack accuracy and may in fact be an underestimation of ELs’ content knowledge. The validity issues summarized here are a mere starting point to the complexities in testing ELs. The ensuing section introduces critical race theory (CRT) to provide another means of examining (language) tests and their impacts. Critical race theory

Critical race theory (CRT) emerged out of critical legal studies (CLS) in the late 1970s from a group of scholars of color who questioned the lack and slow pace of racial reform in the post–Civil Rights era (Delgado, 1995). Foundational tenets of CRT are that race is central to understanding

58  Casey Richardson

inequity (Delgado & Stefancic, 2001), and that racism is normal (Delgado, 1995) and a permanent, endemic fixture of society (Bell, 1992). LadsonBillings summarized the paradigmatic shift of critical race theorists away from traditional CLS after “CLS scholars critique[d] mainstream legal ideology for its portrayal of US society as a meritocracy but failed to include racism in its critique” (1998, p. 11). Critical race theorists challenged ahistoricism (Lawrence et al., 1993) and white supremacy, and committed to highlighting experiential knowledge and voices of color through counter storytelling and interdisciplinarity. CRT’s commitment to social justice and political activism sought to transform the relationships among race, racism, and power (Delgado & Stefancic, 2001). Next, the chapter expands upon two additional tenets of CRT: interest convergence (Bell, 1980) and critique of liberalism (Crenshaw, 1988). Interest convergence

Derrick Bell theorized the interest convergence principle (1980) that asserts that society at large supports minority interests (e.g., benefits, rights) only when they align with those of the majority. Put another way, steps toward racial equality are taken only when they align with white interests. Bell famously explained interest convergence within the judicial context of Brown v. Board of Education (1954). The Brown court case ruled that race-based school segregation was unconstitutional. According to Bell (1980), the court ruling was not motivated out of altruism or morality, but political factors. Brown only required that Black children had the opportunity to attend the same schools as white children. The advances from the Supreme Court decision outlawing de jure segregation partly extended to the students of color whose schools followed the desegregation orders, but ultimately extended to white majority interests in the aftermath of World War II.4 Hence by ruling against the separate but equal doctrine (Plessy v. Ferguson, 1896), whites upheld their material interests (Bell, 1980). For instance, the Brown ruling served as a face-saving measure amid Cold War anxieties in which white political elites needed to minimize the threatening spread of communism5 and to increase international credibility. To do so, southern whites accepted a work force of non-white laborers as part of the industrialization of the South (from an agrarian society) after the war. Bell’s interest convergence principle illuminated how the dismantling of segregation beyond just schools boosted the economic interests of southern whites by increasing their number of prospective investors. While Black communities ostensibly benefitted from the landmark Brown decision, racial progress lacked compared to the gains of white political elites. The moderate reforms stemming from Civil Rights movements such

The racializing power of language assessments  59

as Brown did not challenge underlying structural racism or yield real change for racialized communities. Omi and Winant (2015) described the Civil Rights era as a time in which race as an organizing principle for the distribution of resources, rights, and privileges became destabilized when racial segregation was no longer legally permitted; consequently, the US shifted to a system built on perpetuating hegemony rather than domination through racial projects. As a result of Brown outlawing de jure segregation, Aggarwal (2016) contended that a new racial formation emerged in which racialized students and their families came to be viewed as deficient and in need of interventions. Ideologically, white society believed that desegregation was sufficient in giving minoritized students an equal education to that of white students, so any gaps in educational achievement did not persist owing to material disparities but to deviant individual behaviors (Aggarwal, 2016). As such, judicial and legislative activity affecting English learners focused on their development of academic English, which was construed to be the primary barrier to EL success in US schools. But Flores has argued that “to bracket the broader political and economic issues confronting racialized communities and to focus solely on linguistic solutions places the onus on racialized communities to undo their own oppression through the modification of their language practices” (Flores, 2020, p. 29). Placing the responsibility on ELs to alter their position in society by changing their linguistic behaviors is sustained through the interest convergence principle (as white interests remain unthreatened) and through myths of meritocracy and colorblindness (see Critique of Liberalism).6 True structural change to promote EL success rarely advances because it does not align with white interests. Interest convergence has been used as an explicatory tool (Alemán  & Alemán, 2010) to explain perceived remedies for racial equality and the ways in which majority interests are continually upheld throughout seeming steps toward racial progress. It is as an explicatory tool that interest convergence is used in this chapter to address two cases of EL language testing and the consequences of hyperfocusing on achievement gaps7 and perceived language deficiencies as opposed to focusing on structural change that could disrupt white hegemony. Critique of liberalism

Founding critical race theorists sometimes referred to as Crits (see e.g., Crenshaw et al., 1995) insisted on a critique of liberal antiracism, hereafter liberalism, defined by cautious and incremental Civil Rights litigation and faith in the legal system and equality. Crits became impatient with the lack of progress under liberalism and argued that real racial progress and social

60  Casey Richardson

transformation must be radical and sweeping (Crenshaw, 1988). Given that race has been a determining factor of who was included in certain rights and opportunities, critical race theorists rejected the notion that the law was neutral and that the law could be a site of radical reform in the interest of racialized people. They showed how liberalism impedes radical racial reform through colorblindness, which ignores racist policies and practices (DeCuir & Dixson, 2004) because of the refusal to acknowledge racial reality – exacerbated now by the perception that the US is a postracial society. The CRT critique of liberalism challenged and deconstructed traditional claims of colorblindness, equal opportunity, meritocracy, neutrality, and objectivity of the law (Delgado  & Stefancic, 2001; Delgado Bernal  & Villalpando, 2002; Ladson-Billings, 1998; Ladson-Billings  & Tate, 1995). For example, Tate (1997) asserted that these claims in legal contexts served as “camouflages for the self-interest of powerful entities of society” (p. 235). Ubiquitous are the claims of colorblindness, equality, meritocracy, neutrality, and objectivity that help to promote white hegemony and perpetuate a racist social order. Critical race theorists also exposed institutions for their complicity in reproducing social inequities (Bell, 1980; Delgado  & Stefancic, 2001) in a similar way that Shohamy insisted on the accountability of language testers. Language testers cannot remove themselves from the consequences and uses of tests and therefore must also reject the notion of neutral language testing. Pretending it is neutral only allows those in power to misuse language testing with the very instrument that language testers have provided them. Language testers must realize that much of the strength of tests is not their technical quality but the way they are put to use in social and political dimensions. (Shohamy, 1998, p. 343) In schools, the myth of meritocracy is rampant. The notion of merit is used to justify the use of standardized tests that are perceived to be fair instruments to measure a construct, but this ignores the realities of diverse student populations, where poor students are most likely to attend schools that lack funding and qualified teachers. Minoritized students are not likely to perform as well on standardized tests due to these educational disparities, but colorblind and meritocratic tests and test practices ignore such complexities and instead blame the individual for not working hard enough to improve their status in society. The dominant claims critiqued by CRT fail to acknowledge the permanence of racism and how deeply entrenched it is within society – particularly in institutions like schools. Answering why tests have so much power in maintaining social hierarchies Shohamy wrote, “Tests use the language of numbers . . . Numbers are symbols of objectivity,

The racializing power of language assessments  61

scientificity and rationality, all features which those feed into illusions of truth, trust, legitimacy, status and authority” (1998, p.  338). Because of this, tests are tools that go unchallenged and reaffirm the dominant group’s power, making it easier for them to enact policy – in some cases hidden language policy (Shohamy, 2006). In the US, language tests in English legitimize standardized academic English and devalue other languages, establishing monolingualism and the assimilation of racialized communities such as English learners as the underlying expectation. The following sections consider two cases of EL language testing in the US and the ways in which the tests further racialize the ELs they arguably intend to support. English learner language test cases (2) in the US National English learner legislative policies and testing practices

One of the most obvious uses of tests as powerful tools are the high-stakes tests mandated by No Child Left Behind (NCLB) that sought to hold accountable schools, districts, and states for student performance, i.e., minoritized student performance, in return for federal funding. NCLB became law in 2002 and has greatly influenced educational policies and (language) testing practices ever since. NCLB was a reauthorization of the 1965 Elementary and Secondary Education Act (ESEA)8 – a Civil Rights law that provided funds to school districts that served poor students. Moving from the ESEA’s goal of every student having access to an education and guaranteeing educational services to poor children, NCLB sought to raise the achievement of all students and close the achievement gaps of poor and minority students, including English learners, through a newly devised system of testing and accountability. In 2015, the Every Student Succeeds Act (ESSA) altered NCLB to give more flexibility9 to states in determining how and when to test, and when to intervene in schools/districts considered failing. Both pieces of federal legislation aimed to improve the quality of instruction and increase outcomes for students. NCLB and ESSA shared parallel goals of policy-driven education reform, focusing heavily on high standards to be enforced through high-stakes testing. Under NCLB/ESSA, English learners were expected to demonstrate gains in learning core content (i.e., English language arts, math, and science) and in increasing their English proficiency annually to demonstrate improvement in learning English (Menken et al., 2014). Standardized tests have been framed as a colorblind means of measuring all students without bias, but there has been a blatant disregard for the fact that tests like those required by NCLB have been normed on white middle-class children. The following sections review the ways in which standardized tests lack construct and consequential validity, and how the test scores are

62  Casey Richardson

used in ways that further marginalize ELs as opposed to furthering their achieving educational equity. Increasing evidence reveals that academic content tests are to some degree language proficiency tests. Abedi (2004) argued that the scores from EL assessments are not accurate indicators of ELs’ content knowledge due to language factors. For example, tests administered in English like those mandated by NCLB/ESSA that are intended to assess academic subject knowledge are partly language tests since language proficiency mediates test performance for ELs (Menken, 2008, 2010). Indeed, English proficiency may be a source of construct irrelevant variance in achievement tests in English (Abedi et al., 2003), where a variance in scores is not due to the assessment of the intended construct (Llosa, 2016). Though achievement tests often confound language and academic skills, language proficiency is a distinct construct that must be measured separately, though the content– language link makes it difficult to do so (Llosa, 2016). Solórzano (2008) has provided a comprehensive review of the various (validity) issues that exist when using results from high stakes testing for decisions affecting ELs. The increased focus on English acquisition through such legislation has not positively impacted ELs. Because language proficiency impacts performance, ELs are always considered low performing on tests administered through English (Menken, 2010). For example, Abedi and Dietel (2004) found that ELs tend to score an average of 20–50 percentage points lower than their English-proficient counterparts on state tests. According to Menken, “due to the accountability mandates of NCLB, schools serving large numbers of ELLs are also disproportionately likely to be labeled ‘failing’ and at risk of sanctions” (2009, p. 112, 2010, 2013). In this era of accountability, EL scores have profound consequences for students, teachers, and schools, ranging from individual high school graduation or grade promotion ramifications to school closures. Even though ELs already disproportionately attend high-poverty schools with limited resources (Fry, 2008; Menken, 2010), their schools are further penalized with diminished funding by the accountability mandates of NCLB/ESSA. Jackson (2011) employed interest convergence to explain how the failure of minoritized students to perform well on tests perceived to be objective is constructed as reflecting the failure of the students, thus upholding white interests. She further observed that: Schools with higher concentrations of economically disadvantaged and minority students are punitively sanctioned based on lower collective scores. Consequently, schools that were under-funded and ill resourced to begin with are asked to do more with even less. Parents with adequate cultural capital, transportation, and time can transfer their children to

The racializing power of language assessments  63

schools with higher AYP [adequate yearly progress] rankings while low income and minority students, whose parents cannot transfer them, are left to perish in severely under-resourced schools. This strategic shepherding of white children to better performing schools, and rewarding schools with more white children who test higher, serves whites’ material interests by helping to ensure that white students are better educated than students of color. (Jackson, 2011, p. 443) Given the sanctions that schools could face based on low test scores, some students, e.g., minoritized students such as ELs, are encouraged to drop out. Using standardized testing scores for high-stakes decisions like federal funding allotments and interventions is further complicated by EL reclassification practices. When ELs are redesignated as fluent English proficient, they move out of the EL subgroup in a process that has been referred to as a “revolving door” (Abedi, 2004; Ramsey & O’Day, 2010) since their progress is no longer counted toward the EL group once their period of monitoring is complete (two years under NCLB; four years under ESSA). Subsequently, ELs will never close the gap, which is likely to be overestimated, between their unstable subgroup and group of English-proficient peers (Saunders & Marcelletti, 2013). The resulting EL achievement gap provides political fodder to continue the pursuit of alleged interventions to rectify disparities in test performance. While NCLB/ESSA is framed as supporting equitable education for all students, ELs continue to be blamed and targeted as in need of remediation. Indeed, ELs are more likely to be punished for not passing the highstakes standardized tests than others (Menken, 2009). Not only has testing policy led to de facto language policy (Menken, 2008, 2009; Shohamy, 2001b, 2008) such as the validation of monolingual standardized academic English practices, but the test impact has extended to schools that serve large populations of ELs. Scholars have found a causal link between the passing of NCLB and the closure of bilingual education programs due to accountability pressures (Menken, 2008, 2013; Menken & Solorza, 2013, 2014). Succumbing to such pressures is unsurprising given the increased public exposure to perceived achievement gaps as a result of such testing (Shohamy, 1993). The dismantling of bilingual education programs shows the covert ways in which racialized children’s multilingualism is considered an impediment to their English acquisition and to schools’ performance on accountability measures. Instead of legislative remedies purportedly aimed at increasing ELs’ academic achievement elevating their language practices, the consequences of NCLB/ESSA testing practices result in greater

64  Casey Richardson

negative media attention on perceived deficiencies and language development models that are inferior, e.g., English-only, becoming the norm. State legislative English learner policies and testing practices

The decline in bilingual education programs is evidenced in states like Arizona. In 2000, Arizona followed California’s lead voting in favor of an English-only initiative titled “English for the Children.” The ballot initiative, Proposition 203, was passed by a majority of voters who were persuaded to believe that English learners were trapped in Spanish-dominant bilingual education classes, which led to their academic failure, e.g., low test scores. The proposed solution, structured English immersion (SEI), effectively eliminated bilingual education programs within the state. Uncertainty surrounded how to implement the new language program model and its funding for educating Arizona ELs. Beginning in 1992 with Flores v. Arizona, years of litigation sought to establish how much funding was required to resolve English language development program deficiencies. Flores plaintiffs, the parents of children in the Nogales Unified School District, alleged that Arizona was not fulfilling the Equal Educational Opportunities Act (1974) to provide language support so that ELs could meaningfully participate in their classes. Arizona appeared largely unbothered by the arbitrary funding allotted to EL programming and therefore, with the academic success of ELs schooled in the state. Only in the face of increasing sanctions to funding did the state make efforts to resolve the matter. As a result, in 2006 the Arizona legislature passed House Bill (HB) 2064 that created an ELL Task Force to address concerns with EL schooling. These issues included monitoring ELs, developing a sound English language development (ELD) educational model for language services, and establishing the amount of funding necessary for the model. Indeed, the Task Force was charged with overseeing EL educational policies. This shift in oversight was denounced years later in a report published by the Morrison Institute for Public Policy. The issue of adequately educating ELLs needs to be removed from the scope of political ideologies at the Legislature. For instance, policymakers’ political beliefs need to be moderated to allow rigorous empirical evidence to support the funding and implementation of appropriate K–12 ELL instructional programs. . . . The ELL Task Force should fall within the oversight of the State Board of Education instead of as a separate entity. (Jiménez-Castellanos et al., 2013, p. 15) In 2013, the Task Force disbanded, and the Board of Education again became responsible for implementing SEI. However, change has been slow.

The racializing power of language assessments  65

The remainder of this section reviews issues with the validity of EL language assessments and test misuse within the context of Arizona post-2000. Arizona ELs fill out a home language survey to determine their eligibility for language services. Issues with home language surveys have been documented elsewhere (e.g., Abedi, 2008) and are outside the scope of this chapter. What is noteworthy about the use of most home language surveys is that the response of any language other than English, even if it is in addition to English, triggers the next step of the EL identification process, evidencing the pervasive belief that languages other than English are barriers to student success and threats to national unity (Zentella, 1997). While EL assessments continue to be seen as neutral in determining the need for EL interventions, this chapter illuminates how these tools are part of a pervasive gatekeeping system in which language assessments reinforce language policies and ideologies (Messick, 1981; Shohamy, 2007). Arizona’s instrument to measure English language proficiency is the Arizona English Language Learner Assessment (AZELLA), which is used to both place and re/assess ELs’ proficiencies at pre-emergent, emergent, basic, intermediate, and proficient levels. Any ELs who score below “proficient” are slated for SEI instruction. The Arizona Department of Education (ADE) claimed in the AZELLA Technical Manual (ADE & Harcourt, 2007) to have followed the Standards for Educational and Psychological Testing (AERA et al., 1999) in judging the quality and fairness of AZELLA in its design, yet the cut scores are inconsistent (Martinez-Wenzl et  al., 2012) with the process of setting AZELLA cut scores lacking clarity. Florez reported that: The AZELLA Manual gives no rationale for the substantial differences between the recommended scores and the final cut scores. It also fails to document how final cut scores were determined and the expertise of the person or persons who set the final scores. The substantial differences between the final cut scores and the recommended scores call into question the validity of the entire standard-setting process. If the test developers were confident in the standard-setting process, why were the recommended cut scores rejected? (2010, p. 13) Moreover, Florez explained that “the procedure used to set the cut scores is criticized by national measurement experts as ineffective and obsolete” (Florez, 2010, p.  2). Using AZELLA test scores of questionable validity for high-stakes decisions (e.g., as the sole re/classification criterion for ELs) garnered the attention of local and federal government agencies. An Auditor General report found an increase in ELs being reclassified as “proficient” between fiscal years 2008 versus 2009 and 2010. Providing

66  Casey Richardson

a possible explanation, Davenport reasoned that “as one would expect, greater percentages of students at the intermediate level are associated with higher reclassification rates, as this is the proficiency level immediately preceding proficient” (Davenport, 2011, p. 15). Yet the changes documented by Davenport in ELs’ proficiency levels with many more starting at “intermediate” gave credence to Florez’s claim that “artificially elevated reclassification rates would obfuscate efforts to accurately evaluate the efficacy of Arizona’s SEI program” (2010, p.  16). In other words, it might have been in the state’s interest to unintentionally ignore inconsistencies or intentionally set cut scores that led ELs to be reclassified even before they had acquired enough English to be successful and meaningfully participate in mainstream classes without language services. These potentially overlooked inaccuracies that resulted in ELs exiting from language services before they were ready helped to maintain white interests by giving the government the illusion of caring about and remedying EL educational disparities, contributing to the maintenance of white hegemony. The AZELLA came under federal review by the US Department of Education’s Office for Civil Rights (OCR) and the Department of Justice (DOJ) Civil Rights Division. An investigation between 2006 and 2012 found that the AZELLA under-identified tens of thousands of potential ELs, prematurely exited identified ELs, and reclassified ELs as “proficient” who were “unable to participate meaningfully in the regular education environment and need[ed] further English language supports” (Resolution Agreement, 2012). The OCR and DOJ monitored the implementation of conditions set by the 2012 Agreement and notified ADE that it was still out of compliance with its obligations in January 2016. Again ADE had failed to provide evidence of cut score validity and that ELs who scored “proficient” had attained sufficient English proficiency to equally participate in classes without English language services (Resolution Agreement, 2016). Upon review, the AZELLA did not accurately indicate the English proficiency level required for ELs to be successful in mainstream classes. While test results and impact have historically been disguised behind the illusion of neutrality, investigations have found that Arizona failed to comply with multiple judicial remedies that targeted the needs of ELs. A similar interrogation into credibility could be made into the members of the Task Force and the one-size-fits-all SEI classes they adopted for Arizona ELs. Against sound second language acquisition and educational evidence (see e.g., Faltis & Arias, 2012), the Task Force adopted a four-hour SEI block where ELs were to learn discrete language skills as opposed to academic content. This decontextualized, language-specific instruction has been criticized by various scholars throughout its implementation. Three major concerns with this SEI model are that for the majority of the school day 1) ELs are segregated from their English-proficient peers, 2) ELs are

The racializing power of language assessments  67

devoid of academic content, and 3) ELs are missing out on core credits, resulting in lower graduation rates (Gándara & Orfield, 2012; Lillie et al., 2012; Martinez-Wenzl et al., 2012; Rios-Aguilar et al., 2012). These concerns, among several others, were echoed at the ELL Summit in 2016 hosted by the Arizona Latino Advisory Committee (Cruze et al., 2019). The participants raised concerns about the AZELLA that by assessing English proficiency only once a year, EBs [emergent bilinguals] are likely to score lower (given their lack of language exposure over the summer), which prevents EBs from accessing appropriate language and content instruction for an entire school year. (Cruze et al., 2019, p. 448) After all, ELs who do not in effect pass by testing “proficient” on the AZELLA must continue to partake in SEI classes that academically and socially isolate them – a reality that has worried teachers (e.g., Rios-Aguilar et al., 2012). Valdés (2001) referred to these hyper-segregated conditions as ESL ghettos, where Latinx ELs fall farther and farther behind due to a lack of opportunity and resources. The consequences of being physically and linguistically segregated in SEI classes have raised major concerns as to the ways ELs are harmed (Rios-Aguilar et al., 2012), including a documented higher risk of school failure and drop out (Gándara & Orfield, 2012), especially as the decision for ELs to be placed in SEI classes is based solely on AZELLA scores that fall below the “proficient” threshold. Not only have the achievement gaps not improved between ELs and English-proficient students in Arizona, but they are greater (see e.g., Garcia et al., 2012), and there are no demonstrated improvements in academic achievement or English proficiency following the implementation of SEI (Martinez-Wenzl et  al., 2012; Rios-Aguilar et  al., 2012). As this review of the consequences of SEI instruction makes clear, Arizona’s EL testing practices are a prime example of how using a single test score to make highstakes decisions is questionable in terms of validity and fairness (Kopriva, 2008; Solórzano, 2008). In a more equitable world, the AZELLA that perpetuates monolingual hegemonic whiteness (Flores, 2016) would no longer exist; instead, it would be replaced by multiple, more inclusive assessments that more appropriately inform educators about what ELs have learned and their potential need for additional (linguistic) support. Also in this world, Arizona EL students would never be subjected to a decontextualized SEI model of ELD instruction as EL community members and other qualified individuals with access to sound educational and linguistic research as well as relevant personal and professional experiences would be consulted in the development of EL-responsive pedagogical methods and models.

68  Casey Richardson

English learner language tests’ intended consequences

As the previous sections illustrate, test misuse continues today despite Civil Rights efforts to, arguably, better serve English learners in schools. It is clear that “the use of monolingual tests needs to be contextualized within this political and social reality in which they operate” (Shohamy, 2011, p. 420). As the consequences of high-stakes tests make abundantly clear, the reality in the US (and elsewhere) is that race and racism are endemic to society. Employing the tenets of CRT, liberal claims of objectivity and neutrality sustain authorities’ use of questionably valid tests such that tests continue to flourish. At the same time, raciolinguistic ideologies guide test use and interpretation in re/constructing racial hierarchies and perpetuating gap discourses that blame ELs for a lack of English proficiency that they did not work hard enough to overcome according to dominant claims of colorblindness, equality, and meritocracy. The principle of interest convergence shows how tests function in predictable ways to maintain white interests. The legislative remedies for EL achievement at both national (NCLB/ESSA) and state levels (Proposition 203 & HB 2064) contribute to the racialization of ELs rather than yield structural change from which ELs would be served more equitably in schools. Like other EL language tests that exist, the examples provided here prioritize native monolingual, standardized academic English even though US ELs and people worldwide communicate in multilingual contexts. These cases demonstrate the overlap between test policies and de facto language policies that devalue linguistic practices of racialized individuals who do not conform to what is constructed as “legitimate” English. It is not the suggestion of this chapter to use interest convergence as a political strategy to persuade test and policymakers to create better tests as this would avoid the racial reckoning necessary to enact real reform. What is suggested is that CRT is a helpful tool to understand (language) tests and their consequences not as naïve or unintended but as intended to effectuate white hegemony. Moving forward

The previous sections have shown how standardized language tests in the US reproduce social hierarchies and raciolinguistic ideologies. Tests like the AZELLA and those required of NCLB/ESSA continue to go unchallenged as far as they are viewed as objective and neutral, making them (and their impacts) all the more insidious. While these tests are unique to the US context, the examination of such tools and the consequences that manifest from their use can inform testers and educators elsewhere about the need for more responsible forms of assessment.

The racializing power of language assessments  69

Before an increased critical focus on how race and racism are deeply embedded in testing practices, the misuse of language tests will continue. Until then, the seemingly radical proposal by Valdés and Figueroa (1994), to have a national moratorium and suspension of English learner testing while highstakes decisions are being made on the test scores, is all the more justifiable. Shohamy (2001b) has suggested that tests violating the principles of democratic practices, i.e., those violating the rights of test takers, be forbidden, that testers be held responsible, and that the public be made aware10 of test misuse. However, as long as testers are white psychometric experts who lack experience as a racialized individual, no significant changes are to be expected in the efforts to dismantle white hegemony. Flores too has declared that standardized literacy assessments rooted in colonial relations of power and created by white test designers “can never be used as a tool for combatting [racialized bi/multicultural students’] marginalization” (2021a, p. 2). Under these conditions, ELs and their literacy practices will continue to be devalued. To truly challenge raciolinguistic ideologies and the racial hierarchies they sustain, the following list is proposed as a starting point for language (test) practitioners.

• Similar to grow-your-own programs, testers and policymakers must re-

flect the communities being affected e.g., tests must be designed by/with the impacted communities, recognizing their experiential knowledge and expertise. This requires training and compensating a new generation of multicultural, multilingual testers and policymakers whose experiences more closely represent those of the increasingly diverse student population in the US. This also requires that racialized future testers or policymakers have access to and succeed in higher education and/or that the required credentials for these professions shift to value personal experience in addition to academic skills. • Multiple assessment tools, e.g., interviews, observations, portfolios, simulations, must be employed and their results considered in decision making processes. • ELs must be permitted access to their full linguistic repertoire of dynamic language skills to more accurately identify what students have learned as well as to validate and elevate the status of multilinguals within society. • Assessment agencies must provide detailed yet precise, practical feedback to local communities on areas of strengths and weaknesses, with trainings provided by experts who identify with the constituents of that community. • Sanctions must cease. Interrogating and reforming assessments in contexts of misuse and mistreatment will not do enough to rectify the long-standing resistance against equitably serving ELs and other racialized communities. This list was crafted knowing that (language) tests are likely to persist, yet radical

70  Casey Richardson

change hinges on the collective recognition that tests and the institutions that employ them are extensions of a racist society. In sincere efforts to dismantle white hegemony, tests grounded in colonial logics have no place. A continued racial reckoning is required of testers, teachers, administrators, and policymakers to set the stage for real structural change that has the ability to positively impact millions of students in the US alone. Recommended further reading, discussion questions and suggested research projects Further reading

For those interested in reading more about assessment practices that can be applied to advance maintenance bilingualism, Knoester and Meshulam’s (2022) journal article elucidates two alternative approaches to high stakes standardized tests (descriptive reviews and recollections) that support the growth of dual-language students’ academic, cultural, and linguistic identities. García and Solorza’s (2021) article also details the standards movement coupled with the ideological invention of academic language to show how translanguaging skills are constructed as “non-academic” and the resulting negative and disproportionate impact on Latinx ELs. In 2016 awarded the best book in language testing, the Handbook edited by Fulcher and Harding (2021) is indispensable to a wide audience of stakeholders including researchers, students, teachers, policymakers, and practitioners such as test developers. The comprehensive second edition embraces timely and practical revisions, e.g., a completely new section on assessing language skills, and thirty-six chapters written by fifty-one specialists of diverse backgrounds. Fulcher, G.,  & Harding, L. (Eds.) (2021). The Routledge Handbook of Language Testing (2nd ed.). Routledge. https://doi.org/10.4324/9781003220756 García, O.,  & Solorza, C. R. (2021). Academic language and the minoritization of U.S. bilingual Latinx students. Language and Education, 35(6), 505–521. https://doi.org/10.1080/09500782.2020.1825476 Knoester, M., & Meshulam, A. (2022). Beyond deficit assessment in bilingual primary schools. International Journal of Bilingual Education and Bilingualism, 25(3), 1151–1164. https://doi.org/10.1080/13670050.2020.1742652

Discussion questions

1. Can test validation be equated with responsible test design? Is responsible test design, as conceptualized in this chapter, less or more comprehensive than what is conventionally understood as “validation”? Justify your answer.

The racializing power of language assessments  71

2. Consider your local K–12 school/district. What are its EL assessment practices? Does the school/district use multiple assessments to determine an EL’s English proficiency or does it just use standardized test scores? If ELs do not pass, what are the consequences? 3. Review your state’s K–12 standardized assessment policies and practices. Are ELs permitted a grace period before being tested on core content (e.g., English language arts, math, science)? • Are they allowed (language) accommodations? If so, what kind/s? • After consulting Abedi (2012) and Abedi et al. (2004), justify whether your state should approve (language) accommodations and which type/s. • Do you believe that all students, not just ELs, should have access to (language) accommodations? Explain why or why not. 4. This chapter argues that as long as (EL) assessment practices go unchallenged, they maintain white interests through the interest convergence principle (Bell, 1980). Taking into account your responses to questions 2–3, is the interest convergence principle in operation in your educational context? Provide evidence to support your claim. Suggested research projects Researcher

A) Locate where you can find EL-specific test scores and high school graduation rates at the school, district, and/or state level. Record which EL subgroups, if any, are identified on the basis of race, ethnicity, socioeconomic status, gender, sex, national origin, or first language/s, for example. How do these EL test scores and graduation rates compare to those of non-ELs? What stands out to you from the data, and why? Draft a research question based on the data, and find sources to support this research question. Write an annotated bibliography of four to five sources. B) This chapter encourages “an increased critical focus on how race and racism are deeply embedded in testing practices.” Find a school, district, or state’s requirements for high school graduation. Are there any tests or exit exams required to graduate? Are there any required of ELs specifically? What do these tests purportedly measure? What alternatives or substitute tests are offered, if any? What stands out to you, and why? Accordingly, draft a research question, and find sources to support this research question. Write an annotated bibliography of four to five sources.

72  Casey Richardson

Educator

Design a set of questions to better comprehend the beliefs of EL stakeholders at your site with respect to EL tests and their consequences, including test impact and washback (see e.g., Tsagari & Cheng, 2017). Pay particular attention to any reference/s to equality, meritocracy, neutrality, and/or objectivity. You might interview or survey just one of the following groups of stakeholders or you might choose to interview or survey stakeholders across multiple groups to compare. Write up the results.

• English language development teachers • Content teachers • Test administrators • School/district administrators Advocate

Search your state’s legislation website for bills that are currently under consideration that will affect ELs. Which terminology do they use to refer to ELs and what might this reveal about the state’s stance on multilinguals? Draft an infographic or other text to communicate to a lay audience an overview of the bills that you’re most concerned about, including how they would influence ELs if passed. In designing the text, consider which genre will have the greatest ability to impact the greatest number of people who normally vote or who could be persuaded to vote on such topics. Pilot the text on your family and colleagues and revise as necessary. Post publicly. Notes 1 Shohamy (2001b) maintained that democratic processes are those in which power is shared amongst stakeholders, from elites and elite institutions to local community members. To enact more democratic practices, testing companies such as Education Testing Service, McGraw-Hill, and Pearson Education, to name but a few, would need to invite representation by and sustain active collaborations with local communities to protect those being tested. Such participation could hold powerful institutions more accountable in test construction and administration. 2 See also Language Assessment Quarterly volume 6, issue 1 (2009). 3 Flores has challenged Cummins’ generally accepted cognitive academic language proficiency (CALP) and has offered the alternate definition: “the idealized context-embedded language practices of the American bourgeoisie” (2013, p. 276). 4 Hernandez v. Texas (1954) was a landmark case that preceded Brown. It too addressed the 14th Amendment, though for Mexican-Americans. Again majority interests were upheld when the court ruling was examined historically and through the interest convergence principle. For example, Delgado (2006) found much overlap between Brown and Hernandez such as interests converging to 1)

The racializing power of language assessments  73

mitigate Latinx veterans’ discontent over their lack of rights in the US after f ighting in WWII, 2) improve the global reputation of the US, and 3) try to minimize communism in Latin America. 5 Soviet propaganda illuminated the inequities and segregation apparent in the US, emphasizing US hypocrisy in pursuing democracy abroad, and increasing the need for Black soldiers who fought in WWII to be reassured they would be equal citizens under the 14th Amendment after returning to the US. 6 See also Nelson Flores’s (2021b) presentation for an overview of Cheryl Harris’s whiteness as property, and its links to raciolinguistic ideologies and the dehumanization of people of color and English learners. 7 “Achievement gap” is used here to model the deficit language popularized by standardized tests; however, “opportunity gap” is a more appropriate term that reframes the focus to be on the input children/students receive (based on class and ethnoracial stratifications) as opposed to any perceived or real shortcomings in terms of output. 8 NCLB’s Title III “Language Instruction for Limited English Proficient and Immigrant Students,” Part A “English Language Acquisition, Language Enhancement, and Academic Achievement Act” effectively eliminated Title VII of ESEA (the Bilingual Education Act of 1968), devaluing the maintenance of first/ heritage languages/literacies and removing developmental bilingual education language program funding (Gándara  & Baca, 2008; Menken, 2008; Wiley  & Wright, 2004). 9 Under ESSA, states could exclude recently arrived EL test results from accountability measures the first year, with EL test results being phased in for accountability purposes with a growth measure the second year and included in traditional accountability measures by the third year. 10 See e.g., “Language test activism” (LTA) and its European examples for a model of LTA that informs public opinion and encourages teachers to adopt active roles to prevent test misuse and the harmful consequences of tests (Carlsen & Rocca, 2022).

References Abedi, J. (2004). The No Child Left Behind Act and English language learners: Assessment and accountability issues. Educational Researcher, 33(1), 4–14. https:// doi.org/10.3102/0013189X033001004 Abedi, J. (2008). Classification system for English language learners: Issues and recommendations. Educational Measurement: Issues and Practices, 27(3), 17–31. https://doi.org/10.1111/j.1745-3992.2008.00125.x Abedi, J. (2012). Validity issues in designing accommodations. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 48–61). Taylor & Francis Group. https://doi.org/10.4324/9780203181287 Abedi, J. (2016). Utilizing accommodations in the assessment of English language learners. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education: Language testing and assessment (2nd ed., pp. 303–322). Springer Science + Business Media. https://doi.org/10.1007/978-3-319-02261-1_21 Abedi, J.,  & Dietel, R. (2004). Challenges in the No Child Left Behind Act for English-language learners. Phi Delta Kappan, 85(10), 782–785. https://doi. org/10.1177/003172170408501015 Abedi, J., Hofstetter, C. H., & Lord, C. (2004). Assessment accommodations for English language learners: Implications for policy-based empirical research.

74  Casey Richardson

Review of Educational Research, 74(1), 1–28. https://journals.sagepub.com/ doi/pdf/10.3102/00346543074001001 Abedi, J., Leon, S., & Mirocha, J. (2003). Impact of student language background on content-based performance: Analyses of extant data. CSE Report No. 603. University of California, Center for the Study of Evaluation/National Center for Research on Evaluation, Standards, and Student Testing, Los Angeles. https:// cresst.org/wp-content/uploads/R603.pdf Abedi, J.,  & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14(3), 219–234. https://doi.org/10.1207/ S15324818AME1403_2 Aggarwal, U. (2016). The ideological architecture of whiteness as property in educational policy. Educational Policy, 30(1), 128–152. http://dx.doi.org/10.1177/ 0895904815616486 Alemán, E., Jr., & Alemán, S. M. (2010). “Do Latin@ interests always have to ‘converge’ with White interests?”: (Re) claiming racial realism and interest-convergence in critical race theory praxis. Race, Ethnicity and Education, 13(1), 1–21. https://doi.org/10.1080/13613320903549644 American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. American Educational Research Association. www.aera.net/Portals/38/1999%20Standards_revised.pdf Arizona Department of Education & Harcourt Assessment, Inc. (2007). Arizona English language learner assessment: Technical manual. Arizona Department of Education; Harcourt Assessment, Inc. Bell, D. A., Jr. (1980). Brown v. Board of Education and the interest-convergence dilemma. Harvard Law Review, 93(3), 518–533. https://doi.org/10.2307/ 1340546 Bell, D. A. (1992). Faces at the bottom of the well: The permanence of racism. BasicBooks. Carlsen, C. H., & Rocca, L. (2022). Language test activism. Language Policy, 21, 597–616. https://doi.org/10.1007/s10993-022-09614-7 Crenshaw, K. W. (1988). Race, reform, and retrenchment: Transformation and legitimation in antidiscrimination law. Harvard Law Review, 101(7), 247–284. https://doi.org/10.2307/1341398 Crenshaw, K. W., Gotanda, N., Peller, G., & Thomas, K. (1995). Introduction. In K. W. Crenshaw, N. Gotanda, G. Peller & K. Thomas (Eds.), Critical race theory: The key writings that formed the movement (pp. xiii–xxxii). The New Press. Cruze, A., Cota, M., & López, F. (2019). A decade after institutionalization: Educators’ perspectives of structured English immersion. Language Policy, 18(3), 431–453. https://doi.org/10.1007/s10993-018-9495-1 Davenport, D. K. (2011). Arizona English language learner program: Fiscal year 2010. Office of the Auditor General. www.azauditor.gov/sites/default/files/ ELL_Report.pdf DeCuir, J. T., & Dixson, A. D. (2004). “So when it comes out, they aren’t surprised that it is there”: Using critical race theory as a tool of analysis of race and racism in education. Educational Researcher, 33(5), 26–31. https://doi.org/10.3102/ 0013189X033005026

The racializing power of language assessments  75

Delgado, R. (Ed.). (1995). Critical race theory: The cutting edge. Temple University Press. Delgado, R. (2006). Rodrigo’s roundelay: Hernandez v. Texas and the interestconvergence dilemma. Harvard Civil Rights-Civil Liberties Law Review, 41(1), 23–65. https://scholarship.law.ua.edu/fac_working_papers/653 Delgado, R., & Stefancic, J. (2001). Critical race theory: An introduction. New York University Press. Delgado Bernal, D.,  & Villalpando, O. (2002). An apartheid of knowledge in academia: The struggle over the “legitimate” knowledge of faculty of color. Equity & Excellence in Education, 35(2), 169–180. https://doi.org/10.1080/ 713845282 Extra, G., Spotti, M., & Avermaet, P. V. (2011). Language testing, migration and citizenship cross-national perspectives on integration regimes. Continuum. Faltis, C. J., & Arias, M. B. (2012). Research-based reform in Arizona: Whose evidence counts for applying the Castañeda Test to structured English immersion models?. In M. B. Arias & C. J. Faltis (Eds.), Implementing educational language policy in Arizona: Legal, historical and current practices in SEI (pp. 21–38). Multilingual Matters. https://doi.org/10.21832/978184769746 Flores, N. (2013). Silencing the subaltern: Nation-state/colonial governmentality and bilingual education in the United States. Critical Inquiry in Language Studies, 10(4), 263–287. https://doi.org/10.1080/15427587.2013.846210 Flores, N. (2016). A  tale of two visions: Hegemonic Whiteness and bilingual education. Educational Policy, 30(1), 13–38. http://dx.doi.org/10.1177/ 0895904815616482 Flores, N. (2020). From academic language to language architecture: Challenging raciolinguistic ideologies in research and practice. Theory into Practice, 59(1), 22–31. https://doi.org/10.1080/00405841.2019.1665411 Flores, N. (2021a). A  raciolinguistic perspective on standardized literacy assessments. Linguistics and Education, 64(2), 1–3. https://doi.org/10.1016/j. linged.2020.100868 Flores, N. [NYU Urban Initiative]. (2021b, January 20). Tracing the specter of semilingualism in bilingual education policy [Video]. YouTube. www.youtube.com/ watch?v=vC6V62OCnjo Flores, N.,  & Rosa, J. (2015). Undoing appropriateness: Raciolinguistic ideologies and language diversity in education. Harvard Educational Review, 85(2), 149–171. https://doi.org/10.17763/0017-8055.85.2.149 Florez, I. R. (2010). Do the AZELLA cut scores meet the standards?: A  validation review of the Arizona English Language Learner Assessment. Civil Rights Project/Proyecto Derechos Civiles. https://escholarship.org/uc/item/ 8zk7x839 Fry, R. (2008). The role of schools in the English language learner achievement gap. Pew Hispanic Center. https://files.eric.ed.gov/fulltext/ED502050.pdf Gándara, P., & Baca, G. (2008). NCLB and California’s English language learners: The perfect storm. Language Policy, 7(3), 201–216. https://doi.org/10.1007/ s10993-008-9097-4 Gándara, P., & Orfield, G. (2012). Segregating Arizona’s English learners: A return to the “Mexican room”?. Teachers College Record, 114(9), 1–27. https://doi. org/10.1177/016146811211400905

76  Casey Richardson

Garcia, E. E., Lawton, K., & Diniz de Figueriedo, E. (2012). The education of English language learners in Arizona: A history of underachievement. Teachers College Record, 114(9), 1–18. https://doi.org/10.1177/016146811211400907 García, O., & Wei, L. (2014). Translanguaging: Language, bilingualism, and education. Palgrave Macmillan. https://doi.org/10.1057/9781137385765 Hoang, N. T. H., & Hamid, M. O. (2017). ‘A fair go for all?’: Australia’s languagein-migration policy. Discourse: Studies in the Cultural Politics of Education, 38(6), 836–850. http://dx.doi.org/10.1080/01596306.2016.1199527 Hogan-Brun, G., Mar-Molinero, C.,  & Stevenson, P. (Eds.). (2009). Discourses on language and integration. John Benjamins Publishing Company. https://doi.org/ 10.1075/dapsac.33 Jackson, T. A. (2011). Which interests are served by the principle of interest convergence?: Whiteness, collective trauma, and the case for anti-racism. Race, Ethnicity and Education, 14(4), 435–459. https://doi.org/10.1080/13613324.2010.5 48375 Jiménez-Castellanos, O., Combs, M. C., Martínez, D., & Gómez, L. (2013). English language learners: What’s at stake for Arizona?. Morrison Institute for Public Policy. https://morrisoninstitute.asu.edu/sites/default/files/english_language_ learners.pdf Kopriva, R. (2008). Improving testing for English language learners. Taylor & Francis Group. https://doi.org/10.4324/9780203932834 Ladson-Billings, G. (1998). Just what is critical race theory and what’s it doing in a nice field like education?. International Journal of Qualitative Studies in Education, 11(1), 7–24. https://doi.org/10.1080/095183998236863 Ladson-Billings, G., & Tate, W. (1995). Toward a critical race theory of education. Teachers College Record, 97(1), 47–68. https://doi.org/10.1177/01614681 950970010 Lawrence, C. R., Matsuda, M. J., Delgado, R., & Crenshaw, K. W. (1993). Introduction. In M. J. Matsuda, C. R. Lawrence, R. Delgado & K. W. Crenshaw (Eds.), Words that wound: Critical race theory, assaultive speech, and the first amendment (1st ed., pp. 1–15). Routledge. https://doi.org/10.4324/9780429502941 Lillie, K. E., Markos, A., Arias, M. B.,  & Wiley, T. G. (2012). Separate and not equal: The implementation of structured English immersion in Arizona’s classrooms. Teachers College Record, 114(9), 1–33. https://doi.org/10.1177/01614 6811211400906 Llosa, L. (2016). Assessing students’ content knowledge and language proficiency. In E. Shohamy, I. G. Or  & S. May (Eds.), Language testing and assessment (3rd ed., pp.  3–14). Springer International Publishing. https://doi. org/10.1007/978-3-319-02261-1 Martinez-Wenzl, M., Pérez, K. C., & Gandara, P. (2012). Is Arizona’s approach to educating its ELs superior to other forms of instruction?. Teachers College Record, 114(9), 1–32. https://doi.org/10.1177/016146811211400904 McNamara, T., & Shohamy, E. (2008). Language tests and human rights. International Journal of Applied Linguistics, 18(1), 89–95. https://doi.org/10.1111/ j.1473-4192.2008.00191.x Menken, K. (2008). English learners left behind: Standardized testing as language policy. Multilingual Matters. https://doi.org/10.21832/9781853599996

The racializing power of language assessments  77

Menken, K. (2009). No Child Left Behind and its effects on language policy. Annual Review of Applied Linguistics, 29, 103–117. https://doi.org/10.1017/ S0267190509090096 Menken, K. (2010). NCLB and English language learners: Challenges and consequences. Theory Into Practice, 49(2), 121–128. https://doi.org/10.1080/ 00405841003626619 Menken, K. (2013). Restrictive language education policies and emergent bilingual youth: A  perfect storm with imperfect outcomes. Theory Into Practice, 52(3), 160–168. https://doi.org/10.1080/00405841.2013.804307 Menken, K., Hudson, T., & Leung, C. (2014). Symposium: Language assessment in standards-based education reform. TESOL Quarterly, 48(3), 586–614. https:// doi.org/10.1002/tesq.180 Menken, K., & Solorza, C. (2013). Where have all the bilingual programs gone?!: Why prepared school leaders are essential for bilingual education. Journal of Multilingual Education Research, 4, 9–39. http://fordham.bepress.com/jmer/ vol4/iss1/3 Menken, K., & Solorza, C. (2014). No child left bilingual: Accountability and the elimination of bilingual education programs in New York City schools. Educational Policy, 28(1), 96–125. http://dx.doi.org/10.1177/0895904812468228 Messick, S. (1981). Evidence and ethics in the evaluation of tests. Educational Researcher, 10(9), 9–20. https://doi.org/10.1002/j.2333-8504.1981.tb01244.x Messick, S. (1988). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11. https://doi.org/10.1002/ j.2330-8516.1988.tb00303.x Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education; Macmillan. Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241–256. https://doi.org/10.1177/026553229601300302. National Center for Education Statistics. (2023). English learners in public schools. Condition of education. U.S. Department of Education, Institute of Education Sciences. https://nces.ed.gov/programs/coe/indicator/cgf. Accessed 19 July 2023, Omi, M.,  & Winant, H. (2015). Racial formation in the United States: From the 1960s to the 1980s (3rd ed.). Routledge. Otheguy, R., García, O., & Reid, W. (2015). Clarifying translanguaging and deconstructing named languages: A  perspective from linguistics. Applied Linguistics Review, 6(3), 281–307. https://doi.org/10.1515/applirev-2015-0014 Pennock-Roman, M., & Rivera, C. (2011). Mean effects of test accommodations for ELLs and non-ELLs: A  meta-analysis of experimental studies. Educational Measurement, Issues and Practice, 30(3), 10–28. https://doi.org/10.1111/ j.1745-3992.2011.00207.x Ramsey, A., & O’Day, J. (2010). Title III policy: State of the states. American Institutes for Research. https://www2.ed.gov/rschstat/eval/title-iii/state-of-states.pdf Resolution Agreement, No. 08-06-4006 (OCR)  & 169-8-81 (DOJ). (2012). August. www.justice.gov/crt/file/850406/download Resolution Agreement, No. 08-06-4006 (OCR) & 169-8-81 (DOJ). (2016). May. https://www2.ed.gov/documents/press-releases/ade-grades-3-12-settlementagreement.pdf

78  Casey Richardson

Ricento, T. (Ed.). (2006). An introduction to language policy: Theory and method Wiley-Blackwell. Rios-Aguilar, C., González Canché, M. S.,  & Moll, L. C. (2012). Implementing structured English immersion in Arizona: Benefits, challenges, and opportunities. Teachers College Record, 114(9), 47–80. https://doi.org/10.1177/ 016146811211400903 Rios-Aguilar, C., González Canché, M. S., & Sabetghadam, S. (2012). Evaluating the impact of restrictive language policies: The Arizona 4-hour English language development block. Language Policy, 11(1), 47–80. https://doi.org/10.1007/ s10993-011-9226-3 Rosa, J., & Flores, N. (2017). Unsettling race and language: Toward a raciolinguistic perspective. Language in Society, 46(5), 621–647. https://doi.org/10.1017/ S0047404517000562 Saunders, W. M.,  & Marcelletti, D. J. (2013). The gap that can’t go away: The catch-22 of reclassification in monitoring the progress of English learners. Educational Evaluation and Policy Analysis, 35(2), 139–156. https://doi. org/10.3102/0162373712461849 Shohamy, E. (1993). The power of tests: The impact of language tests on teaching and learning. NFLC occasional papers. The National Foreign Language Center at Johns Hopkins University. https://files.eric.ed.gov/fulltext/ED362040.pdf Shohamy, E. (1998). Critical language testing and beyond. Studies in Educational Evaluation, 24(4), 331–345. https://doi.org/10.1016/S0191-491X(98)00020-0 Shohamy, E. (2001a). The power of tests: A critical perspective on the uses of language tests (1st ed.). Routledge. https://doi.org/10.4324/9781003062318 Shohamy, E. (2001b). Democratic assessment as an alternative. Language Testing, 18(4), 373–391. https://doi.org/10.1191/026553201682430094 Shohamy, E. (2006). Language policy: Hidden agendas and new approaches. Routledge. https://doi.org/10.4324/9780203387962 Shohamy, E. (2007). Language tests as language policy tools. Assessment in Education: Principles, Policy  & Practice, 14(1), 117–130. https://doi. org/10.1080/09695940701272948 Shohamy, E. (2008). Language policy and language assessment: The relationship. Current Issues in Language Planning, 9(3), 363–373. https://doi.org/10. 1080/14664200802139604 Shohamy, E. (2011). Assessing multilingual competencies: Adopting construct valid assessment policies. The Modern Language Journal, 95(3), 418–429. https:// doi.org/10.1111/j.l540-4781.2011.01210.x Solano-Flores, G. (2008). Who is given tests in what language by whom, when, and where?: The need for probabilistic views of language in the testing of English language learners. Educational Researcher, 37(4), 189–199. https://doi. org/10.3102/0013189X08319569 Solano-Flores, G., & Li, M. (2006). The use of generalizability (G) theory in the testing of linguistic minorities. Educational Measurement: Issues  & Practice, 25(1), 13–22. https://doi.org/10.1111/j.1745-3992.2006.00048.x Solano-Flores, G.,  & Trumbull, E. (2003). Examining language in context: The need for new research and practice paradigms in the testing of English-language learners. Educational Researcher, 32(2), 3–13. https://doi.org/10.3102/ 0013189X032002003

The racializing power of language assessments  79

Solórzano, R. W. (2008). High stakes testing: Issues, implications, and remedies for English language learners. Review of Educational Research, 78(2), 260–329. https://doi.org/10.3102/0034654308317845 Spolsky, B. (1997). The ethics of gatekeeping tests: What have we learned in a hundred years? Language Testing, 14(3), 242–247. https://doi.org/10.1177/ 026553229701400302 Tate, W. F. (1997). Critical race theory and education: History, theory, and implications. Review of Research in Education, 22(1), 195–247. https://doi.org/10.31 02/0091732X022001195 Tsagari, D.,  & Cheng, L. (2017). Washback, impact, and consequences revisited. In E. Shohamy, I. G. Or  & S. May (Eds.), Language testing and assessment (3rd ed., pp.  359–372). Springer International Publishing. https://doi. org/10.1007/978-3-319-02261-1_24. Valdés, G. (2001). Learning and not learning English: Latino students in American schools. Teachers College Press. Valdés, G., & Figueroa, R. A. (1994). Bilingualism and testing: A special case of test bias. Ablex. Wiley, T. G., & Wright, W. E. (2004). Against the undertow: Language-minority education policy and politics in the “age of accountability.” Educational Policy, 18(1), 142–168. https://doi.org/10.1177/0895904803260030 Wolf, M. K., & Leon, S. (2009). An investigation of the language demands in content assessments for English language learners. Educational Assessment, 14(3–4), 139–159. https://doi.org/10.1080/10627190903425883 Wright, W. E. & Li, X. (2008). High-stakes math tests: How No Child Left Behind leaves  newcomer English language learners behind. Language Policy, 7, 237– 266. https://doi.org/10.1007/s10993-008-9099-2 Zentella, A. (1997). Growing up bilingual: Puerto Rican children in New York (1st ed.). Wiley-Blackwell.

4 “IT’S NOT THEIR ENGLISH” Narratives contesting the validity of a high-stakes test Gordon Blaine West and Bala Thiruchelvam

Introduction

The validity of assessments is often determined in a top-down manner with validity being determined either by policy makers or test developers. In this chapter, we examine validity from the perspective of test-takers and instructors impacted by a high-stakes English language assessment at a university in South Korea (hereafter Korea). Informed by a critical language testing perspective (Shohamy, 2001b), we sought to hear from stakeholders in three different groups: students who had failed the test, instructors of English Language Program (ELP) courses who taught the students and worked as raters of the assessment, and instructors of an alternative test course, which students could take if they failed the test. This small-scale qualitative study aims to help build an understanding of validity from the bottom-up, accounting for the impact and consequences of the assessment from the perspective of test-takers and instructors. Many studies examining the consequences of language testing and validity from the perspectives of those impacted by assessments have relied on quantitative or quantitatively driven mixed methods studies (e.g., Xerri  & Briffa, 2018). We conducted a qualitative study using narrative analysis (De Fina & Georgakopoulou, 2011) to center the voices of those impacted by the assessment. Using narrative positioning theory (Bamberg, 1997), we are able to see in the interview data how students and instructors positioned themselves in relation to the assessment and related policies not simply as passive subjects, but rather as resistant agents working against the policies in various ways and drawing on moral evaluations DOI: 10.4324/9781003384922-6

“It’s not their English”  81

of the assessment and its consequences to construct judgements of validity that drove their actions. Language testing is not a neutral, objective activity of measurement, but instead a “product and agent of cultural, social, political, education, and ideological agendas that shape the lives of individual participants, teachers and learners” (Shohamy, 2001b, p.  131). In our study, English language testing is where power is exerted to force those going through the university system to conform to specific values and standards created by the university. It is also, as we show, where that power, and particularly the claims to validity that underpin it, are contested by those impacted by the assessment. In this chapter, we first describe the context of our study, including some background on the assessment we are examining. Then we will briefly review studies that look at test-taker perspectives before laying out the theoretical framework for validity that informs our study. We then describe the methodology and report on our findings. Finally, we discuss implications. Siheom University and the computer-based English test

Testing in Korea has an enormous impact on nearly every level of English language education, with tests acting as gatekeeping mechanisms for educational and professional advancement (Choi, 2008). In post-secondary education, many universities in Korea require students to obtain a certain range of scores on a standardized English language assessment, either an in-house test, or others such as the TOEIC, TOEFL, IELTS, or TEPS (Choi, 2008). At the time of our study, the same was true of Siheom University.1 Students at Siheom were required to obtain a certain minimum score on either the TOEIC, or the in-house test, the Computer-based English Test (CBET). In this study, we focus on the CBET since that was the preferred test of the university, and all students were required to take the CBET at the conclusion of their regular, mandatory English language classes. The CBET was developed by the university in the late 1990s and revised several times. It was developed to focus on communicative skills, with the intention of avoiding negative washback associated with discrete point grammar- and vocabulary-focused exams that were and remain common in Korea (Choi, 2008). The CBET contains two separate sections, a writing test, and a speaking test. At the time of the study, the speaking test consisted of eight timed items including tasks like an autobiographical response, information transfer, and giving an opinion, among others. Test-takers used a microphone to record their response on a computer. The writing test consisted of three timed writing tasks that were computer-delivered: responding to an email, describing a graph, and an argumentative or compare and contrast essay.

82  Gordon Blaine West and Bala Thiruchelvam

Scoring for both speaking and writing was based on analytical rubrics with four categories. Only one category, accuracy, focused on grammatical features of the language produced. Other categories were more holistic, assessing appropriateness of content for the context of communication, expansiveness of information conveyed, and overall success in communicating information in English. This was part of the design to avoid washback by rating language production more holistically. Students could achieve one of four levels (rudimentary, moderate, commanding, and expert), and one of 12 sub-scores within that level (i.e., low, mid, high). Academic departments set minimum score requirements as a prerequisite for their students to graduate. For example, arts departments required a score of moderate-low to graduate, while the English department required a score of moderate-high. Most departments required a score of moderate-mid on both speaking and writing to graduate. If students failed to achieve the required minimum score on either the CBET or TOEIC, they could take an alternative CBET course. The course was a one-month intensive course that prepared students for either the writing or speaking section of the CBET, which would be administered at the end of the course, although it could be modified by the instructor to be delivered face-to-face, rather than on a computer. To take the course, they had to pay the course fee. Despite its purported aims, however, the test was critiqued by students and instructors in our study. Elsewhere, we have shown how instructors critique the negative washback effect of the assessment (West & Thiruchelvam, 2018). In this study, participants questioned the validity of the constructs that were tested by the CBET, and pointed out other issues with the assessment, including poorly written test items. Toward a critical concept of validity

We define validity in this study as a social construct, following other scholars who have acknowledged the social nature of validity. Messick (1989, 1998) wrote about the social values and consequences that are embedded in judgements of validity. More expansively, McNamara (McNamara, 2001, 2006; McNamara & Roever, 2006) showed ways in which language testing rests on social constructs that are embedded within value systems that serve social, cultural, and political goals. Despite the move toward a more expansive definition of validity that grapples with the socially constructed nature of validity, there is not yet a deeply theorized critical conception of validity. Shohamy’s (2001b) work on critical language testing makes clear that such a theory would minimally need to consider the views of those being tested, while also accounting for

“It’s not their English”  83

the consequences of testing. A critical theory of validity would need to be built from the bottom-up, foregrounding the views of those impacted by testing, rather than imposed from the top-down as an argument made by test designers and policy makers only. A critical theory of validity raises the question: who gets to determine the validity of a test? To this point, the focus on critical language testing has mainly been on consequence validity (Shohamy, 2001a, 2001b). Earlier work on consequences focused more narrowly on washback, looking specifically at how testing impacted teaching and learning (Cheng, 2008), while avoiding discussions of personal or political consequences of language tests or testing policies. Bachman’s (2005) work pointedly relegates questions about the political nature of testing as issues for a code of ethics, rather than as considerations for judging test validity, and Kane (2013) similarly narrowly defines consequences as impacting validity only to the extent that unintended consequences have adverse impacts on how the test meets the goals of the policy or program in which it is applied. McNamara (2006) critiqued those approaches as too narrowly defining consequences, and as Weideman and Deygers (this volume) argue, a key weakness of argument-based validity is that responsibility for validation under this conceptualization lies with the entities. In the case that they are using tests for unintended purposes, they have very little incentive to ensure that those uses are valid, effectively making it “nobody’s responsibility” (Weideman  & Deygers, 2022, this volume, p. 42). A critical conception of validity requires a broader investigation of consequences, which in turn requires understanding of the ways in which they are politically shaped and used. To understand what the consequences of language assessment are, or even to have a full understanding of how power is exercised in testing, we must hear from the test-takers themselves and other stakeholders who are tasked with implementing and carrying out assessment policies (Shohamy, 2001b). Judgements of validity must include the voices of those subject to assessment and assessment policies, including both test-takers and those instructors and raters who ultimately implement the policies. Those stakeholders, whether their voices are heard or not, are actively involved in the social construction (and contestation) of language test validity. Test-taker perspectives

To fully understand the consequences of testing, we must hear from those impacted by tests and testing policies. Shohamy pointed out that their voices are often silenced, but that “by listening to test-takers, it is possible to learn about the centrality of tests in the test-takers’ lives and about the strategies

84  Gordon Blaine West and Bala Thiruchelvam

test-takers develop in order to overcome testing difficulties” (2001b, p. 14). Others have also made the call for test-taker perspectives to be understood and considered as those best positioned to detail the consequences of testing (Kane, 2013; Moss, 1998). There have been a few recent studies that have taken up this call to investigate the consequences of testing from test-taker perspectives (Hamid et al., 2019; Karataş & Okan, 2019). Karataş and Okan (2019), taking a critical language testing perspective, surveyed test-takers of a national highstakes assessment for English teaching candidates in Turkey. They found that test-takers reported severe professional consequences for failure of the exam, and the vast majority of those surveyed openly contested the validity of the assessment based on the perceived use, severe consequences, and mismatch with the intended purpose of the test. In Korea, Kim et al. (2019) surveyed adult jobseekers in Korea for their views of standardized English assessments. They found that jobseekers viewed English language assessments as more about employment than English language proficiency, and they concluded that neoliberal pressures to learn English for economic survival and advancement were demotivating to test-takers. This finding highlights the impact that testing policies can have on test-takers. While some studies have sought to understand validity from the perspective of test-takers, nearly all studies have used surveys to collect data. To more fully understand how validity is constructed from a critical perspective, qualitative studies that provide greater context to inform analysis are needed. Studies examining teachers’ resistance to policies that they deemed nonvalid can be informative. One recent study used case study methods to examine how Korean teachers resisted policies intended to make them teach English in English (Choi, 2017). Since the teachers’ resistance was hidden and subversive, rather than open, a qualitative approach was better able to capture the multitude of ways in which teachers resisted those policies. Using narrative analysis in this study, we are better able to understand how validity is constructed by test-takers and instructors. Methodology Participants

Participants in this study included four Siheom University students who had repeatedly failed the CBET and who had just finished an alternative CBET course to graduate. Students who had failed the CBET were explicitly sought as they would be most able to speak to the consequences of failing the test. Many students who failed the assessment were reluctant to

“It’s not their English”  85

participate, understandably skeptical of assessment researchers who sought to hear their stories. We also interviewed two instructors who taught the alternative CBET course, and seven instructors from the English Language Program (ELP), who taught the mandatory, regular English language courses all students took, culminating in the CBET. The ELP instructors also worked as raters for the CBET as part of their regular duties, although they received additional pay for rating work. In this chapter, we focus on two students, Seomin and Autumn, and two instructors, Doug and Jamie.2 Seomin majored in international business and had taken the CBET twice, in addition to taking the TOEIC twice, failing each time to get her required score. Autumn majored in English and had taken the CBET seven times and the TOEIC four times, failing each time to achieve her required minimum score. At the time of the interviews, Doug was in his second year of teaching the alternative CBET course. He had been teaching at Siheom University in various departments for several years. Jamie had taught at Siheom for over 10 years in various programs and departments. At the time of the interview, she was an ELP instructor and worked as a CBET rater. Both Doug and Jamie are North American and hold advanced degrees in language educationrelated fields. Interviews

Both authors conducted interviews with participants for this study. All interviews lasted between 30 to 90 minutes with follow up interviews conducted in some cases to clarify or confirm information. All interviews were semi-structured, active interviews, which accounts for how researchers are themselves a part of data collection and influence it in the way that narratives are co-constructed during interviews (Holstein  & Gubrium, 2004). All the interviews were conducted in English. Following local protocols, informed consent was given before all interviews, data was stored securely, and all identifying information removed from transcriptions. Data analysis

For data analysis, we first transcribed the interviews and then thematically coded the transcripts. After this initial analysis, we went back to the transcripts, focusing on narratives elicited from the participants to conduct a narrative analysis of the ways in which students and instructors took moral stances and positions in relation to the CBET and its policies. We define narratives as interactionally (co)constructed stories that are socio-materially

86  Gordon Blaine West and Bala Thiruchelvam

situated. Stories in this sense may take a canonical structure with a beginning, complicating action and resolution (see Labov  & Waletzky, 1967), but they can also be small stories, which are “tellings of ongoing events, future or hypothetical events, shared (known) events, but also allusions to tellings, deferrals of tellings, and refusals to tell” (Georgakopoulou, 2007, p. vii). Narrative analysis uses an interactional approach that takes narratives as socially constructed and shaped by context, including who the tellers are and under what circumstances the tellings are happening (De Fina, 2009; De Fina & Georgakopoulou, 2011). Narratives work well to investigate moral beliefs as narratives are imbued with moral judgements (De Fina, 2009). We often negotiate our levels of commitment to our values and beliefs through evaluations of our past actions as we tell stories of past experiences (Bruner, 1997). This approach allows us to view morality as it is constructed in specific contexts, with the understanding that morality and moral beliefs change and shift over time and depending on context. The types of narratives we analyze in this study are mainly accounts, which are elicited in interview settings where participants are asked to explain or justify in their telling of why something happened or why they did something (De Fina, 2009). In our study, students account for why they failed the CBET while instructors account for why they take actions against the test to support their students. These accounts come with more canonical structures narrated in the past tense, but they also employ different verb tenses for evaluation (Schiffrin, 1981) or in attempts to distance speakers from past actions (Konopasky & Sheridan, 2016). It is important to remember that the interview context itself is one of power imbalance, where the interviewer is asking the interviewee to provide an account with the implicit understanding that the interviewer will also be judging that account. As such, shifts to evaluate actions and events, and to distance tellers from past events can be expected. We use Bamberg’s (1997) positioning analysis to better help us understand this context. Bamberg’s positioning analysis guides us to look at how positioning is accomplished in narratives at three levels. In the first level, we look at how participants position themselves in the story world, which includes the setting, characters, and events in the story being told. At the second level, we can see how participants position themselves interactively with us (the authors) as interlocutors. This helps us examine how the story is told in ways that are specific to the immediate context for the participants to position themselves as moral actors. At Bamberg’s (1997) third level, tellers position themselves in broader discourses. For the purpose of this chapter, we will focus on analyzing how ideas of validity were referenced at the third level. Many of the narratives that were shared in the interviews followed Labov’s (1972) canonical structure of an abstract introducing the narrative,

“It’s not their English”  87

an orientation where the teller introduces the setting and characters, complicating action, evaluations, a resolution, and a coda where the teller ends the narrative. While not all narratives had every part of this structure, nor did they always follow that order, Labov’s structure is helpful in examining the narratives in this report. As part of the narrative analysis, we examine evaluative language and quoted speech (Wortham, 2000) to understand moral positioning. We also look at stance taking (Baynham, 2011) in how participants and we, as interviewers, align or disalign to each other in the interactive construction of narratives, and importantly to see how we align or disalign to ideas of validity at level three of the analysis. Findings

We previously reported on the impact of the CBET on students and instructors (see West & Thiruchelvam, 2018 for more detail of our thematic analysis). Here, we present a narrative analysis that gives further insights into how participants contest the validity of the CBET. We look first at student narratives where Seomin and Autumn position themselves as victims of a nonvalid testing process. We then examine narratives from Jamie and Doug that use moral evaluations of the test to justify actions in narratives taken to mitigate perceived harm to students from those consequences. Students as victims of the test Seomin: victim of the CBET

The first narrative excerpts come from an interview with Seomin. As a consequence of failing to pass the test, her graduation was delayed, and she incurred financial costs by having to repeatedly take the test and enroll in an alternative CBET course in order to graduate. The first excerpt shares an account in response to questions from Bala about what she did to prepare for the CBET, relaying a narrative of personal past experience. The narrative she shared in this excerpt built on her previous stories about her ELP classes and not feeling prepared for the CBET despite assurances that she would be. Unprepared victim: 1 B: and so, uh, what did you do to prepare for the CBET, so you had some lessons that 2 had preparing, and then what did you do to prepare? 3 S: ((laughing)) umm actually teachers always said that this class is enough to prepare CBET

88  Gordon Blaine West and Bala Thiruchelvam 4 5 6 7 8

and just take this class and you can get MH or whatever you want but so, so I just relaxed myself and ((laughing)) and just focused on classes but it was total different B: totally different? Ah, ok so you felt that you were underprepared then for= S: =no not prepared B: =so not prepared, yeah, ok ok

In this short narrative, we can see how Seomin constructs the story world with characters, and events in the story she tells. The characters are Seomin (her past self) and the teachers of her ELP classes. In the story world, we can see Seomin as the protagonist, a focused student, but relaxed about the CBET because she trusted her teachers’ guidance that doing well in class alone would prepare her for the test. The teachers take an antagonistic role then as having deceived Seomin into complacency about the test by telling her “this class is enough to prepare CBET and just take this class and you can get MH” (lines 3–4). In the story, Seomin is a moral character, listening carefully to her teachers and working as a focused student, and it is her teachers in relation to the test who act immorally by deceiving her about the nature of the test being “total different” (line 5) than the classes, resulting in her being “not prepared” (line 7) for the test. The protagonist of the story is a victim of this deception. At Bamberg’s second level, the interactional telling of the narrative, we can see how she reinforces her positioning in the story world as a victim. Her laughing (lines 3, 5) comes as these are points that she has already mentioned previously in the interview. After being asked about her experience in ELP classes leading up to the CBET she had made clear that her teachers had not prepared her, and that she felt deceived by them about the nature of the test, leading to Bala’s follow up question in lines 1 and 2 about what she did to prepare then for the CBET. The laughing seems to reference that previous telling, building now on a shared story that Bala knows, and allowing for her to reiterate the depth of her innocence as a student who listened carefully to and trusted her teachers. Later when Bala asks a clarification and follow up question, referring to the story world Seomin as “underprepared” (line 6), she cuts him off to reframe it more strongly as “no not prepared [at all]” (line 7). We can see how this move to cut Bala off is possible given his general alignment with her telling. Narratives are co-constructed through interaction, and his general alignment, along with the moral judgement implied in line 6 aids Seomin in constructing a narrative of victimhood in ways that a less sympathetic interviewer may have facilitated. In line 8 we see he takes up her reframing as not simply underprepared, but rather as not prepared at all, thereby strongly aligning with her telling and moral positioning in the story as an innocent victim. This type of alignment had occurred before the telling of this narrative as well, with Bala often replying simply

“It’s not their English”  89

“right,” “yeah,” or “ummhmm” giving affirmative responses to her telling of her experiences throughout the interview rather than pushing back or contesting any of her tellings, which may have resulted in greater reluctance to share her stories. At Bamberg’s third level, we gain some perspective of Seomin’s positioning relative to classes and assessment. Her ideal class in this case would be one focused on preparing students directly for the high-stakes assessment they need to take at the end of the course. Teachers who fail to do so in her telling are neglectful of students’ needs and in her case, overtly deceitful in their assurances. While the validity of the test itself is not called into question in this narrative, the validity of the courses and the instruction is clearly compromised in her telling by their failure to align with and prepare students for the test. It may be possible that had her graduation requirement only been to finish coursework with a certain grade, she may have had a different story to tell about the validity of the courses and instruction, but as they were linked to the CBET, they are part of the larger testing process, as a preparation phase, and questioning their validity calls into question the validity of the larger testing process. In the next excerpt, Seomin directly questions the validity of the CBET itself. Later in the interview, Bala asked her a question about what advice she would give to instructors teaching ELP given her experience, positioning her as someone whose personal experience can provide valuable insights to those who had been in positions of power in her coursework. Rather than responding with advice she would give, she rejects the premise of the question and instead responds with an account of why she herself, as a competent communicator, failed the test. In this excerpt, the account mainly focuses on evaluation (Labov  & Waletzky, 1967) of the CBET and the university. Why do we have this policy?:   1   2   3   4   5   6   7   8   9 10 11 12 14 15

S: Actually I don’t know object, the goal of ELP classes. They, this university, told that these classes are for improving English speaking and writing level of students, but, but these are, but there is CBET test also. I think English speaking and writing and test is a little bit different because test is, test needs some technical things to get better score, but English speaking and writing is just different because it is just communication skill. So if ELP classes want to improve English speaking or writing skill, they have to change their policies like except [inaudible] CBET test. But if they want to take, if they make students take the CBET test, they have to focus on the CBET test, so I don’t know what they want actually ((voice shaking)) because I can speak with native speakers and I can communicate. Yeah, surely I have some problems in grammar and I have some problems in vocabulary, yeah something ((laughing)). But I don’t know why I have to got “Advanced,” I have to get Advanced, and I have to pay for it 13 actually. B: Ok S: That’s all ((laughing))

90  Gordon Blaine West and Bala Thiruchelvam

In the story world, Seomin is again the protagonist and a victim of the testing process that is perpetuated by the university as the antagonist. In her telling, the university has two contradictory goals: “improving English speaking and writing level of students” (line 2), and having students pass the CBET. These two things are in contradiction because as she states, the “test needs some technical things to get a better score, but English speaking and writing is just different because it is just communication” (lines 4–5). The purpose of English for the CBET and for communication are not the same, nor are they compatible in her telling. She goes on to further elaborate on how the university cannot pursue both policies at once because of this incompatibility. Seomin’s positioning in the story world shifts in this narrative from simply a diligent, trusting student who was not prepared for the test. She is also a competent English user who is able to communicate well, although imperfectly, with “native speakers” (lines 9–10). Interactionally, the most notable moment in the telling of the narrative is in line 8 when Seomin’s voice begins to shake with emotion. At the end of the complicating action of her being confused by the university’s goals and offering a strong negative evaluation, the emotionality serves to reinforce to her interlocutor how frustrated she felt. The emotionality of that telling then emphasizes the turn in action when her positioning shifts to that of a competent English speaker. Her positioning in the story world as someone who has successfully communicated in English in the past is reinforced by the current telling where she is successfully communicating an emotional narrative in English. The silence of her interlocutor as she is telling this extended narrative further supports her positioning toward her audience as a competent communicator, as someone who is able to hold her audience’s attention through an extended storytelling. The laughter that follows in line 11 serves as a nervous kind of laughter at the end of her telling about her English language abilities. Laughter like this comes again at the end of the coda “that’s all” (line 15) to signal that she has finished her story and is ready for the next question to transition. In both of these cases, laughter is important in managing the interaction, used in uncomfortable moments to shift to a less serious frame and to facilitate transition to a new topic (Warner-Garcia, 2014). Validity of the CBET is called into question in two ways in this narrative. First, Seomin directly questions the construct of communicative ability that it is designed to measure. For her, communicative ability is distinct from the technical skills required to pass the CBET. This judgement of the test leads to the conclusion that the policies surrounding the test, encompassing the entirety of the testing process in this case, are

“It’s not their English”  91

nonvalid. More than nonvalid, she gives a clear evaluation of the CBET and the testing process as harmful. Seomin also questions the validity of the CBET by positioning herself as a competent English language communicator. She can use English effectively to communicate despite what are characterized as minor issues of grammar and vocabulary. That the CBET does not recognize this competence renders it nonvalid. This is tied to her attack on the construct validity of the CBET but makes a distinct point as well by personalizing the attack and presenting herself as evidence as to its nonvalidity. Autumn: victim of an opaque policy and test

Bala also interviewed Autumn, whose repeated failure to achieve the necessary score for graduation resulted in her graduation being delayed and her acceptance to a graduate program rescinded. After talking about the CBET and expressing her concerns about the speaking portion of the test, Bala asked her about her worries about the writing section, which prompted an account sharing her past experiences with the test. Opaque test:   1

 2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 18 19 20 21 22 23 24 25 26

B: And ah were you worried at all about the writing test on the CBET? Ah why or why not? A: Ah I I had lots of very things to answer in this question B: Ah ok A: Um umm although I took the ELP class ah but that was my freshman years so I didn’t know anything I took the class and took the ELP classes of CBET but I didn’t know what it was so umm I’m [I was] very confused and after the times I preparing my graduation ah I definitely needs MH of the CBET so I studied but I didn’t know of uh criteria of the grading B: Right A: Yeah so every time I took CBET I think I thought I did it better than last time but I always failed I always got MM B: MM? Ah, ok. So and what was the most difficult about CBET when you were doing your 12 test? A: The most difficult aspect of the CBET is the criteria. I didn’t know the criteria. I didn’t know how to write, and um why I got MM again. I, I didn’t acknowledge the 17 reason so I had 15 to take my alternative course. B: Right, ok so after taking CBET there was no feedback, or? A: Yeah B: Ah ok, so now you’ve failed CBET a couple of times ah so how did you feel when you didn’t pass the CBET and what were the consequences? A: Yeah ((laughing)) B: ((laughing)) A: I failed several times and every time I failed I feel like why? Why? I didn’t know. I want to know why but um and I read in the CBET’s homepage that you know, there are some feedbacks of the my, my writing and my speaking I read it many times but

92  Gordon Blaine West and Bala Thiruchelvam 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

um it’s not very helpful. It was very not helpful. B: So emotionally it was kind of like frustrating? A: Yeah B: Ok A: I definitely want ah um what was it? Concrete feedback, personal feedback. B: Ah I see. Yeah. Alright then, any other consequences? Like did it postpone your graduation or financial aspects? A: Ah very. Very. Because ah last year ah last year, I thought that I can graduation, I can graduate and I already passed my, I already passed my graduation [was accepted to a graduate] school but I failed in CBET and my entrance was cancelled. Yeah B: Cancelled or delayed? A: Cancelled B: Cancelled, ah A: Yeah cancelled because I’m not graduate so I had to do my entrance again B: Ah so apply again? A: Yeah, apply again and it was very expensive experience. B: Ah so sets you back one year yeah? A: Yeah

The protagonist, and only real human character in the story world of this account is Autumn (her past self). Similar to Seomin, story world Autumn is positioned as a diligent student who “studied but I didn’t know of uh criteria of the grading” (line 8–9), and tried hard to retake the test many times (lines 11, 24), and most of all sought feedback to improve her score (lines 15, 24, 25–27, 31), eventually even taking the alternative CBET course to better understand what was required from the test (lines 16–17). The antagonist in the narrative is the CBET testing process, including things like the testing website, criteria, feedback, scoring, policies, and alternative course. The testing process acts against Autumn in the narrative, causing her confusion (lines 7, 8–9, 11), concealing criteria (lines 15–17), giving opaque feedback (lines 24–27), causing financial pressure (line 42), and most severely, delaying her graduation and causing her to lose her acceptance to a graduate program (lines 34–40). Autumn is positioned as a victim of the CBET. Although the specifics vary, this positioning was consistent with all the students we interviewed. At the second level of positioning, we see that the narrative is coconstructed. Autumn opens on line 2 with a statement of how many things she has to say. This is an abstract to indicate that she is beginning a detailed narrative. Bala aligns to this and affirms her telling when he asks his first clarifying and follow up questions (lines 13–14). After this, rather than simply affirming with minimal responses, Bala continues to press for greater detail and move the narrative in different directions with his questioning (lines 18, 20–21, 28, 32–33, 37, 41, and 43). In doing so, he

“It’s not their English”  93

effectively co-constructs the narrative by both clarifying different details, as in line 28 or 37, but also shifts the telling of the complicating action. This happens in lines 20–21, when instead of asking more about her ordeal with trying to identify the test criteria, he asks about her feelings and the consequences of failure. In his asking, Bala is blunt in referring to her repeated failure, but the shared laughter in lines 22–23 shows continued alignment rather than dis-alignment after this, serving as a sign of rapport (Adelswärd, 1989). Autumn continues in lines 24–27 to talk about the difficulty in understanding the test, rather than answering Bala’s question. Bala restates the question about her feelings on line 28, finally getting a direct answer on line 29. On line 31 she returns to her story about the opacity of the testing process before finally shifting the narrative to focus on the consequences of her failure (lines 34–44) in response to Bala’s question (lines 32–33). We see in this excerpt more clearly how the narrative was co-constructed, but we also see how determined Autumn is in her telling, especially in the point she wants to make about the test criteria and feedback being unclear. Autumn is clearly positioned in this narrative as a victim who suffers severe consequences because of failing to obtain her required score on the CBET. This happens despite her being a diligent student who repeatedly seeks to understand both the criteria and the feedback on her performance to improve. Her inability to find either, hidden as they were by an opaque testing process, is what calls into question the validity of the CBET. If the criteria are not clear, and if students cannot understand their own performance, and are therefore destined to repeatedly achieve the same score point, then the test by implication is not valid. It is not valid because the opacity of the scoring criteria and feedback make it unfair in her telling. It is unfair in that hardworking students have no hope of preparing effectively for the test, nor of being able to improve their score through retesting. Instructor justifications for mitigating harm from the test

We move on now to examine how instructors positioned themselves in narratives as moral actors, working to mitigate harm they perceived toward students from the CBET. Jamie: moral imperative to mitigate harm

We first hear from Jamie. She spoke about her experience teaching the ELP courses and preparing students for the CBET, but also shared stories about

94  Gordon Blaine West and Bala Thiruchelvam

scoring the CBET. Instructors had to go through rater training, and then scored students from classes other than their own in a blind scoring system, with each response being scored by two raters. Earlier in her interview, after she had critiqued the CBET and distanced herself from the test, Gordon pushed back by asking if she considered herself to be part of the testing process as a rater for the exam. She responded with an extended narrative of her past experiences of finding poorly written questions and the steps she took to mitigate the impact of these questions on students by rating their responses more leniently than the rubric in those cases. Rather than narrate in the past tense, however, she uses the modal verb “will” in the first section and the present tense in the second half. These tense shifts enable her telling of the narrative while not directly implicating herself in concrete past actions, acknowledging that her agency is limited in this context and the actions she is sharing may have negative consequences for her if her employer found out (Konopasky  & Sheridan, 2016). In this excerpt, she continues the narrative, but the focus changes from that of individual action to one of collective action. Moral collective action to mitigate harm:   1

J: It is also interesting because sometimes when we have these CBET things happen, somebody will start a thread, and they will send an email to the whole department and basically say, “I just want to tell you an example of a question that I’ve had, that I came across today. Please can we all just be sure that we are being aware that this is a problem”. And people will have a discussion and basically say, “yes.” G: Yeah J: So it’s just kind of making sure everyone’s on the same page, that we need to be checking if the question is not a clear question, if it’s open to interpretation, a gentle reminder, because as we get new staff come and go, etc. um, a gentle reminder, “If the question isn’t clear to us, this is an exam situation for somebody whose native language is not English, and so imagine what extra stress that has put them under, and 12 we need to be sympathetic about that.” 13 G: Yeah 14 J: So that’s what happens.   2   3   4   5   6   7   8   9 10 11

The characters in this story world are Jamie (her past self), the other ELP instructors, and the students. While in narratives before this Jamie had been an individual actor, in line 1, we see the shift to a collective protagonist, “we,” where Jamie and the other instructors work together in common cause as moral actors to mitigate the impact of poorly designed test items on students. The action in line 1, “when we have these CBET things happen,” refers to instances where instructors find questions on the test that are judged to be poorly written. Quoted speech is used to

“It’s not their English”  95

illustrate the actions taken in emailing the department to alert them to the issue (lines 3–5), with quoted speech used again on line 5 to indicate how they all come to a collective agreement on adjusting their scoring. The action is meant to undermine the double-blind scoring system by ensuring that a single rater is not acting alone in futility, but rather in concert with other raters to adjust scores. The action is justified by a moral obligation to protect students from harm based on a judgement that the items are not valid. The students are invoked in the third use of quoted speech in lines 9–12 as an imagined dialogue with a new instructor to the program. The imagined dialogue is meant to persuade the new instructor of the nonvalidity of some test items on the CBET, their ability to make that judgement, and the moral obligation then to act to protect students who do not have the linguistic proficiency to judge the validity of the test themselves, and who are put under immense stress in the testing situation. By using the collective “we” and having the action being driven by the collective protagonist, Jamie reinforces her positioning as a moral actor in the story world interactionally as she tells the story to Gordon. It is important here to remember that in interview settings, accounts are given with the expectation of being judged by the interviewer, and interviewees in these situations often seek alignment with their portrayal of themselves as moral actors (De Fina, 2009). Having the agreement of her colleagues works to counter any negative moral judgement she may have anticipated from Gordon to her earlier story of acting individually to disregard the scoring rubric in instances she deemed nonvalid. The quoted speech delivering the imagined argument to a new instructor about this process also serves as an attempt to persuade Gordon of the morality of the actions of her and her fellow instructors in mitigating harm. In Jamie’s telling, the test designers and writers are responsible for creating poorly written questions that are deemed nonvalid by her and other instructors. Policy makers at the university then persist in using the CBET despite these poor questions, requiring the moral intervention of the instructors to protect the students. Validity of the CBET is called into question not as a whole, directly, but specific questions are deemed nonvalid, which implicitly calls into question the broader validity of the test. The instructors, including Jamie, are deemed to be the final and most appropriate judges of validity, based on their doubt of the test designers and policy makers who let these poorly written questions end up on the final version of a test that has high-stakes consequences for students.

96  Gordon Blaine West and Bala Thiruchelvam

Doug: moral obligation to rebuild student confidence

The next excerpt comes from Doug. Early in the interview, Gordon asked him about his biggest challenges in teaching the alternative CBET course. He responds with an account of his encounter with a dance major as a way of justifying his role as that of having to rebuild students’ confidence after suffering from the CBET. It’s not their English:   1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21

D: So the challenges are for me, just the fact that these, a lot of these students are coming there feeling very disheartened. And I can feel that for, for example somebody whose major is in dance. They don’t, they’ve never felt a need, that’s what they’ve been focusing on from an early age, and English never entered the picture until it was time to graduate. At the same time, I’m like well you knew when you registered as a student that this was a thing, like in theory they knew but G: Yeah D: Umm, but a lot of the students in the ah alternative class have been, have completed their course requirements, all of their other coursework one to five semesters ago G: Yeah D: So that’s the biggest challenge is just, just getting them to feel better about G: Ah because they’ve finished so long ago D: And in most of their cases, their English is not ah, when I meet them, during an initial assessment, like “what level do you need?” They’re like, “MM, moderate-mid.” I’m like, “What?! You’re not? You didn’t get moderate-mid? Cause you’re talking to me like a moderate-mid person.” And so it’s not their English, it’s whatever it is. It’s their test taking ability, or the questions on that day G: Yeah D: Or whatever it was. So just, just helping to build some ego, you know? G: Yeah D: That’s, that’s probably the biggest challenge.

The story world is populated by Doug (his past self), and his students in the alternative CBET course. In the story world, Doug is positioned as both a teacher and as an assessor of their English language proficiency. The students are positioned as victims who have suffered from a test that is not valid. He makes clear how the students have suffered, describing them as “very disheartened” (line 2), having successfully completed all their other graduation requirements long ago (lines 8–9), and in need of rebuilding (line 19). To clarify their positioning as victims, he makes two moves to show that the suffering they endured because of their failure of the CBET was not their fault. First, he places blame on the testing policy that would require an English proficiency score for dance majors (lines 3–4), chosen because they would seemingly have no professional need for English language

“It’s not their English”  97

proficiency in their work or studies. He does hedge on this evaluation saying, “I’m like well you knew when you registered as a student that this was a thing, like in theory they knew but,” (lines 5–6) to indicate that ostensibly they were aware of the requirement, but at the same time this is only “in theory,” thereby keeping most of the blame on the policy itself. He later uses quoted speech (lines 13–17) to place blame more directly for their suffering on the CBET. In a conversation with an imagined student, he voices surprise, “What?! You’re not? You didn’t get moderate-mid?” (line 15), to show his disbelief at the difference in the assessment of their proficiency by the CBET versus his own assessment in their interaction. Second, he goes on to directly question the validity of the CBET, saying, “And so it’s not their English, it’s whatever it is. It’s their test taking ability, or the questions on that day” (lines 16–17). This evaluation both conveys his confidence as an assessor to state that their English proficiency is sufficient, while simultaneously calling into question how the CBET measured their proficiency as not valid. As the CBET was not valid, and the students suffered harm from it, they are victims, and his moral obligation as an instructor is to help them rebuild their confidence, rather than focusing on increasing proficiency, which he deems to be sufficient based on his own assessment. Throughout the narrative, Gordon aligns with Doug’s telling, replying mostly with a simple affirmation, “yeah” (lines 7, 10, 18, 20), and once with a simple restatement of what Doug had just said (line 12). This alignment may have served to allow for greater sharing of details, stronger evaluations, and a more dramatic telling with the quoted speech. In this narrative, Doug clearly evaluates the CBET as not being a valid assessment. His own ability to assess proficiency, based on his professional training and years of experience teaching is superior to that of the CBET. This evaluation of the CBET as not a valid measure of English proficiency, combined with the harm he sees it causing students, is used to morally justify his actions then to mitigate harms from the CBET. In the interview, he went on to describe how he structures the alternative CBET course in such a way that it becomes very difficult to fail the course. Since a passing grade in the course will be treated as the equivalent required score on the CBET for the purposes of graduation requirements, Doug structures the grading so that 40% of the weight is on attendance and participation. This shows his belief in the validity of human interaction as a way of assessing English and achieving the CBET’s stated goal of improving students’ communication abilities. He then does the testing in person, with himself as the interlocutor, further

98  Gordon Blaine West and Bala Thiruchelvam

reinforcing this belief. As with other instructors, Doug’s evaluation of the CBET as both having severe consequences, and as not being valid (in part or in whole) shapes morally justified interventions to mitigate perceived harm that test designers and policy makers may not have anticipated or condoned. Discussion

Both Seomin and Autumn position themselves as victims of the CBET and Siheom’s testing policies, denying the validity of the assessment based on moral evaluations of the fairness of the process and outsize consequences of failing. This was a positioning shared by all four students in our study. Similar to participants in the study by Karataş and Okan (2019), Seomin found the assessment to be nonvalid in part due to the contradiction she saw in purported goals for the CBET in improving students’ English and the reality of students simply needing a certain score on the test that she did not see as connected to the policy goal. Autumn’s evaluation was based on the opacity of the testing process. She found the test to be nonvalid based on her experience of being unable to get feedback or a clear sense of how to improve her score. For both, their judgements of validity are entangled in moral evaluations of the fairness of the testing process and policies, similar to test-takers surveyed in Kim et  al. (2019). Neither of them felt as though they had any agency to change or even voice their opinion on the CBET. As Shohamy asserted, “test-takers are the true victims of tests in this unequal power relationship between the test as an organization and the demands put on testtakers; they do not have the right to actively pursue or understand the inside secrets of tests” (2001a, p. 385). This unequal power relationship was clear in interviews with test-takers in our study. Understanding testtaker perspectives is “imperative for social justice” (Hamid et al., 2019) in language testing. Instructors positioned themselves as moral actors with an imperative to mitigate harm to students from the CBET and testing policies. This was true of both CBET alternative course instructors, and five of the seven ELP instructors interviewed (although all seven reported negative washback effects of the assessment on their instruction). They justified their actions in the accounts they gave by positioning themselves as champions of their students in opposition to an unfair assessment process. Jamie took steps to mitigate harm both as an instructor and rater. Poorly written or unclear test items on in-house, local English language exams have been widely reported in Korea (Jeon, 2010), and Jamie’s

“It’s not their English”  99

account of mitigating harm from poorly written items on the CBET adds to our understanding of what teachers may do when they deem items nonvalid. Doug’s account also questions the validity of the CBET. His critique that the test does not actually measure speaking fits with critiques by scholars who have looked at testing from social perspectives, questioning if language tests are defining what the constructs of speaking or writing are, rather than being informed by language use (McNamara, 2001). Although both instructors question the validity of the CBET and its surrounding policies in different ways, the result is that they take actions to mitigate potential harm from the test and testing policies. This finding fits with Choi’s (2017) study showing teachers who contested, subverted, or otherwise resisted policies which they deemed to be nonvalid and harmful. At a microlevel within the classroom, or within the rater context, these instructors are policy makers. This bottom-up view of policy making is important to take into consideration as it happens whether it is officially condoned or not. The narratives analyzed in this study help show how we may begin to build a critical understanding of validity as something constructed not just from the top-down, but as something that needs to be understood also from the perspective of those impacted by testing and testing policies. To return to a point made earlier in this volume by Weideman and Deygers, in order to create tests that are just, language testers must be aware of how power works in society, how policies are crafted around testing, and a willingness to engage with policy makers. It is also important to note that even when stakeholders like students and instructors are not officially involved in test design or policy making, they are actively involved in constructing and maintaining the construct of validity. Validity arguments, from a critical perspective, should foreground the voices of those most impacted by testing and testing policies and work to erase, to the extent possible, power imbalances in the process. Implications

There is a clear need for more democratic assessment practices that consider test-taker rights and account for the voices and experiences of test-takers and educators impacted by language testing and testing policies. Shohamy (2001a) laid out several principles of democratic assessment, and practices that could be implemented, including investigating consequences of testing, involving different voices in assessment processes, and protecting the rights of test-takers. She also developed a detailed list of test-taker rights, starting with a right to question the uses of tests. Fully embracing a democratic

100  Gordon Blaine West and Bala Thiruchelvam

approach would radically alter the status quo of language testing. Broadfoot once imagined what this could be, writing: If, as is happening in a few isolated experiments already, we allowed pupils and students themselves to define and assess achievement according to their own individual and group cultural values, we would be allowing them to legitimate in many cases a different and potentially conflictual system of school values, behavioral norms and curriculum content with all that implies for the longer-term processes of social reproduction and social control. (1979, p. 122) The imagining of democratic assessment practices is not new, and has been attempted, as Broadfoot (1979) found in various, small-scale experiments. The other implication of this study is the need for more qualitative research that shows the human consequences of testing, foregrounding the voices of those negatively impacted by outcomes and often silenced or overlooked. We need studies that go beyond surveys to better understand the context of testing to fully appreciate the consequences and responses to testing. By hearing emic perspectives of those impacted by assessments, we begin to understand how validity is constructed or contested from the bottom-up. We also get a sense of how moral judgements of assessments are made and how that influences responses of those impacted by language tests. Examining positioning through narrative analysis provides one way in which those views can be foregrounded and is a useful tool for investigating the consequences of testing. Coda

The story of this study was more contentious than the reporting of it may suggest. Highlighting the highly political nature of language testing, while conducting interviews with instructors at Siheom University, several instructors became angry that we were conducting research perceived to be critical of the CBET. One of our key institutional gatekeepers requested a follow up meeting with Gordon to report that some instructors were upset and to confirm that all identities of participants would be protected. The CBET and the policies surrounding it were contentious before our research, with instructors and students advocating behind the scenes for its abolition. Between the time of our study and the publication of this chapter, the CBET or TOEIC minimum score requirement for graduation was dropped. Students at Siheom University now need only to pass mandatory ELP courses to fulfill graduation requirements.

“It’s not their English”  101

The takeaways from this for us are twofold. First, doing this type of research and taking a critical language testing perspective is not an easy route for research. There may be many with strong feelings about the value of the test, and those who have financial incentive to protect the status quo. Research itself is a political act, working ultimately either to support, reform, or change the status quo even by seeking to understand it. In our case, simply interviewing students who had failed the assessment was seen as taking a bias against the assessment by some instructors. The second takeaway, however, is that despite the fraught nature of this type of research, tests and testing policy can change. This may be especially true at the local or institutional level where those designing and maintaining policies are in closer proximity to the stakeholders impacted by policies. We are not suggesting that our research played any role in those changes. More likely, it was simply the advocacy work done by instructors and students. Recommended further reading, discussion questions and suggested research projects Further reading

For those interested in reading more about critical language testing, Shohamy’s (2001a, 2001b) work is indispensable. Lynch’s (2001) work is also a useful introduction. For those interested in reading more about narrative analysis, De Fina and Georgakopoulou’s (2011) book is a good starting point. De Fina’s (2009) article gives a good example of how narratives are elicited and coconstructed in interviews. Bamberg’s (1997) article introduces his theory of positioning, and De Fina (2013) offers a useful expansion on his third level of positioning. Discussion questions

• Think of a local assessment that you are familiar with. Who are the

stakeholders impacted by that assessment and the assessment policies around it? Who is involved in creating it and crafting the policies? Which stakeholders have the most voice in that process, and which have the least? • Think about a local assessment you are familiar with. What steps could be taken to re-develop that assessment democratically? What might the process look like? Who would need to be involved?

102  Gordon Blaine West and Bala Thiruchelvam

• Think of a language assessment you are familiar with. On what basis

is that assessment considered valid? Who determines the validity of the assessment? • Think of a high-stakes assessment you have taken. What were the consequences of that assessment (either achieving or failing to achieve a certain score)? How did you judge the validity of that assessment? • As a student, have you ever had an instructor take steps to mitigate the impact of an assessment (either an in-class test, a locally created exam, or a larger, standardized test)? What happened? How did you judge the actions of the instructor? Suggested research projects

• Interview students who failed a local language assessment, either an in-

class or a program-level assessment. Interviews could be conducted with the students to elicit narratives to better understand the consequences that failure of the assessment brought. • Interview instructors who are tasked with either preparing students for an assessment, or whose classes terminate with a language exam. Questions could be crafted to help understand the impact of testing on their courses and how they understand the tests and construct or contest validity. • Observe language classrooms to gain a better understanding of the context in which instructors are working, and students are studying, and see ways they may be working to respond to impacts of the test. Interviews with students and instructors can be done in conjunction with observations. • Trace an assessment (either locally developed or a larger, standardized test) through several different contexts to gain an idea of how it is impacting different stakeholders. For instance, to study a national English language exam, interview teachers and students at several different schools in different locations repeatedly throughout the year, both before and after the assessment is given, to see the different ways in which the assessment impacts stakeholders, and how they respond differently to the assessment across their diverse contexts. Dedication

This chapter is dedicated to the memory of Andrew Langendorfer (1981– 2020), a dedicated teacher and friend, whose help was instrumental in the early stages of this research.

“It’s not their English”  103

Notes 1 All names are pseudonyms. 2 Some background on these four key participants clarifies the following analysis, although we leave some details intentionally vague to avoid identifying information.

References Adelswärd, V. (1989). Laughter and dialogue: The social significance of laughter in institutional discourse. Nordic Journal of Linguistics, 12(2), 107–136. https:// doi.org/10.1017/S0332586500002018 Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly: An International Journal, 2(1), 1–34. ttps://doi.org/10.1207/ s15434311laq0201_1 Bamberg, M. (1997). Positioning between structure and performance. Journal of Narrative and Life History, 7, 335–342. https://doi.org/10.1075/jnlh.7.42pos Baynham, M. (2011). Stance, positioning, and alignment in narratives of professional experience. Language in Society, 40, 63–74. https://doi.org/10.1017/ S0047404510000898 Broadfoot, P. (1979). Assessment, schools, and society. Methuen & Co. Bruner, J. (1997). A narrative model of self-construction. Annals of the New York Academy of Sciences, 1, 145–161. https://doi.org/10.1111/j.1749-6632.1997. tb48253.x Cheng, L. (2008). Washback, impact and consequences. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education. Volume 7: Language testing and assessment (2nd ed., pp. 349–364). Springer. Choi, I. C. (2008). The impact of EFL testing on EFL education in Korea. Language Testing, 25(1), 39–62. https://doi.org/10.1177/0265532207083744 Choi, T. H. (2017). Hidden transcripts of teacher resistance: A  case from South Korea. Journal of Education Policy, 32(4), 480–502. https://doi.org/10.1080/ 02680939.2017.1290280 De Fina, A. (2009). Narratives in interview – The case of accounts: For an interactional approach to narrative genres. Narrative Inquiry, 19(2), 233–258. https://doi.org/10.1075/ni.19.2.03def De Fina, A. (2013). Positioning level 3: Connecting local identity displays to macro social processes. Narrative Inquiry, 23(1), 40–61. https://doi.org/10.1075/ ni.23.1.03de De Fina, A.,  & Georgakopoulou, A. (2011). Analyzing narrative: Discourse and sociolinguistic perspectives. Cambridge University Press. Georgakopoulou, A. (2007). Small stories, interaction and identities (Vol. 8). John Benjamins Publishing. Hamid, M. O., Hardy, I., & Reyes, V. (2019). Test-takers’ perspectives on a global test of English: questions of fairness, justice and validity. Language Testing in Asia, 9(1), 1–20. https://doi.org/10.1186/s40468-019-0092-9 Holstein, J.,  & Gubrium, J. (2004). The active interview. In D. Silverman (Ed.), Qualitative research: Theory, method and practice (pp. 140–161). Sage.

104  Gordon Blaine West and Bala Thiruchelvam

Jeon, J. (2010). Issues for English tests and assessments: A view from Korea. In Y. Moon & B. Spolsky (Eds.), Language assessment in Asia: Local, regional or global? (pp. 53–76). Asia TEFL. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/ jedm.12000 Karataş, T. Ö., & Okan, Z. (2019). The power of language tests In Turkish context: A  critical study. Journal of Language and Linguistic Studies, 15(1), 210–230. https://doi.org/10.17263/jlls.547715 Kim, M., Choi, D. I., & Kim, T. Y. (2019). South Korean jobseekers’ perceptions and (de) motivation to study for standardized English tests in neoliberal corporate labor markets. The Asian EFL Journal, 21(1), 84–109. Konopasky, A. W., & Sheridan, K. M. (2016). Towards a diagnostic toolkit for the language of agency. Mind, Culture, and Activity, 23(2), 108–123. https://doi. org/10.1080/10749039.2015.1128952 Labov, W. (1972). Socioloinguistic patterns. University of Pennsylvania Press. Labov, W., & Waletzky, J. (1967). Narrative analysis: Oral versions of personal experience. In J. Helm (Ed.), Essays on the verbal and visual arts (pp. 12–44). University of Washington Press. https://doi.org/10.1075/jnlh.7.02nar Lynch, B. K. (2001). Rethinking assessment from a critical perspective. Language Testing, 18(4), 351–372. https://doi.org/10.1177/026553220101800403 McNamara, T. (2001). Language assessment as social practice: Challenges for research. Language Testing, 18(4), 333–349. https://doi.org/10.1177/0265 53220101800402 McNamara, T. (2006). Validity in language testing: The challenge of Sam Messick’s legacy. Language Assessment Quarterly: An International Journal, 3(1), 31–51. https://doi.org/10.1207/s15434311laq0301_3 McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Blackwell Publishing, Inc. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education and Macmillan. Messick, S. (1998). Test validity: A  matter of consequence. Social Indicators Research, 45(1), 35–44. www.jstor.org/stable/27522333 Moss, P. A. (1998). The role of consequences in validity theory. Educational Measurement: Issues and Practice, 17(2), 6–12. https://doi.org/10.1111/j.17453992.1998.tb00826.x Schiffrin, D. (1981). Tense variation in narrative. Language, 57(1), 45–62. https:// doi.org/10.2307/414286 Shohamy, E. (2001a). Democratic assessment as an alternative. Language Testing, 18(4), 373–391. https://doi.org/10.1191/026553201682430094 Shohamy, E. (2001b). The power of tests: A critical perspective on the uses of language tests. Longman. Warner-Garcia, S. (2014). Laughing when nothing’s funny: The pragmatic use of coping laughter in the negotiation of conversational disagreement.  Pragmatics, 24(1), 157–180. https://doi.org/10.1075/prag.24.1.07war Weideman, A., & Deygers, B. (2024). Validity and validation: An alternative perspective. In this volume.

“It’s not their English”  105

West, G. B., & Thiruchelvam, B. (2018). Impacts of and responses to a University high stakes writing test. In D. Crusan  & T. Reucker (Eds.), International political contexts of second language writing assessment (pp. 138–150). Routledge. https://doi.org/10.4324/9781315106069 Wortham, S. (2000). Interactional positioning and narrative self-construction. Narrative Inquiry, 10, 157–184. https://doi.org/10.1075/ni.10.1.11wor Xerri, D., & Briffa, P. V. (Eds.) (2018). Teacher involvement in high-stakes language testing. Springer.

5 THE ETHICAL POTENTIAL OF L2 PORTFOLIO ASSESSMENT A critical perspective review Mitsuko Suzuki

Introduction

As one of the most commonly used alternative assessments, portfolios have been thought to promote a more ethical learning environment than traditional assessment tools. While the validity of a portfolio has been examined from the perspective of traditional testing approaches (e.g., Hamp-Lyons  & Condon, 1993; Song  & August, 2002; Weigle, 2002), less discussed has been its potential impact on power relations highlighted by critical perspectives of language assessment (e.g., Lynch, 2001; Lynch  & Shaw, 2005). This exploratory study, therefore, aimed to critically review empirical studies related to power dynamics in second language (L2) portfolio assessment, as well as different variables that might have influenced their results. After a brief introduction to past L2 portfolio research, the methods section will explain the process the present study undertook to organize and synthesize data from the past literature. In the results section, the methodology used in the primary data is displayed in descriptive statistics, and student participants’ perception of power relations during the portfolio assessment are summarized in relation to its major activities (i.e., collection, selection, reflection, feedback, and evaluation). The chapter concludes with a call for L2 portfolio research and instructions from a more critical language point of view, in order to explore and maximize its potential for empowerment, as an ethical aspect of validation.

DOI: 10.4324/9781003384922-7

The ethical potential of L2 portfolio assessment  107

Portfolio assessment in L2 studies The portfolio as an alternative assessment

Over the past two decades, portfolios have been increasingly utilized in various L2 and foreign language (FL) learning and teaching contexts to document: learners’ language development (e.g., Castañeda  & Rodríguez-González, 2011; Lam, 2013; Lo, 2010; Romova & Andrew, 2011), their intercultural awareness raising (e.g., Chui & Dias, 2017; Su, 2011), and teachers’ professional development (e.g., Moore & Bond, 2002). While its definition varies among educators and researchers (Smith & Tillema, 2003), portfolios could be understood as “repositories of artifacts . . . assembled over time as evidence of development, learning, or capability” (Fox, 2017, p. 138). Regardless of its purpose and format, advocates of portfolio assessment consider three components as key features of effective portfolios: collection, selection, and reflection (Hamp-Lyons, 2003; Hamp-Lyons & Condon, 2000). In a portfolio, learners compile multiple samples of their work from different tasks (i.e., collection). As they keep records of their language use, learners self-assess and monitor their language development (i.e., reflection). Feedback from peers and teachers can provide further insights, especially for learners at the intermediate level of L2 proficiency, as these learners are in the process of developing their metacognitive and metalinguistic skills in their target language (Lam, 2017). Based on these reflective activities, learners can choose samples of work that best represent their language development and achievement, as well as their next language goal (i.e., selection). These three processes are iterative and inter-related, and not necessarily ordered linearly (Lee, 2017). According to Fox (2017), among different language assessment practices, portfolios are “arguably the most pervasive and influential example of alternative assessment approach” (p.  136). In response to criticism of one-shot, standardized assessments that solely emphasizes outcomes, alternative assessment was introduced to L2 and FL learning and teaching, following its implementation in the first language (L1) educational settings (e.g., Elbow & Belanoff, 1986). While the purpose and use of alternative assessments have been discussed from various theoretical perspectives, from a critical language testing point of view, this approach is seen as a means to change the traditional role of students and teachers (Lynch, 2001, O’Malley & Pierce, 1996), and protect the test-takers’ rights: a test taker should be granted the possibility of being assessed via alternative forms of assessment, different from the traditional “test-only”

108  Mitsuko Suzuki

ones. Such information can be used to counter evidence against decisions based on tests only, especially in testing immigrants who are not familiar with the formats used in the new country or in situations where the format of the item is more appropriate for one culture than another, gender included. (Shohamy, 2004, p. 87) Note that here, an alternative assessment is considered as a complementary assessment (Shohamy, 1996) that offers a viewpoint different from traditional assessment to gauge student learning. In the case of portfolios, learners assume more responsibilities to monitor, evaluate, and demonstrate their own progress and achievements. Such an approach is expected to balance the power relations between assessment practitioners and test-takers (Lynch, 2001), offering multilingual learners a way to guard themselves from the misuses of tests, often penalizing minority students that deviate from the dominant language users (Shohamy, 1996, 2001, 2004). The validity of portfolio assessments

Despite their popularity over the past decade (Fox, 2017), the validity of portfolio assessment has been long questioned from positivist viewpoints. Portfolios have been recognized as a trustworthy (Huerta-Macías, 1995; Smith  & Tillema, 2003) tool that can assess students’ learning progress. Unlike traditional psychometrically oriented approaches such as multiplechoice-type testing, portfolios allow the learners to display an array of learning evidence closely linked to their actual classroom experiences. Although this feature brings authenticity to the assessment, researchers have warned that it is misleading to assume that portfolios are inherently more valid (e.g., Brown  & Hudson, 1998; Hamp-Lyons  & Condon, 1993; McNamara, 1998). As a number of scholars (e.g., Gearhart  & Herman, 1998; Hamp-Lyons & Condon, 1993; Weigle, 2002) pointed out, assigning a single score to a wide range of tasks and assignments is not easy. In an English composition course, for example, the collection could include multiple genres of writing, with drafts that students revised based on their teacher and peer feedback. From a post-positivists’ point of view, these features make portfolios a less valid tool to assess each individual’s competence. In the field of L2 assessment, a number of empirical studies have attempted to respond to these criticisms (e.g., Hamp-Lyons  & Condon, 2000; Lallmamode et  al., 2016; Renfrow, 2004; Song  & August, 2002). At the University of Michigan, for instance, portfolios predicted students’ proficiency level equally or even better than timed writing tests (HampLyons & Condon, 2000). Similar findings have been reported from other

The ethical potential of L2 portfolio assessment  109

higher L2 education institutions (e.g., Renfrow, 2004; Song  & August, 2002). On the other hand, scholars have also examined the construct validity of portfolio assessments. Lallmamode et al. (2016), for instance, developed and validated a writing e-portfolio scoring rubric using Bachman’s (2005) Assessment Use Argument (AUA) framework. However, another issue that is less questioned yet crucial to examine, is the degree to which portfolios can serve as a valid tool to empower the learners. From a critical language testing point of view, power relations are one of the key ethical and validity issues to consider in portfolio assessments (e.g., Lynch, 2001; Lynch & Shaw, 2005). Portfolios require learners to become participants in the assessment process, who actively monitor and evaluate their own learning. That is, they support the values such as respect for authority. In theory, this practice can potentially help learners enhance their sense of control and ownership in their learning and assessment, empowering them as active agents and producers of language in civil society. This dynamic and mobile power relations (Foucault, 1982) are in direct contrast to the hierarchical power relations that traditional testing culture entails (Lynch, 2001; Lynch & Shaw, 2005; Shohamy, 2001). While such an approach could be regarded as teachers’ loss of power, one can also interpret it as an opportunity for learner and student empowerment (Lynch, 2001; O’Malley & Pierce, 1996). Despite this potential opportunity for student empowerment, empirical studies that investigated this aspect of portfolio validity have been limited. However, several L2 portfolio research syntheses have included the L2 learners’ view of portfolio assessment (e.g., Burner, 2014, Lam, 2017). Burner (2014) concluded that overall, L2 portfolios facilitate learners’ independent learning and self-regulation, and in turn, increase their motivation in writing. On the other hand, Lam’s (2017) analysis of literature suggested that these “Anglophone portfolio models” (p.  92) can promote learner agency in L1 contexts but not necessarily among those in L2 learning contexts. Research from Asian English as a Foreign Language (EFL) contexts, for instance, have observed learners becoming confused with the practice of critical self-reflection (e.g., Kuo, 2003; Lo, 2007). Such a negative reaction was also observed in Fox and Hartwick’s (2011) study in Canada, where English as a Second Language (ESL) learners did not attach any value to portfolios until the teachers tailored their approach to individual needs. Perhaps the two contradictory results between the previous two syntheses could be attributed to the range of relevant studies these reviews included. While their syntheses were based on careful literature review, both of the reviewers did not fully report where and what research designs (e.g., measurement tools) were implemented in the past portfolio studies. To further understand the power dynamics in different contexts of portfolio assessment, these methodological procedures need to be examined together

110  Mitsuko Suzuki

with their results. In addition, it is also important to consider how past researchers have operationalized the concept of “portfolios” in their studies. In practice, portfolio assessment needs to be tailored to the learners in that specific context (Lam, 2017). Teachers may find a less student-centered style of portfolio assessment appropriate for L2 learners, especially those from non-Western cultures and for students of lower proficiency, who do not necessarily have previous experience or meta-language to reflect and make decisions for their learning (Fox, 2017; Lam, 2017). In such a context, the teacher might decide to restrict the degree of choice learners can utilize. While this modification is an appropriate and ethical decision to make, from a research perspective, it becomes more significant to carefully assess the comparability, reliability, and validity of these empirical portfolio studies, before one can claim the benefits of this alternative assessment. The aim of this study, therefore, is to investigate the extent to which L2 portfolio research has informed its validity from a critical language testing perspective. Here, validity is defined in terms of the evolved power relations or “the extent to which the participants are empowered to carry out the changes that are made possible through the assessment process” (Lynch & Shaw, 2005, p. 282). Past empirical studies related to L2 learners’ perspectives of portfolio assessment is reviewed, with a focus on the following research questions: 1. To date, what methods have been used to observe L2 learners’ perception in portfolio studies? 2. From the student participants’ viewpoint, to what extent were L2 portfolio activities empowering? Method Data collection: identifying the source items

To collect the articles, this study used the databases from Education Resources Information Center (ERIC) and Linguistics and Language Behavior Abstracts (LLBA). These are the two most frequently used electronic databases in the field of applied linguistics, with the most comprehensive coverage of journals (In’nami & Koizumi, 2010). The following keywords were combined or truncated as search terms: portfolio, e-portfolio, alternative assessment, L2 writing, L2 speaking, assessment for learning, self-assessment, dynamic assessment. The journals in these databases (e.g., Annual Review of Applied Linguistics, Applied Language Learning, TESOL Quarterly) were also manually searched in order to ensure the coverage of the relevant articles. In addition, the reference sections of these articles were referred to identify any other relevant journals and articles. Finally, state-ofthe-art articles, edited books, and book chapters, as well as their reference

The ethical potential of L2 portfolio assessment  111

sections were included for further data collection. To ensure the quality of the source studies, only peer-reviewed articles were included. Inclusion and exclusion criteria

After collecting the relevant studies, each of the articles was examined to determine whether they match with the present study’s research questions. As Field and Gillett (2010) pointed out, “[O]ne red sock (bad study) amongst the white clothes (good studies) can ruin the laundry” (p. 668). Application of explicit criteria can control the quality of studies included in the review. By referring to the past reviews about L2 portfolio (e.g., Burner, 2014; Lam, 2017) and examining the literature iteratively, the following emerged as inclusion criteria: 1. The study was published between 1998 and December 2018. The year 1998 was when Black and Wiliam (1998) published a thorough review article, highlighting the effectiveness of formative assessment in improving students’ learning. The article also pointed out the necessity of empirical portfolio studies that go beyond teacher reporting. Thus, this time frame seems to cover the relevant research on L2 portfolio assessment. 2. The study empirically examined learners’ perceptions of L2 portfolio using qualitative and/or quantitative methods and analysis. 3. The study described the characteristics of the L2 portfolio (e.g., artifacts collected in the portfolio, the degree of freedom students had in their selection of items) in enough detail for them to be coded. 4. The purpose of the portfolio was to develop L2 learners’ writing and/or speaking skills. At the same time, reports were excluded from the analysis due to the following reasons: 1. The paper was a review of articles, not including new empirical data (e.g., Burner, 2014; Lam, 2017). 2. The portfolio did not include any reflective activities. As Hamp-Lyons and Condon (2000) argued, without reflection about one’s learning, a portfolio is merely a file of items. Hence, studies that did not report any use of reflective activity were excluded. 3. The study conducted a portfolio assessment for L2 teachers’ professional development (e.g., Hung, 2012) and assessment literacy training (e.g., Moore & Bond, 2002). Although this is another interesting aspect of L2 portfolio assessment, it goes beyond the scope of the current study. 4. The study aimed to observe the teachers’ perception of implementing portfolios (e.g., Dooly et al., 2017).

112  Mitsuko Suzuki

5. The study focused on reporting the validity of L2 portfolio assessment by comparing students’ L2 language performances; not their attitudes (e.g., Song & August, 2002). Coding and analyzing data

Table 5.1 summarizes the items that were profiled from the selected portfolio studies. Methodological features recorded the statistical information, while substantive features kept track of independent variables. Following Orwin and Vevea’s (2009) suggestion, for information missing in the original studies, the item was coded as “unreported.” As the current study was rather exploratory, these coding schemes were refined through re-reading the articles, piloting the codes, and re-coding the data. Adopting such an iterative open coding approach enables researchers to derive new, but reliable results from the data (Sutcliffe et al., 2017) Methodological features

To answer the first research question (Methods used in past studies), methodological features including research design, learner characteristics, measurement, construct of measurement, analysis and transparency were coded. Regarding the research design, the studies’ methods were categorized into five approaches: survey-based, qualitative, action research, multimethod, and mixed method. While both multimethod and mixed method studies implemented quantitative and qualitative tools, the former studies seemed to simply use and report on these two separate methods. On the other hand, when a study integrated these two data sets and combined the results to provide a holistic insight, it was categorized as mixed methods (for a more detailed explanation on mixed and multimethod, see Brown, 2014). The features related to learner characteristics included the following: age, institution (e.g., high school, university), first language, and learning context (i.e., ESL, EFL, Language Other Than English [LOTE]). The analysis undertaken in these past studies was also coded. The chosen quantitative (e.g., t test, ANOVA) and qualitative (e.g., content analysis) procedures were recorded. Given that addressing reliability and validity is an important process to increase the comparability and application of the studies, the reliability of measurement and analysis (e.g., inter-coder reliability; Cronbach’s a), assumption (i.e., conduct of assumption of statistical test), and data cleaning (i.e., approaches to deal with missing data) were also examined. These transparency variables were coded dichotomously (i.e., present = 1 or absent = 0).

The ethical potential of L2 portfolio assessment  113 TABLE 5.1  Coding of methodological features

Variable

Coding

Research design General design Sample size Method Learner characteristics L1 Proficiency level

Age Target language Institution Learning contexta Measurement Data collection tool Construct of measurement Analysis and transparency

With-in; Between; Other Number of participants Survey-based; Qualitative; Action research; Multimethod; Mixed method Name of L1 language; “Multiple”; “Unreported” Impressionistic judgment; Institutional status; In-house assessment; Standardized test (Thomas, 1994); “Unreported” Mean and/or range; “Unreported” Name of the target language; “Unreported” High school; 4-year university; Private language program; “Unreported” ESL; EFL; LOTE; “Unreported” Tool used to examine learners’ emotions (e.g., survey, interview) Theoretical framework or concept the tool is based on; researcher-made

Quantitative procedure Statistical test (e.g., t test, ANOVA) Qualitative procedure Qualitative analyses (e.g., content analysis) Reliability of measurement* Report on the reliability of the analysis (e.g., inter-coder reliability, Cronbach’s a) Assumption* Report on whether the researchers checked the assumption of statistical test Data cleaning* Report on what researchers did with the missing data and/or outliers Note: Variables marked with * were coded dichotomously (i.e., present = 1 or absent = 0).a ESL = English as a Second Language; EFL = English as a Foreign Language; LOTE = Language Other Than English

For RQ1 (Methods of portfolio studies), the coded data were tallied to display how researchers have overall examined emotions in their portfolio studies. For RQ2 (L2 learners’ perception), statistical results (e.g., t-value, F-value, and effect size) were recorded, and qualitative data were read and reread in order to identify the major themes. Initially, these data

114  Mitsuko Suzuki

were classified into three key portfolio activities, namely collection, selection, and reflection (Hamp-Lyons, 2003; Hamp-Lyons & Condon, 2000) in order to give structure to the data analysis and clearer relevance to the past portfolio literature. However, collection and selection were later combined, since it became hard to tease these two categories apart during the coding process. Furthermore, as some of the findings were related to the feedback and grading, “feedback” and “evaluation” were added into these a priori categories. Findings RQ1: methods

Even though there were 329 articles collected initially, only 67 studies had included the assessment of learners’ perceptions. The inclusion and exclusion criteria were used to screen these data, leaving 23 items to be synthesized. Table 5.2 summarizes the general research design and learner characteristics. As can be seen in Table 5.2, the selected portfolio scholarship utilized different approaches to investigate L2 learners’ perceptions. Half of the studies TABLE 5.2  Summary of research design and learner characteristics

Design and context General design

Method

Institution

Learning context

N

%

4 7 12

17.4% 30.4% 52.2%

9 7 2 3 2

39.1% 30.4% 8.7% 13.0% 8.7%

Higher education Secondary education Primary education

18 4 1

78.3% 17.4% 4.3%

EFL ESL LOTE (French and/or Spanish)

17 3 3

73.9% 13.0% 13.0%

within between other Survey-based Qualitative Action research Mixed-method Multimethod

Note: L1 = First language; ESL = English as a Second Language; EFL = English as a Foreign Language; LOTE = Language Other Than English

The ethical potential of L2 portfolio assessment  115

adopted a quasi-experimental design, where students’ attitudes before and after the portfolio implementation (n = 4), or their difference between nonportfolio group (n = 7) were examined. Other studies (n = 12) collected their data while (e.g., reflective journals, mid-term interviews) and/or after the implementation (e.g., post-interview sessions). Out of 23 studies, the most common research method was survey-based (n = 9), followed by qualitative (n = 7), mixed-method (n = 3), multi-method (n = 2), and action research (n = 2). As for the participants, most of the research involved L2 learners of English. The majority of the learners were in higher education, including undergraduate and graduate school students in Taiwan (n = 4), Iran (n = 3), Hong Kong (n = 2), Lebanon (n = 1), Lithuania (n = 1), China (n = 1), Turkey (n = 2), Philippines (n = 1), New Zealand (n = 1), UK (n = 1). The studies also included learners at the secondary level, learning English in Tehran, Taiwan, Canada, and Finland. One study, conducted by Zhang (2009), was conducted in primary schools located in city and rural areas of China. As for LOTE contexts, three studies targeted L2 learners of French and/or Spanish: one at the secondary level, and two at the higher education level. The surveys utilized for the research differed among the studies. Some of these surveys were adopted and/or modified from previous L2 studies (n = 5). Interestingly, with an exception of two studies (Nosratinia & Abdi, 2017; Ziegler & Moeller, 2012), these surveys did not seem to be based on any theoretical foundation. On the other hand, two studies (Aydin, 2010; Yang, 2003) created their own questionnaire based on their own students’ feedback prior to the study. Other studies (n = 6) did not offer details on how or what theoretical framework informed their surveys. Most of these studies (70.2%) reported on the reliability of their surveys (i.e., internal consistency or Cronbach’s a). Descriptive analysis was commonly presented to show the results, but there were few that used t-tests, chi-square, ANOVA, and ANCOVA to examine L2 learners’ perceptions on portfolios. As for the qualitative data, various tools were utilized. Some common tools were learners’ learning journals, portfolio reflections, group and individual interviews. In the case of qualitative data, however, researchers did not always report on how they analyzed their data. While there were some studies that reported to have conducted content analysis, only one study reported the inter- and intra-coder reliability. Prior to discussing the results of the selected portfolio scholarship, it is important to note how theoretical frameworks that these studies implemented might have influenced their results. For example, some of the studies observed students’ experience by focusing on their motivation. In Ziegler and

116  Mitsuko Suzuki

Moeller’s (2012) study, few variables were used to define motivation: goalorientation, task value, test anxiety, control belief, and academic self-efficacy. Out of these variables, task value was the only variable that differed between portfolio and non-portfolio groups (F = 3.39, p = 0.026), with a small effect size (Ziegler & Moeller, 2012). Barrot (2016), on the other hand, had one question in the survey asking whether the e-portfolio motivated the learners to improve their L2 writing. While students agreed to this statement, they also admitted that portfolios played a minimal role in reducing their anxiety in writing. The latter finding contradicted Nosratinia & Abdi’s (2017) claim, although the anxiety framework they employed (Foreign Language Classroom Anxiety Scale; Horwitz, et al., 1986) seemed to be covering a different aspect of L2 anxiety. Furthermore, Lam (2013) also conducted a motivational survey, but their definition of motivation was not explained. Their survey reflected on various aspects of the portfolio assessment process, including the individual teacher-student conferences, self-editing, and interaction with peers. As illustrated in these examples, even if their focus is similar, each study has a different theoretical lens, and therefore, the results may not always be directly comparable. It is also noteworthy to point out how these different theoretical frameworks could have impacted the collected data. Out of 23 articles reviewed, Pearson (2017) and Pollari (2000) explicitly reported to focus on power dynamics in their assessment of portfolios. Problematizing the power structure that standardized tests create in and out of the language classroom, Pearson (2017) investigated how her processfolio could assist in developing learners’ sense of agency and identity as English language writers. Pollari (2000), on the other hand, explored the extent to which portfolios can promote her Finnish secondary students’ sense of control over their learning and ownership in English. While participants in the other studies could have referred to the power dynamics in their portfolio assessment more directly, their insights might have not been totally reflected in the collected data, due to the researchers’ focus and what was determined as highlights of their findings. This limitation will be described again in the discussion section. RQ2: learners’ perceptions in L2 portfolio studies

Many learners in the studies perceived portfolios as a useful tool to identify their strengths and weaknesses (Baturay  & Daloǧlu, 2010; Li, 2016; Yang, 2003), improve their L2 writing skills (Burkšaitiene, & Teresevičiene, 2008; Chen, 2006; Lam, 2010; Li, 2016), and observe one’s progress (Barootchi  & Keshavarz, 2002; Chen, 2006; Huang  & Hung, 2013; Yang, 2003). Despite the benefits that the learners recognized, this alternative way of assessment was not always empowering for them. While some

The ethical potential of L2 portfolio assessment  117

learners showed an overall positive reaction, for others, their experiences were described as “boring” (Aydin, 2010; Hirvela  & Sweetland, 2005), overwhelming (Aydin, 2010; Chen, 2006; Yang, 2003; Zhang, 2009), and imposed (Kristmanson et  al., 2013; Hirvela  & Sweetland, 2005; Pollari, 2000; Zhang, 2009). As discussed in the following sections, these mixed feelings might be attributed to various experiences students underwent during different phases of the portfolio. Collection and selection of portfolio materials

In the selected portfolio scholarship, students faced different requirements for compiling a portfolio. A number of studies let students decide what to collect (n = 4), while most studies had students choose a few representative works (n = 8) or required them to include only specific items (n = 6). While the opportunity for selection did give a sense of ownership to some students (e.g., Chen, 2005), this was not the case for all the participants. In Hirvela and Sweetland’s (2005) study, Moto (a Japanese 1.5 generation ESL student) and Shim (a Korean international student) experienced both a collective (i.e., comprehensive model that includes every writing in the course) and a showcase (i.e., selective model that includes three studentchosen writing samples) portfolio. Regardless of the portfolio type, both of the learners showed skepticism toward the portfolios. Moto first complained about why he was forced to choose three items in the showcase portfolio: “Why not four, why not five?” (p. 204). For him, a meaningful portfolio included all of his work, in which he could look back and reflect on it. Moto’s opinion matches with what a student of Lam (2013) mentioned, who doubted whether a few pieces of work can accurately represent his learning process. Interestingly, however, Moto also did not show enthusiasm concerning the collective portfolio, where he had to collect and reflect on items that he himself did not value. Shim, on the other hand, claimed that showcase portfolio is unsuitable to show his language development, but described collective portfolio as an “unnecessarily painful” (p. 203) activity. All in all, both of the learners in Hirvela and Sweetland’s (2005) study did not feel any agency, as portfolios were merely another forced tool implemented by the teacher to prove their efforts and ability. Feedback

One of the advantages of portfolio assessment is its feedback-rich environment, where learners can gain multiple insights from which to revise their work (Hamp-Lyons, 2006; Hamp-Lyons  & Condon, 2000). Feedback can potentially facilitate self-assessment, which in turn can promote one’s

118  Mitsuko Suzuki

self-efficacy (Lam, 2015). The process can also stimulate positive emotions by creating closer student-student and student-teacher rapport (Hamp-Lyons, 2006). Among the studies reviewed in this paper, teacher feedback seemed to generate such a positive emotion (Baturay  & Daloǧlu, 2010; Lam, 2013, Lam & Lee, 2010; Li, 2016; Romova & Andrew, 2011). In their surveys, Baturay and Daloǧlu (2010) and Lam and Lee (2010) found the “teacher conference” component as one of the most appreciated elements of portfolio assessment. One of the students in Lam and Lee’s (2010) study reflected on her portfolio experience as follows: I like the teacher consulting section most. It is because it motivates students to seek teachers’ feedback about their performances. Students and teachers can have a chance to exchange their ideas. It helps to improve teaching and learning. (p. 59) It is interesting that this Taiwanese EFL student perceived a teacher conference as a two-way, dialogic learning opportunity, rather than a unidirectional teacher-to-student delivery of information. Furthermore, an ESL college student from Germany in Romova and Andrew’s (2011) shared his experience as follows: My writing is not great yet, but I can say one thing: I am actually enjoying it. I  struggle to explain why, but I  always look forward to getting the teacher’s feedback to see if it is the same with what I thought about my text. It gives me confidence when I am able to find flaws myself and I quite like redrafting. (p. 119) Although self-assessment is not an easy task for L2 learners (e.g., Kristmanson et al., 2013; Li, 2016), in the case of this ESL student, the teacher’s feedback not only gave him confirmation but also confidence about his judgment. Here, the teachers’ feedback is perceived as a response to one’s writing, in which the learner can respond by reflecting and revising his text. Note that these student-writers did not always regard their instructors’ suggestions as an absolute revision one needs to agree. Lam (2013), for instance, reported how a student, after long consideration, decided to not make any revisions to her essay, as following her instructor’s feedback did not seem the best way to reflect her unique voice in the essay. What seems pertinent in these students’ reports is that teacher feedback, when it is seen as an opportunity for student-teacher interaction, can play a role in developing students’ sense of agency in their L2 writing.

The ethical potential of L2 portfolio assessment  119

Peer feedback, however, did not always entail such a positive influence. In Lam’s (2013) study, for instance, peer feedback was perceived differently between learners who experienced a working portfolio or a showcase portfolio. While the former group valued their peers’ review as much as their instructors’ comments, the latter group did not show such an appreciation. To some extent, this finding matched the observations from other studies. For instance, in Barrot’s (2016) e-portfolio study on Facebook, students showed discomfort in sharing their final works with their peers. For many students, making their work in L2 public on Facebook was an “embarrassing” (Barrot, 2016, p. 295) experience. Some students were also skeptical about their peers’ comments, doubting whether their peers had actually read their e-portfolio (Barrot, 2016). On the other hand, learners who underwent formative peer feedback practices reacted positively toward peer reviews. Although learners tended to not enjoy their peer feedback session as much as their teacher feedback (Baturay & Daloǧlu, 2010; Pollari, 2000; Lam, 2013; Lam & Lee, 2010), they found the practice valuable by the end of the course (Chen, 2006; Farahian & Avarzamani, 2018; Lam, 2013; Lam & Lee, 2010). A Taiwanese EFL student in Chen’s (2006, p. 6) e-portfolio study posted her view as follows: In the past, I post the article in this composition site because the professor asks us to do it. . . . At first, I did not feel comfortable to write about my feelings. I thought they’re personal and private, should not be read by all in the class. But little by little, when others reply to my article, I always felt warm ad [sic] happy. Sometimes the suggestions which classmates offer are really helpful for me. When I know that certain classmate understand what I really want to express, I am moved. . . . Now, I like this place very much. I can post my feelings or experiences to share with others. As in Barrot’s (2016) study, at the beginning, this Taiwanese EFL learner felt forced to share one’s writing in L2. However, by receiving ongoing feedback and revising one’s text, the portfolio became a platform where one can develop respect to her own L2 writing with help from others. It is important to note though, that providing feedback can be a different experience from receiving feedback. In Aydin’s (2010) study, for instance, the majority of learners did not have any anxiety in receiving negative reviews, but around one-fourth of them found peer feedback demanding. Similarly, in Li’s (2016) EFL portfolio study, learners with lower English proficiency expressed their discomfort in assessing their peers’ work, due to their lack of confidence in L2 skills and knowledge. Such a tense feeling was not reported from the more proficient learners (Li, 2016).

120  Mitsuko Suzuki

Reflection

In many studies (Burkšaitiene & Teresevičiene, 2008; Chen, 2005; Kristmanson et al., 2013; Li, 2016; Paesani, 2006; Pearson, 2017; Romova & Andrew, 2011; Yang, 2003), learners reported enjoyment in reflecting on their learning. As these students reflected on their accumulating evidence in their portfolio, they were able to visualize and witness their growth. This practice fostered confidence in their learning. An Ethiopian ESL student in Romova and Andrew’s study (2011, p.  119) shared his experience as follows: the process shows about me, not only about typical students’ errors and mistakes. The previous excerpt illustrates how self-reflection facilitated one’s metacognition in L2 writing. Overall, these voices support Burkšaitiene and Teresevičiene’s (2008) portfolio study that found a moderate correlation between learners’ satisfaction and perceived usefulness of the portfolio as a tool to document one’s progress (r = 0.525, p = 0.00). At the same time, for some students, the portfolio was also perceived as a space to express one’s emotion and identity as L2 writers (Pearson, 2017; Romova & Andrew, 2011). In Pearson’s (2017) study, learners used words such as frustration and painful as they negotiated their identity as L2 writers (Canagarajah, 1999). According to Pearson (2017), the integration of a portfolio into the curriculum removed the fear of evaluation and created a safe place that learners can “feel free” (p. 17). Such an engagement, however, was not reported from students who perceived self-reflection as part of their grade. This tendency was observed in secondary education (Kristmanson et al., 2013; Pollari, 2000) and ESL preparatory courses (Hirvela & Sweetland, 2005), where grades played an important role in students’ immediate academic success. Daniel, a high school student in Kristmanson et al.’s (2013, p. 478) expressed the anxiety he kept on feeling during the self-assessment: The [language portfolio] kind of puts a lot of pressure on you. . . . If you don’t feel like you’re doing so well, if you should be at this stage at this time. Note that this statement is in direct opposition to the student agency that advocates of portfolios claim. In Kristmanson et al.’s (2013) study, Daniel and other students were instructed to assess their skills using the CEFR scale, a criterion-referenced performance scale published by the Council of

The ethical potential of L2 portfolio assessment  121

Europe. As one of the students pointed out, such a restricted hierarchical learning goal “takes the fun out” the activity (Kristmanson et  al., 2013, p. 477) for some students. Discomfort in self-assessment was also reported by some Finnish EFL secondary school students, who also compiled their journal as part of their portfolio: Keeping this log really gets to me. Namely, I’d rather not bother with writing anything here and, moreover, I hate the fact that this will be read. Feels somehow like my privacy would be violated. (Pollari, 2000, p. 115) Unlike students in Romova and Andrew’s (2011) and Pearson’s (2017) studies, for this student, self-reflection was not a comfortable, safe place to express one’s emotions. It seems that this student is resisting the practice of presenting one’s reflective writing to the teacher; not necessarily the practice of reflection itself. Evaluation

Despite the theorists’ belief that delayed evaluation can shift students’ focus on learning, students felt anxious and awkward to wait for their grades until the end of the semester (Bahous, 2008; Lam & Lee, 2010). In Lam & Lee’s (2010) study, the learners explained their reasons as follows: And I think grade can serve as a performance indicator. If I usually get grade B and suddenly get a D, then I know I have to work harder. If I get a D and my classmates get a B, then I will have strong motivation to work harder. (p. 61) For these learners, having an interim grade was preferable, as they can have a better sense of their progress. In Bahous’s (2008) study, there were actually students who got disappointed with their final grades, as the score was different from what they expected. While portfolio assessment is in theory more motivating and empowering than traditional assessments, some studies did not find a clear preference among the learners (Bahous, 2008; Baturay & Daloǧlu, 2010; Chen, 2006; Li, 2016). In Chen’s (2006) survey, for instance, a third of her Taiwanese students felt that a portfolio is a better assessment tool than traditional tests, while others believed traditional assessments were better (36%) or showed uncertainty (29%). For these learners, preparing a portfolio did not necessarily lower their anxiety level in their L2 classroom (Chen, 2006). Such

122  Mitsuko Suzuki

a tendency was more prevalent among lower proficiency learners (Chen, 2006; Li, 2016; Pollari, 2000). On the other hand, for more-proficient learners, the portfolio was not necessarily motivating than traditional test for a different reason: I am already the best one in my class. The portfolio is useless for me. But it may help those weak ones. (Zhang, 2009, p. 106) For these learners, a portfolio was neither an empowering or dis-empowering task, as they had already acquired the language proficiency and skills to display in the portfolio (Chen, 2006; Pollari, 2000). Learners also reported how portfolios increased their confidence in L2 writing (Chen, 2006; Pearson, 2017), even though it did not mitigate their focus on grades and high-stakes assessments (Hirvela & Sweetland, 2005; Kristmanson et al., 2013, Pearson, 2017; Zhang, 2009). An international postgraduate student in Pearson’s (2017, p.  16) study admitted how, at the end of the day, high-stakes standardized assessment mattered than a portfolio: I cannot feel part of this city because I might be excluded any time I fail an exam. While this learner saw the value of keeping a portfolio, he could not ignore the fact that without passing the government-approved benchmark test, he would lose his visa status as an international student. A  stronger test-oriented comment was made by a Chinese EFL primary school student in Zhang’s (2009, p. 106) empirical study: All the things we do with the portfolio are to earn good grades for the middle-term tests and final tests. Clearly, the focus here is to comply with the institution’s requirement. Zhang (2009) also observed how parents’ lack of understanding, as well as focus on grades, influenced the learners’ experience: Li Jiaqi (one of her group members) is not willing to show me his portfolio. Each time I ask him to hand it in, he makes an excuse by saying that he has left it at home. . . . [he does not want to share because] his parents often wrote negative feedback in his portfolio and it hurt him. (p. 106)

The ethical potential of L2 portfolio assessment  123

For Li’s parents, the portfolio might have been merely a surveillance tool to monitor and control their child’s behavior. Such a lack of support from the stakeholders may have created power relations not different from highstakes assessment. Discussion

The present study explored power relations evolving in portfolio assessment. The methods implemented in the literature (RQ1) were mostly survey-based or qualitative, often conducted at higher education level EFL settings. The surveys conducted in the studies were not always clear on their underlying theories, and their results were often presented as descriptive statistics. The qualitative studies, on the other hand, relied on various sources of information such as reflective journals, portfolio artifacts, and interviews. Their methods of analysis, however, were not always clearly reported. Despite the recent trend of applying mixed-methods in portfolio studies (Fox, 2017), only three studies were selected in this review. While combining insights from this array of studies helped reveal L2 learners’ views on portfolios in a more holistic way, it has to be acknowledged that this inconsistency of theoretical and methodological approaches has posed challenges to draw definite conclusions. Although other eligible studies might have explored the power relations in L2 portfolios more directly, the 23 studies included in this review were chosen to ensure the quality of the synthesis. While it would have been preferable to reanalyze their raw data, due to time and resource constraints, the current study was based on the original articles’ self-reported research design and findings. This limitation might be a fundamental issue common in synthesis studies, not specific to this review alone. Nevertheless, the current study was able to explore the validity of L2 portfolios from a critical language testing viewpoint. As researchers have warned (e.g., Fox, 2017; Lam, 2017), the selected scholarship suggested that portfolios do not automatically guarantee ethical or valid assessment to L2 learners. A careful review of past literature further revealed how power relations changed among different phases of the assessment (RQ2). Within each phase, learners’ reactions were different, depending on their characteristics and learning contexts. While some of them found practices such as self-reflection and feedback sessions intriguing, others did not enjoy these activities. For the latter students, portfolio assessment was not empowering and was rather imposing as much as traditional testing. Hence, the studies reviewed in this chapter provided different interpretations on the potential of portfolios as an empowering tool.

124  Mitsuko Suzuki

Pedagogical implication

While research on how to implement portfolio assessment practices from a critical point of view is still limited, the results of this study offer several pedagogical implications. Overall, the present study seems to indicate that portfolios are not an inherently valid tool that can automatically change teacher-student relationships and empower L2 learners. The present review showed that various features of portfolios, as well as learner characteristics influence learners’ engagement in L2 portfolios. While dealing with such complexity is challenging for the teacher, one pedagogical practice the teachers can undertake is the provision of a space for negotiation with the students. The degree of negotiation would be dependent on the institutional constraints the classroom has. However, teachers can ask for students’ preference for different aspects of portfolio assessment. For instance, in Hirvela and Sweetland’s (2005) study, student participants were upset about the limited control they had over the content of their portfolio. To avoid such a situation, teachers may ask for students’ preference and discuss what format would best display their L2 learning process. Such an accommodation is consistent with the practice of critical language pedagogy, which aims to create a democratic L2 classroom through participatory practices such as negotiated syllabus (Crookes, 2013). Negotiation may also relieve students’ concerns about privacy in portfolio assessment. As some of the studies (Barrot, 2016; Chen, 2006) reported, the advantage of technology such as e-portfolio is convenient, but also threatening as it increases the level of teacher surveillance (Carlson & Albright, 2012). Opportunity to negotiate the degree to which students are comfortable in sharing their work with others could potentially reduce teachers’ control and encourage students to take more initiative in the assessment process. At the same time, it is important to recognize that not only the learners, but also the instructors themselves are also under institutional pressure. Studies have shown how parents (Zhang, 2009) and institutional requirements (Hirvela  & Sweetland, 2005; Kristmanson et  al., 2013, Pearson, 2017; Zhang, 2009) can limit students’ sense of control over their learning and assessment. Their findings imply that negotiations are never just between the student and the instructor. Many stakeholders’ expectations are structuring those negotiations, with the students having the least power in the system. These studies highlight the importance of expanding the negotiation practice outside the classroom. This negotiation can include key stakeholders such as other teachers, and in cases, administrators, parents, and politicians. While these stakeholders may show a refusal in making any changes, the discussion can at least initiate a dialogue to create an assessment culture that takes students into account. Apart from these negotiation practices, it seems crucial to offer peer review training, where students can develop effective language skills and strategies to

The ethical potential of L2 portfolio assessment  125

give feedback to each other. In the selected portfolio scholarship, the majority of learners appreciated the dialogic interaction through teacher feedback (e.g., Lam, 2013; Lam & Lee, 2010; Romova & Andrew, 2011). A student in Lam’s (2013) study even reported how teacher feedback encouraged her to negotiate her voice and reject the instructor’s suggestion. In contrast, students tended to feel less effective in providing and receiving feedback, especially when the feedback was for summative purposes (Barrot, 2016; Lam, 2013). Such a reluctant attitude may partially stem from their limited metalinguistic knowledge and skills (e.g., Li, 2016). While a number of training and guidelines have been provided by L2 scholars (e.g., Lam, 2010; Min, 2005), perhaps the key is to practice these skills repeatedly, as studies indicated that learners gradually developed their trust over time in formative peer feedback (e.g., Chen, 2006). Equipping students with necessary language skills can facilitate the feedback activity and promote further student interactions. Such scaffolding in the assessment process, in turn, may transfer the hierarchical power that teacher-directed portfolios may impose on students to a more balanced one. While such an approach could be regarded as teachers’ loss of power, one can also interpret it as an opportunity for student empowerment. Implication for future studies

While empirical L2 portfolio studies that aim to examine the social dimension of validity are still rare, the results of this study imply the role of microand macro-environment when conducting such studies. The results of this study revealed how learner characteristics (e.g., proficiency level) and learning contexts (e.g., emphasis on traditional testing) can influence learners’ attitudes toward portfolios. However, most of the past literature did not include variables beyond the classroom environment in their analysis. Zhang’s (2009) empirical study was an exception, in which the researcher compared portfolio practices in Chinese schools located in urban and rural areas. The study depicted how Chinese primary learners’ attitude toward portfolios was affected by their parents’ reaction and teachers’ practices, which was not only influenced by the resources available, but also the strong assessment discourse prevailing in the nation. Such an approach will offer a more critical understanding of how power relations evolve in L2 portfolio assessment. Lastly, the present study examined how past literature has examined L2 learners’ perceptions. To our surprise, many of the studies had to be excluded due to their lack of report on portfolio features. Within the selected scholarship, survey-based studies did not report on how their tools were developed, or what underlying theoretical framework the survey was based on. Qualitative studies, on the other hand, tended to not report on their analytical procedure. Reporting these features will help increase the quality

126  Mitsuko Suzuki

and comparability of the studies in the field. In addition, conducting mixedmethod studies can also offer us a more holistic and multidimensional understanding of students’ experience. Recommended further reading, discussion questions and suggested research projects Further reading

For those who are interested in reading more about validity of alternative assessment from a critical perspective, Shohamy’s (2001) book is essential. Lynch and Shaw’s (2005) article further develops her discussion with a focus on portfolios. Lynch, B.,  & Shaw, P. (2005). Portfolios, power, and ethics. TESOL Quarterly, 39(2), 263–297. https://doi.org/10.2307/3588311 Shohamy, E. (2001). The power of tests: A critical perspective on the uses of language tests. Longman.

Discussion questions

1. Reflect on your past portfolio experience, if any. In your opinion, to what extent was the portfolio valid (i.e., empowering students)? Are there any changes you would like to make? If you do not have any experience, to what extent do you think portfolios would be a valid assessment tool in your L2 teaching and learning context? 2. How can classroom teachers assess the validity of their test practice from a critical language testing perspective? What types of evidence can be used to make judgments about learner empowerment? Suggested research projects

Design tools (e.g., survey, interview questions) that could examine L2 learners’ viewpoints on an alternative assessment, with special focus on the extent to which that assessment tool is empowering to them. As this chapter suggested, try to encompass variables beyond the classroom environment (e.g., teacher, parents, educational and governmental policy) in your study. References List of studies included in the review Aydin, S. (2010). EFL writers’ perceptions of portfolio keeping. Assessing Writing, 15(3), 194–203. https://doi.org/10.1016/j.asw.2010.08.001

The ethical potential of L2 portfolio assessment  127

Bahous, R. (2008). The self-assessed portfolio: A  case study. Assessment and Evaluation in Higher Education, 33(4), 381–393. https://doi.org/10.1080/ 02602930701562866 Barrot, J. S. (2016). Using Facebook-based e-portfolio in ESL writing classrooms: Impact and challenges. Language, Culture and Curriculum, 29(3), 286–301. https://doi.org/10.1080/07908318.2016.1143481 Barootchi, N.,  & Keshavarz, M. H. (2002). Assessment of achievement through portfolios and teacher-made tests. Educational Research, 44(3), 279–288. https://doi.org/10.1080/00131880210135313 Baturay, M. H.,  & Daloǧlu, A. (2010). E-portfolio assessment in an online English language course. Computer Assisted Language Learning, 23(5), 413–428. https://doi.org/10.1080/09588221.2010.520671 Burkšaitiene, N., & Teresevičiene, M. (2008). Integrating alternative learning and assessment in a course of English for law students. Assessment and Evaluation in Higher Education, 33(2), 155–166. https://doi.org/10.1080/02602930601125699 Chen, Y. M. (2005). Electronic portfolios and literacy development: A course design for EFL university students. Teaching English with Technology, 5(3). Chen, Y. M. (2006). EFL instruction and assessment with portfolios: A case study in Taiwan. Asian EFL Journal, 8(1), 69–96. Farahian, M., & Avarzamani, F. (2018). The impact of portfolio on EFL learners’ metacognition and writing performance. Cogent Education, 5(1), 2–21. https:// doi.org/10.1080/2331186X.2018.1450918 Hirvela, A., & Sweetland, Y. L. (2005). Two case studies of L2 writers’ experiences across learning-directed portfolio contexts. Assessing Writing, 10(3), 192–213. https://doi.org/10.1016/j.asw.2005.07.001 Huang, H. T. D., & Hung, S. T. A. (2013). Effects of electronic portfolios on EFL oral performance. Asian EFL Journal, 15(3), 192–212. Kristmanson, P., Lafargue, C., & Culligan, K. (2013). Experiences with autonomy: learners’ voices on language learning. Canadian Modern Language Review, 69(4), 462–486. https://doi.org/10.3138/cmlr.1723.462 Lam, R. (2013). Two portfolio systems: EFL students’ perceptions of writing ability, text improvement, and feedback. Assessing Writing, 18(2), 132–153. https:// doi.org/10.1016/j.asw.2012.10.003 Lam, R., & Lee, I. (2010). Balancing dual functions of portfolio assessment. ELT Journal, 64(1), 54–64. https://doi.org/10.1093/elt/ccp024 Li, Q. (2016). Chinese EFL learners’ perceptions of their experience with portfoliobased writing assessment: a case study. Chinese Journal of Applied Linguistics, 39(2), 215–247. https://doi.org/10.1515/cjal-2016-0014 Nosratinia, M., & Abdi, F. (2017). The comparative effect of portfolio and summative assessments on EFL learners’ writing ability, anxiety, and autonomy. Journal of Language Teaching and Research, 8(4), 823–834. http://dx.doi. org/10.17507/jltr.0804.24 Paesani, K. (2006). “Exercices de style”: Developing multiple competencies through a writing portfolio. Foreign Language Annals, 39(4), 618–639. https://doi. org/10.1111/j.1944-9720.2006.tb02280.x Pearson, J. (2017). Processfolio: Uniting academic literacies and critical emancipatory action research for practitioner-led inquiry into EAP writing assessment. Critical Inquiry in Language Studies, 14(2–3), 158–181. https://doi.org/10.1 080/15427587.2017.1279544

128  Mitsuko Suzuki

Pollari, P. (2000). “This is my portfolio”: Portfolios in upper secondary school English studies (ED450415). Institute for Educational Research. Romova, Z., & Andrew, M. (2011). Teaching and assessing academic writing via the portfolio: Benefits for learners of English as an additional language. Assessing Writing, 16(2), 111–122. https://doi.org/10.1016/j.asw.2011.02.005 Yang, N. (2003). Integrating portfolios into learning strategy-based instruction for EFL college students. International Review of Applied Linguistics in Language Teaching, 41(4), 293–317. https://doi.org/10.1515/iral.2003.014 Zhang, S. (2009). Has portfolio assessment become common practice in EFL classrooms? Empirical studies from China. English Language Teaching, 2(2), 98–118. https://doi.org/10.5539/elt.v2n2p98 Ziegler, N. A., & Moeller, A. J. (2012). Increasing self-regulated learning through the LinguaFolio. Foreign Language Annals, 45(3), 330–348. https://doi. org/10.1111/j.1944-9720.2012.01205.x

General references Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2(1), 1–34. https://doi.org/10.1207/s15434311laq0201_1 Black, P.,  & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy  & Practice, 5(1), 7–74. http://dx.doi.org/10. 1080/0969595980050102 Brown, J. D. (2014). Mixed methods research for TESOL. Edinburgh University Press. Brown, J. D., & Hudson, T. (1998). The alternatives in language assessment. TESOL Quarterly, 32(4), 653–675. https://doi.org/10.2307/3587999 Burner, T. (2014). The potential formative benefits of portfolio assessment in second and foreign language writing contexts: A  review of the literature. Studies in Educational Evaluation, 43, 139–149. https://doi.org/10.1016/j.stueduc. 2014.03.002 Canagarajah, S. (1999). Resisting linguistic imperialism in English teaching. Oxford University Press. Carlson, D. L., & Albright, J. (2012). Composing a care of the self: A critical history of writing assessment in secondary English education. Springer Science & Business Media. Castañeda, M., & Rodríguez-González, E. (2011). L2 speaking self-ability perceptions through multiple video speech drafts. Hispania, 94(3), 483–501. https:// doi.org/10.1353/hpn.2011.0066 Chui, C. S., & Dias, C. (2017). The integration of e-portfolios in the foreign language classroom: Towards intercultural and reflective competences. In T. Chaudhuri & B. Cabau (Eds.), E-portfolios in higher education (pp. 53–74). Springer Nature. Crookes, G. V. (2013). Critical ELT in action: Foundations, promises, praxis. Routledge. Dooly, M., Calatrava, J. B., Gonzales, A. B., Pedrol, M. C.,  & Andreu, C. G. (2017). “I don’t know”: Results of a small-scale survey on teachers’ perspectives of the European Language Portfolio. Porta Linguarum, 27, 63–77. https://doi. org/10.30827/Digibug.53952 Elbow, P., & Belanoff, P. (1986). Portfolios as a substitute for proficiency examinations. College Composition and Communication, 37(3), 336–339. https://doi. org/10.2307/358050

The ethical potential of L2 portfolio assessment  129

Field, A. P., & Gillett, R. (2010). How to do a meta-analysis. British Journal of Mathematical  & Statistical Psychology, 63(3), 665–694. http://doi.org/10.1348/ 000711010x502733 Foucault, M. (1982). The subject and power. Critical Inquiry, 8(4), 777–795. https://doi.org/10.1086/448181 Fox. J. (2017). Using portfolios for assessment/alternative assessment. In E. Shohamy et al. (Eds.), Language testing and assessment: Encyclopedia of language and education (pp. 135–147). Springer. Fox, J., & Hartwick, P. (2011). Taking a diagnostic turn: Reinventing the portfolio in EAP classrooms. In D. Tsagari & I. Csépes (Eds.), Classroom-based language assessment (pp. 47–62). Peter Lang. Gearhart, M.,  & Herman, J. L. (1998). Portfolio assessment: Whose work is it? Issues in the use of classroom assignments for accountability. Educational Assessment, 5(1), 41–55. https://doi.org/10.1207/s15326977ea0501_2 Hamp-Lyons, L. (2003). Writing teachers as assessors of writing. In B. Kroll (Ed.), Exploring the dynamics of second language writing (pp.  162–189). Cambridge University Press. Hamp-Lyons, L. (2006). Feedback in portfolio-based writing courses. In K. Hyland & F. Hyland (Eds.), Feedback in second language writing contexts and issues (pp. 140– 161). Cambridge University Press. Hamp-Lyons, L., & Condon, W. (1993). Questioning assumptions about portfolio-based assessment. College Composition and Communication, 44(2), 176–190. https://doi.org/10.2307/358837 Hamp-Lyons, L., & Condon, W. (2000). Assessing the portfolio: Principles for practice, theory, and research. Hampton Press. Horwitz, E. K., Horwitz, M. B., & Cope, J. (1986). Foreign language classroom anxiety. The Modern Language Journal, 70(2), 125–132. https://doi.org/10.2307/327317 Huerta-Macías, A. (1995). Alternative assessment: Responses to commonly asked questions. TESOL Journal, 5(1), 8–11. Hung, S. T. A. (2012). A washback study on e-portfolio assessment in an English as a Foreign Language teacher preparation program. Computer Assisted Language Learning, 25(1), 21–36. https://doi.org/10.1080/09588221.2010.551756 In’nami, Y., & Koizumi, R. (2010). Database selection guidelines for meta-analysis in applied linguistics. TESOL Quarterly, 44, 169–184. https://doi.org/10.5054/ tq.2010.215253 Kuo, C. (2003). Portfolio: design, implementation and evaluation. In Proceedings of 2003 International Conference and Workshop on TEFL and Applied Linguistics (pp. 198–203). Ming Chuan University. Lallmamode, S. P., Daud, N. M.,  & Kassim, N. L. A. (2016). Development and initial argument-based validation of a scoring rubric used in the assessment of L2 writing electronic portfolios. Assessing Writing, 30, 44–62. https://doi. org/10.1016/j.asw.2016.06.001 Lam, R. (2010). A peer review training workshop: Coaching students to give and evaluate peer feedback. TESL Canada Journal, 27(2), 114–127. https://doi. org/10.18806/tesl.v27i2.1052 Lam, R. (2015). Feedback about self-regulation: Does it remain an “unfinished business” in portfolio assessment of writing? TESOL Quarterly, 49(2), 402–413. https://doi.org/10.1002/tesq.226

130  Mitsuko Suzuki

Lam, R. (2017). Taking stock of portfolio assessment scholarship: From research to practice. Assessing Writing, 31, 84–97. https://doi.org/10.1016/j. asw.2016.08.003 Lee, I. (2017). Classroom writing assessment and feedback in L2 school contexts. Springer Nature. Lo, Y. (2007). Learning how to learn: an action research study on facilitating selfdirected learning through teaching English business news. Studies in English Language and Literature, 20, 49–60. Lo, Y. F. (2010). Implementing reflective portfolios for promoting autonomous learning among EFL college students in Taiwan. Language Teaching Research, 14(1), 77–95. https://doi.org/10.1177/1362168809346509 Lynch, B. K. (2001). Rethinking assessment from a critical perspective. Language Testing, 18(4), 351–372. http://doi.org/10.1177/026553220101800403 Lynch, B.,  & Shaw, P. (2005). Portfolios, power, and ethics. TESOL Quarterly, 39(2), 263–297. https://doi.org/10.2307/3588311 McNamara, T. (1998). Policy and social considerations in language assessment. Annual Review of Applied Linguistics, 18, 304–319. https://doi.org/10.1017/ S0267190500003603 Min, H. T. (2005). Training students to become successful peer reviewers. System, 33, 293–308. https://doi.org/10.1016/j.system.2004.11.003 Moore, Z., & Bond, N. (2002). The use of portfolios for in-service teacher assessment: A case study of foreign language middle-school teachers in Texas. Foreign Language Annals, 35(1), 85–92. https://doi.org/10.1111/j.1944-9720.2002. tb01834.x O’Malley, J. M., & Pierce, L. V. (1996). Authentic assessment for English language learners: Practical approaches for teachers. Addison-Wesley Publishing. Orwin, R. G., & Vevea, J. L. (2009). Evaluating coding decisions. In H. Cooper, L. V. Hedges & J. C. Valentine (Eds.), The handbook of research synthesis and metaanalysis (pp.  177–203). Russell Sage Foundation. https://doi.org/10.1177/ 016327879601900108 Renfrow, M. (2004). Using portfolios to predict proficiency exam scores, timed essay scores and the university grade point averages of second language learners (Publication No. 3185216). Doctoral dissertation, University of Kansas, ProQuest Dissertations and Theses Global. Shohamy, E. (1996). Language testing: Matching assessment procedures with language knowledge. In M. Birenbaum & F. Dochy (Eds.), Alternatives in assessment of achievements, learning processes and prior knowledge (pp.  143–159). Kluwer. https://doi.org/10.1007/978-94-011-0657-3_6 Shohamy, E. (2001). The power of tests: A critical perspective on the uses of language tests (1st ed.). Longman. https://doi.org/10.4324/9781003062318 Shohamy, E. (2004). Assessment in multicultural societies: Applying democratic principles and practices to language testing. In B. Norton & K. Toohey (Eds.), Critical pedagogies and language learning (Cambridge Applied Linguistics, pp.  72–92). Cambridge University Press. https://doi.org/10.1017/ CBO9781139524834.005 Smith, K.,  & Tillema, H. (2003). Clarifying different types of portfolio use. Assessment  & Evaluation in Higher Education, 28(6), 625–648. https://doi. org/10.1080/0260293032000130252

The ethical potential of L2 portfolio assessment  131

Song, B.,  & August, B. (2002). Using portfolios to assess the writing of ESL students: A powerful alternative? Journal of Second Language Writing, 11(1), 49–72. https://doi.org/10.1016/S1060-3743(02)00053-X Su, Y. (2011). The effects of the cultural portfolio on cultural and EFL learning in Taiwan’s EFL college classes. Language Teaching Research, 15(2), 230–252. https://doi.org/10.1177/1362168810388721 Sutcliffe, K., Oliver, S., & Richardson, M. (2017). Describing and analyzing studies. In D. Gough, S. Oliver & J. Thomas (Eds.). An introduction to systematic reviews (pp. 123–143). Sage. Thomas, M. (1994), Assessment of L2 proficiency in second language acquisition research. Language Learning, 44, 307–336. https://doi.org/10.1111/ j.1467-1770.1994.tb01104.x Weigle, S. (2002). Assessing writing. Cambridge University Press.

PART III

Sociointeractional perspectives on assessment

6 PORTFOLIO ASSESSMENT Facilitating language learning in the wild Elisa Räsänen and Piibi-Kai Kivik

Introduction

College foreign language programs in the United States are increasingly oriented to language proficiency, and ACTFL OPI is often considered the gold standard of assessment. Concurrently, integration of language use in the wild into classroom teaching and educating learners to be independent users are considered central to the state-of-the-art language instruction aiming at interactional competence, defined as the ability of L2 learners “to engage in the dynamic and context-sensitive coordination of social interaction” (Eskildsen et al., 2019, p. 8), not individual performance ability. Narrowly defined proficiency orientation and the associated assessment methods have clashed with the goal of achieving L2 interactional competence, going back to Kramsch (1986), see Salaberry and Kunitz (2019, p. 6). Approaches to the study of interactional competence that take the emic (participant-based) perspective of language use to realize social action (conversation analysis and discourse analysis) have been found incompatible with assessment as an etic judgment of individual performance against standards (Kley, 2019, p. 292). Our article explores an alternative approach to classroom assessment, a portfolio that considers students’ activities beyond the classroom, in the wild. We study how using a portfolio assessment procedure creates a positive washback effect for learning, as it pushes students to use the target language in the wild. The task encourages the students to seek out real life interactions as resources for learning and focus their attention on specific learning moments within those interactions. The data we used for this study DOI: 10.4324/9781003384922-9

136  Elisa Räsänen and Piibi-Kai Kivik

was collected using a portfolio assignment titled Independent Use Portfolio (see Appendix 1). The portfolio includes students’ self-reports of their independent language use outside of class and documentation, as well as their reflections on these events (see also Räsänen, 2021). Independent language use, as employed here, denotes the students’ self-directed language use (see e.g., Abar & Loken, 2010) beyond the classroom context, and is independent from the instructional setting of the language course. Independent Use Portfolio was implemented in two university programs of less commonly taught languages (Finnish and Estonian at a US university), with a focus on language use in its social context. The portfolio assessment procedure was prompted by a pronounced need for introducing language use in the wild into the curriculum. We, as instructors, observed a mismatch between the curricular goals of our language programs and current assessment methods: while our classes aimed to train students to be engaged, independent learners, we were testing them solely for language proficiency. The pedagogical objective of the portfolio was to enhance language use in authentic contexts, and promote incidental learning (see Lech & Harris, 2019) as well as learner autonomy, which is crucial for successful learning (e.g., Benson & Reinders, 2011). The portfolio was designed to support lifelong learning (Lech & Harris, 2019) and to develop classroom instruction that would contribute to this goal, following the principles of dynamic assessment (Poehner & Lantolf, 2005) and learning-oriented assessment (Turner & Purpura, 2015). As Purpura (2016, p. 202) notes, most of the research on social consequences of assessment have focused on test misuse, such as injustices due to decision-making based on high-stakes tests (see this volume). We present a different context and explore the social relevance of language practices in the context of an assessment task. We believe that interactions between novice and expert language users and memberships in “communities of practice” (Lave  & Wenger, 1991) are central for the development of L2 skills (e.g., L2 socialization approaches in Duff & Talmy, 2011, i.e., social constructivist perspectives to L2 learning). This chapter contributes to the ongoing discussion of assessment and authentic, situated language use in second and foreign language teaching (e.g., McNamara & Roever, 2006; Ross & Kasper, 2013; the chapters in Salaberry & Kunitz, 2019). The portfolio assessment task is designed for students to interact in the target language in the wild and thereby create a washback effect, emphasizing the role of language use in the wild in students’ learning. As teacher-researchers seeking curricular change, we utilized a nexus analytical approach (Scollon  & Scollon, 2004) that enabled us to study our students’ interactions and their reports of such interactions in the portfolio, and to ‘zoom in’ to the analysis of the interaction level. Nexus analysis is an

Portfolio assessment  137

approach that combines ethnography and discourse analysis (see Data and Methods, in this article) and enables incorporating change to the studied practices. As a result of implementing the Independent Use Portfolio, students added their target language to interactions with people with whom they already had existing relationships or reached out to new contacts in the target language. We ask how the interactions (prompted by the portfolio) impacted their practice and, consequently, created opportunities for learning the target language. We provide detailed analyses of the portfolio reflections to identify learning orientations emerging from these social activities. Our research questions are:

• How did the portfolio assessment task impact students’ target language interactions?

• What kind of learning did the students report happening in those situations?

In the next section, we will outline relevant previous research on portfolios as language class assessment. We will then explain the core principles behind the Independent Use Portfolio and the present study. Background Benefits of portfolio assessment and the features of independent use portfolio

Portfolios have been regarded as an alternative, more empowering, assessment task compared to traditional testing procedures (Lynch & Shaw, 2005). A  portfolio enables the students to showcase integrated skills and a wider scope of activities than decontextualized tests (Abrar-ul-Hassan et al., 2021). Portfolios contribute to learner language development, motivation, selfreflection, autonomy, cognition, metacognition and to develop a sense of community. They also promote authenticity. (For an overview of e-portfolio research, see Chostelidou  & Manoli, 2020, pp.  509–510.) In general, research on portfolio assessment in language instruction in the past ten years has mostly focused on writing instruction (see, e.g., Mak & Wong, 2018) and students’ self-regulation and active learning (Mak & Wong, 2018; Yastibas & Yastibas, 2015). Our study stands out from previous portfolio research studies for the following reasons. First, it uses data from less commonly taught language programs at the university level. Second, it emphasizes the role of language use for interaction. We focus on student-initiated instances of language use, instead of reporting on formal classroom assignments. Independent Use

138  Elisa Räsänen and Piibi-Kai Kivik

Portfolio is a showcase portfolio, introducing “examples of a learner’s best work” (Abrar-ul-Hassan et al., 2021, p. 3), which means that it does not measure linguistic development, but rather highlights learners’ successful language learning experiences. Here, this best work is done in the wild instead of in the classroom and success is understood in terms of using the target language to achieve interactional objectives. This is different from most of the otherwise comparable portfolio projects in higher education contexts, where the learning goals were defined by the instructor and based on standard formulations of proficiency milestones, with limited opportunities for student-defined artifacts and practices (e.g., Cadd, 2012). Our Independent Use Portfolio initially drew on the European Language Portfolio (ELP) in emphasizing life-long learning and including language use that is not connected to students’ other homework tasks. A distinctive feature of the ELP is that it does not differentiate between skills obtained from formal classroom instruction and a more informal setting. The ELP, like our portfolio, offers language learners an opportunity to document and reflect on “language learning and intercultural experiences” (Council of Europe, 2021). It includes three parts: a language passport, a language biography, and a dossier (Cummins & Davesne, 2009). Our portfolio only included the dossier aspect, although some of our classes also wrote language biographies, and focused on student engagement and effort rather than demonstrated proficiency. Language learning in the wild as the resource of independent use portfolio

The portfolio was introduced in our classrooms (and subsequently used for this study) as a response to the recent scholarship of language learning in the wild (see Eskildsen et al., 2019 for an overview). The study of second language learning in the wild focuses on language learners’ interactions outside of the classroom (Clark et al., 2011; Hutchins, 1995). As Eskildsen et al. (2019) have put it, researchers “scrutinize learning in everyday mundane situations by means of micro-analyses of how L2 speakers/learners act in the world in concord with others while they accomplish social tasks and move through time and space” and “explore ways in which such L2 speaker experiences can be utilized for classroom purposes” (2019, p.  3). While the present study employs analysis of written exchanges and reflection data instead of oral interaction analysis, it is guided by the research agenda of learning in the wild. Eskildsen et al. (2019) noted that although the term in the wild is often seen as “the antithesis of classroom” (p. 4), the concept is actually more nuanced and there is more of a “gradient” relationship with classroom learning.

Portfolio assessment  139

In our effort to bring language learning in the wild to the classroom via the Independent Use Portfolio and subsequently develop teaching and assessment, we drew on recent advances of integrating research of L2 use in the wild with classroom pedagogy (e.g., L2 interactional competence studies in Salaberry  & Kunitz, 2019). Specifically, the portfolio model was inspired by the pedagogical task that directed students to use target language in the wild, take their experience back to the classroom in the form of recording and reflect on it in a group setting (Lilja & Piirainen-Marsh, 2018, 2019). Lech and Harris (2019) observed that L2 learning in the wild assumes access to in-person speaker community, which is not always viable, and argued for studying “incidental foreign language contact in unstructured, virtual environments, the virtual wild” (p. 39). One of the areas identified as in need of further study included “measuring students’ interest, or lack of, in engaging in – – OILL (Online Informal Learning of Languages) activities” (Lech & Harris, 2019, p. 52) and how effective they are for learning. Also, Cole and Vanderplank (2016, p. 41) suggested investigations on how online learning could be combined with and enrich the formal classroom in assignments that take into account learners’ individuality in language use that goes beyond the class. Our study contributes to this line of investigation. As our students have limited access to target language communities in-person, they often rely on online interactions. The analyses of naturally occurring interaction can pinpoint actual learning only indirectly, via observable learning: changes in practice. We apply the same principle in the analysis, adopting the emic perspective of the participants as they orient to learnables, defined by Majlesi and Broth (2012, p. 193) as “whatever is interactively established as relevant and developed into a shared pedagogical focus.” In the portfolio task, we locate the students’ orientation to learnables in the reported interactions, as the shared focus between participants in the samples of interactions and, also, as identified by learners in their subsequent written reflections. Background principles: dynamic assessment and learning oriented assessment

We utilized principles of dynamic assessment (DA) in the Independent Use Portfolio. Learning in interaction, and lifelong learning are at the core of DA. DA approaches language learning from a sociocultural perspective, following the Vygotskian concept of the zone of proximal development: learners should not only perform tasks at their level, but with assistance, they can reach higher level performances (Poehner, 2008, pp. 5, 12; Poehner & Lantolf, 2005, pp. 233–234). Whereas “assessing without mediation is problematic because it leaves out part of the picture – the future” (Poehner &

140  Elisa Räsänen and Piibi-Kai Kivik

Lantolf, 2005, p.  251), dynamic assessment includes the concept of lifelong learning. The approach of DA brings about a paradigm shift enhancing students’ development through assessment, whereas more traditional tests and assessments are by nature static (Poehner, 2008, p. 13). Our Independent Use Portfolio follows a similar principle of developing students’ skills through an assessment intervention, with the aim of a positive washback effect of assessment on teaching, encouraging students to do more with their target language outside of class. The instructors assess the students’ performance based on the level of their reported active engagement and effort. Learning Oriented Assessment (LOA) (Purpura, 2016; Turner  & Purpura, 2015) also puts assessment in the service of learning and relies on evidence elicited in a variety of L2 contexts. LOA employs classroom elicitations in diverse planned and unplanned contexts. It takes the elicitations out of the classroom context and then brings them back to the pedagogical realm. Our portfolio process starts in the classroom, takes learning to the wild and then brings it back to the classroom as reflection and potentially, in repeated or new encounters. The students have authority and autonomy in choosing the material. Becoming functional in the target language is only possible for autonomous learners (Lech & Harris, 2019) and therefore, it is “crucial that teachers of languages are also teachers of skills for continuing one’s education in the wild” (p. 40). Thus, the core principle behind the Independent Use Portfolio was to highlight for the students the importance of the activities they do in the target language outside of class, and the potential of those activities for learning. The practices promoting learner autonomy are assigned weight by elevating them to the level of course assessment. Data and methods The portfolio process

The context of the study is the one of two small less commonly taught foreign language programs, Finnish and Estonian, at a public university in the United States.1 There are no substantial local target-language speaker communities, and students have limited opportunities for in-person language use in the wild in communicative situations. The two target languages are typologically distant from the students’ first language (English), and structurally complex. Estonian and Finnish are typologically closely related (Finnic branch of the Uralic languages). The language programs are comparable for instructional setting and methods, focusing on task-based instruction and target-language interaction. The participants were students in first, second, third, and fourth-year Finnish and first- and second-year Estonian

Portfolio assessment  141

language classes. As the present study focuses on students’ social activities and their learning orientations instead of development, the students’ proficiency levels were not assessed for the purposes of our study. The present study is part of a larger research study, primarily conducted in Finnish classes with a smaller additional Estonian portfolio corpus. The larger research project includes 99 portfolio entries from 19 students of Finnish, and 21 portfolio entries from 8 students of Estonian. The participants in the study were mostly undergraduate-level students who studied Finnish or Estonian as part of a language requirement in their degree. Some students participated in the research study through several (fall, spring, summer) semesters. We implemented the Independent Use Portfolio in the Finnish/Estonian courses as a continuous homework assessment. An Estonian intensive course also had a final submission. Depending on the class, students were required to submit four to six portfolio entries per semester and the portfolio counted for 5–10% of their final course grade. The instructors assessed the students’ portfolio entries based on engagement and effort (see the full rubric in Appendix 1). It has been found important for foreign language learners’ reflections to be in their L1 to express deeper thoughts (Cadd, 2012, p. 101 on ePortfolio). In the Independent Use Portfolio, the entries were written in both the target language and English, depending on the students’ language skills. Figure 6.1 summarizes the portfolio process (see also Räsänen, 2021):

FIGURE 6.1 

The Portfolio Process

142  Elisa Räsänen and Piibi-Kai Kivik

The students collected samples of language and self-reflections on their language use in the wild and recorded them in an electronic portfolio (cf. Lilja & Piirainen-Marsh, 2018, 2019 study of students recording their authentic interaction for classroom discussion). Students were tasked to collect samples of their language use situations in the form of photos, recordings, screenshots, and text, and they also included drawings and web-links. The samples were accompanied by written reflections on the instances of target language use (see Appendix for portfolio instructions). Students received prompt questions that directed them, among other things, to identify what they had found enjoyable or challenging and what they had learned while engaging in activities in the target language. For research purposes, the portfolio data was de-identified before sharing between the researchers and analyzing to identify common themes. A nexus-analytical approach to discourse analysis

As teacher-researchers, we utilized nexus analysis (Scollon & Scollon, 2004) both in the data collection and analysis, along with discourse analysis (Gee, 2014) to facilitate a closer analysis of the student entries. In nexus analysis, discourse is examined from an ethnographic perspective (also Pietikäinen, 2012). The process enabled us to study practices through a three-stage process: 1) engaging, 2) navigating, and 3) changing the nexus of practice. In the first phase, we engaged with the actors (the students) and relevant social actions (instances of their language use in the wild), reported in the portfolio project (Scollon & Scollon, 2004, pp. 153−154). In the navigating phase, we analyzed and categorized the data, looking for trends. We conducted close analyses of the portfolios, and then again ‘zoomed out’ to see the ‘big picture,’ in a dynamic process (Scollon  & Scollon, 2004, pp. 159−160). Nexus analysis includes change (Scollon & Scollon, 2004, pp. 177−178), which in our study was introduced in two ways: the portfolio prompts students to change their engagement with the target language and holds potential for curricular changes. Nexus analysis involves three central concepts: discourses in place, interaction order and historical body that we used to analyze discourses in the reflection data. Discourses in place means the nexus of discourses where social action happens. Interaction order refers to social order, hierarchies, and arrangements (Scollon & Scollon, 2004, p. 19). Historical bodies, students’ prior expectations and experiences, become salient as students bring their prior language learning experiences into their interactions and reflections. They also address their histories of existing social relationships. In the analysis, we do not focus on language learning per se, but on how students orient themselves to learning activities and construct meaning

Portfolio assessment  143

retrospectively (Jakonen, 2018) in their written reflections. Locating and naming the discourses and their connections is an important part of the analysis (see also Pietikäinen, 2012, p. 423). We use the emic categories and categorizations that our participants (student authors of the portfolio entries) themselves applied or that emerged from their text, for instance, presenting themselves as novice learners or peripheral members of the speaking community. We also supplemented the analysis with the ethnographic information available to us (e.g., students’ learning histories: beginner or more advanced). In the following section, we introduce the analysis and results of our study. Analysis and results Portfolio assessment pushing students to interact in the wild

The portfolio assessment task, focusing on student engagement in the target language in the wild, created a positive washback effect as it led students to seek out interaction partners and new contexts for using the target language. In this section we will focus on how these interactions, prompted by the portfolio, impacted language practice and learning. As described before, the research questions we used to guide our analysis were the following:

• How did the portfolio assessment task impact students’ target language interactions?

• What kind of learning did the students report happening in those situations?

As a result of our analysis, we identified three general types of positive washback: 1) Students pursued existing relationships in the target language (either elicited by the portfolio assessment itself or reported and reflected on by the learner). 2) Students established new connections through the target language. 3) The portfolio became a means of doing learning through reflection. Pursuing existing relationships in the target language

Eva reports in her portfolio entry how she has added using a new language, Finnish, in communications with a Finnish-speaking friend, and how the relationship further motivates her language practice. She writes that she has added Finnish to her interaction with the friend because of the portfolio. She starts her reflection by describing their relationship:

144  Elisa Räsänen and Piibi-Kai Kivik

Excerpt 1. Reflection (Eva/Finnish)2

Minä näytän häntä koko kesän, joka kesä. Kahdeksan vuotta, meidän perheet menemme X:n kesämökki juhannukseen. – – Me tekstimme yksi kertoa koska sanoin että tarvitsen teksti viestiä kurssilta, mutta sitten me haluamme tekstata lisää suomeksi! Minä kaipaan häntä todella paljon. “I saw her all summer, every summer. For eight years, our families went to X’s summer cottage for midsummer. – – We texted one time because I said that I need text messages for class, but then we want to text more in Finnish! I miss her very much.” Eva’s reflection suggests an existing, long-term social connection between the two as a reason to continue the chat. In the reflection, she uses the words I miss and a great deal, and says how they have visited their cottage, which in Finnish culture is a place to which close friends are invited, for eight years. This historical body of their mutual history shapes how the interaction proceeds. Eva reports that they texted in Finnish as she needed a sample (for the portfolio), but they want to continue the practice. The use of the inclusive pronoun in we want signals that according to Eva this is a mutually shared commitment. The portfolio has prompted her to add the target language interactions to the relationship. Eva’s reported willingness to continue and expand a class activity in her life outside the classroom contrasts with typical course assessment practices where the student’s engagement ends with the submission of the assignment. We suggest that being friends with the practice partner has a motivating effect for the student to want to continue the practice. Orientation to the future and lifelong learning are at the core of dynamic assessment (Poehner & Lantolf, 2005, p. 251). An implication of the portfolio assessment is that in addition to students initiating communication in the target language, they create practices that might continue after the portfolio has been completed and the language course is finished. Further excerpts of Eva’s chat reveal that the practice is indeed continuing. The chat chain indicates that the conversation has ended and it has been restarted at least once. Eva wishes the friend good night and after a couple of lines Eva writes: Excerpt 2. Chat (Eva/Finnish)

haluan textin suomeksi enemmän😂😂 ja hyvää ystävänpäivää💕 “I want to text in Finnish more and happy Valentine’s Day”

Portfolio assessment  145

With the message, she expresses her interest to continue interacting in Finnish. The laughing emojis soften Eva’s request, to which the friend responds, continuing the chat with several messages. The conversation has restarted, and the friends discuss different day activities, suggesting that the second thread happened on another day. Chatting with a more advanced speaker enables functioning at the zone of proximal development (Poehner, 2008; Poehner & Lantolf, 2005): Eva writes at her own level and the friend responds with more advanced-level language, which is still comprehensible to Eva. The chat exchange does not suggest any communication trouble. Thus, the portfolio task prompts Eva to use Finnish with an expert speaker, who is also personally close to her, pushing her to maximize the opportunities to learn from interaction in the wild. Another student, Violet, also texted a Finnish friend in Finnish, which resulted in an extensive exchange characterized by sharing. This interaction is prompted by the portfolio task as well. Violet is a third-year Finnish language student, who has lived in Finland. Excerpt 3 is the beginning of their chat exchange, transcribed here from the phone screenshot. Excerpt 3. Chat (Violet/Finnish)

Violet: Moi! Miten menee? 7:46 AM “Hi! What’s up?” (Five posts in response) Violet: Voi ei, millaiset kirjoitukset sulla on? 1:18 pm “Oh no, what kind of writing do you have?” Violet: Mä oon tosi hyvä, kevätloma alkaa huomenna! Mä aion mennä New Orleansiin kaverieden mukana 1:20 pm “I am really good, spring break starts tomorrow! I am going to New Orleans with friends” The chat continues with several messages. In response to Violet’s short Hi! What’s up? the friend writes five lines of response, including the use of a hyperlink.3 In excerpt 4, Violet reflects on the chat: Excerpt 4. Reflection (Violet/Finnish)

Mä aloitin keskustelu ja vain kysyin “Miten menee?”. Sitten me keskustelimme meidän elämästä. Esimerkiksi, mä puhuin kesäsuunnitelmasta ja hän puhui penkkareista ja hänen englannin opinnoista. “I  started the conversation and just asked ‘What’s up?’ Then we talked about our lives. For example, I talked about a summer plan and she talked about penkkarit4 and her English studies.”

146  Elisa Räsänen and Piibi-Kai Kivik

Violet emphasizes how her short, casual inquiry quickly leads to the friends sharing information about their lives. In the beginning, their interaction order is characterized by Violet asking, and the friend responding. The friend’s response turns the chat to a dialogical exchange. Not responding and not continuing the conversation would have had real-life social consequences. Although the portfolio instructions value even minimal target language use (greetings), the social concerns of real-life interactions prompted expanded use, as is seen in this case. Chat enables sharing about one’s life in a personal and authentic manner (see also Räsänen  & Muhonen, 2020). When Violet just asks how the friend is doing, the friend responds by writing about her stress, moving the exchange to a personal level about topics that are currently relevant in the participants’ lives. Chatting as a task for the portfolio has initiated a Finnishlanguage exchange that can then continue outside the assessment submission. The interaction stays in Violet’s target language, Finnish, perhaps because of the portfolio. During the long chat exchange (73 turns) that Violet has included in her portfolio, the friend writes one turn in English. In response to this turn, Violet responds in Finnish. Apart from that one turn, the chat continues in Finnish. Violet, thus, opts out from using English in the chat, possibly because the chat will be part of her portfolio. In addition to expert speaker partners, some students reached out to peer language learners in their target language. These beginner-level learners, however, seemed to do this specifically to practice the language, and they reported the use of Finnish to be restricting their social sharing. For example, Ella and her friend, who are both learners of Finnish, ‘kept in touch’ in English, while practicing Finnish in the same correspondence. Ella, a beginning Finnish language student, has corresponded with an American friend, Kate, who currently lives in Finland. The email correspondence is mostly in English, but there are also Finnish sections. The two languages, Finnish and English, have different functions in the emails. The following is a short excerpt of one of the emails Ella has included in her portfolio: Excerpt 5. Email (Ella/Finnish)

On + 30 astetta X:ssa tänään (how important is word order in Finnish?) ja on aurinkoista. En pidä koska on syyskuu ja syyskuu on syksyllä ja sysky on usein lämmin tai vileä. Mä olen surullinen takia sää (presumably a case ending goes there, but we haven’t actually explicitly talked about case endings yet). Kuka olen? Olen kaunis, vähän vanha, usein iloinen, nopea ja terve. “It is +30 degrees in X today () and it is sunny. I don’t like because it is September and September is in the Fall and Fall is often warm or cool. I am sad because of the weather (). Who am I? I am pretty, a little old, often happy, fast and healthy.”

Portfolio assessment  147

The portfolio task encourages Ella to write in Finnish to her friend. However, she is making it clear that the Finnish part of her communication is for practice. In the email, Finnish and English have clearly delineated functions. Finnish language seems to be used in this exchange as language practice, and the use of English serves to maintain Ella’s relationship with Kate as her fully competent self. Ella writes a part of her email in Finnish (see excerpt 5). Ella’s self-description orients to her learner role: it reads like an excerpt from a classroom task and does not serve to relay contextually relevant information about herself. From the reflection and correspondence, it is obvious that Kate knows who she is. The self-description presents Ella as a learner of Finnish practicing her writing skills. In English, she includes another, more abstract and critical voice. She adds meta-comments in English (bracketed off to further background these visually), positioning herself as a (classroom) learner of Finnish. Ella’s Finnish language learner identity is more limited than the English-language meta voice and including the English-language voice helps to maintain the social relationship. Practicing Finnish and sharing Finland-related information seems to connect these two friends. In her response email, Kate responds to Ella’s language choice, replying in Finnish: Oh, you are speaking Finnish!! Awesome.5 Kate positions herself as a fellow learner, comparing their learning experiences and the different instructional contexts: We all learned starting from written Finnish, but it looks like maybe y’all are starting from spoken? Instead of commenting on the content of Ella’s Finnish-language practice, she comments on her language, as a response to the language practice. As Ella is writing as a language learner, Kate also responds to her as a language learner. Writing in Finnish and asking Finnish-related questions is a topic that both participants in the exchange relate to and it serves as point of connection. Ella’s action of discussing the Finnish language and her Finnish course with Kate expands the class to her time outside of the classroom. The distinction between in and outside of class learning becomes blurred (Benson & Reinders, 2011, p. 2). In this example, the primary goal of the interaction is to keep in touch and exchange information among friends, but since they are both learners, they also include Finnish practice. When language practice intersects with social sharing with a friend, there are risks to the authenticity and depth of the exchange, caused by the lack of linguistic resources. The following example also illustrates the social riskiness of language practice with a friend. A beginning Finnish language student, Kim, has emailed her American friend, Tim, in Finnish. Kim’s email follows the typical conventions of a Finnish language email. It includes greetings, and it is a rather personal note sharing information about her own life. We know from Kim’s entry that her friend is also an American learning Finnish, and the email correspondence is part of their language practice. In Excerpt 6, Kim reflects on her email exchange:

148  Elisa Räsänen and Piibi-Kai Kivik

Excerpt 6. Reflection (Kim/Finnish)

Since we’re friends, I tried to make it a casual email and write about basic things that have been going on in my life lately to keep him updated. I was surprised by how quickly and easily I was able to write this email. – – It was a bit challenging to limit myself to only things we have learned, since I talk to my friends about many other things, but I still found that I had lots to say. Kim is concerned about maintaining the social relationship with her friend during her language practice. Two discourse elements of effortlessness and challenge emerge: one relating to the language use and the other to the social connection. She highlights how casual, basic, quickly, and easy the writing has been. She, however, finds it challenging to express the thoughts she needs to maintain the already established relationship in the Finnish language, suggesting that in English the friends usually share information at a more abstract level. Kim is making an evaluation of her own language skills and reflecting on her effort to stay in what we would call the zone of proximal development in Vygotskian terms. She notices her limits: by keeping the conversation at a certain linguistic level determined by her current resources in the target language, she also limits the level of social sharing that she is able to do. The challenge is that this interaction is consequential, as all ‘real-life’ interactions are. Even if the email is for the portfolio, it is still with an actual person in a real-life relationship. These excerpts illustrated what kind of interactions followed, when students, prompted by the portfolio assessment, added a new language to a previously existing relationship. In Eva and Violet’s cases, Finnish language served merely as a medium of interaction, while two friends shared about their lives. Ella and Kim, however, were concerned about interacting in the target language ‘as themselves.’ In the case of Eva, adding Finnish to her close relationship was motivating, and encouraged her to further the Finnish language interaction. The already existing relationship encouraged her to continue writing in Finnish. Violet engaged in ‘real’ sharing in Finnish, and the need to maintain the flow of the exchange and possibly the fact that the interaction was for the portfolio, ensured that it also continued in Finnish. When the peer was also a language learner, like in the cases of Ella and Kim, students were concerned about representing their ‘true selves’ and maintaining their relationships in the target language. While engaged in language practice, they felt like missing out on some aspects of the social relationship. These students made explicit self-assessments of their language level. In Ella’s case, being language learners functioned as a mutual interest point, establishing further connections between the two friends. In this section we have focused on instances, in which students reached out to

Portfolio assessment  149

their existing contacts in the target language. The following section will feature instances where students used the target language to reach out to new people. Establishing new connections through the target language

Students also established new contacts in the target language, with people that they had not interacted with previously. The portfolio prompted students to engage in target language conversations. Target language use functioned as a bridge to open interaction with another target language speaker. The interaction would then continue either in English or in the target language. The first example is from a business transaction. Lisa, an experienced language learner but a novice learner of Estonian, included screenshots of email inquiries to the Estonian online bookstore where she had used a greeting. The email that Lisa submitted with her portfolio had a greeting in Estonian, and the rest of the email was in English. Excerpt 7. Reflection (Lisa/Estonian)

In having to coordinate either my move or textbook order with Estonian, I  made sure to at least use ‘Tere’. In using even that small bit of Estonian, I found that I received emails back relatively quickly. Even using a few phrases of greeting in Estonian seems to lead to those I’m corresponding with to be even friendly. The portfolio assessment prompted Lisa to pay attention to her own target language use in a business transaction. While reflecting on one of her first opportunities to connect to the native speaker community, Lisa oriented to the use of the word tere (‘hi’) as a resource that helped her establish contact with native speakers and create a friendly communication exchange. This required minimal effort but it was a rewarding instance of authentic language use in an institutional (and therefore potentially consequential) exchange. Lisa brings her historical body of intercultural encounters into the virtual transactional exchange where she participates at the very edge of the ‘community of practice’ of Estonian learners. She ascribes the friendliness of the exchange to her use of the greeting, which serves to further motivate her in the language learning endeavor. While Lisa’s example featured a business transaction, students also reached out to new people in informal settings. By engaging with authentic target-language content, they were not always immediately involved in a dialogic interaction, but they were able to observe the interaction and became

150  Elisa Räsänen and Piibi-Kai Kivik

exposed to authentic language use (cf. Lech & Harris, 2019 on OILL). The observation would then lead to the student joining the interaction. Maya reported that as she followed comments at an Estonian blog in real time, she came across a thread where participants were taking turns posting lines of a nursery-rhyme. Maya posted a screenshot of the forum interaction that she observed and reflected on her experience (excerpt 8). Excerpt 8. Reflection (Maya/Estonian)

And it was so bizarre because I could understand every single word, but I had no idea what was going on! I added my own comment to the posts saying just that, and a native Estonian speaker messaged me telling me that it’s a well-known song about an elf who lives in a forest and bakes bread?! We ended up having a nice conversation in Estonian, as well, because they were very confused about there being a random American student with zero connection to Estonia learning the language. Maya’s observations of her experience are made salient in her reflection. The song lines consist of a series of questions and answers, which, posted on this online forum by different participants, ‘masqueraded’ as a regular chat exchange for someone not familiar with the song (here a familiar media format, which normally promotes target language comprehension, is actually misleading). Maya’s engagement, which had first started as receptive, reading only, turned spontaneously interactive as she inquired about the situation. During the encounter she forged connections with a community of Estonian speakers (even though these contacts were momentary and fleeting) while positioning herself as an outsider: random American, zero connection. Both Maya and the other posters had a moment of being very confused about each other and resolving the confusion became a memorable experience for Maya. The two examples in this section featured learners reaching out to new people in the target language in the virtual wild. The portfolio encouraged students to engage with the target language in any way, including minimal conversations (see Appendix 1). Lisa reported one of her early real-life uses of the target language, as she used an Estonian greeting in online business communication. Maya reached out to speakers of Estonian for clarification after observing a thread of blog comments, reported using the target language in a meaningful way and obtained new cultural knowledge. In this section, we have shown how the Independent Use Portfolio task created a positive washback effect in the form of the students pursuing interactions in the wild. In the case of the students with an existing relationship with a target language speaker, the task prompted them to change the language of communication in that relationship, with variation in the perceived social consequences. In the virtual wild, students also engaged with

Portfolio assessment  151

target language users that they did not know, making initial contacts with the community as peripheral members. The task to use the target language in the wild provided an incentive to make these connections and led them to reflect on the social actions accomplished. Portfolio as means of doing learning through reflection

The enhanced opportunities for learning from interaction in the wild were provided by the reflection component, which eventually led the students to ‘do learning’ within the task: students returned to possible learning objects after the interaction had already happened, so the noticing aspect was enhanced. Previous research suggests a connection between portfolios and students’ self-regulation and active learning (see e.g., Mak & Wong, 2018; Yastibas & Yastibas, 2015). In this section we will show how the students reported learning in their portfolios. In their reflections, students brought up language elements, such as vocabulary or structures that they paid attention to (learnables, see Background). Eskildsen et  al. (2019, p.  7) argue that although the learnables that are most easily observed in interaction are indeed lexical items, the actual learning targets are not linguistic structures themselves but “appropriating and developing these as resources for action.” Students oriented to elements produced by other target language speakers that ‘stood out’ by being new or interesting. This was, for example, vocabulary or structures that were previously unknown to the student. However, they also reflected on more holistic and discourse-level linguistic phenomena, such as register, if it had become salient for them in their interactions. The following example of a chat excerpt (9) demonstrates Violet’s (introduced in the previous section) orientation to a formulaic expression during the chat with her friend. Excerpt 9. Chat (Violet/Finnish)

Violet: Onks sulla yliopisto suunnitelma? 1:40 PM “Do you have plans for college?” Friend: Aion hakea lukemaan biologiaa yliopistoon. 1:49 PM “I plan to study biology (literally: ‘read biology’) at college.” Violet writes in her reflection: Excerpt 10. Reflection (Violet/Finnish)

Keskustelu muistutti mua ‘opiskella aihe’ suomessa on normaalisti sanottu ‘lukea aihe’.

152  Elisa Räsänen and Piibi-Kai Kivik

“The conversation reminded me that ‘to study a subject’ in Finnish is normally said ‘to read a subject’.” This excerpt demonstrates the significance of the reflection part in orienting to learnables. By writing about the phrase, Violet indicates that she had previously encountered the formulaic expression, and she was now reminded of it contrasting with English, while seeing it again in the chat. In the portfolio, the student receives a platform to reflect on her remembering a particular vocabulary item as she observed it used by the expert speaker during the chat. The ‘recycling’ of the language item by re-using it in the reflection passage can be assumed to strengthen learning. In other instances, the learning orientation emerged from a reported misunderstanding. Certain communication trouble occurred in the original interaction, which made the student pay attention to that specific element where the problem occurred. For example, Jenna reports misusing a word in her phone conversation based on her L1 and that she found out that it was confusing to her friend. Excerpt 11. Reflection (Jenna/Finnish)

Haulaisi sanoi hänestä jotka minä aikoi mene ‘kuntosaliin’ mutta minä oli sekava ja sanoin ‘terveasema’ koska minä sanoin ‘health center’ kun mun puhun englanti, ja puhun kuntosalista. “I wanted to say to her that I was going to go ‘to gym’ but I was confusing and said ‘health center’ because I  said ‘health center’ when I speak English, and I talk about the gym.” The retelling of the conversation misunderstanding with the troublesource item spelled out was likely to contribute to Jenna remembering the word. The reflection gave Jenna the opportunity to process the conversation in retrospect and orient to this element. The example resembles the learning situation described by Lilja and Piirainen-Marsh (2018) where students recounted a communication trouble that had happened in the wild, in order to clarify and learn a new phrase the expert speaker had used. In addition to vocabulary, students paid attention to more general features of authentic situated language use such as the register. They also wrote about various social expectations of the language use in a situation (cf. Compernolle, 2018), in which they made salient their historical bodies as language users: their expectations on how interaction functions. For example, Lucas oriented to the register of language in his reflection on a chat exchange. He is a second-year learner of Finnish and chatted with his friend, also a learner of Finnish. Excerpt 12 is from their chat exchange (the friend’s turns are not included):

Portfolio assessment  153

Excerpt 12. Chat (Lucas/Finnish)

Moi! Mitä kuuluu? “Hi! How are you?” (Response) Olen hyvä, mutta olen väsynyt. Koulu on vaikea tämä vuosi:/ “I am good, but I’m tired. School is hard this year” (Response) Minulla oli kevätloma viime viikko! Mä menin kotiin, mutta se oli tosi rentouttava. Milloin on sinun kevätloma? “I had a spring break last week! I went home, but it was really relaxing. When is your spring break?” (Response) In this chat, the two interlocutors mostly use the standard, written variation of Finnish. This is especially noticeable in the use of personal pronouns minulla (form of ‘I’) and sinun, ‘your’, which in a chat would normally be their shorter, colloquial counterparts, such as mulla and sun. Apart from a couple of uses of mä (colloquial ‘I’), the chat is written in standard Finnish. Lucas addresses this in the reflection: Excerpt 13. Reflection (Lucas/Finnish)

Kun mä puhuin Jessica, en tiedä, jos mä mun pitäisi käyttää puhekieli tai kirjakieli. Koska me olemme molempia aloittelevat suomalaiset puhujat, me käytimme kirjakieli. Me molemmat kirjoitimme yksinkertaisia lauseita ja kysyimme yksinkertaisia kysymyksiä. Mutta, mä ymmärsin lähes kaikki, mitä hän sanoi. “When I  talked to Jessica, I  don’t know if I  should use spoken or written language. Because we are both beginning speakers of Finnish, we used written language. We both wrote simple sentences and asked simple questions. But I understood almost everything she said.” In his reflection, Lucas oriented to the choice of register in the exchange. Lucas assessed his and his friend’s language skills, assuming them to be at a similar level (we – both – we used) and refers to the novice status as a motivation for the written (i.e., the standard) language used in the chat.6 Lucas brought up his previous experiences, his historical body, to his judgment about register: he seems to suggest that novice language learners typically are more accustomed with the standard version of the language – colloquial expressions belong to the repertoire of a more advanced learner. The portfolio provided Lucas with a forum to reflect on these choices, and to display his pragmatic and sociocultural competence, making them salient also to his instructor. Traditionally, in proficiency-focused assessment, this knowledge

154  Elisa Räsänen and Piibi-Kai Kivik

would not be part of language classroom assessment at lower levels of instruction (more in the Discussion section). The previous examples show that in the context of the portfolio task, the students needed to pay attention to their own and others’ language use in the wild and consequently they were led to ‘do’ learning. The analysis demonstrated the students’ orientation to learnables while they were reflecting on language use in the wild. These learnables were often vocabulary elements. In their reflection, students either oriented to specific vocabulary items that they had encountered before, or items that had caused a misunderstanding in the original interaction event. Students also oriented to elements such as the register of the language in interaction and explained register choices in their reflections. The written portfolio reflection made salient their learning orientations in the portfolio task. The reflection made students notice and reproduce these elements, leading to an enhanced learning experience. Discussion

In the analysis of the data presented previously, we have analyzed how the portfolio assessment task (i.e., students’ interactions in the wild and how they reported learning in those situations) created a positive washback effect of seeking out more opportunities to use the language and pay attention to the elements of language in these episodes. We will now assess the overall impact of the task and describe the curricular implications of our findings. The portfolio assessment task prompted students to use the target language in the wild, often in socially relevant interactions. The portfolio thus led students to change the language used in their interactions with target language users with whom they had established social relationships. In some of these cases, the target language served as a (new) medium of interaction, as students continued their regular interactions as friends in the target language instead of English. Some students, however, used the L2 ‘just’ for language practice and made a distinction between this practice component and their social sharing. While some students already had previous target language contacts, students also reached out to new contacts through the target language. Some of these exchanges were purely transactional, while some of them resulted in more profound intercultural exchange. The portfolio prompted them to reflect on their target language engagement. In their portfolios, students were tasked to also reflect on their learning in the interactions. Students were consequently pushed to engage with the language elements of their exchanges, potentially leading to learning. While reflecting on their language use in the wild, students oriented to certain learnables such as specific vocabulary items that they perceived as familiar, or which had caused misunderstandings. The written portfolio reflection functioned as a platform to return to these learnables. The portfolio assessment

Portfolio assessment  155

task required the students to reflect on their experience and thus enhance the learning potential of their target language interactions by re-engaging with their experiences at a cognitively higher level and by contextualizing the experiences within their learning process. The portfolio as assessment followed the principles of Dynamic Assessment, through facilitating a positive washback effect of assessment on teaching and learning and emphasizing sociocultural and lifelong learning that continued beyond the classroom (see Poehner, 2008; Poehner & Lantolf, 2005). Reflecting the principles of Learning Oriented Assessment (LOA), the assessment task was employed to serve students’ learning that included their use of the target language in a variety of different learning contexts (Purpura, 2016; Turner & Purpura, 2015). Scollon and Scollon’s (2004) nexus analysis enabled us as teacher-researchers to engage with and navigate our students’ interaction practices in the wild. The portfolio provided us as instructors insight into the students’ target language use outside of the institutional learning situations. Typically for nexus analysis, our study included the aspect of change, as the portfolio was designed to have an immediate impact on the students’ learning practices. The portfolios will also enable creating future data-driven assessment methods (cf. Kley, 2019), such as real-life interactional scenarios and speaking prompts. Another aspect of the change will be future enhancements of the course curricula in response to the results of the study. Nexus analysis uses discourse analysis as a micro-level analysis tool, which enabled us to ‘zoom in’ to the interaction level of the data. It also allowed us to include ethnographic perspectives into the discussion of results, such as the evaluation of how the students’ historical bodies impacted their reflections. As assumed by nexus analysis, we as teacher-researchers needed to constantly reflect on our positionality, as we used the data as assessment in our own classes, and as research data. The study was designed to enable making curricular developments in our language classes and to enhance our pedagogy, which also impacted our research focus and practices. Language learning is a social activity, and the portfolio assignment pushed the learners to expand the range of situations and contexts for their target language use. Those students, who did not have any previous target language contacts, reached out and established new connections in the language. The assessment connected students’ learning inside and outside of class and encouraged them to increase the amount of practice they received during the course and after it. The results of this study indicated that the students found reaching out to previously existing contacts in the target language motivating, because of the social sharing aspect. Arguably connecting socially in order to learn the language seems to be important for our learners’ needs. The portfolio enhanced community building, addressing the social function of foreign language learning for American students: creating ties with other

156  Elisa Räsänen and Piibi-Kai Kivik

speakers of the language. The potential of the portfolio as preparation of students for the reality of the global world was reflected in the multitude of social situations reported on, as well as the students’ observations. The portfolios reflected the range of target language use that the learners would typically engage in (peer conversations, technology-mediated applications), thus aligning the course content better with actual learner needs. Most teaching materials available to learn Finnish and Estonian do not target interactional practices in a systematic manner (especially in online peer interactions). A  research-based understanding of these interactions benefits designing enhanced pedagogical materials, especially with the view to online and hybrid language instruction. Online language teaching requires specific attention to tasks that can be accomplished in a technology-mediated environment. As the opportunities for face-to-face interaction are limited in online instruction, both instruction and assessment regarding L2 interactional skills must be designed for maximum efficiency. Independent use, if well integrated into assessment and teaching, holds promise for this endeavor. In sum, the portfolio described in this chapter raised the learners’ awareness of themselves as language users, including in socially situated interactions, and increased their agency in the learning process by pushing them to identify learnables (Eskildsen & Majlesi, 2018). Besides vocabulary items, these included social expectations of the language use in a situation (cf. Compernolle, 2018) such as, for instance: recognizing registers and codeswitching, managing interactions as non-native speakers, recognizing the learning benefit of target language use beyond the classroom, and learning cultural and social phenomena associated with the target language. Furthermore, the portfolio provided opportunities for individualized learning and increased learner autonomy, thus enhancing learner agency. As the students were in charge of selecting the portfolio material, the power dynamic of the language class became increasingly learner centered (see also Chapter 9 of this volume). Crucially, elevating independent use tasks and reflection to the status of course assessment helped learners realize the importance of these practices for their learning. Conclusion

Language class assessment focusing solely on proficiency does not sufficiently address the objectives of cultural awareness and interactional competence. In our language programs, however, proficiency and achievement still form the major component of students’ final course grade, but metalinguistic knowledge, engagement, agency, intent, and effort also play a part in assessment via the portfolio task. Furthermore, it is limiting for students (especially in the case of adult learners, such as our university students) to only express what they can produce in the target language.

Portfolio assessment  157

The portfolio assessment task we described in this chapter gives students an opportunity to make their metalinguistic processes as part of a pedagogical task that gives them a platform to express complex thoughts about language learning in their native language, something that is not always allowed for in the context of communicative language teaching. The portfolio addressed the general education goal of foreign language learning. In the context of US university-level language programs, language courses fulfill important educational goals of multicultural communication. Even introductory level learners made observations about the target language and culture that go beyond what they can produce at their level in the language. This study enhanced our understanding of language students’ interactional needs. Based on the student-gathered data, we will put forward criteria and rubrics for future portfolios that will assess language use in the wild and reflection. We suggest curricular changes addressing the demonstrated needs, including the development of metacommunicative skills and strategies to manage interaction in the wild (cf. interactional competence). There is a need to develop summative course assessment that is in line with principles of Dynamic Assessment and Learning Oriented Assessment and teaching materials that reflect actual real-life communicative situations for our learner populations, including scenarios. Integrating independent use in the wild with classroom instruction will enable instructors to develop explicit instruction and assessment pertaining to the needs of real-life use. Recommended further reading, discussion questions and suggested research projects Further reading

For those interested in reading more about second language portfolio assessment, the article by Abrar-ul-Hassan et al. (2021) offers a useful overview. For those interested in reading more about dynamic assessment, Poehner’s (2008) book is a good starting point. Purpura’s (2016) article gives a good overview of second and foreign language assessment. Salaberry  & Kunitz’s (2019) edited volume incorporates a section on testing, which explores assessment in the context of innovative researchbased pedagogy, with a focus on interactional practices. Abrar-ul-Hassan, S., Douglas, D., & Turner, J. (2021). Revisiting second language portfolio assessment in a new age. System, 103, 102652. https://doi.org/10. 1016/j.system.2021.102652 Poehner, M. E. (2008). Dynamic Assessment: A Vygotskian approach to understanding and promoting second language development. Springer. https://doi.org/10. 1007/978-0-387-75775-9

158  Elisa Räsänen and Piibi-Kai Kivik

Purpura, J. E. (2016). Second and Foreign language assessment. The Modern Language Journal, 100, 190–208. https://doi.org/10.1111/modl.12308 Salaberry, R.,  & Kunitz, S. (Eds.) (2019). Teaching and testing L2 interactional competence: Bridging theory and practice. Routledge.

Discussion questions

1. What is Independent Use Portfolio, and what kind of alternative approaches does it introduce to language assessment? 2. What kind of washback effect does the Independent Use Portfolio have for a) students’ language use in the wild b) how students learn the language? Suggested research projects

This chapter encourages educators to bridge the students’ language use in the wild with their classroom learning. Design a project in which you: 1) Investigate what your students do in their target language outside of class. 2) Use that information to reform your classroom practices. Notes 1 The data collection took place before the Covid-19 pandemic. 2 The bolded parts have been added to emphasize what we pay special attention to in the analysis. 3 The friend’s responses have been removed. 4 A festivity in Finnish upper secondary schools. 5 Kate’s response email is not included in this chapter. 6 Here, written and spoken refer to different registers, not the format.

References Abar, B., & Loken, E. (2010). Self-regulated learning and self-directed study in a pre-college sample. Learning and Individual Differences, 20, 25–29. https:// doi.org/10.1016/j.lindif.2009.09.002 Abrar-ul-Hassan, S., Douglas, D.,  & Turner, J. (2021). Revisiting second language portfolio assessment in a new age. System, 103, 102652. https://doi.org/10.1016/ j.system.2021.102652 Benson, P.,  & Reinders, H. (2011). Introduction. In P. Benson  & H. Reinders (Eds.), Beyond the Language Classroom (pp. 1–6). Palgrave Macmillan. https:// doi.org/10.1057/9780230306790_1 Cadd, M. (2012). The electronic portfolio as assessment tool and more: The Drake University model. IALLT Journal of Language Learning Technologies, 42(1), 96–126. https://doi.org/10.17161/iallt.v42i1.8504

Portfolio assessment  159

Chostelidou, D., & Manoli, E. (2020). E portfolio as an alternative assessment tool for students with learning differences: A  case study. International Journal for Innovation Education and Research, 8, 507–524. https://doi.org/10.31686/ ijier.vol8.iss5.2369 Clark, B., Wagner, J., Lindemalm, K., & Bendt, O. (2011). Språkskap: Supporting second language learning “in the wild”. INCLUDE. Cole, J., & Vanderplank, R. (2016). Comparing autonomous and class-based learners in Brazil: Evidence for the present-day advantages of informal, out-of-class learning. System, 61, 31–42. https://doi.org/10.1016/j.system.2016.07.007 Compernolle, R. A. (2018). Dynamic strategic interaction scenarios: A Vygotskian approach to focusing on meaning and form. In M. Ahmadian & M. García Mayo (Eds.), Recent perspectives on task-based language learning and teaching (pp. 79–98). De Gruyter Mouton. https://doi.org/10.1515/9781501503399-005 Council of Europe. (2021). European Language Portfolio (ELP), 1 February. www. coe.int/en/web/portfolio. Accessed 02 January 2021. Cummins, P. W.,  & Davesne, C. (2009). Using electronic portfolios for second language assessment. The Modern Language Journal, 93, 848–867. https://doi. org/10.1111/j.1540-4781.2009.00977.x Duff, P. A.,  & Talmy, S. (2011). Language socialization approaches to second language acquisition. Social, cultural, and linguistic development in additional languages. In D. Atkinson (Ed.), Alternative approaches to second language acquisition (pp. 95–116). Taylor & Francis. Eskildsen, S. W., & Majlesi, A. R. (2018). Learnables and teachables in second language talk: Advancing a social reconceptualization of central SLA tenets. Introduction to the Special Issue. Modern Language Journal, 102 (Suppl. 2018), 3–10. https://doi.org/10.1111/modl.12462 Eskildsen, S. W., Pekarek Doehler, S., Piirainen-Marsh, A., & Hellermann, J. (2019). Introduction: On the complex ecology of language learning ‘in the Wild’. In J. Hellermann, S. Eskildsen, S. Pekarek Doehler & A. Piirainen-Marsh (Eds.), Conversation analytic research on learning-in-action (Vol. 38, pp. 1–21). Educational Linguistics, Springer. https://doi.org/10.1007/978-3-030-22165-2_1 Gee, J. P. (2014). An introduction to discourse analysis. Theory and method. Routledge. https://doi.org/10.4324/9781315819679 Hutchins, E. (1995). Cognition in the wild. MIT Press. https://doi.org/10.7551/ mitpress/1881.001.0001 Jakonen, T. (2018). Retrospective orientation to learning activities and achievements as a resource in classroom interaction. Modern Language Journal, 4, 758–774. https://doi.org/10.1111/modl.12513 Kley, K. (2019). What counts as evidence for interactional competence? Developing criteria for a German classroom-based paired speaking project. In R. Salaberry  & S. Kunitz (Eds.), Teaching and testing L2 interactional competence: Bridging theory and practice (pp.  291–381). Routledge. https://doi. org/10.4324/9781315177021-12 Kramsch, C. (1986). From language proficiency to interactional competence. The Modern Language Journal, 70(4), 366–372. https://doi.org/10.1111/j.15404781.1986.tb05291.x Lave, J.,  & Wenger, E. (1991). Situated learning: Legitimate peripheral participation. Cambridge University Press. https://doi.org/10.1017/CBO978051 1815355

160  Elisa Räsänen and Piibi-Kai Kivik

Lech, I. B., & Harris, L. N. (2019). Language learning in the virtual wild. In M. Carrió-Pastor (Ed.), Teaching language and teaching literature in virtual environments. (pp. 39–54). Springer. https://doi.org/10.1007/978-981-13-1358-5_3 Lilja, N., & Piirainen-Marsh, A. (2018). Connecting the language classroom and the wild: Re-enactments of language use experiences. Applied Linguistics, 40(4), 594–623. https://doi.org/10.1093/applin/amx045 Lilja, N.,  & Piirainen-Marsh, A. (2019). Making sense of interactional trouble through mobile-supported sharing activities. In R. Salaberry & S. Kunitz (Eds.), Teaching and testing L2 interactional competence: Bridging theory and practice (pp. 260–288). Routledge. https://doi.org/10.4324/9781315177021-11 Lynch, B., & Shaw, P. (2005). Portfolios, power, and ethics. TESOL Quarterly, 39, 263–297. https://doi.org/10.2307/3588311 Majlesi, A. R.,  & Broth, M. (2012). Emergent learnables in second language classroom interaction. Learning, Culture and Social Interaction, 1, 193–207. https://doi.org/10.1016/j.lcsi.2012.08.004 Mak, P., & Wong, K. (2018). Self-regulation through portfolio assessment in writing classrooms. ELT Journal, 72(1). https://doi.org/10.1093/elt/ccx012 McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Blackwell Publishing. Pietikäinen, S. (2012). Kieli-ideologiat arjessa. Neksusanalyysi monikielisen inarinsaamenpuhujan kielielämäkerrassa. Virittäjä, 116(3), 410–442. Poehner, M. E. (2008). Dynamic Assessment: A  Vygotskian approach to understanding and promoting second language development. Springer. https://doi. org/10.1007/978-0-387-75775-9 Poehner, M. E., & Lantolf, J. P. (2005). Dynamic assessment in the language classroom. Language Teaching Research, 9(3), 233–265. https://doi.org/10.1191/ 1362168805lr166oa Purpura, J. E. (2016). Second and Foreign language assessment. The Modern Language Journal, 100, 190–208. https://doi.org/10.1111/modl.12308 Räsänen, E. (2021). Toimijuus ja vuorovaikutusjärjestys amerikkalaisten suomenoppijoiden itsenäisessä kielenkäytössä. Puhe Ja Kieli, 41(3), 225–245. https://doi. org/10.23997/pk.112565 Räsänen, E., & Muhonen, A. (2020). “Moi moi! Te olette siistejä!”: Chattailyä, itsestä kertomista ja yhteisöllisyyttä pohjoisamerikkalaisissa suomen ohjelmissa. In S. Latomaa & Y. Lauranto (Eds.), 2020. Päättymätön projekti III. Kirjoitettua vuorovaikutusta eri S2-foorumeilla (Kakkoskieli 9, pp. 82–96). Helsingin yliopiston suomalais-ugrilainen ja pohjoismainen osasto. Ross, S. J., & Kasper, G. (Eds.) (2013). Assessing second language pragmatics. Palgrave Macmillan. https://doi.org/10.1057/9781137003522 Salaberry, M. R., & Kunitz, S. (2019). Introduction. In R. Salaberry & S. Kunitz (Eds.), Teaching and testing L2 interactional competence: Bridging theory and practice (pp. 1–22). Routledge. https://doi.org/10.4324/9781315177021-1 Scollon, R., & Scollon, S. B. K. (2004). Nexus analysis. Discourse and the emerging internet. Routledge. https://doi.org/10.4324/9780203694343 Turner, C. E.,  & Purpura, J. E. (2015). Learning-oriented assessment in the classroom. In D. Tsagari  & J. Banerjee (Eds.), Handbook of second language assessment (pp.  255–74). Mouton de Gruyter. https://doi.org/10.1515/ 9781614513827-018

Portfolio assessment  161

Yastibas, A. E., & Yastibas, G. C. (2015). The use of e-portfolio-based assessment to develop students’ self-regulated learning in English language teaching. Procedia – Social and Behavioral Sciences, 176, 3–13. https://doi.org/10.1016/j. sbspro.2015.01.437

Appendix

7 THE ROLE OF AN INSCRIBED OBJECT IN A GERMAN CLASSROOM-BASED PAIRED SPEAKING ASSESSMENT Does the topic card help elicit the targeted speaking and interactional competencies? Katharina Kley Introduction

About a decade ago, language testers adopted an argument-based validation model that consists of a network of inferences, which have to be verified to support test score interpretation and use (Chapelle et al., 2008; Fan & Yan, 2020). One assumption underlying the Evaluation inference as one component to validate a test is that task administration conditions have to be appropriate for providing evidence of targeted language abilities (Chapelle et al., 2008). Language testers have argued repeatedly that apart from the test task, task characteristics, and administration conditions, a variety of other factors in the testing context such as candidate and interlocutor characteristics, the rater’s personal characteristics, and the rating scale used, also affect the test discourse, and all of these factors combined impact scoring and the validity of a performance test like a speaking assessment (e.g., Bachman & Palmer, 1996; Csépes, 2009; McNamara, 1996). This line of research on factors affecting the spoken performance has been predominantly conducted in large-scale high-stakes speaking assessment contexts, and rather scarcely in classroom settings (Fan & Yan, 2020). While full validation of teacher-made speaking assessments is hardly possible, teachers should still strive for some reliability and validity evidence (Hughes, 2003). Hence, this chapter reports on a small prototype speaking assessment task (see Enright et al., 2008) in a classroom-based test context. It addresses the impact of the use of a topic card, which lists the topics to be discussed by the test takers, on the paired test interactions. The overall goal is to identify what task and administration conditions would be optimal for DOI: 10.4324/9781003384922-10

The role of an inscribed object  163

observing evidence of the speaking abilities, including interactional competencies, to be assessed in the described paired speaking assessment. In the context of the present study, I will analyze and discuss how a topic card affects the opportunity to observe evidence of the targeted speaking and interactional competence abilities of the candidates. Task effects in paired and group oral assessments

Paired and group speaking assessments have gained in popularity over the last two decades, mostly due to the fact that they elicit more symmetrical interaction patterns (Galaczi, 2008; Lazaraton, 2002) and a broader range of speech functions (Galaczi  & ffrench, 2011) in comparison to interview-formatted assessments. The underlying construct of paired and group oral assessments is broader than the traditional interview test format and aligns more closely with the concept of interactional competence (Galaczi & Taylor, 2018). A few language testers have researched the impact of test tasks and task characteristics on discourse and scores in paired and group speaking tests. For example, van Moere (2007), who used qualitative and quantitative measures to compare a discussion, a consensus, and a picture difference task in an English as a foreign language group test setting, found that the consensus task is the most suitable task to assess oral proficiency because it elicits a greater number of words and the widest range of interactional functions; it is also most successful in involving the group members in the conversation. In addition, Nakatsuhara (2013) showed that the task interrelates with test-taker characteristics such as proficiency and extraversion. Her data revealed that the information-gap task is more suitable for lower-proficiencylevel learners because the task requires them to exchange information, and thus they contribute more to the interaction than other tasks. However, the ranking and the free discussion tasks are appropriate for more extroverted and proficient test takers where initiative is required from the test takers to complete the task. The impact of different prompts and inscribed objects on peer–peer test interaction

Instead of focusing on different test tasks, Leaper and Riazi (2014) argue that different prompts of the same task also have an impact on test discourse. For an English as a foreign language group oral test, the authors demonstrated that the four different prompts, which were all supposed to be equally difficult, elicited interactions that differed in the number of turns taken, syntactic complexity, and fluency.

164  Katharina Kley

Greer and Nanbu (2020) used multimodal Conversation Analysis (CA) and analyzed over 200 English as a foreign language paired discussion tests to investigate how the test-taker pairs oriented to different prompts. For the four-minute discussion task, the test takers, who were first- and second-year learners of English, were presented with either a short or a long prompt: The short prompt comprised a card with a single topic written on it, while for the long prompt students were provided with four different statements, each written on a separate sheet of paper, and they were asked to discuss whether or not they agreed with the different positions. The analysis showed that the test-taker pairs who were presented with the short prompt oriented to the topic card to shift from an opening greeting sequence to the actual test topic. Gaze and gestures such as pointing at the card were also used by these students to return to the topic at hand when they had diverged from it in their interaction. However, when the student pairs reached a point in their conversation where generation of further talk had come to a halt, they were forced to transition in a stepwise fashion to a subtopic on their own, without any outside help. In contrast, the test-taker pairs who participated in the long-prompt test could simply shift to a new topic when they were unable to further develop a topic. That also means that topics were not expanded on much, but rather abandoned often, and topic change was conducted more in a disjunctive instead of a stepwise manner. In addition, because the prompt was longer, more inscribed language was available for students to use as a resource in their interactions, namely in that they included elements from the topic cards into their own utterances. Greer and Nanbu (2020) also note that the test-taker pairs taking the long-prompt test spent more time gazing at the topic cards (which is also due to the fact that the test takers were faced with a longer text to read). However, the students were also gazing at the cards while their partner was speaking, which might have given the speaker the impression that the other was not listening. The authors concluded that a minimal prompt seems to be most effective if the goal of the assessment is to elicit a natural interaction pattern where topic development can occur in a stepwise fashion. Finally, applying CA for their analysis, Sandlund and Sundqvist (2013) uncovered diverging understandings of test instructions and orientations to an ongoing paired oral test task among the testers and the members of the test-taker pairs. For the task, the test takers drew discussion cards (topic cards), with a question or proposal written on them. Sandlund and Sundqvist’s (2013) data analysis showed that the testers intervened in the student interactions with questions because they understood the prompt literally and thus took a narrower perspective on how to deal with the task. The teachers also attempted to direct the conversation to increase the students’ dialogue and to steer the discussion toward their own understanding

The role of an inscribed object  165

of the task and the topic card. Based on their orientation to the test task, the testers’ interactional conduct affected topic management during the test. Orienting to inscribed objects in social interaction

The literature review thus far suggests that task characteristics and prompts have an effect on peer–peer test interaction with regard to, for example, the number of words elicited, the range of interactional functions produced, the number of turns taken, syntactic complexity, fluency, as well as topic development and management. Less research has been conducted on the impact of inscribed objects on the test discourse although written documents such as handouts or cards that include prompts and/or instructions are commonly used in speaking assessments. Greer and Nanbu’s (2020) study is an exception. In their study from 2020, the two scholars found that the test takers gaze and point at the topic card when they return to the initial topic after they have been off topic; they also incorporate words from the cards into their own contributions. Thus, the sheer existence of a topic card or an inscribed object when administering a test task seems to trigger reactions and behaviors from the test takers, which may impact the test interaction and the students’ performance. While language testers have neglected the impact of test takers’ orientation to inscribed objects and its impact on spoken and interactional behavior, some multimodal CA research, set in non-test contexts, has investigated the interrelationship between written documents and talk-in-interaction (Nevile et al., 2014). As an approach to the study of talk, CA understands talk as systematically organized and socially ordered (Psathas, 1995). Its focus is on how individuals understand one another and how they generate and negotiate social activities in talk-in-interaction (Hutchby & Wooffitt, 2008). In CA, social interaction is predominantly understood as a multimodal phenomenon (Deppermann, 2013; Nevile, 2015; Stivers & Sidnell, 2005), including both vocal and visuospatial modalities (Stivers & Sidnell, 2005). In recent years, however, the notion of multimodality has been expanded; CA multimodality studies now go beyond the traditional embodied resources such as gestures, nods, and gaze. Today, embodiment includes the entire body and thus a wide range of body aspects, including the reaching and handling or touching of material objects (Mondada, 2019; Nevile, 2015). For example, adopting multimodal CA, Svennevig (2012) analyzed business meetings and looked specifically at how the presence of a written agenda affected topic initiations during the meetings. He found that due to the agenda, the “known-in-advance status” (p. 63) of the topics made the topic initiations rather short and elliptical, and formulations contained in the agenda were also used to initiate topics. In addition, Svennevig (2012)

166  Katharina Kley

demonstrated that gaze and gesture invoked the agenda. For example, holding and gazing down at the agenda are indications that a participant is returning to the agenda. Mikkola and Lehtinen (2014), who analyzed the role of an appraisal form as a material object in appraisal interviews between superior and subordinate, came to similar findings: they found that the participants negotiated activity shifts by embodied means such as touching, moving, and gazing at the appraisal form before the activity shift was verbally implemented. They also showed that the verbal turn-taking did not come to a halt, but was ongoing, while participants were initiating these activity shifts by orienting to the appraisal form (“double orientation”, p. 62; see also Deppermann et al., 2010). Goal of the study

As the research reviewed in this chapter has shown, test tasks and prompts affect test-taker performance. The impact of inscribed objects on test discourse has not been investigated much although, as we have seen from multimodal CA studies in test and non-test settings, written documents seem to affect the interaction between participants. If inscribed objects are included in speaking test settings, it is possible they impact student interactions and overall speaking performance, which may have an impact on the validity of a test. Adopting a multimodal CA-informed approach, this study intends to explore the role of a topic card in a classroom-based paired speaking assessment. The research question for this study is as follows: How does a topic card affect the opportunity to observe evidence of the targeted speaking and interactional competence abilities of the candidates of the present paired speaking assessment? Data and methods Participants

Forty-three students (22 student pairs) from three different sections of a first-semester German language class offered at a US university participated in this study. Because of the uneven number of students, one student completed the task twice with two different conversation partners. The test takers, who were peers and knew each other from class, were randomly assigned to pairs by their instructor and given a conversation task. The author of this chapter is one of the instructors involved in this assessment. Test task, topic card, and test procedure

The task was part of an end-of-the-semester low-stakes classroom-based assessment instrument, which was to be completed outside of class. The

The role of an inscribed object  167

speaking assessment comprised three parts: (1) a paired discussion, (2) follow-up questions, and (3) a role-play. The assessment was an achievement test that tested the students’ speaking and interactional abilities after the first semester of German instruction. The students were familiar with these tasks as they were also used for classroom practice. For the purpose of this chapter, the focus is on one of the three tasks: the paired discussion. An instructor or another student was present for this task and served as proctor. The student was a more advanced learner of the language or a German L1 speaker who was the test takers’ tutor and helped with homework assignments and served as conversation partner for speaking practice and assessments. For the paired discussion task, the students were asked to exchange and expand on information with their speaking partner. On the day of the assessment, one of the two participants was asked to draw one of three topic cards, read the topics, and share the card with his/her partner. There were no instructions on where to place or what to do with the card afterward; the students were also not instructed on how to sit, that is, next to or across from one another. All cards included three of the following topics, written in German, that had been covered in class: personal information (birthday, languages, major, hobbies, personality traits); family; daily routine; and living arrangements. The test takers were expected to talk about all three topics listed on their topic card for a total of five minutes. The proctor kept the time. The order in which the students talked about the different topics was not predetermined. An example of a topic card can be found in Figure 7.1. The students video recorded their interactions with their cell phones or laptops and then shared the video recordings with their instructor. The

FIGURE 7.1 Example of a topic card. It includes the following topics: daily routine, living arrangement, and family

168  Katharina Kley

paired discussion task was used to assess the students’ vocabulary use, grammatical accuracy, and pronunciation as well as interactional competence (Galaczi & Taylor, 2018) such as maintaining mutual understanding, initiating and shifting topics, and expanding topics. Procedure for data analysis

The video recordings were first transcribed verbatim. Then, the sections in which the students oriented to the topic card were transcribed using the notational system of CA taken from Jefferson (2004). The analysis was informed by multimodal CA (Mondada, 2018), that is, students’ gaze, gestures, and orientation to the topic card were included in the transcripts. The transcription conventions can be found in the Appendix. The analysis focused on the test takers’ topic shifts as it was found that the students oriented to the topic card when implementing new topics. Topic and transitioning between topics

From a CA perspective, it is difficult to determine what a topic is in a given unit of talk (Sidnell, 2010). Based on Wong and Waring (2010), a new topic is a topic that is unrelated to the previous one. New topics are initiated after a topic has been closed, after a series of silences, or at the beginning or closing of a conversation (Wong & Waring, 2010). Previous CA research has identified several topic initiation methods such as initial elicitations (e.g., what’s new?) (Button  & Casey, 1984), itemized news inquiries and news announcements (Button  & Casey, 1985) as well as pre-topical questions (e.g., Are you a freshman?) and pointing to the immediate surrounding of the interaction (e.g., Nice day, isn’t it?) (Maynard & Zimmerman, 1984). When shifting to a new topic, participants may do so in a disjunctive manner (Wong  & Waring, 2010). To mark abrupt and unexpected topic shifts, CA researchers found that conversationalists use disjunctive markers like actually or by the way. Participants may also terminate a topic before they initiate a new topic by using pre-shift tokens such as acknowledgement tokens and assessments (Jefferson, 1993). However, it has to be noted also that a stepwise transitioning between topics, that is, moving from topic to topic in a gradual fashion, is considered more natural (Sacks, 1992). Teaching and assessing topic initiations and shifts in the present context

The students in the present study were taught acknowledgement tokens (e.g., okay, achso, ja) and assessments (e.g., schön, toll, cool) in German which can be used as pre-shift tokens to close a topic before initiating a new topic.

The role of an inscribed object  169

In addition, in first-semester German, the focus is on asking and answering questions; hence, it was observed in the classroom that the students mostly ask information-seeking questions to initiate new topics. Thus, at the end of their first semester of German, (1) the students should be able to use preshift tokens to close a topic, and (2) it can be expected that they mostly ask information-seeking questions to initiate a new topic. In the present speaking test, topic initiations and shifts are assessed under the Engagement criterion that combines two skills, asking questions and initiating topics. For full credit, students have to be actively engaged in the conversation, that is, they have to ask questions often and initiate topics. Partial credit is given for the students who ask only a few questions and initiate some topics. Students who behave rather passively in the interaction, ask no questions, and do not initiate any topics are not awarded any points on the Engagement criterion. Results

The analysis revealed that most students included in this study oriented to the topic card during their interaction with their conversation partner. Particularly prior to topic shifts, the majority of test takers gazed at the card, and some students also pointed at it, held it in their hands, or talked the card into relevance. Similar to the findings from Svennevig (2012) and Mikkola and Lehtinen (2014), the test-taker pairs in this study also oriented to the material object or the topic card before they verbally implemented the topic shift. Topics were mostly initiated by asking an information-seeking question. As can be seen in Excerpt 1, which is from Steven and Charlie’s conversation, Steven orients to the topic card prior to initiating the new topic, daily routine (lines 2, 4). Until this point, Steven and Charlie have exchanged some personal information such as their age, hobbies, and major. After pointing out what his major is, namely electrical engineering (line 1), Steven gazes at the topic card (lines 2, 4–5) and then asks the informationseeking question what what do you do in the morning? (lines 6–7), which initiates the next topic. Excerpt 1

c s 01 S c s

>>looks at S-------------------> +looks to side->+looks at C----->+ +uh ich studier +elektrontechnik.+  uh  I  study   electrical engineering uh I study   electrical engineering. ------>+looks at back of card-> +looks at card--------------->

170  Katharina Kley

02 C +schön.+[schön.   nice   nice nice. nice. 03 S                [°°elektrontechnik°°     electrical engineering   electrical engineering c ------> s ------> 04 #(0.2) fig. #fig.2

fig.2 c -->+looks at proctor->  s ---------------------> 05 S uhm+ (0.5)           uhm c +looks at S-----------------> s +looks at C->+looks to side-> 06 S +was      +was machst  what    what do c ------------------------------------> s +looks at C->+looks up->+looks at C-> 07 S +du in    +die     +morgen?   you in     the     morning what what do you do in the morning? c +looks to side, then at proctor-> s --------------------------------> 08 C +uh: (0.2)       uh c +looks at S->+looks to side->  s ---------------------------->

The role of an inscribed object  171

09 C c s 10 C c s s 11 S c s 12 C

+ich      +stehe auf uh I                get   up uh +looks at S-->+looks up ---------------->+looks up +viertel bis (.) +neun?  quarter to       nine uh: (0.2) I get up uh quarter to (.) nine? +looks at S +looks at C-> +nods +hm hm +looks to side->> looks at C----->> +uh uh

In addition, the students predominantly used pre-shift tokens, which are acknowledgement tokens like okay or assessments such as schön (nice) (Jefferson, 1993) to terminate a topic before initiating a new one. Returning to Steven and Charlie’s interaction in Excerpt 1, it becomes apparent that the pair’s first topic about personal information (partially shown in lines 1–3), is terminated by Charlie with the assessments nice. nice. (line 2). At the same time, Steven looks at the topic card (lines 2, 4–5) before initiating the new topic. It has to be noted also that some student pairs abandoned the current sequence without using any pre-shift tokens before shifting to the new topic. What has been observed in terms of students’ abilities to transition to new topics in this assessment is what we would expect learners to be able to do after one semester of German instruction, that is, using pre-shift tokens to close the current topic and indicate that a topic transition is underway as well as asking an information-seeking question to initiate a new topic. The skills the students show in the assessment coincide with what was practiced in the classroom and what they should be able to do given their current German proficiency level. Overall, it appears that the topic card provides us with the opportunity to observe the learners’ abilities to shift topics. However, it seems that due to insufficient instructions for both test takers and proctors as well as unstandardized administration conditions of what the students are allowed to do and not to do with the topic card, the topic card was used inconsistently across the different test-taker pair conversations, which impacted the students’ interaction as well as topic shift implementation. For example, one test-taker pair’s orientation to the topic card triggered an unbalanced number of topic shifts between the members of the pair (Excerpt 1), and other

172  Katharina Kley

test-taker pairs produced rather long, constructed, and unnatural topic shifts when orienting to the card (Excerpts 2, 3). Steven and Charlie, whose conversation was discussed earlier (Excerpt 1), produce an unbalanced number of topic shifts, which is due to the fact that Steven holds the card in his hands such that his conversation partner Charlie only sees the back of the card (fig. 1). As Charlie is denied access to the card, he does not shift to any new topics, while Steven, who is in control of the card, also controls the topic shifts. Their interaction continues in this fashion with Steven holding the card and transitioning all topics in their conversation, while Charlie does not initiate one single new topic. Thus, whether or not both participants have equal access to the topic card appears to be not unimportant for a balanced topic shift implementation. Charlie, who has been denied access to the card throughout their entire conversation, is put in a passive role with respect to the topic shifts, which has an impact on scoring. As the initiation of new topics is one criterion on the scoring rubric for this speaking assessment, Charlie was deducted points for not initiating new topics. Charlie, however, might have been more engaged and might have shifted to new topics if he had had access to the card or if Steven had not oriented to the card at all; the existence of the topic card and his conversation partner holding the card and denying him access to it might have prevented Charlie from showing his abilities to shift topics. Unlike Steven and Charlie who sit across from one another, Kim and Catalina sit next to each other during the test (fig. 3) with the topic card in front of them on the table, which gives them equal access to the card. However, pointing and gazing at the card, they engage in a negotiation sequence prior to shifting topics. That leads to a rather unnatural and constructed implementation of topic shifts, as can be seen in Excerpts 2. Excerpt 2

k >>looks at C-------------------------------> c >>looks at K-------------------------------> 01 K hm ich habe::. h ah vier (.) mitbewohnerin=    I  have     uh four        roommate hm I have::. h ah four (.) roommate= k -------------------------------------> c -------------------------------------> 02 K =ahm meine freunde: ( )( )( ) uhm my    friends                     k ------------------------> c ------------------------>

The role of an inscribed object  173

03 K un:d #ah# ↑ ( ) ¿ and  uh =uhm my friends: (   ) (   ) (   ) an:d #uh# ↑(   )¿ k ------> c ------> 04 C °°ja°°   yes k -----> c -----> 05 (0.4) k -----------> c -----------> 06 K ah:[m  uhm 07 C   [°°cool°° k +looks at topic card---------------> c ------------->+looks at topic card-> 08 K +.h okay, (.) +ah:m#     okay   uhm .h okay, (.) uh:m fig.    #fig.3

fig.3 k c 09 k c

-----> ---->+ (0.3)+ ----------------------------------> +looks at K->+looks at topic card->

174  Katharina Kley

10 K +pt. hhhh  +[äh: H. hh. (0.2) k ----------------------------> c ----------------------------> c +moves LH to card-->+points on card with IF-> 11 K +in      [ins  +zu[:  ] in    to the to k    -----------> c                  ->+looks at K-> c                  ---------->+ 12 C  [tage]s+routine¿+         daily routine   [dai]ly routine¿ k ---> c ---> 13 K JA. yes yes. k ---->+ c ----> 14 (0.2)+ k +looks up-> c -------> 15 C +[ja.] yes yes. k     --> c     --> 16 K [ok]ay. k --> c --> 17 (.) k ----------->+looks at C----------->+ c ---------------------------------->> 18 K in an (0.3) +deine (0.3) tag¿ (0.3)+ in on  your        day in (0.3) your (0.3) day¿ (0.3) k +looks to side------>+looks at C-->> c ---------------------------------->> 19 K +.h uhm wie uh: was  +möchtest   du¿      uhm  how uh what would like you .h uhm how uh: what would you like¿

The role of an inscribed object  175

Excerpt 2 is from the end of Kim and Catalina’s first topic (living situation). As can be seen in lines 1–3, Kim shares with Catalina how many roommates she has and who they are. With the assessment °°cool°° in line 7, Catalina closes the topic. Kim then looks at the topic card to transition to a new topic. The okay, (.) ah:m# in line 8 can be seen as projecting a transition. However, the in- and out-breaths, the pauses, and perturbed expressions that follow suggest that Kim has encountered some sort of trouble (lines 9–10). In line 11, she initiates the transition to the new topic. After looking at the topic card (line 10), Catalina points at the card with her left index finger (line 11) and completes Kim’s turn by saying Tagesroutine (daily routine) (line 12) suggesting that this should be their next topic to talk about. Kim confirms (lines 13, 16) and begins formulating the first question for this new topic (lines 18–19). In Excerpt 2, the test takers gaze at the topic card, point at it and negotiate what talking point to turn to next before they shift to a new topic. Although the students are able to close a topic and initiate a new one, Kim’s trouble and the negotiation sequence itself make the topic shift implementation appear as inauthentic and constructed, which may have an effect on the rater’s perception of the students’ abilities to shift topics and thus impact scoring. Lastly, Charlie and Andrea also orient to the topic card when transitioning to a new topic. Charlie talks the card into relevance when he tries to read out loud one of the topics from the card, which he wants to incorporate into an information-seeking question to initiate a new topic. However, it turns out that he has trouble properly pronouncing the word, which elongates the topic shift and makes it rather unnatural, which may also impact scoring, with respect to the Engagement and pronunciation criteria of this assessment. Excerpt 3

a >>looks down---->+looks at C--------> c >>looks off to the side-------------> 01 C uhm (.) ich habe +keinen bruder u-. h    I have no   brother a ---------------------------------> c ---------------------------------> 02 C und +keinen uh (0.6) uh schwester? and  no     uh    uh sister uhm (.) I don’t have a brother. h and uh (0.6) uh  sister? a ---> a +nods-> c --->

176  Katharina Kley

03 A +hm hm a --> a ->+ c ->+ 04 C uh+ uh a ----> c +looks at topic card-> 05 (0.7) a ----> c ----> 06 C ↑ja¿  yes ↑yes¿  a ----> c ----> 07 (0.8) a +looks at topic card-> c ---------------------> 08 C +uh uh a ----->+looks at C-->+looks at topic card-> c ------------>+gets closer to topic card--> 09 (2.0)|+(0.5)|+(0.2)|+(0.2) a --------------------> c --------------------> 10 C uh was ist deine [uh]  uh  what is your uh uh what is you [uh]  a   -----> c   -----> 11 A    [hh] hu hu a +gets closer to topic card-> c -------> 12 A +h.. h a ------->+leans back, looks at topic card->+ c ------------------------------------------> 13 C (tschas)+ (0.5) uh::                 (2.6)+            uh (tschas) (0.5) uh:: (2.6)

The role of an inscribed object  177

a +looks at C---------------------------> c --------------------------------------> c +lifts right arm in the air, drops it+ 14 C +uh was ist + uh what is a -------------> c + looks at A-> 15 C +↑deine   your  a --------------------------------> c -------->+looks off to the side+ 16 C (possen: +        leck lischt). + uh what is ↑your (possen: leck lischt).  a +looks away---------->+looks at C-> c +looks at A-----------------------> 17 A +.h ↑uh::: ich bin::: +fleißig        uh     I  am     hard-working  .h ↑uh::: I am::: hard-working a +looks away------------> c +looks away+looks at A+ 18 A +un:d    +kreativ¿  +   and    creative an:d creative¿ a ------> c +looks off to side-> 19 (0.6) a --->+looks at C-> c ----------> 20 C gut  [gut] good  good good [good] a         ------------> c         +looks at A->   [und] +(0.3)  21 A     and  a ---------------->> c ---------------->> 22 A [und du? hh] ha and you [and you? hh] ha

178  Katharina Kley

23 C

[das ist gut,]  that is good [that is good,] 

As can be seen in this excerpt, Charlie and Andrea have been talking about family up to this point. In line 3, Andrea produces an acknowledgement token hm and nods, which closes the current topic. To transition to the next topic, Charlie and then Andrea gaze at the topic card (lines 5–8). Longer pauses occur (lines 7, 9) as both test takers orient to the topic card. It was also observed that Charlie moves his head closer to the topic card as if he cannot see what is written on the card (line 9), while Andrea looks back and forth between the topic card and Charlie. In line 10, Charlie begins to verbally transition to the new topic by asking uh what is you [uh], but he stops, and Andrea starts laughing (lines 11–12). She now also leans forward to move her eyes closer to the topic card like Charlie (line 12). Charlie, who is still looking at the topic card, continues with his question by uttering (tschas) (line 13). Then, Charlie abandons the turn, as the pauses in line 13 indicate, suggesting that he may have encountered a problem most likely due to what is written on the topic card. As becomes apparent in lines 14–16, Charlie attempts to verbally initiate the new topic a second time asking the following question: uh what is ↑your (possen: leck lischt). (possen: leck lischt). In both of Charlie’s attempts to initiate the topic shift, he produces incorrect German words; he has most likely trouble pronouncing Persönlichkeit (personality), one of the talking points listed on the topic card. Andrea still understands Charlie’s question because she shares a couple of her traits with him, namely that she is hard-working and creative (lines 17–18), which Charlie accepts as an answer to his question, as his assessments in lines 20 (good good) and 23 (that is good,) show. Thus, the trouble Charlie encountered initiating the new topic could be resolved, but, overall, this repair sequence triggered by the presence of the topic card and Charlie’s orientation to the card led to a long and inauthentic topic shift, which may impact the score Charlie is awarded for his ability to pronounce German words and to shift to new topics. Discussion and conclusion

In line with previous research (Mikkola  & Lehtinen, 2014; Svennevig, 2012), the test takers in this study oriented to the topic card before they verbally implemented the topic shift, mostly by asking an information-seeking question. The analysis of the present study – informed by multimodal

The role of an inscribed object  179

CA – showed that prior to topic shifts most test-taker pairs oriented to the topic card in various ways such as gazing at the card, pointing at it, holding it in their hands, and talking the card into relevance by incorporating words from the card into an information-seeking questions to initiate a new topic. It was also found that the students predominantly used pre-shift tokens such as acknowledgement tokens (okay) or assessments (schön; nice) (Jefferson, 1993) to conclude the ongoing topic before initiating a new topic. Some student pairs, however, abandoned the current topic without using any pre-shift tokens. Given these findings, we can say that the students’ skill to shift topics elicited in the assessment are what we would expect. Applying pre-shift tokens before initiating a new topic and using information-seeking questions as topic initiator is what these learners are able to do in terms of topic shift implementation after one semester of German instruction. These skills also coincide with what the students were taught and practiced in the classroom. In that respect, we can conclude that incorporating the topic card in the testing procedure provides us with the opportunity to observe the students’ abilities to shift topics. However, as was mentioned earlier, the analysis also showed that the topic card was not used consistently across all student pairs, which was due to poor instructions, unstandardized administration conditions, and insufficient training of the proctors who administered the assessment. As the test takers could use the card during the assessment as they liked, it was uncovered that some test-taker pairs’ orientation to the topic card led to an unbalanced number of topic shifts between the members of the pair (Excerpt 1) and rather long, constructed, and unnatural topic shifts (Excerpts 2, 3). To be precise, the analysis revealed that equal access to the topic card is important because the participant who is denied access to the card may not be able to shift any topics while their partner who has access to the card implements all topic shifts. This situation may cause an imbalance in topic shifts between the two members of the test-taker pair (Excerpt 1) and potentially has an effect on scoring. In Excerpt 2, we have seen the test-taker pair gazing and pointing at the topic card when negotiating with what topic to continue. In Excerpt 3, one of the members of the test-taker pair wanted to incorporate a word that was written on the topic card in his informationseeking question when shifting to a new topic but encountered trouble pronouncing the word. Due to the students’ orientation to the topic card, the topic shifts observed in Excerpts 2 and 3 were rather long and constructed, obscuring the students’ ability to properly pronounce German words and to transition to new topics, both of which may have an effect on students’ scores.

180  Katharina Kley

Most of the test takers oriented to the topic card during the task, but the test developers did not intend for the students to do so. Thus, there is a discrepancy between what the students’ instructors had in mind the students do with the topic card or the assessment task and what the students turned out to be doing. With reference to Breen (1989), Seedhouse (2005) points to the distinction between task-as-workplan (the intended pedagogy behind a certain task) and task-as-process (what learners do with the task when they engage in it). According to Seedhouse (2005), this distinction may lead to validity issues as tasks are conceptualized at the workplan level, but the data obtained from a test are process-oriented. In the current context, at the workplan level, one member of the student pair was supposed to draw one of the three topic cards presented to them, read the topics listed on the card, and share the card with their conversation partner. The instructors anticipated that the students would memorize the three topics and put the card away, especially because there was little writing on the card, and the topics were familiar to the students, as they had practiced these topics in class prior to the test. The instructors did not expect that the test-taker pairs would use the card as a memory aid during the test. As the test developers did not anticipate the students to orient to the topic card, they did not instruct the students on what to do with the topic card during the assessment. As a matter of fact, not only did the test takers orient to the card, but each student also did so differently; in addition, the students accessed the card from different positions and angles because there were no constraints on seating arrangements. Hence, the test task with the topic card as it was applied was not a consistent or reliable measurement (Bachman & Palmer, 1996) of the students’ speaking or interactional abilities. After all, we are faced with a reliability issue in this assessment, which also affects the test’s validity. How can the testing procedure be made more consistent and the test more reliable and valid? If the topic card is kept in the assessment (which would be legitimate because as was stated earlier in the majority of cases the topic card allows us to observe the test-taker pairs’ ability to shift topics), test takers and proctors would have to be told explicitly what to do with the card, for example, how the students can orient to the card, where the card should be placed, and how the students are to be seated. For example, the students have to be seated such that they both can view the card. They may have to be instructed only to gaze and point at the card but not to hold it in their hands. Lastly, the student pairs may have to be told not to use any words from the topic cards for topic shift implementation. Any procedures like these would have to be followed through by

The role of an inscribed object  181

both test takers and proctors for all tests to account for the assessment to be consistent and reliable. Another option would be to instruct the proctor to take the card away from the students after they look at it and before they begin with the task to avoid distractions and the students’ orientation to the card prior to the topic shifts during the assessment. The topic card could also be eliminated from the task altogether. Instead, the students could be tasked to talk about a number of topics of their choice. This, however, may change the test-taker pairs’ interactional behavior, including how they shift topics, which would have to be investigated. Lastly, to increase the validity of the assessment it would be beneficial to also revise the scoring rubric. In the original rubric, the skill of initiating or shifting topics has been combined with asking questions to one criterion, the Engagement criterion. It may be better to separate them and create one criterion solely for topic initiation and shift implementation. The descriptor should specify what exactly the students are expected to do with respect to topic shifts after one semester of German instruction; thus, the descriptor may detail the use of pre-shift tokens and information-asking questions as the students were taught these interactional features in the classroom. All in all, inscribed objects such as topic cards or instruction sheets are commonly used in oral language tests. While language testers have investigated the impact of task characteristics and test prompts on the test discourse and ratings, they have neglected to examine the effects of inscribed objects, which are part of test tasks. By using an approach informed by multimodal CA, the present study has attempted to fill this gap. It was demonstrated that students orient to an inscribed object (a topic card in this test setting), and that the topic card elicits the anticipated speaking and interactional abilities with respect to the students’ topic shift implementation. However, teachers when using material objects like topic cards in assessments have to standardize the administration conditions to allow for the test to be consistent and reliable. Recommended further reading, discussion questions and suggested research projects Further reading

The book Assessment in the Language Classroom by Cheng and Fox (2017) is recommended for anyone interested in an overview of the theoretical background and practical strategies teachers need when assessing their learners.

182  Katharina Kley

For those interested in learning more about validity in classroom-based test settings, Phakiti and Isaacs’ (2021) article is an interesting read. The authors argue that effective classroom assessment should be approached from the perspective of the general principle of quality. Finally, McNamara et al. (2002) report on various kinds of discourse analysis to validate oral assessments. The authors also reference some important studies on the impact of various facets of the assessment setting, such as the interlocutor, task, test condition, test taker, and rater. Discussion questions

1. Given the data presented, how would you use the topic card in the present test setting? Would you use it at all? Give reasons for your approach. 2. How do you think the validity and reliability of the present assessment can be further improved? Devise concrete suggestions. 3. In this study, an approach informed by multimodal Conversation Analysis was used to investigate the validity and reliability of the test. What other methods can teachers use to determine the validity and reliability of classroom-based assessments? Suggested research project

Design a speaking assessment that uses a material object and pilot test it with a few students. Record the students’ assessments and conduct a brief discourse or conversation analysis to determine if the task elicits the targeted speaking and/or interactional abilities. Is the assessment valid and reliable? How so? How can you improve the assessment’s validity and reliability? References Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press. Breen, M. (1989). The evaluation cycle for language learning tasks. In R. K. Johnson (Ed.), The second language curriculum (pp. 187–206). Cambridge University Press. https://doi.org/10.1017/cbo9781139524520.014 Button, G., & Casey, N. (1984). Generating topic: The use of topic initial elicitors. In J. M. Atkinson  & J. Heritage (Eds.), Structures of social action: Studies in conversation analysis (pp. 167–190). Cambridge University Press. https://doi. org/10.1017/cbo9780511665868.013 Button, G., & Casey, N. (1985). Topic nomination and topic pursuit. Human Studies, 8, 3–55. https://doi.org/10.1007/bf00143022 Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Test score interpretation and use. In C. A. Chapelle, M. K. Enright & J. M. Jamieson (Eds.), Building a

The role of an inscribed object  183

validity argument for the Test of English as a Foreign Language (pp. 1–26). Routledge. http://dx.doi.org/10.4324/9780203937891-7 Cheng, L., & Fox, J. (2017). Assessment in the language classroom: Teachers supporting student learning. Springer. Csépes, I. (2009). Measuring oral proficiency through paired-task performance. Peter Lang. https://doi.org/10.3726/978-3-653-01227-9 Deppermann, A. (2013). Introduction: Multimodal interaction from a conversation analytic perspective. Journal of Pragmatics, 46(1), 1–7. https://doi. org/10.1016/j.pragma.2012.11.014 Deppermann, A., Schmitt, R.,  & Mondada, L. (2010). Agenda and emergence: Contingent and planned activities in a meeting. Journal of Pragmatics, 42(6), 1700–1718. https://doi.org/10.1016/j.pragma.2009.10.006 Enright, M. K., Bridgeman, B., Eignor, D., Kantor, R. N., Mollaun, P., Nissan, S., Powers, D. E., & Schedl, M. (2008). Prototyping new assessment tasks. In C. A. Chapelle, M. K. Enright & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 97–144). Routledge. Fan, J.,  & Yan, X. (2020). Assessing speaking proficiency: A  narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, 330. Galaczi, E. (2008). Peer–peer interaction in a speaking test: The case of the First Certificate in English examination. Language Assessment Quarterly, 5(2), 89–119. https://doi.org/10.1080/15434300801934702 Galaczi, E.,  & French, A. (2011). Context validity. In L. Taylor (Ed.), Examining speaking: Research and practice in assessing second language speaking (pp. 112–170). Cambridge University Press. Galaczi, E.,  & Taylor, L. (2018). Interactional competence: Conceptualisations, operationalisations, and outstanding questions. Language Assessment Quarterly, 15(3), 219–236. https://doi.org/10.1080/15434303.2018.1453816 Greer, T.,  & Nanbu, Z. (2020). General and explicit test prompts: Some consequences of management in paired EFL discussion tests. In C. Lee (Ed.), Second language pragmatics and English language education in East Asia (pp. 194–234). Routledge. https://doi.org/10.4324/9781003008903-10 Hughes, A. (2003). Testing for language teachers. Cambridge University Press. http://dx.doi.org/10.1017/CBO9780511732980 Hutchby, I., & Wooffitt, R. (2008). Conversation analysis: Principles, practices and applications (2nd ed.). Polity Press. Jefferson, G. (1993). Caveat speaker: Preliminary notes on recipient topic-shift implicature. Research on Language and Social Interaction, 26, 1–30. http://dx.doi. org/10.1207/s15327973rlsi2601_1 Jefferson, G. (2004). Glossary of transcription symbols with an introduction. In G. Lerner (Ed.), Conversation analysis: Studies from the first generation (pp. 13–31). John Benjamins. https://doi.org/10.1075/pbns.125.02jef Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. Cambridge University Press. Leaper, D. A., & Riazi, M. (2014). The influence of prompt on group oral tests. Language Testing, 31(2), 177–204. https://doi.org/10.1177/0265532213498237

184  Katharina Kley

Maynard, D. W., & Zimmerman, D. H. (1984). Topical talk, ritual and the social organization of relationships. Social Psychology Quarterly, 47(4), 301–316. https:// doi.org/10.2307/3033633 McNamara, T. (1996). Measuring second language performance. Longman. McNamara, T., Hill, K.,  & May, L. (2002). Discourse and assessment. Annual Review of Applied Linguistics, 22, 221–242. https://doi.org/10.1017/ s0267190502000120 Mikkola, P., & Lehtinen, E. (2014). Initiating activity shifts through use of appraisal forms as material objects during performance appraisal interviews. In M. Nevile, P. Haddington, T. Heinemann & M. Rauniomaa (Eds.), Interacting with objects: Language, materiality, and social activity (pp. 57–78). John Benjamins. https:// doi.org/10.1075/z.186.03mik Mondada, L. (2018). Multiple temporalities of language and body in interaction: Challenges for transcribing multimodality. Research on Language and Social Interaction, 51(1), 85–106. https://doi.org/10.1080/08351813.2018.1413878 Mondada, L. (2019). Contemporary issues in conversation analysis: Embodiment and materiality, multimodality and multisensoriality in social interaction. Journal of Pragmatics, 145, 47–62. https://doi.org/10.1016/j.pragma.2019.01.016 Nakatsuhara, F. (2013). The co-construction of conversation in group oral tests. Peter Lang. https://doi.org/10.3726/978-3-653-03584-1 Nevile, M. (2015). The embodied turn in research on language and social interaction. Research on Language and Social Interaction, 48(2), 121–151. https://doi. org/10.1080/08351813.2015.1025499 Nevile, M., Haddington, P., Heinemann, T., & Rauniomaa, M. (Eds.) (2014), Interacting with objects: Language, materiality, and social activity. John Benjamins. https://doi.org/10.1075/z.186 Phakiti, A., & Isaacs, T. (2021). Classroom assessment and validity: Psychometric and edumetric approaches. European Journal of Applied Linguistics and TEFL, 10(1), 3–24. Psathas, G. (1995). Conversation analysis: The study of talk-in-interaction. Sage. https://doi.org/10.4135/9781412983792 Sacks, H. (1992). Lectures on conversation (Vol. 2, Ed. G. Jefferson). Blackwell. Sandlund, E., & Sundqvist, P. (2013). Diverging task orientations in L2 oral proficiency tests-a conversation analytic approach to participant understandings of pre-set discussion tasks. Nordic Journal of Modern Language Methodology, 2(1), 1–21. https://doi.org/10.46364/njmlm.v2i1.71 Seedhouse, P. (2005). “Task” as research construct. Language Learning, 55(3), 533–570. https://doi.org/10.1111/j.0023-8333.2005.00314.x Sidnell, J. (2010). Conversation analysis. An introduction. Wiley-Blackwell. Svennevig, J. (2012). The agenda as resource for topic introduction in workplace meetings. Discourse Studies, 14(1), 53–66. https://doi.org/10.1177/ 1461445611427204 Stivers, T., & Sidnell, J. (2005). Introduction: Multimodal interaction. Semiotica, 156, 1–20. https://doi.org/10.1515/semi.2005.2005.156.1 van Moere, A. (2007). Group oral tests: How does task affect candidate performance and test scores? (Publication No. 898725120). Doctoral dissertation, Lancaster University; ProQuest Dissertations and Theses International. Wong, J., & Waring, H. Z. (2010). Conversation analysis and second language pedagogy. Routledge. https://doi.org/10.4324/9780203852347

The role of an inscribed object  185

Appendix

Transcription conventions . ? ¿ , ↑ ↓ [ ] = (0.3) (.) : wor- word WORD °word° .h. hh h. hh. haha >word< ((coughs))  () (word) + + +---> —-->+ >> —-->> fig #

falling (final) intonation rising intonation rise, weaker than a question mark, stronger than a comma low-rising intonation marked rising shift in intonation marked falling shift in intonation start of overlap end of overlap latching length of silence micro pause (less than 1/10 of a second) lengthening of the preceding sound an abrupt cut-off marked stress/emphasis loud volume decreased volume in-drawn breaths out-breaths laughter tokens (different vowels) speeded up delivery slowed down delivery verbal description of actions stretch of talk that is unintelligible to the analyst unclear or probable item marks where embodied action occurs in relation to talk action described continues across subsequent lines action described ends until the plus symbol is reached action described begins before the extract begins action described continues after extract ends moment at which a screenshot has been taken shows the temporal position of screenshot within turn at talk

8 L1–L2 SPEAKER INTERACTION Affordances for assessing repair practices Katharina Kley, Silvia Kunitz, and Meng Yeh

Introduction

The present study focuses on repair practices in two classroom-based speaking tests, aimed to assess the oral proficiency of college-level students of Chinese as a foreign language. In one testing format, each student interacted with a peer, while in the other each student interacted with a speaker of Chinese as a first language. All participating students had received instruction in interactional competence (IC; see Kley et al., 2021; Kunitz & Yeh, 2019). From a conversation analytic perspective (see later), IC refers to the ability of producing recognizable social actions through turns-at-talk that are delivered timely and that are fitted to prior talk (Pekarek Doehler, 2019). More specifically, IC rests on the participants’ ability to use basic interactional mechanisms such as turn-taking, sequence organization, and repair. In this paper, we focus on repair as one of the constitutive aspects of IC. Repair is the interactional mechanism that allows participants in coconstructed interaction to raise and address problems of hearing, speaking, and understanding (Schegloff et al., 1977). As such, repair was the target of instruction and assessment in the course taken by the participants in the present study. Traditionally, speaking ability has been assessed by means of language proficiency interviews such as the ACTFL OPI, involving a test taker and an examiner. With the communicative turn in language teaching, paired and group work among second language (L2) learners became more common, also in assessments (van Moere, 2012). The popularity of paired and group speaking tests can be attributed to the fact that, in comparison to DOI: 10.4324/9781003384922-11

L1–L2 speaker interaction  187

interview-formatted tests, they elicit more symmetric and everyday-like balanced interaction patterns with a wider range of interactive features (Brooks, 2009; Csépes, 2009). Paired and group assessments have also allowed language testers to widen the speaking construct by the social-interactional component of IC (Galaczi & Taylor, 2018; Roever & Kasper, 2018), with its emphasis on the features of co-constructed discourse between test takers. In previous years, the conceptualization of IC has been majorly influenced by Conversation Analysis (CA). Following a CA-perspective, Kasper (2006, p. 86) defines IC in terms of the following interactional competencies: understanding and producing social actions; taking turns at talk; formatting actions and turns; constructing epistemic and affective stance; co-constructing social and discursive identities; recognizing and producing boundaries between activities; and repairing problems in speaking, hearing, and understanding. Language testers (Galaczi & Taylor, 2018) currently operationalize the construct of IC as articulated in different components that concern topic management, turn-taking, embodied behavior, interactive listening, and repair, which is the focus of the present study. Repair is an interactional mechanism through which trouble in speaking, hearing, and understanding becomes manifest and can be solved (Schegloff et  al., 1977). From a social-interactional perspective, repair is nowadays considered an important component of the speaking construct in L2 test settings. However, that was not always the case. For example, Lazaraton (1996), who analyzed interviews from the Cambridge Assessment of Spoken English (CASE), found that the English native interviewer sometimes repaired problems of hearing or understanding by repeating or rephrasing questions after the non-native speaker candidate had initiated repair. Both the repair initiations and repairs themselves negatively affected the reliability and validity of the test. In addition, some repair practices such as word searches are usually accompanied by hesitations and interruptions, which language testers traditionally associate with disfluency and thus treat as a sign for deficient language ability (De Jong, 2018). Similarly, since vocabulary is often considered an indication of language competence in L2 testing, a word search may reflect negatively on the student’s L2 linguistic ability; and circumlocutions, which serve as a strategy to resolve a word search, can be interpreted as a learner’s inability to use lexical precision (Rydell, 2019). Thus, traditionally, repair has been considered construct-irrelevant from an L2 testing perspective, and language testers have tried to eliminate repair from speaking assessments, for example, by training interviewers or testers not to initiate or respond to any repair. In the present L2 test setting, however, the construct of a classroom-based paired speaking test has been operationalized from a social-interactional perspective and thus includes repair. That is, repair is one of the grading criteria of this test. To be precise,

188  Katharina Kley, Silvia Kunitz, and Meng Yeh

repair has been operationalized with respect to other-initiated repair and other-directed word searches. In addition, research in language testing has repeatedly shown that test discourse and assigned scores are influenced by various contextual factors, such as the test task, the test setting, the rater, and the interlocutor (Bachman, 1990; Bachman & Palmer, 1996; McNamara, 1996). In this study, we want to focus on one of these contextual factors: the interlocutor. We can assume that interlocutor characteristics may have an effect on the repair practices produced in the Chinese speaking test under investigation. In the present study, we focus on one interlocutor characteristic, namely the native vs. non-native speaker status, which, in our understanding, has not been researched in classroom-based oral tests thus far. We intend to investigate whether the types and number of other-initiations of repair (repair initiated by a learner on prior talk produced by the coparticipant) and other-directed word searches (practices used by a learner to recruit the coparticipant’s help in finding a relevant word to be used in the turn under construction) produced change when the interlocutor is a second language speaker (L2S) or a native speaker (L1S). Determining which instances of repair initiations and other-directed word searches are produced in each of the two test settings will help to adjust the operationalization of repair in the present speaking test and fine-tune the corresponding scoring rubric. Literature review

In what follows, we discuss two research strands that are relevant for present purposes. First, we briefly introduce studies focusing on variability in test discourse and, more specifically, research exploring the interlocutor effect on test performance. Second, we provide a short overview of CA studies on the focal repair practices explored here. Variability in test discourse

An extensive body of research has investigated the variability in interviewformatted (e.g., the OPI) test discourse and ratings due to interviewer conduct (e.g., Brown, 2003; Lazaraton, 1996) and interviewer characteristics such as gender (O’Loughlin, 2002), cultural background (Berwick & Ross, 1996), native speaker status (Richards  & Malvern, 2000), and familiarity with the interviewee (Katona, 1998). It was found that the variability caused by the interviewer in traditional examiner-led language proficiency interviews may have an undesirable impact on the reliability and thus the validity of the test (e.g., Ross, 1996; Kasper & Ross, 2007).

L1–L2 speaker interaction  189

Language testers have also examined the impact of the interlocutor’s personal characteristics on the test discourse in peer–peer speaking tests. For example, Galaczi (2014), who analyzed the effect of the interlocutor’s proficiency, showed that the peer–peer discourse of test takers at different proficiency levels (B1 through C2 following the CEFR; for the most recent version of proficiency descriptors, see Council of Europe, 2020) differs in occurrence and range of interactional features such as topic organization, listener support, and turn-taking management depending on the test takers’ proficiency level. In addition, Davis (2009) uncovered that high-proficiency test-taker pairs produced primarily collaborative and asymmetric dominant interaction patterns, while the interactional patterns of low-proficiency pairs tended to be more varied; parallel or asymmetric passive interactions were common in the interactions between the lowest-scoring pairs. Investigating the effect of introversion/extraversion levels on oral test discourse, Berry (1993) found that extraverts performed better than introverts in homogeneous pairs. Both extraverts and introverts did least well in an individual format; that is, an interview-formatted test with an examiner. Interestingly, while Chambers et  al. (2012) did not find a test-taker familiarity effect, O’Sullivan (2002) reported an acquaintance effect; that is, the female participants in his study scored higher when interacting with an acquaintance. However, gender did not have a large impact on the test discourse in O’Sullivan’s (2002) research. Thus, recent research has focused on a number of interlocutor factors influencing peer–peer interaction, but the impact of an L1S versus L2S interlocutor on test discourse, elicited by test formats other than interviews, has not been investigated thus far. As the research briefly reviewed before suggests, the interlocutor’s personal characteristics clearly impact the co-constructed interaction in peer– peer oral tests. However, Galaczi and Taylor (2018) point out that “the magnitude or direction of that influence” (p.  224) remains unclear. The findings of previous research have at times been contradictory, so that it cannot be directly predicted what personal characteristics have what effect on the test interaction (Brown & McNamara, 2004; Chambers et al., 2012; Davis, 2009; Galaczi & Taylor, 2018). According to some scholars, it may actually be the case that a combination of different personal characteristics as well as contextual factors and the test task together have an impact on the interaction and test performance (Davis, 2009; Galaczi  & Taylor, 2018). Therefore, it is crucial to further investigate the role of personal characteristics in oral test discourse (Chambers et al., 2012). The present study aims to contribute to this field of research by focusing on the effect of L1 versus L2 speakership on oral test performance. Using a mixed-method approach, the study draws on qualitative data derived from the students’ test interactions and quantitative data based on the number of repair practices produced.

190  Katharina Kley, Silvia Kunitz, and Meng Yeh

Repair practices

In this section, we focus on two types of repairs: (1) other-initiations of repair (i.e., ways to indicate trouble with prior talk produced by the coparticipant) and (2) other-directed word searches (i.e., ways of recruiting the coparticipant’s help when one encounters trouble with the continuation of the ongoing turn). Indeed, these two types of repair were first taught in the classroom and then assessed in the two classroom-based oral assessments that constitute our datasets, as the teacher deemed important (1) that students knew how to ask for the meaning of words they could not recognize in another person’s talk and (2) that they knew how to ask for help in case they could not produce a relevant next word during the formulation of their own turn. Other-initiated repair

Other-initiations of repair are typically produced when a recipient mishears or misunderstands (parts of) the previous turn; however, a repair initiation may also be treated as a harbinger of disagreement or challenge on what the coparticipant has just said. Schegloff et al. (1977) found that there are a number of different techniques available for recipients to initiate repair. These procedures differ in their strength to locate the trouble source. Openclass repair initiators (Drew, 1997) such as huh? or what? are located on one end of the scale. As shown in Example 1 next (from Schegloff et al., 1977, p. 367; the transcription has been slightly simplified), recipients use this kind of initiation (see line 02 in Example 1) to indicate that they have encountered a problem in the previous turn, but they do not specifically locate the repairable. Example 1. Other-initiation of repair with open-class repair initiator

01 D:  well did he ever get married or anything? 02 C: hu:h? 03 D:  did he ever get married? 04 C:  i have no idea. More direct procedures that specifically direct the coparticipant to the trouble source are partial repetitions with or without a question word. The strongest type of other-initiated repair is the candidate understanding, with which the recipient provides their interpretation of the previous turn and offers it for confirmation to the speaker of the trouble source. This is shown

L1–L2 speaker interaction  191

in Example 2 next (from Schegloff et al., 1977, p. 369; the transcription has been slightly simplified), where B in line 03 provides a candidate interpretation of the time reference produced by A in line 02. Example 2. Other-initiation of repair with candidate understanding

01 B:  how long you gonna be here? 02 A:  uh- not too long. Uh just till uh Monday. 03 B:  til- oh you mean like a week from tomorrow. 04 A: yeah. Previous research (Brouwer & Wagner, 2004; Hellermann, 2011; Wong, 2000) found differences in the use of repair initiations between less proficient and advanced learners. While beginning learners tend to take longer to identify problems in prior talk, use open-class repair initiators more frequently than more specific repair initiators, and target repairables in final position, more proficient learners were found to make use of more specific repair initiators and identify trouble sources that occur inside the prior turn. At the same time, a number of studies (e.g., Brouwer & Wagner, 2004; Pekarek Doehler & Pochon-Berger, 2015) suggest that L2 speakers diversify their repair practices over time as their proficiency increases. For example, Pekarek Doehler and Berger (2019) investigated the development of word search practices by an L2-speaking au pair (Julie) over time. Specifically, in the first month of data collection, when she had trouble continuing her turn, Julie would stop in medias res or she would explicitly ask for help in a “how do you say X” format, with the X item being produced in her L1. Over time, these practices ceased to occur, and Julie started to ask for help in more subtle ways, trying to offer candidate formulations. These findings are relevant for the present study, with its focus on various practices for recruiting help. This diversification of repair practices was also observed in a classroombased spoken assessment setting. Investigating two oral test interactions between an L2S and an L1S, Kley et al. (2021) found that the higher scoring test taker of the two L2Ss deployed different repair initiation techniques such as open-class repair initiators, clarification requests, partial and full repetitions of the trouble source, and candidate understandings, while the low-scoring test taker produced only translation requests and one instance of an openclass repair initiation in interaction with an L1S. Their study also showed that CA-informed instruction on repair practices prior to an oral assessment can be effective; repair practices are teachable. However, the findings also indicate that students may not necessarily use repair practices in an oral assessment task that was designed to elicit a spontaneous conversation with an L1S.

192  Katharina Kley, Silvia Kunitz, and Meng Yeh

Word searches

Word searches occur in both L1 (Goodwin  & Goodwin, 1986; Hayashi, 2003) and L2 interaction (Brouwer, 2003; Gullberg, 2011; Greer, 2013; Kurhila, 2006). While word search behavior is similar in the L1 and L2 (Kurhila, 2006), word searches are more frequent and more elaborated in L2 talk (Gullberg, 2011). From a CA-perspective, word searches are considered forward-oriented self-repair (Greer, 2013), in that the trouble source is not in prior talk (as in backward-oriented repair), but rather in the talk that is yet to be produced. That is, in a word search a speaker encounters trouble in the production of the next due item in the turn currently under construction. Word searches are accompanied by perturbations such as pauses, hesitation marks, cut-offs, and sound stretches (Goodwin & Goodwin, 1986; Schegloff et al., 1977). Self-addressed questions like “What is . . .” (Hayashi, 2003), gestures, and bodily expressions such as “thinking face” are also indications that a word search is underway (Goodwin & Goodwin, 1986). Gaze plays a major role in word searches as well. More specifically, gaze direction might determine whether a word search is self-directed or otherdirected. That is, when a speaker tries to solve the trouble on their own, they typically withdraw their gaze from the coparticipants, while hesitation tokens and other perturbations indicate that the speaker still holds the floor. On the other hand, by looking at the interlocutor, the speaker invites the recipient to join the word search thereby making it an other-directed search (Goodwin  & Goodwin, 1986; Gullberg, 2011; Hayashi, 2003; Kurhila, 2006). The recipient can also be invited to join the search by other interactional resources such as gestures (Gullberg, 2011), explicit word search markers such as ‘how does one say?’ (Brouwer, 2003; Hosoda, 2000), and hedges (e.g., ‘I’m not sure’) (Rydell, 2019). Kurhila (2006) points out that it is common for a word search sequence to contain both phases; that is, a self-directed phase followed by an otherdirected word search when the self-directed search turned out to be unsuccessful. Lin (2014), however, also observed that a self-directed search can be short or even absent with the L2S inviting the interlocutor to participate in their word search right away. She interprets this phenomenon as the L2S focusing on the progressivity of the talk and thereby attempting to resolve the trouble quickly. Word searches can be completed by self (the speaker) or the other (the interlocutor) or can also be achieved jointly (Rydell, 2019). However, Rydell (2019) stresses that not all word searches are completed; that is, the participants can also abandon them. Word searches can be resolved in various ways. For example, it was observed in non-native/native speaker

L1–L2 speaker interaction  193

interaction that the non-native speaker who searches for a word which they have not fully acquired yet may orient to the native speaker as a language expert and present a candidate solution with rising intonation for the native speaker to confirm or correct (Koshik & Seo, 2012). Hosoda (2006) calls this repair sequence a “vocabulary check” (p. 33). However, native speakers or other speakers can also provide candidate solutions to language learners’ word searches; the learner then accepts or rejects these solutions (Koshik & Seo, 2012). While word search studies were conducted in various settings (in casual conversations, in the classroom, in the wild) and with different participant combinations (L1S-L1S; L2S-L2S; L1S-L2S), research on the role of learners’ word searches in institutional settings such as peer–peer oral tests is rare. Coban and Sert (2020) and Rydell’s (2019) research are clearly exceptions. In a peer–peer L2 English speaking assessment setting, Coban and Sert (2020) investigated how interactional troubles were flagged and resolved in test-talk. They found that transitioning to a sub-topic, claiming and demonstrating understanding, as well as completing the other speaker’s turn help to progress the conversation when interactional trouble occurred in the form of verbal and embodied resources such as silence, hesitation markers, thinking face, gaze aversion, headshakes, and smiles. Rydell (2019), who focused on word searches in a peer–peer speaking test targeted at Swedish L2 learners at the intermediate level, showed that when one member of the test-taker pair invited the other to join a word search, the other either provided a word (although it became evident that lexical precision did not seem to matter much) or did not help but instead indicated understanding of what the speaker wanted to convey. Rydell (2019) argues that the test takers in her study were more focused on ensuring progressivity of their talk; thus, instead of focusing on the correctness of a specific word, the learners oriented to meaning and by doing so continued the discussion. However, the invitation to participate in the word search was not always taken up by the conversation partner. Rather, the interlocutor may avoid joining the word search, which, as Rydell (2019) argues, can point to “the test takers’ awareness of being assessed” (p. 60). As the literature review suggests, research on the use of word searches and other-initiated repair in oral test settings is scarce. In the past, repair was rarely included in the speaking construct. This study, however, which looks at speaking from a social-interactional perspective, incorporates repair. To fine tune the operationalization of repair in the present test context, this study intends to determine what repair initiations and other-directed word searches the test takers produce, taking into consideration also the role of the interlocutor, who is either an L2S or an L1S.

194  Katharina Kley, Silvia Kunitz, and Meng Yeh

Data and setting Participants and setting

The data focused on in this paper was gathered from two speaking tests, which were conducted in a second-semester Chinese as a Foreign Language (CFL) classroom at a private university in the US. Twenty-eight students were included in this study. The participants were non-heritage speakers of Chinese, who had not studied the language before enrolling in Chinese language classes at college. For most of the participating students, English was their first language. Prior to the assessments, the students were instructed on repair use in Chinese. The instruction targeted three types of repair practices: (1) otherdirected word searches (i.e., the ability to ask for help when encountering lexical trouble in the production of one’s own turn; this is a type of forwardoriented repair); (2) other-initiated repair (i.e., the ability to ask for clarification or otherwise display non-understanding of the prior turn produced by the coparticipant; this is a type of backward oriented repair); (3) selfcompleted repair in response to other-initiation of repair (i.e., the ability to respond to a display of non-understanding initiated by a coparticipant; for example, the ability to respond to a clarification request. This is also a type of backward oriented repair). Repair was taught using two sets of recorded naturally occurring conversations as teaching materials; that is, conversations between L1Ss and conversations between CFL beginning students and L1Ss. The teacher (one of the co-authors of this paper) taught the repair practices using a set of instructional phases, modified from the pedagogical cycle proposed by Barraja-Rohan (2011; Barraja-Rohan  & Pritchard, 1997) and further elaborated by Betz and Huth (2014). For a more detailed description of the pedagogical cycle used in this study, see Kley et al. (2021); Kunitz and Yeh (2019); Yeh (2018a, 2018b). Procedure and methods

To demonstrate their ability to employ the targeted interactional practices, including the repair practices listed previously, the students were required to engage in two speaking assessments: For one speaking test, which was conducted at mid-term, each student was paired with a peer with whom they do not usually work. For the second speaking test one month later, the students were asked to find an L1 speaking student on campus. Fourteen interactions with a L2S and 28 interactions with a L1S were collected. For both tests, the task was the same: “get to know your friend.” The students were told to learn more about their conversation partner and their life

L1–L2 speaker interaction  195

on campus. The prompt for the conversation with the L2S was as follows: “Chat with your classmate and get to know him/her better”. For their conversation with the L1S, the students were prompted as follows: “Chat with your Chinese friend. Get to know him/her.” Students were instructed to converse for eight minutes, but most interactions, especially those with L1Ss, lasted longer (in total we have collected roughly 400 minutes of video-recordings). See Tables  8.1 and 8.2 for a list of conversations and how long each conversation lasted. Later, after watching the video recordings, the students analyzed and reflected on their own performance. The students’ conversations were graded based on a set of IC teaching objectives and learning outcomes for the course. For either interaction, with the peer and the L1S, the students were awarded a repair sub-score, which ranges on a 5-point scale. Repair scores were assigned with respect to the quality and quantity of repair practices used. TABLE 8.1  Time lengths of the conversations with the L2S

Pairs

Time Length

Pairs

Time Length

Asher & Baek-hyn Connor & Walker Dae-Hyn & Isabella Jack & Olivia Peter & Belli Scott & Marie Bo-kyung & Riley

7:57 11:15 12:17 14:45 8:35 9:16 12:07

Sophia & Yoo-jung Charlotte & Do Hyn Amelia & Charlie Sehyeon & Karl Sam & Chloe Arthur & Katie Ji Ho & Emma

6:52 15:49 16:18 8:09 11:00 10:12 8:07

TABLE 8.2  Time lengths of the conversations with the L1S

Student

Time Length

Student

Time Length

Baek-hyn Asher Connor Walker Dae-Hyn Isabella Olivia Jack Belli Peter Marie Scott Bo-kyung Riley

8:16 9:23 17:00 10:16 12:06 18:00 10:40 9:12 14:36 9:51 9:16 10:28 11:34 9:29

Sophia Yoo-jung Charlotte Do Hyn Amelia Charlie Sehyeon Karl Sam Chloe Katie Arthur Ji Ho Emma

10:57 8:13 15:49 17:20 7:56 15:15 10:59 10:53 18:18 10:13 9:46 11:22 8:44 5:56

196  Katharina Kley, Silvia Kunitz, and Meng Yeh

Initially, words-only transcripts of all 42 interactions were produced and instances of repair were identified. For the purposes of this study, we decided to focus on the students’ ability to recruit help when interactional trouble occurred, either when the students had difficulty in continuing their turn (i.e., when students engaged in word searches and asked for help) or when issues of understanding occurred (i.e., when the students could not understand part or the entirety of the coparticipant’s prior turn). That is, we focused on other-directed word searches and on other-initiations of repair. Using CA conventions (Jefferson, 2004), more elaborate transcriptions of these focal sequences were produced, including embodiment notations, to allow for a closer and more detailed analysis of the repair practices accomplished by the students. In the excerpts reproduced in the next section, the co-occurrence of talk and embodied actions is marked with a plus sign. Once we identified all repair practices, we counted the number of instances in which each practice occurred in the two datasets. It should also be noted that we excluded from the final count the instances of chained sequences; that is, longer repair sequences that included more than one repair initiation. As mentioned earlier, we intend this study as relying on both qualitative and quantitative methods. In the first, qualitative phase, repair instances were identified to create a collection of potential cases. Subsequently, singlecase analyses for all instances were developed; thanks to this detailed analysis, it was possible to identify specific empirical categories and to further refine our collection. This kind of procedure is typical of any CA work, which starts from single-case analyses to develop collections of similar cases. In the second, quantitative phase of our study, we simply counted the instances of repair for each category we had identified in the qualitative phase. Quantification was useful to answer our research question; that is, to verify whether there were any differences in repair practices in interactions with L2s versus interactions with L1Ss. Though our research question is etic (i.e., researcher-relevant) in nature, the procedures we have used to analyze the interactional data are emic (i.e., participant-relevant) and therefore in line with a CA approach to data analysis. Data analysis

As mentioned earlier, the analysis focused on instances of other-initiated repair and other-directed word searches. More specifically, we identified nine practices of other-initiated repair (i.e., nine different ways of indicating nonunderstanding and asking for help) and five types of other-directed word searches (i.e., five ways of recruiting the coparticipant’s help in the face of a speaking problem). In what follows, we first provide a description of the different practices, together with some illustrative examples; then, we engage in a quantitative analysis of the practices we have identified in the two datasets.

L1–L2 speaker interaction  197

Repair practices identified in the student interactions

In this section, each repair practice is briefly described. The most frequent practices are also illustrated with an example. The three most frequent practices in the peer–peer interactions are also frequent in the interactions with L1Ss; for reasons of space, we only include examples from the peer–peer interactions for these three practices. All other examples are taken from the interactions with L1Ss. In all examples, the participants’ names have been replaced with pseudonyms. The examples are selected from the interactions of five focal students. Two of them (Sam and Ji Ho) produced no repair practices when interacting with a peer and some repair practices when interacting with the L1S; the other three students (Peter, Katie and Chloe) produced repair practices in both interactions but produced more instances of repair and more varied practices in the interactions with the L1S. Table 8.3 illustrates the amount of instances and practices produced by each of these students. Other-initiated repair

A total of nine repair initiation practices in other-repair sequences have been identified in the two datasets. The description of these practices is organized from the more generic practices (e.g., claims of non-understanding, repetition requests, and open-class repair initiators) indicating generic trouble with the previous turn to the more specific practices which pinpoint a specific trouble source in the previous turn (e.g., repetitions of the trouble source and candidate understandings). The difference between these practices has an impact on the interactional affordances for the coparticipant, in that indications of generic trouble may be open to interpretation (as to what the nature of the trouble is and what can be done to repair it), while practices that clearly orient to a specific trouble source mobilize more specific responses.

TABLE 8.3  Repair practices by focal student

Student

Sam Ji Ho Peter Katie Chloe

Interaction W/Peer

Interaction W/L1 Speaker

Total instances

Types of practices

Total instances

Types of practices

0 0 1 5 6

/ / 1 4 2

7 9 13 10 11

3 5 6 6 4

198  Katharina Kley, Silvia Kunitz, and Meng Yeh

With claims of non-understanding (e.g., the equivalent of ‘I don’t know,’ ‘I don’t understand’) the participant indicates that there is a generic problem with the coparticipant’s previous turn, due to epistemic issues of not understanding or not knowing, without specifying the exact trouble source. In the datasets, these initiations are responded to with repetitions of the trouble turn, reformulations, and translations into English. Similarly, repetition requests (here mostly issued with the formula ‘please say it again’) and open class repair initiators (e.g, ‘what?’, ‘uh?’, ‘sorry?’, etc.; see Drew, 1997) also indicate generic trouble with the prior turn; however, they do not specify whether the problem might be due to issues of non-understanding/knowing or not hearing. These initiations are typically responded to with reformulations in Chinese or translations into English. An example of an open-class repair initiation is provided in Excerpt 1 next. Excerpt 1 – What? (Ji Ho & Lin, 8:17–8:33)

01 LIN:   wo duzi e  le. ni duzi e  ma? I belly hungry PART you belly hungry Q   I am hungry. are you hungry? 02  +(0.8)   +Ji Ho leans forward 03 JI HO: [shenme?]  what   what? 04 LIN:  [ni dui]zi- nide duzi e,   ni e   ma?  you belly- your belly hungry, you hungry Q   you belly- your belly is hungry, are you hungry? 05

(0.3)

06 JI HO: O::H! OH. e?    hungry   O::H! OH. hungry? 07  +(0.2)   +Lin nods 08 JI HO: oh. wo:: (0.2) wo::: hen °he::n° wo hen e.     I   I   very very I very hungry   oh. I:: (0.2) I::: am very °very::° I am very hungry. In line 1, the L1S Lin announces that she is hungry and asks whether Ji Ho, the student, is hungry as well. After a 0.8-second pause (line 02), which suggests that Ji Ho might be having trouble in producing the response

L1–L2 speaker interaction  199

made relevant by the turn in line 01, Ji Ho uses the open-class repair initiator shenme? (‘what?’) to indicate that there is a problem of hearing or understanding the previous turn. Lin completes the repair with a reformulation of her previous question (line 04) and a breakthrough moment ensues (line 06; note the use of the change-of-state token oh, see Heritage, 1984) with Ji Ho finally providing the relevant response in line 08. Explanation requests constitute a frequent technique in the datasets (see later). We have identified two different practices: explanation requests without and with repetition of the trouble source. Explanation requests without repetition of the trouble source are designed to ask the meaning of what was said before with formulae like ‘what does it mean?’ or ‘what is it?’; these repair initiations are typically responded to by the coparticipant with a translation into English of the entire turn or part of it. On the other hand, explanation requests with repetition of the trouble source take the format ‘what is X?’; these repair initiations are responded to with translation in the interactions with a peer, but may also be responded to with explanations in Chinese by the L1Ss. An example is provided in Excerpt 2, which is taken from the interaction between two students, Sam and Chloe. Excerpt 2 – What does it mean? (Sam & Chloe, 2:27–2:38)

01 SAM:   wode: ba  ma: (0.2) dou shi: shangren.   my   dad mom    both are businessmen    my: dad and mom: (0.2) are: both businessmen. 02   (0.6)

03 CHLOE   shangren. shangre:n (0.5). hh shenme yisi.   businessman businessman   what meaning   businessmen. businessme:n. (0.5). hh what does it mean. 04 SAM:   uh shi::: business.     is   uh it’s::: business. 05    (0.2) 06 CHLOE:  business! 07 SAM:   mh mh. 08 CHLOE:  (wow.) hen youyisi.      very interesting    (wow.) very interesting. In line 01, Sam informs Chloe that both her parents are businessmen. After a 0.6-second pause (line 02), Chloe initiates repair on a specific trouble

200  Katharina Kley, Silvia Kunitz, and Meng Yeh

source; that is, she repeats shangren (‘businessmen’) and asks for its meaning (shenme yisi., ‘what does it mean.’, line 03). In line 04, Sam translates one part of the word shangren (literally, ‘business people’ from shang, ‘business’, and ren, ‘people’) by saying business. After repeating this translation (line 06), Chloe provides a positive assessment (line 08) of the informing in line 01. Another technique consists of repeating (part of) the trouble source without explicitly asking for its meaning. Initiations done with repetitions of the trouble source are responded to with translations into English or with explanations in Chinese, gestures, etcetera. An example is provided in Excerpt 3. Here, the participants (the students Arthur and Katie) are engaged in talking about their university courses. We join their interaction as Arthur is listing the four courses he is taking on Monday, Wednesday and Friday (line 01), which are: Chinese (line 06), physics (line 08), math (line 10), and economics (line 13). Katie displays attentive listenership by nodding at meaningful junctures in Arthur’s list (see lines 3, 5, 7, 9, 12). However, after Arthur mentions economics (line 13), Katie initiates repair on the item listed just before; that is, shuxue ke (‘math class’, line 10). She does so by repeating the trouble source (with inaccurate pronunciation) with upward intonation and frowning (line 15), while looking at Arthur. He responds by repeating the problematic word (with accurate pronunciation) and translating it into English (line 16). Excerpt 3 – Math? (Arthur & Katie, 5:09–5:41)

01 ARTHUR: uh xingqi yi, san, (0.9) wu wo you    week  one three      five I  have uh on Monday, Wednesday, (0.9) Friday I have 02

wo  ye  you   si I  also have four I also have four (1.5)

03 +(1.5) +Katie nods 04 ARTHUR: uh s:::: (0.5) s:i (.) jie ke,    four  M  class uh f:::: (0.5) f:our (.) classes, 05 +(0.5) +Katie nods 06 ARTHUR:

wo you:: uh zhongwen ke:, I  have   Chinese class I have:: uh Chinese class:,

L1–L2 speaker interaction  201

07 +(1.3) +Katie nods 08 ARTHUR:

wuli (0.2) physics (0.2) ke, physics     class physics (0.2) physics (0.2) class,

09 +(0.6) +Katie nods 10 ARTHUR: uh shuxue ke,    math  class, uh math class, 11 (0.6) 12 +(0.3) +Katie nods 13 ARTHUR: u:h jingji   ke,    economics class u:h economics class, 14 +(0.3) +Katie and Arthur look at each other 15 KATIE: +/su/xue ke?    math  class    math class? +frowning

16 ARTHUR: shuxue uh math.   math   shuxue uh math 17   (0.6) A similar practice consists of partial repetitions of the trouble source, which is not frequent in the two datasets. Finally, the last two practices for other-initiation of repair that we have identified consist of candidate understandings of the trouble source, which may be offered in English and, less frequently, in Chinese. An example of a candidate understanding in English is provided in Excerpt 4. Here, a student, Sam, is interacting with the L1S speaker Qi. The excerpt starts with Qi asking whether Sam was born in Poland or in the US (lines 01–02). Before answering, Sam verifies that her understanding of the word chusheng (‘born’) is correct by providing a candidate translation of the word in

202  Katharina Kley, Silvia Kunitz, and Meng Yeh

English (line 03). Upon receiving confirmation by Qi (line 06), Sam provides the relevant response (line 07). Excerpt 4 – Chusheng is born? (Sam & Qi, 1:54–2:04)

01 QI:

na  ni  shi chusheng zai bolan then you is  born   in  Poland

02 

haishi chusheng zai meiguo? or    born   in USA then were you born in Poland or in the US?

03 SAM: uh wo: (0.2) shi:: tsh- chusheng shi (0.3)born? uh I    is   born   is uh I: (0.2) was:: tsh- chusheng is (0.3) born? 04

+(0.7) +Sam & Qi look at each other

05 SAM: uh 06 QI:

+mh. +Qi nods

07 SAM: uh hu. (so I) mtsh- (0.3)   I am New-York person uh hu. (so I) mtsh- (0.3) Other-directed word searches

This section concerns the practices with which students other-direct their word searches; that is, the practices with which they attempt to recruit the coparticipant’s help in producing the outcome of the search (i.e., the next due lexical item). A total of five practices for other-directing word searches have been identified. The description of these practices will start with more subtle ways of other-directing a word search (e.g., halting midway and gazing at a coparticipant) and end with more explicit practices that require specific help (e.g., translation requests). The first practice we have identified consists of nonverbal requests for help without an attempted outcome. In these cases, there is a halt in the progressivity of the student’s turn due to the engagement in the word search, the outcome of which is not even attempted (at most it is hinted at). Help seems to be recruited with the use of prosody and embodiment (e.g., gazing at the coparticipant as the turn comes to a halt and perturbations are produced). Sometimes, however, the student does attempt to complete the word search and issues a confirmation request. We have identified two practices for this kind of action, depending on the layers of semiotic resources being

L1–L2 speaker interaction  203

used. The first practice consists of a confirmation request on the outcome of a self-completed search; the request is accomplished verbally and with the use of prosody and/or embodiment. This practice seems particularly frequent in the interactions with L1 speakers. In the example reproduced in Excerpt 5, Chloe, a student, interacts with the L1S Ying. After announcing that she wants to become an ambassador (line 1), Chloe lists the languages she would like to learn. In line 11 she engages in a word search targeting the word for ‘Korean language’, which she produces in the same line (hanwen). Despite Ying’s nodding (line 12), which might actually indicate understanding and acceptance of the word produced by the coparticipant, Chloe initiates repair by saying hanguode? (‘Korea’s?’, line 13). Her turn is produced with upward, try-marked intonation (Sacks & Schegloff, 1979) and is accompanied by the embodied action of tilting her head to face Ying (line 13), thereby making a response relevant from her. With this action, Chloe seems to cast uncertainty on the solution proposed in line 11 (hanwen, ‘Korean language’) and to be looking for (dis)confirmation on that lexical item. Ying confirms that hanwen is correct (line 14), and Chloe repeats it while nodding and looking at Ying (line 15). Excerpt 5 – Korean. Korea’s? (Chloe & Ying, 8:31–8:52)

01 CHLOE:  wo xiang shi ambassador,  I  want am  I want am an ambassador, 02    (0.4) 03 YING:    ng:.  m:h. 04    +(0.6)  +Chloe nods 05 YING:  ni xiang [dang,]  you want  serve  you want to serve, 06 CHLOE:     [soyi] wo: wo: (0.3) YAO, (0.3)   so  I  I     want 07    zhidao, (0.2) zhongwen,  know    Chinese  so I: I: (0.3) WAnt, (0.3) to know, (0.2) Chinese, 08    (0.2) 09 CHLOE:  yingwen,  English,

204  Katharina Kley, Silvia Kunitz, and Meng Yeh

10    (0.3) 11 CHLOE:  han- (0.4) hanwen,   Kor-    Korean  Kor- (0.4) Korean, 12  +(0.2)  +Ying nods 13 CHLOE:   +hanguo[de]?   Korea  N?  Korea’s?  +Chloe tilts head to face Ying 14 YING:     [ha]nwen.       Korean. 15 CHLOE:  +hanwen.  Korean  +Chloe looks at Ying and nods Another example is provided in Excerpt 6, where Katie, a student, tells the L1S Han about her exams. As she lists the first exam she has to take (Chinese; line 1), Katie engages in a word search targeting the equivalent of ‘take-home’. She attempts a solution with hui:°°°ji-°°° (‘go: °°°ho-°°° ’, line 3) as she looks at Han (line 3). This action is responded to by Han with a head nod (line 4) and with the provision of the relevant item (dai huija de kaoshi, ‘take-home test’, line 5; note the presence of the change of stake token oh, see Heritage, 1984), which is then repeated by Katie (line 6). Excerpt 6 – Take-home test (Katie & Han, 1:53–2:15)

01 KATIE: wo:: (.) x:ian you: (0.9) zhong- (.) zhongwen, I first  have    Chin-    Chinese I:: (.) f:irst have: (0.9) Chin- (.) Chinese, 02 

+(0.3) +Han nods

04 

+(0.3) +Han nods

03 KATIE: u:h (1.4) hu- hui- (0.6) hui+:°°°ji-°°°    g- go-      go   ho u:h (1.4) g- go- (0.6) go: °°°ho-°°°    +Katie looks at Han

L1–L2 speaker interaction  205

05 HAN: oh dai  huijia   de kao[shi.]    take go-home  N  test oh take-home test. 06 KATIE:     [dai] huijia de kaoshi,     take go-home N test     take-home test, 07 

(0.3)

08 HAN: ng. mh. In some cases, the confirmation request is done with an additional resource; that is, the outcome of the search, initially produced in Chinese, is then produced in English as well, possibly as a way to clarify what the searched-for word might be and to avoid potential misunderstandings. In other cases, there is no attempt at self-completing the search in Chinese. Rather, English is used to produce the next due item in English, with prosody and embodiment that clearly mobilize the coparticipant’s help in finding the outcome of the search. An example is provided in Excerpt 7, where Peter, a student, is interacting with the L1S Wei. Here Peter is talking about his younger brother, who is also in college (lines 1–4). As he attempts to name his brother’s major, Peter engages in a word search targeting the Chinese word for accounting. At first, his search is self-directed, as indicated by the fact that he looks up (line 6), thereby disengaging from his coparticipant (a typical feature of selfdirected searches; see Goodwin & Goodwin, 1986). Peter then turns to look at Wei as he says accounting? (line 7) with upward intonation and a smile. The switch to English, the prosody of this turn and the embodied action of turning to look at his coparticipant are all semiotic resources with which Peter other-directs the search to Wei and mobilizes a response from him. Indeed, the response is produced in line 9, with Wei providing the translation of English accounting into Chinese kuaiji. Peter repeats the word (line 10) and then recycles it into a reformulated version of his original turn (line 13). Excerpt 7 – Accounting? (Peter & Wei, 1:16–1:37)

01 PETER: uh wo ye     you- oh wo you (.) didi.   I  also   have   I  have younger-brother uh I also have- oh I have (.) a younger brother. 02 

(0.3)

206  Katharina Kley, Silvia Kunitz, and Meng Yeh

03 WEI: mh mh. 04 PETER: ta ye  zai: (0.2) daxue. he also  at      college he is also i:n (0.2)college. 05 

ta shi:  (1.1)  u::hm (0.4) tade zhuanye shi: he is      his major   is he is: (1.1) u::hm (0.4) his major is:

06 

+(1.7) +Peter looks up

07 

+accounting¿ +Peter turns to look at Wei and smiles

08 

+(0.7) +Peter and Wei look at each other

09 WEI: °°accounting, uh j-°° (1.0) kuaiji.     accounting 10  PETER: kauiji. accounting. 11 

(0.3)

12 WEI:

°[ng.]° mh.

13 PETER: [ta]de zhuanye shi kauiji.   his   major is   accounting   his major is accounting. 14 

(0.3)

15 WEI:

mh mh.

Finally, the most explicit and most common practice for other-directing word searches in both datasets consists of explicit translation requests. In these cases, the student stops midway during turn production; the next due item is then produced in English and is accompanied by an explicit translation request from English into Chinese that is packaged with various linguistic formulations (e.g., ‘how do you say X?’, where X is the item in English; ‘X what?’; ‘X is?’; etc.). An example is provided in Excerpt 8, which illustrates the interaction between two students, Peter and Belli. The excerpt picks up the talk as Belli asks Peter about his internship at one of the local colleges

L1–L2 speaker interaction  207

(lines 1–2). Peter starts responding by saying tch- wo::: bang u:hm (0.2) wode::: (‘tch- I:::: help u:hm (0.2) my::::’, line 3) and then engages in a word search targeting the next due lexical item. Eventually, after various perturbations including pauses and hesitation tokens, Peter produces the outcome of the search in English (line 4) as he looks at Belli. While this action could already be considered a request for help, Peter upgrades it by explicitly asking Belli if she knows how to say professor (line 5). After tentatively producing the candidate outcome laoshi? (‘teacher?’, line 6), Belli claims that she does not know with interspersed laughter (line 7) and Peter reformulates his answer to Belli’s initial question (lines 1–2) by integrating the English word in his turn (uhm wo- wo bang (0.2) °professor°., ‘uhm I- I help (0.2) a °professor°.’, line 8). Excerpt 8 – Do you know how to say professor? (Peter & Belli, 1:50–2:10)

01 BELLI: u:hm nide::  (0.2)  uhr    your      intern 02 

ni::: (0.7) zai (.) Baylor u:h zuo shenme¿ you      at     do  what u:hm your:: (0.2) inte:rn uhr what do you::: (0.7) u:h do at Baylor?

03 PETER: tch- wo::: bang u:hm (0.2) wode::: (0.6)    I   help   my tch- I::: help u:hm (0.2) my::: 04 

u:hm (0.6) mtsh- (0.7) u:h +°°    +Peter looks at Belli

05  professor zenme  shuo  [ni  zi  zhidao ma¿]    how to say   you you  know  Q do you you know how to say professor? 06 BELLI:   [uh (.) laoshi? u:h]=     teacher     uh (.) teacher? u:h 07  =wo bu(h)  zhi(h)dao(h).   I not   know I do(h)n’t kno(h)w(h). 08 PETER: uhm wo- wo bang (0.2) °professor°.    I  I help uhm I- I help (0.2) a °professor°.

208  Katharina Kley, Silvia Kunitz, and Meng Yeh

Quantitative analysis

Table 8.4 includes the repair practices that were found, including a frequency distribution that shows how often each practice was produced by the focal students in either setting (i.e., interaction with the L2S and the L1S). As Table 8.4 shows, compared to their peer interactions, the students employed both other-directed word searches and other-initiated repair more frequently in the conversations with the L1S. Of the 270 word searches and other-initiated repair produced in all interactions, the students used 27 word searches (10%) in interaction with a peer, but 117 (43.33%) in interaction with a L1S; other-initiated repair was employed 29 times (10.74%) in the L2S conversations and 97 times (35.93%) in the L1S interactions. In addition, while all repair practices that were identified occurred in the L1S interactions, two practices, repetition requests and nonverbal requests for

TABLE 8.4 Frequency distribution of the other-initiated repair practices and other-

directed word searches identified in interaction with the L2S and the L1S

Repair Practice

Interaction W/L2s Interaction W/L1s OTHER-INITIATED REPAIR

Claims of non-understanding 1 (0.37%) 1 (0.37%) Repetition request 0 7 (2.59%) Open-class repair initiation 2 (0.74%) 12 (4.44%) Explanation request w/o repetition of 2 (0.74%) 3 (1.11%) trouble source Explanation request w/repetition of 8 (2.96%) 35 (12.96%) trouble source Repetition of trouble source 10 (3.7%) 29 (10.74%) Partial repetition of trouble source 1 (0.37%) 1 (0.37%) Candidate understanding in English 4 (1.48%) 8 (2.96%) Candidate understanding in Chinese 1 (0.37%) 1 (0.37%) OTHER-DIRECT WORD SEARCHES Nonverbal request for help w/o attempted outcomes Confirmation request on self-completed outcome (embodiment, prosody) Confirmation request on self-completed outcome (English) Next due item produced in English Explicit translation request

0 3 (1.11%)

9 (3.3%) 17 (6.3%)

4 (1.48%)

4 (1.48%)

2 (0.74%) 18 (6.67%)

22 (8.15%) 65 (24.07%)

L1–L2 speaker interaction  209

help without attempted outcome, were produced only in interactions with the L1S and not with the peer. Taking a closer look at the students, the analysis revealed that more students used repair when conversing with a L1S than with a L2S. That is, 14 out of the 28 students (50%) employed other-directed word searches in their interactions with the peer, while 25 students (89.3%) included the L1S in their word searches. With regard to other-initiated repair, it was found that 16 students (57.14%) other-initiated repair in the L2S conversations and 25 students (89.3%) in L1S interactions. The repair practices employed most frequently can be found in Table 8.5. The other practices were produced to a similar extent in both settings but, overall, the students used them rather sparingly. The most frequent practices for other-initiating repair in both settings were explanation requests with repetition of the trouble source and repetitions of the trouble source. This might be an effect of instruction, in that the teacher had insisted (following Wong & Waring, 2010) on the importance of identifying and pinpointing the trouble source, in order to guide the coparticipant as to the kind of help that was needed. At the same time, the frequency of these practices reflects how skilled the students were in repeating lexical items that they could not quite understand; that is, by identifying and repeating words used in prior talk that were difficult for them, the students manifested high monitoring and processing skills. The TABLE 8.5 Frequency distribution of most the frequently used repair practices in

interaction with the L2S and the L1S

Repair Practice

Interaction W/L2s

Interaction W/L1s

OTHER-INITIATED REPAIR Explanation request w/repetition of 8 (2.96%) 35 (12.96%) trouble source Repetition of trouble source 10 (3.7%) 29 (10.74%) Open-class repair initiation 2 (0.74%) 12 (4.44%) Candidate understanding in English 4 (1.48%) 8 (2.96%) OTHER-DIRECTED WORD SEARCHES Explicit translation request Next due item produced in English Confirmation request on self-completed outcome (embodiment, prosody) Nonverbal request for help w/o attempted outcome

18 (6.67%) 2 (0.74%) 3 (1.11%) 0

65 (24.07%) 22 (8.15%) 17 (6.3%) 9 (3.33%)

210  Katharina Kley, Silvia Kunitz, and Meng Yeh

students still produced very generic practices to indicate trouble such as open-class repair initiations; however, these practices did not appear with the same frequency as the other two. Finally, some of the students attempted to provide their candidate understanding of prior talk in English; this practice also indicates (like repetitions of the trouble source with or without an explicit explanation request) high processing skills and engagement with prior talk. When it comes to practices for other-directing a search, we notice a similar tendency to identify exactly the kind of help that was needed. Be it with explicit translation requests or with the production of the next due item in English, the students indicated exactly what the problem was and they did so by resorting to English (i.e., in both practices the trouble source was produced in English) and by making relevant (more or less explicitly) a translation in response. The recourse to English was actually discouraged by the teacher, both during instruction and for assessment purposes; therefore, this finding shows that – despite the attempts at having the students rely solely on their target language – English (which was the language students used in their daily life at the university and the L1 for most of them) is treated by the students themselves as an important resource that help achieve intersubjectivity and aid progressivity also in testing settings. On the other hand, some students did try to complete the word search themselves and simply offered it for confirmation. Finally, and quite interestingly only in the interactions with the L1Ss, some students indicated that they needed help but without specifying the lexical item they were searching for. That is, they relied solely on the coparticipant’s ability to identify a candidate outcome based on the turn-so-far. Despite the observed trends illustrated previously, it should be noted that there is also a large variability in terms of the number of repair practices produced across the 28 students included in this study. Figure 8.1 shows that in interaction with the L1S, Olivia and Dae-Hyun produced 18 and 16 instances of repair respectively; Connor, however, other-initiated repair and/or engaged in a word search only twice, while Yoo-jung did not repair at all when conversing with the L1S. This variability is also prominent in the interaction with the peer: while Chloe, Dae-Hyun, and Emma produced each six instances of repair, Peter, Belli, and Amelia engaged in repair only once, and Charlie, Sophia, and Yoo-jung did not repair at all. This overall trend is also reflected in the individual practices. For example, as can be seen in Figure 8.2, Baek-hyn, Arthur, Sophia, and Yoo-jung did not engage in any explicit translation requests, which nevertheless were the most frequent practices both in the interactions with the L1S and in the

L1–L2 speaker interaction  211

FIGURE 8.1 Total

number of repair practices produced with L1S vs. L2S across all 28 students

interactions with the L2S. In comparison, Belli employed nine explicit translation requests in the L1S interaction, but only one in the interaction with the L2S; Scott produced one explicit translation request in either test setting. Discussion and conclusion

This chapter focused on analyzing instances of other-initiated repair and other-directed word searches in two oral classroom-based test settings. For one interaction, the L2 learners of Chinese were paired with a L2S interlocutor, and for a second interaction with a L1S. We focused on these two types of repair because they were the object of instruction and the target of assessment. Furthermore, given our interest for a possible interlocutor effect due to the differential linguistic expertise between the test takers and their interlocutors, we focused on instances of repair when the test taker actively tried to recruit the interlocutor’s help during the ongoing interaction. The analysis of both test interactions revealed that the test takers produced nine practices of other-initiated repair and five types of other-directed word searches. Our findings show that the most frequent practices in the L1–L2 data were also the most frequent practices in the L2–L2 data. Specifically, test takers who other-initiated repair typically did so with repetitions of the

212  Katharina Kley, Silvia Kunitz, and Meng Yeh

FIGURE 8.2  Frequency

of explicit translation requests produced with L1S vs. L2S across all 28 students

trouble source, with or without an explicit explanation request. On the other hand, test takers who other-directed their word searches tended to do so by either relying on English (i.e., offering an English word for translation) or by asking for confirmation on the self-completed outcome of a search. That is, it seems that the most common ways of recruiting the coparticipant’s help involved a clear identification of the trouble source (with a backward or forward orientation). Of the 14 repair practices identified, two practices, repetition requests and nonverbal requests for help without attempted outcome, were produced only in the interactions with the L1S, but not with the L2S. It was also found that the majority of students included in this study engaged in more other-initiated repair and other-directed word searches with the L1S than with the L2S. We speculate that this low level of engagement in both other-initiated repair and other-directed word searches in the interactions with the peer might be a safe strategy to avoid potentially face-threatening situations, for example, displaying non-understanding or asking for help that the peer might not be able to give. In comparison, the high number of other-directed word searches in L1–L2 interaction may be related to the linguistic epistemic asymmetry between the two interlocutors, with the test taker orienting to the L1S as more knowledgeable. At the same time, the test taker seems to orient to the importance of establishing intersubjectivity with the L1S, who may be using words and expressions the

L1–L2 speaker interaction  213

student is unfamiliar with. This would explain the higher number of repair initiations. In conclusion, the L1–L2 interaction seems to provide more affordances for initiating repair. This finding is in line with previous research on interlocutor effects in speaking tests (e.g., Berry, 1993; Chambers et al., 2012; O’Sullivan, 2002) in that our research also found that interlocutor characteristics have an impact on test discourse and language performance. This study also uncovered a high variability in repair practices produced across the 28 students in both test interactions, with the L2S and the L1S. That is, some students engaged in noticeably more repair than others. Some of these differences in repair use between the students turned out to be large. For example, when conversing with the L1S, one student engaged in repair 18 times, while another student did not produce any repair. The interlocutor effect is relatively pronounced in the present test, which may be due to the fact that the interlocutor was not controlled, that is, trained for the interactions. In addition, as Chalhoub-Deville (2003) has pointed out, when testing IC, we do not look at an individual’s ability, but rather at how an interaction between participants is co-constructed. Due to the fact that discourse is co-constructed, every interaction is unique; it also changes with the interlocutor. All in all, it is best to accept the variability caused by the interlocutor effect and consider it as an important part of the construct, rather than seeing it as construct-irrelevant and wanting to eliminate it. To account for the variability, the prevailing opinion among language testers is that test takers should be able to successfully engage with different interlocutors (Galaczi & Taylor, 2018). In our case, we would recommend to pair the test takers with different interlocutors, L1Ss and/or L2Ss, for at least two, if not more, interactions when assessing their interactional competencies, including their use of repair. However, important to note is also that, despite the variability of repair use across the students, the findings of the study still suggest that a linguistically asymmetric test setting where a learner is paired with a L1S is most fruitful if the testing objective is to elicit repair practices. In this study, we did not examine the relationship between repair practices produced and the repair scores assigned by the teacher. In a follow-up study, it would be interesting to investigate whether, and if so, how the students’ scores and their actual repair use align. Finally, for this chapter, we have looked at the use of repair in peer–peer interactions on the one hand side, and student-L1S interactions on the other, where the L1S is not an interviewer or tester, but another student. The latter is a classroom-based oral test setting, which has not received much attention thus far. Incorporating it in our research and investigating the L1S interlocutor effect is a contribution to the field of assessing IC.

214  Katharina Kley, Silvia Kunitz, and Meng Yeh

Recommended further reading, discussion questions, and suggested research projects Further reading

For those interested in learning more about the organization of repair from a CA perspective, the seminal paper by Schegloff et al. (1977) is an essential read. For readers with an interest in the teaching of IC, of which repair is a part, Betz and Huth’s (2014) paper is highly recommended; the authors propose a set of instructional phases to teach IC. In addition, in their book from 2010, Wong and Waring (2010) bridge the gap between CA and second language teaching by strengthening teachers’ knowledge of interactional language use and suggesting ways for them to apply that knowledge to teaching. Finally, to learn more about testing IC, the reader is referred to Roever and Kasper (2018) and Galaczi and Taylor (2018). Roever and Kasper (2018) argue that widening the speaking construct by including IC enhances the validity of speaking assessments. Galaczi and Taylor’s (2018) article provides an overview of how IC as a theoretical conceptualization and construct in testing speaking has developed over time; the authors also describe how language testers currently operationalize IC in tests and assessment scales. Discussion questions

1. The findings of the study show that the instruction of repair practices guide students to employ appropriate repair strategies during interaction. What challenges do you think classroom teachers may face teaching repair practices? Discuss possible approaches to overcome these challenges. 2. Repair as one of the components of IC has been neglected or rated negatively in interview-formatted speaking tests. Suggest ways to improve oral proficiency interviews to incorporate the assessment of repair practices. 3. The authors suggest conducting a follow-up study to investigate how the students’ repair scores and their actual repair use align. After reading this chapter, what other follow-up studies would you propose to further investigate students’ interactional skills in general and their use of repair in particular? Suggested research project

Design a lesson on repair or another IC component for the language you teach. How would you approach teaching the interactional phenomenon

L1–L2 speaker interaction  215

that you have selected? In a second step, develop two different test tasks to assess your students’ ability to use the interactional phenomenon that you have taught. Like the interlocutor, the test task may also lead to variability in language use. Some tasks may also work better in the sense that they are more able to elicit the targeted interactional phenomenon. Implement the two tasks you have developed and collect the speaking data. Conduct a conversation analysis. Have the students used the phenomenon you have taught? Has learning taken place? What task has worked best and why? References Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press. Barraja-Rohan, A. M. (2011). Using conversation analysis in the second language classroom to teach interactional competence. Language Teaching Research, 15(4), 479–507. https://doi.org/10.1177/1362168811412878 Barraja-Rohan, A. M., & Pritchard, C. R. (1997). Beyond talk: A course in communication and conversation for intermediate adult learners of English (Student’s book). Western Melbourne Institute of Tafe. Berry, V. (1993). Personality characteristics as a potential source of language test bias. In A. Huhta, K. Sajavaara & S. Takala (Eds.), Language testing: New openings (pp. 115–124). Institute for Educational Research, University of Jyväskylä. Berwick, R., & Ross, S. (1996). Cross-cultural pragmatics in oral proficiency interview strategies. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem (pp. 34–54). Cambridge University Press and University of Cambridge Local Examinations Syndicate. Betz, E., & Huth, T. (2014). Beyond grammar: Teaching interaction in the German language classroom. Die Unterrichtspraxis/Teaching German, 47(2), 140–163. https://doi.org/10.1111/tger.10167 Brooks, L. (2009). Interacting in pairs in a test of oral proficiency: Co-constructing a better performance. Language Testing, 26(3), 321–366. https://doi. org/10.1177/0265532209104666 Brouwer, C. (2003). Word searches in NNS-NS interaction: Opportunities for language learning? Modern Language Journal, 87(4), 534–545. https://doi. org/10.1111/1540-4781.00206 Brouwer, C., & Wagner, J. (2004). Developmental issues in second language conversation. Journal of Applied Linguistics, 1(1), 29–47. https://doi.org/10.1558/ jal.v1i1.29 Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 1–25. https://doi.org/10.1191/026553220 3lt242oa

216  Katharina Kley, Silvia Kunitz, and Meng Yeh

Brown, A., & McNamara, T. (2004). “The devil is in the detail”: Researching gender issues in language assessment. TESOL Quarterly, 38(3), 524–538. https:// doi.org/10.2307/3588353 Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing, 20(4), 369–383. https://doi.org/10.119 1/0265532203lt264oa Chambers, L., Galaczi, E., & Gilbert, S. (2012). Test-taker familiarity and speaking tests performance: Does it make a difference? In A. N. Archibald (Ed.), Multilingual theory and practice in applied linguistics: Proceedings of the 45thannual meeting of the British Association for Applied Linguistics (pp. 39–41). Scitsiugnil Press. Coban, M.,  & Sert, O. (2020). Resolving interactional troubles and maintaining progressivity in paired speaking assessment in an EFL context. Papers in Language Testing and Assessment, 9(1), 64–94. www.altaanz.org/uploads/5/9/0/ 8/5908292/2020_9_1__3_hircin-soban_sert.pdf. Accessed 10 February 2020. Council of Europe. (2020). Common European framework of reference for languages: Learning, teaching, assessment – Companion volume. Council of Europe Publishing. Csépes, I. (2009). Measuring oral proficiency through paired-task performance. Peter Lang. https://doi.org/10.3726/978-3-653-01227-9 Davis, L. (2009). The influence of interlocutor proficiency in a paired oral assessment. Language Testing, 26(3), 367–396. https://doi.org/10.1177/026553 2209104667 De Jong, N. (2018). Fluency in second language testing: Insights from different disciplines. Language Assessment Quarterly, 15(3), 237–254. https://doi.org/1 0.1080/15434303.2018.1477780 Drew, P. (1997). ‘Open’ class repair initiators in response to sequential sources of troubles in conversation. Journal of Pragmatics, 28(1), 69–101. https://doi. org/10.1016/s0378-2166(97)89759-7 Galaczi, E. (2014). Interactional competence across proficiency levels: How do learners manage interaction in paired speaking tests? Applied Linguistics, 35(5), 553–574. https://doi.org/10.1093/applin/amt017 Galaczi, E.,  & Taylor, L. (2018). Interactional competence: Conceptualisations, operationalisations, and outstanding questions. Language Assessment Quarterly, 15(3), 219–236. https://doi.org/10.1080/15434303.2018.1453816 Goodwin, M., & Goodwin, C. (1986). Gesture and coparticipation in the activity of searching for a word. Semiotica, 62(1–2), 51–75. https://doi.org/10.1515/ semi.1986.62.1-2.51 Greer, T. (2013). Word search sequences in bilingual interaction: Codeswitching and embodied orientation toward shifting participant constellations. Journal of Pragmatics, 57, 100–117. https://doi.org/10.1016/j.pragma.2013.08.002 Gullberg, M. (2011). Embodied interaction: Language and body in material world. In J. Streeck, C. Goodwin & C. LeBaron (Eds.), Multilingual multimodality: Communicative difficulties and their solutions in second-language use (pp.  137–151). Cambridge University Press Hayashi, M. (2003). Language and the body as resources for collaborative action: A study of word searches in Japanese conversation. Research on Language

L1–L2 speaker interaction  217

and Social Interaction, 36(2), 109–141. https://doi.org/10.1207/s1532797 3rlsi3602_2 Hellermann, J. (2011). Members’ methods, members’ competencies: Looking for evidence of language learning in longitudinal investigations of other-initiated repair. In J. K. Hall, J. Hellermann & S. Pekarek Doehler (Eds.), L2 interactional competence and development (pp. 147–172). Multilingual Matters. https://doi. org/10.21832/9781847694072-008 Heritage, J. (1984). A change-of-state token and aspects of its sequential placement. In J. M. Atkinson  & J. Heritage (Eds.), Structures of social action: Studies in conversation analysis (pp. 299–345). Cambridge University Press. https://doi. org/10.1017/cbo9780511665868.020 Hosoda, Y. (2000). Other-repair in Japanese conversations between nonnative and native speakers. Issues in Applied Linguistics, 11(1), 39–63. https://doi. org/10.5070/l4111005023 Hosoda, Y. (2006). Repair and relevance of differential language expertise in second language conversations. Applied Linguistics, 27(1), 25–50. https://doi. org/10.1093/applin/ami022 Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H. Lerner (Ed.), Conversation analysis: Studies from the first generation (pp. 13–31). John Benjamins. https://doi.org/10.1075/pbns.125.02jef Kasper, G. (2006). Beyond repair: Conversation analysis as an approach to SLA. AILA Review, 19, 83–99. https://doi.org/10.1075/aila.19.07kas Kasper, G.,  & Ross, S. (2007). Multiple questions in oral proficiency interviews. Journal of Pragmatics, 39(11), 2045–2070. https://doi.org/10.1016/j. pragma.2007.07.011 Katona, L. (1998). Meaning negotiation in the Hungarian Oral Proficiency Examination of English. In R. Young & A. W. He (Eds.), Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. 239–267). John Benjamins. Kley, K., Kunitz, S., & Yeh, M. (2021). Jiazhou? Is it California? Operationalizing repair in classroom-based assessment. In M. R. Salaberry  & A. R. Burch (Eds.), Assessing speaking in context: Expanding the construct and its applications (pp. 165–191). Multilingual Matters. https://doi.org/10.21832/9781788923828-008 Koshik, I.,  & Seo, M. S. (2012). Word (and other) search sequences initiated by language learners. Text  & Talk, 32(2), 167–189. https://doi.org/10.1515/ text-2012-0009 Kunitz, S., & Yeh, M. (2019). Instructed L2 interactional competence in the first year. In M. R. Salaberry & S. Kunitz (Eds.), Teaching and testing L2 interactional competence: Bridging theory and practice (pp. 228–259). Routledge. https://doi. org/10.4324/9781315177021-10 Kurhila, S. (2006). Second language interaction. John Benjamins. https://doi. org/10.1075/pbns.145 Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: The case of CASE. Language Testing, 13(2), 151–172. https://doi.org/10.1177/0265 53229601300202 Lin, F. (2014). A conversation-analytic study of word searches in EFL classrooms. Retrieved from ProQuest Dissertations & Theses. (AAI10088083) McNamara, T. (1996). Measuring second language performance. Longman.

218  Katharina Kley, Silvia Kunitz, and Meng Yeh

O’Loughlin, K. (2002). The impact of gender in oral proficiency testing. Language Testing, 19(2), 169–192. https://doi.org/10.1191/0265532202lt226oa O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Language Testing, 19(3), 277–295. https://doi.org/10.1191/02 65532202lt205oa Pekarek Doehler, S. (2019). On the nature and the development of L2 interactional competence: State of the art and implications for praxis. In M. R. Salaberry & S. Kunitz (Eds.), Teaching and testing L2 interactional competence: Bridging theory and practice (pp. 25–59). Routledge. https://doi.org/10.4324/9781315177021-2 Pekarek Doehler, S., & Berger, E. (2019). On the reflexive relation between developing L2 interactional competence and evolving social relationships: A longitudinal study of word-searches in the ‘wild’. In J. Hellerman, S. W. Eskildsen, S. Pekarek Doehler & A. Piirainen-Marsh (Eds.), Conversation analytic research on learning-in-action: The complex ecology of second language interaction ‘in the wild’ (pp. 51–75.) Springer. https://doi.org/10.1007/978-3-030-22165-2_3 Pekarek Doehler, S., & Pochon-Berger, E. (2015). The development of L2 interactional competence: Evidence from turn-taking organization, sequence organization, repair organization and preference organization. In T. Cadierno  & S. Eskildsen (Eds.), Usage-based perspectives on second language learning (pp.  233–268). Mouton de Gruyter. https://doi.org/10.1515/9783110378528-012 Richards, B. J., & Malvern, D. D. (2000). Accommodation in oral interviews between foreign language learners and teachers who are not native speakers. Studia Linguistica, 54(2), 260–271. https://doi.org/10.1111/1467-9582.00065 Roever, C.,  & Kasper, G. (2018). Speaking in turns and sequences: Interactional competence as a target construct in testing speaking. Language Testing, 35(3), 331–355. https://doi.org/10.1177/0265532218758128 Ross, S. (1996). Formulae and inter-interviewer variation in oral proficiency interviewer discourse. Prospect, 11(3), 3–16. Rydell, M. (2019). Negotiating co-participation: Embodied word searching sequences in paired L2 speaking tests. Journal of Pragmatics, 149, 60–77. https:// doi.org/10.1016/j.pragma.2019.05.027 Sacks, H., & Schegloff, E. A. (1979). Two preferences in the organization of reference to persons and their interaction. In G. Psathas (Ed.), Everyday language: Studies in ethnomethodology (pp. 15–21). Irvington. Schegloff, E. A., Jefferson, G., & Sacks, H. (1977). The preference for self-correction in the organization of repair in conversation. Language, 53(2), 361–382. https://doi.org/10.2307/413107 van Moere, A. (2012). Paired and group oral assessment. In C. A. Chapelle (Ed.), Encyclopedia of applied linguistics. Wiley-Blackwell. https://doi.org/10.1002/ 9781405198431.wbeal0894 Wong, J. (2000). Delayed next turn repair initiation in native/non-native speaker English conversation. Applied Linguistics, 21(1), 244–267. https://doi. org/10.1093/applin/21.2.244 Wong, J.,  & Waring, H. Z. (2010). Conversation analysis and second language pedagogy: A guide for ESL/EFL teachers. Routledge. https://doi.org/10.4324/ 9780203852347

L1–L2 speaker interaction  219

Yeh, M. (2018a). Active listenership: Developing beginners’ interactional competence. Chinese as a Second Language Research, 7(1), 47–77. https://doi. org/10.1515/caslar-2018-0003 Yeh, M. (2018b). Teaching beginners topic development: Using naturally occurring conversation. Taiwan Journal of Chinese as a Second Language, 17(2), 89–120.

9 YARDSTICKS FOR THE FUTURE OF LANGUAGE ASSESSMENT Disclosing the meaning of measurement Albert Weideman

Stance, positionality, and reflection

Looking back at the various contributions in this book, one is struck by the rich diversity of approaches and a splendid variety of analytical methods that are employed, among them conversation analysis (Kley, as well as Kley, Kunitz & Yeh, in this volume), critical language testing, alternative assessment formats (as examined, for example, by Suzuki as well as by Räsänen & Kivik, in this volume) and critical race theory (Richardson, in this volume). They celebrate Feyerabend’s (1978, pp. 11, 46) exhortation: “Proliferation of theories is beneficial for science, while uniformity impairs critical power.” What all have in common is an implied concern that we can do better; that our ways of assessment can be improved, and that correction is both possible and desirable. More importantly, the various contributions in this book suggest potential improvements in the assessment of the ability to use an additional language. This chapter takes those suggestions one step further, by asking whether we stand back often enough to reflect and first ask what ‘improvement’ is. To be able to gauge the present state of language assessment, and to have criteria that might also be useful into the future, we need to tackle the criteria explicitly, to conceptualize them as clearly and as comprehensively as possible. We need a larger framework that allows us to evaluate the contribution of each of the specific approaches. To show how that can be undertaken, this chapter strikes a reflective note, in the spirit of an autoethnographic approach (Ellis et al., 2011; Bell, 2002; Macbeth, 2001; Paltridge, 2014; Weideman, 2015), and with a disclosure, at the outset, of the academic stance of the author. DOI: 10.4324/9781003384922-12

Yardsticks for the future of language assessment  221

My positionality in writing about language assessment reflects two identities and roles. I write in the first instance from what I call a technical point of view, that of a designer of language assessments and tests. The defining characteristic of design constitutes the nuclear moment of technically qualified actions and roles (Schuurman, 2009, p. 384; Van Riessen, 1949, pp. 623, 625), such as devising how to assess language ability. Second, as will be elaborated in what follows, I  am intent on deriving from theory certain requirements for designing tests responsibly, and therefore in alignment with my convictions as a human agent engaged in employing measurements to take decisions about language ability. Such decisions, in turn, especially in the case of high-stakes tests, affect others either beneficially or detrimentally. There is a connection between the designs we make and the ethical dimension of our experience, and neither my design work as an applied linguist tackling the solutions of pervasive language problems, nor my scholarly efforts to understand and theoretically ground these designs are neutral. Planned solutions to language problems affect people. These solutions need theoretical grounding that takes this into account. Thus, in addition to an orientation to the conceptual frameworks that enable us to reflect on the meaning of language assessment, I  begin by disclosing what my (non-neutral) academic position is in this reflection. Reflection requires analysis, and analysis finds its grounds and meaning in abstracting things, setting them apart by distinguishing and comparing them (Strauss, 2009, pp. 13–16), and ultimately yielding an articulated theory. My approach is to analyze without reduction, in other words to avoid elevating a single abstraction to explain everything. I also need to state, at the beginning, that conceptual work is important and, in my view, too often neglected in discussions of language assessment. A notable exception can be found in the treatment by Fulcher (2015) of various paradigms underlying validity theory. The neglect of attending to concepts and ideas occasions conceptual confusion, obfuscation, and the toleration of contradictions. It encourages misunderstandings that stand in the way of fruitful debate across ideological divides or theoretical paradigms. We owe it to the rise of postmodernism to acknowledge that non-neutrality is acceptable. As Feyerabend (1978, p. 19) points out, “science knows no ‘bare facts’ at all, but the ‘facts’ that enter our knowledge are already viewed in a certain way and are, therefore, essentially ideational.” He goes on to say that it is a highly contestable conviction to consider “ ‘scientific facts’ . . . as being independent of opinion, belief, and cultural background”. Thus, a statement of personal and professional intent is no longer considered to be outside the realm of ‘science’. I aim to show in this contribution that our reflections can in fact productively engage with our scholarly experience: we can analyze the pivotal moments in them, remembering the unexpected

222  Albert Weideman

insights, recollect the surprising conclusion or the moments of wonder and hesitation, when we are struck by a finding that challenges our cherished prejudices, and when we are faced with the choice between resisting new perspectives, or understanding them properly before acknowledging their strengths or weaknesses. Given these starting points, this chapter will engage with the author’s professional experiences as they relate to finding a theoretical rationale for language assessment design. It takes as its point of departure that applied linguistic designs, of which language assessment designs are one subset, are always technically stamped endeavors. In these, the technical imagination of the designer of the language intervention (be it a test, a language policy, or a language course) holds sway, and the theoretical rationale provides the analytical support for it. The latter supports the design, rather than leading or prescribing it ‘scientifically’. Professional memories, pivotal insights, difficult questions

Our memories of pivotal moments in our scholarly work may remain just that: they are often left unarticulated and unexamined. The background to the first memorable insight is two contradictory statements. The one, by McNamara and Roever (2006, p. 250f.), was the claim that there have been “recent moves to strip validity theory of its concern for values and consequences and to take the field back 80 years to the view that a test is valid if it measures what it purports to measure”, and subsequently citing the source of that alleged regression (Borsboom et al., 2004). Their point of reference for this claim was the then ascendant orthodoxy inspired by Messick (1980, 1981, 1988, 1989). The other is the (textbook) discussion by Davies and Elder (2005) on the same issues, validity and validation. These authors complain about Messick’s “add[ing] to the problems of validity by extending its scope into the social and the ethical” (2005, p. 799). On the one hand, they appear to support the current orthodoxy – that “validity is not tucked up in the test itself but rather resides in the meanings [interpretations] that are ascribed to test scores” (Davies & Elder, 2005, p. 809). But on the other, they claim – contrary to post-Messick orthodoxy – that “in some sense validity does reside in test instruments” (Davies & Elder, 2005, p. 797), and that it is therefore “not just a trick of semantics . . . to say that one test is more valid than the other for a particular purpose” (Davies & Elder, 2005, p. 798). Or, as Davies (2011, p. 38) formulated it elsewhere, can we “validate a test and then argue that it is valid”? If a language test is put through a diligent process of validation, why would it then not be valid? I learned two lessons from this. The first is that one may be rightfully skeptical of current orthodoxies. As Feyerabend (1978, p. 43f.) phrases it,

Yardsticks for the future of language assessment  223

the success of a theory “may be due to the fact that the theory, when extended beyond its starting point, was turned into rigid ideology,” thereby gaining immunity to being contestable. The second lesson is that a view of validity as a feature of a test refers to the quality of the objective instrument, as discussed in chapter 2. It functions in this respect like any other attribute we readily – and apparently unproblematically – ascribe to tests, e.g., that they are useful, implementable, reliable, reputable, fair, just, efficient, and so on. The other side of the coin, no doubt, is the process (or procedure, when it becomes institutionalized) of subjective validation (see the discussion in Weideman & Deygers, in this volume). But ‘validation’ in this instance demonstrably encompasses much more than the interpretability or meaningfulness of the results, as the current orthodoxy would have it. The difficult question that remains, then, is where and how such characteristics of tests fit into a theoretical framework. Application: how do our test designs measure up?

In appraising where we stand (or wish to go) in assessing language ability, the theoretical framework briefly described in Chapter 2 assists us in converting technical ideas that disclose the design of our assessments into principles. These principles must be given concrete shape in our development of tests (as articulated in Weideman, 2017a, p. 225, from which I take several design issues in the discussion later). Each of the principles discussed next derives its theoretical basis from lodestars for language assessment design, or what are called “regulative ideas” in Chapter  2. When these are given concrete shape in language tests, they disclose the meaning of design. Meaningfulness in design: a lingual yardstick

The first principle or criterion derives from the analogical echo of the lingual modality within the leading technical mode of design of a language assessment. When the leading technical function anticipates its disclosing connection with the lingual dimension, it yields a criterion for language testing that may ask several questions of designers. Is the construct meaningfully operationalized and articulated in a blueprint? Does that blueprint present us with clear and comprehensible test specifications? Is what the test measures understood by those who take it and by those who will use its results, because sufficient information is available about it? And considering output: Does the test yield interpretable and meaningful results? The value of work on improving language test blueprints and specifications is incontestable (for accessible discussions, see Davidson  & Lynch, 2002; Fulcher & Davidson, 2007). One set of examples of the successful

224  Albert Weideman

operationalization of a construct, that of academic literacy (Patterson  & Weideman 2013a, 2013b), is the various tests of academic literacy used in many South African universities (Network of Expertise in Language Assessment, 2023; Read, 2016; Weideman, 2021a). In these tests, a construct of the ability to use language competently in academic settings is articulated in a dozen or more components before these are linked to appropriate assessment tasks. In this book, task design has been considered both in Chapters 7 and 8 (Kley, in this volume; Kley, Kunitz, and Yeh, in this volume), where, notably, the format and shape of assessment tasks, focusing on the broader scope of multimodal evidence used to assess competence (Chapter 7) and a broader definition of competence as in the analysis of repairs (Chapter 8), are linked explicitly to issues of fairness. In this case, both lingual and ethical echoes are apparent concerns that require to be addressed. One of the key illustrations of an advance in design, according to the theoretical framework being employed in this chapter, is the anticipation of the future: in this case the identification of unforeseen effects in administering the assessment, and proposing rectifications to test specifications to eliminate them, once exposed. Rambiritch (2012) has dealt with several further kinds of lingual disclosures of the technical in her examination of how informativity and transparency can be built into the design and administration of a test by making enough information available to test takers beforehand, especially a description of the components of the ability being tested, as well as a sample test (see the critique of West and Thiruchelvam, in this volume). Transparency is here being interpreted as the prior knowledge of language test takers about the measurement, and even their familiarity with the format. The positive washback sought and described in Chapter 6 (Räsänen and Kivik, in this volume) in the use of portfolio assessment serves as another example of learners gaining both familiarity and control over their own learning. The articulation of the interests of test takers, and how the voices of test takers are being heard is explicitly discussed in this volume by West and Thiruchelvam (Chapter 4) and Räsänen and Kivik (Chapter 6). The latter, in particular, note how learner engagement with their own learning may contribute positively to the productive washback they have recorded. Once again, lingual disclosures of test design and administration processes are evident. As regards the output side, when we consider the meaningfulness of language test results, we may safely accept that the current conception of validity as being dependent on the interpretation of these results and particularly the requirement that it needs to be demonstrated in an argument-based validation procedure (Kane, 1992, 2010, 2011, 2012, 2016; Kane et  al., 2017) have contributed to a heightened awareness of this broader definition

Yardsticks for the future of language assessment  225

of validity. All these observations are dependent on clear analogies of the lingual aspect within the technical design of language tests and constitute a first step in advancing the design. Social requirements: appropriateness and accessibility

Next in line are the echoes of the social dimension, of human interaction with technical measure. Taylor (2013, p. 406) notes that the public expression of what goes into a test, discussed in the previous subsection, prepares the ground for further disclosure of the meaning of language test design: such information achieves greater accessibility and enhanced public communication about what a test measures. In respect of appropriateness, close attention is paid to Target Language Use (TLU), a term used by Bachman and Palmer (1996) to indicate that tests of language ability need to be relevant by measuring language performance in the specific kind of language or social discourse domain in which that ability needs to be demonstrated. These kinds of requirements indicate the significance of both the social embeddedness of language use, and the technical (designed) appropriateness of the measurement. Several chapters in this volume, and in particular those in Part II, have emphasized the way that tests either offer or deny access to institutional resources. In critical language testing language tests are examined to see whether they place obstacles in the way of those affected to obtain access to citizenship, or to an academic institution (see West and Thiruchelvam, in this volume). In short, tests sometimes have ‘gatekeeping’ as a function. The way that we positively and effectively address the issue of the technical accessibility of language interventions is a feather in the cap of their designers, and highlighting where this is problematic signals that challenges remain. Technical utility and efficiency: the economic sides of language tests

Within the systematic framework being employed here, we conceptualize technical efficiency as that which obtains when our designs are disclosed with reference to the economic modality. Since the emphasis that Bachman and Palmer (1996, pp. 17, 18; Bachman, 2001, p. 110) have placed on ‘usefulness’ as a prime characteristic of a language test, technical utility has been an important requirement. The further interest in a wider range of assessment formats, evident in this book in the investigation of portfolio assessment as an alternative (Suzuki, in this volume), constitutes an additional disclosure of useful test design. There is no doubt, once again, that

226  Albert Weideman

alternative assessment formats, as well as the highly imaginative employment of older forms (e.g., measuring ability to handle advanced grammar and text relations through an adaptation of cloze procedure and in multiple choice format) have taken us further in language test design. What is more, the development of language assessments that are supported by artificial intelligence or machine learning might well bring about further technical efficiencies. Alignment as a technical requirement

The alignment or technical harmony of the construct of a test with what it manages to measure is so important that it has been prominent in test development for quite some time. This link between the technical and the aesthetic dimension of our experience is almost taken for granted in current test design. Yet at the same time it is undermined by compromises that sometimes are made between functional or skills-neutral definitions of the language ability being tested and older, traditional skills-based definitions (Weideman, 2021b). Compromises and accommodations in design are inevitable, but when they throw the technical alignment of construct and test out of kilter, re-alignment must be sought (Van Dyk & Weideman, 2004a, 2004b; Patterson  & Weideman, 2013a, 2013b). Thus, though there are advances to be observed, specifically where imaginative designs manage to overcome problematic traditional perspectives, there is also further room for improvement in this respect. Moreover, there is increased attention in the broader field of language teaching to achieving an alignment among the three prime applied linguistic interventions: language policies, language tests and language courses. Ideally, the language policy of an institution must provide for, regulate, and facilitate the use of language within the organization. Taking a university in a multilingual environment (Weideman et al., 2021) as an example, we note that this institution must achieve technical alignment of language interventions in such a way that the policy specifies the level of language ability required to handle academic discourse, that the ability must be measured, and that appropriate support in the form of language courses to develop that competence must be provided (Weideman, 2019). When these three interventions are misaligned, and there is a measure of technical disharmony among them, it will most certainly affect their technical utility and efficiency, referred to in the previous subsection. Where such an alignment is achieved, on the other hand, the design principle of technical harmony is concretely given shape. In this volume, Räsänen and Kivik (Chapter 6) provide an illustration of how assessment and language development can be harmonized,

Yardsticks for the future of language assessment  227

while West and Thiruchelvam (in Chapter 4) narrate how the undesirable opposite may occur. The standard of accountability

Tellingly, when one looks at the development of the notion of language assessment literacy (LAL) (see Inbar-Lourie, 2017; Taylor, 2009, 2013), one observes a progression from a concern with the language assessment competence of teachers to that of officials who have to interpret the results of language tests (e.g., Deygers  & Malone, 2019). Along with the ideas of technical transparency, informativity, communication, and accessibility, the notion of ‘accountability’ is much more prominent in later reviews of LAL (Taylor, 2013; Inbar-Lourie, 2017) than the founding concepts like reliability, validity, and construct that are called “constitutive concepts” in Chapter 2. There is no doubt that we are now more interested in those regulative technical ideas that we find in the cultural, social, political, and ethical dimensions of language testing (Weideman, 2017b; Weideman, 2009; McNamara & Roever 2006). Such interest signals an advance in assessment design. By extending the notion of LAL to officials who deal with the interpretation of tests results and who are tasked with the implementation of decisions flowing from these results, test designers have taken a step in the right direction. That is not to say that we have done enough to offer information and provide knowledge about what we do to the public at large, by employing a wider range of media than scholarly publications. Technical accountability (Davies, 2008) entails that we need to speak not only to our peers, however important that is, but that we should seek every opportunity of explaining our technical intentions and the benefits of our trade to the public at large. There are huge challenges of doing this adequately, but we should continue to seek such a juridical disclosure of the meaning of language test design. Becoming and being held accountable for test designs is demonstrated in this book in the various contestations related by West and Thiruchelvam (Chapter 4), with a resistance and even subversion of assessment policy being noted not only in the case of those at the receiving end of this assessment, but also by those tasked with administering it. In Richardson’s analyses in Chapter 3 we find similar examples of the traumatizing experiences of test takers, instead of test designers striving and planning to enable them to become ‘empowered collaborators’. The themes of power relations and trauma emerge in these, but also of the positive side: of empowerment that may result from designing portfolio assessment properly (Suzuki, Chapter 5 in this volume).

228  Albert Weideman

Integrity in design, beneficence in impact

The way that our language test designs anticipate the ethical dimension of experience is usually associated with raising issues of justice and fairness (Deygers, 2017, 2018, 2019; Davies, 2010; Kunnan, 2000, 2004; Rawls, 1990). Schuurman (2022, p.  81) views this as the injunction to designers to give shape to “the norms of care and respect for . . . everyone involved . . . and the norms of service, trust, and faith”. The latter two I shall return to, but we may note now that while technical justice refers in the first place to the readiness of the language test designers to correct and rectify their designs (e.g., through careful piloting and diligent analysis) to do justice to the ability being measured, technical care and compassion may be conceived of as the technical beneficence of their tests. West and Thiruchelvam (Chapter 4, in this volume) have focused exactly on this connection between the technical and ethical modes of our experience. The correction that is possible in design is aptly considered by the examination in Chapter 7 and 8 (Kley, in this volume; Kley, Kunitz, and Yeh, in this volume) of changes in task design. The latter are clearly internal juridical analogies in the technical sphere. As to the ethical sides of design, critical language testers are not alone in alerting us to the detrimental effects that language tests can, and have, on test takers. But a more nuanced conclusion can be drawn from interviews about the use of language tests for immigration (Schildt et al., 2023), which indicate that those who use the tests sometimes may have the interests of those who take them at heart. Simply phrased: there is now some empirical evidence that the norms of care and respect are being upheld at least in some cases involving the use of high-stakes tests. The attention globally to adopt codes of ethics for language testing is beyond doubt. The ILTA Code of Ethics (ILTA, 2018) and Guidelines for Practice (ILTA, 2020) have been revised several times, and in fact are again in the process of being scrutinized for possible improvement (Deygers & Malone, 2023). They have been translated into close to three dozen languages. Once again there are challenges, but there is evidence of a professional interest in ensuring that language test takers are being and will be treated fairly. While we could think of the social impact of tests as an external ethical issue, as is evident in the analyses of Chapter 4 (West and Thiruchelvam, in this volume), the internal technical integrity of a language test is equally important. There are innumerable examples of where measures like correcting for differential item functioning (DIF), having identified a segment of the test population which may be unfairly discriminated against, have not only been calculated, but been acted upon. Analytical methods are available in abundance (McNamara et al., 2019 providing a good set of examples). Second chance tests are often an integral part of the design of new tests of

Yardsticks for the future of language assessment  229

academic literacy. A particularly imaginative design solution for this can be found in Keyser’s (2017) proposal for a three-tier test. These are important internal mechanisms for mitigating the potential negative external effects of language tests. Both external and internal considerations of technical integrity and beneficence provide illustrations of the advances that have been made in giving concrete shape to ethical concerns in language testing. Trustworthiness, reputability, and certainty

Having dealt with the norm of service to others in the previous section as an ethical trace within the realm of technical design, I return, finally, to what Schuurman (2022) refers to as the norms of trust and faith. The diligence with which language tests are designed has an end goal in mind: to have the test available as a trustworthy and credible measurement. This is much more than reliability and consistency: it is a norm that speaks to the technical reputability of a test, necessarily gained over time. It is exactly the lack of technical reputation that prevents novel attempts at measuring language ability from being adopted quickly. They still need to prove their mettle. The internal connection of the technical with the dimension of certitude of course relates also to the faith we place in language tests. Such faith is mistaken when it considers tests to be ‘scientifically’ designed measurements and therefore infallible. Neither the most theoretically adequate justifications for tests, nor the careful statistical and other analyses of their empirical properties, can ensure that they will measure infallibly. As accounts of interactions of language testing experts with non-language testers on LAL have shown, it does not befit the former to assume that the latter are deficient in their knowledge of language testing. Baker (2016, p. 82) is rightly critical of such assumptions in our dealings with those professionals in other fields involved in some way with language testing, for example as the interpreters or users of test results. Hubris is nowhere a desirable professional trait, and humility is much more appropriate, given the technically limited insight we have as language test designers. When we believe that we hold the ultimate wisdom as a result of some scientifically gained knowledge, we shall be unable to learn from others. What certainty we have we gain not only from our own experience, but in interaction with others who are professionals in their own right. Conclusion

In Chapter 1 (Salaberry and Weideman, in this volume), three motivations for putting together this book were given: (a) that a re-examination of construct is necessary; (b) that a broadening of the concept of validity demands that we reconsider our approach to language assessment; and (c) that ethical

230  Albert Weideman

considerations in devising tests of language ability are today unavoidable. As regards (a), this volume has discussed new ways of broadening the definition of language ability, and in particular the implementation of an Interactional Competence perspective (as, e.g., in Chapters 7 and 8 by Kley, and by Kley, Kunitz, and Yeh, in this volume), while also describing the drawbacks of decidedly problematic constructs (West and Thiruchelvam, in this volume). The reconsideration of the concept of validity, prompted by the broadening of the construct of language competence, referred to in (b) is particularly prominent not only in this chapter, but specifically in Chapter 2 (by Weideman & Deygers). As to the third motivation, (c) previously, what has been considered in this chapter and in this book are the various sets of evidence that can be brought to bear on assessing language ability in an additional language with a particular focus on how to align those assessments with the interests of those affected by them. In that sense, the book has dealt with what is conventionally termed issues of validity and validation. I hope to have shown that our intention to design assessments responsibly is much wider than that. The advances that we have noted and discussed in this book demonstrate an unfolding and disclosure of the meaning of design, providing yardsticks for the future. The gains and advances we have made in language assessment are not always settled or even near completion. Even the principles of responsible design that have been identified in this book as leading the realization of this goal are preliminary and contestable. There are many challenges still for responsible language test design. But I also hope to have shown, as have the contributions to this book, that in some very substantial senses we have advanced professionally, and that our intentions are to do better still. References Bachman, L. F. (2001). Designing and developing useful language tests. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, T. Lumley, T. McNamara & K. O’Loughlin (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (pp. 109–116). Cambridge University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press. Baker, B. (2016). Language assessment literacy as professional competence: The case of Canadian admissions decision makers. The Canadian Journal of Applied Linguistics, 19(1), 63–83. Bell, J. S. (2002). Narrative enquiry: More than just telling stories. TESOL Quarterly, 36(3), 207–213. https://doi.org/10.2307/3588331 Borsboom, D., Mellenbergh, G. J.,  & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/ 0033-295X.111.4.1061

Yardsticks for the future of language assessment  231

Davidson, F., & Lynch, B. K. (2002). Testcraft: A teacher’s guide to writing and using language test specifications. Yale University Press. Davies, A. (2008). Accountability and standards. In B. Spolsky & F. M. Hult (Eds.), The handbook of educational linguistics (pp.  483–494). Blackwell. https://doi. org/10.1002/9780470694138 Davies, A. (2010). Test fairness: a response. Language Testing, 27(2), 171–176. https://doi.org/10.1177/0265532209349466 Davies, A. (2011). Kane, validity and soundness. Language Testing, 29(1), 37–42. Davies, A., & Elder, C. (2005). Validity and validation in language testing. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp. 795–813). Lawrence Erlbaum Associates. Deygers, B. (2017). Just testing: Applying theories of justice to high-stakes language tests. International Journal of Applied Linguistics, 168(2), 143–163. https:// doi.org/10.1075/itl.00001.dey Deygers, B. (2018). [Review of the book Evaluating language assessments, by Antony John Kunnan]. Language Testing, 36(1), 154–157. https://doi.org/ 10.1177/0265532218778211 Deygers, B. (2019). Fairness and justice in English language assessment. In X. Gao (Ed.), Second handbook of English language teaching. Springer. https://doi. org/10.1007/978-3-319-58542-0_30-1 Deygers, B., & Malone, M. E. (2019). Language assessment literacy in university admission policies, or the dialogue that isn’t. Language Testing, 36(3), 347–368. https://doi.org/10.1177/0265532219826390 Deygers, B., & Malone, M. E. (2023). Standing on the shoulders of giants: Revising the ILTA Code of Ethics. Paper read at the Language Testing Research Colloquium (LTRC 2023), annual conference of the International Language Testing Association (ILTA), New York, June. Ellis, C., Adams, T. E., & Bochner, A. P. (2011). Autoethnography: An overview. Forum: Qualitative Social Research, 12(1), 1–12. https://doi.org/10.17169/ fqs-12.1.1589 Feyerabend, P. (1978). Against method: Outline of an anarchistic theory of knowledge. Verso. Fulcher, G. (2015). Re-examining language testing: A  philosophical and social inquiry. Routledge. Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. Routledge. ILTA (International Language Testing Association). (2018). Code of ethics. www. iltaonline.com/page/CodeofEthics. Accessed 26 April 2023. ILTA (International Language Testing Association). (2020). Guidelines for practice. www.iltaonline.com/page/ILTAGuidelinesforPractice. Accessed 26 April 2023. Inbar-Lourie, O. (2017). Language assessment literacy. In Language testing and assessmen (pp.  257–270). Springer International Publishing. https://doi. org/10.1007/978-3-319-02261-1_19 Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535. https://doi.org/10.1037/0033-2909.112.3.527 Kane, M. T. (2010). Validity and fairness. Language Testing, 27(2), 177–182. https:// doi.org/10.1177/0265532209349467

232  Albert Weideman

Kane, M. T. (2011). Validity score interpretations and uses: Messick lecture, language testing research Colloquium, Cambridge, April 2010. Language Testing, 29(1), 3–17. https://doi.org/10.1177/0265532211417210 Kane, M. T. (2012). Articulating a validity argument. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 34–48). Routledge. https://doi.org/10.4324/9781003220756 Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy  & Practice, 23(2), 198–211. https://doi.org/10.1080/0969594X.2015. 1060192. Kane, M. T., Kane, J., & Clauser, B. E. 2017. A validation framework for credentialing tests. In C. W. Buckendahl & S. Davis-Becker (Eds.), Testing in the professions: Credentialing polices and practice (pp.  20–41). Routledge. https://doi. org/10.4324/9781315751672-2 Keyser, G. (2017). Die teoretiese begronding vir die ontwerp van ’n nagraadse toets van akademiese geletterdheid in Afrikaans. MA dissertation, University of the Free State, Bloemfontein. http://hdl.handle.net/11660/7704 Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in language assessment: Selected papers from the 19th Language Testing Research Colloquium, Orlando, Florida (pp. 1–14). University of Cambridge Local Examinations Syndicate. https://doi.org/10.1002/9781118411360. wbcla144 Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.) European language testing in a global context (Studies in language testing; 18, pp. 27–45). Cambridge University Press. https://doi.org/10.1177/02655322211057040 Macbeth, D. (2001). On ‘reflexivity’ in qualitative research: Two readings and a third. Qualitative Enquiry, 7(1), 35–68. https://doi.org/10.1177/107780040 100700103 McNamara, T., Knoch, U., & Fang, J. (2019). Fairness, justice and language assessment: The role of measurement. Oxford University Press. McNamara, T.,  & Roever, C. (2006). Language testing: The social dimension. Blackwell. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35(11), 1012–1027. https://doi.org/10.1037/0003-066X.35.11.1012 Messick, S. (1981). Evidence and ethics in the evaluation of tests. Educational Researcher, 10(9), 9–20. https://doi.org/10.3102/0013189X010009009 Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences of measurement. In H. Wainer  & I. H. Braun (Eds.), Test validity (pp. 33–45). Lawrence Erlbaum Associates. https://doi.org/10.1002/ j.2330-8516.1986.tb00185.x Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 33–45). Third edition. American Council on Education/Collier Macmillan. Network of Expertise in Language Assessment [NExLA]. (2023). Bibliography. https://nexla.org.za/research-on-language-assessment/. Accessed 27 April 2023. Paltridge, B. (2014). What motivates Applied Linguistics research? AILA Review, 27, 98–104. https://doi.org/10.1075/aila.27.05pal Patterson, R., & Weideman, A. (2013a). The typicality of academic discourse and its relevance for constructs of academic literacy. Journal for Language Teaching, 47(1), 107–123. https://doi.org/10.4314/jlt.v47i1.5

Yardsticks for the future of language assessment  233

Patterson, R., & Weideman, A. (2013b). The refinement of a construct for tests of academic literacy. Journal for Language Teaching, 47(1), 125–151. https://doi. org/10.4314/jlt.v47i1.6 Rambiritch, A. (2012). Accessibility, transparency and accountability as regulative conditions for a post-graduate test of academic literacy. PhD thesis, University of the Free State, Bloemfontein. http://hdl.handle.net/11660/1571 Rawls, J. (1990). A theory of justice (Revised ed.). The Belknap Press of Harvard University Press. Read, J. (Ed.) (2016). Post-admission language assessment of university students. Springer. Schildt, L., Deygers, B. & Weideman, A. (2023). Language testers and their place in the policy web. Language Testing (Online first: article first published online on 17 August). https://doi.org/10.1177/02655322231191133 Schuurman, E. (2009). Technology and the future: A philosophica1 challenge (Trans. H. D. Morton). Paideia Press [Originally published in 1972 as: Techniek en toekomst: Confrontatie met wijsgerige beschouwingen, Van Gorcum]. Schuurman, E. (2022). Transformation of the technological society. Dordt Press. Strauss, D. F. M. (2009). Philosophy: Discipline of the disciplines. Paideia Press. Taylor, L. (2009). Developing assessment literacy. Annual Review of Applied Linguistics, 29, 21–36. https://doi.org/10.1017/S0267190509090035 Taylor, L. (2013). Communicating the theory, practice and principles of language testing to test stakeholders: Some reflections. Language Testing, 30(3), 403–412. https://doi.org/10.1177/0265532213480338 Van Dyk, T., & Weideman, A. (2004a). Switching constructs: On the selection of an appropriate blueprint for academic literacy assessment. Journal for Language Teaching, 38(1), 1–13. https://doi.org/10.4314/jlt.v38i1.6024 Van Dyk, T.,  & Weideman, A. (2004b). Finding the right measure: From blueprint to specification to item type. Journal for Language Teaching, 38(1), 15–24. https://doi.org/10.4314/jlt.v38i1.6025 Van Riessen, H. (1949). Filosofie en techniek. J. H. Kok. Weideman, A. (2009). Constitutive and regulative conditions for the assessment of academic literacy. Southern African Linguistics and Applied Language Studies, Special Issue: Assessing and Developing Academic Literacy, 27(3), 235–251. Weideman, A. (2015). Autoethnography and the presentation of belief in scholarly work. Journal for Christian scholarship, 51(3), 125–141. https://pubs.ufs.ac.za/ index.php/tcw/article/view/382. Accessed 24 April 2023. Weideman, A. (2017a). Responsible design in applied linguistics: Theory and practice. Springer International Publishing. https://doi.org/10.1007/978-3-319-41731-8 Weideman, A. (2017b). The refinement of the idea of consequential validity within an alternative framework for responsible test design. In J. Allan  & A. J. Artiles (Eds.), Assessment inequalities: Routledge world yearbook of education (pp. 218–236). Routledge. https://doi.org/10.4324/9781315517377 Weideman, A. (2019). Definition and design: Aligning language interventions in education. Stellenbosch Papers in Linguistics Plus, 56, 33–48. https://doi. org/10.5842/56-0-782 Weideman, A. (2021a). Context, construct, and validation: A perspective from South Africa. Language Assessment Quarterly. https://doi.org/10.1080/15434303. 2020.1860991

234  Albert Weideman

Weideman, A. (2021b). A skills-neutral approach to academic literacy assessment. In A. Weideman, J. Read & T. du Plessis (Eds.), Assessing academic literacy in a multilingual society: transition and transformation (pp.  22–51, New Perspectives on Language and Education: 84). Multilingual Matters. https://doi. org/10.21832/WEIDEM6201 Weideman, A., Read, J., & Du Plessis, T. (Eds.) (2021). Assessing academic literacy in a multilingual society: Transition and transformation. Multilingual Matters. https://doi.org/10.21832/9781788926218

INDEX

Note: Page numbers in italic indicate a figure and page numbers in bold indicate a table on the corresponding page. accessibility 15, 40, 225, 227 accountability 11–15, 33–34, 43–44, 60–63, 227 accountable 11, 13, 33, 227 advances 20, 58–59, 70, 139, 224, 226–227, 229–230 affordances 20, 186–188; data analysis 196–211, 197, 208–209, 211; discussion and conclusion 211–213, 212; literature review 188–193; participants and setting 194; procedure and methods 194–196, 195; quantitative analysis 208–211, 208–209, 211; repair practices identified in the student interactions 197–207, 197 alignment 226–227 application 223–229 applied linguistics 32–40, 34–35, 37–38, 39 appropriateness 225 Arizona English Language Learner Assessment (AZELLA) 65–67 assessment 186–188; data analysis 196–211, 197, 208–209, 211; discussion and conclusion 211–213, 212; dynamic

assessment (DA) 139–140; ethical design 10–14; Learning Oriented Assessment (LOA) 139–140; literature review 188–193; participants and setting 194; procedure and methods 194–196, 195; quantitative analysis 208–211, 208–209, 211; repair practices identified in the student interactions 197–207, 197; topic initiations and shifts 168–169; see also classroom-based paired speaking assessment; language assessment; L2 portfolio assessment beneficence 228–229 certainty 229 classroom-based paired speaking assessment 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different prompts and inscribed objects on peer–peer test interaction

236 Index

163–165; orienting to inscribed objects in social interaction 165–166; results 169–178; task effects in paired and group oral assessments 163 coding 112 collection of portfolio materials 117 competencies 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different prompts and inscribed objects on peer–peer test interaction 163–165; orienting to inscribed objects in social interaction 165–166; results 169–178; task effects in paired and group oral assessments 163 computer-based English test (CBET) 81–82, 87–91 conceptual uncertainty 28–32 confidence 96–98 consequences, intended 68 construct 3–4; ethical assessment design 10–14; re-conceptualization of the construct of language 4–8; validity and validation 8–10 contested measures 3–4 contested validity 80–82; computerbased English test 81–82; critical concept of validity 82–83; data analysis 85–87; discussion 98–101; findings 87–98; methodology 84–85; test-taker perspectives 83–84; see also validity contested views 28–32 context 3–4; ethical assessment design 10–14; re-conceptualization of the construct of language 4–8; validity and validation 8–10 critical language testing (CLT) 53–54 critical race theory (CRT) 57–61 data 140–143, 166–169, 167, 194–196, 195; analysis 85–87, 112, 168, 196–211, 197, 208–209, 211; coda 112; collection 110–111 design principles 15, 34–36, 42–44, 226; ethical assessment design 10–14; integrity 228–229;

responsible design 44, 46, 230; test designs 223–229 disclosure (unfolding) 3, 20, 32, 44, 220, 224–225, 227, 230 discourse, variability in 188–189 discourse analysis 142–143 dynamic assessment (DA) 139–140; see also assessment economic sides of language tests 225–226 efficiency 225–226 Elementary and Secondary Education Act (ESEA) 61–63 “English for the Children” 64 English language development (ELD) 64 English learners: findings 87–98; language test cases in the US 61–67; language tests’ intended consequences 68; perceptions in L2 portfolio studies 116–123; test-taker perspectives 83–84; texting validity issues 56–57; see also students Equal Educational Opportunities Act 64 ethicality 3–4 ethics 3–4, 106; discussion 123–126; ethical assessment design 10–14; findings 114–123, 114; method 110–114, 113; portfolio assessment in L2 studies 107–110; re-conceptualization of the construct of language 4–8; validity and validation 8–10 evaluation 40–43, 121–123 Every Students Succeeds Act (ESSA) 61–63 exclusion criteria 111–112 failure 62–64, 67, 84, 89–93, 96, 102 fairness 10–15, 31, 40–44, 65–67, 98 feature, validity as 8–10 feedback 117–119 Flores v. Arizona 64 gains 20, 32, 44, 58, 61, 230 German 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different

Index  237

prompts and inscribed objects on peer–peer test interaction 163–165; orienting to inscribed objects in social interaction 165–166; results 169–178; task effects in paired and group oral assessments 163 goal of the study 166 governmentality see nation-state governmentality group oral assessments 163 harm 93–98 high-stakes test 80–82; computer-based English test 81–82; critical concept of validity 82–83; data analysis 85–87; discussion 98–101; findings 87–98; methodology 84–85; test-taker perspectives 83–84 ideologies, raciolinguistic 54–56 impact, beneficence in 228–229 inclusion criteria 111–112 independent use portfolio 137–138; language learning in the wild as the resource of 138–139 inscribed objects 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different prompts and inscribed objects on peer–peer test interaction 163–165; orienting to inscribed objects in social interaction 165–166; results 169–178; task effects in paired and group oral assessments 163 insights 222–223 instructors 93–98 integrity 228–229 interaction see interactional competence; L1–L2 speaker interaction interactional competence 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different prompts and inscribed objects on peer–peer test interaction 163–165; orienting to inscribed objects in social interaction

165–166; results 169–178; task effects in paired and group oral assessments 163 interest convergence 58–59 interpretivist 15, 45 interviews 85 justice 10–15, 40–43, 98, 228 justifications for mitigating harm 93–98 language, re-conceptualization of 4–8 language assessment 53–57, 229–230; application 223–229; critical race theory 57–61; English learner language test cases in the US 61–67; English learner language tests’ intended consequences 68; moving forward 68–70; professional memories, pivotal insights, difficult questions 222–223; stance, positionality, and reflection 220–222 language assessment literacy (LAL) 227 language interventions 32, 222, 225 language learning in the wild 135–140, 156–157; analysis and results 143–154; data and methods 140–143, 141; discussion 154–156 language tests: cases in the US 61–67; economic sides of 225–226; evaluation 40–43; intended consequences 68; peer–peer test interaction 163–165; responsible design 14; validity issues 56–57; see also critical language testing (CLT); high-stakes test learners see English learners Learning Oriented Assessment (LOA) 139–140; see also assessment legislative policies: national 61–64; state 64–67 liberalism, critique of 59–61 lingual yardstick 223–225 linguistics see applied linguistics literature review 188–193 L1–L2 speaker interaction 186–188; data analysis 196–211, 197, 208–209, 211; discussion and conclusion 211–213, 212; literature review 188–193;

238 Index

participants and setting 194; procedure and methods 194–196, 195; quantitative analysis 208–211, 208–209, 211; repair practices identified in the student interactions 197–207, 197 L2 portfolio assessment 106, 135–140, 156–157; analysis and results 143–154; data and methods 140–143, 141; discussion 123–126, 154–156; findings 114–123, 114; method 110–114, 113; portfolio assessment in L2 studies 107–110 meaningfulness 12, 15, 31, 38, 44, 223–225 measurement 229–230; application 223–229; contested measures 3–4; professional memories, pivotal insights, difficult questions 222–223; stance, positionality, and reflection 220–222 memories, professional 222–223 method/methodology 84–85, 110–114, 113, 140–143, 166–169, 167, 194–196 mitigation 93–98 moral imperative 93–95 moral obligation 96–98 multilingualism 3–7, 63, 68–69 narratives 80–82; computer-based English test 81–82; critical concept of validity 82–83; data analysis 85–87; discussion 98–101; findings 87–98; methodology 84–85; test-taker perspectives 83–84 nation-state governmentality 54–56 nexus-analytical approach 142–143 No Child Left Behind (NCLB) 61–63 normative requirements 40 objects see inscribed objects objective technical validity 43–44; see also validity opaque policy 91–93 oral assessments 163

orienting 165–166, 172, 212 orthodoxy 8–9, 28–31, 43, 222–223 other-directed word searches 202–207 other-initiated repair 190–191, 197–202 paired speaking assessment 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different prompts and inscribed objects on peer–peer test interaction 163–165; orienting to inscribed objects in social interaction 165–166; results 169–178; task effects in paired and group oral assessments 163 participants 84–85, 166, 194 pedagogy 124–125 peer–peer test interaction 163–165 perspectives/perceptions of learners: L2 portfolio studies 116–123; test-taking 83–84 policies, opaque 91–93; see also legislative policies portfolio: doing learning through reflection 151–154; process 140–142, 141; see also L2 portfolio assessment positionality 220–222 Primary or Home Language Other Than English (PHLOTE) 65 procedure 194–196 process: portfolio 140–142, 141; validation as 8–10 professional memories 222–223 progress 58–59, 63, 108, 116, 120, 193 prompts 163–165 quantitative analysis 208–211, 208–209, 211 questions 222–223 racialization 53–57; critical race theory 57–61; English learner language test cases in the US 61–67; English learner language tests’ intended consequences 68; moving forward 68–70 raciolinguistic ideologies 54–56 re-conceptualization of language 4–8

Index  239

redefinition 3–4 reflection 120–121, 220–222; portfolio as means of doing learning through 151–154 regressions 29, 222 relationships in the target language 143–149 repair practices 186–188; data analysis 196–211, 197, 208–209, 211; discussion and conclusion 211–213, 212; identified in the student interactions 197–207, 197; literature review 188–193; participants and setting 194; procedure and methods 194–196, 195; quantitative analysis 208–211, 208–209, 211 reputability 15, 223, 229 responsible design 44, 46, 230 selection of portfolio materials 117 setting 194–196, 195 Siheom University 81–82, 84–85, 98, 100 social impact 3, 8, 31, 228 social interaction 165–166 source items 110–111 social requirements 225 speaker interaction see L1–L2 speaker interaction speaking assessment and competence 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different prompts and inscribed objects on peer–peer test interaction 163–165; orienting to inscribed objects in social interaction 165–166; results 169–178; task effects in paired and group oral assessments 163 stance 220–222 structured English immersion (SEI) 64–67 students: interacting in the wild 143; rebuilding confidence 96–98; repair practices in student interactions 197–207, 197; as victims of the test 87–93 subjective validation 43–44; see also validation

target language: establishing new connections through 149–151; pursuing existing relationships in 143–149 task effects 163 teaching 168–169 technical utility and efficiency 225–226 technical validity, objective 43–44; see also validity test see critical language testing (CLT); high-stakes test; language tests test designs 223–229 test discourse 188–189 test procedure 166–168 test task 166–168 theoretical framework 113, 115–116, 125, 223 throwbacks 20 topic 168; teaching and assessing topic initiations and shifts 168–169 topic card 162–163; data and methods 167–169, 167; discussion and conclusion 178–181; goal of the study 166; impact of different prompts and inscribed objects on peer–peer test interaction 163–165; orienting to inscribed objects in social interaction 165–166; results 169–178; task effects in paired and group oral assessments 163 transparency 15, 40, 45, 112–113, 227 trustworthiness 229 uncertainty 28–32 United States: English learner language test cases 61–67 usefulness 9, 29–31, 33, 120, 225 utility 15, 34, 38, 44, 225–226 validation: contest views and conceptual uncertainty 28–32; justice and fairness 40–43; principles of design 35–40, 35, 37–38, 39; as process 8–10; subjective 43–44; theory of applied linguistics 32–35, 34 validity 53–54, 80–82; computerbased English test 81–82; contest views and conceptual uncertainty 28–32; critical concept of 82–83; data analysis

240 Index

85–87; discussion 98–101; English learner testing validity issues 56–57; as feature 8–10; findings 87–98; justice and fairness 40–43; methodology 84–85; objective technical 43–44; of portfolio assessments 108–110; principles of design 35–40, 35, 37–38, 39; test-taker perspectives 83–84; theory of applied linguistics 32–35, 34 validity theory 10, 28–30, 40–42, 221–222

variability in test discourse 188–189 victims: of an opaque policy 91–93; of the test 87–93 views, contested 28–32 word searches 192–193; other-directed 202–207 yardsticks for the future 229–230; application 223–229; professional memories, pivotal insights, difficult questions 222–223; stance, positionality, and reflection 220–222