Assessing Academic English for Higher Education Admissions 9780815350637, 9780815350644, 9781351142403

Assessing Academic English for Higher Education Admissions is a state-of-the-art overview of advances in theories and pr

149 106 5MB

English Pages 219 [235] Year 2021

Table of contents :
List of Figures
List of Tables
Series Editors’ Foreword
Acknowledgements
CHAPTER 1 Framing the Assessment of Academic English for Admissions Decisions (John M. Norris, John McE. Davis, and Xiaoming Xi)
CHAPTER 2 Assessing Academic Reading (Mary Schedl, Tenaha O’Reilly, William Grabe, and Rob Schoonen)
CHAPTER 3 Assessing Academic Listening (Spiros Papageorgiou, Jonathan Schmidgall, Luke Harding, Susan Nissan, and Robert French)
CHAPTER 4 Assessing Academic Writing (Alister Cumming, Yeonsuk Cho, Jill Burstein, Philip Everson, and Robert Kantor)
CHAPTER 5 Assessing Academic Speaking (Xiaoming Xi, John M. Norris, Gary J. Ockey, Glenn Fulcher, and James E. Purpura)
CHAPTER 6 Looking Ahead to the Next Generation of Academic English Assessments (Carol A. Chapelle)
List of Contributors

Recommend Papers

Contextualizing English for Academic Purposes in Higher Education: Politics, Policies and Practices 9781350230453, 9781350230484, 9781350230460

This book highlights the centrality of political and ideological issues as they relate to the positioning and practice o

160 36 12MB Read more

Academic Mothering: Fabulating Futures for Higher Education 9004547452, 9789004547452

Inspired by those who mothered before and through the COVID-19 pandemic, this is a book about, for, and with those who l

102 16 43MB Read more

Academic Ableism: Disability and Higher Education 9780472123414

445 92 894KB Read more

Academic English for Masters

1,121 72 4MB Read more

Academic Biliteracies: Multilingual Repertoires in Higher Education 9781783097425

This book explores issues surrounding biliteracy in academic contexts. Chapters analyse diverse multilingual contexts wh

142 73 2MB Read more

Higher education: free speech and academic freedom 9781528624183

406 6 2MB Read more

Higher Education for Business 9780231883597

Provides the basis for a reappraisal of the state of business education in the United States and applies it to the entir

127 2 31MB Read more

Redeveloping Academic Career Frameworks for Twenty-First Century Higher Education 3031411250, 9783031411250

This book spotlights new pathways for academic career progression beyond the traditional teaching-and-research model. It

121 32 4MB Read more

Academic Activism in Higher Education: A Living Philosophy for Social Justice 9789811603396, 9789811603402

476 24 2MB Read more

Innovative Assessment in Higher Education: A Handbook for Academic Practitioners, 2nd edition 9781138581180, 9781138581197, 9780429506857

106 55 Read more

Assessing Academic English for Higher Education Admissions
9780815350637, 9780815350644, 9781351142403

Author / Uploaded
Xiaoming Xi (Editor)
John M. Norris (Editor)

Similar Topics
Linguistics
Foreign: English

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

i

ASSESSING ACADEMIC ENGLISH FOR HIGHER EDUCATION ADMISSIONS

Assessing Academic English for Higher Education Admissions is a state-of-the-art overview of advances in theories and practices relevant to the assessment of academic English skills for higher education admissions purposes. The volume includes a brief introduction followed by four main chapters focusing on critical developments in theories and practices for assessing reading, listening, writing, and speaking, of which the latter two also address the assessment of integrated skills such as reading-writing, listening-speaking, and reading- listening-speaking. Each chapter reviews new task types, scoring approaches, and scoring technologies and their implications in light of the increasing use of technology in academic communication and the growing use of English as a lingua franca worldwide. The volume concludes with recommendations about critical areas of research and development that will help move the field forward. Assessing Academic English for Higher Education Admissions is an ideal resource for researchers and graduate students in language testing and assessment worldwide. Xiaoming Xi is Chief of Product, Assessment and Learning at VIPKID International, USA. John M. Norris is Senior Research Director of the Center for Language Education and Assessment Research at Educational Testing Service, USA.

ii

INNOVATIONS IN LANGUAGE LEARNING AND ASSESSMENT AT ETS Series Editors: John M. Norris, Sara Cushing, Steven John Ross, and Xiaoming Xi

The goal of the Innovations in Language Learning and Assessment at ETS series is to publish books that document the development and validation of language assessments and that explore broader innovations related to language teaching and learning. Compiled by leading researchers, then reviewed by the series editorial board, volumes in the series provide cutting-edge research and development related to language learning and assessment in a format that is easily accessible to language teachers and applied linguists as well as testing professionals and measurement specialists. Volume 1: Second Language Educational Experiences for Adult Learners John M. Norris, John McE. Davis and Veronika Timpe-Laughlin Volume 2: English Language Proficiency Assessments for Young Learners Edited by Mikyung Wolf and Yuko Butler Volume 3: Automated Speaking Assessment Using Language Technologies to Score Spontaneous Speech Edited by Klaus Zechner and Keelan Evanini Volume 4: Assessing English Language Proficiency in U.S. K-12 Schools Edited by Mikyung Wolf Volume 5: Assessing Change in English Second Language Writing Performance Khaled Barkaoui and Ali Hadidi Volume 6: Assessing Academic English for Higher Education Admissions Edited by Xiaoming Xi and John M. Norris

iii

ASSESSING ACADEMIC ENGLISH FOR HIGHER EDUCATION ADMISSIONS

Edited by Xiaoming Xi and John M. Norris

iv

First published 2021 by Routledge 605 Third Avenue, New York, NY 10158 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2021 Taylor & Francis The right of Xiaoming Xi and John M. Norris to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data A catalog record for this title has been requested ISBN: 978-0-8153-5063-7 (hbk) ISBN: 978-0-8153-5064-4 (pbk) ISBN: 978-1-351-14240-3 (ebk) Typeset in Bembo by Newgen Publishing UK

v

CONTENTS

List of Figures List of Tables List of Contributors Series Editors’ Foreword Acknowledgements 1 Framing the Assessment of Academic English for Admissions Decisions John M. Norris, John McE. Davis, and Xiaoming Xi

vii viii ix xi xiii

1

2 Assessing Academic Reading Mary Schedl,Tenaha O’Reilly,William Grabe, and Rob Schoonen

22

3 Assessing Academic Listening Spiros Papageorgiou, Jonathan Schmidgall, Luke Harding, Susan Nissan, and Robert French

61

4 Assessing Academic Writing Alister Cumming,Yeonsuk Cho, Jill Burstein, Philip Everson, and Robert Kantor

107

5 Assessing Academic Speaking Xiaoming Xi, John M. Norris, Gary J. Ockey, Glenn Fulcher, and James E. Purpura

152

vi

vi Contents

6 Looking Ahead to the Next Generation of Academic English Assessments Carol A. Chapelle

200

Index

215

vi

vi

vi

FIGURES

2 .1 3.1 5.1 5.2 5.3 5.4 6.1

Outline of a reading comprehension assessment model Contextual facets of a listening test task Contextual facets of a speaking test task Hours worked at job each week Communicative effectiveness Constructs and realizations of service encounters Components of an interpretive framework for academic English proficiency 6.2 Aspects of interpretation useful for constructing an interpretation/use argument for a test of academic English for admissions in higher education

35 98 162 176 181 182 202 207

vi

TABLES

3.1 Summary of key trends and recent developments in academic listening 3.2 A generic construct definition for assessing listening comprehension in EAP settings 3.3 Summary of TOEFL iBT Listening section 3.4 Summary of IELTS Listening section 3.5 Summary of PTE Academic Listening section 4.1 Rhetorical functions, purposes, and imagined audiences for writing in the CEFR 5.1 The construct definition of the TOEFL iBT Speaking section

85 87 89 90 91 116 172

ix

CONTRIBUTORS

Jill Burstein is a Director of the Personalized Learning & Assessment Lab at the

Educational Testing Service (ETS) in the U.S. Carol A. Chapelle is a Professor in the English Department at Iowa State

University in the U.S. Yeonsuk Cho is a Senior Research Scientist in the Assessment, Learning and

Technology Research and Development division at ETS in the U.S. Alister Cumming is a Professor Emeritus in the Department of Curriculum,

Teaching and Learning at the University of Toronto in Canada. John McE. Davis is a Research Scientist in the Assessment, Learning and

Technology Research and Development division at ETS in the U.S. Philip Everson is a Senior Strategic Advisor in the Assessment, Learning and

Technology Research and Development division at ETS in the U.S. Robert French is an Assessment Designer in the Assessment, Learning and

Technology Research and Development division at ETS in the U.S. Glenn Fulcher is a Professor in the English Department of English at the University

of Leicester in the U.K. William Grabe is an Emeritus Regents Professor of Applied Linguistics in the

English Department at Northern Arizona University in the U.S.

x

x Contributors

Luke Harding is a Professor in the Department of Linguistics and English

Language at Lancaster University in the U.K. Robert Kantor is a retired Assessment Designer in the Assessment, Learning and

Technology Research and Development division at ETS in the U.S. Susan Nissan is a retired Senior Director in the Assessment, Learning and

Technology Research and Development division at ETS in the U.S. John M. Norris is Senior Research Director of the Center for Language Education

and Assessment Research at ETS in the U.S. Gary J. Ockey is a Professor in the English Department at Iowa State University

in the U.S. Tenaha O’Reilly is a Principal Research Scientist at ETS in the U.S. Spiros Papageorgiou is a Managing Senior Research Scientist in the Center for

Language Education and Assessment Research at ETS in the U.S. James E. Purpura is a Professor at Teachers College at Columbia University in

the U.S. Mary Schedl is a retired Assessment Designer the Assessment, Learning and

Technology Research and Development division at ETS in the U.S. Jonathan Schmidgall is a Research Scientist in the Center for Language

Education and Assessment Research at ETS in the U.S. Rob Schoonen is a Professor in the Department of Language and Communication

at Radboud University in the Netherlands. Xiaoming Xi is Chief of Product, Assessment and Learning at VIPKID International

in the U.S.

x

xi

x

SERIES EDITORS’ FOREWORD

We are delighted to welcome Volume 6 in the series, a collection of papers on assessing academic English for higher education admissions purposes, edited by Xiaoming Xi and John M. Norris. Assessing language proficiency for admissions decisions has long been a central concern in language testing research and practice, as more and more students around the world seek higher education in English- medium institutions.These institutions need fair and accurate measures of English proficiency to ensure that the students they admit have sufficient language proficiency to meet the academic demands of their degree programs. Standardized international language tests have thus come to play a vital role in the admissions process.The high stakes of admissions decisions and the central role of tests have in turn generated a global industry of materials developers, educators, and test preparation providers. In a competitive marketplace, ethical test developers are faced with the challenges of providing fair, comprehensive, and accurate assessments that can provide useful information for stakeholders without burdening test takers with excessive costs or time. Some 20 years ago, ETS published a landmark series of framework papers that provided critical reviews of theory and practice in assessing reading, listening, writing, and speaking in academic contexts as a foundation for developing what is now known as the TOEFL iBT. The landscape for assessment has changed dramatically in the intervening years, both in terms of the increased use of digital technology, which continues to reshape both education and assessment, and in terms of the increased global demand for English-medium education. This collection of papers is a worthy successor to the original framework papers, with chapters on assessing reading, listening, writing, and speaking co- authored by eminent scholars both within and external to ETS. Each of these chapters synthesizes the literature on defining and operationalizing the skill in

xi

xii Series Editors’ Foreword

question, providing an overview of theoretical lenses through which the skill has been conceptualized, and discussing how the skill has been operationalized in major English for Academic Purposes (EAP) tests. The chapters on writing and speaking also include considerations of integrated skills such as listening-speaking and reading-writing or listening-reading-writing. Finally, each chapter provides insights into new developments in task design and scoring, providing guidance for test developers and directions for future research. An introductory chapter by John Norris, John Davis, and Xiaoming Xi contextualizes the papers, outlining the rationale for assessing academic English for admissions purposes, defining the scope of EAP, providing an historical overview of EAP assessment, and introducing each of the framework papers briefly. A final chapter by Carol Chapelle integrates insights from each of the skill chapters in terms of construct definition, task design, and technology, and outlines a framework for test development and validation for an integrated view of EAP. This authoritative volume will be valuable for language testing and educational assessment professionals, researchers, university academics, and graduate students worldwide. Additionally, a broader audience of applied linguists, including researchers as well as teachers, will be interested in the ways in which academic English language ability is represented in the book. We expect this book to become a standard reference on assessing academic English language proficiency for admissions decisions for many years to come. Sara T. Cushing John M. Norris Steven J. Ross Xiaoming Xi

xi

xi

newgenprepdf

xi

ACKNOWLEDGEMENTS

The editors would like to thank all the authors for committing to and following through on this important book project. This book could not have been completed without their hard work, dedication, and patience. Sincere thanks also go to the ETS Associate Editors, Keelan Evanini, John Mazzeo, and Don Powers, for the reviews, Kim Fryer for coordinating the reviews, and the technical reviewers and series editors for providing their helpful reviews of the manuscripts.

xvi

1

1 FRAMING THE ASSESSMENT OF ACADEMIC ENGLISH FOR ADMISSIONS DECISIONS John M. Norris, John McE. Davis, and Xiaoming Xi

1.1 Higher Education Admissions Testing and the Role of English Proficiency Gaining access to higher education plays a critical role in determining the career and life prospects of millions of individuals worldwide each year (Oliveri & Wendler, 2020). Which students are accepted to study—with what backgrounds, skills, aptitudes, and motivations— also shapes the ways in which educators design university learning experiences, and, in turn, the kinds of contributions that institutions make to society. The process of deciding who is admitted to higher education is undoubtedly high stakes, with consequences for prospective students, universities, academics, and society at large. Historically, when relatively few institutions of higher learning were available, universities established their own highly selective criteria and processes for making admissions decisions, often involving examinations of their own creation (Wechsler, 2014). In the twentieth century, massive expansion of higher education worldwide, coupled with dramatic changes in student mobility and a general democratization of education, led to both much greater access to higher education and a much higher number and variety of applicants (Goastellec, 2008; Schofer & Meyer, 2005). These changes complicated the challenge of making admissions decisions, raising questions about efficiency and accuracy in identifying those students who are ready to succeed at college, but also concerns with equity and fairness in the kinds of information used to do so (Nettles, 2019). One result of this evolving educational landscape was the introduction of large- scale standardized testing as a means of facilitating admissions decisions. Though idiosyncratic uses of standardized testing for admissions and other kinds of selection date back centuries, their widespread adoption occurred during the twentieth

2

2 John M. Norris et al.

century (Zwick, 2002).Tests like the SAT and ACT in the U.S. (and others around the globe, see Reshetar & Pitts, 2020) were developed to provide a common, independent, and objective indicator of a student’s readiness to pursue academic studies. By assessing knowledge and skills widely held to underlie advanced academic learning (e.g., reading, writing, mathematical reasoning), these standardized tests ostensibly offer an equal opportunity for all students to display their academic potential.They also provide universities with a standard metric that can be used to compare students on a common scale, rather than depending on less trustworthy and/or noncomparable indicators like high school grade point average or state achievement test scores. While this promise of standardized testing led to near wholesale adoption, particularly among U.S. universities in need of discerning among gigantic pools of applicants, it also exposed fundamental concerns with the inequities of access to high-quality education; potential racial, ethnic, socioeconomic, and linguistic biases inherent in the assessments; and a perceived over-emphasis on certain types of knowledge and abilities that could be tested in a standardized format (e.g., Zwick, 2017). Fundamentally, the question that must be answered for standardized admissions assessments is whether they contribute to useful predictions about the extent to which “a student is likely to be successful with college-level work” (AERA, APA, & NCME, 2014, p. 11). It seems clear today that a variety of factors—only some of which are measured in such assessments—are at play in determining an answer to the question of student readiness for university study. One such factor is language proficiency. It goes without saying that communication plays a fundamental role in virtually all dimensions of university study. Communication in higher education also takes quite specific forms that differ from other discourse contexts and that call upon sophisticated understanding and use of language. It is because of the importance of communication that tests like the SAT measure some aspects of students’ abilities to engage with college-level language use (e.g., evidence-based reading). However, for a particular population of students, the likelihood of success at university depends additionally, and heavily, on the extent to which they have developed global proficiency in the language of study. When the language of communication at a university is different from the first language of a student—as it is for the millions of students who study abroad each year (OECD, 2017)—then it is essential for the student to develop sufficient proficiency in the language such that they can engage successfully with the variety of communication tasks that characterize higher education. From a university admissions perspective, then, there is a need to identify those students whose language proficiency is sufficient to meet the various communication demands of the academic environment. Standardized proficiency tests have been developed for a variety of languages to meet this precise need (e.g., in German, the Test Deutsch als Fremdsprache; in Dutch, the Certificaat Nederlands als Vreemde Taal). The basic premise of these tests is that they should predict with accuracy which students have sufficient

2

3

2

Framing Academic English Assessment 3

language proficiency to enable them to engage successfully with academic study and which students are not yet ready to do so (Eckes, 2020). Unpacking this premise, several assumptions underlie standardized language proficiency tests that can inform admissions decisions. First, the assessment should reflect the essential qualities and characteristics of language use in the academic environment, such that students’ abilities in that domain can be predicted with accuracy (AERA, APA, & NCME, 2014, p. 191). Academic communication involves tasks that call upon all four skills (reading, writing, listening, and speaking) in distinct ways and for a variety of purposes, the critical dimensions of which would need to be assessed. Second, a determination must be made about how much language proficiency is sufficient to warrant admission. The determination of how much is enough language proficiency—typically realized in the form of cut scores on the assessment’s measurement scale—depends on factors such as the degree of linguistic challenge expected in university courses of study, the academic rigor of the university, the availability of language support (e.g., supplementary language courses), assumptions about time needed to complete a degree, and others. Standardized assessments of academic language proficiency, then, are helpful in the admissions process to the extent that they can adequately capture the scope and challenge of language use at university and, in turn, accurately predict a prospective student’s abilities to engage in the same. Ultimately, students and universities alike benefit when the predicted language proficiency of the student matches the linguistic demands of the university; both suffer (e.g., increased costs, time, failure) when they do not (Eckes, 2020). The standardized assessment of academic English proficiency presents a unique case, distinct in several ways from assessments in other languages. Beginning in the post-World War II era and increasing ever since, students from around the world have sought out higher education opportunities in English-dominant countries, including in particular the U.S., the U.K., Australia, Canada, and New Zealand (Institute for International Education, 2017). Based especially on perceptions of educational quality and the sheer number of available universities, second- language speakers of English have opted to study abroad in these locations in mounting numbers. A related factor is the corresponding rise of English as a global language with tremendous influence on a variety of social and professional endeavors, and most recently in its role as the lingua franca of the internet (Crystal, 2003). English has become the de facto language for international communication worldwide, to the extent that Leitner, Hashim, and Wolf (2016) suggested “there is no competitor for English yet in sight, and for the foreseeable decades, its future as a world language is secure” (p. 3). As a result, completing higher education in English opens critical pathways to career opportunities, and proficiency in English is often perceived to be an essential skill in diverse industries. Perhaps most influential, though, is the reality that English has become the international language of scholarship as well as the predominant language of higher education (Northrup, 2013). Not only is a majority of scholarly dissemination worldwide conducted in

4

4 John M. Norris et al.

English, but universities in regions where English is not the dominant language of communication have been adopting English as the medium of instruction for quite some time (e.g., Altbach, 2007; Guilherme, 2007). One result of these developments vis-à-vis the role of English in the world today, and in its relation to higher education, has been the heightened demand for assessments of academic English proficiency to guide university admissions decisions. A sizable testing industry has also emerged since the earliest standardized test of English proficiency was launched in the 1960s, with an increasing array of options to choose from. Prior to reviewing traditions and innovations in the assessment of academic English, it is worthwhile to first consider in some detail what these assessments are intended to shed light upon—what exactly is meant by academic English?

1.2 What is English for Academic Purposes? The phrase English for Academic Purposes (EAP) refers to the specialized use of the English language for a variety of communicative tasks commonly performed in educational contexts, spanning the activities of young learners in primary schools to tasks that adult students perform in higher education. The term EAP also frequently denotes research investigating the academic English language skills needed for post-secondary education. EAP emerged as a focus of research in the 1980s within the broader disciplinary interest of English for Special Purposes (ESP; Hyland & Shaw, 2016). As noted above, in the period after World War II, large- scale immigration to English-speaking countries (among other societal factors) led to a growth in English becoming an important global language, out of which arose a need to help immigrant and sojourning learners use the unique types of English spoken in specific professions and educational programs. ESP addressed the use of English in specialized domains, as well as the approaches to language instruction (and related research) designed to help students acquire professional and academic Englishes (Charles & Pecorari, 2016). EAP primarily emerged in response to the large numbers of international students studying at British universities and an instructional need to help learners develop sufficient English proficiency to study in specialized academic fields, particularly the sciences (De Chazal, 2014; Long, 2015). Since these early beginnings in the U.K. and elsewhere, EAP has become an established research concern driven by the growth of English as the leading language of academic study and research. In EAP research and practice to date, an important aim has been to define the communicative norms and constraints peculiar to academic discourses and to delineate the linguistic dimensions of English used in academic contexts. A key distinction has been the way in which academic discourse appears to require proficiency in two varieties of academic communication. First, there seem to be certain universal elements of academic expression (formality of style, low-frequency vocabulary, etc.) needed to successfully perform common academic tasks (writing

4

5

4

Framing Academic English Assessment 5

papers, giving presentations, etc.) per the expectations of general academic audiences; this type of discourse is commonly referred to as English for General Academic Purposes (EGAP). Second, the communicative norms of students in the sciences are typically conceived to be distinct from those of students in the humanities, and thus EGAP is distinguished from discipline-specific academic discourse; this type of EAP is known as English for Special Academic Purposes (ESAP; Charles & Pecorari, 2016). Mastering general or discipline-specific academic English is understood to demand, in particular, control of specific academic vocabulary, collocations, and other word-unit forms that appear frequently in academic discourses but are less frequent in non-academic, social, or professional domains. One established vein of academic and pedagogic research explores how academic communication is characterized by knowledge and appropriate use of academic vocabulary, and researchers have endeavored to identify such forms via analysis of academic corpora (Coxhead, 2016). The appropriate use of academic vocabulary as a key aspect of academic English proficiency further suggests a needed knowledge of how academic communication is more generally “done” for specialized academic tasks. This conception has been explored primarily through genre analysis research, which investigates the ways in which academic discourses are characterized by distinct genres of communication—particularly for writing and speaking—codified in expected registers of communication by academic audiences (Shaw, 2016; Swales, 1990). For example, performing introductions in the genre of an academic presentation calls for certain ‘moves’ which involve, first, establishing rapport with the audience, followed by providing background information to the topic, then justifying the research, and then proceeding with the main content of the presentation (Charles & Pecorari, 2016). A related contribution to better understanding the unique types of communication within academic English contexts comes from literacy research, which investigates how novice members in academic communities must grapple with acquiring reading and writing literacies, that is, the specific communicative practices situated within disciplinary contexts, academic institutional ideologies, and differential power relations (Lillis & Tuck, 2016). EAP research has also endeavored to identify the specific underlying linguistic skills needed to perform the range of tasks commonly performed in academic contexts, given their genre specificity, unique register demands, and situated practices. An early instance of such a focus was in composition studies (particularly in the U.S.) with its origins in the study of rhetoric and the development of writing skills taught in first-year writing courses in American universities (Tardy & Jwa, 2016). Other veins of applied linguistics research have investigated the underlying English language notions and functions, multi-word expressions, pragmatics, discourse features, and so on, specific to common academic tasks that require speaking, writing, reading, or listening skills (see Newton et al., 2018). Within this domain, corpus analysis research is noteworthy given several important contributions to understanding the nature of EAP. Corpus research

6

6 John M. Norris et al.

involves the collection and analysis of many instances of naturally occurring written or spoken language data stored electronically (Nesi, 2016). As noted above, corpus analyses have developed several lists of high-frequency vocabulary and collocations common in academic discourses, including the well-known Academic Word List (Coxhead, 2011) as well as several others encompassing academic “keywords” (Paquot, 2010), collocations (Ackermann & Chen, 2013), and formulae (Simpson- Vlach & Ellis, 2010) in written and spoken academic discourses (Dang, Coxhead, & Webb, 2017). Corpus studies have also used “multidimensional analyses” to focus on larger, multi-word units to understand differences of register between academic text types, registers, genres, and disciplines (Nesi, 2016). An example is the creation and analysis of the TOEFL 2000 Spoken and Written Academic Language Corpus (Biber et al., 2004), which explored differences in language use between common academic registers based on authentic language data collected at North American universities. Results suggested distinctive uses of language along several dimensions of text features such as narrative versus nonnarrative discourse, situation-dependent versus elaborated reference, and overt expression of persuasion versus argumentation (among others). Notably, in recent years, this research has been informed by the trajectory of the U.S. standards movement in K-12 and higher education, wherein various disciplines have identified key competencies that students are expected to perform at different grade levels, with performance expectations at the higher grades representing the key knowledge/skills needed for college and career readiness (the most widely adopted of these being the Common Core State Standards, National Governors Association, 2010). Critical to such standards are the competencies dedicated to English language communication—or “English language arts”— needed for post-secondary study or the workplace.The creation of these standards has led to corollary sets of competencies developed in English language education to capture the specific linguistic dimensions of academic language. A handful of some of the most widely used standards include the California English Language Development Standards; TESOL International’s PreK–12 English Language Proficiency Standards; and WIDA’s English Language Development Standards. For example, the California English Language Development Standards (2014) divide language skills into three broad categories of ability, including being able to (a) “[i]‌nteract in meaningful ways” (i.e., via collaborative, interpretive, and productive skills), (b) “[l]earn about how English works” (i.e., skills in structuring cohesive texts; expanding/enriching ideas; and connecting/condensing ideas), and (c) to “[u]se foundational literacy skills”.The WIDA Consortium (2014) standards, by contrast, are divided into domains of academic content (with the exception of a standard dedicated to communication for social and instructional purposes) and are further elaborated into performance indicators and can-do statements in the four language skills. The TESOL (2006) standards were developed to augment the original WIDA standards and likewise articulate language abilities in four skills. All three sets of standards rely to a large extent on “can do”-type statements of student

6

7

6

Framing Academic English Assessment 7

knowledge and skills to define academic language abilities at all grade levels. The WIDA standards, for example, call for the ability to recount and display knowledge or narrate experiences or events in different ways; to explain and clarify the “why” or the “how” of ideas, actions, or phenomena; to argue and persuade by making claims supported by evidence; and to discuss and interact with others to build meaning and share knowledge (WIDA Consortium, 2014). Recent EAP research has expanded conceptualizations of the language abilities for academic communication beyond those more formalized task types typically recognized as academic. This broadened view relates particularly to performing social, interactive, and navigational tasks in school settings. Of interest in higher education, for example, are the linguistic dimensions of dialogic interactions between students and their lecturers, teachers, or tutors, and during lectures, seminars, classroom discussions, extra-class meetings for collaborative study, student–tutor meetings, office hours, and thesis supervision meetings (Basturkmen, 2016). Such interactions are seen to be crucial to academic life, in that they enable students to consolidate understanding, facilitate disciplinary acculturation, and serve to develop important relationships. The participatory aspect of academic interactions calls for certain communication skills such as asking for clarification, agreeing/disagreeing, initiating comments or responding to comments, criticizing/objecting, appropriate turn-taking, and disagreeing politely. Speakers also need to communicate in specific roles, such as a discussion leader, which can call for encouraging participation and keeping a discussion on track (Basturkmen, 2016). Along these lines, K-12 standards have likewise accounted for the interactional, collaborative types of tasks that comprise an important aspect of academic life and that require specific English communication skills to perform successfully. For example, the California standards express grade 11 and 12 interactional skills as the ability to “[c]‌ontribute to class, group, and partner discussions, sustaining conversations on a variety of age and grade- appropriate academic topics by following turn-taking rules, asking and answering relevant, on-topic questions, affirming others, and providing coherent and well-articulated comments and additional information” (California Department of Education, 2014, p. 136). Other relatively recent developments are also reshaping potential conceptualizations of academic English. While much of the research on, and setting of standards for, English communication expectations at university has focused on education settings where English is the dominant language of the country (i.e., Canada, the U.S., the U.K., Australia, New Zealand), the rapid rise of English-medium instruction (EMI) at universities across the globe raises the question of whether these expectations are generalizable to all academic English contexts. These diverse EMI contexts may result in distinct uses for, types of, and expectations about academic English (e.g., Dimova, Hultgren, & Jensen, 2015; Doiz & Lasagabaster, 2020; Macaro, 2018; Owen, Shrestha, & Hultgren, in press). Some of the main differences include: (a) a heavy emphasis on subject-or discipline- specific English (vocabulary, discourse patterns) and associated task types (e.g.,

8

8 John M. Norris et al.

English for commerce, or for science and technology); (b) a decreased emphasis on English for social, navigational, and interpersonal communication; (c) a prevalence of distinct regional and non-native varieties of English (pronunciation, pragmatics, vocabulary) among teachers and fellow students; (d) variable roles played by shared first languages among students and teachers, in some contexts (e.g., EMI in Nepal), versus highly multilingual environments in others (e.g., EMI in Sweden); (e) preparedness of teachers to support language and content learning within EMI classes; and others. These unique aspects of academic English in EMI contexts raise important questions about what constitutes “authentic” EAP and whether there is a shared construct that applies across all settings, as well as what the appropriate standards might be—and how they might differ—for determining which students have sufficient English proficiency to enable their participation in academic studies. A final development, and one that continues to evolve rapidly, has to do with the increasing role played by technology-mediated communication within higher education. Though such changes have been occurring for some time (e.g., Jacoby, 2014; Kim & Bonk, 2006)—including the mounting use of e-mail and social media, widespread deployment of learning management systems, the advent of Massive Open Online Courses, and a shift to “flipped” classes and hybridization of instruction—the realities of the COVID-19 global pandemic have caused a massive transition to online instruction across universities in all parts of the world (Gardner, 2020;Tesar, 2020). From an academic English perspective, it is important to consider the extent to which these developments have led (or will continue to lead) to new and/or different expectations regarding the types of communication tasks students engage in and the associated linguistic demands. Furthermore, questions arise regarding not only features of the digital environment that may require different types of English proficiency (e.g., digital literacies, including language associated with navigating the internet or conferencing platforms) but also those that may offer different kinds of linguistic and learning support (e.g., sub- titling of recorded lectures). While earlier iterations of online-delivered university courses emphasized transmission pedagogy, and in particular audio/video-recorded lectures coupled with individual reading assignments and quizzes (Hiltz & Turoff, 2005), technological developments have enabled a shift to much more learner- centered, collaborative, and project-based kinds of activities. The addition of virtual environments, simulation-and game-based learning, and artificial intelligence supported instruction underscores the likelihood of distinct types of English being encountered in higher education (including communication with machines rather than people) (e.g.,Vlachopoulos & Makri, 2017; Zawacki-Richter et al., 2019). To date, little has been done in the way of systematic analysis of the academic English demands in these new/changing higher education settings. One recent study (Kyle et al., in press) analyzed the characteristics of English speaking and writing found in a corpus of recorded technology-mediated university learning environments at U.S. universities. The corpus included the kinds of spoken and

8

9

8

Framing Academic English Assessment 9

written input that students typically encountered in university settings. Detailed multidimensional linguistic analyses enabled a comparison of this corpus with a previous corpus that reflected similar language use within face-to-face university settings (Biber et al., 2004). While considerable overlap in the characteristics of language use was found between the corpora, two important differences emerged. First, on average, the linguistic challenge of spoken input in the technology- mediated setting was found to be substantially higher than in the face-to-face setting. Second, interestingly, the linguistic challenge of written input in the technology-mediated setting was found to be substantially lower than in the face- to-face setting. Though initial, and only focusing on language input (versus language production), these findings point to somewhat different expectations for academic English proficiency in emerging technology-mediated environments. In sum, academic English in higher education can be understood as the qualities and characteristics of English language communication typically encountered in university classes and other study-related occasions as well as in social and navigational dimensions of academic endeavors. It is associated with sophisticated lexical, grammatical, and discourse features, all of which are deployed in receptive, productive, and interactive tasks involving all four language skills. Academic English is not only the language encountered at universities in English-predominant countries, but also increasingly the language of universities in non-English-predominant settings, or so-called English-medium instruction institutions. The expectations regarding academic English proficiency may, therefore, vary somewhat by context of use, and the evolving nature of higher education delivery—in particular online and other forms of technology-mediated instruction—may also imply changes in certain aspects of academic English proficiency.

1.3 Traditions and Innovations in Testing Academic English Given these general understandings of what constitutes academic English, coupled with a widespread need to determine students’ readiness for university study, various approaches to standardized testing of English proficiency have emerged over the past 70-plus years (see Xi, Bridgeman, & Wendler, 2014 for a review). The Test of English as a Foreign Language (TOEFL), developed by the National Council on the Testing of English as a Foreign Language in the U.S., was arguably the first large-scale standardized test that targeted academic English. The first TOEFL test appeared in 1964, and by 1965 it was launched at a global scale by the Educational Testing Service (ETS) (Taylor & Angelis, 2008). The original TOEFL assessed vocabulary knowledge, reading and listening comprehension, and knowledge of English structure and grammar. Since then, numerous additional standardized academic English tests have been developed, though only a few have come to play a role as “large-scale” tests with widespread use across multiple regions of the world. The TOEFL iBT, delivered since 2005 by ETS, is the most recent version of the TOEFL that is administered in over 150 countries and accepted by institutions in

10

10 John M. Norris et al.

over 130 countries. Another such test is the academic version of the International English Language Testing System (IELTS), which emerged in the early 1980s out of consolidated efforts by the University of Cambridge and the British Council, based on a previous general proficiency test referred to as the English Language Testing System (Davies, 2008; Weir & O’Sullivan, 2017). Another test that has gained some international use is the Pearson Test of English (PTE) Academic, first launched in 2009 (Zheng & De Jong, 2011). Distinct from these tests that are intended for global administrations, and similarly accepted by universities around the world, other standardized tests of academic English have been developed primarily for use in specific national contexts. For example, the Canadian Academic English Language (CAEL) test assesses students’ abilities to perform tasks that reflect English specifically as it is used in Canadian university classrooms, and is intended for use by Canadian institutions (CAEL Assessment Office, 2009). Another example is the Global Test of English Communication (GTEC), developed for use in Japan to determine high school students’ readiness to engage in academic English communication at university (Kim, Smith, & Chin, 2017). Each of these academic English tests is distinct from the others in multiple ways, including the specific constructs tested, task types, measurement models, score scales and scoring approaches, and others (Green, 2018). Nevertheless, they also share considerable commonalities. All report scores on each of the four skills independently along with a total score, reflecting the reality that English learners often have distinct profiles of language proficiency that may differentially affect their abilities to participate in university studies, information that has proven particularly useful for informing admissions decisions (Ginther & Yan, 2018). All of them are built upon analyses of the kinds of communication demands encountered in academic English domains and rationales for sampling tasks to represent those domains. Likewise, all of them simulate, replicate, or otherwise represent to varying degrees the real-life communication that takes place in higher education by incorporating task types to elicit the use of language for specific communication purposes. In this sense, all of these tests are task-based, and important validity claims—from content coverage to construct representation to test use consequences—for each test are founded on this basis (Norris, 2018).The important role played by tasks that represent communicative uses of English is perhaps the defining characteristic of these high-stakes tests of academic English. On the one hand, they enable the warranted interpretation of test-takers’ abilities to use the language to get things done—that is fundamentally what university score users want to know about students’ English proficiency, and it is also how English proficiency is represented in standards and frameworks that describe academic English. On the other hand, the inclusion of real-life communication tasks reflects the express intent to encourage positive consequences in the teaching and learning of English. That is, as learners develop their proficiency to be able to do well on such tests, they are actually developing their abilities to use English for successful academic communication.

10

1

10

Framing Academic English Assessment 11

TOEFL iBT, launched in 2005, is an example of how a standardized test of academic English proficiency evolved to achieve these goals. While earlier versions of the TOEFL test were extremely popular, there was also a clear need to transition from a selected-response test of receptive skills and language knowledge to something that aligned more directly with how language is used at university. Prior developments at ETS had already set the stage for a new kind of test with a focus on academic communication, including in particular the development of two productive skills tests, the Test of Written English and the Test of Spoken English (Taylor & Angelis, 2008). Beginning in earnest at the turn of the millennium, a series of substantial domain analyses of language use in North American universities led to the establishment of theoretical frameworks for testing each of the four skills in an academic context (later elaborated to include integrated academic skills; Chapelle, Enright, & Jamieson, 2008). Fundamental to the design of the test was the inclusion of tasks in each of the four skills that require test takers to engage in academic English communication, from reading extended texts and listening to academic lectures to writing essays and speaking on a range of topics. In addition to the incorporation of all four skills and integrated tasks that reflected the academic environment, the delivery and scoring of TOEFL iBT featured important innovations, many a result of technological developments. First, it was designed to be entirely internet delivered with a centralized online scoring and score-reporting system. Test takers around the world could access the same test-taking experience and expect the same secure, high-quality, and unbiased scoring of their responses. Second, careful scoring procedures were developed for writing and speaking tasks, including the expert training of raters and the development and public disclosure of scoring rubrics. Ultimately, based on rapid iterations in automated scoring technologies, a hybrid approach to scoring both writing and speaking tasks was introduced to take advantage of efficiencies and consistencies provided by automated scoring coupled with the insights and accuracy of human raters (ETS, 2020). A final important innovation associated with the TOEFL iBT was the introduction of an argument- based approach to framing validity inquiry (Kane, 2001). Kane’s approach suggested that validity be conceptualized as a series of interconnected claims or arguments, moving from the foundation of an assessment in reference to a given ability or knowledge domain through the elicitation of performances, the evaluation of the same, the development of scores to represent abilities, the extrapolation of scores as evidence of related abilities, and ultimately the justification of specific intended uses for scores. These claims, once stipulated, then serve as the focus of evidence-generating validation studies that would either support or refute the overall argument of the assessment. As detailed in Chapelle, Enright, and Jamieson (2008), the validity argument for the TOEFL iBT provided a comprehensive framework for guiding test development as well as marshalling evidence in support of test score interpretation and use,

12

12 John M. Norris et al.

and hundreds of validity studies have contributed to the evidentiary basis for the test in the intervening years. One such line of research, particularly important for admissions tests (Eckes, 2020), has investigated the extent to which scores on the TOEFL iBT can predict student success at university (e.g., Bridgeman, Cho, & DiPietro, 2016; Cho & Bridgeman, 2012; Harsch, Ushioda, & Ladroue, 2017). Not only has this research highlighted the ways in which TOEFL scores, when used appropriately, can indeed serve as a useful indicator of likelihood of student success (e.g., as measured by first-year grade-point average), but it has also greatly clarified the array of factors that interact in determining student success and advanced the analytic and interpretive methods most helpful in examining predictive validity of admissions tests. Other academic English tests (e.g., those listed above) have undergone varying degrees of construct-and technology-inspired innovation in an attempt to capture meaningful evidence of test-takers’ readiness for using English at university; similarly, again to varying degrees, other tests have also pursued programs of validity research to support such claims. One result of these efforts over the past several decades has been, perhaps inevitably, that potential challenges to the validity claims of these academic English tests have been identified. For one example, an important challenge has to do with the extent to which the communication tasks and varieties of English represented on large-scale tests can be claimed to reflect authentically the uses of English in expanding contexts of higher education and especially in EMI (Hu, 2018; Pinner, 2016). A consequential dimension of this challenge has focused on whether the widespread global use and prioritization of academic English tests may impact the legitimacy and value of other languages and cultures (e.g., Guilherme, 2007). Other areas of challenge for contemporary high-stakes standardized tests of academic English range from the representation of interactional competence, to the inclusion of more extended and integrated task types that typify university communication, to the roles played by emerging genres of technology-mediated communication, and others. Most interesting in this regard is the fact that programs of validity research, and in particular the close examination of different links in an argument-based chain of inferences, have led to the identification of meaningful test design and development targets for the ‘next generation’ of academic English tests. One last and most recent development related to high-stakes English tests has to do with the rise of tests that are largely created and solely scored by artificial intelligence technologies and advocated for use in university admissions decisions. Unlike the tests listed above, which found validity claims on the sampling and representation of robust academic tasks that require meaningful communication, a handful of tests have been developed on the basis of short-cut item types that generally tap into discrete underlying dimensions of language knowledge. Rather than testing a student’s ability to comprehend extended reading or listening texts, write convincing academic discourse in response to challenging prompts, or speak on authentic academic topics, these tests assess things like whether an examinee

12

13

12

Framing Academic English Assessment 13

can discern English words from nonsense words, whether they can process the meaning of discrete sentences, whether they can fill in the blank with a grammatically correct word, whether they can produce word-or sentence-level speech and writing, and similar (e.g., Brenzel & Settles, 2017). In relation to the domain of academic English outlined above, it is clear that such tests do not reflect the types of communication that typify university study, and as such any interpretation of academic English proficiency on the basis of such tests would be highly indirect at best (Wagner, 2020). However, these kinds of tests are increasingly advocated for use explicitly in making university admissions decisions that involve interpretations about students’ English language readiness for university studies, based on which high-stakes admissions decisions are made. The professional and social dimensions of this emerging practice are considerable. Such tests run counter to the responsibilities of test developers to provide a thorough, convincing, and evidence-based approach to informing intended interpretations and uses of a test (AERA, APA, & NCME, 2014). Equally, if not more important, the consequences of these tests for language learners, and for universities, including both potential adverse effects on score-based decisions and negative impact on English language learning and teaching practices, remain to be determined. The recent advent of these kinds of non-academic tests for use in making university admissions decisions, coupled with the dramatic events of 2020 and the disruptions to higher education broadly speaking, while troubling, are also indicative of the need to consider carefully what the future of academic English testing might entail. Certainly, some tests to date have taken seriously the charge of defining an academic English proficiency construct, designing robust tests accordingly, and engaging in substantial evaluation of associated validity claims including construct, predictive, and consequential aspects of validity. Yet, changes in the world around us, in the potential contributions (and drawbacks) of technology, and specifically in the nature of English use for academic purposes point to a timely need for re-assessing the testing of academic English.

1.4 The Chapters in This Volume To establish the conceptual basis for designing the TOEFL iBT test, since 1996, ETS scientists have collaborated closely with language testing experts to develop framework papers for assessing EAP, most notably the TOEFL 2000 framework papers (Bejar et al., 2000; Butler et al., 2000; Cumming et al., 2000; Enright et al., 2000; Jamieson et al., 2000). These framework papers provided critical reviews of theories and practices in the assessment of reading, listening, writing, and speaking as well as useful guidance for designing academic English tests. However, the evolving nature of academic English along with advances in language learning and assessment over the last 20 years have raised the need to revisit how academic English skills might be conceptualized. It is also timely to

14

14 John M. Norris et al.

consider what innovations in test design, tasks, scoring rubrics, scoring, and score reporting might be supported by new developments in assessment practices and technologies. This volume provides a state-of-the-art review of advances in theories relevant to the assessment of academic English skills for higher education admissions purposes, articulates implications and key trends in practices of academic English proficiency testing, and identifies future development and research needs. The four main chapters focus on recent developments in assessing academic reading, listening, writing, and speaking. Assessment of integrated skills, such as reading- writing, listening-speaking, and reading-listening-speaking, are also addressed in the chapters focusing on writing and speaking. Although we have seen increasing use of integrated tasks in tests of English for admissions purposes to reflect real- world academic environments, we adhered to the separation of the four skills in structuring the chapters. Conceptually, despite some common terms for describing competencies across the four skills in models of language ability (e.g., Bachman & Palmer, 2010), the four skills are fundamentally different and involve distinct task types and contexts of communication as well as unique knowledge, skills, processes, and strategies. It is also a common practice in higher education admissions to use the total score as well as individual scores for decision-making. Additionally, reporting separate scores on skills allows higher education institutions to provide targeted language support for admitted students if needed.While greater attention has been given to the training of integrated skills, ESL and EFL instruction around the world has followed the tradition of separate skill courses, given that the development of each skill involves a different path and that learners may have unevenly developed proficiency profiles (Hinkel, 2010). Therefore, in this volume, each skill is given a separate treatment, and in the writing and speaking chapters, skill integration in language production is discussed. Chapter 2, Assessing Academic Reading, by Schedl et al. reviews current theoretical perspectives on how reading ability can be conceptualized from ESL and EFL learner perspectives. More specifically, the chapter outlines a model of academic reading that involves several interacting components and suggests a broader definition of the academic reading construct for assessment purposes. Following discussion of the model in terms of reader purpose, text characteristics, and the reader’s linguistic and processing abilities, the role of new media in academic reading is considered. It is argued that despite recent advances in reading research, most empirical work on reading comprehension has been conducted with traditional printed text. The chapter then elaborates on the key differences between reading printed text and reading in more modern digital environments. These include the need to evaluate the quality and relevance of sources, the introduction of non-linear text, the variety of multimedia sources, and the social nature of digital interactions. Following this discussion, the chapter presents new task and item types that might contribute to assessing academic reading abilities while taking new media into account. Finally, the chapter makes recommendations for both

14

15

14

Framing Academic English Assessment 15

near-term and future directions for research, especially related to reading across multiple texts. Chapter 3, Assessing Academic Listening, by Papageorgiou et al. examines current thinking and research related to L2 listening comprehension and discusses implications for defining the construct of academic listening and the design of test tasks. The chapter first explores theoretical perspectives on listening comprehension derived from cognitive processing and components models as well as findings from needs analyses conducted in university settings. Based on an interactionalist approach to construct definition, the chapter then discusses important contextual facets that should be considered when designing an academic listening test. Key issues in the definition of the construct and the design of test tasks for academic listening tests are then considered. The chapter concludes with suggestions for possible future research in assessing academic listening, including defining the domain, designing new test tasks, and understanding test-taker behavior. Given the inevitable operational constraints on listening tests, which limit the extent to which listening tasks satisfy situational authenticity, the chapter points out that it is critical for test developers to explore innovative ways to enhance the interactional and contextual authenticity of assessment tasks. Chapter 4, Assessing Academic Writing, by Cumming et al. reviews recent trends in theories, research, and methods for the assessment of writing abilities in EAP in large-scale, high-stakes tests.The chapter points out that current conceptualizations of assessing academic writing center on interactionalist principles of task-based assessment to evaluate the characteristics of a person’s abilities to produce written texts in specific contexts. Most writing for academic purposes involves students displaying and/or transforming their knowledge in direct relation to the content and contexts of ideas and information they have been reading, hearing, and discussing in academic courses.These premises require new definitions of the construct of writing in terms of construction-integration, multiliteracies, and genre- based models of language, communication, and literacy. At the same time, writing tasks for standardized tests require that test takers produce texts within reasonable periods of time in response to well-defined expectations that generalize maximally across academic disciplines. Such responses also need to be scored reliably, consistently, and fairly. Accordingly, the chapter then focuses on: (a) the practices, processes, and developments associated with writing from sources; (b) qualities of personal expression, transaction, and commitment in written academic discourse; and (c) the genres and related rhetorical characteristics of academic writing in higher education, including new genres associated with increasingly technology- mediated academic communication. The chapter outlines a framework of task characteristics for assessment purposes, defined as domains of academic and practical-social writing, genres of explanatory and transactional writing, audiences of teachers/professors, professionals, and student peers, and purposes of writing to inform/argue/critique, and to recommend/persuade. It also suggests three objectives to guide future developments: (a) specifying the expected purposes, audiences,

16

16 John M. Norris et al.

and genres for writing tasks; (b) elaborating on and extending the evident success of integrated writing tasks through various types of brief, integrated, content- responsible tasks; and (c) increasing test reliability and the coverage of genres and contexts for academic writing. Agendas for short-and long-term research are proposed. Chapter 5, Assessing Academic Speaking, by Xi et al. reviews current theoretical perspectives on how academic speaking skills should be conceptualized and offers new perspectives on approaches to construct definition for assessment purposes. It also considers current debates around how academic oral communicative competence should be operationalized and makes recommendations for short-term and long-term research and development needs. Fundamentally, the chapter argues that needed changes in the construct definition of academic speaking skills are driven primarily by three factors: advances in theoretical models and perspectives, the evolving nature of oral communication in the academic domain, and standards that define college-ready oral communication skills. Drawing on recent work in approaches to construct definition and models of language abilities, an integrated model of academic speaking skills is proposed. Further discussed are implications of this integrated model for how language use contexts should be represented in construct definition and reflected in task design following an interactionalist approach, how test and rubric design should facilitate elicitation and extraction of evidence of different components of academic speaking skills, and how tests and rubrics might best reflect intended processes and strategies involved in speaking. Additional insights are drawn from domain analyses of academic oral communication and reviews of college-ready standards for communication. Following consideration of these factors, the chapter then reviews new developments in task types, scoring approaches, rubrics, and scales, as well as various technologies that can inform practices in designing and scoring tests of academic speaking. The chapter concludes by considering likely future directions, pointing to the needs for refining an integrated, coherent theoretical model that underlies academic speaking skills and updating construct definitions to reflect the changing roles and demands of oral communication in EMI contexts. It also discusses the need for continuing explorations of new task types and scoring rubrics and for leveraging technological advances to support innovations in assessment design, delivery, scoring, and score reporting. Finally, it underscores the necessity of supporting the validity of interpretations of and uses for academic speaking tests as well as some challenges presented by distinct approaches to testing. In the final chapter, Looking Ahead to the Next Generation of Academic English Assessments, Chapelle proposes an overall approach to construct definition for academic English and outlines a framework for test task design and validation. The chapter makes an important observation that the four main chapters still reinforce the practice that academic English proficiency is defined in separate modalities rather than as a holistic construct.The chapter argues for further research to define an integrated construct of academic English proficiency where the four modalities

16

17

16

Framing Academic English Assessment 17

are fused together, reflecting the complex interactive realities of communication in the higher education environment. It then analyzes the way constructs are defined for each of the four modalities, noting that they all emphasize the context- dependent nature of academic language use. The chapter also synthesizes the test task characteristics introduced across the chapters to present a model of test task characteristics for academic communication overall. Furthermore, it analyzes the way technology is used in defining score interpretation, designing tasks, conducting validation research, and reshaping the admissions testing space. The chapter articulates the values and concerns expressed in each area that should inform a validity argument for test score interpretation and use. It includes suggestions of research that come from the four chapters as well as other research needed to support a validity argument for a test combining the four areas of language abilities. The chapter concludes by drawing together two notable current debates and proposing some ways forward. The first relates to overall approaches to defining and operationalizing the constructs of academic English for admissions purposes and how these different approaches may impact score-based interpretations and uses. The second involves balancing appropriate test use with evolving user needs, driven by at times volatile political, financial, and social forces that surround higher education admissions testing in the contemporary era around the world.

References Ackermann, K., & Chen,Y. H. (2013). Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach. Journal of English for Academic Purposes, 12(4), 235–247. AERA, APA, & NCME. (2014). Standards for educational and psychological testing.Washington, DC: Author. Altbach, P. (2007). The imperial tongue: English as the dominating academic language. Economic and Political Weekly, September 8, 3608–3611. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, U.K.: Oxford University Press. Basturkmen, H. (2016). Dialogic interaction. In K. Hyland & P. Shaw (Eds.), The Routledge handbook of English for academic purposes (pp. 152–164). London, U.K.: Routledge. Bejar, I., Douglas, D., Jamieson, J., Nissan, S., & Turner, J. (2000). TOEFL 2000 listening framework: A working paper (TOEFL Monograph No. 19). Princeton, NJ: Educational Testing Service. Biber, D., Conrad, S., Reppen, R., Byrd, P., Helt, M., Clark, V., Cortes, V., Csomay, E., & Urzua, A. (2004). Representing language use in the university: Analysis of the TOEFFL 2000 spoken and written academic language corpus (TOEFL Monograph Series). Princeton, NJ: Educational Testing Service. Brenzel, J., & Settles, B. (2017). The Duolingo English test—Design, validity, and value. Retrieved from https://s 3.amazonaws.com/d uolingo-papers/other/D ET_Short Paper.pdf Bridgeman, B., Cho, Y., & DiPietro, S. (2016). Predicting grades from an English language assessment: The importance of peeling the onion. Language Testing, 33(3), 307–318.

18

18 John M. Norris et al.

Butler, F.A., Eignor, D., Jones, S., McNamara,T., & Suomi, B. K. (2000). TOEFL 2000 speaking framework: A working paper (TOEFL Monograph No. 20). Princeton, NJ: Educational Testing Service. CAEL Assessment Office. (2009). CAEL Assessment administration manual. Ottawa, CN: Author. California Department of Education. (2014). California English Language Development Standards (electronic edition). Sacramento, CA: State of California Department of Education. Chapelle, C. A., Enright, M. K., & Jamieson J. M. (Eds.) (2008). Building a validity argument for the Test of English as a Foreign Language. New York: Routledge. Charles, M., & Pecorari, D. (2016). Introducing English for academic purposes. New York: Routledge. Cho, Y., & Bridgeman, B. (2012). Relationship of TOEFL iBT® scores to academic performance: Some evidence from American universities. Language Testing, 29(3), 421–442. Coxhead, A. (2011). The new Academic Word List 10 years on: Research and teaching implications. TESOL Quarterly, 45(2), 355–362. Coxhead, A. (2016). Acquiring academic and disciplinary vocabulary. In K. Hyland & P. Shaw (Eds.), The Routledge handbook of English for academic purposes (pp. 177–190). London, U.K.: Routledge. Crystal, D. (2003). English as a global language. Cambridge, U.K.: Cambridge University Press. Cumming, A., Kantor, R., Powers, D. E., Santos, T., & Taylor, C. (2000). TOEFL 2000 writing framework: A working paper (TOEFL Monograph No. 18). Princeton, NJ: Educational Testing Service. Dang, T. N. Y., Coxhead, A., & Webb, S. (2017). The academic spoken word list. Language Learning, 67(4), 959–997. Davies, A. (2008). Assessing academic English:Testing English proficiency, 1950–1989:The IELTS solution. Cambridge, U.K.: Cambridge University Press. De Chazal, E. (2014). English for academic purposes. Oxford, U.K.: Oxford University Press. Dimova, S., Hultgren, A. K., & Jensen, C. (2015). English-medium instruction in European higher education. English in Europe,Volume 3. Berlin, Germany: De Gruyter Mouton. Doiz, A., & Lasagabaster, D. (2020). Dealing with language issues in English-medium instruction at university: A comprehensive approach. International Journal of Bilingual Education and Bilingualism, 23(3), 257–262. Eckes,T. (2020). Language proficiency assessments in higher education admissions. In M. E. Oliveri & C. Wendler (Eds.), Higher education admissions practices: An international perspective (pp. 256–275). Cambridge, U.K.: Cambridge University Press. Educational Testing Service (ETS). (2020). TOEFL® Research Insight Series, Volume 3: Reliability and comparability of TOEFL iBT® scores. Princeton, NJ: Educational Testing Service. Enright, M. K., Grabe, W., Koda, K., Mosenthal, P., Mulcahy-Ernt, P., & Schedl, M. (2000). TOEFL 2000 reading framework:A working paper (TOEFL Monograph No. 17). Princeton, NJ: Educational Testing Service. Gardner, L. (2020). Covid-19 has forced higher ed to pivot to online learning. Here are 7 takeaways so far. The Chronicle of Higher Education, March 20, 2020. Ginther, A., & Yan, X. (2018). Interpreting the relationships between TOEFL iBT scores and GPA: Language proficiency, policy, and profiles. Language Testing, 35(2), 271–295. Goastellec, G. (2008). Changes in access to higher education: From worldwide constraints to common patterns of reform? In D. P. Baker & A. W. Wiseman (Eds.), The worldwide

18

19

18

Framing Academic English Assessment 19

transformation of higher education (International Perspectives on Education and Society, Vol. 9, pp. 1–26). Bingley, U.K.: Emerald Group Publishing Limited. Green, A. (2018). Linking tests of English for academic purposes to the CEFR: The score user’s perspective. Language Assessment Quarterly, 15(1), 59–74. Guilherme, M. (2007). English as a global language and education for cosmopolitan citizenship. Language and Intercultural Communication, 7(1), 72–90. Harsch, C., Ushioda, E., & Ladroue, C. (2017). Investigating the predictive validity of TOEFL iBT test scores and their use in informing policy in a United Kingdom university setting (TOEFL iBT Research Report Series No. 30). Princeton, NJ: Educational Testing Service. Hiltz, S. R., & Turoff, M. (2005). Education goes digital: The evolution of online learning and the revolution in higher education. Communications of the ACM, 48(10), 59–64. Hinkel, E. (2010). Integrating the four skills: Current and historical perspectives. In R. B. Kaplan (Ed.), Oxford handbook in applied linguistics (2nd ed., pp. 110–126). Oxford, U.K.: Oxford University Press. Hu, G. (2018). The challenges of world Englishes for assessing English proficiency. In E. L. Low & A. Pakir (Eds.), World Englishes: Rethinking paradigms (pp. 78–95). London, U.K.: Taylor and Francis. Hyland, K., & Shaw, P. (2016). Introduction. In K. Hyland & P. Shaw (Eds.), The Routledge handbook of English for academic purposes (pp. 1–14). London, U.K.: Routledge. Institute of International Education. (2017). A world on the move: Trends in global student mobility. New York: Author. Jacoby, J. (2014). The disruptive potential of the Massive Open Online Course: A literature review. Journal of Open, Flexible, and Distance Learning, 18(1), 73–85. Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000 framework: A working paper (TOEFL Monograph No. 16). Princeton, NJ: Educational Testing Service. Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342. Kim, K. J., & Bonk, C. J. (2006). The future of online teaching and learning in higher education. Educause Quarterly, 29(4), 22–30. Kim, M., Smith, W. Z., & Chin, T.Y. (2017). Validation and linking scores for the Global Test of English Communication. Tokyo, Japan: Benessee. Kyle, K., Choe, A-T., Egushi, M., LaFlair, G., & Ziegler, N. (in press). A comparison of spoken and written language use in traditional and technology mediated learning environments (TOEFL Research Report Series). Princeton, NJ. Educational Testing Service. Leitner, G., Hashim, A., & Wolf, H. G. (Eds.) (2016). Communicating with Asia: The future of English as a global language. Cambridge, U.K.: Cambridge University Press. Lillis, T., & Tuck, J. (2016). Academic literacies: A critical lens on writing and reading in the academy. In K. Hyland & P. Shaw (Eds.), The Routledge handbook of English for academic purposes (pp. 30–43). London, U.K.: Routledge. Long, M. (2015). Second language acquisition and task- based language teaching. Oxford, U.K.: John Wiley & Sons. Macaro, E. (2018). English medium instruction. Oxford, U.K.: Oxford University Press. National Governors Association. (2010). Common core state standards.Washington, DC:Author. Nesi, H. (2016). Corpus studies in EAP. In K. Hyland & P. Shaw (Eds.), The Routledge handbook of English for academic purposes (pp. 206–217). London, U.K.: Routledge. Nettles, M. T. (2019). History of testing in the United States: Higher education. The ANNALS of the American Academy of Political and Social Science, 683(1), 38–55.

20

20 John M. Norris et al.

Newton, J. M., Ferris, D. R., Goh, C. C., Grabe, W., Stoller, F. L., & Vandergrift, L. (2018). Teaching English to second language learners in academic contexts: Reading, writing, listening, and speaking. New York: Routledge. Norris, J. M. (2018). Task-based language assessment: Aligning designs with intended uses and consequences. JLTA Journal, 21, 3–20. Northrup, D. (2013). How English became the global language. New York: Palgrave Macmillan. OECD. (2017). Education at a glance 2017: OECD indicators. Paris, France: OECD Publishing. Oliveri, M. E., & Wendler, C. (Eds.) (2020). Higher education admissions practices: An international perspective. Cambridge, U.K.: Cambridge University Press. Owen, N., Shrestha, P., & Hultgren, K. (in press). Researching academic reading in two contrasting EMI (English as a Medium of Instruction) university contexts (TOEFL Research Report Series). Princeton, NJ: Educational Testing Service. Paquot, M. (2010). Academic vocabulary in learner writing: From extraction to analysis. London, U.K.: Bloomsbury. Pinner, R. S. (2016). Reconceptualising authenticity for English as a global language. Amsterdam, The Netherlands: Multilingual Matters. Reshetar, R., & Pitts, M. (2020). General academic and subject-based examinations used in undergraduate higher-education admissions. In M. E. Oliveri & C. Wendler (Eds.), Higher education admissions practices: An international perspective (pp. 237–255). Cambridge, U.K.: Cambridge University Press. Schofer, E., & Meyer, J. W. (2005). The worldwide expansion of higher education in the twentieth century. American Sociological Review, 70(6), 898–920. Shaw, P. (2016). Genre analysis. In K. Hyland & P. Shaw (Eds.), The Routledge handbook of English for academic purposes (pp. 243–255). London, U.K.: Routledge. Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512. Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge, U.K.: Cambridge University Press. Tardy, C. M., & Jwa, S. (2016). Composition studies and EAP. In K. Hyland & P. Shaw (Eds.), The Routledge handbook of English for academic purposes (pp. 56–68). London, U.K.: Routledge. Taylor, C., & Angelis, P. (2008). The evolution of TOEFL. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 27–54). New York: Routledge. Tesar, M. (2020). Towards a post-Covid-19 ‘new normality?’: Physical and social distancing, the move to online and higher education. Policy Futures in Education, 18(5), 556–559. TESOL. (2006). Pre-K–12 English language proficiency standards. Alexandria,VA: TESOL. Vlachopoulos, D., & Makri, A. (2017). The effect of games and simulations on higher education: A systematic literature review. International Journal of Educational Technology in Higher Education, 14(1), 22. Wagner, E. (2020). Duolingo English Test, Revised Version July 2019. Language Assessment Quarterly, 17(3), 300–315. Wechsler, H. S. (2014). The qualified student: A history of selective college admission in America. London, U.K.: Routledge. Weir, C. J., & O’Sullivan, B. (2017). Assessing English on the global stage:The British Council and English language testing, 1941-2016. London, U.K.: Equinox.

20

21

20

Framing Academic English Assessment 21

WIDA Consortium. (2014). The WIDA standards framework and its theoretical foundations. Retrieved from https://wida.wisc.edu/sites/default/files/resource/WIDA-Standards- Framework-and-its-Theoretical-Foundations.pdf Xi, X., Bridgeman, B., & Wendler, C. (2014).Tests of English for academic purposes in university admissions. In A. J. Kunnan (Ed.), The companion to language assessment: Evaluation, methodology, and interdisciplinary themes (Vol. 1, pp. 318–337). Chichester, U.K.: Wiley. Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education –where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. Zheng,Y., & De Jong, J. H. A. L. (2011). Research note: Establishing construct and concurrent validity of Pearson Test of English Academic. London, U.K.: Pearson Education Ltd. Zwick, R. (2002). Fair game? The use of standardized admissions tests in higher education. London, U.K.: Routledge. Zwick, R. (2017). Who gets in? Strategies for fair and effective college admissions. Cambridge, MA: Harvard University Press.

2

2 ASSESSING ACADEMIC READING Mary Schedl, Tenaha O’Reilly, William Grabe, and Rob Schoonen

Reading comprehension is widely agreed to be not one, but many things. At the least, it is agreed to entail cognitive processes that operate on many different kinds of knowledge to achieve many different kinds of reading tasks. Emerging from the apparent complexity, however, is a central idea: Comprehension occurs as a reader builds one or more mental representations of a text message…. Among these representations, an accurate model of the situation described by the text… is the product of successful deep comprehension. (Perfetti & Adlof, 2012, p. 1)

2.1 Introduction In the past decade, the construct of reading has evolved and innovations in theory, measurement, and delivery systems offer an opportunity to reexamine how L2 reading should be assessed (Sabatini et al., 2012). This chapter presents an overview of the research on reading ability for first (L1) and second (L2) language learners as well as its implications for L2 reading assessment. Specifically, the chapter covers the key theoretical models of reading as well as their implications for the assessment of large-scale, high-stakes second-language (L2) academic reading, examining how the construct has evolved and possible new ways to capture these changes in an assessment context. We argue that academic reading is a purpose-driven endeavor. In academic contexts, these purposes often require comprehension of conceptually new (and sometimes challenging) concepts, summary and synthesis skills beyond general main idea comprehension, and uses of information from reading in combination with writing, speaking, and listening. Skilled academic reading also often requires that learners seek, evaluate,

2

23

2

Assessing Academic Reading 23

and integrate multiple-source documents to achieve a particular aim. Given the expansion of digital technology, modern readers are often immersed in a multimedia context as they read and learn new information to achieve their reading goals. As we examine the construct of academic reading, it is also important to recognize that more advanced reading skills are still built on a foundation of basic word recognition, vocabulary and grammar knowledge, reading fluency, and main idea comprehension. We consider the relationship between first and second language reading abilities in Section 2.2.We review major models of reading comprehension in Section 2.3. In Section 2.4 we consider the implications and limitations of these models for developing purpose-driven tasks for assessment and the components that need to be considered in designing an assessment. In Section 2.5 we consider the influence of non-print reading media in academic reading and assessment, and in Section 2.6 we consider possible new academic reading tasks for an expanded construct. Directions for future research and development are discussed in Section 2.7.

2.2 The Relationship between L1 and L2 Reading Ability In this section, we first highlight similarities in reading abilities across various L1s and in reading abilities across L1s and L2s. Over the past decade, considerable evidence has emerged to support both patterns of similarities in reading abilities. These overall similarities allow us to draw upon both L1 and L2 research findings to build arguments for the construct of reading abilities. The case for reading comprehension similarities across differing L1s has been explored in many studies comparing two or more specific languages. There are obvious cases of specific differences in phonology, orthography, vocabulary, and grammar; however, the processes used in each language to develop reading comprehension skills look remarkably similar, particularly at more advanced levels of reading. Even across languages like Chinese and English, or Hebrew and Spanish, for example, the script differences, as well as morphological and syntactic differences, do not override similar linguistic, cognitive, and neurolinguistic processes engaged in reading comprehension, and component abilities that contribute to reading comprehension are remarkably similar (see Frost, 2012; Perfetti, 2003; Seidenberg, 2017;Verhoeven & Perfetti, 2017 for more details). Much as with the case of different L1s, many empirical research studies have indicated great similarity in the overall patterns of component-skills development across L1s and L2s (see Farnia & Geva, 2013; Geva & Farnia, 2012; Jeon &Yamashita, 2014; Kieffer, 2011; Lesaux & Kieffer, 2010; Lipka & Siegal, 2012; Melby-Lervåg & Lervåg, 2014; Nassaji & Geva, 1999;Van Steensel et al., 2016;Verhoeven & Perfetti, 2017;Verhoeven & Van Leeuwe, 2012). Despite some obvious differences in early reading development across L1 and L2 situations, these differences diminish in importance as readers develop stronger reading comprehension skills in their L2s. As lower-level processes become more fluent and automatized, the higher-level

24

24 Mary Schedl et al.

skills critical for academic reading purposes impose greater similarity on reading processes. When one looks at specific linguistic component skills that are best predictors of reading comprehension, it is not surprising that similarities dominate: Word identification, vocabulary, morphology, and listening comprehension all generate high correlations with reading comprehension in both L1 and L2 contexts. Passage fluency, morphology, and grammar seem to be stronger discriminating variables for L2 reading skills than for L1s, but all are important variables for L1 readers at lower proficiency levels. In addition, similar strategy uses such as inferencing, monitoring, summarizing, and attending to discourse structure seem to distinguish stronger from weaker readers in both L1 and L2 contexts. Among the ways that L2 reading differs from L1 reading abilities, most of these differences center, either directly or indirectly, on proficiency levels of the language and linguistic resources that a reader can bring to bear on text comprehension (Farnia & Geva, 2010, 2013; Geva & Farnia, 2012; Melby-Lervåg & Lervåg, 2014; Netten et al., 2011;Trapman et al., 2014;Van Daal & Wass, 2017). Other differences center on reading experiences. The linguistic and processing limitations (e.g., decoding, word recognition, vocabulary, syntax) create real L1–L2 differences until the L2 linguistic resources and processing capabilities have grown sufficiently strong and fluent. The lesser overall amount of exposure to reading in an L2 is a limitation that is only overcome by more practice in reading comprehension in the L2. With respect to underlying cognitive processes that contribute to reading comprehension skills, they are involved in L1 and L2 reading in similar ways because they are more general learning abilities (goal setting, inferencing, comprehension monitoring, working memory, speed of processing, motivations for reading, etc.). These cognitive skills and processes are not specific to only L1 or L2 contexts, and they are not specific to language learning as opposed to other types of learning during child development. Decades ago, Alderson (1984) posed the question of whether reading in a foreign language was a reading problem or a (L2) language problem, and he speculated that it was probably more of a language problem before a language threshold was crossed. Since then many studies have decisively supported this view (see Alderson et al., 2016; Farnia & Geva, 2013; Jeon & Yamashita, 2014; Kieffer, 2011;Yamashita & Shiotsu, 2017 for overviews). Learning to read in an L2 is primarily a process that relies on developing L2 language skills until an advanced point where L2 language skills automatize in ways similar to L1 reading skills. Once these foundational skills are automatized, other influences such as more advanced academic reading purposes may separate learners. At a different level of L1–L2 comparison, several models of reading have been proposed for L1 reading comprehension, but fewer have been proposed for L2 reading. Of interest is whether measurement models of latent constructs identified in L1 reading research are appropriate in predicting L2 academic reading achievement. The strongest case for theoretical similarities of a reading model across both L1 and L2 contexts is found for the Simple View of Reading (Gough

24

25

Assessing Academic Reading 25

& Tunmer, 1986; Hoover & Gough, 1990), which has generated a considerable number of studies to verify this similarity. However, the larger issue of models of reading abilities and reading development, while more prominent in L1 contexts, is perhaps problematic in both L1 and L2 contexts. We take up this issue in the following section where we address theories of reading and reading development.

2.3 Conceptualizing Reading Proficiency Reading is a complex process made up of a wide range of subskills. As such, there are several theories of reading, many of which focus on parts of the overall construct, rather than its entirety. As Perfetti and Stafura (2014) state: There is no theory of reading, because reading has too many components for a single theory. There are theories of word reading, theories of learning to read, theories of dyslexia, and theories of reading comprehension at various grain sizes (sentence comprehension, text comprehension), appropriately targeted to a manageable part of reading. The progress of 20+ years in reading research has been guided by specific problems and flexible frameworks more than by testing of precise theories. (p. 22)

24

2.3.1 The Skilled Reading Construct as a Form of Cognitive Expertise Skilled (academic or professional) reading is a complex cognitive skill that is a learned form of expertise; it is learned as an incremental process that minimally requires extensive exposure to relevant input and some degree of effective explicit instruction over an extended period of time. Skilled reading is not the same as demonstrating very basic literacy skills, but rather is a higher-order skill that relatively few people achieve (see the results of the National Assessment of Educational Progress (NAEP), the Progress in International Reading Literacy Study (PIRLS), and other large-scale assessment programs). In addition, reading abilities vary by the purposes for reading as well as the overall levels of reading proficiency. So, hypothesizing the skills (and the reading construct) for adult basic literacy students will differ considerably from the skills central to early elementary reading development, and both will differ from skilled academic reading abilities that develop and grow from secondary school contexts through to the end of graduate studies. This academic reading is the complex cognitive ability that is typically the upper bound of the reading construct to be measured by large-scale assessment programs, and it is the appropriate target for proposing a reading construct to guide planning for a large-scale, high-stakes academic reading assessment. Skilled academic reading is a type of cognitive expertise as examined through decades of work by Ericsson and others (see Anderson, 2015; Ericsson, 2009;

26

26 Mary Schedl et al.

Ericsson & Pool, 2016). As such, it requires thousands of hours of practice, careful support by mentors/teachers, clear goals, self-regulation and motivation, and consistent deliberate practice. A central question for assessment designers is if this type of expertise can be captured in a current model of reading. If we expect a model of reading to provide precise predictions that can guide assessment (and instructional) development, the answer is no. The complexity embodied by the skilled (academic) reader can be described in terms of many subskills, knowledge bases, environmental factors, and cognitive processes that have been shown to impact the development of reading abilities. But no one theory of skilled reading abilities has been developed that can account for this complexity. Moreover, many of the component variables that contribute to skilled reading can be subdivided. A reasonable conclusion to draw is that there is no single best model of reading that can be empirically validated and encompass all potentially relevant variables. A second conclusion is that skilled reading is a complex cognitive ability that is a form of learned expertise (see also Alexander & the Disciplined Reading and Learning Research Laboratory, 2012; J. R. Anderson, 2015). As such, no one model of reading can provide a precise predictive account of how “reading” works.

2.3.2 Conceptualizing Reading Abilities Given the complexity described above, there are multiple theories that describe reading ability and each theory tends to emphasize various facets of the “construct.” Most theories of reading seek to provide global “explanations” of reading comprehension abilities (more or less fully articulated), but not precise models that can be falsified or make specific predictions of learning performance. The goal of global frameworks is to “explain” how a reader forms a coherent and appropriate interpretation from a text. Generally, these comprehension models are based on the integration of language skills, knowledge bases, and language processes leading to a coherent construction of a text model and a situation model. The text model reflects what the author is trying to convey but may not be precise, and the situation model reflects what the reader decides to take from the text as the information to be retained in long-term memory. One drawback for these global models is that the higher-level processing to build a text model and situation model must be built upon the many lower-level processes and knowledge bases supporting comprehension processes, not all of which are explicitly articulated. Nevertheless, a number of models of reading comprehension have been proposed and supported by empirical studies that (1) measure multiple component skills of reading comprehension (while also controlling for others), and compare these measures to measures of reading comprehension (using correlation, regression, and a number of complex multivariate techniques); (2) focus on skill differences across groups or readers; and (3) use a number of other (usually multivariate) research techniques to add converging support.

26

27

26

Assessing Academic Reading 27

In this section we will review four of the most prominent theories of reading comprehension and reading development.These theories include the Simple View of Reading (SVR), the Construction–Integration Model (C–I), the Landscape Model (LM), and the Documents Model (DM).We will then propose that reading comprehension abilities are most likely best “captured” by the more cautious “frameworks” approach proposed by Perfetti and Stafura (2014; see also Perfetti et al., 2005).This approach recognizes the key contributions of the reading models noted above, but also points out that each model does not capture the full range of factors contributing to reading ability. While there is still no single comprehensive model of comprehension (Perfetti & Stafura, 2014), McNamara and Magliano (2009) point out that differences across the models below are primarily related to the different foci of the models and from the different comprehension situations they describe and are not necessarily contradictory. Models that propose relationships between cognitive skills and linguistic components, such as those discussed below, make it possible to research hypotheses about sources of comprehension weakness that can provide empirical evidence relevant to language teaching, learning, and assessment.

2.3.3 The Simple View of Reading One of the most widely cited models of reading comprehension is the Simple View of Reading (SVR; Gough & Tunmer, 1986; Hoover & Gough, 1990). Its attractiveness as a model might be due to its parsimonious features. The SVR contends that there are two overarching components to reading ability, decoding (efficient word recognition) and language comprehension. Reading ability is the product of separate decoding and language comprehension abilities such that each component is necessary but not sufficient to ensure adequate reading comprehension. Thus, if students have no decoding ability, they cannot read even if they have adequate language comprehension. In short, both decoding and language comprehension are needed to both comprehend text and to become a skilled reader. Hoover and Gough (1990) demonstrated the validity of the SVR formula in a study of 250 K-4th grade children from 1978 to 1985. High correlations were found for reading comprehension scores based on the formula, and numerous studies since have supported the SVR in L1 language research. Strong support has also been found for it for L2 readers (Mancilla-Martinez et al., 2011; Spencer & Wagner, 2017;Verhoeven & Van Leeuwe, 2012). When comparing English language learners and students whose first language was English, Farnia and Geva (2013) found support for an augmented SVR that includes cognitive (nonverbal ability and working memory), word-level, and language components. Yamashita and Shiotsu (2017) investigated L1 reading and L2 listening together with L2 linguistic knowledge and found that over 90% of L2 reading variance in their Japanese university sample was explained by these predictors, with most of the shared variance being accounted for by the L2 factors.

28

28 Mary Schedl et al.

Overall, student populations examined for the SVR have included primary- grade students through late secondary-level students, strong readers and weak readers, average performing students and students with reading disabilities. In almost all cases, the outcomes show that word recognition and language skills, in some combination, account for a large percentage of shared variance with reading comprehension, and measurement outcomes vary in expected ways in longitudinal studies. The obvious strength of the SVR is that the approach statistically captures a large share of what readers seem to be doing when they read. The SVR has additional strengths in that the major variables tend to absorb, or account for, covariance from other potential variables. In this way, word recognition skills and language comprehension skills account for other potential contributing variables and thus provide parsimony for the theoretical explanation (Kim, 2017). Moreover, the SVR has the appeal of being very easy to understand and very easy to suggest implications for instruction. In all of these respects, the SVR is remarkably robust and says something very important about reading comprehension abilities across individuals, including L2 readers. Of course, with such a simple view as a sole explanation, there are inevitably drawbacks. The obvious limitation of this approach is that it leaves out many variables seen as contributing to reading abilities (e.g., reading span, background knowledge, text structure, inferencing, etc.). It also does not identify the processes of comprehension that are so central to the other models discussed. In essence, we do not know how it works to generate reading comprehension. In addition, the SVR is simply a language measurement model, and thus is not really a model of reading comprehension at all as much as a fact about some key reading-related variables. However, it is an excellent and well-established fact overall, and one that can promote many useful instructional and assessment applications.

2.3.4 The Construction–Integration Model Kintsch’s (1998, 2012) Construction–Integration (C–I) model of comprehension focuses on the psycholinguistic processes that allow comprehension to occur. It emphasizes the construction of mental models and the integration of text with prior knowledge. A reader’s “mental model” of a text is a network of interrelated syntactic-and semantic-based propositions having two components –a textbase consisting of information derived directly from the text (the text construction part of the model) and a situation model consisting of information from the reader’s background knowledge and experience that integrates prior linguistic and world knowledge with the textbase (the integration part of the model). This integration process improves coherence, completeness, and interpretation of the textbase. Integration is not only necessary for comprehension, but also for learning from text. In the creation of a textbase, irrelevant associations are commonplace because words call up numerous independent associations (e.g., bears and honey versus

28

29

Assessing Academic Reading 29

bears and football). Background knowledge is needed for text comprehension but also for making correct inferences from the text.The integration process establishes the appropriate meaning in the context and the quick deactivation of contextually inappropriate meanings. Text comprehension is the result of connecting the textbase with relevant background knowledge. The C–I model offered several strengths as a theoretical model. The construction part of the clause-by-clause process developed propositional structures to specify how the syntactic-semantic units in a text could be combined to form text comprehension. The idea that there is a text model and a situation model to be constructed allowed a way to incorporate inferential support for text comprehension that took account of a reader’s background knowledge and provided a way for reader effort to account for comprehension processes with more challenging texts. In the C–I model, background knowledge is critical because authors often do not make everything explicit and thus readers need to make knowledge- based inferences to fill in conceptual gaps (McNamara et al., 1996; McNamara & Magliano, 2009; O’Reilly & McNamara, 2007). In an L2 context, where readers may not be familiar with culture- specific concepts and phrases, background knowledge may play a role in comprehension and learning (Chung & Berry, 2000; Purpura, 2017). However, the precise role of background knowledge in comprehension and learning in L2 contexts needs more research (Purpura, 2017).

28

2.3.5 The Landscape Model Like the C–I model, the landscape model (LM) of reading proposed by Van den Broek et al. (1999) focuses on the relationship between comprehension processes and the memory representation that is created and updated in real time during reading. The idea is that the concepts and propositions activated during reading fluctuate as readers move through text. The activation process is conceptualized in accordance with models of working memory, like the C–I model, and the outcome of the reading process is viewed as a mental representation of the text. Each new reading cycle offers four potential sources of activation, similar to the C–I model: (1) the text currently being processed, (2) the immediately preceding reading cycle, (3) concepts processed in earlier cycles, and (4) background knowledge. Each of these potential sources is available for activation as readers process a text. In a sense, the LM is an early version of the “word-to-text integration” process more recently developed by Perfetti and Stafura (2014), which focused more on the role of syntactic parsing and the linked semantic propositions. A “landscape” of activation appears by considering the “peaks” and “valleys” of the more or less activated concepts and propositions representing the text. The processing of concepts is accompanied by cohort activation (the activation of other connected concepts). It is assumed that the amount of activation of secondarily derived concepts is based on the strength of their relationship to the

30

30 Mary Schedl et al.

primarily derived concept and the cohorts of concepts that appear across a text as new concepts are added and new associations are formed. Thus, a key contribution of the LM is that comprehension is described as a process that builds on prior cycles as new sources of information and background knowledge are activated. In short, readers build a mental model that contains the central concepts and thematic ideas as they move from one sentence to the next.This is an ongoing process that changes throughout a text and will alter the thematic core being developed. Accordingly, the landscape changes in line with what ultimately is emphasized in a text. Attentional resources are limited, however, and activation depends on working memory. Individual readers differ in numerous ways that influence their comprehension. Working memory, processing abilities, background knowledge, syntactic and semantic knowledge, etc. all influence the resources available to the reader in making sense of the text. In addition to working memory constraints, the level and depth of comprehension is also dependent on the readers’ “standard of coherence,” or their criteria for determining whether the text makes sense or not (Van den Broek et al., 1995). The LM’s predictions about the fluctuating activations during reading were evaluated in several experiments, suggesting that the model captures important aspects of the cognitive processes taking place during reading. For instance, elements of the LM have been evaluated by examining the effects of reading purpose and working memory capacity (WMC) on text processing (Linderholm & Van den Broek, 2002). Linderholm and Van den Broek (2002) argued that readers attempt to achieve standards of coherence that are based on their reading purpose and that they adjust cognitive processes to fit different reading purposes, such as reading to study versus reading for entertainment. In a study designed to look at performance of low-versus high-working memory capacity readers across these two different reading purposes, the authors found that both high and low WMC readers adjusted their reading processes in a similar fashion with similar results when their purpose was entertainment. When their purpose was studying, however, although both groups adjusted their reading processes, the high WMC readers used more demanding processes that were more beneficial for recall. When a low standard of coherence is chosen, then, gaps in understanding are acceptable to the reader and some errors in understanding are tolerated. In contrast, when readers adopt a high standard of coherence, they expend additional effort to deepen and embellish understanding to ensure that intricate details of a text are integrated into a coherent model, and they also monitor and repair breaks in their understanding. Clearly, the purpose for reading and the standard of coherence adopted while reading influence what is read, how it is processed, and how deep is the level of understanding that is reached. However, further work is necessary to determine how factors such as a reader’s standard of coherence and other components of the LM predict and explain ELL reading performance.

30

31

30

Assessing Academic Reading 31

2.3.6 The Documents Model The documents model (DM) is an effort to extend an explanation of comprehension processes from a single text to an explanation of how a reader would construct a situation model based on multiple texts. Centrally important to the DM is the establishment of a purpose for reading as well as a specific task as a reasonably concrete outcome or goal. Tasks, academic or professional, often require that a reader turn to multiple texts to seek out the most important information to report, the most effective way to present information and the most appropriate synthesis of information, given a specific task and a specific audience. Efforts to develop a multi-source documents model have been developed over the past 25 years, and this extension of a single text situation model may be very important for advanced reading instruction and reading skills assessment (Britt et al., 2013; Britt et al., 2018; Perfetti et al., 1999; Rouet, 2006; Rouet & Britt, 2011). One line of early exploration for modeling comprehension from multiple documents was the work of Perfetti and colleagues in the early to mid-1990s (Perfetti et al., 1995; Perfetti et al., 1999).A six-week experimental class was created to read multiple documents on the history of the Panama Canal. A range of data on student reasoning and writing production examined how students acquired a multi-text understanding of the issues around the handover of the Panama Canal. From this work and related studies, Perfetti et al. (1999) created a formalized DM of reading comprehension and synthesis. The model built upon Kintsch’s C–I model by adding two layers of representation, the intertext model which includes source information (e.g., who wrote the text and why), and the integrated mental model of the situation that is combined with background knowledge (e.g., resolution of two competing explanations for a phenomena). For the purposes of this chapter, the DM offers perspectives on how reading synthesis can emerge from reading multiple texts (and more recently, web-based texts) and what might be possible as measurements of multi-text syntheses. The four models of reading reviewed to this point all build on a similar set of processing ideas that move from word identification to text formation, to semantic integration with some emerging set of linked main ideas. The next model emerges out of a very different orientation of experimental research.

2.3.7 Reading Systems Framework Approach: A Way Forward Perfetti and Stafura (2014) and Stafura and Perfetti (2017) have argued that explanations for the development of reading abilities and the process of skilled reading should be described somewhat more cautiously through their Reading Systems Framework. The genesis of this Reading Systems Framework approach goes back 40 years starting from Perfetti’s Restrictive-Interactive model of reading (Perfetti, 1985). From work in the late 1970s to the present, Perfetti has been interested in the various pressure points among processes that contribute to

32

32 Mary Schedl et al.

reading comprehension abilities. These likely pressure points, or bottlenecks in processing while reading that lead to reading difficulties (and poor comprehension outcomes), were first outlined in Perfetti (1985). Initial explorations of the Restrictive-Interactive model highlighted the importance of word recognition, partly to respond to unconstrained top-down models of reading, and partly to identify the importance of the lexicon and the incremental accretion of lexical depth that is key to the many automated processes needed for comprehension (see Perfetti, 1991). Perfetti (1999) extended the idea of pressure points in a first outline of a reading systems framework, and identified key elements needed to take a reader from visual processing to higher-level comprehension abilities (see Figure 1 in Perfetti & Stafura, 2014). Perfetti et al. (2005) further refined the set of contributing linguistic and cognitive variables likely to be pressure points in comprehension.Their reported path analysis was based on a six-year longitudinal study by Oakhill et al. (2003; see also Oakhill & Cain, 2012) that followed a cohort of students equivalent to students from 2nd grade to 8th grade. The key idea behind the use of the Reading Systems Framework was to identify important variables impacting reading comprehension while also generating additional research that expanded our understanding of the major processes involved in reading. At about the same time as the first statement of a reading systems framework, Perfetti and colleagues presented their first version of the Lexical Quality Hypothesis, identifying lexical quality (lexical depth variables) as key to several further reading comprehension processes (Perfetti, 2007; Perfetti & Hart, 2002). The major innovation to this approach was the assumption that successful reading is largely dependent on word knowledge and that problems in comprehension are often caused by ineffective, or not fully automated, word identification processes. A high-quality lexical representation consists of three fully specified and very accurately processed constituents: The orthographic, phonological, and semantic specifications of the word. The orthographic representations are precise. The semantic representations include many rich associations with other word meanings and a high level of integration with many sorts of world knowledge. Fully specified and highly accurate representations emerge incrementally from extensive exposure to print over time and automatization of lexical processing. Lexical quality grows word by word for each individual. Lexical quality can be associated with lexical depth but the purpose is not to contrast lexical breadth and depth. The goal is to turn more words into higher-quality representations over time, and many words become sight words for automatic access and automatic associations with a wide range of other knowledge –lexical, syntactic, and discoursal (Landi, 2013). Any representation is considered low quality that does not specify the value of all three constituents. According to the hypothesis, skilled readers are advantaged in having many high-quality word representations, while less skilled readers have fewer high-quality representations. A high-quality lexicon creates spaces for new word forms with certain associations to fit into well-defined networks of lexical items.Thus, a high-quality lexicon enhances new word learning, and typically with

32

3

32

Assessing Academic Reading 33

fewer repetitions of the new word form.This argument explains why learners with larger vocabulary knowledge learn new words more efficiently, particularly with incidental word learning while reading. It also highlights the idea that lexical quality and new word learning from incidental exposure are both mostly matters of statistical learning processes and implicit learning through additional reading experiences. The Lexical Quality Hypothesis also converges nicely with Share’s (2008) concept of self-teaching of vocabulary through bootstrapping. The Lexical Quality Hypothesis has made several predictions that draw together much empirical research on vocabulary learning. In addition, the importance of a large vocabulary for reading success has been demonstrated extensively in L2 as well as L1 research (Alderson et al., 2016; Elgort et al., 2016; Elgort & Warren, 2014; Jeon & Yamashita, 2014; Kieffer & Lesaux, 2012a, 2012b; Lervåg & Aukrust, 2010; Lesaux & Kieffer, 2010; Li & Kirby, 2014; Lipka & Siegal, 2012;Van Steensel et al., 2016). The Reading Systems Framework not only specifies a tightly integrated and precise word recognition process; it also proposes that the output of lexical access feeds a second major reading comprehension process, the “word to text integration” process (Perfetti & Stafura, 2014; Perfetti et al., 2008).This process takes the lexical information as it is processed in real time as syntactic parsing (see Figure 3 from Perfetti & Stafura, 2014). The parsing of the sentence structure adds a further contribution to the high-quality lexical items by forming semantic propositions to be linked into an ongoing overall set of highly active main ideas that are an emerging text model of the text being read.This process of word to text integration converges with the C–I model and the LM in terms of building real-time comprehension, but it includes a role for syntactic information contributing to reading, and the rich lexical information provides additional support for how the text model holds together. A further outcome of strong lexical quality as part of the word to text integration is that the need for inference generation is strongly constrained to information in the text model. For example, anaphora resolution requires finding linkages to existing referents in prior text (e.g., “she” refers to “Jill”). Perfetti et al. (2013) extend this argument to suggest that any strong association between a current word being processed and relatable words by some form of strong association in prior text triggers a word-to-word linkage without need for the concept of inferencing. This extension reinforces the potential power of thousands of hours of reading practice as well as the hidden role of implicit learning of lexical items. However, Perfetti recognizes that very challenging texts and tasks with new information to be learned will likely require inferences and strategic processing that are normally associated with the strategic reader (Grabe, 2009). In this regard, there is also a place for reading strategies and problem solving often found with advanced academic tasks. The work by Perfetti and colleagues on the DM also extends the Reading Systems Framework approach to reading syntheses across multiple documents.

34

34 Mary Schedl et al.

The point of this lengthy development of Perfetti’s 40-year research agenda, and the arguments supporting the Reading Systems Framework, is that it forms a foundation for examining all facets of reading comprehension processing. It does so in a way that causes of poor comprehension can be explored within a flexible theoretical context. The framework itself can be expanded to incorporate new findings without harming overall conceptions of reading comprehension knowledge bases and processes.The framework provides a useful way to synthesize research results of many types into a manageable interpretation of how reading comprehension most likely works in most contexts. In effect, it provides a good roadmap for developing a fairly inclusive construct of reading to identify factors impacting reading comprehension, reading instruction, and reading assessment practices.

2.4 From Theoretical Models to a Framework for the Assessment of Academic Reading Abilities Most major L1 and L2 assessments today are based on test frameworks and test specifications that consider various components of the models of reading comprehension above but require decisions about which components to assess and how. Depending on the test purpose, different components of the construct will be more or less central. A diagnostic test and an admissions test have different purposes and tap into different test constructs. For example, in a diagnostic test, more basic skills are important; in an admissions test, it is mostly the situation model that matters. An assessment of academic reading abilities will focus on higher-level abilities that allow readers to construct meaning from text, such as the abilities to locate, connect, and synthesize information. For example, skills tested by the Academic Reading Test of the International English Language Testing System (IELTS) include “reading for gist, reading for main ideas, reading for detail, skimming, understanding logical argument and recognizing writers’ opinions, attitudes, and purpose” (https:// ielts.org/en-us/about-the-test/testformat). And the Test of English as a Foreign Language (TOEFL) assesses “basic information skills, inferencing skills, and reading to learn skills” using a variety of question types (The Official Guide to the TOEFL test, 2017). Of major importance in the assessment decision-making process is the desire to select representative academic reading abilities from which reasonable inferences about a test taker’s overall ability can be drawn. In this respect, the C–I model, the LM, the DM, and the Reading Systems Framework offer guidance. Depending on the text and the purpose for reading, readers will need to apply inferencing and monitoring skills, and engage in more overt reasoning. The LM and the DM focus on processes critical for more advanced reading comprehension with challenging texts. For an assessment of academic reading, individual tasks will often require the integration of information across different parts of the text, and successfully answering comprehension questions will require successfully connecting relevant information. The C–I model introduced the idea of a situation model of comprehension formed at the same time as the text model,

34

35

Assessing Academic Reading 35

essentially infusing reader knowledge and perspective into the text model as a situation model of comprehension –reflecting the information that is actually learned from a text and stored, however strongly or fragilely, in long-term memory. Any assessment of academic-level reading needs to assess the extent to which the reader has successfully created a situation model of the text. It is assumed that the higher-level abilities such as inferencing, synthesizing, and so on depend on word-and sentence-level comprehension resources and all linguistic and writing system components of the Reading Systems Framework. It is also assumed that assessing information points across a reading passage using a variety of task types will activate processing of individual concepts across the landscape of the text and the attempt to form a mental representation of the text, as hypothesized in the C–I and Landscape models. In Figure 2.1 we outline Interactional Components of Academic Reading Academic Reading Task

Reader Purpose / Goals ● ● ● ● ●

34

Characteriscs of Text ● ● ●

Text type Rhetorical structure Text features

Finding information General comprehension Learning from texts Evaluating information Integrating information across texts

Reader Linguisc and Processing Abilies ● ● ● ● ● ● ● ● ●

FIGURE 2.1 Outline

Awareness of text structure Background knowledge Engagement of comprehension strategies Morphological & phonological knowledge Semantic knowledge Syntactic knowledge Text processing abilities Word and sentence recognition / vocabulary knowledge Working memory (efficiency) / pattern recognition

of a reading comprehension assessment model

36

36 Mary Schedl et al.

an assessment model somewhat related to the Rand heuristic (Snow, 2002) that highlights the importance of reader characteristics, the text, and the activity. For academic reading assessment purposes, the three components of this model to consider are: (1) the tasks: what are the most relevant, tenable assessment tasks that will provide information about the test takers’ abilities? (2) the purposes for academic reading: what purposes should the tasks elicit from test takers to motivate their search for information to answer the questions? and (3) the texts: what kinds are academically appropriate and useful for generating appropriate tasks and purposes? The fourth component, the readers’ own linguistic and processing abilities, will determine the extent to which they are able to correctly respond to the test questions. Test takers’ purposes for reading in an assessment will be determined by the test tasks and will also determine what effort they will bring to the tasks (what standard of coherence to set). Readers’ abilities differ, interact with their purposes and goals, and affect their comprehension.

2.4.1 The Tasks In real-world academic settings, students are given a wide range of tasks to complete. At a broad level, academic tasks require students to understand text content for which they will be accountable (e.g., when tested or when completing a writing assignment). An academic task—for example, read an article and summarize it—determines the purpose for reading. Assessment tasks designed to mimic academic tasks attempt to elicit from test takers performances similar to reading for these academic purposes. In some cases, tasks are relativity simple. For example, students and test takers are often required to understand main ideas and details, locate and understand factual information, make inferences, and answer basic comprehension questions. In other cases, the tasks are more demanding and involve critical- thinking skills, connecting new information to background knowledge, and integrating and synthesizing information across multiple texts. In both cases, students need to understand the task demands and allocate appropriate resources and effort to complete them, and this is equally true in a test-taking situation. Research is needed to identify, select, and sample academic task types suitable for testing (see Sections 2.2 and 2.7 for a discussion of possible new academic reading tasks and directions for future research and development).

2.4.2 The Reading Purposes Establishing a purpose for reading and a specific task are critical. In effect, reading comprehension is a purpose-driven activity (Alderson, 2000; Britt et al., 2018; Carver, 1997; McCrudden et al., 2011; McCrudden & Schraw, 2007; Rouet & Britt, 2011). Research has shown that students approach reading in different ways depending upon the specific reading goals (e.g., study or entertainment, see Linderholm & Van den Broek, 2002;Van den Broek et al., 2001). The goals of

36

37

36

Assessing Academic Reading 37

reading help define what is and what is not important to attend to and how deeply the relevant material should be processed. While there is a wide range of reading purposes in real-world contexts, there are specific, task-driven purposes for reading in academic contexts. These reading purposes are generally thought to include reading to find information, reading for general comprehension, reading to learn, and, more recently, reading to integrate information across texts. Test designers and researchers determine which types of tasks are likely to stimulate academic reading purposes, which types of texts are appropriately academic, which ability components will be tested, and how they will be tested based on the assessment construct.

2.4.2.1 Reading to Find Information Academic readers need to be proficient at finding discrete pieces of information, which involves rapid, automatic identification of words, working memory efficiencies, and fluent reading rates. This also involves knowledge of text structure (Guthrie & Kirsch, 1987; Meyer & Ray, 2011), genre features, and discourse markers to help make the search process in long and complex texts more structured and efficient. Reading purposes that require students to search for information often involve locating small units of text rather than understanding them in light of the larger text as a whole. The ability to locate information is particularly critical in digital reading (see Coiro, 2009; Leu et al., 2013a).

2.4.2.2 Reading to Learn and Reading for General Comprehension In the C–I model, Kintsch (1998, 2012) distinguishes between a reader’s “text model” and the creation of a “situation model” on which learning from the text is based. Skilled readers update, refine, and build upon situation models when new information is available (Gernsbacher, 1990, 1997). This type of reading, typically in longer texts, requires additional abilities beyond those required for more local and basic comprehension. Skilled readers need to monitor their understanding to ensure that the global representation is coherent, consistent, and in line with the processing demands of the task. More complex purposes demand that students adopt higher “standards of coherence” or deeper levels of understanding, as demonstrated by Van den Broek et al. (1995, 2001, 2011). In learning from text, comprehension difficulties such as gaps in understanding must be resolved. However, it is important to recognize that, to the extent that a reader’s language knowledge (decoding, vocabulary, fluency, etc.) and/or processing capacity is inadequate, complete understanding of the text and, hence, learning from it, will be impeded or limited. Perfetti and Stafura’s (2014) Reading Systems Framework includes language comprehension components that are especially critical for L2 readers. The L2 test populations for academic reading assessments generally have strong reading abilities in their L1 but may have weaknesses in their L2 language

38

38 Mary Schedl et al.

knowledge (morphology, vocabulary, syntax) and associated processes in fluent passage reading. Not only greater effort but also additional practice in reading and improving language knowledge will be required for them to understand increasingly difficult texts in the second language.

2.4.2.3 Integrating Information across Multiple Texts The types of reading discussed above involve forming a more global representation of a single text. However, in academic contexts, assignments often involve asking students to answer “big questions” that span multiple texts. It is argued that the establishment of the reading purpose and specific goals is even more important for the creation of a multiple-text representation.While particular texts are written for specific purposes, there is no guarantee that any one document contains the answers to a specific reading goal. In light of the increased availability of information on the internet, coupled with the specificity of reading goals, now more than ever people need to be able to read, understand, and integrate multiple documents (Britt et al., 2018; Lawless et al., 2012; Leu et al., 2013b; McCrudden et al., 2011; Rouet & Britt, 2011; Strømsø et al., 2010; Wolfe & Goldman, 2005). Like understanding information in single texts, multiple-text understanding requires the reader to evaluate text and extract the basic meaning. However, unlike single-text comprehension, multiple-text understanding involves additional processing. Perfetti et al. (1999) introduced the term DM to refer to the mental representation of multiple documents. The DM describes the reader’s evaluation of and understanding of (a) the information presented in each individual text, (b) how the various texts are related, and (c) an interpretation or synthesis of the big picture implied by the collection of texts as a whole. Information that is consistent across documents is likely to be included in the reader’s DM, while information that is discrepant is more likely to be excluded from the DM. The DM and newer theories of multiple-text comprehension (e.g., MD- TRACE, Rouet & Britt, 2011) underscore the increased processing demands when students integrate information from multiple sources versus when they read a single source in isolation. Not surprisingly, many students have difficulty integrating and synthesizing information from multiple texts (Britt & Rouet, 2012). When reading multiple texts, the documents are often written by different authors, contain different levels of text complexity, offer multiple perspectives, and are sometimes selected from different genres. More importantly, when interacting with multiple texts, there are no explicit connections among the texts; all similarities, differences, and elements related to the reader’s goal have to be inferred. This process can be very demanding, and it can break down at many points. For instance, when reading a single text, most of the connections among key ideas are made explicit or implied by the single author. In contrast, when reading multiple texts, common ideas need to be identified and integrated in ways that were often not intended by any of the authors of the individual texts. Multiple-text

38

39

38

Assessing Academic Reading 39

comprehension is not new, but the prevalence of electronic media with its vast amounts of diverse information increases the need to be proficient with online multiple-source reading skills.

2.4.3 The Texts In designing a reading assessment, thought must be given to both the types of texts to be used to measure reading abilities and to the abilities to be measured. The success with which any given reader accomplishes a reading task depends on both the reader’s linguistic and processing abilities and the text and task characteristics. Decisions must be made about the characteristics of the texts to be used because text characteristics will influence test-taker performance. However, types of text and difficulty characteristics of texts are for the most part not sufficiently specified in reading models to guide assessment frameworks.Therefore, if an assessment is designed to assess the reader’s ability to carry out academic reading tasks, it is usually thought appropriate to use academic-like reading materials, such as the various types of exposition, description, and argumentation. Most large- scale assessments have associated reading frameworks or test specifications that include detailed descriptions of the text types and genres included in the assessments. For example, the TOEFL internet-based test (iBT) framework considered grammatical and linguistic text features and specified appropriate types of texts by pragmatic features and corresponding rhetorical features (see Enright et al., 2000 for a detailed description). From these types of texts, appropriate tasks can be designed to assess academic reading abilities. It is important that text complexity mirror real-world academic text complexity in terms of vocabulary and density, for example, because the test purpose is to assess test-taker abilities to read these types of text. Questions about the texts will discriminate test takers based on their ability to understand the linguistic content of the texts and the efficiency of their text processing abilities, important components of the Reading Systems Framework. Test questions based on linguistically challenging parts of a text are likely to be difficult, even if they test discrete propositional content.When reading for general comprehension, readers might be able to skip over some difficult parts and still get a general, high- level understanding of the text. But if understanding of these parts is necessary to forming a complete text model, not understanding them or failing to make correct inferences based on them will result in an incomplete situation model and incomplete learning from the text. It is assumed that those L2 test takers who are able to answer questions about linguistically difficult parts of a text correctly are in a better position to learn from texts, in the ways we have described as reading to learn, because they have superior knowledge of the linguistic and writing system and efficient processing capabilities. In the next section we consider the expanding role of digital reading in academic tasks and how the inclusion of multiple texts in large-scale assessments of academic reading might expand the construct.

40

40 Mary Schedl et al.

2.5 How do New Media Influence the Reading Construct and the Assessment of the Construct? Despite recent advances in reading research, most of the empirical work on reading comprehension has been conducted with traditional printed text. Students participating in typical studies are asked to read narrative or expository texts linearly (sentence after sentence, paragraph after paragraph) often on paper-based materials. While paper-based reading has traditionally been the dominant mode of reading, in 21st century learning environments, print material represents only a fraction of the reading occurring in academic and naturalistic contexts (Coiro, 2009; Leu et al., 2013b). Modern contexts for reading also involve the use of technology on a host of digital devices. For instance, a recent survey indicated that about 95% of Americans have some type of cellphone, and 77% own a smartphone, which is a 35% increase since 2011 (Pew, 2018). In addition, about two- thirds of Americans own a laptop or desktop computer, about 50% own a tablet, and around 20% own an e-reader.While these figures may not generalize to other countries, many people use digital devices for a wide range of purposes ranging from checking e-mail and searching for information on the internet, to interacting with others on social networking sites and blogs. Thus, the medium and type of materials that students now read has evolved, and assessments that measure new literacies should follow suit. These changes also impact how students learn new material, as sources are often incomplete, contradictory, and sometimes untrustworthy. As such, digital literacy skills need to be cultivated as students search for and learn new information. The prevalence of technology, the internet, and online reading has affected the way that many stakeholders have thought about measuring reading in the 21st century. For instance, all three of the major large-scale international assessments of reading are implementing new assessments that incorporate electronic media and online literacy. At the fourth grade level, the Progress in International Reading Literacy Study (PIRLS) (Mullis et al., 2009) has developed and implemented an assessment called digital media ePIRLS that is designed to measure online reading (Mullis & Martin, 2015). ePIRLS uses a simulated online reading environment to measure students’ ability to navigate, evaluate, integrate, and synthesize online sources in a non-linear environment. Similar trends have occurred for assessments that are designed for older students. Both the reading framework for the Program for International Student Assessment (PISA) for 15-year-olds (Organisation for Economic Co-operation and Development, 2009a) and the literacy conceptual framework for the Program for the International Assessment of Adult Competencies (PIAAC) for adults (Organisation for Economic Co-operation and Development, 2009b) include aspects of technology and digital environments in their frameworks. For instance, the PIAAC literacy framework discusses the impact of navigation on performance and the interactive nature of electronic media and proposes a wide range

40

41

40

Assessing Academic Reading 41

of electronic texts that include hypermedia, interactive message boards, and visual accompaniments to text. Clearly, at the international level stakeholders are designing and implementing the next generation of reading assessments with technology and digital literacy in mind.1 In short, technology has the potential to broaden access, facilitate communication, and rapidly expand what it means to read and understand in 21st century learning environments. However, it is somewhat uncertain as to how these new environments will shape the construct of academic reading. On the one hand, technology has the potential to facilitate personalized learning and adaptive testing. On the other hand, technology may also shape the nature of the construct itself. As described below, technology is a moving target, and its effects are hard to predict. Unlike the host of theories and empirical data on reading with traditional printed materials, the empirical and theoretical base on digital literacy is not as extensive, but it is continuing to grow (Coiro, 2009, 2011; Coiro et al., 2018; Leu et al., 2009; Leu et al., 2013a; Leu et al., 2013b; Leu et al., 2014; Sparks & Deane, 2015).The novelty coupled with the ever-changing face of technology introduces challenges for developing coherent theories of digital literacy (Leu et al., 2013a). For instance, as new theories emerge, unforeseen technologies change the way people interact, think, and process information.This transient technological environment is so fast paced that the theoretical and empirical literature will have difficulty keeping up. Despite these challenges, Leu et al. (2013a) argue there are some common themes, and some generalizations can be made about digital literacy. For example, they argue that online learning is a self-directed process that involves identifying a problem and locating, evaluating, and synthesizing information. Leu and colleagues also argue that online reading is often a social activity that involves the communication of information to diverse audiences. While the skills involved in online reading are not entirely new, they account for some unique variance over students’ ability to read offline and traditional printed sources (Coiro, 2011). Below, we elaborate on the potential differences between reading with traditional printed text and more modern digital environments. In particular, we argue there are some key differences between online reading and traditional offline reading. These include: The need to evaluate the quality and relevance of sources, the introduction of non-linear text, the variety of multimedia sources, and the social nature of digital interactions including synchronous and asynchronous communications (Ashely, 2003; Sparks & Deane, 2015; Sparks et al., 2016; Stoller et al., 2018).2

2.5.1 Evaluate the Quality and Relevance of Sources While the internet has increased the access of information to a wider population, this luxury is not without cost. Virtually any person can now publish information on the internet and there is no central quality control facility to ensure what is posted is correct and free of bias. In the past, and in current regulated environments

42

42 Mary Schedl et al.

(e.g., academic texts and contexts), most published documents are edited and peer- reviewed to ensure editorial and factual quality.With the introduction of the internet, readers need to be especially aware of what they read online and evaluate it on a few dimensions before deciding if the source is credible and “safe” to use. Although there are several ways to evaluate information, Metzger (2007) identified five dimensions that are useful to consider during web-based learning.These dimensions concern the accuracy, authority, objectivity, currency, and coverage of web-based information. Accuracy concerns the truth of the information: Is the information valid and error free? Authority refers to the credentials of the author: What are the author’s qualifications (if the author is even identifiable)? Objectivity refers to the concept of bias and involves issues such as distinguishing fact from opinion, identifying conflicts of interest, and monetary gain: Is the author/source biased in his/her view? Currency refers to the time at which the information was published: Is the information out of date? Coverage refers to the amount of information contained in the document: Is there enough detail and depth? Is there missing information? As students learn new information, whether in print or digital contexts, they need to exercise these source evaluation and credibility skills to ensure the information is reliable, trustworthy, and accurate. Thus, learning is not a process of absorbing and memorizing information, but also a process of sorting and evaluating it for potential use in relation to one’s goals for reading. In short, Metzger’s (2007) dimensions provide a useful heuristic for helping students evaluate internet sources. In line with the discussion on reader goals and the multiple-source literature (Britt et al., 2018; Rouet & Britt, 2011), we propose adding a sixth component to Metzger’s model that includes the concept of relevance (McCrudden & Schraw, 2007; McCrudden et al., 2011; O’Reilly & Sheehan, 2009). Relevance refers to the degree to which the source information is important for achieving the reader’s specific goals. When sifting through multiple sources, some of the information will be relevant to the goals of reading while other pieces of information will not. Filtering out irrelevant information can be difficult as students are often distracted or misled by seductive details (McCrudden, & Corkill, 2010). Although students can be trained in web-based critical-thinking strategies, obtaining large changes in behavior is often challenging (Graesser et al., 2007). Nonetheless, reflective, strategic metacognitive behavior comprises an important part of learning (Hacker et al., 2009; Schoonen et al., 1998; Van Gelderen et al., 2004). Learning involves seeking information relevant to one’s goals and monitoring progress towards one’s goal. Thus, learning in 21st century environments is possibly more dependent upon the quality of one’s self-regulation and abilities than it was in the past. More modern, thorough, and empirically based models of how people read in 21st century environments have also been developed. For instance, Sparks and Deane (2015) developed an intricate model of the knowledge, skills, and processes required to be a proficient reader. Their work grew out of an ETS initiative called CBAL, or Cognitively Based Assessment, of, for, and as Learning that was designed

42

43

42

Assessing Academic Reading 43

to leverage the research in the learning sciences and instruction into assessment design.3 Specifically, Sparks and Deane proposed a model of research and inquiry that involved three interacting phases: Inquiry and information gathering; analysis, evaluation, and synthesis; and communication and presentation of results. The inquiry phase is characterized by the set of knowledge and skills that orient and direct the inquiry. These include goal setting, planning, and monitoring, as well as gathering sources related to the goal for reading. In addition to formulating and refining research questions, students need to develop plans for carrying out their goal. This involves assessing their information needs relative to their existing background knowledge and conceptual understanding. During the analysis, evaluation, and synthesis phase, students comprehend, evaluate, and reconcile differences among multiple sources. This involves making judgments about whether the sources are useful for the task goals, as well as an appraisal of whether the sources are reliable and credible. Multiple-text processes are also involved and include the ability to notice the similarities and differences across the arguments, concepts, and perspectives, as well as the ability to reconcile the discrepancies across the sources. During the communication phase, students organize and present their work through writing or other means. This involves attending to the audience and disciplinary expectations of the domain in question. It also involves citing appropriate sources and avoiding plagiarism. While the details of the model go beyond this chapter, each phase is also accompanied by a set of hypothesized learning progressions that help identify what students know and can do in each of the phases. In short, models such as this one are not only useful for identifying what parts of a more complex task a student can or cannot do, but also to potentially use this information to inform instruction. While the research and inquiry model has been developed for students in K-12, other work has reviewed the literature and provided a definition of digital information literacy for higher education (Sparks et al., 2016) and large-scale assessment in the areas of online inquiry, collaborative learning, and social deliberation (Coiro et al., 2018).

2.5.2 Non-linear Text and Hyperlinks Traditional printed texts contain a series of ideas that are linked conceptually. As such, they are intended to be read linearly. However, in the world of digital literacy, readers can easily deviate across documents (as in the multiple-text example above) or within a document. Many digital texts contain hyperlinks that allow the reader to further their reading on related topics in a non-linear fashion. Hyperlinks may facilitate comprehension by providing readers access to the meaning of topic-specific vocabulary, or to elaborate on key issues that may be unfamiliar to the reader. When used appropriately, these additional digital supports can help support understanding by increasing background knowledge (Akbulut, 2008; Coiro, 2011) or supporting vocabulary development (Akbulut, 2007;Yoshii, 2006). For instance, Akbulut (2008) found that providing annotations

4

44 Mary Schedl et al.

to L2 learners helped students partially compensate for the lack of background knowledge on the topic of the materials. Despite the potential advantages of using hyperlinks to support L2 learning (Chen & Yen, 2013), the evidence on their effectiveness is equivocal. For instance, there is evidence to suggest that hyperlinks can be distracting, particularly for low-knowledge or less metacognitively aware students (Salmeron et al., 2010). For example, Salmeron et al. (2010) found that less skilled or low-knowledge readers may select hyperlinks based on their screen position or the student’s interest rather than the larger goals of the task. This potentially distracting element poses new processing challenges for learners of the digital age. Students must now remain focused on reader goals and be able to quickly identify what is and what is not relevant information to attend to (McCrudden et al., 2011; O’Reilly & Sheehan, 2009). The sheer number and variation in the quality of links in digital text is potentially more demanding on students. This may require more focus and attention as they try to process relevant and quality sources. The sheer volume of sources, authors, formats, and types of media also raises issues about cognitive load and its impact on online learning (Bradford, 2011; Chen, 2016). On the one hand, it could be argued that the ability to deal with this variation is construct relevant. That is, the ability to inhibit irrelevant information and to be flexible with a wide range of devices, web formats, and input features is authentic and should be measured. Conversely, assessments need to be feasible and fair and thus some standardization of the materials, formats, and the context should be controlled. This would not only potentially help formalize the assessment, but it also may reduce the cognitive load of features that are deemed to be considered construct irrelevant. While it is beyond the scope of this chapter to outline what elements of the interface and response formats are construct irrelevant, we believe standardization is necessary and one consequence of this could be to reduce construct-irrelevant elements that may increase the cognitive load of the assessment. Future work should explore how human factors principles and cognitive labs can be used to design assessments that measure the construct but remove irrelevant features that increase the cognitive load.

2.5.3 Multimedia Printed texts often include illustrations, charts, and graphs to promote interest or to elaborate key points discussed in the texts. Not surprisingly, pictures incorporated in digital texts are effective in supporting vocabulary learning for L2 learners (Yoshii, 2006). However, digital texts can offer much more than pictures, charts, and graphs to elaborate the text content. Unlike traditional printed materials, digital texts often incorporate audio files, videos, and simulations to augment the web page content. Akbulut (2008) found that multimedia such as a video can be a valuable source for learning a second language as it can help improve

4

45

4

Assessing Academic Reading 45

comprehension. This effect might in part be due to the fact that videos provide alternate representations to strengthen the encoding of the material (Mayer, 1997; Paivio, 1986) as well as reducing decoding demands. For instance, according to the Simple View of Reading, both print (i.e., decoding, word recognition) and language skills are necessary for understanding (Adlof et al., 2006; Catts et al., 2006; Hoover & Gough, 1990). By eliminating the printed code, and replacing it with an audio/video source, the readers use their language skills to comprehend the video, and any weaknesses in word recognition and decoding skills will be masked. Thus, if a student learns new information by audio and video, they may be able to demonstrate their oral comprehension while the print skills remain weak. Although such media aids may help students who are weak in foundational print skills, it does not help to solve the problem of improving their long-term reading ability. Students still need to be able to read printed material with fluency and comprehension. Moreover, while multimedia can provide additional tools to support student comprehension, the variety and quality of these materials may place extra demands on students’ attention and their ability to maintain focus on task-relevant goals.

2.6 Consideration of Possible New Academic Assessment Tasks and Media as Part of an Expanded Academic Reading Construct In this section we consider tasks and media that are not part of most current large-scale L2 reading assessments but that are important in academic reading. Current reading assessments provide good coverage of general comprehension abilities, but we would argue that additional consideration should be given to the assessment of abilities specifically related to reading to learn and reading to integrate information across texts, given their critical importance in academic reading. If assessments are to add new components of comprehension, the tasks to measure these new components need to be carefully researched from a theoretical as well as an assessment point of view.

2.6.1 Tasks that Might Assess Reading to Learn Abilities If we want to expand the measurement of reading to learn (and reading to learn across multiple texts), then we need to think about what kinds of reading tasks might require reading to learn strategies and abilities. We have argued that, in addition to the abilities required for general comprehension, learning requires the reader to evaluate and connect information from different parts of a text and across texts to form a coherent situation model (Kintsch, 1986). In addition, the use of higher “standards of coherence,” or deeper levels of understanding, are often required when the reader’s purpose is learning from texts (Van den Broek et al., 2001, 2011).

46

46 Mary Schedl et al.

Some assessment tasks already measure some reading to learn abilities or can easily be incorporated into current large-scale assessments. For example, a question that asks for an author’s purpose in providing specific information in a text can call on the reader’s ability to make inferences that are part of a coherent situation model. A simple multiple-choice task, with or without the text present, might ask test takers to identify a correct summary of the text from a number of written alternatives, or test takers might be asked to complete an outline of the text or a process model. Writing tasks that require the reader to summarize text information draw on the reader’s ability to connect important information in the text into a coherent whole. Some integrated reading-writing tasks exist, such as the TOEFL iBT writing task that uses both written and aural input as a basis for a writing task. New task types that use longer single or multiple written texts could be designed to focus specifically on reading to learn from written texts or reading to integrate information across multiple texts, which are very common academic tasks. Additional types of reading to learn items that could tap test takers’ ability to create a mental representation of important text information should be considered. For example, taking away the text and leaving test takers with their situation models to answer the reading comprehension questions could provide information about what they have learned from the text. In each case, hypotheses about which reading to learn abilities can be demonstrated by the tasks should be articulated before tasks are tried out and should be used to guide the design of new tasks and evaluate the results.

2.6.2 Tasks That Might Use New Media to Assess Reading Abilities Related to Connecting Information across Multiple Texts Britt and others (Britt & Rouet, 2012; Britt et al., 2018), including Goldman et al. (2016), have argued that the cognitive processes involved in reading multiple texts are more complex and more diverse than those needed for single texts. Readers need to integrate information both within texts and across texts. The documents framework for multiple texts (Britt et al., 2018; Rouet & Britt, 2011) consists of an intertext model and an integrated mental model of each text. How well a reader understands the texts and makes connections between them determines the quality of learning from the texts. The content and organizational structure of the information are both important, and both contribute to the integrated texts framework generated by the reader as well as the expectations and social conventions of the discipline (Goldman et al., 2016). Reading to learn across texts can be assessed using traditional print media or using new media. Given the prevalence of multiple-text reading in academia, we recommend that some measure of the abilities related to this reading purpose be considered for large-scale, high-stakes assessment. Issues related to new media considered in Section 2.4 suggest that an appropriate new media design will require a lot of thought and assessment design research.

46

47

46

Assessing Academic Reading 47

In addition to investigating multiple-choice tasks based on multiple digital texts, we recommend that designers explore various possibilities for new integrated reading-writing assessments. Writing is prevalent in higher education and research suggests a positive relationship between a measure of information communications technology skills (related to digital literacy) and writing course outcomes (Katz et al., 2010). Integrated reading and writing assessments are critical for some literacy frameworks such as the CBAL writing assessment (Deane et al., 2008). Emerging data seems to suggest that this integrated approach to a reading and a writing test is feasible and provides added value, at least in the area of argumentation (Deane et al., 2019). Prototypes of L2 writing tasks based on reading passages could be considered for development, ranging from short-answer tasks to longer essay tasks based on passages of various lengths and on multiple-text passages. In an academic context, such integrated tasks would draw on the reader’s ability to evaluate text information (Sparks & Deane, 2015), and while reading ability will likely impact the extent to which certain students can write an essay, providing texts from which students can write might decrease the potential confound that background knowledge has on writing ability. Research comparing the effect of background knowledge on writing ability by manipulating reading demands might be fruitful for both modalities. The writing task might allow the reading section to assess deeper understanding, while the reading section might help control for the effects of background knowledge on writing performance. The reading-writing connection also supports the idea of an integrated skills assessment (see Deane et al., 2008). Grabe and Zhang (2016) have examined reading-writing relationships in both L1 and L2 academic literacy development. Reading-writing integrated tasks could provide information about reading and writing as interactive skills, and the possibility of such tasks contributing scoring and performance information to both skills should be investigated. Most of the multi-text studies to date have focused on strategies used in reading multiple texts in order to write a short essay based on the texts (Plakans, 2009; Spivey, 1990, 1991; Spivey & King, 1989). The development of new tools for online reading research and assessment should also be considered, including digital-text-related tools. These tools might include highlighting tools for test takers to mark important information, hyperlinks to provide background knowledge related to specific academic fields, and note- taking tools to measure strategy use.

2.7 Recommendations for Research 2.7.1 Short-term Research 2.7.1.1 Survey the International Academic Reading Domain It will be important to conduct an international domain and/or needs analysis to provide a context for long-term research into the possible expansion of the

48

48 Mary Schedl et al.

assessment of academic reading. While the PIAAC framework is helpful, more work is needed to specifically identify the needs in a higher education context. An analysis of the international academic reading domain should be conducted from the point of view of: 1. What are the most common and important types of academic reading and the tasks associated with academic reading? 2. What are the reading expectations of professors across different disciplines and in different international environments where English is used as a medium of instruction (for the U.S. context, see N. J. Anderson, 2015; and Hartshorn et al., 2017)? 3. How does the academic reading domain use new media? • What is the current role of information communication technology (ICT) in academic study? • What technologies and approaches are used in both distance education and as part of regular education? • What are the current uses of digital media and non-linear text in university teaching and learning? • Are there developments in academic reading, in particular in the use of technologies that are taking place but are not captured by current reading frameworks?

48

2.7.1.2 Investigate Assessment in Support of Learning The most common approaches to L2 reading promote reading comprehension, as determined by students’ answers to post-reading comprehension questions and classroom discussions, but they do not promote reading to learn (Grabe & Stoller, 2019). In placing greater emphasis on reading to learn and reading to integrate information across multiple texts, it will be important to not only communicate to teachers and students of L2 reading what is meant by reading to learn but also to provide support for the teaching and learning of reading to learn. Frameworks in support of learning that connect classroom teaching, learning, and assessment should be developed and supported by research. Formative and summative assessments aligned with the reading to learn construct can incorporate a broader range of item types targeting reading to learn, such as graphic organizers that are not easily incorporated into standardized multiple-choice tests. New item types can be constructed for learning and attempts to score them efficiently can be pursued for both classroom and large-scale assessment use. Various types of scaffolding to support learning should be researched and incorporated into teaching and classroom assessments. Research seeking empirical support for reading to learn classroom teaching, learning, and assessment should be pursued and lead to better connections among them. It is not clear whether teachers are prepared to teach students digital literacy skills or if they have the resources, including computer access, to do so (Stoller

49

48

Assessing Academic Reading 49

et al., 2018). This is an important area where research and development of digital literacy tools may be able to provide support for learning to read in a second or foreign language.

2.7.2 Long-term Research 2.7.2.1 Multiple Texts Qualitative research can be used to investigate whether multiple texts and tasks elicit construct-relevant academic strategies such as integrating information across texts and summarizing information from multiple texts. For instance, Wolfe and Goldman (2005) used a think-aloud procedure for examining sixth grade students’ ability to process contradictory texts. The data was coded for the presence of the type of processing as well as the content mentioned.The think-alouds revealed that some students paraphrased the material and drew connections within and across texts. Some students also engaged in deeper processing strategies that involved forming elaborations to prior knowledge. In short, their findings indicated that some students were capable of understanding various elements of multiple texts. However, the understanding (depth of reasoning) was predicted by the quality of the think-alouds. In particular, student responses that increased the coherence of the text displayed more depth of reasoning. Interestingly, this involved both deeper responses (e.g., self-explanations) as well as surface connections. Some initial prototyping work was carried out at the Educational Testing Service (ETS) on multiple-text strategies and on display and usability issues. A reading prototype using two different sets of multiple texts and multiple-choice reading items on different topics was developed to investigate whether strategies and abilities needed to answer questions based on multiple texts differ from strategies and abilities required to answer questions based on single texts (Schedl & Papageorgiou, 2013). A number of specific strategies were frequently associated with multiple- text questions that differed from those usually associated with single passages. Associated with all multiple-text items were: Integrating and synthesizing, summarizing, and generating an overview of information from both passages; comparing the organization, focus, rhetorical structure, and purpose of two passages or parts of two passages; comparing and contrasting as well as recognizing causal information connecting two passages. A second study employed three shorter texts and additional new item types with a focus on modern digital text display and associated issues (Schedl & O’Reilly, 2014). An attempt was made to represent multiple texts as they might appear in multiple-text online reading situations. Non-linear texts and hyperlinks were included. Hyperlinks provided supportive visual and textual features to reduce background knowledge needs and to explain difficult vocabulary. Test takers’ self- reports indicated that they generally understood how to answer the novel item types, and the difficult items were perceived as being manageable or appropriate. In

50

50 Mary Schedl et al.

general, they preferred having the digital look and feel in the prototype assessment and they did use some of the available features (maps, highlighting tool). Future research should answer more specific research questions. For instance, what features of text increase or decrease the extent to which readers can make meaningful connections across multiple texts? How do author, craft, style, rhetorical structure, and cohesion impact multiple-text understanding? Does reading purpose influence the extent to which students can connect multiple passages?

2.7.2.2 Electronic Media ICT is a construct in itself so we need to think carefully about conflating it with English language proficiency. The use of technology and the internet is prevalent in academia, but prevalence does not necessarily mean importance. There are several issues related to the introduction of some new media features into the test, such as the degree to which they are important and relevant to academic reading success and their generalizability to the academic domain. The incorporation of audio files, videos, and simulations to augment web page content, the increasingly common use of messaging, e-mail, blogs, etc., all need to be researched to determine their validity and feasibility in the assessment of international test takers’ reading ability. These phenomena also need to be understood in relation to introductory courses that may prepare students for some of the associated challenges. Students are accustomed to using digital devices for non-academic tasks, but there may be a gap in their digital literacy skills for reading to learn purposes (Stoller et al., 2018). The issue of evaluating sources is important, and prototypes have been developed with promising results (Coiro et al., 2018; Leu et al., 2009; Sparks & Deane, 2015; Sparks et al., 2016; Stanford History Education Group, 2016). However more work is needed to understand how this line of work would be applied in a large-scale L2 academic context. Prototypes could be created to investigate the feasibility of obtaining this type of information in an assessment context. Consideration should also be given to the question of whether the assessment of this type of critical ability is within the scope of the test and how native speakers perform on such tasks. New technology allows us to design new types of assessments and create new types of items and tasks. Care must be taken, however, to ensure that the tasks are fair to all groups of test takers and that they do not measure abilities that are irrelevant to the academic reading construct.

2.7.2.3 Factors That Affect Difficulty We know that it is the interplay of reader purpose, text and task characteristics, and the linguistic and processing abilities of the reader that determine comprehension (Snow, 2002). Difficult parts of texts should pose the greatest comprehension

50

51

50

Assessing Academic Reading 51

issues, demand higher “standards of coherence,” and so on. While there is a wealth of studies on how to measure text complexity through automated means (Graesser et al., 2011; Graesser et al., 2014; Sheehan, 2016; Sheehan et al., 2014) and its impact on comprehension (McNamara et al., 1996; O’Reilly & McNamara, 2007; Ozuru et al., 2009; Ozuru et al., 2010), more work is needed to determine how to leverage this notion in a formal assessment context. On the one hand, a simple approach would be to report the complexity of the texts used in the assessment. This would provide users with quantitative information. More complicated are the decisions on how to systematically vary the text complexity in an assessment context and how this information would be reported, interpreted, and used by stakeholders. Thus, additional studies related to the difficulty of text variables in academic texts should be conducted with the goal of informing the development of appropriate test designs and how the tasks can be related to varying levels of text complexity.

2.7.2.4 Strategic Abilities Research Research should investigate whether the reading strategies and abilities specified for the academic reading assessment are in fact elicited by the test texts and tasks. Skilled readers use strategies to clarify, refine, and elaborate their understanding. Investigating strategy use can help uncover ways that readers process tasks. For instance, the reading to learn construct suggests that readers should be forming a global representation of text. To the extent that we can develop tasks that measure local versus global processing, the strategies used by the reader can be observed to determine whether they match the intended construct-relevant processing strategies. Academic reading strategies of interest include: Summarizing, paraphrasing, monitoring, rereading for new purposes, inferring, predicting, recognizing text organization and genre, integrating information/ connecting parts of a text, guessing word meanings in context, adapting the purpose for reading, identifying main ideas, and critiquing the author or the text. We should investigate whether processing strategies differ for different reading purposes and whether different strategies are used by readers of different abilities or whether they use similar strategies but with different results. These kinds of questions are important for both learning and assessment. Future research should identify the types of reading strategies that can be effectively measured in the context of a high-stakes reading test and determine if and how strategy use can be used to augment or interpret the overall reading score.

2.8 Summary In summary, the large-scale assessment of academic reading has not typically included some important aspects of academic reading, such as the reading of multiple texts and the skills to integrate, evaluate, and synthesize information in them

52

52 Mary Schedl et al.

in a purposeful way. It has also not investigated the role of digital texts in academic reading. Modern assessments and reading research to support them should be updated to reflect these aspects of academic reading. While there is a broad set of theories to describe reading, assessment frameworks should be transparent about the rationale for the design, including the constructs measured, the empirical support for these constructs, the types of texts included, and the tasks to measure reading proficiency. Finally, while there are key differences in reading in L1 and L2 contexts, there is a large overlap of shared resources and processes. Thus, we argue that the extensive work on L1 reading is relevant and may help expand the opportunities for learning more about L2 reading and how to effectively measure it.

Notes 1 Note these tests are not high stakes for the test takers. 2 This is not to say that foundational skills such as decoding, word recognition, vocabulary knowledge, reading fluency, and syntactic knowledge are unimportant; they are critical to reading development. 3 Note that this work was developed for students in K-12 settings. An open question is whether it would transfer to an academic context in higher education.

References Adlof, S. M., Catts, H. W., & Little, T. D. (2006). Should the simple view of reading include a fluency component? Reading and Writing, 19(9), 933–958. Akbulut, Y. (2007). Effects of multimedia annotations on incidental vocabulary learning and reading comprehension of advanced learners of English as a foreign language. Instructional Science, 35(6), 499–517. Akbulut,Y. (2008). Predictors of foreign language reading comprehension in a hypermedia reading environment. Journal of Educational Computing Research, 39(1), 37–50. Alderson, J. C. (1984). Reading in a foreign language: A reading problem or a language problem? In J. C. Alderson & A. H. Urquhart (Eds.), Reading in a foreign language (pp. 1–27). New York: Longman. Alderson, J. C. (2000). Assessing reading. Cambridge, U.K.: Cambridge University Press. Alderson, J. C., Nieminen, L., & Huhta, A. (2016). Characteristics of weak and strong readers in a foreign language. The Modern Language Journal, 100(4), 853–879. Alexander, P., & The Disciplined Reading and Learning Research Laboratory. (2012). Reading into the future: Competence for the 21st century. Educational Psychologist, 47(4), 259–280. Anderson, J. R. (2015). Cognitive psychology and its implications (8th ed). New York: Worth Publishers. Anderson, N. J. (2015). Academic reading expectations and challenges. In N. W. Evans, N. J. Anderson, & W. G. Eggington (Eds.), ESL readers and writers in higher education: Understanding challenges, providing support (pp. 95–109). New York: Routledge. Ashely, J. (2003). Synchronous and asynchronous communication tools. ASAE: The Center for Association Leadership. Retrieved from www.asaecenter.org/Resources/articledetail. cfm?itemnumber=13572

52

53

52

Assessing Academic Reading 53

Bradford, G. R. (2011). A relationship study of student satisfaction with learning online and cognitive load: Initial results. Internet and Higher Education, 14(4), 217–226. Britt, M., & Rouet, J. F. (2012). Learning with multiple documents: Component skills and their acquisition. In M. J. Lawson & J. R. Kirby (Eds), The quality of learning: Dispositions, instruction, and mental structures (pp. 276– 314). Cambridge, U.K.: Cambridge University Press. Britt, M., Rouet, J. F., & Braasch, J. (2013). Documents as entities: Extending the situation model theory of comprehension. In M. Britt, S. Goldman, & J-F. Rouet (Eds.), Reading—From words to multiple texts (pp. 160–179). New York: Routledge. Britt, M., Rouet, J. F., & Durik, A. M. (2018). Literacy beyond text comprehension: A theory of purposeful reading. New York: Routledge. Carver, R. P. (1997). Reading for one second, one minute, or one year from the perspective of rauding theory. Scientific Studies of Reading, 1(1), 3–43. Catts, H., Adlof, S., & Weismer, S. (2006). Language deficits in poor comprehenders: A case for the simple view of reading. Journal of Speech, Language & Hearing Research, 49(2), 278–293. Chen, I., & Yen, J. (2013). Hypertext annotation: Effects of presentation formats and learner proficiency on reading comprehension and vocabulary learning in foreign languages. Computers & Education, 63, 416–423. Chen, R. (2016). Learner perspectives of online problem-based learning and applications from cognitive load theory. Psychology Learning and Teaching, 15(2), 195–203. Chung, T., & Berry, V. (2000). The influence of subject knowledge and second language proficiency on the reading comprehension of scientific and technical discourse. Hong Kong Journal of Applied Linguistics, 5(1), 187–222. Coiro, J. (2009). Rethinking reading assessment in a digital age: How is reading comprehension different and where do we turn now? Educational Leadership, 66, 59–63. Coiro, J. (2011). Predicting reading comprehension on the Internet: Contributions of offline reading skills, online reading skills, and prior knowledge. Journal of Literacy Research, 43(4), 352–392. Coiro, J., Sparks, J. R., & Kulikowich, J. M. (2018). Assessing online collaborative inquiry and social deliberation skills as learners navigate multiple sources and perspectives. In J. L. G. Braasch, I. Bråten, & M. T. McCrudden (Eds.), Handbook of multiple source use (pp. 485–501). London, U.K.: Routledge. Deane, P., Odendahl, N., Quinlan, T., Fowles, M., Welsh, C., & Bivens-Tatum, J. (2008). Cognitive models of writing: Writing proficiency as a complex integrated skill (ETS Research Report No. RR-08-55). Princeton, NJ: Educational Testing Service. Deane, P., Song,Y.,Van Rijn, P., O’Reilly, T., Fowles, M., Bennett, R., Sabatini, J., & Zhang, M. (2019). The case for scenario-based assessment of written argumentation. Journal of Reading and Writing, 32, 1575–1606. https://doi.org/10.1007/s11145-018-9852-7 Elgort, I., & Warren, P. (2014). L2 vocabulary learning from reading: Explicit and tacit lexical knowledge and the role of learner and item variables. Language Learning, 64(2), 365–414. Elgort, I., Perfetti, C., Rickles, B., & Stafura, J. (2016). Contextual learning of L2 word meanings: Second language proficiency modulates behavioral and ERP indicators of learning. Language and Cognitive Neuroscience, 30(5), 506–528. Enright, M. K., Grabe, W., Koda, K., Mosenthal, P., Mulcahy-Ernt, P., & Schedl, M. A. (2000). TOEFL 2000 Reading Framework: A working paper (RM-00-04,TOEFL-MS-17). Princeton, NJ: Educational Testing Service.

54

54 Mary Schedl et al.

Ericsson, K. A. (2009). The influence of experience and deliberate practice on the development of superior expert performance. In K. A. Ericsson, N. Charness, P. Feltovich, & R. Hoffman (Eds.), The Cambridge handbook of expertise and expert performance (pp. 683–703). New York: Cambridge University Press. Ericsson, K. A., & Pool, J. (2016). Peak: Secrets from the new science of expertise. Boston, MA: Houghton Mifflin Harcourt. Farnia, F., & Geva, E. (2010). Cognitive correlates of vocabulary growth in English language learners. Applied Psycholinguistics, 32(4), 711–738. Farnia, F., & Geva, E. (2013). Growth and predictors of change in English language learners. Journal of Research in Reading, 36(4), 389–421. Frost, R. (2012). Towards a universal model of reading. Behavioral and Brain Sciences, 35(5), 263–279. Gernsbacher, M.A. (1990). Language comprehension as structure building. Hillsdale, NJ: Lawrence Erlbaum Associates. Gernsbacher, M. A. (1997). Two decades of structure building. Discourse Processes, 23(3), 265–304. Geva, E., & Farnia, F. (2012). Developmental changes in the nature of language proficiency and reading fluency paint a more complex view of reading comprehension in ELL and EL1. Reading and Writing, 25, 1819–1845. Goldman, S. R., Britt, M. A., Brown, W., Cribb, G., George, M., Greenleaf, C., Lee, C., Shanahan, C., & Project READI. (2016). Disciplinary literacies and learning to read for understanding: A conceptual framework for disciplinary literacy. Educational Psychologist, 51, 219–246. doi:10.1080/00461520.2016.1168741 Gough, P. B., & Tunmer, W. E. (1986). Decoding, reading, and reading disability. Remedial and Special Education, 7(1), 6–10. Grabe, W. (2009). Reading in a second language: Moving from theory to practice. Cambridge, U.K.: Cambridge University Press. Grabe, W., & Stoller, F. (2019). Reading to learn: Why and how content-based instructional frameworks facilitate the process. In K. Koda & J.Yamashita (Eds.), Reading to learn in a foreign language (pp. 9–29). New York: Routledge. Grabe, W., & Zhang, C. (2016). Reading-writing relationships in first and second language academic literacy development. Language Teaching, 49(3), 339–355. Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234. doi:10.3102/ 0013189X11413260 Graesser, A. C., McNamara, D. S., Cai, Z., Conley, M., Li, H., & Pennebaker, J. (2014). Coh- Metrix measures text characteristics at multiple levels of language and discourse. The Elementary School Journal, 115(2), 210–229. doi:10.1086/678293 Graesser, A. C., Wiley, J., Goldman, S., O’Reilly, T., Jeon, M., & McDaniel, B. (2007). SEEK web tutor: Fostering a critical stance while exploring the causes of volcanic eruption. Metacognition and Learning, 2, 89–105. Guthrie, J., & Kirsch, I. (1987). Literacy as multidimensional: Locating information in text and reading comprehension. Journal of Educational Psychology, 79, 220–228. Hacker, D. J., Dunlosky, J., & Graesser, A. C. (2009). Handbook of metacognition in education. New York: Routledge. Hartshorn, K. J., Evans, N., Egbert, J., & Johnson, A. (2017). Discipline-specific reading expectations and challenges for ESL learners in U.S. universities. Reading in a Foreign Language, 29(1), 36–60. Hoover, W. A., & Gough, P. B. (1990). The simple view of reading. Reading and Writing: An Interdisciplinary Journal, 2(2), 127–160.

54

5

54

Assessing Academic Reading 55

Mullis, I.V. S., & Martin, M. O. (Eds.). (2015). PIRLS 2016 Assessment Framework (2nd ed). Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its correlates: A meta-analysis. Language Learning, 64(1), 160–212. Katz, I. R., Haras, C. M., & Blaszczynski, C. (2010). Does business writing require information literacy? Business Communication Quarterly, 73(2), 135–149. Kieffer, M. (2011). Converging trajectories: Reading growth in language minority learners and their classmates, kindergarten to grade 8. American Educational Research Journal, 48(5), 1187–1225. Kieffer, M. J., & Lesaux, N. K. (2012a). Development of morphological awareness and vocabulary knowledge in Spanish-speaking language minority learners: A parallel process latent growth curve model. Applied Psycholinguistics, 33(1), 23–54. Kieffer, M. J., & Lesaux, N. K. (2012b). Direct and indirect roles of morphological awareness in the English reading comprehension of native English, Spanish, Filipino, and Vietnamese speakers. Language Learning, 64(4), 1170–1204. Kim,Y. S. (2017). Why the simple view of reading is not simplistic: Unpacking component skills of reading using a direct and indirect effect model of reading (DIER). Scientific Studies of Reading, 21(4), 310–333. Kintsch, W. (1986). Learning from text. Cognition and Instruction, 3(2), 87–108. Kintsch, W. (1998). Comprehension. New York: Cambridge University Press. Kintsch, W. (2012). Psychological models of reading comprehension and their implications for assessment. In J. Sabatini, E. Albro, & T. O’Reilly (Eds.), Measuring up: Advances in how we assess reading ability (pp. 21–38). Lanham, MD: Rowman & Littlefield. Landi, N. (2013). Understanding the relationship between reading ability, lexical quality, and reading context. In M. Britt, S. Goldberg, & J-F. Rouet (Eds.), Reading—From words to multiple texts (pp. 17–33). New York: Routledge. Lawless, K. A., Goldman, S. R., Gomez, K., Manning, F., & Braasch, J. (2012). Assessing multiple source comprehension through evidence-centered design. In J. Sabatini, T., O’Reilly, & E. Albro (Eds.), Reaching an understanding: Innovations in how we view reading assessment (pp. 3–17). Lanham, MD: Rowman & Littlefield. Lervåg, A., & Aukrust,V. (2010).Vocabulary knowledge is a critical determinant of the difference in reading comprehension growth between first and second language learners. Journal of Child Psychology and Psychiatry, 51(5), 612–620. Lesaux, N. K., & Kieffer, M. J. (2010). Exploring sources of reading comprehension difficulties among language minority learners and their classmates in early adolescence. American Educational Research Journal, 47(3), 596–632. Leu, D. J., Kinzer, C., Coiro, J., Castek, J., & Henry, L. (2013a). New literacies: A dual- level theory of the changing nature of literacy, instruction, and assessment. In D. E. Alvermann, N. J. Unrau, & R. B. Ruddell (Eds.), Theoretical models and processes of reading (6th ed., pp. 1150–1181). Newark, DE: International Reading Association. Leu, D. J., Forzani, E., Burlingame, C., Kulikowich, J., Sedransk, N., Coiro, J., & Kennedy, C. (2013b). The new literacies of online research and comprehension: Assessing and preparing students for the 21st century with common core state standards. In S. B. Neuman & L. B. Gambrell (Eds.), C. Massey (Assoc. Ed.), Reading instruction in the age of common core standards (pp. 219–236). Newark, DE: Wiley-Blackwell. Leu, D. J., Kulikowich, J., Sedransk, N., & Coiro, J. (2009). Assessing online reading comprehension: The ORCA project. Research grant funded by U.S. Department of Education, Institute of Education Sciences. Leu, D. J., Zawilinski, L., Forzani, E., & Timbrell, N. (2014). Best practices in new literacies and the new literacies of online research and comprehension. In L. M. Morrow

56

56 Mary Schedl et al.

& L. B. Gambrell (Eds.), Best practices in literacy instruction (5th ed., pp. 343– 364). New York: Guilford Press. Li, M., & Kirby, J. (2014). Unexpected poor comprehenders among adolescent ESL students. Scientific Studies of Reading, 18(1), 75–93. Linderholm, T., & Van den Broek, P. (2002). The effects of reading purpose and working memory capacity on the processing of expository text. Journal of Educational Psychology, 94(4), 778–784. Lipka, O., & Siegal, L. (2012).The development of reading comprehension skills in children learning English as a second language. Reading and Writing, 25, 1873–1898. Mancilla-Martinez, J., Kieffer, M., Biancarosa, G., Christodoulou, J., & Snow, C. (2011). Investigating English reading comprehension growth in adolescent language minority learners: Some insights from the simple view. Reading and Writing, 24, 339–354. Mayer, R. E. (1997). Multimedia learning: Are we asking the right questions? Educational Psychologist, 32(1), 1–19. McCrudden, M.T., & Corkill, A. J. (2010).Verbal ability and the processing of scientific text with seductive detail sentences. Reading Psychology, 31(3), 282–300. McCrudden, M. T., & Schraw, G. (2007). Relevance and goal-focusing in text processing. Educational Psychology Review, 19(2), 113–139. McCrudden, M. T., Magliano, J., & Schraw, G. (Eds.) (2011). Text relevance and learning from text. Greenwich, CT: Information Age Publishing. McNamara, D. S., Kintsch, E., Songer, N. B., & Kintsch, W. (1996). Are good texts always better? Interactions of text coherence, background knowledge, and levels of understanding in learning from text. Cognition and Instruction, 14(1), 1–43. McNamara, D. S., & Magliano, J. P. (2009). Towards a comprehensive model of comprehension. In B. Ross (Ed.), The psychology of learning and motivation (pp. 297–384). New York: Elsevier. Melby-Lervåg, M., & Lervåg, A. (2014). Reading comprehension and its underlying components in second-language learners: A meta-analysis of studies comparing first- and second-language learners. Psychological Bulletin, 140(2), 409–433. Metzger, M. J. (2007). Making sense of credibility on the web: Models for evaluating online information and recommendations for future research. Journal of the American Society for Information Science and Technology, 58(13), 2078–2091. Meyer, B., & Ray, M. N. (2011). Structure strategy interventions: Increasing reading comprehension of expository text. International Electronic Journal of Elementary Education, 4(1), 127–152. Mullis, I.V. S., Martin, M. O., Kennedy, A. M., Trong, K. L., & Sainsbury, M. (2009). PIRLS 2011 assessment framework. Boston, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Retrieved from http://timssandpirls. bc.edu/pirls2011/downloads/PIRLS2011_Framework.pdf Nassaji, H., & Geva, E. (1999). The contribution of phonological and orthographic processing skills to adult ESL reading: Evidence from native speakers of Farsi. Applied Psycholinguistics, 20(2), 241–267. Netten, A., Droop, M., & Verhoeven, L. (2011). Predictors of reading literacy for first and second language learners. Reading and Writing, 24, 413–425. Oakhill, J., & Cain, K. (2012). The precursors of reading comprehension and word reading in young readers: Evidence from a four-year longitudinal study. Scientific Studies of Reading, 16(2), 91–121. Oakhill, J., Cain, K., & Bryant, P. (2003). Dissociation of single-word reading and text comprehension skills. Language and Cognitive Processes, 18(4), 443–468.

56

57

56

Assessing Academic Reading 57

O’Reilly, T., & McNamara, D. S. (2007). Reversing the reverse cohesion effect: Good texts can be better for strategic, high-knowledge readers. Discourse Processes, 43(2), 121–152. O’Reilly, T., & Sheehan, K. M. (2009). Cognitively based assessment of, for and as learning: A framework for assessing reading competency. (RM-09-26). Princeton, NJ: Educational Testing Service. Organisation for Economic Co- operation and Development (OECD). (2009a). PISA 2009 assessment framework: Key competencies in reading, mathematics and science. Paris, France: Author. Retrieved from www.oecd.org/document/44/0,3746,en_2649_ 35845621_44455276_1_1_1_1,00.html Organisation for Economic Co-operation and Development (OECD). (2009b). PIAAC literacy: A conceptual framework. Paris, France: Author. Retrieved from www.oecd-ilibrary. org/content/workingpaper/220348414075 Ozuru,Y., Briner, S., Best, R., & McNamara, D. S. (2010). Contributions of self-explanation to comprehension of high-and low-cohesion texts. Discourse Processes, 47(8), 641–667. Ozuru,Y., Dempsey, K., & McNamara, D. S. (2009). Prior knowledge, reading skill, and text cohesion in the comprehension of science texts. Learning and Instruction, 19(3), 228–242. Paivio, A. (1986). Mental representations: A dual coding approach. Oxford, U.K.: Oxford University Press. Perfetti, C. (1985). Reading ability. Oxford, U.K.: Oxford University Press. Perfetti, C. (1991). Representations and awareness in the acquisition of reading comprehension. In L. Rieben & C. Perfetti (Eds.), Learning to read: Basic research and its implications (pp. 33–45). Hillsdale, NJ: L. Erlbaum Assoc. Perfetti, C. (1999). Comprehending written language: A blueprint of the reader. In C. M. Brown & P. Hagoort (Eds.), The neurocognition of language (pp. 167–208). Oxford, U.K.: Oxford University Press. Perfetti, C. (2003).The universal grammar of reading. Scientific Studies of Reading, 7(1), 3–24. Perfetti, C. (2007). Reading ability: Lexical quality to comprehension. Scientific Studies of Reading, 11(4), 357–383. Perfetti, C., & Adlof, S. (2012). Reading comprehension: A conceptual framework from word meaning to text meaning. In J. Sabatini, E. Albro, & T. O’Reilly (Eds.), Measuring up: Advances in how we assess reading ability (pp. 3– 20). Lanham, MD: Rowman & Littlefield Education. Perfetti, C., & Hart, L. (2002). The lexical quality hypothesis. In L. Verhoeven, C. Elbro, & P. Reitsma (Eds.), Precursors of functional literacy (Vol. 11, pp. 67–86). Amsterdam, The Netherlands: John Benjamins. Perfetti, C., & Stafura, J. (2014). Word knowledge in a theory of reading comprehension. Scientific Studies of Reading, 18(1), 22–37. Perfetti, C., Britt, M., & Georgi, M. (1995). Text-based learning and reasoning: Studies in history. Hillsdale, NJ: L. Erlbaum Assoc. Perfetti, C., Landi, N., & Oakhill, J. (2005). The acquisition of reading comprehension skill. In M. Snowling & C. Huome (Eds.), The science of reading: A handbook (pp. 227–247). New York: Oxford University Press. Perfetti, C., Rouet, J., & Britt, M. A. (1999). Toward a theory of document representation. In H. Van Oostendorp & S. R. Goldman (Eds.), The construction of mental representations during reading (pp. 99–122). Mahwah, NJ: Lawrence Erlbaum Associates. Perfetti, C., Stafura, J., & Adlof, S. (2013). Reading comprehension and reading comprehension problems: A word-to-text integration perspective. In B. Miller, L. Cutting, & P. McCardle (Eds.), Unraveling reading comprehension; Behavioral, neurobiological, and genetic components (pp. 22–32). Baltimore, MD: Paul H. Brookes.

58

58 Mary Schedl et al.

Perfetti, C.,Yang, C., & Schmalhofer, F. (2008). Comprehension skill and word-to-text integration processes. Applied Cognitive Psychology, 22(3), 303–318. Pew Research Center. (2018). Pew internet and American Life project: Trend data (adults). Retrieved from www.pewinternet.org/Trend-Data-(Adults)/Device-Ownership.aspx Plakans, L. (2009). Discourse synthesis in integrated second language writing assessment. Language Testing, 26(4), 561–587. Purpura, J. E. (2017). Assessing meaning. In E. Shohamy, I. G. Or., & S. May (Eds.), Encyclopedia of language and education (Vol. 7): Language testing and assessment (pp. 33–62). New York: Springer. Rouet, J. F. (2006). The skills of document use. Mahwah, NJ: L. Erlbaum, Assoc. Rouet, J. F., & Britt, M. A. (2011). Relevance processes in multiple document comprehension. In M. T. McCrudden, J. P. Magliano, & G. Schraw (Eds.), Relevance instructions and goal-focusing in text learning (pp 19–52). Greenwich, CT: Information Age Publishing. Sabatini, J., Albro, E., & O’Reilly, T. (2012). (Eds). Measuring up: Advances in how we assess reading ability. Lanham, MD: Rowman & Littlefield Education. Salmeron, L., Kintsch, W., & Kintsch, E. (2010). Self-regulation and link selection strategies in hypertext. Discourse Processes, 47(3), 175–211. Schedl, M., & O’Reilly,T. (2014). Development and evaluation of prototype texts and items for an enhanced TOEFL iBT reading test. Educational Testing Service, internal report. Schedl, M., & Papageorgiou, S. (2013). Examining the cognitive processing of comprehending different information and points of view in more than one passage. Educational Testing Service, internal report. Schoonen, R., Hulstijn, J., & Bossers, B. (1998). Metacognitive and language-specific knowledge in native and foreign language reading comprehension. An empirical study among Dutch students in grades 6, 8 and 10. Language Learning, 48(1), 71–106. Seidenberg, M. (2017). Language at the speed of sight. New York: Basic Books. Share, D. (2008). Orthographic learning, phonological recoding, and self-teaching. In R. Kail (Ed.), Advances in child development and behavior (pp. 31–81). New York: Elsevier. Sheehan, K. M. (2016). A review of evidence presented in support of three key claims in the validity argument for the TextEvaluator text analysis tool. (ETS Research Report No. RR-16-12). Princeton, NJ: Educational Testing Service. Sheehan, K. M., Kostin, I., Napolitano, D., & Flor, M. (2014).The TextEvaluator tool: Helping teachers and test developers select texts for use in instruction and assessment. The Elementary School Journal, 115(2), 184–209. Snow, C. (2002). Reading for understanding: Toward an R&D program in reading comprehension. Santa Monica, CA: RAND. Sparks, J. R., & Deane, P. (2015). Cognitively based assessment of research and inquiry skills: Defining a key practice in the English language arts (ETS RR-15-35). Princeton, NJ: Educational Testing Service. Sparks, J. R., Katz, I. R., & Beile, P. M. (2016). Assessing digital information literacy in higher education: A review of existing frameworks and assessments with recommendations for next-generation assessment (ETS RR-16-32). Princeton, NJ: Educational Testing Service. Spencer, M., & Wagner, R. K. (2017). The comprehension problems for second-language learners with poor reading comprehension despite adequate decoding: A meta-analysis. Journal of Research in Reading, 40(2), 199–217. Spivey, N. N. (1990). Transforming texts: Constructive processes in reading and writing. Written Communication, 7(2), 256–87.

58

59

58

Assessing Academic Reading 59

Spivey, N. N. (1991). The shaping of meaning: Options in writing the comparison. Research in the Teaching of English, 25(4), 390–217. Spivey, N. N., & King, J. R. (1989). Readers as writers composing from sources. Reading Research Quarterly, 24(1), 7–26. Stafura, J., & Perfetti, C. (2017). Integrating word processing with text comprehension: Theoretical frameworks and empirical examples. In K. Cain, D. Compton, & R. Prilla (Eds.), Theories of reading development (pp. 9–31). Philadelphia, PA: John Benjamins. Stanford History Education Group [SHEG]. (2016). Evaluating information: The cornerstone of civic online reasoning –Executive summary. Retrieved from https://stacks.stanford.edu/ file/druid:fv751yt5934/SHEG%20Evaluating%20Information%20Online.pdf Stoller, F., Grabe, W., & Kashmar Wolf, E. (2018). Digital reading in EFL reading-to- learn contexts. Language Education in Asia (lEiA), 9, 1–14. Retrieved from www.leia. org/LEiA/LEiA%20VOLUMES/Download/LEiA_V9_2018/LEiA_V9A02_Stoller_ Grabe_Wolf.pdf Strømsø, H. I., Bråten, I., & Britt, M. A. (2010). Reading multiple texts about climate change:The relationship between memory for sources and text comprehension. Learning and Instruction, 20(3), 192–204. The Official Guide to the TOEFL Test, 5th ed. (2017). New York: McGraw Hill Education. Trapman, M., Van Gelderen, A., Van Steensel, R., Van Schooten, E., & Hulstijn, J. (2014). Linguistic knowledge, fluency and meta-cognitive knowledge as components of reading comprehension in adolescent low-achievers: Differences between monolinguals and bilinguals. Journal of Research in Reading, 37(S1), S3–S21. Van Daal,V., & Wass, M. (2017). First-and second-language learnability explained by orthographic depth and orthographic learning: A “natural” Scandinavian experiment. Scientific Studies of Reading, 21(1), 46–59. Van den Broek, P., Bohn-Gettler, C., Kendeou, P., Carlson, S., & White, M. J. (2011). When a reader meets a text: The role of standards of coherence in reading comprehension. In M. T. McCrudden, J. P. Magliano, & G. Schraw (Eds.), Text relevance and learning from text (pp. 123–140). Greenwich, CT: Information Age Publishing. Van den Broek, P., Lorch, R. F., Jr., Linderholm, T., & Gustafson, M. (2001). The effects of readers’ goals on inference generation and memory for texts. Memory & Cognition, 29(8), 1081–1087. Van den Broek, P., Risden, K., & Husebye-Hartman, E. (1995). The role of the reader’s standards of coherence in the generation of inference during reading. In R. F. Lorch, Jr. & E. J. O’Brian (Eds.), Sources of coherence in text comprehension (pp. 353–374). Mahwah, NJ: Lawrence Erlbaum Associates. Van den Broek, P.,Young, M., Tzeng,Y., & Linderholm, T. (1999). The landscape model of reading: Inferences and the on-line construction of a memory representation. In H.Van Oostendorp & S. R. Goldman (Eds.), The construction of mental representation during reading (pp. 71–98). Mawhah, NJ: Lawrence Erlbaum Associates. Van Gelderen, A., Schoonen, R., De Glopper, K., Hulstijn, J., Simis, A., Snellings, P., & Stevenson, M. (2004). Linguistic knowledge, processing speed and metacognitive knowledge in first and second language reading comprehension: A componential analysis. Journal of Educational Psychology, 96(1), 19–30. Van Steensel, R., Oostdam, R., Van Gelderen, A., & Van Schooten, E. (2016). The role of word decoding, vocabulary knowledge and meta-cognitive knowledge in monolingual and bilingual low-achieving adolescents’ reading comprehension. Journal of Research in Reading, 39(3), 312–329.

60

60 Mary Schedl et al.

Verhoeven, L., & Perfetti, C. (2017). Operating principles in learning to read. In L. Verhoeven & C. Perfetti (Eds.), Learning to read across languages and writing systems (pp. 1–30). Cambridge, U.K.: Cambridge University Press. Verhoeven, L., & Van Leeuwe, J. (2012). The simple view of second language reading throughout the primary grades. Reading and Writing, 25(8), 1805–1818. Wolfe, M. B. W., & Goldman, S. R. (2005). Relations between adolescents’ text processing and reasoning. Cognition and Instruction, 23(4), 467–502. Yamashita, J., & Shiotsu, T. (2017). Comprehension and knowledge components that predict L2 reading: A latent-trait approach. Applied Linguistics, 38(1), 43–67. Yoshii, M. (2006). L1 and L2 glosses: Their effects on incidental vocabulary learning. Language Learning & Technology, 10(3), 85–101.

60

61

60

3 ASSESSING ACADEMIC LISTENING Spiros Papageorgiou, Jonathan Schmidgall, Luke Harding, Susan Nissan, and Robert French

3.1 Introduction Listening comprehension is a complex process (Brunfaut, 2016; Buck, 2001) and a vital language ability for academic success (Powers, 1986; Stricker & Attali, 2010). This chapter examines issues related to defining and assessing the construct of listening comprehension in the context of English-medium higher education. We first explore how listening comprehension has been conceptualized specifically with reference to the academic domain, drawing upon theoretical models, language proficiency standards, and research focused on relevant aspects of the context of language use. We then review how tests of academic listening have defined and operationalized the construct in their design. We also examine key issues related to how listening comprehension is operationalized in assessment and we consider the various tradeoffs (e.g., between practicality and measurement quality, between construct representation and fairness). We conclude the chapter with a recommendation for future research and a model of general academic listening comprehension based on the interactionalist approach to construct definition (Chapelle, 1998).

3.2 Defining Listening Ability in the Academic Domain Defining the construct to be assessed is central to test development and subsequent validation activity (McNamara, 2004). A construct may be based on a theory of language learning or use, a needs analysis, a domain analysis, a course syllabus or other resources external to the test (Bachman & Palmer, 2010). Even when a test has a clearly defined target construct, it is useful to periodically examine the relevant research and developments in theory to ensure that the conceptual basis

62

62 Spiros Papageorgiou et al.

of the construct remains structurally sound in the face of new perspectives or changes within the target language use (TLU; Bachman & Palmer, 1996) domain. With the central role of construct definition in mind, this section will provide a discussion of approaches to conceptualizing English language listening assessment for general academic purposes. We first focus on domain-agnostic cognitive processing and components- based models of the listening construct. Given the differences in the approach between cognitive processing and component models of listening ability, as well as the influence they have exerted on current thinking in the field of second language listening assessment, we explore variations of each type. We also review models of second language listening comprehension primarily designed with a pedagogical focus. We further discuss academic standards and how they describe academic listening skills, given that such standards have also had an impact on how developers of language assessments design test content and report test scores. We summarize recent trends related to the TLU domain of academic listening, which are likely to affect test developers’ and researchers’ understanding of the contextual features relevant to the construct of academic listening. Drawing on the discussion of changes in theories of the construct of academic listening and in the academic domain itself in this section, we propose a generic construct definition for listening comprehension for academic purposes.

3.2.1 Listening Comprehension as a Set of Knowledge, Skills, and Abilities 3.2.1.1 Cognitive Processing Models of Listening Ability Cognitive processing models attempt to specify phases of processing that occur between a language user’s reception of acoustic (and visual) signals and subsequent response to an assessment task. In test settings, listeners may access various sources of knowledge, such as situational knowledge, linguistic knowledge, and background knowledge (Bejar, Douglas, Jamieson, Nissan, & Turner, 2000). Receptive and cognitive processes interact concurrently to construct a propositional representation of the aural input. Propositions contain “an underlying meaning representing a concept or a relationship among concepts” (Sternberg, 2003, p. 536). Propositional representations are then applied to provide a task response (whether constructed or selected). The construction of propositional representations is also mediated by an individual’s working memory capacity (Londe, 2008). Thus, the cognitive load of a task is an important consideration as coordinating these processes may be an intensive, if not overwhelming, task for some second language listeners (see Field, 2011). Various processing models have been proposed for second language listening, which share a view that listening involves both lower- level processes of

62

63

62

Assessing Academic Listening 63

perception and higher-level processes of meaning-making working in tandem. Rost (2005) characterized listening as a cognitive process consisting of three primary phases: decoding, comprehension, and interpretation. Although the phases were described in a manner that implied linear processing, Rost stated that they operate in parallel. Processing within each phase is driven towards a particular goal. Decoding requires attentional and perceptual processes and linguistic knowledge to recognize words and parse syntax, the raw material for the comprehension phase. Comprehension utilizes both short and long-term memory and includes four overlapping sub-processes: identifying salient information, activating appropriate schemata, inferencing, and updating memory. Interpretation is primarily characterized by processes that support pragmatic understanding of the aural text, and may also include response activities (e.g., backchanneling, follow-up acts). Field (2013) proposed a model of academic listening based on earlier psycho linguistic models (see Anderson, 1985; Cutler & Clifton, 2000) by sub-dividing the decoding/perceptual and utilization/interpretation/conceptual phases. In Field’s (2013) model, listening occurs across five levels of processing: input decoding, lexical search, parsing, meaning construction, and discourse construction. Field (2011) argues that the higher-level processes (meaning and discourse construction) may be especially important in academic listening. Meaning construction is hypothesized to begin with propositional input and draw upon various sources of knowledge to construct a meaning representation. A meaning representation still needs to be organized or evaluated in some manner via discourse construction (Field, 2013). Discourse construction is a higher-order cognitive process that engages processes that may be utilized during meaning construction, including selecting relevant information, monitoring information for consistency, integrating information, and building a coherent information structure that includes macro- and micro-points. Structure building is particularly important for listeners who need to discern the relative importance of points, which is critical, for example, when various arguments are being presented. Memory processes have also come to be seen as integral features within the listening comprehension process. Londe (2008) investigated the relationship between working memory and performance on listening comprehension tests using structural equation modeling (SEM). Drawing upon research in cognitive psychology, Londe hypothesized a structural model1 in which listening comprehension ability was predicted by working memory capacity and short-term memory capacity as measured in a test taker’s first (L1) and second (L2) languages. Londe’s (2008) model attempted to define the relationships between working memory, short-term memory, and long-term memory within the context of performing a listening comprehension task. In addition to these memory systems, the model included memory processes (including retrieval and rehearsal), affective influences, and strategic decision- making. Based on her analysis of alternative structural models, Londe argued that working memory is an essential component of the listening comprehension construct. Since working memory has been shown to vary

64

64 Spiros Papageorgiou et al.

across individuals and within individuals over time and found to have an impact on listening comprehension scores, Londe noted that elements of an assessment task that impose higher demands on working memory may introduce construct- irrelevant variance if working memory is not considered in the construct. A related issue is how information obtained through listening is encoded and stored in memory. Theorists have proposed several types of knowledge representations, including propositional representations, schematic representations (a set of linked propositions), and mental models, which incorporate propositional information and mental imagery (see Sternberg, 2003). These representations assume that what listeners encode and remember is not the exact wording of utterances, but the propositional content of utterances, connections between propositions, and connections between propositions and real-world knowledge. Therefore, discourse may be understood by a listener without the listener being able to recall the precise wording of the discourse. Thus, a task that requires the listener to recall discourse verbatim might introduce construct-irrelevant variance because it may not engage the ability as defined by the test construct. Consequently, construct-irrelevant variance would undermine what Field (2013) described as the “cognitive validity” of a test, or what Bachman and Palmer (2010) refer to as the meaningfulness of score interpretations, unless the construct specifically involves recollection of discourse verbatim. For example, when assessing the learner’s ability to understand the speaker’s intention, a specific part of the stimulus in a listening test might need to be repeated. Several of the theoretical perspectives discussed so far in this section may have implications for the cognitive processing model articulated in construct definitions related to the assessment of academic listening. For example, according to Bejar et al. (2000) two stages occur in listening test settings: listening and response. The output of the listening stage is described by Bejar et al. as a propositional representation, which is then accessed during the response stage in conjunction with other sources of knowledge in order to communicate. Field (2013) describes the meta-cognitive strategies (e.g., selection, integration, structure building) and sources of knowledge (e.g., pragmatic, external knowledge) that may be used to transform propositional representations into meaning and discourse constructions. In addition, Londe’s (2008) research highlights the important mediating role that working memory may play in the process of listening comprehension, underscoring its centrality to any cognitive processing model.

3.2.1.2 Component Models of Listening Ability Researchers have also proposed models that delineate components of listening ability to support the teaching or assessment of listening skills. These models often recognize and try to accommodate features of cognitive processing. For example, most of the models emphasize that listening ability requires declarative and procedural knowledge, but they do not specify how processing is performed

64

65

64

Assessing Academic Listening 65

in any detail. Despite this lack of specificity, such models may be useful for the purposes for which they were designed: identifying key components of declarative knowledge, fundamental cognitive and meta-cognitive processes, and important task or contextual characteristics that should be considered when teaching or assessing listening ability. Two models which have had considerable impact in the field (Buck, 2001; Weir, 2005) are structured in a similar manner, owing much to Bachman’s (1990) component model of communicative competence. These models are described below. Buck (2001) provides a model for describing the components of listening comprehension to support test development. The model, adapted from Bachman and Palmer’s (1996) model of communicative competence, includes two parts: language competence and strategic competence. Language competence includes declarative and procedural knowledge related to listening: grammatical, discourse, pragmatic, and sociolinguistic knowledge. Buck’s definition of grammatical knowledge is the ability to understand short utterances “on a literal semantic level” (Buck, 2001, p. 104), including syntax and vocabulary but also features of speech such as stress and intonation. Discourse knowledge is the ability to understand longer utterances and more interactive discourse based on knowledge of discourse features (cohesion, foregrounding, rhetorical schema, etc.). Pragmatic knowledge is similar to Bachman’s (1990) description of illocutionary knowledge, the ability to understand the intended function of an utterance. Sociolinguistic knowledge is the ability to understand language appropriate to particular sociocultural settings. The second major component of Buck’s model, strategic competence, includes cognitive and meta-cognitive strategies related to listening. Cognitive strategies are processing activities that can be categorized as comprehension (processing input), storing and memory (storing input in memory), or using and retrieval (accessing memory). Meta-cognitive strategies are executive activities that manage cognitive processes, and include assessing the situation (e.g., available resources, constraints), monitoring, self-evaluation, and self-testing. Ultimately, Buck’s intention is to provide a comprehensive model that test developers can adapt based on the aspects of language competence that are included in the construct definition. For example, if test developers targeted the construct “knowledge of the sound system,” they would only need to assess phonology, stress, and intonation. Buck offers a formal definition for a default listening construct to emphasize his view of the most important aspects of language competence. This construct includes “the ability to process extended samples of realistic spoken language, automatically and in real time; to understand the linguistic information that is unequivocally included in the text; and to make whatever inferences are unambiguously implicated by content of the passage” (Buck, 2001, p. 114). Buck also describes a model for a task-based listening construct, again based on Bachman and Palmer (1996). In this model, key tasks in the TLU domain are identified, and assessment tasks are designed to correspond to these tasks more closely.This model explicitly recognizes the importance of the context in which language use takes

6

66 Spiros Papageorgiou et al.

place, and Buck suggests that competence-based and task-based approaches should be combined to define a construct more fully. Alternatively, a purely competence- based approach to construct definition would assume that a test taker’s performance is determined by their competence or ability and may say little about contextual features of language use. Weir’s (2005) model of listening comprehension also includes two main components: language knowledge and strategic competence. Weir characterizes language knowledge as a collection of executive resources that consist of grammatical, discoursal, functional (pragmatic), sociolinguistic, and content knowledge.This description of language knowledge builds on Buck’s by including content knowledge, which may be internal to the test taker (i.e., background knowledge) or external (i.e., provided by the task).Weir’s notion of strategic competence is framed as executive processing whose components include goal setting, acoustic/visual input, audition, pattern synthesizer, and monitoring. Like Buck (2001),Weir describes these components as roughly sequential but overlapping processes. Weir’s approach is more holistic than Buck’s, as it lacks the formal distinction between cognitive and meta-cognitive processing that is a critical categorical distinction in Buck’s approach. However, Weir’s goal is similar to Buck’s in that he intends to provide a model that can be used to guide test development and subsequent validation research (see Taylor & Geranpayeh, 2011 for an application). Weir’s model for validating listening tests requires a consideration of the context validity of the test tasks and administration, as defined by features of the setting and demands of test tasks. His approach, which is consistent with an interactionalist approach to construct definition (see Chapelle, 1998), suggests that characteristics of the test takers interact with internal processes (defined by the construct) and contextual factors, and that all of these elements should be considered during test design and validation.

3.2.1.3 Pedagogical Models for Describing Listening Ability Several models of second language listening comprehension have been designed to improve pedagogy, both in terms of how listening skills are taught in the second language learning classroom, and how content lecturers present information to their students. Flowerdew and Miller’s (2005) model is intended for a variety of contexts; thus, it is primarily concerned with defining dimensions of the individual and context that may impact listening comprehension. Miller’s (2002) model describes lecturing in a second language and attempts to delineate important factors that influence comprehension while emphasizing the roles lecturers and listeners play in this process.While both models recognize the importance of cognitive and meta-cognitive processing, no attempt is made to provide a detailed hypothesis about how phases of processing may unfold. Flowerdew and Miller’s (2005) cognitive processing model is supplemented by eight key dimensions that may impact listening comprehension: individual, cross-cultural, social, contextualized, affective, strategic, intertextual, and critical.

6

67

6

Assessing Academic Listening 67

The individualized dimension recognizes individual differences in linguistic processing due to an individual’s unique level of language proficiency. A cross-cultural dimension is included to account for the belief that schemata and background knowledge can vary according to one’s gender, age, culture, or native language, and may impact the interpretation of speech. The social dimension recognizes that listeners take on a variety of different roles (e.g., side participants, overhearers), and that knowledge is co-constructed by the speaker and listener. The contextualized dimension emphasizes that listening is integrated with other processes and activities (e.g., reading, observation, writing, speaking) and thus real-world listening activities are often multi-modal. For example, the task of listening to a lecture may require taking notes, looking at visuals, and reading handouts. The affective dimension includes four factors that may influence one’s decision to listen: attitude, motivation, affect, and physical feelings. The strategic dimension includes the use of strategies to enhance listening comprehension, that is, meta-cognitive strategies. The intertextual dimension functions at the level of register and genre that may demand a high level of familiarity with the target language’s relevant cultural referents. The critical dimension requires advanced skills, through which the context of the aural text is deconstructed and analyzed to examine the inequalities in power that it reproduces. Miller (2002) presents a model for lectures in a second language, conceptualized as a prism. The prism has three dimensions at its base: genre, community of learners, and community of practice. Lectures are characterized by a particular genre, and include artefacts (e.g., handouts, visuals) and lecturer behavior (e.g., gestures). Students (or listeners) belong to a community of learners; such a community is in part defined by learning practices or tradition (e.g., teacher-centered classrooms). Lecturers attempt to initiate students into a community of practice. These communities are viewed as discourse communities that are discipline-specific, for example, the discourse of an engineering community. Language is a key component that needs to be accessible to the listeners to facilitate comprehension of the content, contextualized in a particular genre and community of practice. The process of lecturing is preceded by expectations possessed by the lecturer and students and proceeds as both the lecturer and students navigate the interactive components of the prism. The product of this process is higher and lower levels of comprehension. Miller’s (2002) model highlights important themes related to the construct definition of academic listening. First, lectures in a second language occur in a multidimensional context, which includes communities of learners and practice. Second, genre is co-constructed by the lecturer and listener and preceded by differences in expectations due to previous learning and teaching experiences. As the lecture unfolds, differences in behavior among participants (linguistic, pedagogic, strategic) may impact comprehension. For example, the lecturer may provide a handout on which some students take notes while others do not; those who take notes may vary in how they organize them.

68

68 Spiros Papageorgiou et al.

3.2.1.4 Standards that Define College-ready Listening Skills Standards describing academic listening skills differ in their focus and specificity. Consideration of these standards in defining the construct of academic listening is important because they might affect teachers’ expectations about learning objectives in the classroom and the learning goals set in curricula. In the United States, the Common Core State Standards (www.corestandards. org) is an initiative supported by many states to describe the skills and abilities expected by students at each grade level (school year). Skills and abilities for college are classified into two broad categories: Comprehension and Collaboration, and Presentation of Knowledge and Ideas. Listening is not described separately, but it is integrated with Speaking.2 According to statements that primarily describe listening abilities, students are expected to “integrate and evaluate information presented in diverse media and formats, including visually, quantitatively, and orally” (Standard SL.5) and “evaluate a speaker’s point of view, reasoning, and use of evidence and rhetoric” (Standard SL.2). The Common European Framework of Reference (CEFR; Council of Europe, 2001) describes six main levels of language proficiency organized into three bands: A1 and A2 (basic user), B1 and B2 (independent user), C1 and C2 (proficient user). Its main purpose is to provide “a common basis for the elaboration of language syllabuses, curriculum guidelines, examinations, textbooks, etc. across Europe” (Council of Europe, 2001, p. 1). The CEFR describes what learners are able to do in relation to various communicative activities and language competences across the six levels, with descriptors also available for the “plus” levels A2+, B1+, and B2+. A companion volume (Council of Europe, 2018) offers additional descriptors. Because the CEFR is not language-specific and is intended as a reference document, it does not provide information specific to grade levels as is the case with the Common Core State Standards. Instead, test providers very often refer to its levels to facilitate score interpretation. As a result, “the six main levels of the CEFR have become a common currency in language education” (Alderson, 2007, p. 660) and reference to its scales can “add meaning to the scores” yielded by a test (Kane, 2012, p. 8). Although the CEFR is not designed for the curriculum of a specific country, it is increasingly used as an external standard by governments and international agencies in order to set policy (see Fulcher, 2004; McNamara, 2006). In terms of listening comprehension, the CEFR contains several scales which relate to the processing and utilization of aural input, some of which are relevant to language use in the academic domain (see North, 2014, pp. 62−65). For example, two descriptors define the following listening abilities: •

•

Can follow the essentials of lectures, talks and reports and other forms of academic/professional presentation which are propositionally and linguistically complex. (Listening as a Member of a Live Audience, Level B2) Can understand recordings in standard dialect likely to be encountered in social, professional or academic life and identify speaker viewpoints and

68

69

68

Assessing Academic Listening 69

attitudes as well as the information content. (Listening to Audio Media and Recordings, Level B2) In the United Kingdom, following the Education Reform Act 1998, the National Curriculum was introduced in England, Wales, and Northern Ireland and “Key Stages” define objectives that need to be attained at different grades. In the 2014 program of Key Stage 4 (ages 14–16) for English,3 Listening is embedded with Speaking, similarly to the Common Core State Standards. Students at Key Stage 4 are expected to be taught to speak confidently, audibly, and effectively, including through: • •

listening to and building on the contributions of others, asking questions to clarify and inform, and challenging courteously when necessary listening and responding in a variety of different contexts, both formal and informal, and evaluating content, viewpoints, evidence and aspects of presentation

In terms of grammar and vocabulary, students should be taught to consolidate and build on their knowledge of grammar and vocabulary through drawing on new vocabulary and grammatical constructions from their reading and listening and using these consciously in their writing and speech to achieve particular effects. The Australian Curriculum, Assessment and Reporting Authority (ACARA) has developed the Australian Curriculum which describes the “core knowledge, understanding, skills and general capabilities important for all Australian students” (www.australiancurriculum.edu.au/). The learning progressions for literacy4 specify listening as one of the core skills: •

Literacy involves students listening to, reading, viewing, speaking, writing and creating oral, print, visual and digital texts, and using and modifying language for different purposes in a range of contexts.

Students are expected to become “increasingly proficient at building meaning from a variety of spoken and audio texts.” Descriptors such as those listed below indicate what a student at the top level is expected to be able to do: • • •

Evaluates strategies used by the speaker to elicit emotional responses Identifies any shifts in direction, line of argument or purpose made by the speaker Identifies how speakers’ language can be inclusive or alienating (a speaker using language which is only readily understood by certain user groups such as teenagers or people involved in particular pastimes)

Educational standards such as the ones described in this section define learning goals in the classroom and expectations for learning outcomes. Therefore, it is

70

70 Spiros Papageorgiou et al.

important to consider how these standards define listening proficiency in the academic domain when conceptualizing and operationalizing the construct of academic listening. It is also useful to summarize here some aspects of academic listening that appear to be common across standards: • • • • •

Listening skills are critical in the academic domain in situations of both collaborative and non-collaborative listening. Listening abilities are expected to intertwine with speaking abilities. College-ready students are expected to deal with a large variety of complex listening stimuli. Comprehension involves skills beyond recognition and comprehension, for example the ability to synthesize and evaluate oral information. The social/cultural context is a crucial facet in listening performance.

Educational standards are typically underspecified by design, as they are intended for application in diverse educational contexts.This is particularly the case with the CEFR, whose use as a policy tool in educational and non-educational contexts has been heavily criticized (Fulcher, 2004; McNamara, 2006). Because of their intentional lack of specificity, educational standards are unlikely to be sufficient for test design (Papageorgiou & Tannenbaum, 2016). However, test developers might find them useful in identifying tasks which test takers are expected to accomplish at different proficiency levels and school grades and, as mentioned earlier, in helping test users interpret test scores.

3.2.2 Listening Comprehension in the Context of English-medium Higher Education Following an interactionalist approach (Chapelle, 1998), language ability is not only determined by an individual’s abilities on a given trait, but by the interaction between those abilities and the demands of a given context. Determining academic listening ability, then, requires a detailed consideration of the contextual elements of academic listening in English-medium higher education. Recent trends in the growth of English Medium Instruction (EMI) programs, increasing mobility of academics and students, new technology, and the spread and growth of English itself have made theorizing listening ability across educational contexts much more complex. In this section, we survey several contextual issues and their implications for conceptualizing listening in academic settings.

3.2.2.1 Academic Needs and Genres Early academic listening needs analyses found that students needed listening skills to understand academic lectures and participate in small-g roup discussions, and that the skills engaged did not vary much within these contexts (Ferris,1998; Ginther

70

71

70

Assessing Academic Listening 71

& Grant, 1996; Rosenfeld, Leung, & Oltman, 2001;Waters, 1996). However, subsequent empirical needs analyses (e.g., Kim, 2006; Less, 2003) and studies of spoken academic genres suggest a wider range of academic listening skills are needed than was previously thought. For example, Kim (2006) examined the academic oral communication needs of East Asian international graduate students in non- science and non-engineering majors in the United States. Frequently required tasks included participating in whole-class discussions, raising questions during class (and presumably, understanding responses), and engaging in small-group discussions. Students perceived formal oral presentations and strong listening skills as the most important elements for academic success. Perceptions of the importance of listening, however, vary across international academic settings.Various needs analyses have been conducted in English-medium academic contexts outside of North America (e.g., Chia, Johnson, Chia, & Olive, 1999; Deutch, 2003; Evans & Green, 2007; Fernández Polo & Cal Varela, 2009; Kormos, Kontra, & Csölle, 2002). These studies investigated academic English needs across four language skills: speaking, writing, listening, and reading. The importance of listening skills varied across studies. For example, Kormos et al. (2002) found that English major students in Hungary ranked listening to lectures and taking notes while listening as the two most important skills in the academic domain. The importance of listening skills was also reported in Chia et al.’s (1999) study in a Taiwanese context. However, Evans and Green’s (2007) study in Hong Kong suggested that listening skills are of less concern than the other three skills, although students expressed difficulty in comprehending unfamiliar and technical vocabulary. Also, in the context of legal English courses in Israel (Deutch, 2003), law students ranked listening as the second most important skill next to reading. This study found that productive skills were ranked as unimportant in practical terms. On the whole, these needs analyses demonstrate that there is an expanding circle of English for Academic Purposes (EAP) domains around the world, and that the importance placed on academic listening will depend on both the specific setting and disciplinary norms.

3.2.2.2 Technology-mediated Learning Distance learning programs have grown in number and accessibility as institutions have increasingly employed technology and utilized the internet as a medium of instruction (Warschauer, 2002;Watson Todd, 2003). Studies have examined various aspects of these distance learning programs, including the types of academic tasks and teaching tools that are used, and various forms of blended learning. Most of these studies focus on undergraduate settings, with a few exceptions that focus on graduate students (e.g., Sampson, 2003; Yildiz, 2009). Several studies have focused on teaching tools and how they are utilized for academic tasks in distance learning. Cunningham, Fägersten, and Holmsten (2010), for example, examined the use of tools available to assist communication among students and teachers in

72

72 Spiros Papageorgiou et al.

a synchronous and asynchronous online EAP learning environment at Dalarna University, Sweden.The program employed a web-based learning platform which provides asynchronous document exchange, collaborative writing tools, e-mail, recorded lectures in a variety of formats, live-streamed lectures that simultaneously allow text-based questions to be submitted to the lecturer in real time, text chat, and audiovisual seminars using desktop videoconferencing systems. In Yildiz’s (2009) study of graduate-level web-based courses, an asynchronous computer conferencing application called SiteScape Forum was introduced and enabled multiple users to share documents and hold threaded discussions. Yildiz reported that this application facilitated active class participation, discussion, and collaboration between instructors and students by posting course information (e.g., questions on assignments or lectures) and sharing professional experiences. While some distance learning contexts have required academic listening skills for live-streamed lectures (e.g., Cunningham, Fägersten, & Holmsten, 2010), academic content is often delivered through an asynchronous web-based computer conferencing forum application where students share documents or post their opinions to participate in discussions without requiring students to listen to the lectures (e.g.,Yildiz, 2009). However, as technology for real-time, embedded video conferencing develops (e.g., Zoom), synchronous, computer-mediated group discussion is likely to become an integral part of distance learning. Various forms of blended learning (such as the use of web-based materials as a supplement to face-to-face classroom learning) have been increasingly employed and appear to be favored by students, which may be due to the increased use of technology in recent years (e.g., Felix, 2001; Harker & Koutsantoni, 2005; Jarvis, 2004). For example, Harker and Koutsantoni (2005) investigated the effectiveness of web- based learning and found that a blended learning mode was more effective in terms of student retention than distance learning for non-credit-bearing optional EAP courses. The benefit of the blended learning mode could be seen in the form of increased interactivity and more efficient communication among class participants. Also, the growth in the use of PowerPoint in delivering lectures has significantly changed the nature of listening and note-taking (Lynch, 2011). Students might be less dependent on understanding lectures aurally but need to refer to and process written dimensions of the lecture input at the same time. Nevertheless, research suggests that this trend has not affected the teaching and learning process radically, but rather has become part of the blend of program delivery. The use of distance learning in delivering lectures and content has several implications for academic listening needs and assessment. Students increasingly need to use integrated language skills in distance learning contexts since students not only listen to live-stream or asynchronous lectures but also participate in interactive activities using various teaching tools. For example, students participate in threaded discussions based on their understanding of lectures or they refer to textbooks while listening to lectures. In addition, students also refer to various resources during lectures; not only textbooks but also web-based materials or visual

72

73

72

Assessing Academic Listening 73

resources (e.g., PowerPoint slides). Therefore, students’ academic listening needs might not be limited to merely understanding an academic lecture but may also include the ability to comprehend a lecture while referring to various resources and to express an understanding of the integrated content either in written or spoken form. However, apart from the operational challenges for a large-scale test of academic English, an additional issue with the inclusion of technology-assisted tools relates to the potential introduction of construct-irrelevant variance, because test takers might differ in their familiarity with such tools and their need to use these tools in the TLU domain.

3.2.2.3 Pragmatic Understanding Pragmatic understanding is an important subcomponent in most of the models of listening comprehension discussed earlier; however, it is still an under-represented feature of many academic listening tests (see Grabowski, 2009; Roever, 2011; Timpe Laughlin, Wain, & Schmidgall, 2015). We examine this aspect of the listening construct because pragmatic understanding might arguably be one of the key listening competences in higher education contexts where listening spans multiple registers and contexts, and where the ability to draw inferences about, for example, an interlocutor’s stance, is of vital importance. Several component models of pragmatic knowledge make a primary distinction between what Leech (1983) describes as a pragmalinguistic component and a sociopragmatic component. The pragmalinguistic component refers to the relationships between a speaker’s utterances and the intended functions of those utterances, often described as speech acts. The sociopragmatic component refers to how language users utilize features of the communicative context in order to deliver appropriate utterances. Based on a model originally developed by Bachman (1990), Bachman and Palmer (2010) described the pragmalinguistic component as functional knowledge, or knowledge of the intended function of an utterance. Types of functions of an utterance include ideational (expressing propositions, knowledge, or feelings), manipulative (instrumental, regulatory, interactional), heuristic (extending knowledge), and imaginary (esthetic, metaphoric, or fantastical) functions. Bachman and Palmer refer to the sociopragmatic component as sociolinguistic knowledge. Sociolinguistic knowledge is composed of a few subcomponents, including knowledge of (a) dialect or variety, (b) register, (c) genre, (d) naturalness of expression, and (e) cultural references and figures of speech. Buck’s (2001) component model of pragmatic knowledge largely follows the organization of Bachman’s (1990) model, with several distinctions. Buck labels the pragmalinguistic component pragmatic knowledge and adds three subcomponents to those listed above: indirect meaning/hints, pragmatic implications, and text-based inferences. Buck adopts Bachman’s (1990) organization for the subcomponents of sociolinguistic knowledge, renaming knowledge of naturalness to knowledge of the appropriacy of linguistic forms.

74

74 Spiros Papageorgiou et al.

Purpura (2004) developed a component model of pragmatic meaning that more explicitly considers the context in which non-literal utterances are produced or interpreted. In Purpura’s model, pragmatic meaning may be further deconstructed as (a) contextual meanings, (b) sociolinguistic meanings, (c) sociocultural meanings, (d) psychological meanings, and (e) rhetorical meanings. Contextual meanings are understood within the context of interpersonal relationships wherein mutually understood utterances are the product of personal familiarity or intimacy. Sociolinguistic meanings require a shared understanding of social meanings, which may include social norms, preferences, and expectations. Sociocultural meanings require knowledge of cultural norms, preferences, and expectations. Psychological meanings derive from understanding or communicating affective stance. Finally, rhetorical meanings require knowledge of genre and organizational modes (e.g., the structure of an academic lecture in a particular discipline). Thus far, Purpura’s more nuanced component model of pragmatic meaning has not been applied directly in listening assessments, where pragmatic knowledge is rarely assessed explicitly. Components of pragmatic knowledge have also been discussed in terms of pragmatic phenomena. For example, Roever (2005) describes pragmatic knowledge in terms of three components (speech acts, implicature, and routines), and suggests that the construct of pragmatic knowledge could be expanded by adding two more components: conversational management and sociopragmatic knowledge. Timpe (2013) focuses on four components, roughly designed to correspond to Hudson et al.’s (1992) model: speech acts, implicature, routines, and politeness. Each of these components can of course be further subdivided, and it may be important for a language proficiency test to be specific about how these components are represented. For example, a test that includes items testing understanding of performative verbs can be said to test understanding of speech acts, but it would be testing a very narrow aspect of speech acts. Similarly, there are many types of implicature, so it is important to define the types represented in a test, not only to justify construct claims but to provide clear guidelines for consistent test development. A few types of implicature that might be represented on an English language proficiency test are discussed below. Although operationalizing a construct of pragmatic knowledge or under standing may provide its own challenges, failing to consider the construct in a broad sense may lead to its underrepresentation (Grabowski, 2009; Roever, 2011, 2013). One way to operationalize pragmalinguistic and sociopragmatic knowledge might be to explicitly contextualize tasks to elicit evidence of a listener’s ability to infer pragmatic meanings, and, in particular, to evaluate their appropriateness (e.g., sociolinguistic, sociocultural, and psychological meanings). However, there are several reasons why doing so could be particularly challenging. First, because pragmatic meaning is typically evaluated in terms of its appropriateness (e.g., Grabowski, 2009), speech acts typically are not evaluated in terms of “right” or “wrong,” but as “more” or “less” appropriate. These judgments of appropriateness may not necessarily be consistent across the individuals who form the social,

74

75

74

Assessing Academic Listening 75

institutional, or cultural reference group, which poses challenges for assessment (Roever & McNamara, 2006). Also, for a domain as broad as that of academic English in higher education, it may not be possible to insert some aspects of pragmatic meaning in a systematic and valid manner into a test task. The TLU domains to which large-scale tests, such as the TOEFL iBT test, aim to generalize include university settings across the globe. Given the variation within this broad TLU domain with respect to contextual, sociolinguistic, and sociocultural groups, contextualizing discourse with respect to each group would provide a substantial challenge for test developers.

3.2.2.4 The Interactive Dimension of Academic Listening Academic lectures have typically been considered as monologues or as a mode of one-way academic listening (Lynch, 2011). However, classrooms have become more informal and interactive as academic lectures now increasingly involve more than one speaker, or an increased amount of unplanned speech. Students are expected to be more actively involved in various communicative genres that require active listening skills, including interactive lectures, small-g roup discussions, team projects, tutorials, seminars, conference presentations, and meetings with advisors (Ferris, 1998; Lynch, 2011; McKeachie, 2002; Murphy, 2005). Research suggests that students’ challenges in interacting with their advisors during office hours (e.g., Davies & Tyler, 2005; Krase, 2007; Skyrme, 2010) or with tutors (e.g., Basturkmen, 2003) may be traced to comprehension difficulties.This embeddedness of listening within interactive communication also suggests that listening may be measured most authentically through tasks which involve interaction with one or more interlocutors. Various terms have been employed to characterize the interactive dimension of listening, including active listening (Ulijn & Strother, 1995), reciprocal listening (Lynch, 1995), collaborative listening (Buck, 2001; Rost, 2005), and two-way listening (Lynch, 2011). Uljin and Strother (1995) refer to active listening skills as the ability to give verbal and non-verbal feedback, such as verbal signals or eye contact; these skills help distinguish skilled and average negotiators. Several recent studies have conceptualized academic lectures as a social event that encourages student participation, as opposed to merely informational texts. For example, Morita (2000, 2004) examined reciprocal discourse in academic lectures and students’ socialization process in classroom speech events, including oral academic presentations and open-ended class discussions. Other researchers have emphasized the role of interactive language use in academic lectures. Morell (2004) analyzed four linguistic aspects of interactive and non-interactive lectures and found differences in the quantity and quality of language use (e.g., personal pronouns, discourse markers, questions, meaning negotiation) based on analyses of a corpus of academic lectures at the University of Alicante, Spain. These findings were then used to enhance student participation in lectures. Farr (2003) conducted a study that investigated the multi-functional nature of listenership devices (e.g.,

76

76 Spiros Papageorgiou et al.

continuer or back-channel tokens like mm hm) in spoken and aural academic discourse. The study suggested that students benefited from understanding and using these devices for interactive and pragmatic purposes in academic listening. Research based on the TOEFL 2000 Spoken and Written Academic Language (T2K- SWAL) corpus (Biber, Conrad, Reppen, Byrd, & Helt et al., 2004) suggested that academic registers with informational purposes, such as classroom teaching and study groups, involved highly interactive features (e.g., Biber, Conrad, Reppen, Byrd, & Helt, 2002). For example, the use of first-and second- person pronouns, wh-questions, emphatics, and amplifiers reflected interpersonal interaction and a focus on personal stance in academic spoken discourse. Another corpus-based study by Csomay (2006), also based on the T2K-SWAL, investigated linguistic features of class sessions and found, inter alia, that class sessions exhibited characteristics of both face-to-face conversations and academic prose. In summary, the importance of interactive listening in academic discourse appears to be increasing, at least in certain academic settings, as university classrooms become more interactive and informal, and as students are expected to perform various communicative tasks involving an active listening ability. This argument is supported by empirical findings from both qualitative and corpus-based quantitative studies.The interactive dimension of academic listening has a range of implications for language learning, teaching, and testing. For test developers to increase the authenticity of tasks, it may be helpful to investigate various linguistic and paralinguistic features in interactive listening tasks that might be different from those in monologic lectures (e.g., Ockey, Papageorgiou, & French, 2016; Read, 2002).

3.2.2.5 Multimodality and Meaning Making Researchers have pointed out that visual information often accompanies aural information in communicative settings and suggest that listening tasks should include the visual channel to correspond more closely to TLU situations (Buck, 2001; Lynch, 2011). However, the extent to which visual information (still images, animation, or video) adds construct-irrelevant variance has been a matter of contention (see Wagner & Ockey, 2018, for a detailed discussion of this topic related to test development). Gruba (1997) discussed the role of video text in listening assessment and outlined concerns that language testers need to address. These concerns include a good understanding of the role of non-verbal communication in listening comprehension, the practical demands of video for test developers (cost, control, convenience), and methodologies that may be useful to employ in future research –including verbal protocols, direct observation, extended interviews, and text analysis. Based on retrospective verbal reports with test takers, Gruba (2004) proposed a seven-category framework for describing features of video texts used in listening comprehension assessments and suggested that visual information should be considered in an expanded construct of listening comprehension.

76

7

76

Assessing Academic Listening 77

The interactions of test takers with video input during listening comprehension tests have been shown to vary. Wagner (2008) found through verbal protocol analysis that the eight test takers of a video listening test varied in the extent to which they attended to non-verbal information. For example, some test takers used speakers’ facial expressions to contextualize and anticipate their utterances, while others did not. Ockey (2007) used observation, verbal protocols, and interviews to examine how six test takers interacted with still images or video text in an L2 listening comprehension test. He found that while test takers engaged minimally and similarly with the still images, the ways and degree to which they engaged with the video stimulus varied widely. Research has also suggested that using video input versus audio-only input impacts test scores. Wagner (2010) used a quasi-experimental design to investigate whether test takers would score higher in an audio-only version of a test (control) or video-enhanced version (experimental).While group differences were not observed on pre-test scores, test takers in the experimental group received significantly higher post-test scores than the test takers in the control group, both overall and across task types (lecture and dialogue). Wagner proceeded to examine 12 items on the post-test where the group difference exceeded 10% and provided a rationale for six of the items. Four of the items included gestures that appeared to enhance comprehension, while two of the items included photographs that aided comprehension. However, no rationale could be provided for the group difference observed in the remaining six items. Wagner also found that test takers reported positive opinions of video texts in a test of L2 listening comprehension. According to test-taker survey responses, the use of video may have lowered their anxiety, increased their interest in the tasks, and helped focus their attention. Interestingly, though, test takers did not believe that the use of video helped them score higher on the test. The emergence of new research technologies has helped to shed new light on the nature of multi-modal listening comprehension. Suvorov (2015), for example, used eye-tracking to measure the amount of time listeners spent looking at either “content” videos (those where the images matched the content) or “context” videos (those where the images provided information about the setting of the talk, speaker role, etc.). Among other findings, Suvorov found that content videos had a significantly higher dwell time than context videos (58% versus 51%), suggesting that participants found the content videos more informative and useful than the context videos. Batty (2017) also used eye-tracking to examine what listeners pay attention to on a video listening test based on conversational interactions. Batty measured the extent to which participants focused on speakers’ faces, hands and bodies, or on objects in the scene or the general setting. Findings indicated that listeners predominantly paid attention to speakers’ faces, and particularly to their eyes and mouths. The focus on speakers’ faces was more pronounced on items that tested implicit (pragmatic) meaning compared with explicit (propositional) meaning. This research highlights the complexity of the relationship between, on

78

78 Spiros Papageorgiou et al.

the one hand, features of the video, and on the other the point tested in the listening item. The issue of whether to include visual support when assessing listening in a test of academic language proficiency remains an important topic for construct operationalization. However, visuals might impact test takers differently, and the type of visual support might have differential impact on test performance (Suvorov, 2015). For example, test developers should balance the anticipated enhancements construct operationalization with the complexities introduced to test administration when using video. Use of video would undoubtedly increase the situational authenticity of a listening task. However previous research which shows low rates of attention to a visual channel in listening assessment suggests less certainty that the use of video would increase interactional authenticity significantly in listening tasks. Situational authenticity, that is, the perceived match between test tasks with real-life tasks (Bachman, 1991; Lewkowicz, 2000), is an important aspect of test design. However, as Bachman (1990) points out, interactional authenticity is another important aspect of test tasks. Interactional authenticity is concerned with the interaction between the language user, the context, and the discourse and focuses on the extent to which test performance can be interpreted as an indication of the test taker’s communicative language abilities. Although it is inevitable that operational constraints will limit the situational authenticity of listening test tasks, it is critical that test developers consider ways to maintain the interactional authenticity of these tasks. Situational authenticity and interactional authenticity are of particular concern in listening tasks where test takers are simply overhearers (Papageorgiou, Stevens, & Goodwin, 2012), or where the information conveyed through the visual channel simply complements rather than extends (or actually contradicts) the audio channel. Ultimately, the additional operational effort to include video might not be justified, and still images –or animated PowerPoint- style presentations –might provide sufficient situational authenticity without threatening the practicalities of ongoing test development.

3.2.2.6 Emerging Academic Discourse from the Perspective of English as a Lingua Franca (ELF) Language proficiency is increasingly being conceptualized as the ability to communicate across linguistic varieties or as a multidialectal competence. However, research suggests that familiarity with specific accents may help facilitate listening comprehension, which raises important measurement issues for language testers from the perspective of English as a lingua franca (ELF), World Englishes, and English as an international language (EIL), terms we discuss below. As the use of English as a second or foreign language has increased, similar but sometimes competing descriptions of language use communities have developed. The term World Englishes describes local discourse communities in reference to Kachru’s (1985) inner, outer, and expanding circle conceptualization, based on

78

79

78

Assessing Academic Listening 79

traditional sociolinguistic and historical bases of English users. English as a lingua franca (ELF) refers to the discourse produced by language users from different native language backgrounds who choose to communicate in English, most commonly as a foreign language (Firth, 1996). EIL encompasses the World Englishes and ELF perspectives (Canagarajah, 2006) and can essentially be viewed as a localized version of ELF (Brown, 2004). According to Jenkins (2012), the terms “ELF” and “EIL” have been regarded as synonymous, with EIL less used in recent years because of its ambiguity. Early ELF research was characterized by an attempt to describe commonalities in discourse across groups of language users, and even proposed simplified versions of English for functional, standardized use (e.g., Basic English/Simple English, Threshold Level English, Globish, Basic Global English). However, this view has since shifted to a view of ELF as a research paradigm seeking to understand what makes communication successful in environments where interlocutors do not share the same linguacultural background. There has been considerable debate about the extent to which non-standard features of World Englishes and ELF overlap to form a common core or set of universals (Chambers, 2004), and the implications for assessment. Distinctions can be made between English varieties based on several key features. Non-native varieties of English are often distinguished by pronunciation features, including reduction of consonant clusters, merging of long and short vowel sounds, reduced initial aspiration, lack of reduced vowels, and heavy end-stress (Kirkpatrick, 2010). The use of vocabulary can be quite different across varieties of English as well, due to the influence of a local language, and code-mixing is prevalent across different varieties. Jenkins (2006) reviewed the evolving perspectives in World Englishes, ELF, and EIL with an eye on how these changes may impact language assessment. She recognized that many varieties of English are still being documented and that test developers may be hesitant to operationalize criteria for EIL. She expressed a concern that future attempts to operationalize EIL may oversimplify the varieties inherent in these Englishes, producing a “Global Standard English” that is not in accordance with an EIL perspective. Elder and Davies (2006) discussed the implications of the ELF movement for assessment and offered two proposals for expanding construct definitions to include ELF. First, they suggested that ELF users could be accommodated by tests designed around Standard English communication through a series of accommodations. Several of these suggestions are relevant to the implementation of listening stimuli. Initially, they suggested that stimuli be reviewed for potential bias against ELF users who may not have had the chance to explore particular topics or genres. Next, lexical items or structures unfamiliar to ELF users should be glossed or eliminated as they would add construct- irrelevant variance. These accommodations are designed to minimize systematic error that would depress the scores of ELF users, but may serve to create new problems. With these accommodations in place, it is not entirely clear that the construct would be defined similarly for different groups of users. Elder and Davies’ second proposal is to design the test based on

80

80 Spiros Papageorgiou et al.

an expanded construct definition which includes relevant local norm(s). A test of this kind would need to be coherent with broader norms of ELF pedagogy and communication, which typically de-emphasizes grammatical accuracy in favor of situated language use.Thus, a listening stimulus may include an authentic ELF text (e.g., international news broadcast) that contains a variety of non-native speaker accents. The focus of listening assessment would shift from comprehension of a particular norm to the ability to understand different speech varieties. Canagarajah (2006) argued that the debate between native-speaker norms of English and World or new Englishes norms may be misleading, since norms are relative, heterogeneous, and evolving at a rapid rate. Citing the increasing amount of contact and fluidity between groups of English users, Canagarajah suggested that providers of proficiency assessments should consider including both inner circle and local varieties, and focus on a language user’s negotiation of varieties rather than proficiency in any particular one. As a result, the construct definition would emphasize the following components: language awareness, sociolinguistic sensitivity, and negotiation skills. Language awareness requires language users to “discern the structure, pattern, and rules from the available data in a given language” (Canagarajah, 2006, p. 237), and is based on the presumption that speakers need competence in multiple dialects or language systems. Sociolinguistic sensitivity refers to the awareness that pragmatic norms vary across contexts. Negotiation skills are vital to this approach and may include a range of competencies and strategies, including code switching, speech accommodation, various interpersonal strategies, and attitudinal resources. In a similar proposal, Kirkpatrick (2010) calls for a multilingual model of English proficiency wherein norms based on native- speaker or inner-circle standards are replaced by an emphasis on norms that are constantly shifting. Building primarily on the work of Harding and McNamara (2017), Kirkpatrick identified the ability to tolerate and comprehend different varieties of English, including different accents, different syntactic forms, and different discourse styles, as a key ELF competence, which needs to be modeled in listening assessment. The author also pointed out that this type of competence is best assessed within a purpose-built ELF test, as attempting to tap into this type of competence within traditional proficiency testing formats is likely to be unsuccessful. The rapid rise of multilingual and multicultural student populations in universities around the world has led to an increased use of English as the primary medium of instruction in higher education in non-English-speaking countries (Coleman, 2006; Jenkins, 2011). For example, a three-fold increase in the number of English-medium programs in European universities between the years 2002 and 2007 was reported in Wächer (2008). Although the concept of ELF is a relatively new area of study (e.g., Seidlhofer, 2001), there has been a heightened awareness of ELF research, due in part to the development of two large ELF corpora: the corpus of English as a Lingua Franca in Academic Settings [ELFA] (Mauranen, Hynninen, & Ranta, 2010) and the Vienna-Oxford International Corpus of English [VOICE]

80

81

80

Assessing Academic Listening 81

(Vienna-Oxford International Corpus of English, 2011). Theone-million-word ELFA corpus of spoken academic discourse has been developed for two main purposes (for more details on the ELFA corpus, see Mauranen, 2006a): (a) to understand how academic discourses work particularly in many countries using ELF, and (b) to provide a delimited database of ELF that is composed of language uses from linguistically and intellectually challenging situations rather than simple language routines. The other large-scale ELF spoken corpus, VOICE, has been compiled by the Department of English at the University of Vienna. This corpus consists of one-million words of spoken ELF interactions in diverse domains (professional, educational, leisure) among ELF speakers with approximately 50 different first languages. A number of studies have been conducted that use either the ELFA corpora or ELF interaction, investigating a broad range of topics that include managing misunderstandings among speakers (Mauranen, 2006b), syntactic features in ELF (Ranta, 2006), comparing discourse features between native speakers and speakers using ELF (Metsä-Ketelä, 2006), and ELF speakers’ perceptions of their accents (Jenkins, 2009). For example, the issue of how intonational structure in ELF interaction differs from English native speakers is examined by Pickering (2001, 2009). Based on qualitative analysis of ELF interaction, Pickering argues for a clear role of pitch movement for intelligible and successful interaction in ELF discourse. Pickering found that ELF participants successfully oriented to pitch movement in conjunction with tone and key choices, which are used to signal trouble sources and to negotiate meaning for resolution as a pragmatics resource. However, further research on the role of intonational features in naturally occurring ELF interaction will be necessary given the full results and constraints of Pickering’s research. Pickering (2001) found that difficulties in communication may be due to inadequate intonation choice by non-native international teaching assistants, and the data used in Pickering (2009) was collected under experimental task conditions. Overall, empirical findings show that ELF speakers systematically employ both conventional and creative ways to communicate with each other. In addition to corpus-based research, the expanding circle of English as academic lingua franca can be seen in an increasing number of research studies in different universities in Sweden (Björkman, 2008, 2011), Finland (Hynninen, 2011), Germany (Knapp, 2011), and Norway (Ljosland, 2011). These studies focus on diverse issues around English as an academic lingua franca partly owing to the mobility program Erasmus (European Region Action Scheme for the Mobility of University Students), established by the European Union to foster and support temporary student exchanges between higher education institutions. Erasmus has enabled over two-million students to study in another European country since 1987 (Janson, Schomburg, & Teichler, 2009). Considering the expanding circle of academic discourse, more attention on investigating academic listening domains, genres, and tasks in globalized academic contexts is needed.

82

82 Spiros Papageorgiou et al.

3.2.2.7 Shift in Faculty Demographics Faculty demographics are an important consideration in determining the types of listening environment learners might encounter. There have been significant changes in faculty employment and policies in North American universities and colleges, including increased employment of part- time instructors, contract faculty, and post-doctoral students rather than tenure-track faculty (Clark & Ma, 2005; Leslie & Gappa, 2002). One of the most notable changes over the last two decades has been the increasing presence of international faculty in US institutions of higher education (Wells, 2007), who bring with them their own ELF practices. Despite this shift in faculty demographics, however, international faculty as a demographic group are under-researched compared to other minority groups (Wells, 2007). Outside of the North American context, the internationalization of college classrooms in Europe, Australia, and Asia is already a common phenomenon (Manakul, 2007; McCalman, 2007). Luxon and Peelo (2009) described growing numbers of international faculty members in departments across UK higher educational settings. Their study reported on the difficulties that both students and international teaching staff experience by conducting interviews and focus groups to examine language, teaching, bureaucratic, and cultural issues.The authors found that additional support in dealing with how cultural differences impact teaching practices needs to complement support for improving the linguistic competence of international faculty. This issue is being observed elsewhere around the world, including China (Ouyang, 2004), Finland (Groom & Maunonen- Eskelinen, 2006), and Korea (Jon, 2009), indicating a growing diversity of international faculty members at institutions of higher education. A consequence of the change in faculty demographics is that students increasingly encounter and need to understand different accents and English varieties in various academic subdomains to comprehend lectures and to communicate with international faculty. This change has potential implications for academic listening assessment, because understanding different pronunciation, prosodic conventions, and dialectal varieties of English may need to be considered as part of the construct definition. The topic of understanding accented speech is further discussed in the next section.

3.2.2.8 Comprehension of Accented Speech Research has shown that several variables may affect listening comprehension (Brindley & Slatyer, 2002). Some of these are related to listeners’ personality and experience (e.g., attitudes, motivation, topic familiarity) while others may be considered features of the aural input (e.g., accent, rate of speech, grammar, vocabulary), which may vary based on particular varieties of English (Major,

82

83

82

Assessing Academic Listening 83

Fitzmaurice, Bunta, & Balasubramanian, 2005).The variable of speaker accent is of particular interest, as it relates to the need for communication across language varieties in the academic domain (see earlier discussion of English as a lingua franca). There are two prevailing hypotheses about the relationship between speaker accent and L2 comprehension (Harding, 2011). The “own-accent” hypothesis states that listeners more easily comprehend a speaker whose accent is similar to theirs. The “familiarity” hypothesis asserts that listeners more easily comprehend speakers whose accents they are familiar with, including, but not limited to, their speech community. Evidence to support either hypothesis is thin and is often limited by research design. Researchers have investigated the comprehensibility of different English accents by various groups of native and non- native English speakers (e.g., Derwing, Munro, & Thomson, 2008; Munro & Derwing, 1998; Munro, Derwing, & Morton, 2006; Schmidgall, 2013; Venkatagiri & Levis, 2007), but these studies are usually based on the impressionistic judgment of listeners, not a measure of listening comprehension. The results of these studies have suggested that unfamiliar accents may cause difficulty in comprehension, and that familiarity with a speaker’s accent might aid comprehension, but findings have generally been mixed. Munro, Derwing, and Morton’s (2006) study found very little effect for shared L1 or familiarity on intelligibility measures, with the implication that it is speaker rather than listener factors that are most important in determining intelligibility in NNS/NNS speaker–listener pairs. Schmidgall (2013) found that both speaker and listener factors influenced judgments of comprehensibility, but familiarity with the speaker’s accent or native language was not a significant listener- related factor. Rather, the speaker’s pronunciation as judged by expert raters was the largest language-related factor. Among studies that have looked directly at the relationship between accented aural input and comprehension scores, results suggest that there is an impact, although findings are also mixed. Major, Fitzmaurice, Bunta, and Balasubramanian (2002) conducted a study in which four groups of listeners with different native languages (Chinese, Japanese, Spanish, American English) completed an academic English listening comprehension test that included speakers from each native language background. Results were mixed, suggesting that there was not a clear relationship between the native language of the listener and comprehension based on the accent of the speaker. However, results did suggest that introducing non-native varieties into an English comprehension test may create bias. Harding (2012) used differential item functioning (DIF) analyses to investigate whether a listener’s native language (Japanese or Chinese) impacted their performance on an academic English listening comprehension test that included three English varieties (Australian-, Japanese-, and Chinese-accented). On some of the items, listeners appeared to be given an advantage by listening to speakers who shared the same native language, but the overall effect was different for native speakers of Japanese

84

84 Spiros Papageorgiou et al.

and Chinese. For the native speakers of Japanese, there was no overall effect due to the balance of items favoring non-Japanese listeners. For the native speakers of Chinese, the advantage on Chinese-accented items was strong enough to produce an advantage on the overall test. Harding concluded that when the construct definition does not include the ability to process L2 varieties, the use of L2 varieties may introduce construct-irrelevant variance. However, he also questioned how a construct definition that ignored the ability to process L2 speech could be useful for drawing inferences about listening ability in modern academic settings. In another study, Major et al. (2005) found small but significant differences in listening comprehension scores based on English varieties (Standard American English, Southern American English, Black English, Indian English, Australian English) for groups of native and non-native English speakers in the United States. Large-scale tests vary in the number of varieties of English in their listening stimuli. For example, IELTS utilizes North American, British, Australian, and New Zealand English varieties, while the TOEFL iBT test was limited to North American (US) varieties until 2013, when varieties similar to those found in IELTS were added, based on research findings discussed below. Empirical evidence supporting decisions to include multidialectal listening stimuli appears to be scarce. Although Harding (2012) showed that sharing a speaker’s accent may result in higher scores under certain test conditions, two studies analyzing responses to experimental test items during operational TOEFL iBT test administrations (Ockey & French, 2016; Ockey, Papageorgiou, & French, 2016) offered additional insights into the testing of multidialectal listening by demonstrating that as the perceived strength of accent increases, more effort for comprehension might be required.The authors concluded that it is important to consider strength of accent before including a particular accent in a listening comprehension test. Canagarajah (2006) described how the speaking test of the First Certificate of English examination could be revised to accommodate a larger variety of norms, including inner-circle and local norms. Specifically, two raters that interacted with test takers during a negotiation task could be required to come from different English- speaking communities, which would support the goal of assessing language use across varieties of English. However, it is unclear whether and how the raters’ linguistic background might impact test-taker performance. Similarly, the Pearson Test of English (PTE) Academic claims to test the “ability to understand a variety of accents, both native and non-native” (Pearson Education, 2010), but to the best of our knowledge, there is no published research examining how the various English accents in the test may impact listener performance. There remain many questions surrounding the potential effects of diverse accents on listening performance, particularly with respect to the interaction between the degree of speaker intelligibility or accentedness, the cognitive demands of the task, the linguistic complexity of the text, and levels of listener

84

85

84

Assessing Academic Listening 85

experience (see Harding, 2011; Kang, Thomson, & Moran, 2018). Recent research funded by the TOEFL Committee of Examiners (COE) program (Kang et al., 2018), however, found that there are no clear effects on scores when highly comprehensible speakers of outer-and expanding-circle varieties of English are selected for inclusion in listening test input. Setting a precise threshold for acceptable comprehensibility may still require further deliberation at the test construction stage (see also Ockey & French, 2016; Ockey et al., 2016), but these results suggest that, provided speaker selection is performed carefully, diverse speaker accents could potentially be used in listening assessment without a significant impact on task difficulty. Decisions around speaker selection would be further aided by additional empirical data concerning the range of accents found in TLU domains across international settings. TABLE 3.1 Summary of key trends and recent developments in academic listening

Topic

Summary of findings

Academic listening • Students interact more frequently within various communicative needs and genre genres that require reciprocal and active listening skills (e.g., small- analyses group discussions and meetings with advisors) • Students need to utilize listening skills in conjunction with other skills, in particular speaking, although further empirical studies are needed for various academic contexts Technology- • College-level undergraduate settings around the globe employ mediated distance learning due to its effectiveness and accessibility learning • Students need to use integrated language skills for synchronous and asynchronous online lectures while referring to texts and participating in threaded discussions • Various forms of blended learning have become popular in regular classrooms with the increased use of technology, and students use integrated listening skills during such blended learning Pragmatic • Pragmatic understanding is an important subcomponent in most understanding models of listening comprehension; however, it is still under- represented in many academic listening tests • Although operationalizing a construct of pragmatic knowledge or understanding may provide its own challenges, failing to consider the construct in a broad sense may lead to its underrepresentation Interactive • Empirical studies show that academic registers involve highly dimension interactive features of academic • Non-verbal (e.g., gestures, facial cues) interactional features listening need to be considered as construct-relevant variance in listening comprehension (continued)

86

86 Spiros Papageorgiou et al. Table 3.1 Cont.

Topic

Summary of findings

Multimodality and meaning making

• Visual information often accompanies aural information in communicative settings, which suggests that listening tasks need to include the visual channel to correspond to target language use situations more closely • Test designers need to consider that videos might impact test takers differently, and that the type of visual support might have differential impact on test performance English as a lingua • There is rapid rise in multilingual and multicultural student franca (ELF) populations along with increased use of English as a medium of perspective instruction • Various empirical studies investigated language use and ELF communication based on large-scale ELF corpora • The expanding circle of academic discourse implicates the use of non-native accents, English varieties, and certain interactive features that are specific to ELF interaction in listening assessment • More studies on investigating listening domains, genres, and tasks in globalized academic contexts are needed Shift in faculty • Increased presence of international faculty in higher educational demographics contexts around the globe • Students increasingly encounter and need to understand different accents and English varieties in academic settings Comprehension • Questions surrounding the potential effects of accented speech of accented on listening performance remain, particularly with respect to speech the interaction between the degree of speaker intelligibility or accentedness, the cognitive demands of the task, the linguistic complexity of the text, and levels of listener experience • Recent research suggests potential effects may be minimized through a careful speaker selection process to ensure high levels of comprehensibility • More empirical data are required to understand the range of accents found across different TLU domains

3.2.2.9 Summary of Key Trends Table 3.1 summarizes the key trends and recent developments in academic listening discourse discussed in this section.

3.2.3 A Generic Construct Definition for Assessing Listening Comprehension in EAP Settings An interactionalist approach to construct definition emphasizes the importance of language use in context but allows for generalization about ability beyond the set

86

87

86

Assessing Academic Listening 87

of tasks that are included in the assessment procedure (Chapelle, 1998). Drawing on the discussion of changes in theories of the construct of academic listening and in the academic domain itself in this section, we propose a generic construct definition for assessing listening comprehension in EAP settings, based on three dimensions, as shown in Table 3.2: • • •

the intended language subdomains the major communication goals the foundational and higher-order listening skills that are engaged to fulfill communications goals

The first two dimensions define key aspects of contextual factors, whereas the third dimension indicates knowledge, skills, and abilities on which to make inferences in relation to the context. Table 3.2 can be used by test developers as the starting point for defining the construct to be assessed by their test.

TABLE 3.2 A generic construct definition for assessing listening comprehension in EAP

settings Overall definition: The assessment of English listening comprehension for general academic purposes measures test takers’ abilities and capacities to comprehend realistic spoken language in the following subdomains of the English-speaking academic domain: social-interpersonal, academic-navigational, and academic-content. The focus of the construct definition is the academic-content domain, with some representation of the social-interpersonal and academic-navigational domains. To demonstrate their abilities and capacities, test takers are required to use linguistic resources effectively to comprehend aural input sufficiently in order to select, relate, compare, evaluate and synthesize key information from listening stimuli. The listening stimuli relate to campus life scenarios and academic topics typical of the introductory college level. Intended subdomains

Social-interpersonal

Academic-navigational

Academic-content

Communication goals Foundational and higher-order abilities

Understand main ideas, supporting details, relationships among ideas, inferences, opinions, speaker purpose, and speaker attitude. • to process extended spoken language automatically and in real time • to make use of phonological information, including intonation, stress, and pauses, in order to comprehend meaning • to make use of lexical and grammatical information in order to comprehend meaning • to make use of pragmatic information encoded in talk in order to comprehend meaning • to process organization devices (cohesive and discourse markers, exemplifications, etc.) in order to understand connections between statements and between ideas

8

88 Spiros Papageorgiou et al.

3.3 Operationalizing Academic Listening in Major English Language Tests for Admission Purposes 3.3.1 Overview In the previous section, we demonstrated that listening ability is multifaceted. For testing purposes, adequately capturing listening involves careful definition of the knowledge, skills, and abilities required to process and interpret spoken language, and consideration of the contextual demands that determine successful performance in a given TLU domain. At the same time, assessing listening presents several key practical constraints, which may restrict the extent to which theoretical and contextual facets of listening can be modeled within assessment tasks. Given these considerations, perhaps it is not surprising that academic listening has been operationalized in a wide variety of ways by different high-stakes language proficiency tests. In this section, we briefly review the approach to listening adopted by three major academic English listening tests used for university admissions purposes: TOEFL iBT, IELTS, and PTE Academic. Each test has been chosen as to demonstrate a distinct approach to the operationalization of academic listening proficiency. A short description will be provided along with a more detailed profile of the characteristics of each listening test.

3.3.2 Profiles of Academic English Listening Tests 3.3.2.1 TOEFL iBT The TOEFL iBT listening section is one of the four sections on the test (the others being reading, writing, and speaking). The listening section focuses chiefly on measuring English listening skills required to function across a range of academic disciplines, as the test design overall places emphasis on measuring English for academic purposes. Test takers listen to three or four lectures (both monologic and interactive) representing different academic areas, each about five minutes’ long. Test takers also listen to two or three conversations representing typical campus interactions with faculty, staff, and fellow students, each about three minutes’ long. The variation in the number of lectures and conversation is because some test takers are administered additional items, which are not used for scoring, but for piloting and other purposes. Listening items aim to assess test takers’ ability to understand main ideas and important details, recognize a speaker’s attitude or function, understand the organization of the information presented, understand relationships between the ideas presented, and make inferences or connections among pieces of information (ETS, 2020; Ockey et al., 2016; Sawaki & Nissan, 2009).The distinctive elements of TOEFL iBT’s operationalization of academic listening are the procedure of note-taking followed by question response, the use of images to provide context and to reinforce content (e.g., showing key terminology on a whiteboard), the focus on pragmatic understanding, the computer-based delivery, and the contextualization of the input in the academic domain (see Table 3.3).

8

89

Assessing Academic Listening 89 TABLE 3.3 Summary of TOEFL iBT Listening section

Characteristics

Details

Mode of delivery Timing Number of listening texts Number of questions Topics

Computer-based 41–57 minutes 5–7 (2–3 conversations; 3–4 lectures) 28–39 • Conversations relevant to an academic domain • Lectures on academic topics • Conversations include two speakers • Lectures are both monologic and interactive (interactive lectures include questions to and/or from the audience) North American, Australian, British, and New Zealand Once only Still images contextualize all input (e.g., an image of a lecturer). Some still images reinforce oral content (e.g., key term on a whiteboard) Four-option multiple-choice questions; grids; drag-and-drop • Questions are answered after recordings have ended; listeners are allowed to take notes while listening, and these notes can be drawn upon to answer the questions. • The integrated tasks of the Speaking and Writing sections require test takers to listen to conversations and lectures (at a lower difficulty level than in the Listening section) as input for productive speaking and writing tasks. Listening ability is not directly scored in the speaking or writing sections.

Interactivity

Accents Number of times heard Visual input

Response format Other details

8

3.3.2.2 IELTS Like TOEFL iBT, the IELTS Listening section is one of four sections on the test (the others being reading, writing, and speaking). It is composed of four major parts, with a total of 40 questions (www.ielts.org/en-us/about-the-test/test-format). The first and second parts are related to social contexts and general topics, whereas the third and fourth parts are related to academic contexts (Phakiti, 2016).This substantial focus on the social/general domain is because the same listening section is used across IELTS Academic and IELTS General Training. According to the above website, listening items aim to assess test takers’ ability to understand main ideas or important details, and recognize relationships and connections among pieces of information. The distinctive elements of IELTS’s operationalization of academic listening are the range of response formats in the various task types (multiple- choice items, as well as short-answer questions), the focus on argument development, and the brevity of the listening section (about half the length of other academic listening assessments described here) (see Table 3.4).

90

90 Spiros Papageorgiou et al. TABLE 3.4 Summary of IELTS Listening section

Characteristics

Details

Mode of delivery Timing Number of listening texts Number of questions Topics

Paper-based (some computer-based testing available) 30 minutes 4 (2 conversations; 2 lectures)

Interactivity Accents Number of times heard Visual input Response format Other details

40 • One conversation and one lecture on a general topic • One conversation and one lecture on an academic topic • Conversations feature two speakers • Lectures are monologic British, Australian, North American and New Zealand Once only None A range of selected response (e.g., multiple-choice) and limited production (e.g., short-answer questions) questions. • IELTS has been described as a “while-listening” exam in that listeners are expected to complete their answers to questions as they listen, rather than taking notes and answering later. • Two notable elements of the IELTS exam is that answers to limited production items must be spelled correctly and must be grammatically congruent with a stem (presumably to prevent differences in inter-scorer reliability across the vast number of test centers in which IELTS exams are marked on-site).

3.3.2.3 Pearson Test of English Academic The PTE Academic is a computer- delivered language test which presents a different testing approach compared with TOEFL iBT and IELTS. Listening forms one of three main test sections (the others are Speaking and Writing, and Reading). There are eight tasks, of which three contribute to the listening score only, three contribute to the listening and writing scores, and two to the listening and reading scores (Pearson, 2019). Variation in the duration of the section is because different versions of PTE Academic are balanced for total test length (Pearson, 2019). PTE Academic is distinguished from other tests presented here in its use of multiple shorter, discrete, and decontextualized listening tasks without strong emphasis on the language domain (e.g., dictation based on a short sentence) and use of authentic recorded material (rather than developing and recording texts in-house) (see Table 3.5).

90

91

Assessing Academic Listening 91 TABLE 3.5 Summary of PTE Academic Listening section

Characteristics

Details

Mode of delivery Timing Number of listening texts

Computer-based 45–57 minutes 17–25 (ranging from 3–5 second sentences to 60–90 second academic monologues) 17–25 (several scored partial credit) Ranging from more general to academic Most texts are monologic A range of native and non-native speaker accents Once only None • Summarizing spoken text • Multiple-choice questions (single-/multiple-answer) • Fill in the blanks • Select correct summary of text • Select missing word • Highlight incorrect word(s) • Dictation • Several PTE Academic Listening tasks are essentially integrated tasks, with scores feeding into the scores for two different skills. For example, the first task on the test is “Summarizing spoken text.” Scores on this task feed into both the listening communicative skill score and the writing communicative skill score.

Number of questions Topics Interactivity Accents Number of times heard Visual input Response format

Other details

90

3.3.3 Summary This section has demonstrated there are substantively different approaches to operationalizing listening constructs across the three major large-scale tests of academic English. These variations seem to be the product of different theoretical bases for conceptualizing academic listening, different understandings of the contextual elements most salient to TLU domains, different practical limitations resulting from considerations of test length, mode of delivery, scoring systems, etc., or interactions between all three elements. Because of the above considerations, the tests vary both with respect to their authenticity and specificity as academic tests of English, and in their efficiency and practicality regarding test delivery. For the TOEFL iBT listening test, and other tests in this style, which have sought primarily to represent the range of listening demands learners may face in academic

92

92 Spiros Papageorgiou et al.

environments, the challenge will remain to enhance the test with respect to construct representation, while at the same time not compromising the practicality of test delivery. The next section will therefore explore these challenges with respect to the changing communicative needs identified in Section 3.2.

3.4 Directions for Future Research 3.4.1 Multimodal Listening: The Use of Visual Information Still images are typically employed by test developers to offer visual clues during a listening assessment. For example, every listening stimulus in the listening section of the TOEFL iBT test contains at least one context visual, such as depicting the roles of the speaker or speakers and the context, be it a classroom or office. Some listening stimuli also contain content visuals, which can range from a notepad or whiteboard showing technical terms or names used in the conversation or lecture to visuals that reinforce lecture content, such as a diagram illustrating a process being discussed. To increase authenticity, a test provider might consider presenting listening stimuli on video (Douglas & Hegelheimer, 2007; Vandergrift & Goh, 2012). The use of video would provide context cues to listeners, as still images do, but as noted earlier in this chapter, it would also provide listeners with non-verbal information, including gestures and facial expressions, which listeners have access to in real- world academic listening situations.This use of non-verbal information, as Douglas and Hegelheimer (2007) point out, is an integral part of communicating meaning in interpersonal exchanges such as face-to-face conversations, in lectures, and in visual media such as television and films; therefore, the lack of video could be treated as construct underrepresentation. Even if it is acknowledged that listening tests that lack video underrepresent the construct, it is nevertheless possible to test crucial aspects of listening comprehension without the use of video. Increasing construct representation needs to be weighed against both construct concerns and practical development concerns (see Batty, 2017). One advantage of using still images exclusively relates to the practical challenges of developing video stimuli. Moreover, still images are less likely to raise fairness issues during the measurement of listening skills, as test takers have been found to engage with still images in a limited and uniform way, as opposed to engagement with video (Ockey, 2007). Moreover, as discussed earlier in this chapter, it is not clear whether adding video leads to test scores that provide more valid inferences about listening ability (Wagner, 2010) than scores derived from tests using still images. However, following Wagner (2010), it could be argued that visual elements are part of the TLU domain assessed by academic language tests, in particular given the changing nature of lectures and other academic activities because of technological advances. Consequently, lack of video in the listening section of an academic language test could be regarded as construct underrepresentation, hence a threat to the validity of the listening scores. The use of video might also make processing of more interactive listening easier, as test

92

93

92

Assessing Academic Listening 93

takers will have more opportunities to distinguish the voices of multiple speakers (Taylor & Geranpayeh, 2011). Based on the above considerations, future research should focus on investigating topics such as those listed below: • • •

Investigating the extent to which the use of multimedia input enhances the interactional authenticity of a language test Investigating the relationship between the use of video input by test takers and their test scores Investigating the effects of using different types of visual support (for example, animated PowerPoint slides) with different text types and in different academic learning contexts, including distance learning contexts

3.4.2 ELF Perspectives in Assessing Listening: Including Diverse Speaker Accents As we discussed earlier in this chapter, adding any type of accent to a listening test of academic language proficiency on the assumption that accents are part of the construct is not best practice. The inclusion of a variety of accents in a listening test requires careful consideration of the TLU domain, the nature of assessing listening when the test taker assumes the role of overhearer, measurement concerns about fairness, and empirical evidence of the impact of accent familiarity on test taker performance. Research (Ockey & French, 2016; Ockey et al., 2016) suggests that using at most a mild version of a British and perhaps Australian accent would not differentially impact performance of TOEFL iBT test takers; thus it is a justified approach to balance the tension between construct representation and measurement concerns about fairness. The research did not address the question of the impact on test takers of being exposed to multiple lectures with accents, however, as each test form contained only one lecture with a non-US accent. It is possible that there is a cumulative effect of unfamiliar accents on test-taker performance. Future research should address this question before multiple accents in a test can be considered. The inclusion of multiple stimuli with non-US English accents is an issue not only of test administration considerations (e.g., content comparability of test forms; recruitment and retention of voice actors), but also both of construct representation and empirical support. A listening test whose target domain is English-medium universities can justify the use of a variety of accents at the test design level, but at the test form level, it is much harder to implement. A student who intends to study at an English-medium university in the Middle East might face very different language demands than a student who intends to study in the United States. A test form that makes use of two or three accents might more appropriately represent some TLU domains. Additionally, empirical evidence is needed to investigate whether accents of any variety do not differentially impact performance, regardless of degree of familiarity, and following from

94

94 Spiros Papageorgiou et al.

this, whether an intelligibility “threshold” could be established to ensure at the stimuli recording stage that selected speakers would be equally comprehensible to all listeners. One question that was not addressed by recent research (Ockey & French, 2016; Ockey et al., 2016) was whether listening stimuli using a particular accent should also make use of lexical items and phrases associated with that accent. Adding such items and phrases could support the claim of adequate construct representation. At the same time, adding lexical items and phrases associated with different accents would complicate and probably slow the test development process. Moreover, it is not clear that the inclusion of variety-specific lexis is crucial to anything beyond a sense of speaker authenticity. Because terminology that is technical or likely to be unfamiliar to a large proportion of test takers is routinely glossed (often indirectly) in the test, regionally specific lexical choice would also need to be glossed for consistency, and as such would be unlikely to have any impact on broadening the test construct. Future research could explore a variety of topics related to accent in academic listening tests as follows: •

•

Explore the nature and distribution of particular accents in academic settings. Abeywick rama (2013) points out that in two major US universities, international teaching assistants constitute about 20% to 30% of all teaching assistants.Thus, surveys of academic institutions are needed to determine the nature and distribution of accents that students encounter, and to inform decisions about the inclusion of specific accents in an academic listening test.This line of research could also explore the extent to which construct considerations dictate creation of parallel test forms in which the content of the listening section is identical, but the accent of the speakers differs to more adequately cover accents encountered by the students in specific higher education contexts. Determine attitudes of admissions personnel, faculty, and students to the inclusion of non-native English varieties in the listening section of academic language tests such as the TOEFL iBT test. Although language testers generally recognize the need for a variety of accents to assess L2 listening ability, the attitudes of various stakeholders towards multidialectal listening might not be consistent and may pose a barrier to the successful implementation of any change. In a survey by Gu and So (2015), a similar percentage of respondents in four groups (test takers, English teachers, score users, and language testing professionals) stated that all accents found on university campuses should be included in a test of academic English. However, there were fewer positive responses by test takers regarding the need to include a variety of Standard English accents or non- standard accents by highly proficient speakers. Harding (2008, 2011, 2012) found that test takers held complex views on the acceptability of non-native speaker accents in academic listening tests, suggesting that this is a topic where further investigation is required. Future surveys employing different methods

94

95

94

Assessing Academic Listening 95

•

(e.g., focus groups and interviews) can contribute to a better understanding of various stakeholder groups’ attitudes to the inclusion of non-native speaker accents in an academic listening test. Investigate how L2 listeners adapt to accents. Research outside the language testing context suggests that although unfamiliar accents can be difficult to understand, listeners at certain proficiency levels can adapt to various accents with appropriate experience (Baese-Berk, Bradlow, & Wright, 2013). Moreover, the ability to adapt to unfamiliar accents is included in the descriptors of the higher proficiency levels of the CEFR (Council of Europe, 2001, p. 75) and the ACTFL Guidelines (American Council on the Teaching of Foreign Languages, 2012). Therefore, empirical investigations of how L2 listeners adapt to particular accents, such as Harding (2018), might provide support to the inclusion of specific accents, in particular non-standard ones, in the listening section of academic language tests. Such research will also need to consider the test administration context, as test takers might have little time to adapt to unfamiliar accents, whereas real-life listeners might have more opportunities to do so.

3.4.3 Interactivity in Academic Situations and Activities Academic lectures have been traditionally considered “non- collaborative” listening (Buck, 2001, p. 98). However, as we demonstrated above, listening in the academic context also requires comprehension of more informal and interactive uses of language inside and outside the classroom. It should also be pointed out that within the practical constraints of typical listening test design, test takers are passive overhearers of discourse (Flowerdew & Miller, 2005), whereas interaction in real-life language use requires a combination of speaking and listening abilities (for a discussion of the role of listening in completing speaking test tasks, see Ockey & Wagner, 2018). Thus, future research should explore the domain of interactive listening tasks. In addition, test takers typically respond to items following monologic or dialogic input, whereas input involving three or more speakers is less common. Because monologic and dialogic input has been found to differ in terms of difficulty (Brindley & Slatyer, 2002; Papageorgiou et al., 2012), more research is needed to obtain a better understanding of the nature of listening input involving one versus two or more than two speakers, and, as Buck (2001) advocates, the listener’s role in collaborative listening. Based on these considerations, future research could investigate a variety of topics: • • •

Identify the types of interactional listening tasks that candidates will encounter in academic settings Investigate the cognitive components of interactive listening tasks Identify listeners’ comprehension strategies in academic settings involving collaborative and non-collaborative listening

96

96 Spiros Papageorgiou et al.

3.4.4 Assessing Pragmatic Understanding The testing of pragmatic comprehension in tests of academic listening typically focus on speaker intention, speaker meaning, or speaker opinion or stance, whereas the testing of conversational implicature in an interactional context is rare. Implicature (Grice, 1975), that is, hidden or implied meaning in contrast to literal meaning, is an important aspect of interaction; thus, future research should focus on establishing ways to more fully assess ESL learners’ ability to comprehend various types of implicature. Grice (1975) defines two types of implicature: conventional implicature that does not depend on the context in which the particular utterance is used, and conversational implicature, which is linked to certain features of discourse and the context of use. More recent research, discussed in Roever (2011), establishes the following types of implicature: idiosyncratic implicature, which is similar to conversational implicature, formulaic implicature, topic change, and indirect criticism through praise of an irrelevant aspect of the whole. Future research should first establish whether any of these types are already included in an academic listening test, and if not, whether they should be.

3.4.5 Investigating Test-Taker Characteristics As Bachman and Palmer (1996) point out, use of test scores to make decisions about individuals or inferences about their language ability requires a clear conceptualization of the characteristics of the test takers. Understanding how test takers respond to test items is therefore critical to support the validity of test scores. Qualitative data analysis methods such as verbal protocol analysis and interviews have been employed in the language testing literature to offer insights into test-taker cognitive processing when responding to various item types and stimuli (Ockey, 2007; Wagner, 2008; Wu, 1998). Increasingly, eye-tracking studies are becoming more prominent, and could be utilized to understand, for example, how test takers make use of written prompts and options of selected-response items while performing listening tasks (Winke & Lim, 2014). Prospective studies investigating the cognitive processes of test takers during academic listening tests can offer support to claims about the construct assessed.

3.5 Towards a Model of Assessing Academic Listening Language use does not take place in isolation. For this reason, the context of language use has been of central concern in the testing literature (Bachman, 1990; Douglas, 2000) and relevant contextual factors are typically discussed based on different paradigms in linguistics, such as sociolinguistics (Hymes, 1972) or pragmatics (Leech, 1983). As pointed out in the chapter on academic speaking proficiency, oral interaction has been the focus of attention in most discussions of language use context in the testing literature because of the important role

96

97

96

Assessing Academic Listening 97

contextual factors play when participants interact. Because listening is part of real- life oral interaction, and to avoid repetition, a review of definitions of language use contexts can be found in the speaking chapter (Chapter 5, this volume). This section will therefore conclude this chapter with a focus on aspects of language use contexts related to the design of listening tasks. As discussed earlier in this chapter, an important practical constraint of the design of listening tests relates to the fact that test takers are passive overhearers of discourse (Flowerdew & Miller, 2005), whereas real-life interaction requires a combination of speaking and listening abilities. It should also be noted that the listening stimuli of most large-scale testing programs are scripted (Wagner, 2014). Because of the overhearer constraint and the scripted nature of discourse, the contextual information represented in listening test tasks may not faithfully replicate all aspects of real-life oral interactions (for example, asking for clarification when an utterance is not understood). However, not all contexts of language use involving listening skills require interaction; in fact, non-collaborative listening (for example, comprehension of monologic lectures) remains an important academic task. Although it is inevitable that operational constraints will limit the situational authenticity of listening test tasks, it is critical that test developers consider ways to maintain the interactional authenticity of these tasks, as discussed in this chapter. Another characteristic of listening tasks is the cognitive environment, i.e., what the listener has in mind, which Buck (2001) sees as the type of context that subsumes all contextual factors. Drawing on the approach of the speaking chapter in this volume (Chapter 5, this volume), the context of listening tasks in EAP tests may be defined in a few layers, each of which adds specificity that constrains the contextual parameters of the tasks (Figure 3.1): •

•

•

•

•

Subdomains of language use. In the general academic context, such subdomains might include the social-interpersonal, navigational, and general academic subdomains. Speech genres (Bakhtin, 1986). It will be helpful to incorporate speech genres in the design of listening tests because discourse in each genre may be structured in a way that is distinct from other genres. Goal of comprehension for a particular task. Each comprehension goal entails one or more listening comprehension subskills, such as understanding the main idea, recognizing important details, inferring speaker meaning, etc. Amount of discourse interactivity. A test taker might listen to a short oral presentation to an imaginary class (monologic); or a conversation between two speakers (dialogic); or a group conversation involving more than two interlocutors. Characteristics of the setting. In listening test tasks, the time and place of an interaction (the spatial-temporal dimensions) are critical contextual factors that might affect comprehension. For example, a test taker should be able

98

98 Spiros Papageorgiou et al.

Contextual facets of a listening task

Language use domain (e.g., social-interpersonal, navigational, general academic, etc.)

Speech genres (e.g., everyday narration, everyday conversation, argumentative, etc.)

Goal of comprehension (e.g., understand the main idea, recognize important details, infer speaker intention, etc.)

98

Discourse interactivity -Monologic -Dialogic -Group (two or more speakers)

Characterictics of the setting -Spatial temporal (e.g., time and location of the event, audio and visual channel) -Participants (roles and relationships) FIGURE 3.1 Contextual

facets of a listening test task

to understand the context of a lecture given by a professor or a conversation between a professor and students to be able to activate their topical knowledge. Participants are another facet of the setting. In listening tasks that involve one-on-one or one-to-many interactions, the roles that interlocutors play, and the social and cultural relationships (e.g., power structure) they have with other interlocutors impact the test takers’ comprehension of the speakers’ intended meaning.

9

98

Assessing Academic Listening 99

We conclude this chapter by echoing the point made in Chapter 5 (this volume) that the design of the academic language proficiency tests should incorporate facets of the context that allow for sufficient domain representation and for measuring variation in the test takers’ comprehension of aural input.

Notes 1 A structural model specifies the relationships between latent (or unobserved) variables. The fit of the statistical model can be evaluated using a variety of indices developed for use within the structural equation modeling (SEM) framework (Yuan & Bentler, 2007). 2 www.corestandards.org/ELA-Literacy/CCRA/SL 3 www.gov.uk/government/uploads/system/uploads/attachment_data/file/331877/ KS4_English_PoS_FINAL_170714.pdf 4 www.australiancurriculum.edu.au/resources/national-literacy-and-numeracy-learning- progressions

References Abeywickrama, P. (2013). Why not non-native varieties of English as listening comprehension test input? RELC Journal, 44(1), 59–74. Alderson, J. C. (2007). The CEFR and the need for more research. The Modern Language Journal, 91(4), 659–663. American Council on the Teaching of Foreign Languages. (2012). ACTFL proficiency guidelines. Retrieved from www.actfl.org/sites/default/files/pdfs/public/ ACTFLProficiencyGuidelines2012_FINAL.pdf Anderson, J. R. (1985). Cognitive psychology and its implications. New York: Freeman. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, U.K.: Oxford University Press. Bachman, L. F. (1991). What does language testing have to offer? TESOL Quarterly, 25(4), 671–704. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford, U.K.: Oxford University Press. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford, U.K.: Oxford University Press. Baese-Berk, M. M., Bradlow, A. R., & Wright, B. A. (2013). Accent-independent adaptation to foreign accented speech. The Journal of the Acoustical Society of America, 133(3), 174–180. Bakhtin, M. M. (1986). The problem of the text in linguistics, philology, and the human sciences: An experiment in philosophical analysis. In Speech genres and other late essays (pp. 103–131). Austin, TX: University of Texas Press. Basturkmen, H. (2003). So what happens when the tutor walks in? Some observations on interaction in a university discussion group with and without the tutor. Journal of English for Academic Purposes, 2(1), 21–33. Batty, A. O. (2017). The impact of visual cues on item response in video-mediated tests of foreign language listening comprehension. (Unpublished doctoral dissertation). Lancaster University. Bejar, I., Douglas, D., Jamieson, J., Nissan, S., & Turner, J. (2000). TOEFL 2000 Listening Framework: A working paper (TOEFL Monograph Series No. 19). Princeton, NJ: Educational Testing Service.

01

100 Spiros Papageorgiou et al.

Biber, D. E., Conrad, S., Reppen, R., Byrd, P., & Helt, M. (2002). Speaking and writing in the university: A multidimensional comparison. TESOL Quarterly, 36(1), 9–48. Biber, D., Conrad, S., Reppen, R., Byrd, P., Helt, M., Clark, V., Cortes, V., Csomay, E., & Urzua, A. (2004). Representing language use in the university: Analysis of the TOEFL 2000 spoken and written academic language corpus (TOEFL Monograph Series MS-25). Princeton, NJ: Educational Testing Service. Björkman, B. (2008).‘So where we are’: Spoken lingua franca English at a Swedish technical university. English Today, 24(2), 11–17. Björkman, B. (2011). Pragmatic strategies in English as an academic lingua franca: Ways of achieving communicative effectiveness? Journal of Pragmatics, 43(4), 950–964. Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19(4), 369–394. Brown, J. D. (2004). What do we mean by bias, Englishes, Englishes in testing, and English language proficiency? World Englishes, 23(2), 317–319. Brown, J. D. (2014). The future of World Englishes in language testing. Language Assessment Quarterly, 11(1), 5–26. Brunfaut, T. (2016). Assessing listening. In D. Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 97–112). Boston, MA: Mouton de Gruyter. Buck, G. (2001). Assessing listening. Cambridge, U.K.: Cambridge University Press. Canagarajah, S. (2006). Changing communicative needs, revised assessment objectives: Testing English as an international language. Language Assessment Quarterly, 3(3), 229–242. Chambers, J. K. (2004). Dynamic typology and vernacular universals. In B. Kortmann (Ed.), Dialectology meets typology (pp. 127–145). New York: Mouton de Gruyter. Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman & A. D. Cohen (Eds.), Second language acquisition and language testing interfaces (pp. 32–70). Cambridge, U.K.: Cambridge University Press. Chia, H.-U., Johnson, R., Chia, H.-L., & Olive, F. (1999). English for college students in Taiwan: A study of perceptions of English needs in a medical context. English for Specific Purposes, 18(2), 107–119. Clark, R. L., & Ma, J. (Eds.). (2005). Recruitment, retention, and retirement in higher education: Building and managing the faculty of the future. Cheltenham, U.K.: Edward Elgar Pub. Coleman, J. A. (2006). English-medium teaching in European higher education. Language Teaching, 39(1), 1–14. Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge, U.K.: Cambridge University Press. Council of Europe. (2018). CEFR companion volume with new descriptors. Retrieved from www.coe.int/en/web/common-european-framework-reference-languages Csomay, E. (2006). Academic talk in American university classrooms: Crossing the boundaries of oral-literate discourse? Journal of English for Academic Purposes, 5(2), 117–135. Cunningham, U., Fägersten, K. B., & Holmsten, E. (2010). “Can you hear me, Hanoi?” Compensatory mechanisms employed in synchronous net- based English language learning. International Review of Research in Open and Distance Learning, 11(1), 161–177. Cutler, A., & Clifton, C., Jr. (2000). Blue print of the listener. In P. Hagoort & C. Brown (Eds.), Neurocognition of language processing (pp. 123– 166). Oxford, U.K.: Oxford University Press. Davies, C. E., & Tyler, A. E. (2005). Discourse strategies in the context of crosscultural institutional talk: Uncovering interlanguage pragmatics in the university classroom. In K.

01

10

01

Assessing Academic Listening 101

Bardovi-Harlig & B. S. Hartford (Eds.), Interlanguage pragmatics: Exploring institutional talk (pp. 133–156). Mahwah, NJ: Lawrence Erlbaum Associates. Derwing,T. M., Munro, M. J., & Thomson, R. I. (2008).A longitudinal study of ESL learners’ fluency and comprehensibility development. Applied Linguistics, 29(3), 359–380. Deutch, Y. (2003). Needs analysis for academic legal English courses in Israel: A model of setting priorities. Journal of English for Academic Purposes, 2(2), 125–146. Douglas, D. (2000). Assessing language for specific purposes. Cambridge, U.K.: Cambridge University Press. Douglas, D., & Hegelheimer, V. (2007). Assessing language using computer technology. Annual Review of Applied Linguistics, 27, 115–132. Elder, C., & Davies, A. (2006). Assessing English as a lingua franca. Annual Review of Applied Linguistics, 26, 282–304. ETS. (2020). TOEFL iBT® Test Framework and Test Development (Vol. 1). Retrieved from www.ets.org/toefl_family/research/insight_series Evans, S., & Green, C. (2007). Why EAP is necessary: A survey of Hong Kong tertiary students. Journal of English for Academic Purposes, 6, 3–17. Farr, F. (2003). Engaged listenership in spoken academic discourse: The case of student– tutor meetings. Journal of English for Academic Purposes, 2(1), 67–85. Felix, U. (2001). A multivariate analysis of students’ experience of web-based learning. Australian Journal of Educational Technology, 17(1), 21–36. Fernández Polo, F. J., & Cal Varela, M. (2009). English for research purposes at the University of Santiago de Compostela:A survey. Journal of English for Academic Purposes, 8(3), 152–164. Ferris, D. (1998). Students’ view of academic aural/oral skills: A comparative needs analysis. TESOL Quarterly, 32(2), 289–318. Field, J. (2011). Into the mind of the academic listener. Journal of English for Academic Purposes, 10(2), 102–112. Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.), Examining listening (pp. 77–151). Cambridge, U.K.: Cambridge University Press. Firth, A. (1996). The discursive accomplishment of normality. On “lingua franca” English and conversation analysis. Journal of Pragmatics, 26(2), 237–259. Flowerdew, J., & Miller, L. (2005). Second language listening: Theory and practice. Cambridge, U.K.: Cambridge University Press. Fulcher, G. (2004). Deluded by artifices? The Common European Framework and harmonization. Language Assessment Quarterly, 1(4), 253–266. Ginther, A., & Grant, L. (1996). A review of the academic needs of native English-speaking college students in the United States (TOEFL Monograph Series). Princeton, NJ: Educational Testing Service. Grabowski, K. C. (2009). Investigating the construct validity of a test designed to measure grammatical and pragmatic knowledge in the context of speaking. (Unpublished doctoral dissertation). Teachers College, Columbia University. Groom, B., & Maunonen-Eskelinen, I. (2006).The use of portfolio to develop reflective practice in teaching training: A comparative and collaborative approach between two teacher training providers in the UK and Finland. Teaching in Higher Education, 11(3), 291–300. Grice, P. H. (1975). Logic and conversation. In P. Cole & J. L. Morgan (Eds.), Syntax and semantics 3: Speech acts (pp. 41–58). New York: Academic Press. Gruba, P. (1997). The role of video media in listening assessment. System, 25(3), 335–345. Gruba, P. (2004) Understanding digitized second language videotext. Computer Assisted Language Learning, 17(1), 51–82.

012

102 Spiros Papageorgiou et al.

Gu, L., & So, Y. (2015). Voices from stakeholders: What makes an academic English test ‘international’? Journal of English for Academic Purposes, 18, 9–24. Harding, L. (2008). Accent and academic listening assessment: A study of test- taker perceptions. Melbourne Papers in Language Testing, 13(1), 1–33. Harding, L. (2011). Accent and listening assessment. Frankfurt am Main, Germany: Peter Lang. Harding, L. (2012). Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective. Language Testing, 29(2), 163–180. Harding, L. (2018). Listening to an unfamiliar accent: Exploring difficulty, strategy use, and evidence of adaptation on listening assessment tasks. In G. J. Ockey & E. Wagner (Eds.), Assessing L2 listening: Moving towards authenticity (pp. 97–112). Amsterdam, The Netherlands: John Benjamins. Harding, L., & McNamara, T. (2017). Language assessment: The challenge of ELF. In J. Jenkins, W. Baker, & M. J. Dewey (Eds.), Routledge handbook of English as a lingua franca (pp. 570–582). London, U.K.: Routledge. Harker, M., & Koutsantoni, D. (2005). Can it be as effective? Distance versus blended learning in a web-based EAP programme. ReCALL, 17, 197–216. Hudson, T., Detmer, E., & Brown, J. D. (1992). A framework for testing cross-cultural pragmatics. Honolulu: University of Hawai’i, Second Language Teaching and Curriculum Center. Hymes, D. (1972). Models of the interaction of language and social life. In J. Gumperz & D. Hymes (Eds.), Directions in sociolinguistics:The ethnography of communication (pp. 35–71). New York: Holt, Rinehart, Winston. Hynninen, N. (2011). The practice of ‘mediation’ in English as a lingua franca interaction. Journal of Pragmatics, 43(4), 965–977. Janson, K., Schomburg, H., & Teichler, U. (2009). The professional value of ERASMUS mobility: The impact of international experience on formal students’ and on teachers’ careers. Bonn, Germany: Lemmens. Jarvis, H. (2004). Investigating the classroom applications of computers on EFL courses at higher education institutions in UK. Journal of English for Academic Purposes, 3(2), 111–137. Jenkins, J. (2006). The spread of EIL: A testing time for testers. ELT Journal, 60(1), 42–50. Jenkins, J. (2009). (Un)pleasant? (In)correct? (Un)intelligible? ELF speakers’ perceptions of their accents. In A. Mauranen & E. Ranta (Eds.), English as a lingua franca: Studies and findings (pp. 10–36), Newcastle, U.K.: Cambridge Scholars Publishing. Jenkins, J. (2011). Accommodating (to) EFL in the international university. Journal of Pragmatics, 43(4), 926–936. Jenkins, J. (2012). English as a Lingua Franca from the classroom to the classroom. ELT Journal, 66(4), 486–494. Jon, J. E. (2009). “Interculturality” in higher education as student intercultural learning and development: A case study in South Korea. Intercultural Education, 20(5), 439–449. Kachru, B. B. (1985) Standards, codification and sociolinguistic realism: The English language in the outer circle. In R. Quirk & H. G.Widdowson (Eds.), English in the world:Teaching and learning the language and literatures (pp. 11–30). Cambridge, U.K.: Cambridge University Press. Kane, M. (2012).Validating score interpretations and uses. Language Testing, 29(1), 3–17. Kang, O., Thomson, R., & Moran, M. (2018). Empirical approaches to measuring intelligibility of different varieties of English. Language Learning, 68(1), 115–146. Kim, S. (2006). Academic oral communication needs of East Asian international graduate students in non-science and non-engineering fields. English for Specific Purposes, 25(4), 479–489. Kirkpatrick, A. (2010). English as an Asian lingua franca and the multilingual model of ELT. Language Teaching, 44(2), 212–234.

012

103

012

Assessing Academic Listening 103

Knapp, A. (2011). Using English as a lingua franca for (mis-)managing conflict in an international university context: An example from a course in engineering. Journal of Pragmatics, 43(4), 978–990. Kormos, J., Kontra, E. H., & Csölle, A. (2002). Language wants of English majors in a non- native context. System, 30(4), 517–542. Krase, E. (2007). “Maybe the communication between us was not enough”: Inside a dysfunctional advisor/L2 advisee relationship. Journal of English for Academic Purposes, 6(1), 55–70. Leech, G. (1983). Principles of pragmatics. London, U.K.: Longman. Leslie, D. W., & Gappa, J. M. (2002). Part-time faculty: Component and committed. New Directions for Community Colleges, 118, 59–67. Less, P. (2003). Academic and nonacademic skills needed for success by international students in a Master’s of Business Administration program. (Unpublished doctoral dissertation). University of Arkansas, Little Rock. Lewkowicz, J. A. (2000). Authenticity in language testing: Some outstanding questions. Language Testing, 17(1), 43–64. Ljosland, R. (2011). English as an Academic Lingua Franca: Language policies and multilingual practices in a Norwegian university. Journal of Pragmatics, 43(4), 991–1004. Londe, Z. C. (2008). Working memory and English as a second language listening comprehension tests: A latent variable approach. (Unpublished doctoral dissertation). University of California, Los Angeles. Luxon, T., & Peelo, M. (2009). Internationalisation: Its implications for curriculum design and course development in UK higher education. Innovations in Education and Teaching International, 46(1), 51–60. Lynch, T. (1995). The development of interactive listening strategies in second language academic settings. In D. Mendelsohn & J. Rubin (Eds.), A guide for the teaching of second language listening (pp. 166–185). San Diego, CA: Dominie Press. Lynch, T. (2011). Academic listening in the 21st century: Reviewing a decade of research. Journal of English for Academic Purposes, 10(2), 79–88. Major, R. C., Fitzmaurice, S. F., Bunta, F., & Balasubramanian, C. (2002).The effects of nonnative accents on listening comprehension: Implications for ESL assessment. TESOL Quarterly, 36(2), 173–190. Major, R. C., Fitzmaurice, S. M., Bunta, F., & Balasubramanian, C. (2005).Testing the effects of regional, ethnic, and international dialects of English on listening comprehension. Language Learning, 55(1), 37–69. Manakul, W. (2007). Role of English in internationalization of higher education: The case of the graduate school of engineering, Hokkaido University. Journal of Higher Education and Lifelong Learning, 15, 155–162. Mauranen, A. (2006a). A rich domain of ELF –the ELFA corpus of academic discourse. The Nordic Journal of English Studies, 5(2), 145–159. Mauranen, A. (2006b). Signalling and preventing misunderstanding in English as a lingua franca communication. International Journal of the Sociology of Language, 177, 123–150. Mauranen, A., Hynninen, N., & Ranta, E. (2010). English as an academic lingua franca:The ELFA project. English for Specific Purposes, 29(3), 183–190. McCalman, C. L. (2007). Being an interculturally competent instructor in the United States: Issues of classroom dynamics and appropriateness, and recommendations for international instructors. New Directions for Teaching and Learning, 110, 65–74. McKeachie, W. J. (2002). Teaching tips: Strategies, research, and theory for college and university teachers. Boston, MA: Houghton Mifflin.

104

104 Spiros Papageorgiou et al.

McNamara, T. (2004). Language testing. In A. Davies & C. Elder (Eds.), The handbook of applied linguistics (pp. 763–783). Malden, MA: Blackwell Publishing. McNamara, T. (2006). Validity in language testing: The challenge of Sam Messick’s legacy. Language Assessment Quarterly, 3(1), 31–51. Metsä-Ketelä, M. (2006). Words are more or less superfluous: The case of more or less in academic lingua franca English. The Nordic Journal of English Studies, 5, 117–143. Miller, L. (2002). Towards a model for lecturing in a second language. Journal of English for Academic Purposes, 1(2), 145–162. Morell,T. (2004). Interactive lecture discourse for university EFL students. English for Specific Purposes, 23(3), 325–338. Morita, N. (2000). Discourse socialization through oral classroom activities in a TESL graduate classroom. TESOL Quarterly, 34(2), 279–310. Morita, N. (2004). Negotiating participation and identity in second language academic communities. TESOL Quarterly, 38(4), 573–604. Munro, M. J., & Derwing, T. M. (1998). The effects of speaking rate on listener evaluations of native and foreign-accented speech. Language Learning, 48(2), 159–182. Munro, M. J., Derwing, T. M., & Morton, S. L. (2006). The mutual intelligibility of L2 speech. Studies in Second Language Acquisition, 28(1), 111–131. Murphy, J. M. (2005). Essentials in teaching academic oral communication. Boston, MA: Houghton-Mifflin. North, B. (2014). The CEFR in practice. Cambridge, U.K.: Cambridge University Press. Ockey, G. J. (2007). Construct implications of including still image or video in computer- based listening tests. Language Testing, 24(4), 517–537. Ockey, G. J., & French, R. (2016). From one to multiple accents on a test of L2 listening comprehension. Applied Linguistics, 37(5), 693–715. Ockey, G. J., Papageorgiou, S., & French, R. (2016). Effects of strength of accent on an L2 interactive lecture listening comprehension test. International Journal of Listening, 30(1– 2), 84–98. Ockey, G. J., & Wagner, E. (2018). An overview of interactive listening as part of the construct of interactive and integrated oral tasks. In G. J. Ockey & E.Wagner (Eds.), Assessing L2 listening: Moving towards authenticity (pp. 129–144). Amsterdam,The Netherlands: John Benjamins. Ouyang, H. H. (2004). Remaking of face and community of practices. Beijing, China: Peking University Press. Papageorgiou, S., Stevens, R., & Goodwin, S. (2012). The relative difficulty of dialogic and monologic input in a second-language listening comprehension test. Language Assessment Quarterly, 9(4), 375–397. Papageorgiou, S., & Tannenbaum, R. J. (2016). Situating standard setting within argument- based validity. Language Assessment Quarterly, 13(2), 109–123. Pearson Education. (2010). The official guide to Pearson Test of English (PTE) Academic. Hong Kong, China: Pearson Longman. Pearson. (2019). Score guide version 12. Retrieved from https://pearsonpte.com/wp- content/uploads/2019/10/Score-Guide-for-test-takers-V12-20191030.pdf Phakiti, A. (2016). Test- takers’ performance appraisals, appraisal calibration, state- trait strategy use, and state-trait IELTS listening difficulty in a simulated IELTS Listening test. IELTS Research Reports Online Series, 6, 1–3. Pickering, L. (2001).The role of tone choice in improving ITA communication in the classroom. TESOL Quarterly, 35(2), 233–255.

104

105

104

Assessing Academic Listening 105

Pickering, L. (2009). Intonation as a pragmatic resource in ELF interaction. Intercultural Pragmatics, 6(2), 235–255. Powers, D. E. (1986). Academic demands related to listening skills. Language Testing, 3(1), 1–38. Purpura, J. (2004). Assessing grammar. Cambridge, U.K.: Cambridge University Press. Ranta, E. (2006). The ‘attractive’ progressive –Why use the-ing form in English as a lingua franca? The Nordic Journal of English Studies, 5(2), 95–116. Read, J. (2002). The use of interactive input in EAP listening assessment. Journal of English for Academic Purposes, 1(2), 105–119. Roever, C. (2005). Testing ESL pragmatics: Development and validation of a web-based assessment battery. Frankfurt am Main, Germany: Peter Lang. Roever, C. (2011). Testing of second language pragmatics: Past and future. Language Testing, 28(4), 463–481. Roever, C. (2013). Assessment of pragmatics. In C. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 1–8). London, U.K.: Blackwell Publishing. Roever, C., & McNamara, T. (2006). Language testing: The social dimension. International Journal of Applied Linguistics, 16(2), 242–258. Rosenfeld, M., Leung, P., & Oltman, P. K. (2001). The reading, writing, speaking, and listening tasks important for academic success at the undergraduate and graduate levels (TOEFL Monograph No. 21). Princeton, NJ: Educational Testing Service. Rost, M. (2005). L2 listening. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp. 503–527). Mahwah, NJ: Erlbaum. Sampson, N. (2003). Meeting the needs of distance learners. Language Learning & Technology, 7(3), 103–118. Sawaki, Y., & Nissan, S. (2009). Criterion-related validity of the TOEFL iBT Listening section (TOEFL iBT Report No. 8). Princeton, NJ: Educational Testing Service. Schmidgall, J. E. (2013). Modeling speaker proficiency, comprehensibility, and perceived competence in a language use domain. (Unpublished doctoral dissertation). University of California, Los Angeles. Seidlhofer, B. (2001). Closing a conceptual gap: The case for a description of English as a lingua franca. International Journal of Applied Linguistics, 11(2), 133–158. Skyrme, G. (2010). Is this a stupid question? International undergraduate students seeking help from teachers during office hours. Journal of English for Academic Purposes, 9(3), 211–221. Sternberg, R. (2003). Cognitive psychology. Belmont, CA: Thomson Learning. Stricker, L., & Attali, Y. (2010). Test takers’ attitudes about the TOEFL iBT (TOEFL iBT Report No. 13). Princeton, NJ: Educational Testing Service. Suvorov, R. (2015). The use of eye tracking in research on video-based second language (L2) listening assessment: A comparison of context videos and content videos. Language Testing, 32(4), 463–483. Taylor, L., & Geranpayeh, A. (2011). Assessing listening for academic purposes: Defining and operationalising the test construct. Journal of English for Academic Purposes, 10(2), 89–101. Timpe,V. (2013). Assessing intercultural language learning:The dependence of receptive sociopragmatic competence and discourse competence on learning opportunities and input. (Unpublished doctoral dissertation). Technische Universität Dortmund, Dortmund. Timpe Laughlin,V.,Wain, J., & Schmidgall, J. (2015). Defining and operationalizing the construct of pragmatic competence: Review and recommendations (ETS Research Report No. RR-15- 06). Princeton, NJ: Educational Testing Service.

016

106 Spiros Papageorgiou et al.

Ulijn, J. M., & Strother, J. B. (1995). Communicating in business and technology. Frankfurt am Main, Germany: Peter Lang. Vandergrift, L., & Goh, C. C. M. (2012). Teaching and learning second language listening: Metacognition in action. New York: Routledge. Venkatagiri, H. S., & Levis, J. M. (2007). Phonological awareness and speech comprehensibility: An exploratory study. Language Awareness, 16(4), 263–277. Vienna-Oxford International Corpus of English. (2011). What is voice? Retrieved from www.univie.ac.at/voice/page/what_is_voice Wächer, B. (2008). Teaching in English on the rise in European higher education. International Higher Education, 52, 3–4. Wagner, E. (2008). Video listening tests: What are they measuring? Language Assessment Quarterly, 5(3), 218–243. Wagner, E. (2010). The effect of the use of video texts on ESL listening test-taker performance. Language Testing, 27(3), 493–513. Wagner, E. (2014). Using unscripted spoken texts in the teaching of second language listening. TESOL Journal, 5(2), 288–311. Wagner, E., & Ockey, G. J. (2018). An overview of the use of audio-visual texts on L2 listening tests. In G. J. Ockey & E. Wagner (Eds.), Assessing L2 listening: Moving towards authenticity (pp. 129–144). Amsterdam, The Netherlands: John Benjamins. Warschauer, M. (2002). Networking into academic discourse. Journal of English for Academic Purposes, 1(1), 45–58. Waters, A. (1996). A review of research into needs in English for academic purposes of relevance to the North American higher education context (TOEFL Monograph Series No. MS-06). Princeton, NJ: Educational Testing Service. Watson Todd, R. (2003). EAP or TEAP? Journal of English for Academic Purposes, 2(2), 147–156. Weir, C. (2005). Language testing and validation: An evidence- based approach. Basingstoke, U.K.: Palgrave Macmillan. Wells, R. (2007). International faculty in U.S. community colleges. New Directions for Community Colleges, 138, 77–82. Winke, P., & Lim, H. (2014). The effects of testwiseness and test-taking anxiety on L2 listening test performance: A visual (eye-tracking) and attentional investigation. IELTS Research Reports Online Series, 3, 1–30. Wu,Y. (1998). What do tests of listening comprehension test? A retrospective study of EFL test-takers performing a multiple-choice task. Language Testing, 15(1), 21–44. Yildiz, S. (2009). Social presence in the web-based classroom: Implications for intercultural communication. Journal of Studies in International Education, 13(1), 46–66. Yuan, K.-H., & Bentler, P. M. (2007). Structural equation modeling. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics. Vol. 26: Psychometrics (pp. 297–358). Amsterdam, The Netherlands: North-Holland.

016

107

016

4 ASSESSING ACADEMIC WRITING Alister Cumming, Yeonsuk Cho, Jill Burstein, Philip Everson, and Robert Kantor

This chapter describes current practices for and recent research on assessing writing in English as a second or foreign language (L2) for academic purposes.1 We begin by reviewing conceptualizations of academic writing through interactionist assessment principles, rating scales, and definitions of the domain of academic writing from surveys, analyses of the characteristics of written texts, and curriculum standards.We next describe and distinguish normative, formative, and summative purposes for assessing academic writing. The third section of the chapter describes how writing is assessed in three major international tests of English. The fourth section reviews the range and types of research, inferences, and evidence recently used to validate aspects of writing tests. The chapter concludes with recommendations for research and development related to the assessment of L2 academic writing: to specify task stimuli and rating criteria, to increase the coverage of genres and domains for academic writing, to expand and elaborate on the design of integrated writing tasks, and to develop automated scoring systems further.

4.1 Conceptualizing Writing Proficiency in English for Academic Purposes Every assessment of writing conceptualizes writing in a certain way, whether stated explicitly as a theoretical construct to be evaluated or enacted implicitly as procedures and contexts for eliciting and scoring writing. Scholars and educators agree that both writing and language proficiency are complex, multi-faceted, and varied in their realizations. This complexity, however, challenges efforts to formulate a straightforward definition of the construct of writing in a second language that could apply generally across contexts, purposes, and populations for assessment

018

108 Alister Cumming et al.

(Cumming, Kantor, Powers, Santos, & Taylor, 2000; Shaw & Weir, 2007; Weigle, 2002) or research or teaching (Cumming, 2016; Leki, Cumming, & Silva, 2008). The many aspects of writing, written texts, and language forms and functions interact interdependently: therefore, it is difficult to name, let alone evaluate, them all precisely or account systematically for how they vary by contexts and purposes. Component abilities for writing include, for example: (a) micro- processes of word choice, spelling, punctuation, or keyboarding; (b) composing processes of planning, drafting, revising, and editing; and (c) macro-processes of fulfilling genre conventions, asserting a coherent perspective, or expressing membership in a particular discourse community. Guidance from theories is limited because there is no single, universally agreed-upon theory of writing in a second language that could satisfy the many different purposes, audiences, language forms, and societal situations for which people write in, teach, learn, or assess second languages. Current theories provide partial, and often divergent, explanations for integral phenomena such as intercultural rhetoric, composing processes, text genres, or sociocultural development (Cumming, 2016; Ivanic, 2004), despite recognition that “cognitive academic language proficiency” (Cummins, 1984) is a set of abilities that students need and acquire in order to pursue academic studies in first or additional languages. Nonetheless, assessors have agreed for decades, in principle and in practice, that tests of writing in a first or second language must necessarily require test takers to produce coherent, comprehensible written texts (Bachman, 2002; Cumming, 2009; Huot, 1990).The multi-faceted complexity of writing and interdependence of its components oblige test designers to solicit people’s demonstration of their abilities to write effectively rather than attempting to evaluate independently a selection of the innumerable subcomponents of writing, for example, through editing, error correction, translating, or gap-filling tasks, which would under- represent the complexities of actually writing. As Carroll (1961) put it in the mid- 20th century, writing in a second language needs to be evaluated holistically in an “integrative” manner rather than “discretely” in respect to its components. These fundamental assumptions have led to three conceptualizations and practices that guide the design, implementation, and evaluation of writing tests: interactionist principles for assessment tasks; rating scales for scoring; and descriptions of the domain of academic writing.

4.1.1 Interactionist Principles for Assessment Tasks Writing is conventionally assessed through tasks rather than items.That is, test takers’ production of written texts differs fundamentally from the responses conventionally expected in assessments of comprehension with selected response items and of vocabulary and grammar with discrete items.Writing assessments typically ask test takers to compose a piece of writing in response to instructions and prompts that specify a topic, purpose, relevant information sources, time allocation, intended

018

019

018

Assessing Academic Writing 109

readers, and expected length of the written text. Prospective test takers are usually made aware of the criteria and procedures by which their written compositions will later be evaluated by human scorers, automated computer protocols, or combinations of both. Assessment tasks focus test takers on specific goals to convey pragmatic meaning through written language, as people would in classroom tasks for language teaching and learning or education generally (Brindley, 2013; Byrnes & Manchón, 2014; Ellis, 2003; Norris, 2009; Robinson, 2011; Ruth & Murphy, 1988). From the perspective of psychometric theories, interactionist theories conceptualize writing in reference to the characteristics of a person’s abilities to interact in a particular context rather than as a trait, independent of a context, or as simply a behavior, independent of a person’s characteristics (Chalhoub-Deville, 2003; Mislevy, 2013; Mislevy & Yin, 2009).Writing assessment tasks follow interactionist principles in two senses. One sense is that test takers’ production of writing involves the construction and interaction of all components of writing together at the same time. The second sense of interactionist principles is that the writing produced in an assessment is interpreted and evaluated by a reader or machine- scoring protocol according to criteria for the fulfillment of the task, specifically, and for the effectiveness of writing, generally. That is, the assessment of writing tasks is based, reciprocally and dynamically, on qualities of the written texts that test takers produce as well as readers’ interpretative processes of comprehending and evaluating those texts. As Mislevy and Yin (2009) explained, interactionist principles of assessment recognize that “performances are composed of complex assemblies of component information-processing actions that are adapted to task requirements during performance. These moment-by-moment assemblies build on past experience and incrementally change capabilities for future action” (p. 250). Interactionist principles for assessment can be realized in contexts that either are rich and complex in information, situations, and purposes—as in teachers’ ongoing assessments of students’ compositions in classrooms—or in contexts that are lean and for general purposes as in large-scale proficiency tests. Context-r ich situations can foster students’ gathering of information from various sources and presentation in diverse forms to communicate unique ideas earnestly, for example, to a teacher or other students. In context-lean situations, such as formal tests, the extent of information expected in writing may be restricted in the interests of standardization so that the abilities of a broad range of test takers with different backgrounds and types of knowledge can be evaluated in common ways and judged with minimal biases. In situations of human scoring, experienced raters read and evaluate test takers’ compositions in reference to descriptive criteria in the form of scoring rubrics. Studies of human rating processes have described how raters’ ongoing interpretations and judgments mediate their uses of scoring rubrics in relation to: their perceptions of the qualities of writing, uses of language, and ideas or content in a composition for the assigned task; raters’ control over their own assessment

10

110 Alister Cumming et al.

behaviors; and their knowledge about relevant assessment policies and educational contexts (Cumming, Kantor, & Powers, 2002; Li & He, 2015; Lumley, 2005; Milanovic, Saville, & Shen, 1996). The best of the available commercial systems designed to automatically evaluate writing have been modeled on human scoring processes, making explicit in the form of computer algorithms the linguistic qualities and values of texts expected for certain writing tasks (Attali & Burstein, 2006; Shermis & Burstein, 2013; Shermis, Burstein, Elliot, Miel, & Foltz, 2016). Interactionist principles shape the design of most writing assessments. For example, Shaw and Weir (2007) elaborated a model in current terms that follows from principles and practices used for writing tasks in Cambridge’s English examinations for almost a century. One dimension of their model specifies the cognitive processes that writing tasks are expected to elicit from test takers, including macro-planning to gather information and identify goals and readership; organization of ideas, relationships, and priorities; micro-planning to produce local aspects of texts progressively; translation of content into words and phrases; monitoring of spelling, punctuation, and syntax; and revising to correct or improve a text produced. The other dimension of Shaw and Weir’s model describes two aspects of the contexts integral to writing assessment tasks and their administration. One aspect is the setting, involving expectations for response format, purpose, assessment criteria, text length, time constraints, and writer-reader relationship. The other aspect concerns the linguistic demands expected for writing: task input and output, lexical resources, structural resources, discourse mode, functional resources, and content knowledge.

4.1.2 Rating Scales Most tests aim to distribute test takers’ performances on a continuum from least to most proficient. Accordingly, the scoring of written tasks is usually done in reference to descriptive criteria ordered on a scale (or sets of scales) that describe characteristics of writing of varying degrees of quality. Scales for evaluating writing are usually numbered in sequence in a fixed range (e.g., from minimum 0 or 1 to a maximum between 4 and 10). Each point on the scale is demarcated on the basis of distinctions that correspond to credible impressions of difference in writing abilities that can be stated coherently and distinguished reliably in scoring (Papageorgiou, Xi, Morgan, & So, 2015). Scale points of 0 or 1 may represent negligible written production. The maximum score point on a scale may represent writing by highly proficient, well-educated users of the language. Points in between these extremes on the scale describe characteristics of writing quality in ascending order. For comprehensive language tests that evaluate writing along with speaking, reading, and listening, scales for writing abilities are usually parallel in format and kind to scales designated for abilities in the other modes of communication. Scales for assessment are impressionistic in that they intend to represent as well as guide raters’ impressions and judgments about key characteristics of test takers’

10

1

10

Assessing Academic Writing 111

written texts. As has been widely recognized (e.g., Hamp-Lyons, 1991;Weigle, 2002, pp. 108–139; White, 1985), the organization of scales for scoring writing can take one of two formats. A holistic scale combines all aspects of writing quality in a single scale to produce one score for each writing task. Alternatively, sets of analytic or multi-trait scales score different major aspects of written texts, such as ideas, rhetorical organization, language use, or vocabulary. Research suggests that either format has relative advantages and disadvantages (Barkaoui, 2010; Li & He, 2015; Ohta, Plakans, & Gebril, 2018). Holistic scales may have the advantage of ease, reliability, flexibility, and speed for rating, particularly for large-scale tests. In comparison, analytic or multi-trait scales may have advantages of focusing raters’ attention on distinguishing and analyzing particular aspects of writing and monitoring their own rating behaviors, and so be useful for novice raters, diagnostic purposes, learning- oriented feedback, or research on writing abilities or development. The longstanding method for establishing new scales to rate writing ability has first been to field-test prototype writing tasks on a relevant population. Then panels of experts (e.g., experienced raters, teachers, and/or researchers) rank order, describe, and establish consensus on their impressions of the resulting compositions from best to worst and articulate the primary traits that the writing samples are perceived to display at different points on the scale. Then benchmark texts are selected to exemplify the descriptive criteria at each score point (Enright et al., 2008; Lloyd-Jones, 1977; Papageorgiou, Xi, Morgan, & So, 2015; Purves, Gorman, & Takala, 1988; Ruth & Murphy, 1988; Shaw & Weir, 2007, pp. 154–167). As Banerjee,Yan, Chapman, and Elliott (2015) demonstrated, tools from corpus linguistics are increasingly useful in specifying text features and criteria for scoring rubrics. Certain issues obtain in applying rating scales to evaluate multi-faceted abilities such as writing and language proficiency.As Reckase (2017) has elucidated for tests generally, the logic of rating scales intersects with a wholly different approach to the design of assessments (discussed below): domain description, in which relevant characteristics of an ability are first described, and then assessments are designed to sample systematically from the domain. Rating scales identify comparatively tangible points of difference in performance rather than attempting to account for all aspects of being proficient in writing a second language. Consequently, rating scales demarcate differences in test takers’ performances, but rating scales do not define, reveal, or explain all dimensions at which an examinee is proficient. A prevailing concern, therefore, in applying rating scales to educational practices is that they may be taken to represent, and so easily become confused with, a comprehensive construct of language proficiency or literacy. As discussed further below, this dilemma has been frequently debated for applications to curricula, assessments, or individual people, for example, of the proficiency descriptions derived from the Common European Framework of Reference for Languages (CEFR) (Byrnes, 2007; Council of Europe, 2001, 2009; North, 2000, 2014) or similar frameworks such as the Certificates in Spoken and Written English in Australia (Brindley, 2000, 2013).

12

112 Alister Cumming et al.

Differences in points on rating scales in language tests cannot claim to represent progressions of development that learners actually make as they learn to write in a second language over time. There is little empirical evidence or theoretical foundations to substantiate such claims about learning or development (Biber & Gray, 2013; Elder & O’Loughlin, 2003; Knoch, Roushad, & Storch, 2014; Norris & Manchón, 2012; Polio, 2017). Moreover, samples of writing that receive the same scores on major tests of English display somewhat different qualities of discourse and language use (Friginal, Li, & Weigle, 2014; Jarvis, Grant, Bikowski, & Ferris, 2003; Polio & Shea, 2014; Wang, Engelhard, Raczynskia, Song, & Wolf, 2017; Wind, Stager, & Patil, 2017). Evidently, individual learners of languages develop certain aspects of their writing abilities differently depending on their educational, communication, and other experiences even though raters’ impressionistic judgments may reliably group the qualities of their writing together. Measurement issues further complicate these matters. Assumptions are made about continuity in rating scales although the distances between different score points on a rating scale are seldom the same or equivalent. Likewise, statistical methods to analyze language test scores may assume that a rating scale represents a single dimension, when in fact, as observed from the outset of this chapter, a defining characteristic of writing is its multi-f aceted complexity, particularly in a second language.

12

4.1.3 Describing the Domain of Academic Writing The domain of writing in a second language for academic purposes is usually defined from two perspectives in assessments and in curricula. One perspective addresses broadly the kinds and qualities of writing that are done and expected for academic purposes in school, college, or university programs. The second perspective focuses on the characteristics of written texts that are produced in such contexts and in assessment tasks. The idea of genre mediates the two perspectives by defining conventional rhetorical forms of written texts that writers and readers use routinely to communicate in certain contexts. Domain descriptions justify the selection of writing tasks for assessment in terms of relevance, authenticity, or importance as well as establish specific criteria about knowledge and skills for which qualities of writing are scored. Expectations about writing in academic programs have been defined by surveying writing practices and policies, analyzing characteristics of written texts, and stipulating standards for curricula in educational programs.

4.1.3.1 Surveys of Academic Writing Tasks Abilities to write are recognized—by educators and students alike—as integrally important to participating in programs of academic study across all disciplines and academic fields (Huang, 2010; Nesi & Gardner, 2012; Rea-Dickins, Kiely, &

13

12

Assessing Academic Writing 113

Yu, 2007; Rosenfeld, Leung, & Oltman, 2001). Most students’ writing for academic purposes involves them displaying (and, ideally, also showing evidence of their transforming) their knowledge in direct relation to the content and contexts of ideas and information they have been studying, reading, hearing about, and discussing in academic courses (Leki, 2007; Rosenfeld, Leung, & Oltman, 2001; Sternglass, 1997). Numerous studies have identified the types of writing tasks in English most frequently required at colleges and universities. Nesi and Gardner (2012) and Gardner and Nesi (2013) analyzed almost 3,000 texts produced by undergraduate and graduate students in England in the British Academic Written Corpus, identifying 13 genre families across academic fields as well as considerable variation within and across them. Essays were the most frequently occurring genre. Other prevalent genres were case studies, critiques, design specifications, empathy writing, essays, exercises, explanations, literature surveys, methodology recounts, narrative recounts, problem questions, proposals, and research reports. These results resemble the genres identified by Hale et al.’s (1996) survey of writing tasks from 162 undergraduate and graduate courses at eight universities in the United States and Canada. Hale et al. also distinguished differences between writing tasks that were either brief or lengthy; were done either within or outside of classes; and involved either greater (apply, analyze, synthesize, or evaluate) or lesser (retrieve or organize) cognitive demands. Melzer’s (2003, 2009) surveys of thousands of undergraduate courses across a range of disciplines and institutions in the United States have shown that short- answer exams, requiring a few sentences of explanation, are the most frequently assigned, identifiable written genre, followed by journals or logs of events, and then research-related reports including term papers and lab reports. Burstein, Elliot, and Molloy (2016) surveyed educators, interns, and employers about writing tasks in American schools, colleges, and workplaces, finding that the essay genre, summary abstracts, and academic e-mails persist as frequent writing demands for students in schools as well as in higher education. However, Burstein et al. also observed that written expression at the college level “is increasingly concerned with transaction—how to get things done in the world” (p. 122) as well as “attention to citing sources” (p. 123). Functions and genres of writing in schools have also been surveyed and documented, often accompanied by disparaging observations about how little extended or expressive composing students actually do. Applebee and Langer’s (2011) National Study of Writing Instruction found most writing in middle and high schools in the United States was done in English classes; writing in most other school subjects was simply “fill in the blank and short answer exercises, and copying of information directly from the teacher’s presentation” (p. 15). Schleppegrell (2004) described the genres of writing most encountered in middle and secondary schools as argumentation, guided notes, narration, online text conversations, procedures, recounts, reflections, reports, and fictional or poetic texts. Christie (2012) observed developmental progressions in written genres and registers over

14

114 Alister Cumming et al.

the stages of schooling in Australia from (a) young children learning the language of schooling, talking to perform, writing together with drawing, playing with words and grammar, and learning to build abstractions to (b) young adolescents dealing in their writing with abstract knowledge, technical terms, and organizing longer texts, explaining and evaluating the importance and consequences of events and texts, and foregrounding natural phenomena over human actions, and on to (c) late adolescents and young adults writing to engage with theoretical knowledge by developing extended explanations, identifying abstract themes and issues, using dense language, and discussing cultural significance.

4.1.3.2 Characteristics of Written Academic Texts Writing differs from the fleeting, interactive nature of conversation in being realized through orthographic or print media as fixed texts to be conveyed and potentially analyzed outside of their immediate situational context (by writers themselves as well as their readers and assessors, both real and imagined), though a blurring of this distinction appears in new forms of digitally mediated communications, such as chat. In addition to its relative permanence, academic writing is usually produced at a self-controlled pace with high expectations for precision of expression (Olson, 1996;Williams, 2012). Beyond these few observations, it is difficult to make definitive generalizations about discourse characteristics that span the many genres, topics, modalities, and contexts for academic writing in English. Biber and Gray (2010) concluded that one dominant feature of academic writing is structural compression of information, particularly through extensive embedding of clauses within noun phrases—a feature that was consistently evident in their analyses of corpora of academic research articles and university textbooks and course assignments.They observed that this style is “efficient for expert readers, who can quickly extract large amounts of information from relatively short, condensed texts” (p. 2). The time available and feasible for writing during formal tests usually restricts the sample of the domain of academic writing in assessments to just a few tasks that students are able to write in brief periods rather than the complete, broad universe of academic, professional writing. For this reason, most studies describing the characteristics of academic writing in English for assessment purposes have looked to the texts that students and test takers produce during assessments (despite the obvious circularity of defining writing for assessment purposes in terms of the writing produced for assessment purposes). Accordingly, one popular approach to understanding characteristics of L2 academic writing stems from research on second language acquisition. This approach has focused on ways in which language learners’ texts demonstrate differing degrees of fluency (e.g., represented by lexical variety), accuracy (e.g., represented by grammatical errors), and complexity (e.g., represented by syntactic and discourse complexity) (Wolfe-Quintero, Inagaki, & Kim, 1998). Numerous research studies have used this analytic framework, though with mixed results, seemingly because of the complexity and variability

14

15

14

Assessing Academic Writing 115

of writing, languages, contexts, and learner populations. For example, Crossley and McNamara (2014) found “significant growth in syntactic complexity… as a function of time spent studying English” (p. 66) among 57 adult learners of English over one semester in the United States but also that few of the syntactic features successfully predicted human judgments of overall L2 writing quality on the analytic rating scale used for the research. Cumming et al. (2005) found for a prototype version of the Test of English as a Foreign Language, Internet-based Test (TOEFL iBT) significant differences across three levels of English writing proficiency “for the variables of grammatical accuracy as well as all indicators of lexical complexity (text length, word length, ratio of different words to total words written), one indicator of syntactic complexity (words per T-unit), one rhetorical aspect (quality of claims in argument structure), and two pragmatic aspects (expression of self as voice, messages phrased as summaries)” (p. 5). Biber and Gray (2013) analyzed large corpora with discourse analytic techniques to provide detailed, comprehensive accounts of the written and spoken discourse appearing in TOEFL iBT tasks. They found “few general linguistic differences in the discourse produced by test takers from different score levels” (p. 64), observing correctly that the TOEFL’s “score levels are not intended to directly measure linguistic development in the use of particular lexico-g rammatical features” (p. 66) per se, but rather represent interactions between complex dimensions of language, discourse, and ideas in written and spoken expression. Nonetheless, their research pointed toward certain salient, albeit variable, differences between texts produced at different levels of English proficiency. Notably, more proficient test takers tended to use “longer words, attributive adjectives, and verb+that clauses” than did less proficient test takers, who tended to use more possibility modals (p. 64). Other approaches to defining the domain of writing in tests have examined rhetorical aspects of discourse. For argumentative essays, which feature in many writing assessments for academic purposes, Toulmin’s (1958) model of argument structure has widely been used as a basis to evaluate the quality of propositions, claims, data, warrants, backing, and rebuttals in tasks written in this genre for writing tests (e.g., Cumming et al., 2005; Ferris, 1994; McCann, 1989). For example, Plakans and Gebril (2017) found for integrated writing tasks on the TOEFL that overall scores for writing related to ratings of “organization and coherence… with quality improving as score increased” (p. 98), but markers of cohesion did not differ significantly across the score levels. L2 writing also entails the acquisition of, and the ability to employ, a large repertoire of phrases suitable for writing in academic registers (Jarvis, 2017; Li & Schmitt, 2009; Macqueen, 2012; Reynolds, 1995;Vo, 2019;Yu, 2010).

4.1.3.3 Curriculum Standards Generally, educational programs and systems assert their own definitions of the domain of academic writing, often explicitly in the form of curriculum policy

16

116 Alister Cumming et al.

documents about aspects of writing to be taught, studied, and assessed but also implicitly through classroom practices, teaching and learning activities, and textbooks and other resource materials. The enormous variability around the world in the sociolinguistic and literacy contexts, purposes, and populations that study and learn writing in additional languages makes it impossible to specify universally common characteristics of the skills and knowledge for English writing for academic purposes. To try to manage this great variability for educational purposes, a few general frameworks have been established through consensus and deliberation among educators to serve as “meta-frameworks” of standards to guide the formulation and interpretation of curricula and assessments for local language programs (Byrnes, 2007; Cumming, 2001a, 2014a). For learners and users of languages in Europe, the Common European Framework of Reference (CEFR) recommends analyzing multiple dimensions of writing assessment tasks (Council of Europe, 2001, 2009; as detailed in ALTE, n.d.).The various proficiency scales in the CEFR describe aspects of writing such as grammatical and lexical range and accuracy, cohesion and coherence, content in relation to task fulfillment, development of ideas, and orthography in relation to the six main levels of the CEFR (A1, A2, B1, B2, C1, and C2)—from basic abilities to exchange information in a simple way up to mastery of a language. Table 4.1 lists the rhetorical functions, purposes, and imagined audiences for writing specified in the CEFR. The CEFR framework has formed the basis TABLE 4.1 Rhetorical functions, purposes, and imagined audiences for writing in

the CEFR Rhetorical functions

Purposes

Imagined audiences

Describing Narrating Commenting Expositing Instructing Arguing Persuading Reporting events Giving opinions Making complaints Suggesting Comparing/contrasting Exemplifying Evaluating Expressing possibility and probability Summarizing

Referential Emotive Conative Phatic Metalingual Poetic

Friends General public Employers Employees Teachers Students Committees Businesses

16

17

16

Assessing Academic Writing 117

for the development and evaluation of various writing assessments, for example, by Alderson and Huhta (2005), Harsch and Rupp (2011), Harsch and Martin (2012), Hasselgren (2013), Holzknecht, Huhta, and Lamprianou (2018), and Huhta, Alanen, Tarnanen, Martin, and Hirvela (2014). Most major international tests of English reviewed below have benchmarked scores from their tests in reference to the six main levels of the CEFR because of its widespread use in educational systems around the world. For learners of English as an additional language in schools in the United States, the World-class Instructional Design and Assessment Consortium (WIDA) recommends assessing students’ writing in English for three categories—linguistic complexity, vocabulary usage, and language control—at six levels of ability called entering, beginning, developing, expanding, bridging, and reaching (WIDA, 2017). A similar assessment framework, Steps to English Proficiency (STEP), was developed by Jang, Cummins, Wagner, Stille, and Dunlop (2015) for English language learners in schools in Ontario, Canada. Both WIDA and STEP base their descriptions of students’ abilities on curriculum requirements generally rather than specifying particular tasks or genres of writing for assessment.

4.2 Purposes of Writing Assessments Writing assessments differ in their purposes, which may be either normative, formative, or summative.

4.2.1 Normative Purposes Normative assessments compare the performances of all people who take a particular test, usually to inform high-stakes decisions about selection or placement into academic or language programs, certification of credentials, or for employment or immigration. Normative purposes also include tests that monitor trends in an educational system overall, comparing performance across sub- regions, institutions, and time for educational authorities to decide how to develop or improve educational policies, practices, and resources. Normative assessments are usually large-scale tests with major consequences for test takers’ individual careers and therefore also with high expectations for validity, fairness, and impartiality. For that reason, normative tests are usually administered by authorized agencies external to educational jurisdictions, curricula, or programs (to avoid biases or favoritism); involve formal instruments designed, justified, evaluated, and interpreted by highly qualified experts; and follow explicit procedures standardized to be equivalent across administrations, time, and locations and without orientation to individual test takers’ identities or backgrounds. Historically, an initial impetus to establish formal writing tests was to determine admissions to universities and colleges and hiring for employment on the basis of merit and abilities rather than privilege or patronage (Cumming, 2009;

18

118 Alister Cumming et al.

Spolsky, 1995; Weir,Vidakovíc, & Galaczi, 2013). Conventions for most normative language tests follow from Carroll’s (1975) model of writing, reading, listening, and speaking as four basic “language skills” that combine (subskills of) vocabulary, grammar, discourse, and communication within fixed tasks to be evaluated comprehensively. Normative tests for academic purposes may be international or national in scope. The international administration of normative language tests—such as the TOEFL iBT, International English Language Testing System (IELTS), or Pearson Test of English (PTE) Academic, described below—is organized to facilitate and monitor the movement of students between or within countries and/or academic programs. Normative language tests that are national or regional in scope are used to determine satisfactory or exemplary completion of secondary or tertiary education, for selection into higher education or employment, and/or to monitor at regular intervals the achievements of students and educational systems overall and within regions of one nation or state. Within educational institutions, normative principles to design and organize tests that compare the writing abilities of all students to each other are sometimes used in assessments to place students into appropriate courses in (e.g., Ewert & Shin, 2015; Plakans & Burke, 2013) or determine satisfactory completion of programs of English language study (e.g., Cumming, 2001b; Shin & Ewert, 2015). A fundamental characteristic of normative assessments for national or regional contexts is that they are designed for particular populations and educational programs and so are, like summative purposes of assessment, connected directly to curricula taught and studied within one educational system. In contrast, international normative language tests function independently from any curriculum, pedagogical context, specialized knowledge, or educational program to avoid biases in the content of the test or student populations (Cumming, 2014a). However, when “cut scores” from normative tests are established within institutions or programs, for example for decisions such as admissions to universities, they take on a logic akin to criterion referencing, whereby individual abilities are related directly to standards or criteria for performance rather than compared normatively across all other people taking a test. In such contexts, logical distinctions between norm referencing and criterion referencing become blurred, and potentially problematic, given the consequences of decisions for the lives of test takers.

4.2.2 Formative Purposes Formative assessments are performed by teachers, tutors, or other students who share ongoing pedagogical commitments and interpersonal relationships.Teachers use formative assessments to decide what to teach; to evaluate, interpret, and guide students’ written texts and subsequent pedagogical or revising actions; to try to understand what functioned well or not in a task or lesson; and to enhance and motivate opportunities to learn. Practices for formative assessments of writing

18

19

18

Assessing Academic Writing 119

may be targeted (e.g., to focus on particular language or rhetorical forms taught or studied) or opportunistic (e.g., arise from a teacher’s or peers’ impressions or preferences about a draft) (Mislevy & Yin, 2009). In contrast to standardized normative assessments, formative assessments are rich in shared local contextual information (Mislevy, 2013). Formative and dynamic assessments combine teaching, learning, and assessing rather than separating them institutionally (Leung, 2007; Poehner & Infante, 2017; Poehner & Lantolf, 2005). Certain principles have been synthesized into pedagogical advice for teachers to respond to the writing of second language students. For example, Lee (2013, 2017), Ferris (2012), and Hasselgren (2013) have advocated such principles for formative assessment of writing as: be selective, specific, and focused; draw attention to only a few “treatable” aspects of language systems; suggest concrete actions and ensure students understand them; address intermediate rather than final drafts; relate feedback to material taught and studied; motivate, encourage, and clarify learning; and engage students intellectually and interactively. Meta- analyses of results from research have established that “written corrective feedback can lead to greater grammatical accuracy in second language writing, yet its efficacy is mediated by a host of variables, including learners’ proficiency, the setting, and the genre of the writing task” (Kang & Han, 2015, p. 1). Biber, Nekrasova, and Horn (2011) likewise determined that research across English L1 and L2 contexts indicates that feedback on writing “results in gains in writing development” (p. 1). Other of their meta-analytic findings suggested that: “written feedback is more effective than oral feedback”, “peer feedback is more effective than teacher feedback for L2 students”, “commenting is more effective than error location”, and “focus on form and content is more effective than an exclusive focus on form” (p. 1). Such claims need to be evaluated further because the extent and scope of available research is limited, and the situations examined vary greatly. Students’ evaluating their peers’ L2 written drafts has been an increasingly popular approach to classroom formative writing assessments. Peer assessments have proved to be effective after training through teacher modeling and analyses of genre exemplars by student peers and others (Cheng, 2008;To & Carless, 2016), in creating communities of students interested in appreciating each other’s writing and ideas, and by shifting students’ attention away from simply trying to correct grammatical errors toward more sophisticated strategies for responding such as aiming to clarify unclear meanings and offering suggestions and explanations in consultative and praising tones (Lam, 2010; Min, 2016; Rahimi, 2013). Various approaches to diagnostic assessments of writing are also practiced. Diagnostic assessments can take the form of detailed rating scales to identify and explain aspects of students’ writing in need of improvement, such as grammatical accuracy, lexical complexity, paragraphing, hedging claims, presentation and interpretation of data and ideas, coherence, and cohesion (Alderson, 2005; Kim, 2011; Knoch, 2009). Similarly, automated scoring programs like My Access and Criterion can identify problems in, and make suggestions about appropriate choices

210

120 Alister Cumming et al.

for, students’ revisions of their writing at the levels of grammatical errors or of rhetorical organization (Chapelle, Cotos, & Lee, 2015; Hoang & Kunnan, 2016; Klobucar, Elliot, Dress, Rudniy, & Joshi, 2013).

4.2.3 Summative Purposes Students’ achievements over the period of a course or program are conventionally summarized, documented, and reported in three ways: portfolios, grades, and/or exams. Writing readily facilitates uses of portfolios to document students’ writing activities and progress over time either in researching, drafting, and revising specific writing tasks or assembling a sample of papers to represent students’ best written work during a course (Burner, 2014; Hamp-Lyons & Condon, 2000; Lam, 2018). Portfolio approaches to assessing writing have been widely touted as the optimal way to organize summative assessments in language and writing courses because their format presents several inherent advantages. Portfolios are open to self-evaluation, teacher evaluation, and display to other people. They can promote reflection and analysis of achievements and problems, self-responsibility, and a sense of identity as an L2 writer. Portfolios also can shift the burden and time of teachers’ feedback to students’ self-assessments. For formative purposes, portfolio evaluations are usefully organized to align directly to and focus selectively on: a course’s curriculum and classroom activities; students’ declared abilities, goals, and motivations; and responses from peers, teachers, and others to support self-reflection and analyze learning progress (Burner, 2014; Lam, 2017; Lee, 2017; Little, 2005). Grades are simple, ubiquitous indicators of individuals’ summative achievements in academic courses, whether formulated as a letter grade from A to F, percentage score, or pass/fail. Surprisingly little research has examined how grading is done in L2 writing courses. Chen and Sun (2015a) found multiple interacting factors affecting the grading practices by English teachers in secondary schools in China, including diverse methods of assessment, perceptions of students’ effort and study habits, and differences by the grade level and size of class taught and teachers’ prior training in assessment. Cumming (2001b) found teachers’ practices for assessment of English writing in six countries internationally were more tightly defined in courses for specific purposes or within particular academic disciplines compared to writing courses with more general purposes or populations. In sum, variability is inevitable in assessments of writing for formative and summative purposes because such assessments function within unique educational institutions and programs; are applied to differing curricula, resources, and student populations; and are used at the discretion of individual teachers and students. Assessments of English academic writing for large-scale normative purposes, however, tend toward commonalities because they operate beyond the contexts of particular educational settings, limited time is available for writing in conditions of standardized testing, and certain genres of writing have been established as conventional in such tests in order

210

12

210

Assessing Academic Writing 121

to meet the expectations of test takers and score users alike and mediate their high- stakes consequences. For these reasons, before reviewing procedures for validating the writing components of English language tests, we describe briefly in the next section how three major tests of English have operationalized the assessment of writing.

4.3 Academic Writing in Major International Tests of English All major, high-stakes tests of English for academic purposes feature a writing component alongside assessments of reading, listening, and speaking. Interestingly, no major test of writing in English as a second or foreign language exists independently either nationally or internationally. Presumably, users of scores from tests of English for academic purposes—such as university admission officers, accrediting agencies, or employers—wish to see results about applicants’ writing in the context of, or as a supplement to, scores across a comprehensive range of assessments of speaking, reading, and listening. Three tests currently dominate the international markets for tests of academic English for adults: TOEFL iBT, IELTS Academic, and PTE Academic. Despite similarities in their tasks and timing for writing assessment, the three tests differ in their rating and psychometric scales and uses of human and automated scoring: Both human and automated scoring are applied in the TOEFL iBT, the IELTS uses only human scoring, and the PTE uses only automated scoring.

4.3.1 TOEFL iBT The TOEFL iBT features two writing tasks done on a standard English keyboard within 50 minutes. Sample test questions and sample responses from the test are available on a website to orient potential test takers (ETS, 2015). Written compositions are scored from 0 to 5 on holistic rating scales defined by descriptive criteria unique for the independent task and for the integrated reading-listening- writing task at each of the six score points. Scoring is done by one trained human rater and also by an automated scoring protocol (e-rater, see Attali & Burstein, 2006); any discrepancies are resolved by another human rater and then converted to a scaled score ranging from 0 to 30 for the writing section (ETS, 2019, 2021). The independent task is an adaptation of the Test of Written English, originally developed as an optional supplement to the TOEFL (Stansfield & Ross, 1988). Test takers write a short essay to support an opinion. Test takers “have 30 minutes to plan, write, and revise your essay” and “typically, an effective response contains a minimum of 300 words” (ETS, 2015, p. 25). A sample prompt is “Do you agree or disagree with the following statement? A teacher’s ability to relate well with students is more important than excellent knowledge of the subject being taught. Use specific reasons and examples to support your answer” (ETS, 2015, p. 25). The integrated task was introduced when the TOEFL was revised to become the TOEFL iBT, based on a framework proposed in Cumming et al. (2000) and then

21

122 Alister Cumming et al.

field-tested internationally (Enright et al., 2008). Test takers write an answer to a content-oriented question in response to reading and listening to source materials on a specific topic. Test takers are told that “an effective response will be 150 to 225 words” (ETS, 2015, p. 24) and “judged on the quality of your writing and on how well your response presents the points in the lecture and their relationship to the reading passage” (ETS, 2015, p. 23).

4.3.2 IELTS Academic The Academic Module of the IELTS features two writing tasks, revised systematically in the early 2000s (Shaw & Weir, 2007, pp. 161–167) after initial development in the 1980s by Cambridge English Language Assessment in the UK and a consortium of Australian universities (Davies, 2008). As described on the orientation website, Prepare for the IELTS (British Council, 2019), for one task test takers “must summarize, describe or explain a table, graph, chart or diagram” in 20 minutes and at least 150 words. The second task is to write a short essay task of at least 250 words in 40 minutes “in response to a point of view, argument or problem” and which can involve “a fairly personal style” (British Council, 2019). Test takers may choose to write either with pen and paper or, in many countries, on a computer while taking the writing, reading, and listening components of the test.Written texts are evaluated on a scale with nine points (0 to 9) demarcated by descriptive criteria in four analytic rating categories: task achievement, coherence and cohesion, lexical resources, and grammatical range and accuracy (Shaw & Weir, 2007, pp. 161–167; Uysal, 2009). The scoring criteria are unique but similar for each of the two tasks; their separate scores are averaged to a whole number to report an overall band score for writing on the test. Only one rater scores the compositions, though scores are checked regularly by central raters, and raters are retrained every two years.

4.3.3 PTE Academic The PTE Academic, launched in 2009, evaluates test takers’ writing on six tasks using an automated scoring program (Intelligent Essay Assessor) that applies latent semantic analyses to produce scores that are matched to previous scores from large pools of expert human ratings of large numbers of field-tested and piloted test tasks (Pearson, 2018, 2019).Two tasks directly involve composing: (a) an argumentative essay of between 200 and 300 words written in 20 minutes in response to a prompt and (b) a one-sentence summary (in less than 75 words) of a written text (of up to 300 words) in 10 minutes. Supplementary scores for writing are generated from four, more-constrained integrated tasks, one of which involves reading and writing (filling blanks in a written text from multiple- choice options of possible synonyms) and three of which involve listening and writing (writing a one-sentence summary of a lecture, writing to

21

123

21

Assessing Academic Writing 123

fill blanks in a written text, and writing from a dictation). The micro-level focus of these latter tasks is presumably to establish reliability to support the overall approach to computer-based scoring. A sample prompt for the argumentative essay appears in Pearson (2018): Tobacco, mainly in the form of cigarettes, is one of the most widely used drugs in the world. Over a billion adults legally smoke tobacco every day. The long-term health costs are high –for smokers themselves, and for the wider community in terms of health care costs and lost productivity. Do governments have a legitimate role to legislate to protect citizens from the harmful effects of their own decisions to smoke, or are such decisions up to the individual? (p. 22) Combined, scaled scores from 10 to 90 are reported for writing ability overall. Raw scores are generated through automated scoring that allocates partial credits of either 0, 1, or 2 (and 3 for content) on traits (initially defined by human raters’ judgments) of: content; development, structure, and coherence; length requirements; general linguistic range; grammar usage and mechanics; vocabulary range; and correct spelling. Additional scores are reported from 10 to 90 for so-called “enabling skills” of sentence-level grammar, spelling, and text-level written discourse—but with a caution that these latter scores may be helpful for diagnostic information but not for “high-stakes decision-making” because they contain “large… measurement error” (Pearson, 2018, pp. 42, 49).

4.4 Development and Validation of Writing Assessments The development and validation of writing tests go hand in hand—first as an initial, then as an ongoing, process of argumentation, specification, evidence- gathering, analyses, and refinement to be able to assert that inferences about scores from a test are appropriate and justified. Chapelle (2008) and Xi (2008) have outlined the sequences of logical inferences and accompanying evidence that underpin current argument-based approaches to language test validation: domain definition relating test scores to a target domain; evaluation to link test performance to observed scores; generalization to link observed scores reliably to the universe of scores; explanation to link the universe of scores to construct interpretations in test contexts; extrapolation to link the universe of scores to interpretations to real- world situations; and utilization to link interpretations to uses of score reporting, decisions made based on test results, and long-term consequences for institutions and stakeholders. Recent inquiry has produced evidence for each of these types of inferences to inform the validation of large-scale tests of academic writing in English, even in instances where research has not been conducted in the context of a full validity

214

124 Alister Cumming et al.

argument (Knoch & Chapelle, 2018). It is primarily the major international and national tests of English, however, that have conducted long-term, systematic programs of validation research because of the significant commitments in resources, research, and expertise to ensure the fulfillment of their high-stakes purposes. A notable exception is Johnson and Riazi (2017), whose research uniquely analyzed a full set of evidence and argumentation to inform the development of a new writing test in one EFL program in Micronesia—revealing in the process numerous shortcomings and recommendations for improvement of the test. As Davison and Leung (2009) have rightly urged, concepts, activities, and procedures for validation of large-scale standardized tests establish precedents and expectations for all teacher-based and program-level assessments. In the following paragraphs we highlight some key published validation studies and issues that have emerged and will continue to require further refinement, elaboration, and justification for all writing tests. Studies defining the domain of academic writing in English for academic purposes were reviewed already in Section 1.3 of this chapter.

4.4.1 Evaluation A fundamental concern of assessments is to specify precisely and comprehensively the knowledge, skills, and abilities that test takers are expected to demonstrate in writing tasks and to ensure that these skills and behaviors are elicited and evaluated appropriately, pragmatically, and in justifiable ways in a test. As explained near the outset of this chapter, assessments of writing occur through evaluators rating with pre-specified descriptive scales test takers’ performances on composing tasks. Accordingly, as Knoch and Chapelle (2018) observed, two foci have dominated research evaluating whether writing assessments match observed scores with intended characteristics. One focus has been on the properties and functions of rating scales. The second focus has been on raters’ thinking and decision making while rating written compositions. A primary expectation for evaluation evidence is that raters distinguish testees’ writing reliably and accurately into different ability levels. For example, Plakans, and Gebril (2013) observed that when composing integrated writing tasks, “high- scoring writers selected important ideas from the source texts and used the listening text as the task prompt instructed” whereas “low scoring writers depended heavily on the reading texts for content and direct copying of words and phrases” (p. 217). Similar findings appeared in Burstein, Flor, Tetreault, Madnani, and Holtzman (2012). A related issue is whether raters use all points on a rating scale consistently. For example, Huhta et al. (2014) reported achieving this kind of match from their multi-faceted Rasch analyses between holistic rating scales derived from the CEFR and rating processes for adolescent students’ writing in English and in Finnish on a national test in Finland. In contrast, Becker (2018) found that teachers in an Intensive English Program in the United States did not use all

214

125

214

Assessing Academic Writing 125

points on a rating scale consistently, so he recommended revisions to the content of the scale and the scoring procedures. A different kind of evaluation appeared in studies by Hoang and Kunnan (2016) and Liu and Kunnan (2016), which analyzed the accuracy of error recognition in L2 students’ writing by automated scoring programs and by human raters—for My Access! and WriteToLearn, respectively— finding both of these automated scoring programs greatly lacking in accuracy. A second evaluation expectation is that rating scales address particular hypothesized abilities. One approach to such evaluation has been through factor analytic studies. For example, Sawaki, Quinlan, and Lee (2013) identified factors of comprehension, productive vocabulary, and sentence conventions within scores for TOEFL iBT integrated reading-listening-writing tasks. Zheng and Mohammadi (2013) identified two factors— analytical/ local writing and synthetic/ global writing—in scores across the writing tasks on the PTE Academic. Further evidence for evaluation inferences have addressed the reliability, veridicality, or sources of unintended variance in raters’ processes of scoring writing, for instance, through process-tracing methods such as verbal reports, interviews, or questionnaires. Lumley (2005) elaborated a model from his analyses of raters’ thinking processes while scoring writing tests in Australia to illuminate how raters, in addition to their expected uses of the scale to score writing, also applied their knowledge from teaching experiences, program conditions, and policy contexts to interpret and judge the quality of writing. Barkaoui (2010), Cumming (1990), and Shi, Wang, and Wen (2003) documented differences in rating processes and criteria between experienced and inexperienced raters and educators. Their findings bolster the longstanding tenet that minimal levels of educational and assessment experience along with orientation, training, and practice are needed to produce consistency among composition raters, an equally integral element for generalization inferences as described below (Attali, 2016; Cotton & Wilson, 2011; Harsch & Martin, 2012; Lumley & McNamara, 1995; Weigle, 1994). A further type of evaluation research has examined raters’ processes of using scoring schemes for written compositions. For example, Becker (2018), Chan, Inoue, and Taylor (2015), Ewert and Shin (2015), Gebril and Plakans (2014), and Harsch and Hartig (2015) have analyzed raters’ applications of scoring rubrics in the process of developing new tasks for particular language tests.Their studies have revealed considerable variability and differences in raters’ judgments of test takers’ writing quality in reference to common criteria or rating scales. Such inquiry is an obvious direction to take in future validation studies to be able to assert that raters really do score writing in ways expected for a test. Schaefer (2008) produced an exemplary study to demonstrate how multi-faceted Rasch analyses can reveal patterns of bias among raters—and provide empirical bases for rater training—in respect to relative leniency or severity in scoring text features like organization, content, language use, or spelling and punctuation. Of course, the potential for construct-irrelevant variance from sources other than raters also needs to be evaluated and reduced accordingly. For example,

216

126 Alister Cumming et al.

Barkaoui (2013) found that TOEFL iBT writing tasks distinguished effectively between English proficiency levels among a sample of university students but also showed a weak effect related to students’ keyboarding skills. Numerous studies have evaluated whether test takers’ computer skills impact significantly on their performance on various tests of English academic writing. Most recent researchers have concluded there are only small, if any, effects: Barkaoui and Knouzi (2018); Brunfaut, Harding, and Batty (2018); Chan, Bax, and Weir (2018); Kim, Bowles, Yan, and Chung (2018); Jin and Yan (2017). Capitalizing on this point, Jin and Yan (2017) and Yu and Zhang (2017) have argued compellingly that computer communications and skills have become so prevalent in education internationally that computer-based testing should be established as the norm for writing tests so as not to disadvantage students who are relatively unaccustomed to writing with pen and paper. Choi and Cho (2018) studied the effects of correcting or leaving spelling errors in L2 compositions written for a pilot test, finding that trained raters produced significantly higher holistic scores for texts with corrected spelling, particularly at lower or middle levels of English proficiency, suggesting that permitting uses of spellcheckers (as is now seldom done in tests) could impact distinctly on scores for writing in high-stakes tests.

4.4.2 Generalization Three issues have dominated studies that address generalization inferences from writing tests. The main issue about generalization is whether scores obtained from assessments are consistent across parallel forms of a test, across raters assessing writing, and across different examinee populations. To this end, Barkaoui (2018) used multi-level modeling to establish that the writing scores of 1,000 test takers who took the PTE Academic three or more times did not change appreciably (i.e., increased slightly the second time but then later declined) nor varied by age or gender. Sawaki and Sinharay’s (2018) study of four administrations of the TOEFL iBT used subscore and confirmatory factor analysis to establish that results were consistent across each administration of the test and across three major examinee populations (Arabic, Korean, and Spanish L1s). However, they observed that the writing subscores were less reliable than were subscores for other sections of the test, perhaps for reasons discussed in the next paragraphs (i.e., only two writing tasks on this test and differences in genres or types of writing).Various studies have investigated whether raters who either share or do not share the first language of test takers display biases or preferences in assessing writing on English tests in relation to their knowledge of rhetorical, syntactic, or lexical conventions across languages: for example, Kobayashi and Rinnert (1996) for Japanese; Marefat and Heydari (2016) for Iranian; and Shi (2001) for Chinese. Nonetheless, as Milanovic, Saville, and Shen (1996) concluded, individual raters inevitably exercise somewhat unique styles for evaluating written compositions.

216

217

216

Assessing Academic Writing 127

A second issue concerns the number and kinds of writing tasks that (a) represent the domain of academic writing and that (b) are feasible for test takers to perform in the context of a formal test and are also (c) feasible for testing agencies to produce in equivalent, meaningful forms. English proficiency tests can evaluate only a limited sample of a person’s writing within the typical time constraints imposed by a standardized, comprehensive language test, so any writing assessment can only select from among the many types of tasks, communication purposes, and knowledge sources with which writing is normally practiced. Concerns about practicality, equivalence, and fairness also obtain. Writing tasks must be readily generated in large numbers and of high quality by test developers, sufficiently different that they do not promote cheating or rote memorization but also logically and operationally equivalent for different administrations of the test, fair to varied populations of test takers, and suitable for (human and automated) scoring processes. Psychometrically, more measurements than the current one or two tasks that feature in most writing tests would be desirable to make reliable assessments of individuals’ writing abilities. Generalizability studies by Bouwer, Beguin, Sanders, and van den Bergh (2015), Lee and Kantor (2005), and Schoonen (2005) have recommended three to five samples of individual students’ L2 writing as a basis for making reliable estimates of their English writing abilities. Gorman, Purves, and Degenhart’s (1988, pp. 28–40) survey of L1 writing in secondary schools in 14 countries determined that they needed to have each participating student write three compositions in different genres in order to obtain samples of writing that represent the domain of school writing internationally. The third, related issue is whether different types of writing tasks in a test should produce identical or somewhat different scores depending on their respective genres, content, and cognitive demands. Numerous studies have aimed to determine whether integrated writing tasks (that involve reading and/ or listening sources) elicit written performances that are distinct from, but also complementary or intrinsically related to, test takers’ performances on writing tasks that are independent of sources in the same test. For example, in a preliminary study for the TOEFL iBT, Cumming et al. (2005) found that “the discourse produced for the integrated writing tasks differed significantly from the discourse produced in the independent essay” (p. 5). Subsequent studies have established direct correspondences between scores on TOEFL iBT integrated writing tasks and both the rhetorical organization and the coherence in test takers’ independent compositions (Gebril & Plakans, 2013). Other kinds of task differences, such as topical or background knowledge, are potential sources of construct-irrelevant variance in assessments of English writing. Expressing ideas or information is, of course, integral to the production of writing as well as its assessment; test takers can be expected to express themselves uniquely in their writing. Nonetheless, most tests of writing for academic purposes deliberately aim to avoid content that may be perceived as contentious, biased, or

218

128 Alister Cumming et al.

involving specialized knowledge. As Shaw and Weir (2007) put it for Cambridge’s writing tests, “At all levels topics that might offend or otherwise unfairly disadvantage any group of candidates are avoided” (p. 133). He and Shi (2012) showed how general topics for writing in a test at one Canadian college produced more equitable and effective writing from international students compared to a relatively more specialized topic. Nonetheless, different topics, prompts, or content for academic writing inevitably involve some variation. Cho, Rijmen, and Novák (2013) determined that variations in scores on the integrated reading, listening, and writing tasks in the TOEFL iBT generally reflect differences in the English abilities of examinee populations but also vary on particular test administrations according to such characteristics of source materials as the distinctness and relative difficulty of ideas within reading and listening prompts. O’Loughlin and Wigglesworth (2003) likewise revealed slight variations across prompts for IELTS writing tasks, pointing to inconsistencies among human raters and/or topic or rhetorical differences in writing prompts.

4.4.3 Explanation Explanation inferences for assessments in second languages are complicated for reasons of complexity and variation already outlined at the start of this chapter, challenging the formulation of straightforward construct definitions. Further complications follow from the normative purposes, large- scale contexts, and highly diverse populations involved in most tests of English writing for academic purposes.Tests like the IELTS,TOEFL iBT, and PTE aim to make reliable, general distinctions in abilities among enormously varied groups of test takers rather than to be able to produce fine-g rained indices of language or literacy development such as might be expected in a longitudinal research experiment or uniform curriculum context. People seldom take or administer formal tests of L2 writing to discover or explain how L2 writing is learned. Chen and Sun (2015b) exposed a fundamental dilemma for explanation inferences about English language learners taking the Ontario Secondary School Literacy Test—a requirement for all students in Ontario to graduate from secondary school—because the test has produced distinctly different results each year for students who are English-dominant and those for whom English is a second language. Chen and Sun observed that English language learners experience literacy differently in respect to their vocabulary knowledge, writing experience, and genre familiarity. Cummins (1984) among others has long argued that tests in schools that have been norm-referenced for majority students are inappropriate to apply to minority-background students who have not had sufficient opportunities or time (i.e., 5 to 7 years) to acquire the full academic language and literacy in the second language to succeed in such tests. One type of research for explanatory inferences on English language tests has focused on test takers’ written texts, seeking evidence of developmental

218

219

218

Assessing Academic Writing 129

differences in the writing of test takers at different score levels, aiming to justify the scoring schemes and scales on a test. Banerjee, Franceschina, and Smith (2007), for example, analyzed features of compositions written from bands 3 to 8 on the IELTS to show that indicators of text cohesion, vocabulary richness, and grammatical accuracy improved distinctly across these ability levels, but they also found that these text features interacted variably with different task types. Likewise, Cumming et al. (2005) examined texts written for prototypes of the TOEFL iBT to find that more proficient test takers (according to their overall TOEFL scores) tended across independent and integrated writing tasks to “write longer compositions, use a greater variety of words, write longer and more clauses, demonstrate greater grammatical accuracy, employ better quality propositions and claims in their arguments, and make more summaries of source evidence” (p. 45). Kim and Crossley (2018) determined that “higher rated essays” on the TOEFL iBT tended to contain “more sophisticated words… greater lexical overlap between paragraphs, and longer clauses” (p. 39).

4.4.4 Extrapolation Support for extrapolation inferences has evaluated the extent to which the writing done on a language test corresponds to the writing that students do in academic courses. Perceptions about authenticity of writing tasks feature in certain studies. For example, Cumming, Grant, Mulcahy-Ernt, and Powers (2004) interviewed a range of experienced ESL and writing teachers at North American universities, asking them to review prototype writing tasks for the TOEFL iBT along with samples of students’ performances on them—finding that they thought the new tasks and performances represented the types of writing that students do in English writing courses. In a more extensive questionnaire survey of instructors’ as well as students’ perceptions, Llosa and Malone (2017) found that participants thought writing performance, tasks, and rating criteria on the TOEFL iBT corresponded to writing for course assignments. Other studies have compared features of texts written on language tests and in course papers. Riazi (2016) found that compositions written for the TOEFL iBT and samples of their course papers by 20 graduate students in several disciplines at Australian universities bore similarities as well as differences for indicators of syntactic complexity, lexical sophistication, and cohesion. Llosa and Malone (2019) compared texts written for the TOEFL iBT by 103 undergraduate students with first and second drafts of two assignments written for courses and instructors’ grades for the papers and holistic ratings of English proficiency. Reporting various correlations, they concluded that the “quality of the writing on the TOEFL tasks was comparable to that of the first drafts of course assignment but not the final drafts” (p. 235). Other studies have compared scores of writing quality on test tasks and on papers written for academic assignments. Biber, Reppen, and Staples (2016)

310

130 Alister Cumming et al.

found TOEFL iBT scores to be “weak to moderate predictors of academic writing scores” (p. 953). They emphasized the contextual complexities of such comparisons, highlighting variation between integrated and independent writing tasks on the test, scores for language and for rhetorical organization, undergraduate and graduate students, academic disciplines, and genres of course papers.Variability related to language background, academic fields, topics, and language proficiency similarly emerged from Weigle and Friginal’s (2015) multidimensional analysis comparing linguistic features in essays written for English tests with course papers in the Michigan Corpus of successful student writing at that university. An innovative study by Beigman Klebanov, Ramineni, Kaufer, Yeoh, and Ishizaki (2019) used corpus-analytic techniques to create rhetorical/lexical profiles of argumentation in independent essays written for the TOEFL iBT and for the Graduate Record Exam in order to compare them to large samples of academic papers in the Michigan Corpus and opinion essays in The New York Times. They concluded that “test corpora, focusing on argumentation in two standardized tests, are rhetorically similar to academic argumentative writing in a graduate-school setting, and about as similar as a corpus of civic writing in the same genre” (p. 125). A commonly debated extrapolation issue is the so-called “predictive validity” of English language tests in view of their uses to determine whether students from international backgrounds have sufficient language proficiency to succeed in academic studies in English.A consensus from empirical studies is that a moderate relationship exists between scores from English language tests and such indicators of academic success as Grade Point Average, but correlations often tend to be weaker for scores of writing than for scores of reading, speaking, or listening abilities. It must be acknowledged that all such trends are mediated inextricably by many contextual factors such as students’ language backgrounds, range of proficiency levels, individual differences, and prior grades in educational programs; fields and programs of study; institutional selectivity, resources, and expectations; and availability of facilitating supports such as English instruction or tutoring: Bridgeman, Cho, and DiPietro (2016); Cho and Bridgeman (2012); Ginther and Elder (2014); Graham (1987).

4.4.5 Utilization Utilization inferences for test validation may involve two functions. One function relates to the information, scores, and decisions from assessments being appropriate and clearly communicated to relevant stakeholders. To this end, follow-up studies such as Malone and Montee’s (2014) and Stricker and Wilder’s (2012) surveys and focus groups with students, instructors, and university administrators have helped to appreciate their uses of the TOEFL iBT and to improve score reporting. A comparative, institutionally grounded perspective appeared in Ginther and Elder’s (2014) analyses of uses and understandings of the TOEFL iBT, IELTS, and PTE Academic by administrators at two universities, one in the United States and

310

13

310

Assessing Academic Writing 131

the other in Australia. Kim (2017) analyzed how Korean students in discussion forums claimed they perceived, prepared for, and utilized results from TOEFL iBT—raising many questions about practices in test preparation courses and about students’ understanding of the test and its score reports. The other set of utilization inferences concerns long-term effects in promoting the interests of users of assessments and their results, what Kunnan (2018, p. 257) called “justice” or “beneficial consequences”. Consequences can be either negative or positive, of course. A longstanding criticism of standardized writing tests is that they can narrow the focus of teaching and learning writing to certain formulaic genres that can be easily coached, may be so widely practiced as to be over- exposed, and limit students’ creativity or invention (Hillocks, 2002; Raimes, 1990). Green’s (2006) study of preparation classes for the IELTS did observe a narrowing of focus in teaching and studying for that test but also documented practices for pedagogy and learning that commonly appear in other contexts of education for English for academic purposes. Detailed case studies by Wall and Horak (2008) demonstrated that the introduction of integrated tasks in the TOEFL iBT had a discernible, long-term, and positive impact on teaching and learning practices in several English language courses, particularly through the mediation of textbook materials that emphasized these complex task types in classroom activities. Zheng and Chen (2008) observed that many educators think the College English Test (CET) has promoted a basic proficiency in English among university students throughout China, but others question this assumption because the test has not emphasized communicative competence. Sun (2016) gathered extensive information from employers and graduate admissions officers at universities about how they used applicants’ scores from the CET, revealing complex but uncertain consequences from this test throughout higher education, businesses, and industries in China. Harrington and Roche (2014) demonstrated institutional impacts of a writing assessment at a university in Oman for its usefully identifying students who were linguistically at risk for their English proficiency and could benefit from supplementary English instruction to bolster their opportunities to learn and use the language effectively for their academic studies. From a research perspective, experiences participating as raters in studies of writing assessments have frequently been described as enhancing people’s knowledge about writing assessment as well as self-awareness of their own rating and even teaching processes.

4.5 Directions for Development and Future Research We propose four objectives to guide future development and research on tests of writing for academic purposes. These objectives assume that testers continue to pursue issues and evidence about all aspects of test validity while building on the foregoing design principles and current state of major tests of writing in English for academic purposes. One objective is to specify clearly the expected purposes,

312

132 Alister Cumming et al.

audiences, and domains for writing tasks to guide and engage test takers in producing texts that are written to be appropriate and effective for specific intended reasons, contexts, genres, and information sources and that can be scored explicitly for the fulfillment of such criteria. A second objective is to increase the coverage of written genres and contexts for academic writing by increasing the number and types of writing tasks appearing on tests, thereby improving test reliability. A third objective is to expand on the inherently academic nature of integrated writing tasks by introducing additional types of content- responsible tasks involving personal expression, commitment to relevant ideas and information, and explicit references to source texts and contexts. A fourth objective is to develop further the systems, practices, and applications of automated scoring.

4.5.1 Specify Task Stimuli and Rating Criteria Yu (2009, 2013) tellingly compared the instructions and prompts for summary tasks on major tests of English, revealing that most tasks were not well specified and differed markedly from each other in their definitions and functional operationalizations of how and why a summary should be written and be evaluated. His analyses point toward aspects of task stimuli and rating criteria that could usefully be refined on language tests, including the specification of audience, voice, domains, purposes, genres, and information sources for academic writing. Contextual factors are major sources of variability across the diverse situations and kinds of interactions in which academic writing occurs. Writers’ knowledge and awareness of intended audiences, purposes, and genres for writing play important roles in their composing processes by invoking relevant knowledge (e.g., expected genres, rhetorical structures, or goals), which impacts the quality of writing. Research needs to investigate the extent to which it is feasible to incorporate specific contextual factors into standardized language tests and whether doing so improves a test’s capacities to evaluate test takers’ writing, or, in contrast, whether the inclusion of contextual factors might constrain or bias people from diverse backgrounds internationally in conditions of formal assessments.

4.5.1.1 Audiences and Purposes The intended audiences for academic writing should be more systematically prompted and evaluated in writing tasks to enhance, on the one hand, test takers’ motivation, purposes, and senses of authenticity and, on the other hand, to enable raters to evaluate whether the pieces of writing produced on a test have achieved their aims. Realistic, engaging, and relevant scenarios for writing should be stipulated to overcome the obvious constraint in a standardized test that the real audiences for test takers’ writing are in fact the raters and scoring systems of the test. To write meaningfully and effectively, test takers need to know to whom

312

13

312

Assessing Academic Writing 133

they are supposed to be writing, in what kind of situation, and for what purpose (Magnifico, 2010; Purpura, 2017). In turn, raters need to know these elements in order to be able to evaluate precisely and earnestly whether test takers’ writing has achieved its intentions. “Task fulfillment” appears commonly as a broad criterion for evaluating writing in many tests, but applicable criteria tend to be phrased generally so as to span a range of similar tasks, leaving assessors to their personal impressions as bases for judgments of task fulfillment rather than to address directly the perspectives of specific interlocutors, communication purposes, or realistic contextual factors. Cho and Choi (2018) have started to evaluate this issue, finding that specifying expected audiences for writing on a test had distinct influences on test takers’ source attribution and context statements, although these qualities in writing varied across score levels. In academic contexts, the majority of formal writing tasks require or imply that writing is done for students’ course instructors or professors to whom information, understanding, opinions, and ideas are displayed, analyzed, classified, recounted, or critiqued for the purpose of demonstrating relevant knowledge. Another audience for academic writing is a broad professional community of which students are members, either specifically in an institutional setting (e.g., writing a memo to a manager or negotiating with an educational administrator) or generally (e.g., classifying information for scientific verification). Students also write to their peers or student colleagues in the contexts of gathering information, completing, or reporting on group tasks or assignments, participating in online courses, using multimedia resources, sharing or requesting information with one another informally, or conducting projects for research or course or professional presentations.

4.5.1.2 Personal Voice and Stance Complementary to the audience addressed while writing are the extent and qualities of personal expression expected in writing tasks. Test takers need to know expectations for their expressing, asserting, critiquing, or creating—or not—new viewpoints on ideas, topics, or information they write about. Argumentative tasks may put a premium on articulating one’s opinions in writing, but tasks that involve integrated writing from sources are often less transparent about the stance that a writer should take. A longstanding complaint among instructors and assessors alike is that underprepared students, particularly those writing in another language, become overwhelmed by a lack of understanding of source materials when attempting to write about them (F. Hyland, 2001; Macbeth, 2010). Knowing how to express one’s voice appropriately in writing may well be a developmental process in academic literacy and acquisition of an additional language. Wette (2010) succinctly identified voice as one of three key skills that university students need to acquire in order to write from sources in English: “comprehending complexities in texts, summarizing propositional content accurately, and integrating citations with their own voices and positions” (p. 158).

314

134 Alister Cumming et al.

A potential foundation for specifying personal voice for writing and scoring schemes follows from K. Hyland’s (2010) construct of “proximity”, which defines the rhetorical relations of shared understanding that writers establish with readers of their texts through certain linguistic devices. These devices include organization (e.g., in placement of main and supporting points), argument structures (e.g., through appeals, focus, and framing information), establishing credibility (e.g., designating the relative status and authority of information, reporting verbs), stance (e.g., self-identification, direct quotations, hedges, or claiming affinity), and reader engagement (e.g., through plural first-or second-person pronouns, questions). In related work, Zhao (2013) developed and validated a scoring rubric for measuring the quality of authorial voice in TOEFL iBT independent writing tasks that could potentially be incorporated into rating scales for this or other tests. Stimulating motivational factors such as interest in topics, relevance of writing tasks, and perceptions of source materials may also positively influence the quality of students’ writing performance (e.g., Boscolo, Ariasi, Del Favero, & Ballarin, 2011).

4.5.2 Increase Coverage of Domains and Genres Large-scale standardized tests of academic writing must limit the domains of writing to those that students are most frequently and importantly expected to be able to perform competently during educational studies and that generalize across academic disciplines and topics. As observed above, research suggests that several samples of a person’s writing on different tasks are necessary to elicit a set of performances that reliably represent an individual’s abilities to write in key academic genres.While many tests currently make do with a couple of different kinds of writing tasks, more writing tasks would be desirable from the perspectives of psychometric properties as well as domain coverage. Given constraints on the time and concentration that test takers can reasonably perform on a formal test, this recommendation implies that any tasks added to current tests be brief and manageable while also realizing interactionist principles of composing and assessing. What domains and tasks may be relevant?

4.5.2.1 Academic and Practical-social Domains Two primary domains for writing in academic contexts are academic and practical-social domains. The academic domain refers to contexts of language use where people produce coherent written texts to convey their understanding of and beliefs about specific information involving ideas and related knowledge for academic audiences, such as professors, course instructors, staff, or other students. The practical-social domain involves the transaction of information, ideas, and

314

135

314

Assessing Academic Writing 135

knowledge in routine situations of written communication to establish, analyze, recount, question, or maintain interactions with people for practical and social reasons while negotiating or navigating contexts of academic studies or work. Beyond these broad distinctions, domains for academic writing are unique and vary according to specific academic fields, technical expertise, specialized terminology, conventions of relations between people, assumed or accepted knowledge, available information, the relative strength of assertions, personal viewpoints, expectations for consequences or impact, and styles of reasoning about issues or problems. As useful as the construct of genre may be for research, theories, or teaching, no language test could reasonably include tasks for all 13 genre families that Gardner and Nesi (2013) showed to be common across disciplines in higher education. For these reasons, domain coverage of academic writing for assessment purposes must necessarily shift up to a level of abstraction that spans specific genres. We suggest that level be rhetorical functions—for example, to inform, to persuade or recommend, or to argue or critique—which in turn can be grouped together as having explanatory, transactional, or expressive functions.

4.5.2.2 Explanatory Writing The primary purpose of explanatory writing is to describe and explain concrete or accepted information. In characterizing short-answer exams as the most frequent form of explanatory writing observed at colleges, Melzer (2009) observed that short-answer exams “almost always consist of questions that require rote memorization and recall of facts, and the instructor almost always plays the role of teacher-as-examiner, looking for ‘correct’ answers” (p. 256). In other words, the goal of this purpose for writing is to convey students’ understanding of information acquired through reading, lectures, or verbal interactions in classrooms or combinations of these activities to audiences such as professors, course instructors, or teaching assistants. Integrated writing tasks in current language tests aim to capture the integral characteristics of written responses in short-answer exams without, however, recreating the full contexts of learning that real study or research for academic purposes entails nor the variety of types of writing expected across differing academic disciplines or fields.

4.5.2.3 Transactional Writing Transactional writing refers to writing done in anticipation of others’ actions or services (Britton, Burgess, Martin, McLeod, & Rosen, 1975). Transactional writing has a pragmatic purpose and realization that takes place within practical, social situations where readers are often expected to do something as a result of written communications addressed to them. Burstein et al. (2016) identified numerous transactional written genres that students in higher education often do, including

136

136 Alister Cumming et al.

e- mails, class presentations (e.g., Power Point), peer reviews, memos, application letters, and blogs. These types of written genres require writers to pay close attention to their intended audiences and purposes in a specific social context, rather than merely to summarize information “for the sake of displaying knowledge” (Melzer, 2003, p. 98).There are numerous directions and foundations for the design of assessment tasks for transactional purposes. Biesenbach-Lucas (2009), for example, demonstrated that there are rhetorical moves in the genre of academic e-mails written by students to professors or peers that are either obligatory (e.g., opening greetings, provision of background information, requests, closings, signoffs) or optional (e.g., elaborating, justifying, attending to recipient’s status, thanking).

4.5.2.4 Expressive Writing The most common type of task on tests of English writing is an argumentative essay in which test takers are asked to express their viewpoint on a controversial topic and to elaborate and defend it.The task type may be so frequently practiced, studied, and assessed at all levels of education that it runs the risk of being over- exposed. Nonetheless, producing an extended, coherent text expressing personal knowledge, experience, and/or opinions can offer evidence of test takers’ abilities to critically evaluate, develop, and express in writing a perspective about a debatable topic, which is consistent with standards for academic performance and college-readiness. Other forms of expressive writing, such as creatively composing stories or poems or writing about one’s feelings to overcome trauma, are surely valuable for educational or therapeutic reasons, but are not central to writing across academic fields.

4.5.3 Elaborate Integrated Writing Tasks To assess whether people possess English writing abilities sufficient for academic studies requires tasks that realistically involve test takers interpreting, justifying, and writing about content from source reading, listening, and/or multimedia materials. To this end, integrated tasks conceptualize writing as discourse practices of synthesizing multiple sources of knowledge, texts, and skills consistent with constructivist (e.g., Kintsch, 1998) or multiliteracies (e.g., Cope & Kalantzis, 2000) theories of communication. Knoch and Sitajalabhorn (2013) defined integrated writing tasks: “Test takers are presented with one or more language-r ich source texts and are required to produce written compositions that require (1) mining the source texts for ideas, (2) selecting ideas, (3) synthesising ideas from one or more source texts, (4) transforming the language used in the input, (5) organizing ideas and (6) using stylistic conventions such as connecting ideas and acknowledging sources”(p. 306). The value of integrated tasks for writing assessments has been broadly accepted, but several key issues stand out for further research and development.

136

137

136

Assessing Academic Writing 137

4.5.3.1 Threshold Level Abilities to integrate ideas from source materials effectively and appropriately into writing may involve precisely the set of skills that demarcate those students who possess sufficient proficiency in an additional language from those who do not with respect to their ability to perform competently in English in academic contexts (Cumming, 2013; Grabe & Zhang, 2013; Knoch, 2009; Sawaki, Quinlan, & Lee, 2013). However, research needs to evaluate and confirm this hypothesis. A frequently observed limitation of integrated writing tasks is that test takers require a threshold level of comprehension of the vocabulary, rhetorical situation, and overall discourse of source materials in order to be able to make sufficient sense to attempt to write coherently about them (Cumming, 2013, 2014b; Knoch & Sitajalabhorn, 2013). That is, integrated writing tasks may be informative about the writing abilities of language learners at high levels of English proficiency but frustrating and counterproductive for learners with low proficiency. Gebril and Plakans (2013) found that grammatical accuracy and the limited extent to which test takers used material from source texts played a distinctive role in judgments of quality of writing from sources for Egyptian students at lower levels of English proficiency, “whereas other textual features, such as cohesion, content, and organization, are more critical at higher level writing” (p. 9).

4.5.3.2 Cognitive Skills What might be the requisite skills, knowledge, and behaviors to perform integrated writing tasks? Considerable evidence has accumulated that such abilities can readily be taught and learned (e.g., Cumming, Lai, & Cho, 2016; Wette, 2010; Zhang, 2013). Indications of relevant abilities appear in recent studies involving think-aloud protocols and post-writing interviews to illuminate the processes used by students of English as they performed integrated reading-writing tasks. But it remains to be established that criteria for scoring writing can be developed to evaluate written texts from tests without direct access to data on test takers’ thinking and decision-making processes. Plakans (2009) built on Spivey’s (1997) model of discourse synthesis to identify basic processes of: organizing subprocesses of integrated writing (e.g., “planning content, structuring their essays, and using strategies to understand the readings”, p. 572); selecting parts of source texts for writing (e.g., rereading and reinterpreting source texts, paraphrasing, deciding on citations); and connecting (i.e., “writers found relationships between the topic, their experience, the source texts, and their essays”, p. 574). Unlike Spivey’s studies of English L1 writers, Plakans also identified common language difficulties experienced by L2 writers. These difficulties involved attending to issues of style and vocabulary, particularly understanding terms in the source texts, finding equivalent words in students’ L1 and the L2, and using synonyms when paraphrasing. In addition, L2 writers had issues with

138

138 Alister Cumming et al.

grammar, restructuring their own writing, punctuation, and addressing an audience. In a subsequent study, Plakans (2010) described four iterative phases of task performance: initial task representation by reading instructions and source materials; topic determination (and subsequent reconsideration) by reading, interpreting, and articulating ideas in source materials; genre identification and plans for writing; and decisions about how to use source texts, integrate them into their writing, and adopt appropriate approaches for citations.

4.5.3.3 Textual Borrowing A related area for research and development concerns identifying and demarcating test takers’ appropriate, effective, and inappropriate uses of borrowing, citing, and paraphrasing ideas and phrases from source texts. Studies such as Currie (1998) and Li and Casanave (2012) have documented how naïve students produce “patchwriting” by inappropriately copying phrases verbatim or with minor rephrasing, synonyms, or grammatical alterations from source materials when they write with limited lexical resources about unfamiliar content from short texts in brief periods of time in a second language. A challenge for human evaluations of written texts is simply perceiving the extent to and ways in which such textual borrowing may have been done— though computer programs can flag strings of several words from source texts that appear in test takers’ compositions (e.g., Cumming et al., 2005). But judging the appropriateness of borrowed phrases and distinguishing them from commonplace formulaic terms is complicated. Indeed, teachers’ judgments about the legitimacy of students’ textual borrowings in their writing differ greatly (Pecorari & Shaw, 2012). Moreover, as Shi (2012) has demonstrated, major differences about the appropriateness of textual borrowing arise from the conventions and expectations of differing scholarly disciplines and are highly complex because they “depend on one’s knowledge of the content, the disciplinary nature of citation practices, and the rhetorical purposes of using citations in a specific context of disciplinary writing” (p. 134). How might scoring rubrics distinguish appropriate from inappropriate textual borrowing in integrated tasks? A conventional distinction is the “triadic model” of “paraphrase, summary, and quotation” (Barks & Watts, 2001, p. 252), which Keck (2006) elaborated into four types of paraphrasing based on the extent of revision of source material evident in English language learners’ compositions: near copy, minimal revision, moderate revision, and substantial revision. Another possible basis to develop scoring rubrics comes from Petric and Harwood’s (2013, p. 114) analyses of the functions of citations in a successful Master’s student’s course assignments, which included functions of defining to explain a concept or approach, acknowledging sources, expressing agreement or disagreement, supporting to justify an idea, and applying a concept or approach.

138

319

138

Assessing Academic Writing 139

4.5.4 Develop Automated Scoring Further The technologies, practices, and applications for automated scoring of academic writing have developed rapidly in recent decades and will continue to in the future (Shermis, Burstein, Elliot, Miel, & Foltz, 2016). Applications and efficiency improve with every new generation of systems but needs persist for continuing refinement. Limitations commonly observed concern underrepresentation of key constructs such as content, discourse, and argumentation (e.g., Deane, 2013), inaccuracies in scoring or error identification or correction (e.g., Hoang & Kunnan, 2016), and the negative consequences for students of writing to a machine rather than to communicate with humans (e.g., Ericsson & Haswell, 2006). Natural language processing researchers are working to improve on the two former points (e.g., Beigman Klebnanov, Ramineni, Kaufer, Yeoh, & Ishizaki, 2019). Ultimately, Deane (2013) asserted that “Writing assessment, whether scored by human or by machines, needs to be structured to support the teacher and to encourage novice writers to develop the wide variety of skills required to achieve high levels of mastery” (p. 20). Future developments of automated scoring systems need to address varied genres of writing beyond the argumentative essay or simple summary as well as features of texts beyond lexical or syntactic variety, discourse coherence, or accuracy of grammar, spelling, and punctuation. No automated writing evaluation systems currently address, for example, features associated with voice, especially with regard to appropriate handling of audience. Standards for practice with automated scoring are only starting to emerge. Among the major English tests reviewed above, the two prevalent automated scoring systems, Intelligent Essay Assessor (IEA) and e-rater®, both model their scoring (albeit very differently) on measurements and weighting of diverse features of writing on probabilistic relations with large corpora of particular text types that humans have previously scored. But applications of the two systems in tests differ in that the PTE produces scores solely from IEA’s automated scoring whereas TOEFL iBT combines a human score with a machine score. Research agendas need to address questions such as, what criteria, principles, or users’ needs might determine optimal or complementary relationships between human and machine scoring of academic writing?

4.6 Conclusion We have reviewed current practices and research for the assessment of academic writing in English as a second or foreign language. The domain has been conceptualized in reference to tasks and rating scales based on analyses of students’ uses of writing in educational contexts, characteristics of the texts they are expected to produce, and relevant curriculum standards. Evaluators

410

140 Alister Cumming et al.

and educators have established differing practices and expectations for writing assessments for normative, formative, and summative purposes. Researchers have started to exemplify various ways in which writing assessments can be validated to be certain that inferences from their scores are appropriate and justified. Future research and development need to continue, refine, and expand upon such inquiry in addition to specifying task stimuli and rating criteria, increasing the coverage of genres and domains for academic writing, elaborating the design of integrated writing tasks, and developing automated scoring systems.

Note 1 Acknowledgement: We thank John Norris, Spiros Papageorgiou, and Sara Cushing for useful, informed comments on an earlier draft.

References Alderson, J. C. (2005). Diagnosing foreign language proficiency. London, U.K.: Continuum. Alderson, J. C., & Huhta, A. (2005).The development of a suite of computer-based diagnosis tests based on the Common European Framework. Language Testing, 22(30), 301–320. Applebee, A., & Langer, J. (2011). A snapshot of writing instruction in middle schools and high schools. English Journal, 100(6), 14–27. ALTE (Association of Language Testers of Europe). (n.d.). The CEFR Grid for Writing Tasks, version 3.1 (analysis). Strasbourg, France: Council of Europe, Language Policy Division. Retrieved from www.alte.org/resources/Documents/CEFR%20Writing%20Gridv3_ 1_analysis.doc.pdf Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 95–115. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2.0. Journal of Technology, Learning, and Assessment, 4(3). Retrieved from www.jtla.org Bachman, L. (2002). Some reflections on task-based language performance assessment. Language Testing, 19(4), 453–476. Banerjee, J., Franceschina, F., & Smith, A. (2007). Documenting features of written language production typical at different IELTS band score levels. IELTS Research Report, 7(5). Banerjee, J.,Yan, X., Chapman, M., & Elliott, H. (2015). Keeping up with the times: Revising and refreshing a rating scale. Assessing Writing, 26, 5–19. Barkaoui, K. (2010).Variability in ESL essay rating processes:The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54–74. Barkaoui, K. (2013). Examining the impact of L2 proficiency and keyboarding skills on scores on TOEFL iBT writing tasks. Language Testing, 31(2), 241–259. Barkaoui, K. (2018). Examining sources of variability in repeaters’ L2 writing scores: The case of the PTE Academic writing section. Language Testing, 36(1), 3–25. Barkaoui, K., & Knouzi, I. (2018). The effects of writing mode and computer ability on L2 test-takers’ essay characteristics and scores. Assessing Writing, 36, 19–31. Barks, D., & Watts, P. (2001). Textual borrowing strategies for graduate-level ESL writers. In D. Belcher & A. Hirvela (Eds.), Linking literacies: Perspectives on L2 reading-writing connections (pp. 246–267). Ann Arbor, MI: University of Michigan Press.

410

14

410

Assessing Academic Writing 141

Becker, A. (2018). Not to scale? An argument-based inquiry into the validity of an L2 writing rating scale. Assessing Writing, 37, 1–12. Beigman Klebanov, B., Ramineni, C., Kaufer, D.,Yeoh, P., & Ishizaki, S. (2019). Advancing the validity argument for standardized writing tests using quantitative rhetorical analysis. Language Testing, 36(1), 125–144. Biber, D., & Gray, B. (2010). Challenging stereotypes about academic writing: Complexity, elaboration, explicitness. Journal of English for Academic Purposes, 9(1), 2–20. Biber, D., & Gray, B. (2013). Discourse characteristics of writing and speaking task types on the TOEFL iBT® test: A lexico-grammatical analysis (TOEFL iBT Research Report 13). Princeton, NJ: Educational Testing Service. Biber, D., Nekrasova,T., & Horn, B. (2011). The effectiveness of feedback for L1-English and L2- writing development:A meta-analysis (TOEFL iBT RR-11-05). Princeton, NJ: Educational Testing Service. Biber, D., Reppen, R., & Staples, S. (2016). Exploring the relationship between TOEFL iBT scores and disciplinary writing performance. TESOL Quarterly, 51(1), 948–960. Biesenbach- Lucas, S. (2009). Little words that could impact one’s impression on others: Greetings and closings in institutional emails. In R. P. Leow, H. Campos, & D. Lardiere (Eds.), Little words: Their history, phonology, syntax, semantics, pragmatic, and acquisition (pp. 183–195). Washington, DC: Georgetown University Press. Boscolo, P., Ariasi, N., Del Favero, L., & Ballarin, C. (2011). Interest in an expository text: How does it flow from reading to writing? Learning and Instruction, 219(3), 467–480. Bouwer, R., Beguin, A., Sanders, T., & van den Bergh. (2015). Effect of genre on the generalizability of writing scores. Language Testing, 32(1), 83–100. Bridgeman, B., Cho, Y., & DiPietro, S. (2016). Predicting grades from an English language assessment: The importance of peeling the onion. Language Testing, 33(3), 307–318. Brindley, G. (Ed.) (2000). Studies in immigrant English language assessment (Vol. 1). Sydney, Australia: National Centre for English Language Teaching and Research, Macquarie University. Brindley, G. (2013). Task-based assessment. In C. Chapelle (Ed.), Encyclopedia of applied linguistics. Malden, MA: Wiley-Blackwell. doi:10.1002/9781405198431.wbeal1141 British Council. (2019). Prepare for the IELTS. Retrieved from https://takeielts. britishcouncil.org/prepare-test/understand-test-format Britton, J., Burgess, A., Martin, N., McLeod, A., & Rosen, R. (1975). The development of writing abilities (11-18). London, U.K.: Macmillan. Brunfaut, T., Harding, L, & Batty, A. (2018). Going online: The effect of mode of delivery on performances and perceptions on an English L2 writing test suite. Assessing Writing, 36, 3–18. Burner, T. (2014). The potential formative benefits of portfolio assessment in second and foreign language writing contexts: A review of the literature. Studies in Educational Evaluation, 43, 139–149. Burstein, J., Elliot, N., & Molloy, H. (2016). Informing automated writing evaluation using the lens of genre: Two studies. CALICO Journal, 33(1), 117–141. Burstein, J., Flor, M., Tetreault, J., Madnani, N., & Holtzman, S. (2012). Examining linguistic characteristics of paraphrase in test-taker summaries (ETS RR-12-18). Princeton, NJ: ETS. Byrnes, H. (Ed.) (2007). Perspectives. Modern Language Journal, 90(2), 641–685. Byrnes, H., & Manchón, R. (Eds.) (2014). Task-based language learning: Insights from and for L2 writing. Amsterdam, The Netherlands: John Benjamins.

412

142 Alister Cumming et al.

Carroll, J. B. (1961). Fundamental considerations in testing for English language proficiency of foreign students. In Testing the English proficiency of foreign students (pp. 30–40). Washington, DC: Center for Applied Linguistics. Carroll, J. B. (1975). The teaching of French as a foreign language in eight countries. NewYork:Wiley. Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing, 20(4), 369–383. Chan, I., Inoue, C., & Taylor, L. (2015). Developing rubrics to assess the reading-into- writing skills: A case study. Assessing Writing, 26, 20–37. Chan, S., Bax, S., & Weir, C. (2018). Researching the comparability of paper-based and computer-based delivery in a high-stakes writing test. Assessing Writing, 36, 32–48. Chapelle, C. (2008). The TOEFL validity argument. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 319–352). New York: Routledge. Chapelle, C., Cotos, E., & Lee, J. (2015).Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, 32(3), 385–405. Chen, L., & Sun,Y. (2015a).Teachers’ grading decision making: Multiple influencing factors and methods. Language Assessment Quarterly, 12(2), 213–233. Chen, L., & Sun, Y. (2015b). Interpreting the impact of the Ontario Secondary School Literacy Test on second language students within an argument-based validation framework. Language Assessment Quarterly, 12(1), 50–66. Cheng, A. (2008). Analyzing genre exemplars in preparation for writing: The case of an L2 graduate student in the ESP genre-based instructional framework of academic literacy. Applied Linguistics, 29(1), 50–71. Cho,Y., & Bridgeman, B. (2012). Relationship of TOEFL iBT scores to academic performance: Some evidence from American universities. Language Testing, 29(3), 421–442. Cho, Y., & Choi, I. (2018). Writing from sources: Does audience matter? Journal of Second Language Writing, 37, 25–38. Cho, Y., Rijmen, F., & Novák, J. (2013). Investigating the effects of prompt characteristics on the comparability of TOEFL iBT™ integrated writing tasks. Language Testing, 30(4), 513–534. Choi, I., & Cho,Y. (2018). The impact of spelling errors on trained raters’ scoring decisions. Language Education and Assessment, 1(2), 45–58. Christie, F. (2012). Language education throughout the school years: A functional perspective. Malden, MA: Wiley-Blackwell. Cope, B., & Kalantzis, M. (Eds.) (2000). Multiliteracies: Literacy learning and the design of social futures. London U.K.: Routledge. Cotton, F., & Wilson, K. (2011). An investigation of examiner rating of coherence and cohesion in the IELTS Academic Writing Task 2 (IELTS Research Report 12). Melbourne, Australia. Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Strasbourg, France: Council of Europe and Cambridge University Press. Retrieved from https://r m.coe.int/1680459f97 Council of Europe. (2009). Relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (CEFR): A manual. Strasbourg, France: Language Policy Division, Council of Europe. Retrieved from https://r m.coe. int/CoERMPublicCommonSearchServices/DisplayDCTMContent?documentId=09 00001680667a2d Crossley, S. A., & McNamara, D. S. (2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26, 66–79.

412

413

412

Assessing Academic Writing 143

Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. Cumming, A. (2001a).The difficulty of standards, for example in L2 writing. In T. Siliva & P. Matsuda (Eds.), On second language writing (pp. 209–229). Mahwah, NJ: Erlbaum. Cumming, A. (2001b). ESL/EFL instructors’ practices for writing assessment: Specific purposes or general purposes? Language Testing, 18(2), 207–224. Cumming, A. (2009). Research timeline: Assessing academic writing in foreign and second languages. Language Teaching, 42(1), 95–107. Cumming, A. (2013). Assessing integrated writing tasks for academic purposes: Promises and perils. Language Assessment Quarterly, 10(1), 1–8. Cumming, A. (2014a). Linking assessment to curricula, teaching, and learning in language education. In D. Qian & L. Li (Eds.), Teaching and learning English in East Asian universities: Global visions and local practices (pp. 2–18). Newcastle, U.K.: Cambridge Scholars Publishing. Cumming, A. (2014b). Assessing integrated skills. In A. Kunnan (Ed.), Companion to language assessment (pp. 216–229). Malden, MA: Wiley-Blackwell. doi:10.1002/9781118411360. wbcla131 Cumming, A. (2016). Theoretical orientations to L2 writing. In R. Manchón & P. K. Matsuda (Eds.), Handbook of second and foreign language writing (pp. 65– 88). Berlin, Germany: Walter de Gruyter. Cumming, A., Grant, L., Mulcahy-Ernt, P., & Powers, D. (2004). A teacher-verification study of speaking and writing prototype tasks for a new TOEFL. Language Testing, 21(2), 159–197. Cumming, A., Kantor, R., Baba, K., Erdosy, U., Eouanzoui, K., & James, M. (2005). Differences in written discourse in independent and integrated prototype tasks for the next generation TOEFL. Assessing Writing, 10(1), 5–43. Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86(1), 67–96. Cumming, A., Kantor, R., Powers, D., Santos, T., & Taylor, C. (2000). TOEFL 2000 writing framework: A working paper (TOEFL Monograph 18). Princeton, NJ: Educational Testing Service. Cumming, A., Lai, C., & Cho, H. (2016). Students’ writing from sources for academic purposes: A synthesis of recent research. Journal of English for Academic Purposes, 23, 47–58. Cummins, J. (1984). Bilingualism and special education: Issues in assessment and pedagogy. Clevedon, U.K.: Multilingual Matters. Currie, P. (1998). Staying out of trouble: Apparent plagiarism and academic survival. Journal of Second Language Writing, 7(1), 1–18. Davison, C., & Leung, C. (2009). Current issues in English- language teacher- based assessment. TESOL Quarterly, 43(3), 393–415. Davies, A. (2008). Assessing academic English: Testing English proficiency 1950–1989 –The IELTS solution. Cambridge, U.K.: Cambridge University Press. Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. Educational Testing Service (ETS). (2015). TOEFL iBT® test questions. Retrieved from www.ets.org/s/toefl/pdf/toefl_speaking_rubrics.pdf Educational Testing Service (ETS). (2019). Understanding your TOEFL iBT® test scores. Retrieved from www.ets.org/toefl/ibt/scores/understand/ Educational Testing Service (ETS). (2021). TOEFL scoring guides (rubrics) for writing responses. Retrieved from www.ets.org/s/toefl/pdf/toefl_writing_rubrics.pdf

41

144 Alister Cumming et al.

Elder, C., & O’Loughlin, K. (2003). Investigating the relationship between intensive English language study and band score gain on IELTS. In R. Tulloh (Ed.), International English Language Testing System (IELTS) research reports 2003 (Vol. 4, pp. 207–254). Canberra, Australia: IDP: IELTS Australia. Ellis, R. (2003). Task-based language learning and teaching. Oxford, U.K.: Oxford University Press. Enright, M., Bridgeman, B., Eignor, D., Kantor, R., Mollaun, P., Nissan, S., Powers, D., & Schedl, M. (2008). Prototyping new assessment tasks. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 97–143). New York: Routledge. Ericsson, P., & Haswell, R. (Eds.). (2006). Machine scoring of student essays:Truth and consequences. Logan, UT: Utah State University Press. Ewert, D., & Shin, S. (2015). Examining instructors’ conceptualizations and challenges in designing a data-driven rating scale for a reading-to-write task. Assessing Writing, 26, 38–50. Ferris, D. R. (1994). Rhetorical strategies in student persuasive writing: Differences between native and non-native English speakers. Research in the Teaching of English, 28(1), 45–65. Ferris, D. (2012). Written corrective feedback in second language acquisition and writing studies. Language Teaching, 45(4), 446–459. Friginal, E., Li, M., & Weigle, S. (2014). Exploring multiple profiles of L2 writing using multi-dimensional analysis. Journal of Second Language Writing, 26, 80–95. Gardner, S., & Nesi, H. (2013). A classification of genre families in university student writing. Applied Linguistics, 34(1), 25–52. Gebril, A., & Plakans, L. (2013). Toward a transparent construct of reading- to- write tasks: The interface between discourse features and proficiency. Language Assessment Quarterly, 10(1), 9–27. Gebril, A., & Plakans, L. (2014). Assembling validity evidence for assessing academic writing: Rater reactions to integrated tasks. Assessing Writing, 21, 56–73. Ginther, A., & Elder, C. (2014). A comparative investigation into understandings and uses of the TOEFL iBT® Test, the International English Language Testing Service (Academic) Test, and the Pearson Test of English for graduate admissions in the United States and Australia: A case study of two university contexts (TOEFL iBT ® Research Report TOEFL iBT–24 ETS Research Report No. RR–14-44). Princeton, NJ: Educational Testing Service. Gorman, T., Purves, A., & Degenhart, R. (Eds.) (1988). The IEA study of written composition I:The international writing tasks and scores. Oxford, U.K.: Pergamon. Grabe, W., & Zhang, C. (2013). Reading and writing together: A critical component of English for academic purposes teaching and learning. TESOL Journal, 4(1), 9–24. Graham, J. G. (1987). English language proficiency and the prediction of academic success. TESOL Quarterly, 21(3), 505–521. Green, A. (2006). Watching for washback: Observing the influence of the International English Language Testing System Academic Writing Test in the classroom. Language Assessment Quarterly, 3(4), 333–368. Hale, G., Taylor, C., Bridgeman, B., Carson, J., Kroll, B., & Kantor, R. (1996). A study of writing tasks assigned in academic degree programs (Research Report 95-54). Princeton, NJ: Educational Testing Service. Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 241–276). Norwood, NJ: Ablex. Hamp-Lyons, L., & Condon, W. (2000). Assessing the portfolio: Principles for practice, theory, and research. Cresskill, NJ: Hampton.

41

145

41

Assessing Academic Writing 145

Harsch, C., & Hartig, J. (2015).What are we aligning tests to when we report test alignment to the CEFR? Language Assessment Quarterly, 12(4), 333–362. Harsch, C., & Martin, G. (2012).Adapting CEF-descriptors for rating purposes:Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228–250. Harsch, C., & Rupp, A. (2011). Designing and scaling level- specific writing tasks in alignment with the CEFR: A test-centered approach. Language Assessment Quarterly, 8(1), 1–33. Harrington, M., & Roche,T. (2014). Identifying academically at-r isk students in an English- as-a-Lingua-Franca university setting. Journal of English for Academic Purposes, 15, 37–47. Hasselgren, A. (2013). Adapting the CEFR for the classroom assessment of young learners’ writing. Canadian Modern Language Review, 69(4), 415–435. He, L., & Shi, L. (2012). Topical knowledge and ESL writing. Language Testing, 29(3), 443–464. Hillocks, G. (2002). The testing trap: How state assessments of writing control learning. New York: Teachers College Press. Holzknecht, F., Huhta, A., & Lamprianou, I. (2018). Comparing the outcomes of two different approaches to CEFR-based rating of students’ writing performances across two European countries. Assessing Writing, 37, 57–67. Hoang, G., & Kunnan, A. (2016). Automated essay evaluation for English language learners: A case study of MY Access. Language Assessment Quarterly, 13(4), 359–376. Huang, L.-S. (2010). Seeing eye to eye? The academic writing needs of graduate and undergraduate students from students’ and instructors’ perspectives. Language Teaching Research, 14(4), 517–539. Huhta, A., Alanen, R., Tarnanen, M., Martin, M., & Hirvela, T. (2014). Assessing learners’ writing skills in an SLA study: Validating the rating process across tasks, scales and languages. Language Testing, 31(3), 307–328. Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60(2), 237–263. Hyland, F. (2001). Dealing with plagiarism when giving feedback. ELT Journal, 35(4), 375–381. Hyland, K. (2010). Constructing proximity: Relating to readers in popular and professional science. Journal of English for Academic Purposes, 9(2), 116–127. Ivanic, R. (2004). Discourses of writing and learning to write. Language and Education, 18(3), 220–245. Jang, E., Cummins, J., Wagner, M., Stille, S., & Dunlop, M. (2015). Investigating the homogeneity and distinguishability of STEP proficiency descriptors in assessing English language learners in Ontario schools. Language Assessment Quarterly, 12(1), 87–109. Jarvis, S. (2017). Grounding lexical diversity in human judgments. Language Testing, 34(4), 537–553. Jarvis, S., Grant, L., Bikowski, D., & Ferris, D. (2003). Exploring multiple profiles of highly rated learner compositions. Journal of Second Language Writing, 12, 377–403. Jin, Y., & Yan, M. (2017). Computer literacy and the construct validity of a high-stakes computer-based writing assessment. Language Assessment Quarterly, 14(4), 101–119. Johnson, R., & Riazi, A. (2017).Validation of a locally created and rated writing test used for placement in a higher education EFL program. Assessing Writing, 32, 85–104. Kang, E., & Han, Z. (2015). The efficacy of written corrective feedback in improving L2 written accuracy: A meta-analysis. Modern Language Journal, 9(1), 1–18. Keck, C. (2006). The use of paraphrase in summary writing: A comparison of L1 and L2 writers. Journal of Second Language Writing, 15(4), 261–278.

416

146 Alister Cumming et al.

Kim, E. (2017). The TOEFL iBT writing: Korean students’ perceptions of the TOEFL iBT writing test. Assessing Writing, 33, 1–11. Kim, H., Bowles, M., Yan, X., & Chung, S. (2018). Examining the comparability between paper-and computer-based versions of an integrated writing placement test. Assessing Writing, 36, 49–62. Kim, M., & Crossley, S. (2018). Modeling second language writing quality: A structural equation investigation of lexical, syntactic, and cohesive features in source-based and independent writing. Assessing Writing, 37, 39–56. Kim, Y. (2011). Diagnosing EAP writing ability using the Reduced Reparameterized Unified Model. Language Testing, 28(4), 509–541. Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge, U.K.: Cambridge University Press. Klobucar, A., Elliot, N., Dress, P., Rudniy, O., & Joshi, K. (2013). Automated scoring in context: Rapid assessment for placed students. Assessing Writing, 18(1), 62–84. Knoch, U. (2009). Diagnostic writing assessment: The development and validation of a rating scale. Frankfurt am Main, Germany: Peter Lang. Knoch, U., & Chapelle, C. (2018).Validation of rating processes within an argument-based framework. Language Testing, 35(4), 477–499. Knoch, U., Roushad, A., & Storch, N. (2014). Does the writing of undergraduate ESL students develop after one year of study in an English-medium university? Assessing Writing, 21, 1–17. Knoch, U., & Sitajalabhorn, W. (2013). A closer look at integrated writing tasks: Towards a more focussed definition for assessment purposes. Assessing Writing, 18(4), 300–308. Kobayashi, H., & Rinnert, C. (1996). Factors affecting composition evaluation in an EFL context: Cultural rhetorical pattern and readers’ background. Language Learning, 46(3), 397–437. Kunnan, A. (2018). Evaluating language assessments. New York: Routledge. Lam, R. (2010). A peer review training workshop: Coaching students to give and evaluate peer feedback. TESL Canada Journal, 27(2), 114. Lam, R. (2017). Taking stock of portfolio assessment scholarship: From research to practice. Assessing Writing, 31, 84–97. Lam, R. (2018). Portfolio assessment for the teaching and learning of writing. Singapore: Springer. Lee, I. (2013). Research into practice: Written corrective feedback. Language Teaching, 46(1), 108–119. Lee, I. (2017). Classroom writing assessment and feedback in L2 school contexts. Singapore: Springer. Lee,Y-W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes (TOEFL Monograph 31). Princeton, NJ: Educational Testing Service. Leki, I. (2007). Undergraduates in a second language: Challenges and complexities of academic literacy development. New York: Erlbaum. Leki, I., Cumming, A., & Silva, T. (2008). A synthesis of research on second language writing in English. New York: Routledge. Leung, C. (2007). Dynamic assessment: Assessment for and as teaching. Language Assessment Quarterly, 4(3), 257–278. Li, H., & He, L. (2015). A comparison of EFL raters’ essay-rating processes across two types of rating scales. Language Assessment Quarterly, 12(2), 178–212. Li, J., & Schmitt, N. (2009). The acquisition of lexical phrases in academic writing: A longitudinal case study. Journal of Second Language Writing, 18(2), 85–102.

416

147

416

Assessing Academic Writing 147

Li, Y., & Casanave, C. (2012). Two first- year students’ strategies for writing from sources: Patchwriting or plagiarism? Journal of Second Language Writing, 21(2), 165–180. Little, D. (2005). The Common European Framework and the European Language Portfolio: Involving learners and their judgements in the assessment process. Language Testing, 22(3), 321–336. Liu, S., & Kunnan, A. (2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WriteToLearn. CALICO Journal, 33(1), 71–91. Llosa, L., & Malone, M. (2017). Student and instructor perceptions of writing tasks and performance on TOEFL iBT versus university writing courses. Assessing Writing, 34, 88–99. Llosa, L., & Malone, M. (2019). Comparability of students’ writing performance on TOEFL iBT and in required university courses. Language Testing, 36(2), 235–263. Lloyd-Jones, R. (1977). Primary trait scoring. In C. Cooper & L. Odel (Eds.), Evaluating writing (pp. 33–69). Urbana, IL: National Council of Teachers of English. Lumley, T. (2005). Assessing second language writing: The rater’s perspective. Frankfurt am Main, Germany: Peter Lang. Lumley, T., & McNamara, T. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71. Macbeth, K. P. (2010). Deliberate false provisions: The use and usefulness of models in learning academic writing. Journal of Second Language Writing, 19(1), 33–48. Macqueen, S. (2012). The emergence of patterns in second language writing. Bern, Switzerland: Peter Lang. Magnifico, A. (2010). Writing for whom: Cognition, motivation, and a writer’s audience. Educational Psychologist, 45(3), 167–184. Malone, M., & Montee, M. (2014). Stakeholders’ beliefs About the TOEFL iBT® Test as a measure of academic language ability (TOEFL iBT ® Research Report TOEFL iBT–22 ETS Research Report No. RR–14-42). Princeton, NJ: Educational Testing Service. Marefat, F., & Heydari, M. (2016). Native and Iranian teachers’ perceptions and evaluation of Iranian students’ English essays. Assessing Writing, 27, 24–36. McCann, T. (1989). Student argumentative writing knowledge and ability at three grade levels. Research in the Teaching of English, 23(1), 62–76. Melzer, D. (2003). Assignments across the curriculum: A survey of college writing. Language & Learning Across the Disciplines, 6(1), 86–110. Melzer, D. (2009). Writing assignments across the curriculum: A national study of college writing. College Composition and Communication, 61(2), 240–260. Milanovic, M., Saville, N., & Shen, S. (1996). A study of the decision-making behaviour of composition markers. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment (pp. 92–114). Cambridge, U.K.: Cambridge University Press. Min, H. T. (2016). Effect of teacher modeling and feedback on EFL students’ peer review skills in peer review training. Journal of Second Language Writing, 31, 43–57. Mislevy, R. (2013). Modeling language for assessment. In C. Chapelle (Ed.), Encyclopedia of applied linguistics. Malden, MA: Wiley- Blackwell. doi:10.1002/ 9781405198431. wbeal0770 Mislevy, R., & Yin, C. (2009). If language is a complex adaptive system, what is language assessment? Language Learning, 59(1), 249–267. Nesi, H., & Gardner, S. (2012). Genres across the disciplines: Student writing in higher education. Cambridge, U.K.: Cambridge University Press.

418

148 Alister Cumming et al.

Norris, J. (2009). Task based teaching and testing. In M. Long & C. Doughty (Eds.), The handbook of language teaching (pp. 578–594). Malden, MA: Wiley-Blackwell. Norris, J., & Manchón, R. (2012). Investigating L2 writing development from multiple perspectives: Issues in theory and research. In R. Manchón (Ed.), L2 writing development: Multiple perspectives (pp. 221–244). Berlin, Germany: de Gruyter. North, B. (2000). The development of a common framework scale of language proficiency. New York: Peter Lang. North, B. (2014). The CEFR in practice. Cambridge, U.K.: Cambridge University Press. Ohta, R., Plakans, L., & Gebril, A. (2018). Integrated writing scores based on holistic and multi-trait scales: A generalizability analysis. Assessing Writing, 28, 21–36. Olson, D. (1996). The world on paper: The conceptual and cognitive implications of writing and reading. Cambridge, U.K.: Cambridge University Press. O’Loughlin, K., & Wigglesworth, G. (2003). Task design in IELTS academic writing task 1: The effect of quantity and manner of presentation of information on candidate writing. IELTS Research Report, 4, 3. Canberra, Australia: IELTS Australia. Papageorgiou, S., Xi, X., Morgan, R., & So, Y. (2015). Developing and validating band levels and descriptors for reporting overall examinee performance. Language Assessment Quarterly, 12(2), 153–177. Pearson. (2018). PTE Academic score guide for institutions, version 10. Retrieved from https://pearsonpte.com/wp-content/uploads/2020/04/Score-Guide-16.04.19-for- institutions.pdf Pearson. (2019). Pearson Test of English Academic: Automated scoring. Retrieved from https:// pearsonpte.com/wp-content/uploads/2018/06/Pearson-Test-of-English-Academic- Automated-Scoring-White-Paper-May-2018.pdf Pecorari, D., & Shaw, P. (2012). Types of student intertextuality and faculty attitudes. Journal of Second Language Writing, 21(2), 149–164. Petric, B., & Harwood, N. (2013). Task requirements, task representation, and self-reported citation functions: An exploratory study of a successful L2 students’ writing. English for Academic Purposes, 12(2), 110–124. Plakans, L. (2009). Discourse synthesis in integrated second language assessment. Language Testing, 26(4), 561–587. Plakans, L. (2010). Independent vs. integrated writing tasks: A comparison of task representation. TESOL Quarterly, 44(1), 185–194. Plakans, L., & Burke, M. (2013). The decision- making process in language program placement:Test and nontest factors interacting in context. Language Assessment Quarterly, 10(2), 115–134. Plakans, L., & Gebril,A. (2013). Using multiple texts in an integrated writing assessment: Source text use as a predictor of score. Journal of Second Language Writing, 22(3), 217–230. Plakans, L., & Gebril, A. (2017). Exploring the relationship of organization and connection with scores in integrated writing assessment. Assessing Writing, 31, 98–112. Poehner, M., & Lantolf, J. (2005). Dynamic assessment in the language classroom. Language Teaching Research, 9(3), 233–265. Poehner, M., & Infante, P. (2017). Mediated development: A Vygotskian approach to transforming second language learner abilities. TESOL Quarterly, 51(2), 332–357. Polio, C. (2017). Second language writing development: A research agenda. Language Teaching, 50(2), 261–275. Polio, C., & Shea, M. (2014). An investigation into current measures of linguistic accuracy in second language writing research. Journal of Second Language Writing, 23, 10–27.

418

419

418

Assessing Academic Writing 149

Purpura, J. (2017). Assessing meaning. In E. Shohamy, I. Or, & S. May (Eds.), Language testing and assessment: Encyclopedia of language and education (3rd ed., pp. 33–61). New York: Springer. Purves, A., Gorman, T., & Takala, S. (1988). The development of the scoring scheme and scales. In T. Gorman, A. Purves & R. Degenhart (Eds.), The IEA study of written composition I:The international writing tasks and scores (pp. 41–58). Oxford: Pergamon. Rahimi, M. (2013). Is training student reviewers worth its while? A study of how training influences the quality of students’ feedback and writing. Language Teaching Research, 17(1), 67–89. Raimes, A. (1990). The TOEFL test of written English: Causes for concern. TESOL Quarterly, 24(3), 427–442. Rea-Dickins, P., Kiely, R., & Yu, G. (2007). Student identity, learning and progression: The affective and academic impact of IELTS on ‘successful’ candidates. IELTS Research Report, 7, 59–136. Reckase, M. (2017). A tale of two models: Sources of confusion in achievement testing (ETS Research Report No. RR-17-44). doi:10.1002/ets2.12171 Reynolds, D.W. (1995). Repetition in non-native speaker writing. Studies in Second Language Acquisition, 17(2), 185–209. Riazi, A. (2016). Comparing writing performance in TOEFL iBT and academic assignments: An exploration of textual features. Assessing Writing, 28, 15–27. Robinson, P. (2011). Task-based language learning: A review of issues. Language Learning, 61(1), 1–36. Rosenfeld, M., Leung, S., & Oltman, P. (2001). The reading, writing, speaking, and listening tasks important for academic success at the undergraduate and graduate levels (TOEFL Monograph 21). Princeton, NJ: Educational Testing Service. Ruth, L., & Murphy, S. (1988). Designing tasks for the assessment of writing. Westport, CT: Greenwood Publishing. Sawaki, Y., Quinlan, T., & Lee, Y.- W. (2013). Understanding learner strengths and weaknesses: Assessing performance on an integrated writing task. Language Assessment Quarterly, 10(1), 73–95. Sawaki, Y., & Sinharay, S. (2018). Do the TOEFL iBT section scores provide value-added information to stakeholders? Language Testing, 35(4), 529–556. Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. Schleppegrell, M. J. (2004). The language of schooling: A functional linguistic perspective. Mahwah, NJ: Erlbaum. Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30. Shaw, S., & Weir, C. (2007). Examining writing: Research and practice in assessing second language writing. Cambridge, U.K.: Cambridge University Press. Shermis, M., & Burstein, J. (Eds.) (2013). Handbook of automated essay evaluation: Current applications and new directions. New York: Routledge. Shermis, M., Burstein, J., Elliot, N., Miel, S., & Foltz, P. (2016). Automated writing evaluation: An expanding body of knowledge. In C. A. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (2nd ed., pp. 395–409). New York: Guilford Press. Shi, L. (2001). Native-and nonnative- speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing, 18(3), 303–325. Shi, L. (2012). Rewriting and paraphrasing source texts in second language writing. Journal of Second Language Writing, 21(2), 134–148.

510

150 Alister Cumming et al.

Shi, L.,Wang,W., & Wen, Q. (2003).Teaching experience and evaluation of second-language students’ writing. Canadian Journal of Applied Linguistics, 6(2), 219–236. Shin, S., & Ewert, D. (2015). What accounts for integrated reading-to-write task scores? Language Testing, 32(2), 259–281. Spivey, N. (1997). The constructivist metaphor: Reading, writing, and making of meaning. San Diego, CA: Academic Press. Spolsky, B. (1995). Measured words: The development of objective language testing. Oxford, U.K.: Oxford University Press. Stansfield, C., & Ross, J. (1988). A long-term research agenda for the Test of Written English. Language Testing, 5(2), 160–186. Sternglass, M. (1997). Time to know them: A longitudinal study of writing and learning at the college level. Mahwah, NJ: Erlbaum. Stricker, L., & Wilder, G. (2012). Test takers’ interpretation and use of TOEFL iBT score reports: A focus group study (ETS Research Memorandum RM-12-08). Princeton, NJ: Educational Testing Service. Sun,Y. (2016). Context, construct, and consequences:Washback of the College English Test in China. Riga, Latvia: Lambert Academic Publishing. To, J., & Carless, D. (2016). Making productive use of exemplars: Peer discussion and teacher guidance for positive transfer of strategies. Journal of Further and Higher Education, 40(6), 746–764. Toulmin, S. E. (1958). The uses of argument. Cambridge, U.K.: Cambridge University Press. Uysal, H. (2009). A critical review of the IELTS writing test. ELT Journal, 64(3), 314–320. Vo, S. (2019). Use of lexical features in non-native academic writing. Journal of Second Language Writing, 44, 1–12. Wall, D., & Horak, T. (2008). The impact of changes in the TOEFL examination on teaching and learning in central and eastern Europe: Phase 2, coping with change (TOEFL iBT Research Report 5). Princeton, NJ: Educational Testing Service. Wang, J., Engelhard, G., Raczynskia, K., Song, T., & Wolf, E. (2017). Evaluating rater accuracy and perception for integrated writing assessments using a mixed-methods approach. Assessing Writing, 33, 36–47. Weigle, S. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. Weigle, S. (2002). Assessing writing. New York: Cambridge University Press. Weigle, S., & Friginal, E. (2015). Linguistic dimensions of impromptu test essays compared with successful student disciplinary writing: Effects of language background, topic, and L2 proficiency. Journal of English for Academic Purposes, 18, 25–39. Weir, C.,Vidakovíc, I., & Galaczi, E. (2013). Measured constructs: A history of Cambridge English language examinations 1913–2012. Cambridge, UK: Cambridge University Press. Wette, R. (2010). Evaluating student learning in a university-level EAP unit on writing using sources. Journal of Second Language Writing, 19(3), 158–177. White, E. (1985). Teaching and assessing writing. San Francisco, CA: Jossey-Bass. WIDA (World-class Instructional Design and Assessment). (2017). Writing rubric of the WIDA consortium, grades 1 to 12. Retrieved from http://morethanenglish.edublogs.org/ files/2011/08/WIDA-writing-rubric-2f94y4l.pdf Williams, J. (2012).The potential role(s) of writing in second language development. Journal of Second Language Writing, 21(4), 321–331. Wind, S., Stager, C., & Patil, Y. (2017). Exploring the relationship between textual characteristics and rating quality in rater-mediated writing assessments: An illustration with L1 and L2 writing assessments. Assessing Writing, 34, 1–15.

510

15

510

Assessing Academic Writing 151

Wolfe- Quintero, K., Inagaki, S., & Kim, H.- Y. (1998). Second language development in writing: Measures of fluency, accuracy and complexity. Honolulu, HI: University of Hawai’i at Manoa. Xi, X. (2008). Methods of test validation. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment (2nd ed., pp. 177–196). New York: Springer. Yu, G. (2009). The shifting sands in the effects of source text summarizability on summary writing. Assessing Writing, 14(2), 116–137. Yu, G. (2010). Lexical diversity in writing and speaking task performances. Applied Linguistics, 31(2), 236–259. Yu, G. (2013). The use of summarization tasks: Some lexical and conceptual analyses. Language Assessment Quarterly, 10(1), 96–109. Yu, G., & Zhang, J. (2017). Computer-based English language testing in China: Present and future. Language Assessment Quarterly, 14(2), 177–188. Zhang, C. (2013). Effect of instruction on ESL students’ synthesis writing. Journal of Second Language Writing, 22(1), 51–67. Zhao, C. (2013). Measuring authorial voice strength in L2 argumentative writing: The development and validation of an analytic rubric. Language Testing, 30(2), 201–230. Zheng,Y., & Chen, L. (2008). College English Test (CET) in China. Language Testing, 25(3), 408–417. Zheng, Y., & Mohammadi, S. (2013). An investigation into the writing construct(s) measured in the Pearson Test of English Academic. Dutch Journal of Applied Linguistics, 2(1), 108–125.

512

5 ASSESSING ACADEMIC SPEAKING Xiaoming Xi, John M. Norris, Gary J. Ockey, Glenn Fulcher, and James E. Purpura

This chapter reviews current theoretical perspectives on the conceptualization of academic oral communicative competence (henceforth academic speaking) and recent debates around practices to operationalize it. The chapter argues that needed changes in defining the construct of academic speaking are driven primarily by three factors: advances in theoretical models and perspectives, the evolving nature of oral communication in the academic domain, and standards that define college-ready oral communication skills. Drawing on recent work in these areas, an integrated model of academic speaking ability is proposed that follows an interactionalist approach. Further discussed are implications of this integrated model for how language use contexts should be represented in construct definition and reflected in task design, and how test and rubric design should facilitate elicitation and extraction of evidence of different components of speaking ability. Additional insights into the construct definition and operationalization of academic speaking are drawn from domain analyses of academic oral communication and reviews of college-ready standards for communication. The chapter then compares the conceptualization of speaking ability in major tests of English for higher education admissions. It further reviews new developments in task types, scoring approaches, rubrics, and scales, as well as various technologies that can inform practices in designing and scoring tests of academic speaking. It also discusses opportunities and challenges involved in operationalizing an expanded construct of academic speaking and proposes some ways forward. The chapter concludes by considering future directions in advancing the assessment of academic speaking. It points to the needs for refining an integrated, coherent theoretical model that underlies academic speaking and for updating

512

513

512

Assessing Academic Speaking 153

construct definitions to reflect the changing roles and demands of oral communication in English-medium higher education. It also discusses the need for continuing explorations of new task types and scoring rubrics and for leveraging technological advances to support innovations in assessment design, delivery, scoring, and score reporting. Finally, it underscores the necessity of supporting the validity of interpretations and uses of academic speaking tests—most specifically those used for university admissions decisions—including investigating intended and unintended consequences.

5.1 Conceptualizing Academic Speaking Major inquiry into how to define communicative competence started four decades ago (Campbell & Wales, 1970; Hymes, 1971), as indicated by Chapelle et al. (1997) in a review of the definitions of communicative language competence and the implications for defining the constructs of a new TOEFL test. In the early 1980s, the notion of communicative competence became well-known in the field of language learning and teaching through the seminal work of Canale and Swain (1980), and Canale (1983). Their model of communicative competence included four components: grammatical, discourse, sociolinguistic, and strategic competences. Bachman (1990) extended Canale and Swain’s (1980) work by proposing a refined model of communicative language ability that was quickly adopted in language testing. He defined communicative competence as “…both the knowledge, or competence, and the capacity for implementing, or executing that competence in appropriate, contextualized communicative language use” (p. 84). This approach is characterized by Chapelle (1998) as the “Interaction- ability” approach, as it recognizes the intricate interaction between a user’s language knowledge and competence, and the context of a specific language use situation. Over the last 20 years, other approaches to defining the construct of communicative competence have emerged. A few of the most influential approaches are discussed in Bachman (2007), including the task-based approach represented by Long and Norris (2000), and three interactionalist approaches: the minimalist interactionalist approach by Chapelle (1998), the moderate interactionalist by Chalhoub-Deville (2003), and the strong interactionalist approach by Young (2011, 2012, 2019). In the task-based approach, language abilities are understood in relation to the communication requirements of specific tasks and the extent to which L2 speakers can accomplish them. By contrast, the three interactionalist approaches differ in the extent to which contexts of language use interact with the linguistic resources that a language user brings to a communicative situation, and how interpretations about language ability might be impacted by conceptualizations of context. Chapelle’s (1998) minimalist interactionalist approach emphasizes the component of strategic competence or metacognitive strategies (Bachman & Palmer, 1996) as a mechanism to govern the use of linguistic knowledge and

514

154 Xiaoming Xi et al.

competence in a language use context. Chalhoub-Deville (2003) defines construct as “ability-in-individual-in-context” (p. 372) and argues that the ability components that are engaged by a context change the facets of the context and are changed by them. This conceptualization goes a step further than Chapelle’s (1998) approach and recognizes the reciprocal roles of context and ability and the changing-from-moment-to-moment, dynamic nature of context. Young’s (2011, 2012) approach represents the most extreme view of the interactive nature of language ability. Extending the view that discourse is dynamic and co-constructed by participants, he suggests that language abilities are not owned by individual participants. Rather, they are co-constructed through interaction, thus rendering it difficult to make stable inferences about an individual’s language abilities. The first two interactionalist approaches, although differing somewhat, essentially agree that the construct of a test needs to account for “performance consistency” (Chapelle, 1998) and allow for the generalization of an individual’s abilities beyond a set of contexts. The third view casts some doubt on this premise, as the co-constructed nature of interactive discourse seems at odds with the concepts of performance consistency and generalization. Chapelle’s interactionalist approach is very practical for defining constructs in language assessments. She simplified the interactions between context and ability and made it manageable to define a context-informed construct and make inferences about test takers’ abilities for specific contexts. In her approach, the context of language use is defined as an important part of the construct. More recently, Purpura advocated for a meaning- oriented approach to defining language abilities (Purpura, 2004, 2016). He focused on the conveyance of different layers of meanings in communication, including functional, propositional, and pragmatic. According to Purpura, communication involves the marshalling of linguistic, cognitive, and propositional resources to express different layers of meanings in context. He argued that dominant models of language abilities have focused on functional meanings expressed in can-do statements but have not given adequate treatments to propositional and pragmatic meanings (Purpura, 2016). He also clarified the various resources that are called upon to express different meanings, pointing out that linguistic and cognitive resources are well represented in the Canale and Swain (1980) and Bachman and Palmer models (1996, 2010). However, propositional resources (i.e., capacity to utilize topical knowledge and content knowledge) are less explicitly specified as part of communication skills, except in Douglas (2000) and Purpura (2016). Furthermore, Purpura argued that meanings are situated in context, and different contextual factors may elicit facets of communication that pertain to functional, propositional, and pragmatic meanings. Each of these perspectives on language ability shapes how academic language use can be understood and how the ability to communicate in academic settings might be assessed. In the following sections, we first review attempts to characterize academic speaking according to a set of ability components and consider

514

15

514

Assessing Academic Speaking 155

some of the challenges associated with this approach. We then explore the ways in which domain analyses can help inform the construct definition of academic speaking ability, specifically by incorporating important aspects of context into what gets assessed and for what specific purpose.

5.1.1 Models of Components of Oral Communicative Competence The assessment of speaking ability has benefited from a considerable history of attempts to model the various competencies that comprise oral communication. Mainly, these models have followed the approach initiated by Hymes (1971, 1972) for first language ability, drawing upon relevant theories to logically account for distinct components that encapsulate the human capacity to communicate. In second language assessment, the combined work of Canale and Swain (1980), Canale (1983), Bachman (1990), and Bachman and Palmer (1996, 2010) has articulated an influential set of components that arguably constitute much of second language communicative competence and, by extension, speaking ability. Summarizing briefly, these models suggest that communicative language use engages language abilities comprised of language knowledge and strategic competence (metacognitive strategies that a speaker can apply to transact language use situations), and is mediated by multiple attributes of individuals including topical knowledge (content knowledge relevant to a language use situation), personal characteristics (features of the language user, such as gender, age, and native language), affective responses (emotional reactions to language use situations), and cognitive strategies (strategies employed when executing language comprehension or production plans). Of primary interest for the assessment of speaking abilities to date has been the central component of language knowledge. According to Bachman and Palmer (2010), the two main categories of language knowledge are organizational and pragmatic knowledge. Organizational knowledge focuses on how utterances and/or sentences are organized individually (grammatical knowledge) and to form texts including how conversations are organized (textual knowledge). Grammatical knowledge addresses the comprehension and production of accurate utterances or sentences in terms of syntax, vocabulary, phonology, and graphology. Textual knowledge refers to how individual utterances or sentences are organized to form texts, including cohesion and rhetorical organization. By contrast, pragmatic knowledge focuses on the relationship between various forms of language related to the speaker’s communicative goals (functional knowledge) and to features of the language use setting (sociolinguistic knowledge). The components approach to communicative competence outlined above has proved influential in guiding the development of second language (L2) speaking assessments and as a broad rubric for evaluating the various dimensions of a speaking ability construct that might be represented in a given test (Luoma, 2004). At the same time, models like Bachman and Palmer (1996, 2010) have been

516

156 Xiaoming Xi et al.

critiqued from several key perspectives. First, while components models provide coverage of a wide range of factors that may determine communicative success, this kind of ‘checklist’ approach (Skehan, 1998) may not capture differentiated abilities required across diverse language use situations or between distinct skills (e.g., speaking versus writing). Without a means for modeling, weighting, and otherwise variegating the relationship between specific components and specific language use contexts (treated by Bachman & Palmer, 1996 as test methods facets as opposed to competence components), it is unclear how speaking performance within a given language use situation may elicit or reveal underlying competences. Including the dimension of strategic competence in models of communicative language use may help, in that an individual’s use of a variety of possible strategies demonstrates the extent to which the individual is capable of marshaling language knowledge resources as needed within the corresponding language use situation. At the same time, which strategies are more or less successful in which situations raises the same question as which aspects of grammatical knowledge are more or less called upon for transacting a set of language tasks in a given language use situation. The strategic competence involved in planning and delivering an oral presentation in front of an academic audience differs from that involved in participating in a small-g roup discussion, so some means for articulating the demands for the corresponding knowledge, skills and strategic competences in performing communication tasks in a specific language use context is required. Second, although pragmatic knowledge is a key component in existing models, it has been suggested that the dialogic, cross-cultural, and contextualized nature of pragmatic competence is under-represented in largely cognitivist/individualist models of communicative competence (e.g., McNamara & Roever, 2006; Roever, 2011). While broad aspects of functional or sociolinguistic knowledge may be inferred indirectly from most speaking tasks (e.g., politeness forms in a monologic oral presentation), at stake is the extent to which more nuanced pragmatic understandings may be captured within such models, as they are called upon by the diversity of demands encountered in academic speaking.What is needed, according to Purpura (2004), is a way to identify and measure the conveyance of a host of implied or pragmatic meanings that depend more on an understanding of the context than on a literal understanding of the words arranged in syntax. Pragmatic competence in academic communication can be judged by the extent to which the interlocutor is able to use linguistic resources to understand and express acceptable and appropriate contextual, sociolinguistic, sociocultural, psychological, and rhetorical meanings. By measuring pragmatic meanings in academic oral communication, L2 assessments need to be able to elicit and judge L2 test takers’ understanding of, and ability to communicate different layers of meanings. A key notion is that pragmatic competence cannot be divorced from a thorough understanding of language use context—all instances of language production encode implied or pragmatic meanings (e.g., contextual, sociolinguistic, sociocultural, psychological, and rhetorical) by the information inherent in the context of the situation (e.g., the presuppositions of the interlocutors, the topic of discussion, the setting).

516

517

516

Assessing Academic Speaking 157

Third, along similar lines, interactional competence has received sustained attention as a particularly important set of abilities that may not be adequately portrayed within existing models of communicative competence. In components models of communicative competence, knowledge of conversational management such as topic nomination, turn-taking, pre-sequencing, and preference organization is included as part of textual competence (Bachman & Palmer, 2010). From other perspectives, looking beyond an individual’s cognitive or psycholinguistic capacities, especially with respect to speaking abilities, is essential for capturing the reality that spoken communication involves the joint construction of interactive practices by multiple participants which are specific to certain social situations (Young, 2011, 2012, 2019). Perhaps less helpful for the purposes of individualized assessment, a radical approach to interactional competence might suggest that an individual never possesses agency over the form of interaction that takes place in actual conversation, which is ever-changing and inherently relative. However, like nuanced understandings of pragmatics in context, the less dramatic addition of interactional abilities to components models may be called for, if the objective of speaking assessment is to estimate how well individuals can engage in speaking tasks that involve interaction. Missing from components models of competence, then, are constructs deemed critical to successful L2 (spoken) interaction. These constructs may include intersubjectivity, or publicly displayed mutual understanding regarding what is happening at a moment of interaction, and co-construction, which is the joint creation of performance through shared assumptions, forms, roles, etc., both of which are accomplished and maintained through talk (Jacoby & Ochs, 1995; McNamara, 1997). Such constructs may be perceived as inherently relative to the specific interlocutors and settings, but they may also be generalized (especially in relation to high-frequency contexts of interaction) in the form of additional components of competence required of effective L2 speakers. Quite distinct from those voices calling for greater representation of pragmatic and interactional competence in models of speaking ability, another criticism of the components approach suggests that such models may be overly complicated because they do attempt, at some level, to capture the interface between language knowledge and language use, and thereby inevitably face the challenges of performance inconsistency and scoring reliability (Van Moere, 2012). Further, it may be that these models do not adequately represent the psycholinguistic realities of L2 speaking (Ellis, 2005; Hulstijn, 2011). Such critiques suggest that speaking ability or proficiency is best understood—or perhaps best initially modeled— through psycholinguistic theories of processing efficiency or automaticity, and that it is best captured through discrete measures of such phenomena. Rather than muddying the waters with the uncontrollable aspects of actual language use tasks and social interactions in a myriad of situations, this argument advocates a focus on what we think we know about the relationship between psycholinguistic capacities and spoken language performance.To meet the goal of predicting performance across or regardless of situations of use, it might be best to use decontextualized measures of language ability rather than richly contextualized measures. Unless

518

158 Xiaoming Xi et al.

carefully designed, the latter may introduce construct underrepresentation and/or reliability issues owing to the influence of communication context. While the psycholinguistic approach to understanding speaking ability highlights one key dimension of L2 processing that no doubt underlies successful performance, it also renders any possible interpretations of speaking ability quite distant from language use in actual communicative situations. When aspects of context, such as English for Academic Purposes (EAP), are added to the desired interpretive frame, abstract psycholinguistic measures are hard pressed to do more than exhibit some degree of correlation with performance—the tendency to overgeneralize on their basis, then, is considerable. Quite the opposite of such short-cut estimates of speaking proficiency, components models of communicative competence provide a basis upon which to begin outlining a comprehensive construct for speaking proficiency that reflects multiple dimensions of language in use (rather than psycholinguistic mechanisms underlying possible language use). It seems clear that the grammatical, textual, pragmatic, and strategic competences of early models have a key role to play in setting the stage for what might be involved in capturing effective L2 speaking performance in academic settings. Building upon this foundation, it also seems clear that, if the objective is to incorporate notions of speaking as a type of communication that is inherently related to context and interlocutor, then additional components are required. While still maintaining a focus on the individual’s capacities to speak successfully in target language use situations, it will be necessary to acknowledge both the nuanced nature of pragmatic competence and the interactive nature of interactional competence into any comprehensive model of EAP. Further, it seems clear that, on their own, competence models can only take us so far down the road towards defining the construct. It is the interaction of an individual’s competences with the demands of the language use situation that underlies constructs such as academic English speaking abilities. To fully model and reasonably assess such a construct, context must be explicitly incorporated into the construct definition. For assessments that claim to assess EAP speaking ability, a key question in need of attention is what degrees and distributions of which components of a full competence model are called upon by targeted speaking tasks. The interaction of components models with thorough domain analyses is an obvious offshoot of this commitment to assessing abilities to use language for specific purposes.

5.1.2 Articulating a Model of Academic Speaking Informed by Current Debates and Issues The discussion of theoretical perspectives above has highlighted the importance of building an integrated model of academic speaking. In defining test constructs, theoretical component models of language ability help establish the central role of underlying abilities. However, generic theoretical models are typically underspecified for legitimate reasons, thus failing to capture the nuances

518

519

518

Assessing Academic Speaking 159

of language use in a specific domain. Generic component models of language ability do not intend to capture potential hierarchical relationships among the components, nor do they specify how these different components interact with one another in actual speech production and communication. Current generic component models also do not provide any elaborate treatment of how the components of language ability interact with different language use contexts. Further, no generic theoretical models have formalized how speech or writing is produced based on written, spoken, or graphic source materials; nor have they characterized the integrated nature of language use tasks in specific domains such as academic settings. Models of a specific language use domain such as the academic domain, on the other hand, more faithfully represent the complexity and the nuances of key language skills and language use tasks. These domain-specific models shed light on what the key language use contexts are, and what components of communicative competence are critical to assess, such as aspects of pragmatic competence. Domain models also provide practical insights into how to provide a reasonable representation of the target language use domain in test design. Any test is necessarily a simplification of the target language use domain. Due to practical testing constraints, a one-to-one match between test tasks and real-world tasks is never possible in large-scale assessments. Appropriate sampling of the domain and careful thinking about the underlying abilities that can account for performance in a variety of language use contexts is necessary to enable extrapolation beyond specific test tasks. In the absence of a strong domain model, rigorous domain analyses satisfy the need while contributing to theory building for that specific domain.

5.1.2.1 Representation of Language Use Contexts and Test Design Definitions of language use context abound in the language learning and testing literature (Bachman, 1990; Douglas, 2000; Ellis & Roberts, 1987; Hornberger, 1989), yet context is an elusive concept. This section briefly reviews definitions of language use contexts and adopts an approach to specifying contexts for speaking tasks. It also discusses the connection between representation of language use contexts and test design features. Oral interactions have attracted considerable attention in discussions about the context of language use, partly because oral interactions engage a complicated set of contextual factors ranging from space and time to relationships and shared knowledge of the participants. Hymes (1974) argued that the nature of language used in different oral communicative settings is largely governed by multiple features of the context, captured in his SPEAKING model: Setting, Participants, Ends, Acts sequence, Key, Instrumentalities, Norms, and Genres. Halliday and Hasan (1989) built on this model in defining context of language use according to three major dimensions: field, tenor, and mode. Field refers to the topic, location, and action associated with a language use context. Tenor refers to the participants,

610

160 Xiaoming Xi et al.

how they are related, and what goals need to be accomplished. Mode includes the channel, texture, and genre of the language use context. Young (2008) defined context as “a place, a time, or conditions in which other things happen” (p. 16). He further explained that “the thing that happens is the focal event and the context is the background within which that event happens” (p. 16). He saw context of speaking events as a multidimensional construct encompassing three dimensions: spatiotemporal, social and cultural, and historical. Context as a spatiotemporal construct defines the time and location of a communicative event. Context as a social and cultural phenomenon highlights the important role that such factors play in shaping the meaning of an interaction, including the relationships between participants, their socioeconomic status, cultural backgrounds, and so on.The historical dimension of a context brings in what was just said or happened in a particular interaction as well as words, events, and activities from the past that bear connections to the current interaction. In the L2 learning literature, most of the efforts to define language use contexts have been related to speech events which present the most complex, interrelated contextual factors, such as group discussions that involve participants with varying power status. Bachman (1990) broadened the discussion of context of language use to all four language modalities and proposed a framework of test methods for language tests, many aspects of which define the contexts of language test tasks. Bachman and Palmer (2010) extended this framework to focus on test task characteristics. In both frameworks, contextual factors are not singled out as one facet; rather they are interwoven into different test method or test task features, including characteristics of the (test) setting, characteristics of the rubric (instructions, test structure, time allotment, and recording/scoring method), characteristics of the input, and characteristics of the expected response. For example, in Bachman and Palmer’s (2010) framework, one aspect of context might be captured as part of the characteristics of the setting which include participants, defined as “the people who are involved in the task” (p. 68). However, participants in Bachman and Palmer (2010) may refer to test takers who are engaged with one another in a speaking task, or to the test taker/writer and his/ her intended audience (e.g., an imaginary pen friend) in a writing task. Other facets of task characteristics which provide contexts are prompts and input for interpretation, categorized under characteristics of the input. Input for interpretation refers to written, oral, or graphic materials that are presented to test takers as stimulus materials. For example, in an integrated task, test takers may be required to listen to part of a lecture and discuss the concepts in a class presentation. The topical characteristics of the audio input are one facet of the context of this language use activity. A prompt in a role-play speaking task may define the goal of the task (e.g., to get an extension for a term paper), the intended listener (e.g., the professor), the setting (e.g., during an office hour), and the register of the language to be used (e.g., please be polite and respectful). All of these are important contextual factors.

610

16

610

Assessing Academic Speaking 161

Douglas (2000) recognized context as the social, physical, and temporal environments in which the language activity is situated. Adopting Hymes’ SPEAKING model for defining context, he argued that the prompt and the input data in language for specific purpose (LSP) test tasks define the context for the test taker. In his view, the prompt consists of these eight features of the LSP context conceptualized in Hymes’ work.The input data provide additional contextual information on the topic and orient the test taker to the LSP context. The contextual factors represented in speaking test tasks may not faithfully replicate all of those that define the contexts of real-life oral interactions.While researchers studying real-life oral interactions are motivated to articulate a comprehensive set of contextual factors that govern the flow of interactions, any attempts to describe the facets of context for speaking test tasks will likely need to focus on those that are prominent in characterizing the target language use domain and defining test tasks, and those that may account for variation in the quality and in the linguistic and discourse features of test takers’ responses. Furthermore, given the critical role that context plays in construct definition in the interactionalist approach, it will be useful to make context prominent in the construct definition, and devise a scheme to characterize contexts of test tasks in ways that facilitate construct operationalization and test score interpretation. One way to achieve this goal is to provide a coherent set of speaking task characteristics which define the contexts of test tasks. For tests of general academic English such as TOEFL iBT, the contexts of speaking tasks may be defined in a few layers, each of which adds specificity that constrains the contextual parameters of the tasks (Xi, 2015).As shown in Figure 5.1, the first layer is a definition of the sub-language use domains of the general academic context, which include social-interpersonal, academic-navigational, and academic-content. The second layer is speech genres, which refers to “relatively stable types” of utterances that demonstrate some prototypicalities of response patterns (Bakhtin, 1986). Some common speech genres in a general academic domain may include everyday narrative, everyday conversation, and argumentative speech. The third layer is the goal of communication, that is, the goal that participants in a task intend to accomplish. Each communication goal entails one or more language functions such as making an apology, summarizing, or synthesizing information from stimulus materials. The fourth layer constitutes the medium of communication. Media can include: monologic speech, where a test taker delivers a short oral presentation to an imaginary class; one-on-one communication, in which a test taker interacts with another test taker or an examiner; or group interaction, where a test taker is engaged in interactions with other test takers/examiners. Characteristics of the setting are the last layer of the context, also with multiple facets. One facet is the spatial-temporal dimensions of the task. In speaking test tasks, the time and place of an interaction can be provided as part of the task prompt. For example, in a sample role-play task, a test taker is expected to ask another test taker, who assumes the role of a neighbor, to turn down

612

162 Xiaoming Xi et al.

Contextual facets of a speaking test task

Intended language use domain (e.g., socialinterpersonal, navigational, general academic, etc.)

Speech genres (e.g., everyday narration, everyday conversational, argumentative, etc.)

Goal of communication (e.g., to make a request, to support an opinion, to summarize, etc.)

Medium of communication • Monologic • Dialogic (one on one) • Group (one to two or more)

612

Characteristics of the setting (actual or imaginary)

Spatial-temporal (actual or imaginary) • Time of event • Location of event

Participants

Dialogic/group Roles that participants assume Their social and cultural relationships

FIGURE 5.1 Contextual

Monologic Role that the participant assumes Intended imaginary listener(s) Their social and cultural relationships

facets of a speaking test task

the music after midnight. In this task, the temporal condition of the role play, along with other contextual factors, evokes the pragmatic competence of the participants to deal with the situation in a contextually, socially, and culturally appropriate way. Participants are another key facet of the setting. In speaking tasks that involve one-on-one or one-to-many interactions, the roles that test

613

612

Assessing Academic Speaking 163

takers play, and the social (e.g., power structure) and cultural relationships that they perceive they are having with other participants impact the metacognitive strategies test takers use to approach the task, and the features of the discourse that are produced, including linguistic features, turn-taking behavior, and so on. In contextualized speaking tasks that elicit monologic speech, the test taker is typically expected to assume a role and speak to an imaginary listener(s). The social and cultural relationships between the test taker and the listeners can be construed based on the information provided in the stimulus materials and/or the prompt. Although in the interactionalist approach, context is supposed to be an important part of the construct, including all the facets of context would make the construct definition unwieldy and less useful in guiding test design.Therefore, test designers need to prioritize the inclusion of key facets of context when defining the construct, as these key facets need to be represented adequately in test and task design, so that performance consistencies can be adequately measured.The choice of contextual factors reveals test designers’ beliefs about how the target language use domain should be represented in test design, and how key language use activities in the domain should be sampled. In defining and operationalizing test constructs, we should strive to include the most critical facets of context that provide an adequate representation of the relevant domain of language use, and that are expected to account for most of the variation in the quality and features of test takers’ discourse (see Figure 5.1). The other facets of context should be considered and represented as much as possible in modeling the domain through test and task design; however, the three intended sub-domains and key language functions commonly occurring in the academic domain take priority when practical considerations of test length, test delivery mode, and other factors put constraints on the facets of context that can be reasonably included.

5.1.2.2 Components of Academic Speaking Ability and Task and Rubric Design To make interpretations about an examinee’s academic speaking ability, some type and amount of L2 spoken performance must be elicited and observed, and that performance must be evaluated against some set of criteria. A key assumption in the components approach to communicative competence is that performances should tell us something meaningful about the underlying nature of language ability rather than something meaningful about the specific context or task within which the performance occurs (Bachman, 2002).Assessment tasks, even for specific purposes assessment (Douglas, 2000), are mechanisms for eliciting performances to infer examinee abilities, and the achievement of tasks must therefore be understood and controlled in terms of the extent to which they affect language performance. Strictly speaking—from a component model perspective—tasks are not

164

164 Xiaoming Xi et al.

of primary interest as targets of assessment. Instead, the primary question is which tasks should be utilized to sample a sufficient range of language behaviors to allow consistent inferences about the various competency components. Historically, in speaking assessments a relatively truncated and maximally generic set of task types has provided the basis for eliciting performances (Fulcher, 2003): one-on-one interviews with an examiner; monologic prompted performance; and more recently, paired and group-based interactions. Speaking tests are thus comprised of a range of tasks such that evidence is provided (a) for the distinct competency components, and (b) for the depth or level or degree of competency within said components. In order to judge, identify, and select speaking test tasks regarding their capacity to elicit relevant types and amounts of evidence, a specific model of competency is necessary, which historically has taken the form of speaking proficiency scales and descriptors (e.g., Breiner-Sanders et al., 1999; CEFR, 2001). Such speaking proficiency scales and associated assessments, which seek to divulge examinee’s communicative competence, have paid little attention to the domains to which inferences might or might not relate. In ESP testing (e.g., Douglas, 2000), tasks may be sampled to represent a domain, but their operationalization in the speaking test should involve careful control over contextual factors such that the targeted communicative competence components are adequately elicited. Otherwise, the argument goes, extrapolation of abilities beyond the specific test tasks will not be possible (Bachman, 2002, 2007). Several major challenges are related to assessment task selection that follows this line of reasoning (see, e.g., Brown, 2004; Fulcher, 2003; Norris et al., 1998; Skehan, 1998; van Lier, 1989). First, tasks vary considerably in the competencies that are solicited for their performance, including even supposedly ‘generic’ speaking tasks, such that some tasks may require only minimal grammatical knowledge but maximal pragmatic competence, while other tasks may emphasize organizational competence, and still others will place an emphasis on strategic competence. For a component approach to testing speaking, then, it is essential to select a variety of tasks that elicit the variety of components implied by the given competency model (e.g., a proficiency scale). Second, tasks also vary according to other dimensions not necessarily envisioned within communicative competence models, such as cognitive complexity and affective challenge. An examinee’s performance on a given task may be influenced by factors other than the speaking competence components that are being assessed. For a components approach, these factors must be understood and controlled to produce warranted interpretations about the competency components themselves. Third, tasks that can be efficiently delivered within a testing situation tend to overemphasize certain competencies (e.g., display of grammatical knowledge) at the expense of other competencies (e.g., interactional competence). Fourth, related to the third point, the inferences that are typically made about language ability (e.g., ‘can do’ certain speaking tasks in certain settings) based on competency models may well exceed

164

165

164

Assessing Academic Speaking 165

the evidence provided by the speaking test, given the likelihood that the constructs being interpreted are under-represented in the handful of tasks utilized for testing. The major concern with selecting speaking test tasks for competency approaches can be summarized as the extent to which any given task (or set of tasks) will be able to elicit enough evidence for judgments about an examinee’s oral communicative competence, while ruling out extraneous factors that may also influence performance. When global proficiency-based interpretations about test takers are made, that can be a tall order for speaking test tasks, due to the broad nature of inferences being made (e.g., speakers at X proficiency level can do tasks like A, B, and C with stipulated degrees of competency in components D, E, and F). One approach to resolving this challenge, as alluded to under the section on interactionalist approaches to language testing, has been to include two additional considerations in the selection of speaking test tasks: (a) domain analysis and restriction, and (b) task specification. Summarizing briefly, by restricting the range of intended inferences to particular target language use domains, it may be possible to identify a specific set of tasks or task types that adequately characterize the language use domains to which test performances should generalize (e.g., tasks that characterize English use in adult academic domains), and to analyze and specify the components of speaking competency called for within this set of tasks (Bachman & Palmer, 1996, 2010). While domain analysis and task specification help to reign in the kinds of over-generalizations typical of more generic competency-based speaking assessments, this approach also begs the question of what precisely is being tested.Whether speaking tests that follow this approach can still be considered tests of underlying abilities, abilities within a target language use domain, or abilities to do specific tasks within specific situations at specific degrees of success remains a major question for language testing to resolve. Once tasks are selected and performances are elicited, the other dimension of making interpretations about test takers’ competencies has to do with the criteria by which performances are judged. Rating rubrics, scales, and descriptors provide the typical basis for determining the speaking abilities of test takers as exhibited on a set of test tasks. From a component perspective, the key issue has to do with the extent to which a model of communicative competence is captured by the rubric used for rating performances. Once modeled, it is relatively straightforward to ensure that various components of language abilities are represented in a rubric, which, however, is only the first step. A main purpose of speaking assessment is to determine test takers’ levels of ability, which presents some challenges. A first question is the frame of reference for determining levels of ability: How are different levels of ability to be defined? Is a scale intended to reflect the developmental sequence for a given component or combination of components? To what extent is it intended to capture accurate and relevant ideas, beliefs, and other propositions expressed by the test taker? Is it based on an ultimate criterion of ‘native speaker ability’? Is it related to interpretations about actual task success? How many levels are needed to capture the range of abilities

61

166 Xiaoming Xi et al.

supposedly represented in the communicative competence model, and how are the differences between levels to be determined? Are all components from the model to be represented equally at all levels of the rubric, or is there differentiation in terms of the weighting of specific components? Are answers to these and related questions based on expert judgments, the predilections of test developers, intended test score user interpretations, or empirical evidence for actual language user performance? Historically, for reasons of efficiency, rater effectiveness, simplicity of score reporting, etc., holistic rubrics have been favored for rating overall speaking ability in large-scale standardized assessment. From a components point of view, however, the challenge with holistic scoring is the extent to which a mix of components is represented within a given scale level and/or in the mind of the rater, and it is likely that differentiations between scale levels may be arbitrary or idiosyncratic for any given component. Analytic scales provide an alternative, in that each component may be represented by its own rubric, with appropriate weights attached to performance at different levels of the scale, such that an overall determination of ability level may be achieved based on a much more fine-g rained rating analysis. However, analytic scales come at a price, such as rater training and effort, and complex modeling of the relationships among components represented in the scale (Xi, 2007). Beyond issues of representation, another major challenge for components- oriented speaking assessment has to do with the determination of different ability levels for any given competence component. How many degrees of differentiation, and of what size, are needed to adequately capture the actual range of abilities with grammatical knowledge, pragmatic competence, and so on? While many proficiency scales have set such levels arbitrarily or based on expert intuitions, it is also possible to attempt to determine actual levels of ability through empirical analysis (e.g., Bridgeman et al., 2011; Iwashita et al., 2008; Norris, 1996). A recent approach along these lines shows considerable promise by beginning with a careful analysis of speakers’ performances on actual language use tasks. According to Fulcher, Davidson, and Kemp (2011), “The performance data- driven approach, on the other hand, places primary value upon observations of language performance, and attempts to describe performance in sufficient detail to generate descriptors that bear a direct relationship with the original observations of language use” (p. 5). By working from the ground up, this approach may enable the identification of which components are called for in a given speaking task, and to what degrees components vary in determining levels of performance quality. The key for such an approach is an understanding of what tasks in which language use domains should be analyzed in the first place, pointing again to the importance of domain restriction and analysis as an a priori condition to effective rubric development. In summary, a component approach to oral communicative competence provides an initial set of target aspects of language to consider in identifying

61

167

61

Assessing Academic Speaking 167

speaking tasks, and in developing criteria for evaluating examinee performance. However, on their own, models of communicative competence do not enable the adequate selection of test tasks or the creation of rating rubrics, as they tend to equate all possible dimensions of speaking ability with each other and simultaneously view domains and tasks as sources of construct-irrelevant variance to be controlled for. Especially for an EAP test, it seems clear that additional approaches are needed to determine: (a) which tasks are appropriate for eliciting consistent performances that reveal certain competencies, and (b) what qualities of performance on those tasks may be indicative of what degrees or levels of ability. To the extent that greater specificity is possible through domain analysis, target language use task identification, and task performance specification, the more accurate will be the selection of fitting speaking test tasks and the more warranted will be interpretations of performance based on rubrics that capture the actual qualities and ranges of ability at stake. Ultimately, the key question is what interpretations really need to be made about test takers’ abilities. Where vague, fuzzy, or broad interpretations about test-taker speaking abilities are acceptable, then a handful of tasks that elicit some evidence about communicative competence components might be acceptable. Where specific interpretations about test-takers’ abilities to accomplish representative tasks in a target domain are needed, generalization from test performance will require much more careful understanding and representation of the types of tasks in the targeted domain.

5.1.3 The Academic Speaking Construct from the Perspective of Domain Analyses Academic language use contexts have been studied extensively to inform the development of language proficiency assessments, and particularly for English language learners in the U.S. (Bailey & Butler, 2003; Bailey et al., 2007; Collier & Thomas, 1989; Hakuta & Beatty, 2000; García et al., 2006). These efforts have focused on defining academic sub-domains, or language use contexts (Bailey & Heritage, 2008; Bailey et al., 2001), and on identifying common language functions and associated linguistic features. Typical language functions have also been revealed for some content areas and for different grade levels through analyses of English language proficiency standards, textbooks, and classroom observations by using a functional linguistic approach to domain analysis (Bailey et al., 2007; Chamot & O’Malley, 1994; Schleppegrell, 2004). These studies have enriched our understanding of the nature of language use in the academic domain but have also underscored the need to establish a shared understanding of what an academic domain model might entail, and how it should be organized, given that there are many ways in which a domain can be analyzed. As discussed earlier, the contexts of language use for academic settings can be described in a few layers, including language use sub-domains, speech genre, goal

618

168 Xiaoming Xi et al.

of communication, medium of communication, and characteristics of the setting. The domain analyses for academic settings have focused mainly on language use sub-domains, goals of communication, and medium of communication. The domain of English-medium instruction has been analyzed extensively in the contexts of K-12 schools in the U.S. because of the No Child Left Behind (NCLB) Act in 2001, which required all children including English learners to achieve high standards in English language arts and mathematics by 2014. This NCLB requirement spawned wide-ranging research into the nature of ‘school language’ that can be used for English learners to participate in content learning in English, which includes types of language used, common language functions for different grade levels, and features of academic language. Bailey and her colleagues (Bailey & Heritage, 2008; Bailey et al., 2001, 2007) identified three types of language used in the school setting: Social Language, School Navigational Language, and Curriculum Content Language. Social language is defined as the language used to communicate with family, classmates, and friends on familiar topics. The tasks in this domain tend to involve informal language registers. School navigational language involves communication with teachers, service people, and peers in broad social settings including classrooms. Curriculum content language is used to communicate with teachers and peers on academic course content. The language used for this purpose typically involves more formal and technical texts. Another major finding from this body of research is that academic language is characterized by more complex sentence structures and specialized vocabulary than language sampled from the social domain (Cummins, 2000). Focusing specifically on academic speaking, research on communication in university settings indicates that there is a large difference in the types and amounts of speech that are expected across fields of study. For instance, Ferris (1998) found that North American students majoring in Arts and Humanities reported much more participation in small-g roup work, class discussions, formal speeches, student-led discussions, and debates, while students in the ‘hard’ sciences reported that they only needed to speak English during office hours. To ensure fairness across test-taker groups it is therefore important that various types of tasks are used to assess academic oral ability. As a medium of communication, speech can be categorized as reciprocal and non-reciprocal. Bachman and Palmer (1996) define reciprocal tasks as ones in which the test taker or language user engages in language use with one or more interlocutors. The Common Core State Standards (CCSS; n.d.) and Kennedy (2008) present similar notions. The CCSS refer to comprehension and collaboration to describe reciprocal speaking activities, and presentation of knowledge and ideas to describe non-reciprocal speaking tasks. Research suggests that university students are expected to engage in both reciprocal and non-reciprocal oral communication tasks. Kennedy (2008) reported the amount of time that graduate students spend using reciprocal and non-reciprocal aural/oral tasks at a North American university. She found that students spent a little

618

619

618

Assessing Academic Speaking 169

more than twice as much of their time on non-reciprocal as reciprocal tasks. The research did not distinguish between listening and speaking, and presumably much of the time spent on non-reciprocal tasks involved listening to lectures or other types of input. The findings, however, do provide evidence that both reciprocal and non- reciprocal oral communication are important in a North American university context. The most studied non-reciprocal speaking task is the oral presentation, likely because it is commonly encountered in university settings (Zappa-Hollman, 2007). Both graduate and undergraduate students engage in class presentations, and graduate students give conference presentations, teaching assistant lectures, and dissertation or thesis defense presentations. The ability to give an oral presentation to a specific audience effectively has been shown to rely on fluency, vocabulary, pronunciation, familiarity with discourse genres, delivery skills related to confidence in public speaking, knowledge and ability for use of discourse markers and conversational hedges, and pragmatic competence (see Hincks, 2010; Myles & Chang, 2003; Santanna-Williamson, 2004; and Zappa-Hollman, 2007). Pragmatic competence is related to the speaker’s attitude towards a proposition, degree of knowledge, understanding of the audience, and role in the discourse (Recski, 2005). Reciprocal spoken communication is often subcategorized as dialogic versus group. Dialogic speech in the academic domain might usually involve two students, or a student and a teaching assistant, an instructor, or a service person. Research suggests that these types of encounters typically require students to assert themselves respectfully (Krase, 2007; Myles & Chang, 2003; Skyrme, 2010). The general conclusion is that native English-speaking university students use pragmatic functions to accomplish this goal of being assertive and respectful much more effectively than non-native English-speaking university students. Group discourse is also commonly encountered in university classrooms (Kim, 2006). Students work in small groups on course-related work, and full class discussions are also common. However, few studies have investigated the speech of this type of discourse, and the ones that have used very small sample sizes. This research suggests that active participation in a group discussion is particularly difficult for ESL students (Alco, 2008). Jones (1999) found that turn-taking rules were important and much more complex in group discussions than dialogic speech. Lee (2009) expanded on this finding by determining that this difficulty stemmed at least in part from the large number of participants and the differences in status among the speakers. Tin (2003) found that participants in group discussions need to understand the relevant importance of illocutionary functions, including contradicting, explaining, and converging. Taken together, these findings suggest that university students are expected to engage in group interaction, but not much domain analysis research has been conducted on the topic. In sum, the research on school and university language has identified three major types of spoken language and revealed that academic language differs from social and navigational school language in important ways. The research on medium of communication indicates that students engage in monologic, dialogic, and group

710

170 Xiaoming Xi et al.

discourse in university settings. Presentations, one-on-one service encounters, and small and large group class discussions are common to most university settings, but with differences in degree of participation across disciplines. It bears repeating, though, that most of the research on academic English has focused on North American contexts, so more research is needed to expand such understandings to other language use situations.

5.1.4 How Academic Speaking is Currently Conceptualized in Major Tests of English for Admissions Purposes Drawing on the theoretical perspectives and issues raised above, we argue for a multi- faceted approach to construct definition for speaking assessments. We propose that the construct definition for speaking should consist of key contextual facets such as intended language use domain(s), communication goals, medium of communication, and the characteristics of the setting, as well as foundational and higher-order skills that are called upon to fulfill the communication goals. Intended language use domains constrain generalizations about speaking ability. Communication goals convey the major language functions that a speaker can fulfill, and foundational and higher-order skills represent the underlying speaking skills called upon to fulfill these communication goals. The intended language use domains and communication goals are key facets of language use context, and thus need to be represented in a balanced way in designing a test. The foundational and higher-order skills inform the design of both test tasks and scoring rubrics. The tasks need to be designed in a way to elicit evidence about the foundational and higher-order skills, which also need to be captured in the scoring rubrics. Ultimately, this information will need to be presented effectively in the assessment score report, which summarizes a test-taker’s performance on the test and needs to provide relevant, meaningful, and actionable information to different types of test users (e.g., test takers, admissions officers, teachers). The reported information needs to adequately reflect the construct and context, be consistent with the intended use of the test, and be presented in a way that is accessible to its users (Tannenbaum, 2018). Using this multi-faceted approach to construct definition, in this section we compare the constructs of academic speaking embodied in the speaking sections of three major tests for admissions purposes: TOEFL iBT, IELTS Academic, and PTE Academic.

5.1.4.1 TOEFL iBT Speaking The TOEFL iBT Speaking section includes four tasks (www.ets.org/toefl/test- takers/ibt/about/content/speaking/): Task 1: Independent task –A test taker is presented two sides of an issue on an everyday familiar topic, states his/her opinion, and gives reasons to support it.

710

17

710

Assessing Academic Speaking 171

Task 2: Integrated task –A test taker reads a short passage about a campus-related topic, listens to a conversation about the same topic, and restates the opinion expressed in the reading and listening materials and reasons for holding the opinion, or explains a problem and a solution. Task 3: Integrated task –A test taker reads a short passage about a general academic course topic, listens to a short lecture about the same topic, and explains a concept using examples given in the stimulus materials. Task 4: Integrated task –A test taker listens to a lecture on a general academic course content and explains theories, concepts, etc. using points and examples from the lecture. The TOEFL iBT Speaking scoring rubric includes three dimensions: delivery, language use, and topic development (www.ets.org/s/toefl/pdf/toefl_speaking_ rubrics.pdf). Delivery refers to the way speech is delivered, including pronunciation, intonation, and fluency. Language use concerns vocabulary diversity and precision and grammatical range and accuracy. Topic development, for the independent task, refers to the clarity and sophistication of the ideas and the coherence of connections among ideas. For the integrated tasks, topic development involves the completeness, representation, and accuracy of the content in the source materials. Note that the TOEFL iBT Speaking rubric includes an explicit dimension of topic development that focuses on meaning and content, which resonates with Purpura’s meaning-oriented conceptualization of language abilities (Purpura, 2004, 2016). The construct definition, consisting of sub-domains, communication goals, and higher-order and foundational skills, is described in Table 5.1.

5.1.4.2 IELTS Academic Speaking The IELTS Academic Speaking section consists of three parts (https://takeielts. britishcouncil.org/take-ielts/prepare/test-format): Part 1: A test taker answers very general questions from an examiner on everyday conservational topics such as hometown, family, work, studies, and interests. Part 2: A test taker sees a task card which asks him/her to talk about a familiar topic, including points to address, having one minute to prepare and one to two minutes to talk. Then the test taker is asked one or two questions on the same topic by the examiner. Part 3: A test taker answers further questions from the examiner that are connected to the topic of Part 2 and that are more abstract in nature. The test covers one sub-domain, social-interpersonal, and includes two typical communication goals: making a description and stating and defending an opinion. It focuses on everyday conversations and does not cover the other two

172

172 Xiaoming Xi et al. TABLE 5.1 The construct definition of the TOEFL iBT Speaking section

Overall definition: The TOEFL iBT Speaking section measures test takers’ abilities to communicate effectively in three sub-domains of the English-speaking academic domain including social-interpersonal, academic-navigational, and academic-content. These include the abilities and capacities to use linguistic resources effectively to accomplish the following communication goals: a) to describe events and experiences, and support or disagree with a personal preference or opinion about familiar topics typical of casual or routine social contexts, drawing on personal experience; b) to select, relate, summarize, explain, compare, evaluate, and synthesize key information from reading and listening materials on a typical campus life scenario and on an academic topic typical of the college introductory course level. Intended sub-domains

Social-interpersonal

Academic-navigational

Academic-content

Communication • Explain and • Explain relationships • Summarize a given goals support or between issues, concept, idea, or disagree with positions, and other key given a personal stances information preference or an • Describe a given • Explain and exemplify opinion problem and a given concept, idea, solution(s) process, etc. • Evaluate given • Synthesize key options or positions information (e.g., different points of • Propose alternative view) from multiple options or positions sources Foundational and • to pronounce words clearly and intelligibly higher-order • to use linguistic resources such as intonation, stress, and pauses to abilities pace speech and to understand and express meaning precisely • to use linguistic resources such as vocabulary and grammar to understand and express meaning precisely • to use organizational devices (cohesive and discourse markers, exemplifications, etc.) to connect and develop ideas effectively and to convey content accurately and completely

sub-domains related to the academic setting (academic-navigational and general academic). The communication goals engaged do not include summarization, synthesizing, comparing, and contrasting based on information from source materials. The IELTS speaking scoring rubric includes four major dimensions: fluency and coherence, lexical resources, grammatical range and accuracy, and pronunciation (www.ielts.org/-/media/pdfs/speaking-band-descriptors.ashx?la=en). Although the oral interview format involves human interactions, it elicits a special kind of discourse with the interviewer controlling the opening and closing of topics, and

172

713

172

Assessing Academic Speaking 173

therefore eliciting only some aspects of interactional competence (Brooks, 2009); the scoring rubric does not address interactional competence.

5.1.4.3 PTE Academic Speaking The PTE Academic Speaking section includes five task types (https://pearsonpte. com/preparationpath/): ask 1: T Task 2: Task 3: Task 4: Task 5:

Read-aloud –reading a text of up to 60 words out loud Repeat sentence –repeating a series of short sentences presented aurally Describe image –describe an image such as a graph Retell lecture –retell the key points in a short aural lecture Answer questions –answer short questions with one or a few words

In terms of the sub-domains of academic settings, the Speaking section seems to cover general academic topics and does not involve social-interpersonal or academic- navigational domains. The communication goals engaged focus on some simple activities such as making a description or retelling something heard. Several of the task types (1 and 2) are not discernably related to any specific communication goals. The speaking tasks are scored solely using an automated scoring engine. It is claimed that the scoring rubric covers content, pronunciation, and oral fluency (https://pearsonpte.com/the-test/format/english-speaking-writing/ re-tell-lecture/). However, due to current limitations—even in state of the art—of automated scoring, it is questionable to what extent content can be adequately assessed (though see promotional materials which suggest that content is scored in a much more sophisticated way). For example, for lecture retell, it is claimed that: Content is scored by determining how accurately and thoroughly you convey the situation, characters, aspects, actions and developments presented in the lecture. Your description of relationships, possible developments and conclusions or implications is also scored. The best responses retell all the main points of the lecture and include possible developments, conclusions or implications. (https://pearsonpte.com/the-test/format/english-speakingwriting/re-tell-lecture/) The design of the three tests reflects different philosophies. The design of the TOEFL iBT Speaking section prioritizes coverage of the three sub-domains, typical communication goals in academic settings, and the integration of reading, listening, and speaking skills. The PTE Academic Speaking section, due to the constraint of using automated scoring only, includes task types in the general

174

174 Xiaoming Xi et al.

academic domain that address basic communication goals such as description, retelling, and summarization. Some of these tasks also elicit speech based on source materials. For both TOEFL iBT and PTE Academic, computer delivery facilitates standardization of the test experience but at the cost of not being able to support interactive communication with a human. Including tasks that elicit interactive communication either with a human or via sophisticated artificial intelligence technology would be an enhancement to both tests. In terms of scoring, it is questionable to what extent all aspects in the scoring rubrics of PTE Academic Speaking can be fully measured, due to limitations of automated scoring technologies. The most important design feature of the IELTS Speaking section is the inclusion of dialogic interactions in social-interpersonal settings. The other two sub-domains—academic-navigational and academic-content—are not covered. Measuring interactive communication is a positive design feature in the IELTS Speaking section and including dialogic interactions in its design has the potential to promote test preparation practices that involve practicing speaking with others. Future opportunities for the IELTS Speaking section would include designing revised or new task types to elicit discourse that is more representative of the academic domain and that allows demonstration of expanded aspects of interactional competence as well as revising the scoring rubric to cover interactional competence to make it less linguistically focused.

174

5.2 New Possibilities for Operationalizing Academic Speaking Given the broad array of speaking activities in the academic domain and the wide range of speaking skills engaged by these activities, test designers must make conscious choices in test design decisions. These decisions reflect their testing philosophies, the aspects of the constructs they intend to prioritize, the potential tradeoff between the need to capture the construct faithfully and the costs of doing so at scale, and operational constraints including the test delivery system, test length, technology requirements, and so on. In attempting to remain true to the ‘academic’ focus of speaking tests, designers have to prioritize the key contextual factors that are deemed important for representing the domain of post-secondary academic settings, including the three sub-domains and major communication goals. Inclusion of higher-order skills such as pragmatic competence and interactional competence in the construct definition will also impact decisions in designing assessment tasks to elicit those competencies. In this section we first review important considerations related to assessing (a) pragmatic competence and (b) interactional competence, and we present selected examples of task types designed for these purposes. We then review approaches to rubrics and scoring as well as considerations for score reporting, and the use of automated speech scoring technologies, which have the potential to expand the assessment of academic speaking in useful ways.

175

174

Assessing Academic Speaking 175

5.2.1 Assessing the Conveyance of Propositional and Pragmatic Meanings in Academic Speaking An important way to operationalize academic speaking is to identify and measure the literal and intended semantic meanings encoded by linguistic forms, typically referred to as the propositional content of talk (Purpura, 2004). We are interested in the extent to which interlocutors can get their point across in talk or the degree to which they are able to convey intended messages in context. Another way to operationalize academic speaking is to identify and measure the conveyance of a host of implied or pragmatic meanings that depend more on an understanding of the context than on a literal understanding of the words arranged in syntax. These pragmatic meanings are seen to embody a host of meaning extensions, which are superimposed simultaneously on grammatical forms in association with the literal meaning of utterances in communication. The source of implied pragmatic meanings may be contextual (i.e., what is situationally implied, but not stated), sociolinguistic (i.e., what is acceptable or appropriate to say to whom, where, when), sociocultural (i.e., what is acceptable or appropriate to say in this culture given its assumptions, norms, and expectations), psychological (i.e., what is an acceptable or appropriate emotional, attitudinal, or affective stance to use in this situation), or rhetorical (i.e., what is an acceptable or appropriate way to structure ideas, information, or exchanges). Thus, pragmatic meanings in academic communication can be judged by the extent to which the interlocutor is able to use linguistic resources to understand and express acceptable and appropriate contextual, sociolinguistic, sociocultural, psychological, and rhetorical meanings. By measuring pragmatic meanings in academic oral communication, L2 assessments can judge test takers’ understanding of, and ability to communicate, meanings related to functional intent, metaphor, irony, register, formality, politeness, sociocultural norms, humor, emotionality, sarcasm, deference, and so forth. Tasks that present input revolving around contextually impoverished topics or situations are designed to highlight the conveyance of propositional meaning over the conveyance of implied or pragmatic meanings. For example, description tasks are likely to highlight the conveyance of propositional and functional meanings over the conveyance of pragmatic meaning, whereas a task asking a test taker to pitch a product to a client would require skills to convey not only propositional and functional meanings, but a host of other implied meanings. Consider the following example of a task that prioritizes conveyance of semantic meaning. Task: Giving an opinion based on numeric information (Adapted from a task used at Kanda University of International Studies, Ockey et al., 2015) Situation: The following graph shows the average number of hours per week that different types of workers need to be successful. Take 15 seconds to familiarize yourself with the graph (Figure 5.2).

716

176 Xiaoming Xi et al.

Number of Hours

Hours Worked at Job Each Week 80 36

Cook

50

65

25 Manager

Banker

Doctor

Nurse

Job FIGURE 5.2 Hours

worked at job each week

Instructions: Tell what the graph indicates about the number of hours worked for each of the five jobs and then select a job and tell why you would prefer this one over the others. Use information in the graph to justify your decision.You have 60 seconds. Analysis:This task aims to assess the ability to describe a visual stimulus with accuracy, clarity, and precision, while also giving the test taker an additional reason for talking about it. The relatively simple input provides a quickly accessible set of information that requires interpretation and explanation. The subsequent requirement to state an opinion and justify it based on the information in the visual provides a reason for speaking, but it does so in a way that emphasizes conveyance of information (propositional content) found in the input. Note that no additional aspects of context are provided (e.g., audience for the task). Now, if the same task were structured to include a more elaborate context in which test takers were asked to give a formal presentation at a conference, or to engage in an informal discussion with classmates, the task would then clearly elicit a range of other pragmatic meanings in addition to semantic meanings.The design feature key to eliciting pragmatic meanings systematically is the context provided by the input and task requirements (see Grabowski, 2008; Plough et al., 2018). Consider this example task that was designed to elicit several dimensions of pragmatic competence. Task: Give constructive feedback to your classmate’s email (Youn, 2010) Situation: In your class, you are discussing how to write appropriate e-mails to your academic advisor. The class was required to write an e-mail to a professor to request a meeting.Your classmate has written an e-mail and asks for your feedback. Instructions: Read your classmate’s e-mail and give oral feedback for making the message more appropriate. Message: hi this is tom i have some questions for you about the course.

716

17

716

Assessing Academic Speaking 177

can I meet up with you tomorrow at 3:00pm Yet I do not know where is your office, so can you e-mail back with the office number? Analysis: This task was designed to assess oral communication with an emphasis on eliciting and measuring pragmatic competence. Note the representation of context provided about the situation, as well as the description of the relationship between the test taker and the hypothetical audience, a classmate.The task instructions along with the message input provide two opportunities for eliciting evidence of pragmatic competence: (1) the pragmatic challenges presented in the email message itself (e.g., register appropriateness, directness of the request) that would be commented on by the test taker, and (2) the demand on the test taker to provide oral feedback to the classmate in a pragmatically appropriate way. In sum, propositional, functional, and pragmatic meanings are expressed in all discourse. The extent to which these are controlled for in academic tasks is a function of the inferences we wish to make about test takers’ ability to communicate in the academic domain. Pragmatic competence is currently not explicitly assessed by any of the major academic English tests for admissions, but it is anticipated that future academic speaking tasks will specify communication goals and intended audience increasingly to elicit responses that contain various pragmatic meanings. However, two issues need to be considered in designing tasks and associated scoring rubrics if pragmatic competence is included in the construct definition. The first pertains to the diversity of English-medium instructional contexts, including those in traditional English-dominant countries and those in countries where English is not used as the primary language.The heterogeneity of relevant target language use situations introduces challenges in defining social and cultural norms, against which pragmatic competence can and will be evaluated. The second is related to how to define and operationalize a threshold level of academic speaking ability for admissions purposes, given the diverse social and cultural backgrounds of test takers.These two issues should be carefully considered when designing tasks and rubrics to measure some aspects of pragmatic competence for academic admissions testing purposes.

5.2.2 New Approaches to Assessing Interactional Competence Interactional competence entails the individual attributes—above and beyond fundamental linguistic knowledge and skills (e.g., pronunciation, fluency, vocabulary, grammar)—that humans need to engage in reciprocal communication (including appropriate turn-taking, repairing, opening and closing of gambits, responses to others, and negotiation and co-construction of topics). Interactional competence involves the understanding and application of systematic structures in organizing interactions associated with accomplishing a communication goal within a social group (Mondada & Pekarek Doehler, 2004; Plough et al., 2018).

718

178 Xiaoming Xi et al.

Assessing interactional competence requires designing tasks that elicit relevant evidence and developing scoring rubrics that capture critical aspects of the construct. None of the major academic English tests for admissions assesses interactional competence.The IELTS Academic Speaking section, despite the inclusion of tasks that involve both a test taker and an examiner, does not elicit the necessary evidence for evaluating interactive speaking skills for academic purposes; it also uses a traditional scoring scale that does not measure interactional competence. Computer-delivered speaking tasks such as those in TOEFL iBT or PTE Academic elicit monologic speech only and do not attempt to measure interactional competence. As reviewed earlier, reciprocal spoken communication, including dialogic and group interactions, is prevalent in university settings. These reciprocal speaking tasks have been explored in small-scale speaking tests; however, in large-scale speaking tests, they still have not been widely used. Technology has allowed more efficient scaling up of speaking assessment, but the challenge of operationalizing automated interactive tasks is still a major limitation of computer-based speaking tests that do not involve the use of a human interlocutor. Other advances in technology may make it more feasible to assess human-mediated interactional competence to varying degrees in large-scale assessments. Skype-or Zoom-like computer-mediated communication technology, for example, makes it possible for test takers to interact reciprocally with an examiner (e.g., oral interview) and/or with other test takers (e.g., paired or group speaking). The oral interview has the advantage of a long history of testing and high reliability compared with paired or group tests. However, it suffers from several drawbacks. For instance, even well-trained interviewers have been shown to affect the validity of score interpretations yielded from oral interviews by providing different levels of support for test takers (Altakhaineh et al., 2019; Brown, 2003; Johnson & Tyler, 1998). The discourse elicited from oral interviews has also been disparaged for eliciting limited interactional competence. Taylor (2000) found that the paired format yielded more varied language, more balanced interaction, and more communicative language functions than an oral interview. Similarly, Brooks (2009) reported more interaction, negotiation of meaning, and complex language in the paired format as compared to an oral interview. Paired or group formats have the advantage of requiring fewer interviewers and the potential of positive washback for classrooms (Hilsdon, 1995; Nevo & Shohamy, 1984). Conversely, some research has indicated that the personal characteristics of other test takers in a group could affect a test taker’s score (Ockey, 2009; O’Sullivan, 2002). This would be of concern in a high-stakes assessment. Virtual environments, in which test takers communicate in a virtual world through avatars, have also been investigated at ETS and elsewhere (Ockey et al., 2017). This approach may be advantageous from a delivery standpoint, as it would significantly reduce the network bandwidth necessary for delivery, but investigation is required to ensure that the use of avatars does not alter the targeted construct in a significant way. For instance, nonverbal information would not be easily communicated in this format.

718

719

718

Assessing Academic Speaking 179

It may also be possible to introduce interactivity by using spoken dialogue system (SDS) technologies that consist of automated speech recognition, speech understanding, conversation management, and speech synthesis components (Ramanarayanan et al., 2020; Suendermann-Oeft et al., 2016). In these cases, the stimulus materials and the task directions provide the context and orient test takers to an expected response. Although no human interlocutors are involved, the responses are expected to show sensitivity to the contextual factors as defined by each conversational cue and contain appropriate content (for example, a simulated admissions interview using SDS where the computer acts as the interviewer).Two general approaches have been used with SDSs for assessment. One approach is to use highly constrained tasks that lead to a somewhat limited range of responses from test takers. This makes it possible to train systems on the range of likely responses. The system can then provide an appropriate response in an ‘interaction’ between the SDS and the test taker (Ramanarayanan et al., 2020). Another approach has been to use broader tasks with an SDS not designed to accurately recognize everything said by the test taker. These systems are programmed to provide opportunities for test takers to demonstrate a particular interactional competence feature, such as clarifying a request when a communication break-down with the computer occurs. Inaccurate recognition and/or inappropriate responses by the system are seen as providing opportunities for the test taker to repair communication breakdowns (Chukharev-Hudilainen & Ockey, in press), which might have potential for assessing and training learner communication strategies in a low-stakes environment. The first type of SDS can lead to somewhat natural discourse but is limited to rather narrow task and topic coverage, while the second type can employ broader tasks and a range of topics, though discourse can be disconnected and unnatural. SDSs remain limited in their effectiveness to date, but their capability for providing opportunities for test takers to demonstrate certain features of interactional competence may exceed that of human partners (Ockey & Chukharev-Hudilainen, 2021). Much further development of SDSs is necessary before these systems could replace a human partner in providing open-ended discussion opportunities and topics as could be expected in real life. Given the importance of assessing interactional competence, a rating scale that captures this ability would also need to be included in a scoring rubric. Most likely, the best way to assess this ability would be to include a separate score dimension specifically designed to measure interactional competence, including taking turns appropriately in a given context, negotiating meaning reciprocally, using opening and closing gambits, giving appropriate responses to others, and developing topics. Below we present a task that aims to assess oral communication with a specific focus on interactional competence. Task: Group discussion—Polite disagreement Situation: Y ou and your classmates have been asked to give a presentation in your computer class on the use of the Internet. Each of you will choose a different option that you would like for the presentation. Student A chooses

810

180 Xiaoming Xi et al.

from the first two options, Student B chooses from the next two options, and Student C chooses from the last two options. From your two options, choose the one that you think would be best for a presentation in your computer class. Try to get your partners to agree with your choice by discussing the three options that were chosen by your group. Be sure to focus on why your choice would make a good presentation topic. Topic: How in the past 10 years has the Internet changed the lives of: Student A

Student B

Student C

children university students elderly people —or— —or— —or— workers at a company people who work at home people who live in small towns

Instructions: You will have a discussion with two partners. The purpose of the discussion is to complete the task above. You should try to share time equally and keep the discussion going for four minutes. You will be told when it is time to start and stop talking. When asked to begin, anyone can start the discussion. Analysis:This task involves a discussion among three test takers. It is designed to measure both foundational speaking skills and aspects of interactional competence. The task is set up to encourage three test takers to negotiate on a fairly specific topic and persuade others to adopt their own choice of presentation topic, thus allowing a rich and nuanced demonstration of a wide range of interactional competence.

5.2.3 Approaches to Academic Speaking Scoring Rubrics and Considerations in Score Reporting The possible approaches to scoring speech are most basically defined as analytic or holistic. In analytic scoring, multiple aspects of the performance are scored using different scaled performance descriptors. Each aspect of the performance is either judged impressionistically to yield a score, or is enumerated or counted, such as the number of grammatical errors, or types of hesitation phenomena, which are then converted into a score, and scores are reported independently for each aspect of performance. In holistic scoring, a performance is judged impressionistically as a whole, or different aspects of the performance are evaluated and aggregated to yield a single score. Scales may have a “real-world” focus, or an “ability focus” (Bachman, 1990, pp. 344– 348). In the former, the descriptors contain references to specific tasks, contexts, or performance conditions, such that there is a direct link between the scale and the task type, with an assumption that the task types are representative of tasks in the real- world domains to which inferences are to be drawn. Conversely, an ability-focused

810

18

810

Assessing Academic Speaking 181

scale avoids all reference to context. There is an attempt to describe the underlying ability required to perform well on a range of task types, assuming that the score can be generalized to contexts not necessarily represented in the test content. New operational rating scales are rare.Typically, most scales are a priori by nature (Fulcher, 2003) and developed organically from previous rating scales. Scoring rubrics are created based on the test developer’s intuitive knowledge of language learning sequences or performance levels. Most current operational speaking tests use a priori rating scales, covering grammar, vocabulary, fluency, and pronunciation. However, in recent years there has been an increasing tendency to include aspects of the speaking construct that have grown in theoretical importance, such as discourse management and interactional competence, though these descriptors remain largely a priori constructions. Only two new scale types have been introduced to evaluate L2 speaking performance in the last 20 years: the empirically derived, binary-choice, boundary- definition scale (EBB) (Turner, 2000; Turner & Upshur, 2002; Upshur & Turner, 1999), and the Performance Decision Tree (PDT) (Fulcher et al., 2011; Kemp & Fulcher, 2013). The former adopted the data-based approach proposed by Fulcher (1987) but used the decisions and rationales of expert judges as data to create binary scoring choices between sets of performances that were placed in high vs. low proficiency piles. An EBB scale replicates these binary choices. Raters are required to ask a set of binary questions when assessing each speech sample, the answer to which leads to a score. A sample set of binary decisions for the construct ‘communicative effectiveness’ in retelling a story is illustrated in Figure 5.3. The primary advantage of the EBB is the ease of use by raters in live scoring, or in retrospective scoring where time is at a premium. However, a unique EBB scale must be developed for each test task, or task type.The resulting score can therefore only be generalized to similar task types, which may be considered an advantage Coherent story retell vs. listing No

Yes

One story element only or “garbles” No

Yes

Three story elements 1 without prompts No

Yes 2

FIGURE 5.3 Communicative

Little hesitation or use of L1

3

Yes

No 6

L2 vocabulary was supplied No

Yes 5

4

effectiveness (Upshur & Turner, 1995, p. 8)

812

182 Xiaoming Xi et al.

or disadvantage, depending upon the theoretical position taken. Maintaining a link between the scoring rubric and the task type strengthens the validation claim for score meaning but restricts generalizability. Making a claim for score meaning beyond the task or task type is questionable, and requires significant validation efforts, but may lead to wider extrapolation of score meaning. Turning from EBBs to PDTs, Fulcher et al. (2011) extended the methodology of Fulcher (1996) to incorporate the logic and the usefulness of the EBB. Using domain analysis, criteria for successful performance in service encounters under discourse competence and pragmatic competence were examined; the latter also incorporated components of ‘interactional competence’ such as rapport building strategies. Each of the components may be scored in binary fashion, or by assigning partial credit. Each branch of the tree represents increases in the score a test taker could receive. Kemp and Fulcher (2013) showed how this could be extended to scoring in the academic domain using existing analyses of university interaction (e.g. Biber, 2006; Rosenfeld et al., 2001), supplemented with further detailed discourse analysis of the same. The design of the PDT requires the collection of performance data either from real-world performance on typical domain-specific tasks, or from test tasks modeled on those tasks. The data is then analyzed using methods suitable for the constructs of interest. Fulcher et al. (2011), using discourse analysis, illustrated a model for spoken interaction in service encounters (Figure 5.4). The model and description of linguistic realizations are used to inform the structure of the decision tree, as illustrated in Figure 5.3.The score generated from the tree is a profile of speaker abilities, which may then be communicated to the test taker in addition to the aggregated score and used for diagnostic or educational purposes (cf. Alderson, 2010). A. Discourse Competence

1. Realization of service encounter discourse structure (the ‘script’: Hasan, 1985) 2. The use of relational side-sequencing B. Competence in Discourse Management 3. 4. 5. 6. 7.

Use of transition boundary markers Explicit expressions of purpose Identification of participant roles Management of closings Use of back-channelling

C. Pragmatic Competence 8. Interactivity/rapport building 9. Affective factors, rituality 10. Non-verbal communication

FIGURE 5.4

Constructs and realizations of service encounters (Examples of data to illustrate realizations can be found at http://languagetesting.info/features/rating/pdts.html)

812

183

812

Assessing Academic Speaking 183

Like the EBB method, PDTs assume that scoring will be specific to the task type associated with activities in a domain. As service encounters also take place in the academic environment, a scale developed in the context of one service encounter, such as travel agent interaction, may be somewhat generalizable to a wider range of encounters in the academic setting. However, further research is required to establish similarity and the range of potential score extrapolations for a given type of task. From the earliest a priori rating scales which relied upon expert intuition for the construction of the rubric, through measurement-driven scales that are compilations of rubrics from other scales into a new scale, to data-driven scales that base rubrics upon performance data, a clear trend is emerging. Scales and rubrics in L2 assessment are increasingly related to the types of communicative activities that test takers are likely to perform in specified real-world domains. With the focus on performance, it becomes critical to describe the language used to achieve communicative purposes. This trend takes both language and test purpose very seriously and requires the specification of a test-taker population, the domain-specific communication goals, and the intended interpretations and extrapolations of score meaning. Score report design is also crucial in making sure that the reported information is aligned with the intended use of a test and presented in a way to facilitate appropriate interpretation and decision-making by score users. As Tannenbaum (2018) argued, four major questions need to be answered in designing a score report: 1) For what purpose is the report designed, formative or summative? Given the purpose, does the report include meaningful, relevant, and actionable information for different stakeholders? 2) Is the report static or dynamic? 3) Are sub-scores reported? 4) Is the imprecision of test scores communicated to different types of users? As L2 assessments increasingly emphasize abilities to use language for communicating successfully on representative or prioritized speaking tasks, score reporting will require considerable additional information to meet users’ needs and convey abilities with accuracy. As discussed previously, the scoring approaches of major academic English tests used for higher education admissions have not integrated the new scale types discussed above that are task-or task-type-specific, largely due to a desire to generalize beyond the test task types and to optimize operational efficiency. All three major academic English tests use holistic scoring and descriptors that are focused on traditional foundational skills or a mix of foundational and higher-order skills. The score reports of major academic English tests for admissions purposes serve the needs of admissions officers and test takers primarily. They are all designed as static reports and do not include dynamic score reports of relevant subgroups that may be of interest to different stakeholders. Given that some universities require minimum scores on specific sections in addition to the total score for admissions, all three tests report sub-scores including a speaking score. In addition, to provide additional guidance to students, the PTE Academic test reports oral fluency

814

184 Xiaoming Xi et al.

and pronunciation sub-scores. Unlike PTE and IELTS, in addition to Speaking section scores, the TOEFL iBT test reports levels of performance on speaking tasks representing the three different sub-domains—familiar topics, campus situations, and general academic course content—along with performance descriptors that characterize the examinee’s abilities within each of these sub-domains. Finally, none of the test score reports communicates information about the measurement imprecision of the scores to test users such as admissions officers and test takers, partly due to challenges of conveying measurement error to a lay audience effectively. Zapata-Rivera et al. (2018) discussed how to communicate measurement error to parents as well as teachers, and similar guidance could inform approaches to conveying measurement imprecision to admissions officers in a transparent and effective way so that it is carefully considered in setting admissions standards.

5.2.4 Review of New Scoring Technologies Automated scoring has seen increasing use in tests of academic English speaking such as PTE Academic and TOEFL iBT in the last decade. Research on automated assessment of non-native speech initially focused on pronunciation (Bernstein et al., 1990; Witt & Young, 2000) and fluency of constrained speech (e.g., read- aloud) (Cucchiarini, Strik, & Boves, 1997). It later expanded to aspects of fluency and pronunciation of spontaneous speech (Cucchiarini, Strik, & Boves, 2002; Xi et al, 2008), and more recently targeted vocabulary diversity, grammatical complexity and sophistication, content relevance, and aspects of discourse of spontaneous speech (see Zechner & Evanini, 2020 for a review of recent work in new speech features). The typical architecture of a state-of-the-art speech scoring system consists of four components: 1) an automatic speech recognition (ASR) engine that provides word hypotheses based on the test takers’ spoken responses; 2) computation modules that generate a set of features for various sub-dimensions of the speaking construct (e.g., pronunciation, fluency); 3) filtering models that detect non-scorable responses (e.g., with too high a noise level or where test takers do not respond at all); and 4) a scoring model that predicts a task score based on a combined set of features (Higgins et al., 2011). ETS’s research on automated scoring of spontaneous speech, initiated in 2002, has been noteworthy (see Zechner & Evanini, 2020 for a comprehensive overview). A first scoring system, SpeechRaterSM, that mainly used features relating to fluency and pronunciation, was operationally deployed in 2006 to score the low- stakes TOEFL Practice Online (TPO) program.The system has continuously been improved in many ways, including: (1) use of a higher-performance ASR system; (2) substantial extension of the feature set to include vocabulary diversity, grammatical complexity and sophistication, content relevance, and aspects of discourse; and (3) substantial extension of the capabilities of the filtering models to identify

814

158

814

Assessing Academic Speaking 185

non-scorable responses (Zechner & Evanini, 2020). Recently, SpeechRater has been deployed operationally to score the TOEFL iBT Speaking section in conjunction with human raters. Other efforts at employing automated scoring of spontaneous speech for admissions tests of English include work applied to PTE Academic (Bernstein et al., 2010) and the Duolingo English Test (LaFlair & Settles, 2020), where machine scoring is used solely to score a variety of speaking responses. Another approach to leveraging the benefits of automated scoring is combining human and machine scores to generate a final task score. Thus, aside from using either method in isolation, the machine score could be a contributory score combined with a human score, or a “check” score that triggers human adjudication in case of discrepancies between human and machine scores above a pre- defined threshold. Yet another approach would be to use human raters for cases where the automated system has difficulties (e.g., scoring of responses that have a low confidence score by the ASR system, or that are very short). Finally, it would also be conceivable to use an automated engine to score basic aspects of a spoken response (e.g., fluency, pronunciation), and let human raters only focus on content and coherence, and then combine the information to yield a task score.

5.3 Recommendations for Future Research and Development In this final section, we synthesize critical observations related to the evolving nature of academic speaking assessment and propose a research and development agenda.

5.3.1 Recommendation 1: Developing an integrated and coherent theoretical model of academic speaking In Section 5.1, we discussed the limitations of current theoretical perspectives on academic speaking ability in relation to a specific domain.We also emphasized the need to develop an integrated model that consists of descriptions of the contexts of academic language use and components of language abilities that are relevant to the academic domain. In addition, we proposed a scheme for characterizing the contexts of academic speaking tasks to provide guidance in customizing constructs for academic speaking tests. This approach includes key contextual factors such as sub-domain of language use, genre, communication goal, medium of communication, and characteristics of the setting. A domain-specific theoretical model for academic speaking ability is expected to provide better guidance for operationalizing the construct, though extensive work will be needed to define, test, and refine these interconnected components of a model that is specific to the academic domain. A domain- specific theoretical model could also benefit from increasing attention to meaning and content conveyance as advocated by Purpura (2004)

816

186 Xiaoming Xi et al.

and clearer explication of the complex interactions among language knowledge, cognitive and metacognitive processes, topical knowledge, and context (Purpura, 2016).The role of topical and content knowledge, especially for tests of English for specific purposes, warrants further investigations in both defining and operationalizing constructs. While it is sensible to define content coverage and accuracy as part of the construct for integrated speaking tasks that use source materials, the extent to which content, background, and world knowledge should be considered as an integral part of the construct remains to be debated, if such knowledge is not provided as part of the test stimulus materials.

5.3.2 Recommendation 2: Refining the speaking construct in relation to the evolving nature of oral communication in higher education The landscape of global higher education is rapidly evolving. Increasing globalization and technological mediation of instruction in higher education may further diversify the contexts in which academic English is used and the very nature and extent of speaking EAP. As these developments unfold, an adequate academic speaking test will need to better reflect the expanding and changing domains and purposes of spoken English language use, if it is to be employed for valid decision- making within these contexts.

5.3.2.1 Changing settings of English-medium higher education Worldwide, English- medium higher education is on the increase. Similarly, sustained growth in international mobility of college students, coupled with the status of English as the default academic lingua franca, point to a burgeoning need for EAP assessment that extends well beyond typical English-dominant regions (e.g., the U.S., Canada, the United Kingdom, Australia, and New Zealand). However, along with these developments come questions regarding the shifting realities of spoken academic English use in these settings, including: (a) the type and frequency of speaking tasks encountered, (b) the relative importance afforded to speaking ability as a component of academic English competency, (c) the variety of English- language interlocutors encountered, and (d) the expectations of higher education administrators and educators regarding a threshold level of spoken English competency for participating in English- medium instruction. Broader philosophical questions also arise in relation to the norms or standards adopted or assumed for spoken English (e.g., ‘native’ versus lingua franca varieties), the potentially varied nature of pragmatic and interactional competencies, the opportunities for English-language instructional support within higher education institutions, and so on. Long-term research and development in academic English testing should address these concerns through a vigorous program of needs and domain analysis.

816

817

816

Assessing Academic Speaking 187

5.3.2.2 Technology-mediated delivery of higher education Globally, higher education is experiencing profound changes because of the increasing adoption of technological tools and platforms for delivery of instruction, as well as more varied possibilities for participation in and completion of courses and degree programs (e.g., online and hybrid courses, distance education). Already, most if not all college-level teaching and learning in typical universities is technology-mediated, in the sense that reading materials and other resources are becoming primarily web- based, courses are organized and managed via online tools and virtual meeting spaces (e.g., Blackboard, Canvas, and Zoom), and communication among students and teachers is carried out extensively in electronic environments (e.g., email, chat rooms, blogs). Even the traditional lecture has evolved to incorporate increasingly technology-mediated visual and audio support. More recently, the emerging trend of Massive Open Online Courses (MOOCs) in higher education has gained considerable momentum and is likely to expand exponentially in the future, resulting in considerable proportions of college coursework being completed in 100% web-based environments with fully dispersed, and likely maximally diverse, participants. It goes without saying that the advent of global health crises, such as the COVID-19 pandemic, further serves to push higher education into a reliance on technology-mediated communication. Clearly, these and future innovations will impact the way English is used for oral communication in higher education, and the growing use of technology for facilitating communication may also influence the way we define and operationalize the speaking construct for assessment purposes. Questions arise concerning (a) the relative importance of speaking and oral interaction (versus literacy-based skills), (b) the types and frequency of spoken tasks encountered in technology- mediated instructional settings, and (c) the influence of web-based contexts on oral communication. For tests of academic speaking proficiency to remain relevant, there is a pressing need to monitor these trends in higher education on a continual basis and to determine whether the speaking constructs represented in academic English assessments need to be updated accordingly and regularly to reflect ongoing developments.

5.3.2.3 Intended uses for speaking assessment With changes in the settings for English-language higher education, more fundamental issues related to English proficiency assessment may emerge. It is unclear to what extent current approaches to assessing speaking proficiency will meet the needs and intended uses of actual decision makers and other users of large- scale assessments. For the expanding contexts of English-medium instructional programs, it is uncertain to what extent standards of English language proficiency, and particularly spoken proficiency, will play a defining role in admissions testing. It may also be the case that other uses for assessment will gain more traction in these

18

188 Xiaoming Xi et al.

settings, such as formative, diagnostic, and placement uses. Similarly, it may be that quite distinct levels and standards of expectation exist across the diverse contexts for English-medium higher education, including distinct expectations for different sub-skills (e.g., a prioritization of receptive skills) within the same context. In technology-mediated instruction, too, the role for speaking proficiency assessment is uncertain, particularly considering the increasing importance of text-based communication in such environments. Whether current standards for oral language ability, including those held as expectations for first-language users in conjunction with the Common Core State Standards or typical liberal studies requirements, will be perpetuated for new technology-mediated environments remains to be seen. In light of these ambiguities vis-à-vis the roles for oral communication in future realizations of higher education, another major focus of ongoing investigation should be the needs of assessment users. Whether tests will need to inform decision makers about spoken proficiency in English, what exactly they need to know about it, and what precisely will be done with that information should be the focus of continual monitoring and periodic revision.

5.3.3 Recommendation 3: Developing new task types and rubric designs to reflect the changing construct As outlined above, it seems clear that new task types will be called upon in order to better reflect the specific construct of academic speaking, especially as the academic context itself continues to evolve and as the characteristics of effective speaking take on new meanings. The introduction of new task types into admissions testing as a way of better reflecting the academic speaking construct does seem to have been proven worthwhile in initial efforts. A major innovation of the TOEFL iBT test, integrated task types including listen-read-speak tasks, have been well-received by test users and language testing researchers alike (Plakans, 2012) since they mimic real-world academic tasks. Integrated tasks also constrain the content of expected responses, minimizing the possibility of test takers adapting memorized responses. Given these advantages, we expect that development and use of additional integrated task types will continue. These may also represent a wider range of speaking in academic settings, including on-campus service encounters, interactions with peers, and speaking with academic staff on academic course content. Of considerable current interest is the introduction of task types that tap more directly into an interactive speaking ability component, and several alternatives are being explored, including: (a) Interaction between candidates in role-play or group discussion scenarios, either face-to-face, or via video conferencing.

18

819

18

Assessing Academic Speaking 189

(b) Interaction with an interlocutor/ rater either face- to- face, or via video conferencing. (c) Interaction with a simulated interlocutor. Of course, the choice of task types and the participation format do not in and of themselves create ‘more’ valid assessments of speaking. Rather, these choices create the conditions under which appropriate evidence may generated to support a valid inference about the constructs of interest. Appropriate scoring rubrics also are needed. Paired or group formats, whether virtual or physical, allow interaction between test takers and/or interviewers, which provides the potential for broadening the speaking construct to interactional competence. However, factors such as personality and familiarity of partner (Berry, 2007; Ockey, 2009), proficiency level (Nakatsuhara, 2011), gender (O’Sullivan, 2000), speaking style (Brown, 2003), and cultural background (Fulcher & Marquez-Reiter, 2003) introduce variability that is absent from monologic speech (see also Vidakovíc & Galaczi, 2013). The key challenge remains unanswered: in order to assess interactional competence do we remove constraints on all sources of variability present in real-life communication, or do we restrict variability in order to deliver a test that is at least notionally the same for every test taker at the point of delivery (Fulcher, 2003)? And what balance between the two extremes provides a sound basis for predicting likely future performance in a higher education context? These questions may guide research efforts in this direction. Finally, practically speaking, it is important to ensure that the task specifications are sufficiently coherent to generate parallel tasks while maintaining a large task pool necessary for large-scale assessments. Research into the impact of task characteristics on difficulty will provide insights into considerations for task specifications that guide the development of comparable speaking tasks across test forms.

5.3.4 Recommendation 4: Leveraging technological advances to support assessment design innovations Chapelle (2010) suggested that technology supports language assessments for three main purposes: efficiency, equivalence, or innovation. Efficiency includes making tests shorter, cheaper, and quicker for scoring and score reporting. Shorter assessments can be achieved through adaptive testing. Tests can be made cheaper by using automated scoring. Equivalence refers to the emphasis placed on making scores from computer- based and paper- and- pencil assessments comparable in contexts where both formats need to co-exist. As Chapelle suggests, real innovation would leverage technology to better assess constructs that may not be measurable with paper-and-pencil or traditional online assessments. This notion of technology-driven innovation fits well with a reconceptualization of language ability constructs to better represent the current academic language use domain.

910

190 Xiaoming Xi et al.

Innovative speaking assessments, for example, could use multimedia to make tasks more authentic by providing enhanced context.Visual and audio input could support a test taker’s understanding of the setting of the task and replicate real- world demands. Technology could also be used to broaden the possible language use domains that could be assessed. For example, by using visual and audio stimuli, it may be possible to set up a context in which abstract language could be more easily assessed. Fantasy and imaginary situations (delivered via Virtual Reality technology) could be depicted in the visual input. Technology may also make it possible to scaffold more complex language use situations to make them accessible for assessment. Research on state-of-the-art technology needs to be conducted to determine the extent to which it could support computer-mediated reciprocal speaking assessments. In order to develop a viable model for assessing a broad construct of speaking ability including interactional competence, a multi-phase research effort is called for. This research strand would begin with small-scale studies which investigate and evaluate delivery and administration procedures needed for assessing reciprocal communication. Various technological possibilities are available, including vendors such as Skype, Cisco WebEx, and Zoom for computer- mediated video communication and Second Life and Metamersive for virtual world with audio communication and avatar representation. The ways in which different interfaces could work (e.g., whether all speakers should be seen at the same time or one speaker should be seen when speaking, whether a speaker should see himself or herself and if so, where on the screen) need to be investigated for real-time video formats. A comparison between video and virtual world formats is also needed. For instance, are test takers less anxious with one format than another? Which format best assesses interactional competence? Another research strand would explore how interactive speaking tasks can be designed that have the desired measurement characteristics and construct fidelity. An emerging body of research on one-on-one, paired, and group assessments could be used to guide this step. Research is needed to demonstrate that such tasks also can be developed in a standardized way to provide a scalable task pool. Finally, research needs to be pursued on the use of simulated dialogues based on speech technologies to better assess interactional competence. This research can build on previous research on intelligent tutoring systems (e.g., D’Mello et al., 2011; Stoyanchev & Stent, 2009), in which response formats are relatively unconstrained, but the domain of discourse is tightly constrained.

5.3.5 Recommendation 5: Continuing research on automated technologies for scoring and feedback Substantial research is needed to support capabilities for automated scoring of spoken responses, as well as for providing substantive and meaningful feedback on different aspects of test takers’ proficiency. Research and development in these

910

19

910

Assessing Academic Speaking 191

areas will include the conceptualization, development, and evaluation of new feature classes that address new construct areas. Such constructs include the completeness, representation, and accuracy of content derived from input sources; topic development and coherence; and features that address particular aspects of new task types, such as evaluating the pragmatic appropriateness of responses that are addressed to a particular conversational participant. Furthermore, in using automated technologies to provide feedback, investigations are needed into which types of feedback may be of most interest and use to test takers and learners (and teachers, potentially), how reliably feedback can be computed, and in what form it should be presented.

5.3.5.1 Feature development for new constructs Since most of the automated feature development in the existing literature has focused on basic aspects of speech, such as fluency, pronunciation, prosody, simple measures of vocabulary frequency, vocabulary diversity, and variation in the use of grammatical structures, the focus looking forward will be on deeper features of language use such as grammatical accuracy and appropriateness of word use in context as well as topic development consisting of content, discourse coherence, and progression of ideas. Further research to conceptualize, develop, and evaluate features that address pragmatic appropriateness in contextualized speaking tasks is also needed. Current speaking tasks in large-scale assessments are generally de-contextualized; contextualized speaking tasks would specify explicit communication goals and target audiences, requiring the speaker to consider factors such as the setting, characteristics of the participants, etc., when planning and executing a spoken response. These contextual factors may impact features of the responses such as choice of vocabulary, sentence complexity, intonation, and tone, and so on, all of which would require innovation in automated speech evaluation capabilities.

5.3.5.2 Development of feedback for test takers Feedback provided by automated speech technologies should be able to help identify learners’ strengths and weaknesses and inform their future study plans. Automated feedback has been developed for both constrained and spontaneous speech. For constrained speech (e.g., read-aloud tasks), feedback has focused on pronunciation accuracy and fluency, and specific word or phoneme pronunciation errors can also be identified (see Loukina, Davis, & Xi, 2018 for a review). Gu et al. (2020) reported on two types of machine-generated feedback on spontaneous speech, including speech dimension sub-scores and feedback based on individual linguistic features. Automated feedback research on spontaneous speech has largely centered around performance-level feedback on speech dimensions (e.g., language use, topic development, delivery) or individual speech features

912

192 Xiaoming Xi et al.

(e.g., speech rate, rhythm, pronunciation, and grammatical accuracy). Research on targeted feedback that flags errors or deficiencies in spontaneous speech is still in its infancy, though, and technological capabilities that provide possible corrections to specific errors are almost non-existent. Research and development work will need to focus on providing feedback on features that indicate higher-order speech skills such as content accuracy and representation, rhetorical organization and coherence, and pragmatic appropriateness. For low-level features, feedback capabilities that identify specific pronunciation, vocabulary usage, and grammatical errors are also needed to provide targeted, actionable feedback to users. Furthermore, different ways of presenting automated feedback should be investigated and evaluated to make sure they are meaningful and interpretable to users. The combination of automated feedback and human feedback is also an important area of investigation to leverage the accuracy and reliability of machine feedback and the richness and sophistication of feedback by trained language educators. In summary, we recommend developing an active and robust research agenda to continue to investigate the following long-term areas of need in the assessment of academic speaking abilities: • • • •

the relationships between speaking constructs and continually evolving higher education media, settings, and assessment uses the use of technology to support interactive speaking tests to measure interactional and pragmatic competence more fully the use of NLP and speech technologies to provide fine-g rained, meaningful diagnostic feedback to test takers and teachers alternative approaches to designing scoring rubrics and human scoring that lead to improved score accuracy and reliability while yielding useful information for providing diagnostic feedback

5.4 Conclusion In this chapter, we have reviewed the origins, recent advances in, and practices of academic speaking assessment, and we have identified key opportunities for the next generation of academic speaking tests. Considering recent perspectives related to the conceptualization of academic speaking ability and domain analysis, we proposed a new approach to defining the construct of academic speaking ability and discussed recent developments in possible task types, scoring rubrics, scoring and score reporting approaches, and automated scoring technologies. The landscape of global higher education has continued to evolve including recent and dramatic disruptions by the global COVID- 19 pandemic. These disruptions have led to significant changes in admissions practices, including broadened and, in some cases, undifferentiated acceptance of English language

912

193

912

Assessing Academic Speaking 193

tests that are of questionable quality for ensuring that second language international students have the speaking ability needed for English- medium academic success. While it may be that the contexts and purposes for speaking have evolved somewhat in the new era, it is also certainly the case that the need to communicate effectively in speech remains critical for all educational settings. Accordingly, we advocate a return to the fundamentals of academic English testing for admissions purposes, which should fulfill the goals of selecting students who have adequate English communication skills and enabling their success in a variety of English-medium academic environments. Any practice that deviates from these fundamentals will eventually erode the capacity of English admissions tests to serve as a valuable tool to both equalize and expand opportunities for international students around the world.

References Alco, B. (2008). Negotiating academic discourse in a second language: A case study and analysis. (Unpublished doctoral dissertation). Penn State University, Pennsylvania. Alderson, J. C. (2010). Cognitive diagnosis and Q-Matrices in language assessment: A commentary. Language Assessment Quarterly, 7(1), 96–103. Altakhaineh, A. R., Altkhayneh, K., & Rahrouh, H. (2019). The effect of the gender and culture of the IELTS examiner on the examinees’ performance on the IELTS Speaking Test in the UAE context. International Journal of Arabic-English Studies, 19, 33–52. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, U.K.: Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford, U.K.: Oxford University Press. Bachman, L. F. (2002). Some reflections on task-based language performance assessment. Language Testing, 19(4), 453–476. Bachman, L. F. (2007).What is the construct? The dialectic of abilities and contexts in defining constructs in language assessment. In J. Fox, M. Wesche, D. Bayliss & L. Cheng (Eds.), Language testing reconsidered (pp. 41–71). Ottawa, Canada: University of Ottawa Press. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, U.K.: Oxford University Press. Bailey, A. L., & Butler, F. A. (2003). An evidentiary framework for operationalizing academic language for broad application to K–12 education: A design document (CSE Tech. Rep. No. 611). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Bailey, A. L., Butler, F. A., LaFramenta, C., & Ong, C. (2001). Towards the characterization of academic language in upper elementary classrooms (Final Deliverable to OERI/OBEMLA, Contract No. R305B960002). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Bailey, A. L., Butler, F. A., Stevens, R., & Lord, C. (2007). Further specifying the language demands of school. In A. L. Bailey (Ed.), The language demands of school: Putting academic English to the test (pp. 103–156). New Haven, CT:Yale University Press. Bailey, A. L., & Heritage, M. (2008). Formative assessment for literacy, grades K–6: Building reading and academic language skills across the curriculum. Thousand Oaks, CA: Corwin/ Sage Press.

914

194 Xiaoming Xi et al.

Bakhtin, M. (1986). The problem of speech genres. In Speech genres and other late essays (pp. 60–102). Austin, Texas: Austin University of Texas Press. Berry,V. (2007). Personality differences and oral test performance. Frankfurt, Germany: Peter Lang. Bernstein, J., Cohen, M, Murveit, H., Rtischev, D., & Weintraub, M. (1990). Automatic evaluation and training in English pronunciation. In Proceedings of the ICSLP-90: 1990 International Conference on Spoken Language Processing (pp. 1185–1188). Kobe, Japan. Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27(3), 355–377. Biber, D. (2006). University Language: A corpus-based study of spoken and written registers. Amsterdam, The Netherlands: John Benjamins. Breiner-Sanders, K. E., Lowe, P. Jr., Miles, J., & Swender, E. (1999). ACTFL Proficiency Guidelines Speaking: Revised 1999. Foreign Language Annals, 33(1), 13–18. Bridgeman, B., Powers, D., Stone, E., & Mollaun, P. (2011). TOEFL iBT speaking test scores as indicators of oral communicative language proficiency. Language Testing, 29(1), 91–108. Brooks, L. (2009). Interacting in pairs in a test of oral proficiency: Co-constructing a better performance. Language Testing, 26(3), 341–366. Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 1–25. Brown, J. D. (2004). Performance assessment: Existing literature and directions for research. Second Language Studies, 22(2), 91–139. Campbell, R., & Wales, R. (1970). The Study of Language Acquisition. In J. Lyons (Ed.), New horizons in linguistics (pp. 242–260). Harmondsworth, U.K.: Penguin Books Ltd. Canale, M. (1983). From communicative competence to communicative language pedagogy. In J. C. Richard & R. W. Schmidt (Eds.), Language and communication (pp. 2–14). London, U.K.: Longman. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47. CEFR. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge, U.K.: Press Syndicate of the University of Cambridge. Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing, 20(4), 369–383. Chamot, A. U., & O’Malley, J. M. (1994). The CALLA handbook: Implementing the cognitive academic language learning approach. Reading, MA: Addison-Wesley. Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 32–70). Cambridge, U.K: Cambridge University Press. Chapelle, C. (2010). The spread of computer-assisted language learning. Language Teaching, 43(1), 66–74. Chapelle, C., Grabe, W., & Berns, M. (1997). Communicative language proficiency: Definition and implications for TOEFL 2000 (TOEFL Research Memorandum No. RM-97-03). Princeton, NJ: Educational Testing Service. Chukharev-Hudilainen, E., & Ockey, G. J. (in press). The development and evaluation of Interactional Competence Elicitor (ICE) for oral language assessments (TOEFL Research Report Series). Princeton, NJ: Educational Testing Service. Collier, V. P., & Thomas, W. P. (1989). How quickly can immigrants become proficient in school English? Journal of Educational Issues of Language Minority Students, 5, 26–38. Cucchiarini, C., Strik, H., & Boves, L. (1997). Automatic evaluation of Dutch pronunciation by using speech recognition technology. In 1997 IEEE Workshop on Automatic Speech

914

195

914

Assessing Academic Speaking 195

Recognition and Understanding Proceedings (pp. 622–629). Santa Barbara, CA: Institute of Electrical and Electronics Engineers. Cucchiarini, C., Strik, H., & Boves, L. (2002). Quantitative assessment of second language learners’ fluency: Comparisons between read and spontaneous speech. The Journal of the Acoustical Society of America, 111, 2862–2873. Cummins, J. (2000). Language, power and pedagogy: Bilingual children in the crossfire. Clevedon, U.K.: Multilingual Matters. D’Mello, S. K., Dowell, N., & Graesser, A. C. (2011). Does it really matter whether students’ contributions are spoken versus typed in an intelligent tutoring system with natural language? Journal of Experimental Psychology: Applied, 17(1), 1–17. Douglas, D. (2000). Assessing language for specific purposes. Cambridge, U.K.: Cambridge University Press. Ellis, R. (2005). Principles of instructed language learning. System, 33(2), 209–224. Ellis, R., & Roberts, C. (1987). Two approaches for investigating second language acquisition in context. In R. Ellis, (Ed.), Second language acquisition in context (pp. 3–29). Englewood Cliffs, NJ: Prentice Hall International. Ferris, D. (1998). Students’ views of academic aural/oral skills: A comparative needs analysis. TESOL Quarterly, 32(2), 289–318. Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. English Language Teaching Journal, 41(4), 287–291. Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208–238. Fulcher, G. (2003). Testing second language speaking. Harlow, U.K.: Longman. Fulcher, G., & Marquez-Reiter, R. (2003). Task difficulty in speaking tests. Language Testing, 20(3), 321–344. Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale design. Language Testing, 28(1), 5–29. García, G. E., McKoon, G., & August, D. (2006). Language and literacy assessment of language-minority students. In D. August & T. Shanahan (Eds.), Developing literacy in second-language learners: Report of the National Literacy Panel on Language- Minority Children and Youth (pp. 597–624). Mahwah, NJ: Lawrence Erlbaum Associates Publishers. Grabowski, K. (2008). Measuring pragmatic knowledge: Issues of construct underrepresentation or labeling? Language Assessment Quarterly, 5(2), 154–159. Gu, L., Davis, L.,Tao, J., & Zechner, K. (2020). Using spoken language technology for generating feedback to prepare for the TOEFL iBT® test:A user perception study. Assessment in Education: Principles, Policy & Practice. doi:10.1080/0969594X.2020.1735995 Hakuta, K., & Beatty, A. (Eds.) (2000). Testing English- language learners in U.S. schools. Washington, DC: National Academy Press. Halliday, M. A. K., & Hasan, R. (1989). Language, context, and text: Aspects of language in a social-semiotic perspective. Oxford, U.K.: Oxford University Press. Higgins, D., Xi, X., Zechner, K., & Williamson, D. (2011). A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech & Language, 25(2), 282–306. Hilsdon, J. (1995). The group oral exam: Advantages and limitations. In J. Alderson, & B. North (Eds.), Language testing in the 1990s: The communicative legacy (pp. 189–197). Hertfordshire, U.K.: Prentice Hall International. Hincks, R. (2010). Speaking rate and information content in English lingua franca oral presentations. English for Specific Purposes, 29(1), 4–18.

916

196 Xiaoming Xi et al.

Hornberger, N. H. (1989). Continua of biliteracy. Review of Educational Research, 59(3), 271–296. Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers: An agenda for research and suggestions for second-language assessment. Language Assessment Quarterly, 8(3), 229–249. Hymes, D. (1971). Sociolinguistics and the ethnography of speaking. In E. Ardener (Ed.), Social anthropology and language (pp. 47–93). London, U.K: Routledge. Hymes, D. (1972). Models of the interaction of language and social life. In J. Gumperz, & D. Hymes (Eds.), Directions in sociolinguistics:The ethnography of communication (pp. 35–71). New York: Holt, Rinehart & Winston. Hymes, D. (1974). Foundations in sociolinguistics: An ethnographic approach. Philadelphia, PA: University of Pennsylvania Press. Iwashita, N., Brown, A., McNamara,T., & O’Hagan, S. (2008). Assessed levels of second language speaking proficiency: How distinct? Applied Linguistics, 29(1), 24–49. Jacoby, S., & Ochs, E. (1995). Co-construction: An introduction. Research on Language and Social Interaction, 28(3), 171–183. Johnson, M., & Tyler, A. (1998). Re-analyzing the OPI: How much does it look like natural conversation? In R.Young, & W. He (Eds.), Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. 27–51). Philadelphia, PA: John Benjamins Publishing Company. Jones, J. F. (1999). From silence to talk: Cross-cultural ideas on students’ participation in academic group discussion. English for Specific Purposes, 18(3), 243–259. Kemp, J., & Fulcher, G. (2013). Performance decision trees: Developing domain-specific criteria for teaching and assessment. In J. Wrigglesworth (Ed.), EAP within the higher education garden: Cross-pollination between disciplines, departments and research (pp. 159–170). Reading, U.K.: Garnet Publishing Ltd. Kennedy, S. (2008). Second language learner speech and intelligibility: Instruction and environment in a university setting (Unpublished doctoral dissertation). McGill University, Montreal, Quebec. Kim, S. (2006). Academic oral communication needs of East Asian international graduate students in non-science and non-engineering fields. English for Specific Purposes, 25(4), 479–489. Krase, E. (2007). “Maybe the communication between us was not enough”: Inside a dysfunctional advisor/ L2 advisee relationship. Journal of English for Academic Purposes, 6(1), 55–70. LaFlair, G. T., & Settles, B. (2020). Duolingo English Test: Technical manual. Pittsburgh, Pennsylvania. Retrieved from https://englishtest.duolingo.com/research Lee, G. (2009). Speaking up: Six Korean students’ oral participation in class discussions in US graduate seminars. English for Specific Purposes, 28(3), 142–156. Long, M. H., & Norris, J. M. (2000).Task-based teaching and assessment. In M. Byram (Ed.), Encyclopedia of language teaching (pp. 597–603). London, U.K.: Routledge. Loukina, A., Davis, L., & Xi, X. (2018). Automated assessment of pronunciation in spontaneous speech. In O. Kang & A. Ginther (Eds.), Assessment in second language pronunciation (pp. 153–171). New York: Routledge. Luoma, S. (2004). Assessing speaking. Cambridge, U.K.: Cambridge University Press. McNamara, T. F. (1997). ‘Interaction’ in second language performance assessment: whose performance? Applied Linguistics, 18(4), 446–466. McNamara, T. F., & Roever, C. (2006). Language testing: The social dimension (Language Learning and Monograph Series). Malden, MA; Oxford, U.K.: Blackwell Publishing.

916

197

916

Assessing Academic Speaking 197

Mondada, L., & Pekarek Doehler, S. (2004). Second language acquisition as situated practice. The Modern Language Journal, 88(4), 501–518. Myles, J., & Chang, L.Y. (2003). The social and cultural life of non-native English speaking international graduate students at a Canadian university. Journal of English for Academic Purposes, 2(3), 247–263. Nakatsuhara, F. (2011). Effects of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483–508. Nevo, D., & Shohamy, E. (1984). Applying the joint committee’s evaluation standards for the assessment of alternative testing methods. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans. Norris, J. (1996). A validation study of the ACTFL Guidelines and the German Speaking Test. (Unpublished Master’s thesis). University of Hawaii at Manoa, Honolulu, HI. Norris, J. M., Brown, J. D., Hudson,T., & Yoshioka, J. (1998). Designing second language performance assessments (Vol. SLTCC Technical Report #18). Honolulu, HI: Second Language Teaching and Curriculum Center, University of Hawaii at Manoa. O’Sullivan, B. (2000). Exploring gender and oral proficiency interview performance. System, 28(3), 373–386. O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Language Testing, 19(3), 277–295. Ockey, G. J. (2009). The effects of group members’ personalities on a test taker’s L2 group oral discussion test scores. Language Testing, 26(2), 161–186. Ockey, G. J., & Chukharev-Hudilainen, E. (2021). Human vs computer partner in the paired oral discussion test. Applied Linguistics. https://doi.org/10.1093/applin/amaa067 Ockey, G. J., Gu, L., & Keehner, M. (2017). Web-based virtual environments for facilitating assessment of L2 oral communication ability. Language Assessment Quarterly, 13(4), 346–359. Ockey, G. J., Koyama, D., Setoguchi, E., & Sun, A. (2015).The extent to which TOEFL iBT speaking scores are associated with performance on oral language tasks and oral ability components for Japanese university students. Language Testing, 32(1), 39–62. Plakans, L. (2012). Writing integrated items. In G. Fulcher, & F. Davidson, (Eds.), The Routledge handbook of language testing (pp. 249–261). New York: Routledge. Plough, I., Banerjee, J., & Iwashita, N. (Eds.). (2018). Interactional competence. [Special issue]. Language Testing, 35(3). Purpura, J. E. (2004). Assessing grammar. Cambridge, U.K.: Cambridge University Press. Purpura, J. E. (2016). Assessing meaning. In E. Shohamy & L. Or (Eds.), Encyclopedia of language and education. Vol. 7. Language testing and assessment. New York: Springer International Publishing. doi:10.1007/978-3-319-02326-7_1-1 Ramanarayanan, V., Evanini, K., & Tsuprun, E. (2020). Beyond monologues: Automated processing of conversational speech. In K. Zechner & K. Evanini (Eds.), Automated speaking assessment: Using language technologies to score spontaneous speech (pp. 176–191). New York: Routledge. Recski, L. (2005). Interpersonal engagement in academic spoken discourse: A functional account of dissertation defenses. English for Specific Purposes, 24(1), 5–23. Roever, C. (2011). Testing of second language pragmatics: Past and future. Language Testing, 28(4), 463–481. Rosenfeld, M., Leung, S., & Oltman, P. K. (2001). The reading, writing, speaking, and listening tasks important for academic success at undergraduate and graduate levels (TOEFL Report No. MS-21). Princeton, NJ: Educational Testing Service.

918

198 Xiaoming Xi et al.

Santanna-Williamson., E. (2004). A comparative study of the abilities of native and non-native speakers of American English to use discourse markers and conversational hedges as elements of the structure of unplanned spoken American English interactions in three subcorpora of the Michigan Corpus of Academic Spoken English (Unpublished doctoral dissertation). Alliant International University, San Diego. Schleppegrell, M. J. (2004). The language of schooling:A functional linguistics perspective. Mahwah, NJ: Lawrence Erlbaum Associates Publishers. Skehan, P. (1998). A cognitive approach to language learning. Oxford, U.K.: Oxford University Press. Skyrme, G. (2010). Is this a stupid question? International undergraduate students seeking help from teachers during office hours. Journal of English for Academic Purposes, 9(3), 211–221. Stoyanchev, S., & Stent, A. (2009). Concept form adaptation in human-computer dialog. Proceedings of the SIGDIAL 2009 Conference: 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 144– 147). Stroudsburg, PA: Association for Computational Linguistics. Suendermann-Oeft, D., Ramanarayanan, V., Teckenbrock, M., Neutatz, F., & Schmidt, D. (2016). HALEF: An open-source standard-compliant telephony-based spoken dialogue system –a review and an outlook. In Proceedings of Natural language dialog systems and intelligent assistants (pp. 53–61). New York: Springer. Tannenbaum, R. J. (2018). Validity aspects of score reporting. In. D. Zapata-Rivera (Ed.), Score reporting: Research and applications (pp. 9–18). New York: Routledge. Taylor, L. (2000). Investigating the paired speaking test format. University of Cambridge ESOL Examination Research Notes, 2, 14–15. The Common Core State Standards (CCSS). (n.d.) Retrieved from www.corestandards. org/ELA-Literacy/SL/11-12 Tin, T. B. (2003). Does talking with peers help learning? The role of expertise and talk in convergent group discussion tasks. Journal of English for Academic Purposes, 2(1), 53–66. Turner, C. E. (2000). Listening to the voices of rating scale developers: Identifying salient features for second language performance assessment. Canadian Modem Language Review, 56(4), 555–584. Turner, C. E., & Upshur, J. (2002). Rating scales derived from student samples: Effects of the scale marker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49–70. Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3–12. Upshur, J. A., & Turner, C. E. (1999). Systemic effects in the rating of second-language speaking ability: Test method and learner discourse. Language Testing, 16(1), 82–111. van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: Oral proficiency interviews as conversation. TESOL Quarterly, 23(3), 489–508. Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325–344. Vidakovíc, I., & Galaczi, E. (2013). The measurement of speaking ability 1913–2012. In C. J. Weir, I. Vidakovíc, & E. Galaczi, (Eds.), Measured constructs: A history of Cambridge English language examinations 1913–2012 (Studies in Language Testing 37, pp. 257–346). Cambridge, U.K.: Cambridge University Press. Witt, S. M., & Young, S. L. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(2–3), 95–108.

918

91

918

Assessing Academic Speaking 199

Xi, X. (2007). Evaluating analytic scoring for the TOEFL Academic Speaking Test (TAST) for operational use. Language Testing, 24(2), 251–286. Xi, X. (2015, March). Language constructs revisited for practical test design, development and validation. Paper presented at the annual Language Testing Research Colloquium, Toronto, Canada. Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRater v1.0 (ETS Research Rep. No. RR-08-62). Princeton, NJ: Educational Testing Service. Youn, S. J. (2010, October). From needs analysis to assessment: Task-based L2 pragmatics in an English for academic purposes setting. Paper presented at the Second Language Research Forum, University of Maryland, College Park, Maryland. Young, R. F. (2008). Language and interaction: An advanced resource book. London, U.K.; New York: Routledge. Young, R. F. (2011). Interactional competence in language learning, teaching, and testing. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp. 426–443). New York: Routledge. Young, R. F. (2012). Social dimensions of language testing. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 178–193). New York: Routledge. Young, R. F. (2019). Interactional competence and L2 pragmatics. In N. Taguchi (Ed.), The Routledge handbook of second language acquisition and pragmatics (pp. 93– 110). New York: Routledge. Zapata-Rivera, D., Kannan, P., & Zwick, R. (2018). Communicating measurement error information to teachers and parents. In D. Zapata-Rivera (Ed.), Score reporting research and applications (pp. 63–74). New York: Routledge. Zappa-Hollman, S. (2007). Academic presentations across post-secondary contexts:The discourse socialization of non-native English speakers. Canadian Modern Language Review, 63(4), 455–485. Zechner, K., & Evanini, K. (Eds.). (2020). Automated speaking assessment: Using language technologies to score spontaneous speech. New York: Routledge.

20

6 LOOKING AHEAD TO THE NEXT GENERATION OF ACADEMIC ENGLISH ASSESSMENTS Carol A. Chapelle

The chapters in this volume express the imperative for continued research and development to keep pace with the demands for English language testing in higher education. Each one paints a piece of a picture of a higher education context that is increasing in sophistication and complexity while expanding geographically. The evolving character of higher education is fueled by societies with increased connectivity and mobility, both of which are fueled by participants using English as the vehicle for wider communication. Never has the need been so great for English language assessments that can help to convey the language qualifications of prospective candidates to decision makers as well as to provide test takers and educators with specific, actionable guidance about language learning needs. To offer a prospective view of the complex topic of academic English language testing as it is used in admissions testing in higher education, this volume is divided into areas of language modalities in chapters on assessment of academic listening, reading, speaking, and writing. The division reflects long-standing divisions in language testing representing the “four skills” encompassed by overall language proficiency. The division is reflected in the way that scores have been reported for academic English tests for years, which has created expectations on the part of score users. Even though students in higher education seldom write without having read, or listen without speaking or writing, for example, modern testing of academic language has been conducted by dividing up academic language ability into four constructs often to create four-part tests and deliver four separate scores in addition to a composite score. Recently, research has shown the value of the four part scores, which provide different or additional information and allow decision makers to use students’ profiles of ability for decision making (Ginther & Yan, 2018). Other research has found that the four part scores add to the interpretation that can be made from the total score alone (Sawaki & Sinharay, 2018).

20

210

Looking Ahead 201

Another value of the division is evident in the variety of perspectives brought to the understanding of academic language assessment by the review of research and issues in each area of language testing. This chapter capitalizes on the variety of perspectives to look to the future of academic language assessment by, so to speak, putting the construct back together. Focusing on admissions testing, each chapter looks to advances in theory and research on one area of English for academic purposes in addition to innovations in test design, tasks, and scoring methods to chart directions for academic English language testing. All four chapters address the need for interpretive construct frameworks built on theory and research in their respective areas, which are variously referred to as a “model of academic reading,” “construct of academic listening,” “construct of writing,” and “academic speaking ability.” All of the chapters recognize the need for specification of the language, strategies, and contexts relevant to academic language ability, but each offers a different insight as to how to conceptualize the specific modality. All of the chapters call for test tasks to be designed on the basis of ongoing analysis of task demands in higher education, but they differ in the lenses they recommend to describe the relevant features. All four identify the role of technology as critical, but each highlights a different aspect of what should be the larger project that incorporates technology. This chapter synthesizes the contributions of the four chapters in each of these areas.

20

6.1 Score Interpretation for Use The four chapters in this volume follow the lead of the 2014 Standards for Educational and Psychological Testing (American Education Research Association [AERA], American Psychological Association [APA], & the National Council on Measurement in Education [NCME], 2014) by considering the construct that a test is intended to assess as central to the interpretation of test scores. The construct interpretation is intended to serve as the “conceptual framework” for the test, which “points to the kinds of evidence that might be collected to evaluate the proposed interpretation in light of the purposes of testing” (AERA/APA/NCME, 2014, pp. 1–2). Although the Standards give construct interpretation a central role, they provide little guidance about what constitutes a good construct interpretation. The lack of specific guidance undoubtedly reflects the fact that what makes a good construct depends on the nature of the interpretation and the intended test use. What should a construct interpretation for tests of academic English consist of if the scores are to be used for decisions about university admissions and instruction? The chapters of this volume contribute answers to this question by presenting parts of a construct interpretation for academic English divided by language modality. If test score users wanted four scores, one for each of reading, listening, speaking, and writing, and were not concerned with academic English ability as a whole, the work would be well underway. In fact, the four scores are of

20

202 Carol A. Chapelle

interest to many test users, but when the primary test use is to make admissions decisions based on an overall score of academic English proficiency, and in some cases, an overall score plus scores on individual skills such as speaking, it begs the question of how overall academic English proficiency should be defined. For example, should it be conceptualized as an aggregate of constructs of reading, listening, speaking, and writing with interrelationships among them explained, or an integrated construct in which the four skills are embedded in an articulated theoretical framework? One might hope to tackle the job of defining academic English proficiency by merging the construct frameworks presented in the previous chapters. However, the theories and research evidence presented for the respective constructs differ across the chapters resulting in pieces of a puzzle to be completed if an integrated construct of academic language ability is the goal. The authors of each of the chapters do agree, however, that simplistic accounts of context-free language proficiency are inadequate for extrapolation of score meaning to academic language use. The speaking chapter (Chapter 5) lays out the overall frame for the puzzle with articulation of a need “to develop an integrated model that consists of descriptions of the contexts of academic language use and components of language abilities that are relevant to the academic domain” (this volume, p. x). An interpretive construct framework along the lines suggested in the speaking chapter is sketched in Figure 6.1.

Academic contexts of language use

Task A

Task C Task B

Task H

Task E

Task G

Task D Task F

Language knowledge, processes and strategies

FIGURE 6.1 Components of an interpretive framework for academic English proficiency

20

203

20

Looking Ahead 203

Beginning with the academic contexts of language use, an overall academic English proficiency framework might start with the broad categories of situations within academic contexts offered in the listening and speaking chapters (Chapters 3 and 5): the social- interpersonal, navigational, and general academic. However, the reading chapter (Chapter 2) calls for a survey of the domain of extrapolation internationally in view of the fact that the expansion of English-medium instruction in higher education has resulted in academic English tests being called on for decision making throughout the world. The chapter poses some broad questions that such a survey would need to address, which can be expanded to include not only academic reading, but academic English more broadly as follows: 1. What are the most common and important practices taking place in English in higher education throughout the world? 2. What expectations for students’ English use are held by professors and other relevant actors across different disciplines and in different international environments where English is used for instruction? Such research is needed if claims about the interpretation and use of an academic English test are to be credible across contexts where English-medium instruction is offered. Such research may reveal diverse practices and expectations across the wide range of EMI throughout the world. For example, the social- interpersonal and academic-navigational contexts that are part of the proposed academic English construct framework may look very different in universities where English is used for conveying academic content and asking questions about what is conveyed (Owen, et al., in press). Moreover, these conventions are shaped by the affordances of technology in digitally mediated instruction (Kyle, et al., in press). Issues at the intersection of assessment and English as the language of wider communication in higher education have been posed by applied linguists for years (Cook, 1999; Jenkins & Leung, 2017; Lowenberg, 1993). These issues are not limited to the social and political dimensions of testing, but also intersect with the technical decisions to be made about the intended construct interpretation (McNamara & Roever, 2006). The Tasks included in the construct framework refer not to the test tasks on a particular form of a test but to characterizations of types of tasks in academic contexts, such as reading a textbook in preparation for taking a multiple-choice test, working on a lab project and writing a report of the process and results, or listening to lectures to work collaboratively with classmates to produce a PowerPoint presentation. Types of tasks delimit the pool, or universe, of tasks to which test scores can be extrapolated. Such task characterizations require sufficiently precise definitions to express the types of tasks referred to, but not so restrictive as to limit the task types to only a few actual academic tasks. Each chapter in this volume introduces a perspective about how tasks should best be defined that highlights the critical aspects of tasks for its respective area.

204

204 Carol A. Chapelle

The speaking chapter defines a task by including the domain in which it is embedded (one of the three academic contexts), the genre of the speech, the goal of communication, the number of interlocutors and nature of interaction, characteristics of the setting (actual or imaginary), the time and place of the event, and the roles and relationships of the participants. The listening construct framework, in contrast, does not specify any task features except for communication goals, which consist of a list of goals that listeners might adopt such as understanding main ideas. Likewise, the reading construct framework details reader purpose or goals rather than specifying other characteristics of the tasks. The reader purpose and the characteristics of the text to be read are the two critical task components for construct interpretation.The authors of the writing construct framework see task as an important dimension but do not attempt to define task characteristics in a construct framework. A synthesis of these perspectives suggests that a definition of academic English proficiency should take into account the purposes and text types, the number of interlocutors and nature of interaction, characteristics of the setting (actual or imaginary), the time and place of the event, and the roles and relationships of the participants. The language knowledge, processes, and strategies definitions across the four chapters also approach this aspect in different ways. The chapters on listening and reading identify knowledge and processes that are needed to work toward the readers’ or listeners’ communication goals, which are prompted by the task. The writing chapter acknowledges three types of processes: “micro-processes of word choice, spelling, punctuation, or keyboarding; composing processes of planning, drafting, revising, and editing; and macro-processes of fulfilling genre conventions, asserting a coherent perspective, or expressing membership in a particular discourse community” (Chapter 4, this volume, p. 108). The speaking chapter explicitly eschews a strictly context-free psycholinguistic processing approach to identifying abilities because it “renders any possible interpretations of speaking ability quite distant from language use in actual communicative situations” (Chapter 5, this volume, p. x). Instead, the authors adopt an assessment-based view of speaking ability as consisting of two primary divisions of organizational and sociocultural competence. The absence of linguistically based discussion of the academic language required for expressing and understanding content knowledge in academics is notable throughout the chapters. Development of this area will undoubtedly be able to consult research in discourse analysis investigating academic language (e.g., Biber, 2006; Gray, 2015; Hyland, 2009; Martin, Maton, & Doran, 2020) to explore overall construct definition for academic language proficiency.

6.2 Task Design The changes in academic English testing have prompted suggestions for potential advances in test tasks to be included in future tests. Advances have resulted in changes in theory, research, and practices that encompass both cognitive and social

204

250

204

Looking Ahead 205

dimensions of performance; recognize the broad range of language and strategic abilities required for performance of academic tasks; and draw upon accessible technology for constructing test tasks. Each of the chapters makes suggestions about test tasks for future academic English tests. These range in their levels of precision and definitiveness about the desirability of certain task types and task features for operational testing, but all are relevant for consideration in tests of academic English. The writing chapter is most definitive in its recommendations about the need for test tasks to specify the expected purposes, audiences, and domains for writing tasks. The authors also argue for an increased coverage of written genres and contexts for academic writing by increasing the number and types of writing tasks appearing on tests, including additional types of content-responsible tasks. These recommendations extend beyond writing tasks used in major tests of English for academic purposes, which already require test takers to write in response to aural and written texts. These recommendations contribute to the goal of defining academic English assessment tasks as multimodal. The listening chapter tackles some difficult testing issues in recommending changes in listening test tasks. First, the authors recommend the expanded research and development of multimodal listening tasks that allow or even require test takers to use visual information in conjunction with aural input to comprehend meaning. Discussion about the appropriateness of visual components of tasks for assessment of listening is decades old, but score interpretations that explicitly take into account the context of listening and the nature of listening tasks are useful for clarifying that the intended score meaning is not limited to ability in ‘blind’ listening under all circumstances. A second task recommendation is that some test tasks should take into account interactivity in academic situations and activities by requiring test takers to listen and respond in a dialogic format. Third, the authors recommend exploring the use of diverse speaker accents to reflect those in higher education. Finally, the authors raise the need to include tasks that assess pragmatic understanding. The reading chapter recommends tasks that assess reading to learn, either from individual texts or from multiple texts, the latter also requiring integration of information across multiple texts. The authors also recommend the use of new media, which presumably includes multimedia presentations. They suggest the integration of reading assessment with writing tasks to create a variety of read- write tasks, which may hold the potential for circumventing the problem that uneven background knowledge across test takers can affect writing performance. The idea is that, by providing test takers with knowledge in one or more texts, good readers would have sufficient content to write about. The task score would be interpreted as ability to write about what had been learned from reading, that is, reading to write, rather than reading or writing alone. In short, the ideas suggested for exploration in the reading chapter take further strides toward integrated academic English assessment.

206

206 Carol A. Chapelle

The speaking chapter calls for investigation of novel speaking tasks and rubrics that expand the constructs measured to include higher-order skills such as interactional competence and pragmatic competence. It also argues that academic speaking tasks take into account communication in academic contexts which include spoken interaction mediated by technology. The suggestions proposed for exploration emphasize tasks requiring dialogic oral interaction. These include interactions that could be accomplished by having an invigilator interact with the test taker either face-to-face or via video. Other task concepts would build upon the research into group oral interviews with tasks requiring one or more test takers to work together to carry out role plays or group discussions either face-to-face or remotely through video. Still other types of oral interaction tasks to be explored are tasks requiring spoken interaction with a simulated interlocutor. Exploration of all of these task types will work toward the larger project of assessing academic English ability.

6.3 Validation of Interpretations and Uses of Test Scores The proposals for new task types to assess academic English need to undergo validation research because evidence is needed to support their acceptability in producing scores for use in decision making. The first stage of such research has begun with the chapters in this volume that have started to survey the theory and research in the field to identify relevant concepts and practices for understanding the abilities required for success. This domain analysis begins with an evidence- centered design (Mislevy, Steinberg, & Almond, 2003) process that provides input to development of the type of interpretive framework outlined in Figure 6.1 as well as the task patterns required to create prototype tasks.The interpretive framework and prototype tasks play important roles in the validation process. Validation consists of multiple types of research, each of which investigates the tenability of a claim to be made about test score interpretation and use.The claims one would want to support for an academic English test to be used in higher education admissions could be stated generally as follows: 1. The domain of academic English in international higher education has been adequately investigated, yielding a useful conceptual framework for task development and for test score interpretation and use. 2. Test tasks provide appropriate observations of test taker performance. 3. Test scores accurately summarize relevant performance. 4. Test scores reflect performance consistency across tasks within the defined universe of generalization (the domain of generalization). 5. Test scores indicate the intended construct of language knowledge, processes, and strategies (the construct domain). 6. Test scores represent the target performance in the academic domain (the domain of extrapolation).

206

207

206

Looking Ahead 207

7. Test scores are useful for making decisions about admissions and learning for students of English as an additional language in higher education. 8. Test score use results in equitable admissions standards and high- quality English-medium education. These claims, which serve as the basis for the validity argument (Chapelle, Enright, & Jamieson, 2008; Kane, 2013), rely on preliminary review of the academic domain and development of an interpretive framework. The connection between the claims and framework is evident in claims one, four, five, and six, which make direct reference to elements in the interpretive framework, as shown in Figure 6.2. The first claim is about the adequacy of the investigations that provide the basis of the interpretive framework itself. The fourth claim makes reference to the domain of generalization, which refers to the universe of tasks defined by the task framework. The framework identifies relevant tasks based on a conceptual task framework whereas the claim states that performances across tasks sampled from the defined universe will produce scores that reflect performance consistency. The fifth claim states that the test scores reflect the construct of language knowledge, processes, and strategies (the construct domain). The sixth claim is that scores represent the target performance in the academic domain (the domain of extrapolation). The other claims—about task quality (2), response scoring (3), test use (7), and consequences (8)—will be integral to the validity argument for use of a test

Academic contexts of language use

Task A

Task C Task B

Domain of Extrapolation

Task H

Task E

Task G

Task D

Domain of Generalization

Task F

Language knowledge, processes and strategies

FIGURE 6.2 Aspects

Construct Domain: Knowledge, processes and strategies

of interpretation useful for constructing an interpretation/use argument for a test of academic English for admissions in higher education

208

208 Carol A. Chapelle

of academic English intended for higher education. Research about task quality could help to investigate assumptions about the quality of the task types suggested in each of the four chapters. For example, the reading chapter suggests the need to better understand the factors that affect task difficulty. The listening chapter recommends research to investigate test takers’ cognitive processes during task performance to understand what knowledge, strategies, and abilities the tasks assess. Such research is important at early stages of task development to find evidence that new task types are successful at eliciting the intended performance, particularly for assessing new aspects of the construct such as pragmatic ability. The writing chapter explains the need for research focusing on response scoring to investigate assumptions made about the quality of various approaches to scoring test takers’ linguistic responses.The speaking chapter also recommends research on automated scoring to investigate the quality of the scoring for obtaining accurate summaries of test takers’ performance. In addition, it identifies research on how the use of results produced by various scoring methods for providing feedback to test takers about their responses can be improved. This is one of a few examples of suggestions pertaining to research that would be needed to support the claim about test usefulness (7). Each chapter identifies research required to develop a good interpretive framework that will serve validation research. As illustrated above, a usefully detailed interpretive framework serves in bringing coherence to lists of ideas resulting from domain analysis activities and provides the foundation for ultimately developing a validity argument by first outlining an interpretation/use argument (Kane, 2013). Validation, the process of justifying test score interpretation and use, has an amplified significance in the current socio-economic context of higher education. The conceptual background laid by stating and investigating claims that are important for the meaning and use of the test scores is essential in this context. Still, this technical work is a different matter than communicating about the test score interpretation and use to a wide range of score users and decision makers. The job of communicating the meaning of test scores and the strength of support necessarily starts with a coherent technical understanding behind the claims about validity. However, communication needs to extend beyond the technical to the actual test users among whom assessment literacy is notoriously low. This issue is extremely important in the international, unregulated marketplace, where tests can end up being selected based on considerations other than their validity for making the intended decisions.

6.4 Technology in Higher Education and Language Assessment The impact of technology encompasses every aspect of academic English proficiency testing and all facets of higher education. Collectively, the chapters point to the important issues raised by the use of technology for communication, language

208

209

208

Looking Ahead 209

analysis, teaching, and learning in higher education as well as for test development, scoring, score reporting, and validation in language assessment. Twenty years ago, technology used for test delivery was viewed as a potential source of systematic bias for test takers, who varied in their levels of computer familiarity (Taylor, Jamieson, Eignor, & Kirsch, 1998). Going forward, in contrast, technology has become an essential resource for conceptualizing language constructs that serve as a basis for the interpretive framework, for developing test tasks, for scoring, for score reporting, for carrying out validation research, and even for reshaping the marketplace where admissions testing operates. Technology plays an important role in the framework for score interpretation because communication and learning in higher education are mediated through the use of information and communication technologies, requiring both language and technology skills. Therefore, each aspect of the framework needs to take into account the influence of technology in defining what is relevant to interpretation and use of the test scores. In the academic context of language use, the top part of the interpretive framework outlined in Figure 6.1, it is necessary to make explicit the ways in which technology mediates communication and learning in academic contexts and the ways in which these technologies in higher education affect the nature of the linguistic tasks that students need to be able to accomplish (e.g., Kyle et al., in press). The reading chapter points out that academic study entails negotiating the variety of multimedia sources and non-linear text for learning as well as the social communication of synchronous and asynchronous communications. It also identifies several skills and strategies associated with use of such information and communication tools including the need for learners to evaluate the quality and relevance of sources. The fundamental changes in higher education and their consequences prompted the authors of the reading chapter to propose that questions such as the following be addressed in future domain analysis with a goal of updating the interpretive framework. 1. What is the current role of information communication technology (ICT) in academic study? 2. What technologies and approaches are used in both distance education and as part of regular education? 3. What are the current uses of digital media and non-linear text in university teaching and learning? 4. Are there developments in the use of technology for learning that are taking place but are not adequately considered in test score interpretation and use? Inclusion of the technology dimensions of each part of the interpretive framework should clarify some of the roles that technology should play in the design of test tasks. Technology has become central to operational testing because of the various technologies used in task design, test delivery, scoring, and score reporting. Such

210

210 Carol A. Chapelle

technology use represents more than an efficiency in delivery; it can offer valuable affordances that help testers address the demands of testing academic language ability. On the surface, perhaps the most obvious benefit of technology-mediated test tasks is that they are needed to simulate the types of interactions that students engage in when they are studying at the university. Beyond this most evident need for technology is the range of test-taking conditions that can be created to simulate critical features of the learning episodes where language is used. For example, through the use of interactive video or animation, test takers can engage in simulation of an experiment and then write a brief report describing the process and explaining the outcome. A combination of images, audio, and interaction can simulate key learning moments of a classroom lecture where a professor is using PowerPoint slides and students are responding to questions. Such tasks hold the potential for assessment of multimodal academic language ability. The back end of technology-mediated tasks experienced by test takers also has significant capabilities such as potential for adaptive delivery, automatic item generation, and automated scoring. Adaptive delivery refers to the use of test taker responses to select subsequent test tasks during test taking. Adaptivity typically refers to adaptive selection from a pool of intact tasks during test delivery whereas automatic item generation refers to automated creation of tasks from a pool of task parts and rules for task assembly. Of these three significant potentials, the chapters on speaking and writing in this volume both emphasized the potential value of automated scoring and therefore the need for improving its quality. In short, the potential positive impact of technological advances in operational testing calls on an expanded range of cross-disciplinary expertise to create the intended task conditions for test takers and the most efficient and fair delivery system. Technology has also expanded the options for gathering data required for validation research through the investigation of response processes based on data obtained from screen recording, eye tracking and keystroke logs, and interaction data such as response latencies and selection strategies. Response process data were identified years ago as one source of evidence for investigating the validity of test score interpretation and use (Messick, 1989). Depending on the nature of the data, such evidence may be relevant for backing assumptions required for supporting several of the claims in a validity argument. However, until recently little use had been made of such data because of the difficulty or impossibility of gathering detailed data about test-taking behaviors. In this sense, computer technology opens a new avenue for investigation of test performance to support validity arguments (Ercikan & Pellegrino, 2017). Technology is also largely responsible for the changing social and business environment of English language testing in higher education.The role of technology in creating connections among disparate parts of the world is perhaps nowhere more dramatically felt than in higher education. The time when history and geography prompted institutions to select which tests to accept for admissions has passed. In the global business of English language testing, institutions see themselves as

210

21

210

Looking Ahead 211

having options for test selection and in many cases pass the decision making to the applicants, who are left to choose which test scores they would like to submit. In this context, the profession is grappling with an assessment culture in which few people have even fundamental knowledge about assessment. Lacking any substantive knowledge about test quality, decision makers at universities may select English language test scores to be used in admissions on the basis of how many applicants their selection of tests is likely to yield rather than the demonstrated validity of the test scores for making admissions decisions. When decisions about test selection are passed to the applicants themselves, decisions are likely to be made on the basis of ease of access, affordability, test preparation opportunity, perceptions of the ease of the test, the threat of test security being compromised, and expected success resulting from its use in admissions. In this new reality, the access to multiple tests creates potentially positive circumstances for choice, but only to the extent that knowledgeable choices are made.

6.5 Directions for Academic Language Assessment The research directions identified in each of the chapters in this volume point toward important areas of inquiry for developing a stronger foundation for assessing each of the language modalities for academic English testing. These lines of inquiry can also inform the larger project of academic English proficiency assessment, but the larger project of developing a construct framework that takes into account the integrated use of language modalities in higher education would benefit from progress in three areas. First, the international contexts in which academic English is used and test scores are needed should be investigated. In applied linguistics, a growing body of work in English-medium instruction internationally provides important input to this line of inquiry (Macaro et al., 2018). In view of the foundational role of the academic context of language use in the interpretive framework, this research needs to be consulted and augmented to support decisions about how to characterize an appropriate domain of extrapolation. For example, are the three types of academic settings proposed in the speaking chapter, the nature of the reading purposes in the reading chapter, and the types of writing test tasks proposed in the writing chapter relevant across international academic settings? The evolution of English language testing as both business and profession into an international arena of activity creates conditions for understanding higher education in international contexts (Xi, Bridgeman, & Wendler, 2014); however, the project of integrating international perspectives into interpretive frameworks remains. Second, the unit of analysis and characteristics for tasks in the interpretive framework need to be standardized to integrate insights from the four chapters. Task frameworks for second language learning and assessment have been developed by multiple authors in the past, most notably the test method facets of Bachman (1990) and test task characteristics of Bachman and Palmer (1996, 2010). Rather

21

212 Carol A. Chapelle

than adopting or adapting one such framework for defining tasks, each of the four chapters in this volume creates a task framework based on interpretation of research from its respective modality of language use. In second language research and teaching, the variety of theoretical viewpoints brought to bear on the design and implementation of tasks has been productive (Ellis, 2015). In contrast, inconsistency in the way tasks are conceptualized across the modalities for language assessment appears to create a challenge for developing a framework for academic language ability. For any English language test yielding a total score that is to be interpreted as academic language ability, there should be a means of characterizing the tasks that define the domain of language use to which the scores can be extrapolated. This volume leaves open the question of the commensurability of the task frameworks across the chapters which would be needed to conceptualize academic language tasks. Are the constructs most usefully defined as incommensurable? Is commensurability among constructs contributing to a total score needed for an interpretive framework for the total score? These are some of the questions raised by the chapters in this volume. Third, the expression “academic language” in the interpretive framework is intended to delimit the scope of meaning of the test scores to make transparent the relevant uses of the test scores. Particularly the reading and writing chapters hint at the need to better characterize the nature of academic language by noting that academic language tasks need to be concerned with learning academic content. The writing chapter most directly recommends that academic English tests include “content-responsible tasks,” which refer to tasks requiring certain content- related qualities such as veracity of statements, accuracy of interpretations, and logic in argumentation. The speaking chapter also discusses key notions regarding the role of content in construct definition, especially for tests of English for specific purposes (Douglas, 2000; Purpura, 2004, 2016). Academic language is about content such as business, biology, engineering, and history. Because academic language is for creating and learning academic knowledge, interpretive framework for academic English test scores should have a means of highlighting the language choices users make to construe such processes and products in higher education. Long considered a non-linguistic concern to be worked around, content in language tests cannot be ignored in academic English tests whose tasks are intended to simulate the demands of academic learning tasks (Sato, 2011). Recent research has begun to probe relevant aspects of content, for example in response to integrated TOEFL iBT tasks (Frost et al., 2019). This project requires a theory of knowledge (Maton, 2014) that couples with linguistic analysis to provide a meta-language for describing various dimensions of knowledge that are pertinent to language use. Current research in applied linguistics engages multiple theoretical perspectives and analytic means to contribute to addressing the project begun in the four chapters in this volume. The challenges of incorporating such work into useable frameworks, test tasks, and validity arguments in language assessment have clearly

21

231

21

Looking Ahead 213

begun to be addressed in the chapters in this volume.The volume charts directions for creating better academic English assessments for the future.

References American Education Research Association, American Psychological Association, & the National Council on Measurement in Education (AERA/ APA/ NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Education Research Association. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, U.K.: Oxford University Press. Bachman, L., & Palmer, A. (1996). Language testing in practice. Oxford, U.K.: Oxford University Press. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, U.K.: Oxford University Press. Biber, D. (2006). University language: A corpus- based study of spoken and written registers. Amsterdam, The Netherlands: John Benjamins. Chapelle, C. A., Enright, M. E., & Jamieson, J. (Eds.) (2008). Building a validity argument for the Test of English as a Foreign Language. London, U.K.: Routledge. Cook,V. J. (1999). Going beyond the native speaker in language teaching. TESOL Quarterly, 33(2), 185–209. Douglas, D. (2000). Assessing language for specific purposes. Cambridge, U.K.: Cambridge. University Press. Ellis, R. (2015). Understanding second language acquisition (2nd ed.). Oxford, U.K.: Oxford University Press. Ercikan, K. W., & Pellegrino, J. W. (Eds.). (2017). Validation of score meaning in the next generation of assessments:The use of response processes. New York: Routledge. Frost, K., Clothier, J., Huisman, A., & Wigglesworth, G. (2019). Responding to a TOEFL iBT integrated speaking task: Mapping task demands and test takers’ use of stimulus content. Language Testing, 37(1), 133–155. Ginther, A., & Yan, X. (2018). Interpreting the relationships between TOEFL iBT scores and GPA: Language proficiency, policy, and profiles. Language Testing, 35(2), 271–295. Gray, B. (2015). Linguistic variation in research articles: When discipline tells only part of the story. Amsterdam, The Netherlands: John Benjamins. Hyland, K. (2009). Academic discourse: English in a global context. London, U.K.: Continuum. Jenkins, J., & Leung, C. (2017). Assessing English as a lingua franca. In E. Shohamy et al. (Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment. (2nd ed., pp. 1607–1616). New York: Springer International. Kane, M.T. (2013).Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. Kyle, K., Choe, A-T., Egushi, M., LaFlair, G., & Ziegler, N. (in press). A comparison of spoken and written language use in traditional and technology mediated learning environments (TOEFL Research Report Series). Princeton, NJ: Educational Testing Service. Lowenberg, P. (1993). Issues of validity in tests of English as a world language. World Englishes, 12, 95–106. Macaro, E., Curle, S., Pun, J., An, J., & Dearden, J. (2018). A systematic review of English medium instruction in higher education. Language Teaching, 51(1), 36–76.

214

214 Carol A. Chapelle

Martin, J. R., Maton, K., & Doran,Y. J. (2020). Accessing academic discourse: Systemic functional linguistics and legitimation code theory. London, U.K.: Routledge. Maton, K. (2014). Knowledge and knowers: Towards a realist sociology of education. London, U.K.: Routledge. McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Oxford, U.K.: Blackwell Publishing. Messick, S. (1989).Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan Publishing Co. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–62. Owen, N., Shrestha, P., & Hulgren, K. (in press). Researching academic reading in two contrasting EMI (English as a medium of instruction) university contexts (TOEFL Research Report Series). Princeton, NJ: Educational Testing Service. Purpura, J. E. (2004). Assessing grammar. Cambridge, U.K.: Cambridge University Press. Purpura, J. E. (2016). Assessing meaning. In E. Shohamy & L. Or (Eds.), Encyclopedia of language and education.Vol. 7: Language testing and assessment (pp. 33–61). New York: Springer International Publishing. doi:10.1007/978-3-319-02326-7_1-1 Sato, T. (2011). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29(2), 223–241. Sawaki,Y., & Sinharay, S. (2018). Do the TOEFL iBT® section scores provide value-added information to stakeholders? Language Testing, 35(4), 529–556. Taylor, C., Jamieson, J., Eignor, D., & Kirsch, I. (1998). The relationship between computer familiarity and performance on computer-based TOEFL test tasks (TOEFL Research Reports-61). Princeton, NJ: Educational Testing Service. Xi, X., Bridgeman, B., & Wendler, C. (2014). Tests of English for academic purposes in university admissions. In A. J. Kunnan (Ed.), The companion to language assessment (pp. 318–337). Oxford, U.K.: John Wiley & Sons, Inc.

214

251

INDEX

academic: communication 4–17, 157, 175; discourse 4–17, 76–81, 86 academic content 6, 72, 203, 212 academic domain 16, 50, 61–72, 83–88, 89, 134, 152, 159–169, 172, 174–185, 202, 206–207 academic English assessment 1–17, 187, 200–213 academic English proficiency see English for Academic Purposes academic listening 15, 61–99, 91, 200–201 academic reading 14, 22–52, 35, 201, 203 academic speaking 16, 96, 152–193, 201, 206 academic vocabulary 5 academic writing 15–16, 107–140, 206 accent 79–85, 86, 89, 90, 91, 93–95, 205 admissions: decisions 1–4, 10–13, 153, 202, 211; testing 1, 17, 177, 188, 200–209 analytic rating scale 111, 115, 122, 166, 180 artificial intelligence 8, 12, 174 audience 5, 15, 31, 41, 43, 68, 89, 108, 116, 132–139, 156, 160, 169, 176–184, 191, 205 automated scoring 11, 107, 119–140, 173, 184–192, 208–210 Bachman, L. 14, 61–65, 73, 78, 96, 108, 153–168, 180, 211 background knowledge 28–31, 35, 36–49, 62, 66–67, 127, 205

Barkaoui, K. 111, 125–126 Biber, D. 6, 9, 76, 112–119, 129, 182, 204 blended learning 71–72, 85 Britt, M. 31, 36–46 Buck, G. 61–76, 95–97 CAEL (Canadian Academic English Language Test) 10 Canale, M. 153–155 CBAL (Cognitively Based Assessment of, for, and as Learning) 42, 47 CEFR (Common European Framework of Reference for Languages) 68, 70, 95, 111, 116, 117, 124, 164 Chalhoub-Deville, M. 109, 153–154 Chapelle, C. 11, 16, 51, 66, 70, 86, 120, 123–124, 153–154, 189, 207 cognitive complexity see cognitive load cognitive load 44, 62, 164 cognitive processes 22–30, 46, 62–65, 96, 110, 208 cognitive processing 15, 62–66, 96 collaborative listening 70, 75, 95, 97 Common Core State Standards 6, 68–69, 168, 188 communication goals 87, 170 complexity: of listening 77, 84, 86; of reading 22, 26, 38–39, 51; of speaking 159, 170–177, 172, 183, 191, 204; of writing 107–108, 112, 128 component abilities 23, 108

216

216 Index

component models of language proficiency 15, 34–39, 64–67, 74, 108–109, 153–159, 163–167, 185, 202 conceptualizing: academic listening 62, 70, 91; academic reading 25–26; academic speaking 153; academic writing 107 consequences of admissions testing 1, 10, 13, 117–121, 123, 131, 153 consequential validity 12, 13, 207 construct domain 206–207 construct representation 10, 61, 91–94 Construction–Integration Model 27–34 construct-irrelevant variance 44, 64, 73, 76, 79, 84, 125, 127, 167 construct underrepresentation 92, 158 contextual facets 15, 88, 98, 162, 170 contextual factors 66, 87, 96–97, 130–133, 154, 159–164, 174, 179, 185, 191 corpora 5, 9, 80–81, 86, 114–115, 130, 139 corpus analysis 5–9, 75–81, 111–113, 130 coverage: of content 10, 186; of domains 107, 134–135, 173; of genres and contexts 16, 107, 205, 132, 134, 140; of web-based information 42 Cumming, A. 13, 15, 108, 110, 115–121, 125, 127, 129, 137–138 curriculum standards 68–69, 107, 115, 139 declarative knowledge 64–65 decoding 24, 27, 37, 45, 63 diagnostic assessments 34, 111, 119, 123, 182, 188, 192 Differential Item Functioning (DIF) 83 digital literacy 40–43, 47–50 distance learning 48, 71–2, 85, 93, 187, 209 Documents Model (DM) 27, 31–34, 38 domain see academic domain domain analysis 61, 165–169, 182, 186, 192, 206, 208 domain of extrapolation 203–211, 207 domain of generalization 206–207 dynamic assessments 119 Eckes, T. 3, 12 EAP (English for academic purposes) 4–8, 13, 15, 71–72, 86–88, 87, 97, 107, 121, 124, 131, 158, 167, 186, 201, 205 EIL (English as an international language) 78–79 ELF (English as a Lingua Franca) 3, 78–83, 86, 93, 186

EMI (English-medium instruction) 7–9, 12, 16, 61, 70–71, 80, 93, 153, 168, 177, 186–188, 193, 203, 207, 211 English-medium higher education see EMI ESP (English for specific purposes) 4, 164 ETS (Educational Testing Service) 9, 11, 13, 42, 49, 88, 121–122, 171–172, 178 evaluation inference 11, 123–125 expertise 25–26, 124, 135, 210 explanation inference 123, 128 explanatory writing 135 expressive writing 136 extrapolation inference 11, 123, 129, 130, 159, 164, 182, 202–203, 206 eye-tracking 77, 96 fairness 1, 61, 92, 93, 117, 127, 168 familiarity hypothesis 83 Field, J. 62–64 Flowerdew, J. 66, 95, 97 fluency 173 formative purposes for assessment 48, 107, 117–120, 140, 183, 188 foundational skills 6, 24, 45, 87, 170–171, 172, 180, 183 four skills 3, 6, 10, 11, 14, 200, 202 Fulcher, G. 68, 70, 164, 166, 181–182, 189 functional: knowledge 66, 73, 110; meaning 154–156, 175–177 generalization inference 86, 123–126, 154, 167, 206–207, 207 genre analysis 5 genres 5–6, 12, 15–16, 38–39, 70–81, 85, 97, 98, 107–117, 120, 126–140, 162, 162, 169, 205 Grabe, W. 33, 47–48, 137 grades 6, 69–70, 120, 129, 130 grammatical accuracy 80, 114–129, 137–139, 171–176, 191–192 grammatical complexity 84, 86, 114–117, 128–129, 184, 191 grammatical knowledge 65, 155–156, 164, 166 GTEC (Global Test of English Communication) 10 higher-order skills 25, 63, 87, 170–174, 172, 183, 192, 206 high-stakes testing 1, 10–15, 22, 25, 46, 51, 88, 117, 121–126, 178 holistic rating scale 108, 111, 121–129, 166, 180, 183

216

217

216

Index 217

human scoring 11, 109–110, 114–115, 121–128, 138–139, 185, 192 hyperlinks 43–44, 47, 49 ICT (Information Communication Technology) 48, 50, 209 IELTS (International English Language Testing System) Academic, 10, 34, 84, 88–89, 90, 118, 121–122, 128–131, 170–174, 177, 184 integrated skills 11–14, 47, 67, 72, 85, 91, 159, 202, 205, 211, 212; in speaking tasks 68, 89, 160, 171, 186, 188; in writing tasks 46–47, 89, 107, 115, 121–140 interactional authenticity 15, 78, 93, 97 interactional competence 7, 12, 157–158, 164, 173–174, 177–182, 186, 189–192, 206 interactionalist approach (to construct definition) 15–16, 61, 66, 70, 86, 152–154, 161–165 interactionist principles 107–110, 134 interactive listening 65, 67, 75–76, 85, 86, 88, 89, 92 interactive tasks 7, 9, 17, 40, 47, 75–76, 95, 174, 178, 188, 190, 192 internationalization of higher education 82 interpretive framework 158, 201, 202, 206–212 Kane, M. 11, 68, 207–208 Kintsch, W. 37, 45, 136 Knoch, U. 112 119, 124, 136–137 L1-L2 reading 24 Landscape Model 27–30, 35 language competence 65, 153 language learning 13, 24, 61, 66, 76, 153, 159, 181, 200, 211 language modality 201, 212 language use 2–11, 62, 65–68, 80, 84, 95, 98, 111–112, 125, 134, 171, 177, 185–186, 191, 202, 204; contexts of 16–17, 61, 78, 86, 96–97, 152–170, 162, 189–190, 202, 207, 209–212 large-scale testing 9, 12, 15, 22, 25, 39, 40, 43, 45–51, 73, 75, 84, 90, 97, 109, 111, 117, 120, 123–124, 128, 134, 159, 166, 178, 189, 191 lexical complexity 115, 119 Lexical Quality Hypothesis 32–33

linguistic complexity 84, 86, 117 see also grammatical complexity, lexical complexity Long, M. 4, 153 meaning-oriented approach to language ability 154, 171 medium of communication 40, 161, 162, 168–170, 185 mental model 28–31, 35, 38, 46, 64 metacognitive processes and strategies 42, 64–67, 153, 155, 163, 186 Mislevy, R. 109, 119, 206 multi-faceted approach to construct definition 170 multilingualism 8, 80, 86 multimedia 14, 23, 41, 44–45, 93, 133, 136, 190, 205, 209 multimodal ability 76, 86, 92, 205, 210 multimodality see multimodal ability multiple texts 15, 31, 36, 38–39, 45–51, 205 multi-trait scale see analytic rating scale navigational tasks 7–9, 87, 168–174, 203 new scoring technologies 184–185 new task types for assessment 14, 16, 46, 153, 174, 188, 191, 206, 208 non-linear text 14, 40–41, 43–49, 209 non-verbal communication 27, 75–77, 85, 92, 178, 182 normative purposes 107, 117–120, 128, 140 Norris, J. 10, 109, 112, 153, 164, 166 Ockey, G. 76–77, 84–85, 88, 92–96, 175–179, 189 online reading 40–41, 47, 49 oral communicative competence 16, 152, 155, 159, 165–166 overall language proficiency 200 own-accent hypothesis 83–84 Palmer, A. 14, 61–65, 73, 96, 153–157, 160, 165, 168, 211 Papageorgiou, S. 15, 49, 70, 76, 78, 84, 95, 110, 111 PTE (Pearson Test of English Academic) 10, 84, 88–90, 91, 118, 121–122, 125–128, 130, 139, 170, 173–174, 177, 183–185 pedagogical models for describing listening ability 62, 66–67

218

218 Index

peer assessments 119, 120, 133, 136 Perfetti, C. 22–38 personal voice and stance 15, 74, 76, 132–136, 172 PIAAC 40, 48 Plakans, L. 47, 111, 115, 118, 124–125, 127, 137–138, 188 portfolios 120 practical-social domains 15, 134–135 pragmatic: competence 156–169, 174–177, 182, 192, 206; knowledge 63–66, 73–77, 80, 85, 96, 155–156; meaning 74–77, 109, 154–156, 175–177 practical considerations in assessment 76, 88, 91–95, 97, 159, 163 predictive validity 12–13, 130 pressure points in reading 31–32 procedural knowledge 64–65 purpose for reading 14, 22–23, 30–39, 35, 42, 44, 45–46, 50–51, 204 purpose for writing 108, 110, 133, 135 Purpura, J. 29, 74, 133, 154–156, 174–175, 185–186, 212 raters see human scoring rating scales and criteria: for speaking 165–167, 179, 181, 183; for writing 107–108, 110–112, 115, 119, 121–125, 129, 132, 134, 139–140 reader goals see purpose for reading reading strategies 33, 51 Reading Systems Framework 31–39 reading to learn 34, 37–39, 45–51, 205 real-life tasks 10, 78, 95, 97, 161, 189 rhetorical characteristics, meanings, structure 15, 35, 39, 49–50, 65, 74, 111–115, 116, 119, 120, 126–138, 155–156, 175, 192 Rost, M. 63, 75 rubric design 14, 16, 109, 111, 125, 134, 138, 152–153, 155, 160, 163, 165–166, 170–174, 177, 179, 180–183, 188–189, 192, 206 schools 4, 113, 117, 120, 127–128, 168 score interpretation 11, 17, 64, 68, 161, 178, 201, 205–210 scoring rubrics see rubric design setting: academic 7–9, 15, 36, 70–71, 75, 80, 82, 84, 85–86, 87, 94–95, 120, 130, 133, 154, 158–159, 167–170, 172–174, 178, 183, 186–188, 191, 193, 211; communicative 65–66, 76–77, 97, 98,

119, 155–157, 159–161, 162, 185, 192, 204; test 62, 64, 110, 160 Shaw, S. 108, 110, 111, 122, 128 Simple View of Reading 24, 27–28, 45 situation model 26, 28–29, 31, 34–35, 37, 39, 45–46 situational authenticity 15, 62, 78, 97, 114 skilled reading 25–26, 31 social-interpersonal 162, 203 sociolinguistic knowledge 65–66, 73–75, 79–80, 153, 155–156, 175 speaking ability 152, 155, 157–158, 163, 166–167, 170, 177, 185–188, 190, 192–193, 201, 204 stakeholders 40–41, 51, 94, 123, 130, 183 standardized testing 1–4, 9–12, 15, 48, 79, 117–120, 124, 127, 130–132, 134, 166 standards: college-ready 16, 68–70, 136, 139, 152; of language proficiency 6–10, 16, 61–62, 80, 112, 115–118, 167–168, 186–188; movement 6 standards of coherence 30, 37, 45, 51 strategic competence 42, 51, 65–67, 153–156, 158, 164, 205 student success 12 study abroad 2–3 subdomains in academic listening 82, 87, 97 summative purposes 48, 107, 117–118, 120, 183 surveys of domains and tasks 40, 47, 77, 94, 107, 112–113, 127, 129, 130, 203 Swain, M. 153–155 syntactic complexity see grammatical complexity target language use (TLU) domain 62, 65, 73, 75–76, 85, 86, 88–93, 123, 158–159, 161, 163, 165–167, 177 task-based assessment 10, 15, 65–66, 153 task design 16, 152, 163, 204–206, 209 task stimuli 107, 132–134, 140 technology: in assessment 12, 13, 17, 40–41, 50, 174–178, 189–192, 201, 205, 208–210; in higher education 8, 48, 50, 70, 206, 208–210 technology-mediated: communication 8, 12, 23, 40–41, 47; instruction and learning 8–9, 41, 71–73, 85, 187–188, 203 TESOL 6 test development 11–12, 47, 51, 61, 65–66, 74–78, 92, 94, 117, 122–124, 131, 140, 155, 166–167, 186–192, 205–209

218

219

218

Index 219

test-taker characteristics 96 text complexity 38–39, 51 textual borrowing 138 textual features 49, 137 textual knowledge 155, 157–158 theoretical model(s) 16, 22, 29, 34, 61, 152, 158, 159, 185 think-aloud 49, 137 threshold: of ability 24, 85, 94; level 79, 137, 177, 185–186 TOEFL (Test of English as a Foreign Language) 6, 9, 11, 153 TOEFL Committee of Examiners (COE) 84 TOEFL iBT 9, 11–13, 75–76; listening 84, 88–89, 91–94; reading 34, 39; speaking 161, 170–174, 172, 177, 184–185, 188, 212; writing 46, 115, 118, 121, 125–131, 134, 139 transactional writing 15, 135–136 TSE (Test of Spoken English) 11 TWE (Test of Written English) 11, 121

validation 11, 16–17, 61, 66, 123–125, 130, 182, 206–210 validity argument 11, 17, 207–212 Van den Broek, P. 29–30, 36–37, 45 varieties of English 8, 12, 78–84, 86, 94, 186 video input in listening assessment 77, 93 visual information and support 41, 49, 62, 69, 73, 76, 78, 86, 92–93, 176, 187, 190, 205

undergraduate courses and students 71, 85, 113, 129–130, 169 utilization inference 123, 130–131

Young, R. 153, 157, 160, 184

Weigle, S. 108–112, 125, 130 Weir, C. 10, 65–66, 108, 110, 111, 118, 122, 126, 128 WIDA (World-Class Instructional Design and Assessment Consortium) 6–7, 117 working memory 24, 27, 29, 30, 35, 37, 62–64 World Englishes 78–79 writing proficiency and abilities 15, 107, 110–112, 115, 118, 127, 136–137 written academic text characteristics 114 Xi, X. 9, 16, 97, 99, 110–111, 123, 161, 166, 184, 191, 211

Zwick, R. 2

20