325 48 6MB
English Pages 301 [302] Year 2023
“Technology has taken over language assessment and we are all becoming familiar with the latest generation of products, taking assessments at home on mobile devices, interacting with avatars and having our performance scored by a machine. In this pioneering collection, Sadeghi and Douglas have brought together some of the leading voices in the field to help us to get to grips with the profound practical and theoretical questions raised by the shift to technology mediated assessment. This book is indispensable reading for everyone who makes or studies language assessments in the digital age.” Tony Green, University of Bedfordshire, UK
“This book presents in-depth discussions by some 40 language testing scholars from all over the world on the use of technology and AI in language assessment. Most chapters recognize that Covid-19 has driven learning, teaching and assessment to the internet and observe how the negative impact of the epidemic on students’ participation in schooling is paired with positive impulses for technological developments, thereby offering new opportunities for remote assessment and for timely and adaptive feedback to students’ learning. The issue of validity is recurrent throughout the book. Yet as long as the language testing field judges the performance of machines by their capability to generate outcomes similar to those produced by humans, automated processes will never reach their full potential.” John H. A. L. De Jong, Educational Testing Service
“Educators have been utilizing technology steadily and increasingly in different areas of education for almost half a century. The assessment field is no exception. However, the COVID-19 Pandemic forced education into an almost complete online system where the infrastructures were not adequate in many places. Further, although some educational organizations offered online courses before the pandemic, many teachers did not have enough knowledge or could not receive training in time to operate in virtual contexts. More importantly, the critical role of assessment in making valid and reliable decisions required more secure technology than those in other fields. In such a critical context, practitioners needed guidelines from specialists and policymakers to cope with the crisis. This book is a pleasant
contribution to addressing their needs. The selection of the topics demonstrates the editors’ informed taste and deep concern about the use of technology in language assessment. The variety of the issues addressed in the book will make it helpful to a wide range of practitioners and scholars. An important point that adds to the value of this book is that it will serve as a useful reference even after the COVID19 pandemic when technology continues to dominate education in general and language assessment in particular.” Hossein Farhady, Professor of Applied Linguistics, Yeditepe University, Istanbul, Turkey
Fundamental Considerations in Technology Mediated Language Assessment
Fundamental Considerations in Technology Mediated Language Assessment aims to address issues such as how the forced integration of technology into second language assessment has shaped our understanding of key traditional concepts like validity, reliability, washback, authenticity, ethics, fairness, test security, and more. Although computer-assisted language testing has been around for more than two decades in the context of high-stakes proficiency testing, much of language testing worldwide has shifted to “at-home” mode, and relies heavily on the mediation of digital technology, making its widespread application in classroom settings in response to the COVID-19 outbreak unprecedented. Integration of technology into language assessment has brought with it countless affordances and at the same time challenges, both theoretically and practically. One major theoretical consideration requiring attention is the way technology has contributed to a reconceptualization of major assessment concepts/constructs. There is very limited literature available on the theoretical underpinnings of technology mediated language assessment. This book aims to fill this gap. This book will appeal to academic specialists, practitioners, or professionals in the field of language assessment, advanced and/or graduate students, and a range of scholars or professionals in disciplines like educational technology, applied linguistics, and teaching English to speakers of other languages (TESOL). Karim Sadeghi is a Professor of TESOL at Urmia University. He is the founding Editor-in-Chief of the Iranian Journal of Language Teaching Research (Scopus Q1 Journal). His recent publications include Assessing Second Language Reading (2021), Talking about Second Language Acquisition (2022), and Theory and Practice in Second Language Teacher Identity (2022, co-edited with Dr Farah Ghaderi). Dan Douglas is a Professor Emeritus in the Applied Linguistics and Technology program at Iowa State University, specializing in assessing language ability in specific academic and professional contexts. His books include Understanding Language Testing (2010) and Assessing Languages for Specific Purposes (2000). He was President of the International Language Testing Association from 2005 to 2006 and again from 2013 to 2015, and Editor of the journal Language Testing from 2002 to 2007. In 2019 he received the Cambridge/ILTA Distinguished Achievement Award.
Fundamental Considerations in Technology Mediated Language Assessment Edited by Karim Sadeghi and Dan Douglas
Designed cover image: elenabs via Getty Images First published 2023 by Routledge 4 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 605 Third Avenue, New York, NY 10158 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2023 selection and editorial matter, Karim Sadeghi and Dan Douglas; individual chapters, the contributors The right of Karim Sadeghi and Dan Douglas to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilized in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record has been requested for this book ISBN: 978-1-032-27364-8 (hbk) ISBN: 978-1-032-27365-5 (pbk) ISBN: 978-1-003-29239-5 (ebk) DOI: 10.4324/9781003292395 Typeset in Times New Roman by Deanta Global Publishing Services, Chennai, India
This volume was inspired by a similar title by Lyle Bachman: Fundamental Considerations in Language Testing. It is our great honour to dedicate this book to Lyle, who has made substantial contributions to the field of second language assessment.
Contents
Acknowledgements List of contributors Foreword
xii xiii xix
LYLE F. BACHMAN
1
Technology mediated language assessment: Key considerations
1
KARIM SADEGHI AND DAN DOUGLAS
PART I
Validity concerns in technology mediated language assessment 2
Appraising models and frameworks of spoken ability in remote L2 assessment
15 17
F. SCOTT WALTERS
3
Exploring the role of self-regulation in young learners’ writing assessment and intervention using BalanceAI automated diagnostic feedback
31
EUNICE EUNHEE JANG , MELISSA HUNTE, CHRISTIE BARRON, AND LIAM HANNAH
4
Balancing construct coverage and efficiency: Test design, security, and validation considerations for a remote-proctored online language test
49
LARRY DAVIS, JOHN NORRIS, SPIROS PAPAGEORGIOU, AND SHOKO SASAYAMA
5
From pen-and-paper trials to computer-based test: Impact on validity IVY CHEN AND UTE KNOCH
64
x
Contents
6
Two testing environments and their impact on test equity and comparability of the results: Insights from a national evaluation of learning outcomes
79
MARITA HÄRMÄLÄ AND JUKKA MARJANEN
PART II
Reliability issues and machine scoring 7
Exploring rater behaviors on handwritten and typed readingto-write essays using FACETS
97 99
SUN-YOUNG SHIN, SENYUNG LEE, AND YENA PARK
8
Validating an AI-driven scoring system: The Model Card approach 115 BARRY O’SULLIVAN, TREVOR BREAKSPEAR, AND WILLIAM BAYLISS
9
Web-based testing and automated scoring: Construct conceptualization and improving reliability
135
NATHAN T. CARR
PART III
Impact, security, and ethical considerations
151
10 Testing young multilingual learners remotely: Prioritizing equity, fairness, and social justice
153
MARK CHAPMAN, JASON A. KEMP, AHYOUNG ALICIA KIM, DAVID MACGREGOR, AND FABIANA MACMILLAN
11 Improving the impact of technology on diagnostic language assessment
171
ARI HUHTA
12 Proctoring remote language assessments
186
ERIK VOSS
PART IV
Options and issues in technology mediated classroom assessment 201 13 Interdisciplinary collaborations for the future of learningoriented assessment NICK SAVILLE AND PAULA BUTTERY
203
Contents 14 Interactional competence in video-mediated speaking assessment in Vietnam
xi 218
NORIKO IWASHITA, VAN-TRAO NGUYEN, AND GIANG HONG NGUYEN
15 Test-taking strategies in technology-assisted language assessment
235
ANDREW D. COHEN, TEYMOUR RAHMATI, AND KARIM SADEGHI
16 Researching technology mediated classroom language assessment practices: Methodological challenges and opportunities 255 KEITH MENARY AND LUKE HARDING
17 Conclusion: The way forward
270
DAN DOUGLAS AND KARIM SADEGHI
Index
273
Acknowledgements
An edited book like this is the outcome of a concerted effort by a team of experts. We are in the first place indebted to the contributors to this volume, all internationally accredited scholars, who so kindly accepted our invitation to write for us and be among the first to venture into the untouched area of theoretical issues surrounding technology mediated language assessment. Next to thank are acquisition editors, assistants, and the production team at Routledge who welcomed our proposal and worked with us very diligently at different stages to bring our initial ideas to fruition; Andrea Hartill and Iola Ashby, your support has been great! We are also very grateful to anonymous reviewers of the proposal whose feedback was constructive in better structuring the sections of the book. Any faults are ours, and we welcome feedback by readers to improve future editions of the book. Karim Sadeghi, Urmia University Dan Douglas, Iowa State University
List of contributors
Christine Barron is a PhD candidate in Developmental Psychology and Education at the Ontario Institute for Sudies in Education, University of Toronto, Canada. Her research is largely situated within the area of educational measurement and assessment. Christine’s applied work investigates the interrelationships between linguistic, metacognitive, and affective development among diverse student populations using mixed methods and latent variable modeling. William Bayliss is an experienced teacher, trainer, examiner, examiner trainer, and consultant with the British Council. He has a keen interest in the development and validation of machine-rated speaking assessments. His professional experience includes teaching, academic management, IELTS examining, examiner training, assessment consultancy, and the development of AI-driven EdTech products. Trevor Breakspear has specialized in the assessment of productive skills since 2014, including the development of validation systems aligning technology and test development. Formerly with the British Council, Trevor is now a Senior Validation Manager at Cambridge University Press and Assessment and holds an MA in Language Assessment from Lancaster University, UK. Paula Buttery is a Professor of Machine Learning and Language at the University of Cambridge, UK. She is co-Director of Cambridge Language Sciences, an Interdisciplinary Research Centre, and leads research into personalized adaptive technology for learning and assessment within the Cambridge Institute for Automated Language Teaching & Assessment (ALTA). Nathan Carr is a Professor of TESOL at California State University, US. His research interests are eclectically focused on language assessment, particularly validation, computer-automated scoring of short-answer tasks, test development and revision, rating scale development, test task characteristics, and assessment literacy training. Mark Chapman joined the Wisconsin Centre for Education Research (WIDA) at University of Wisconsin-Madison, US, in 2015 where he designs and develops large-scale language proficiency tests for young multilingual learners. His
xiv
List of contributors
research interests focus on the validation of language tests for young language learners, with specialization in constructed response performance assessments. Ivy Chen is a Research Fellow at the Language Testing Research Centre, University of Melbourne, Australia. Her research interests include language testing, second language acquisition (SLA), and corpus linguistics. Thus, her PhD dissertation, A corpus-driven receptive test of collocational knowledge, used the argument-based approach to validation and modelled factors affecting item difficulty. Andrew D. Cohen is Professor Emeritus at the University of Minnesota, US. He has also held the following positions: Peace Corps Volunteer, Bolivia (1965–1967); ESL Section, UCLA, US (1972–1975); Language Education, Hebrew University, Israel (1975–1991); Fulbright Lecturer/Researcher, PUC, São Paulo, Brazil (1986–1987); L2 Studies, University of Minnesota, US (1991–2013); Visiting Scholar, University of Hawaii, US (1996–1997) and Tel Aviv University, Israel (1997); Visiting Lecturer, Auckland University, New Zealand (2004–2005). He co-edited Language Learning Strategies (2007), co-authored Teaching and Learning Pragmatics (2nd ed., 2022), authored Strategies in Learning and Using a Second Language (2011), and authored Learning Pragmatics from Native and Nonnative Language Teachers (2018). Larry Davis is a Senior Research Scientist in the Center for Language Education and Assessment Research at Educational Testing Service (ETS), Princeton, New Jersey, US. His research interests include development of technologyenhanced speaking and writing tasks, assessment of spoken interaction, creation of rubrics, rater cognition, and automated evaluation of speaking ability. Dan Douglas is a Professor Emeritus in the Applied Linguistics and Technology program at Iowa State University, US, specializing in assessing language ability in specific academic and professional contexts. His books include Understanding Language Testing (2010) and Assessing Languages for Specific Purposes (2000). He was President of the International Language Testing Association from 2005 to 2006 and again from 2013 to 2015, and Editor of the journal Language Testing from 2002 to 2007. In 2019 he received the Cambridge/ILTA Distinguished Achievement Award. Liam Hannah is a PhD student in Developmental Psychology and Education at the Ontario Institute for Studies in Education, University of Toronto, Canada. His research is in educational measurement and assessment, utilizing multimethod and multi-modal techniques to broaden the ability for educators to evaluate diverse forms of learning. Much of this work relies on machine learning and natural language processing techniques and includes critical questions of validity. Luke Harding is a Professor in Linguistics and English Language at Lancaster University, UK. His research interests are in language assessment and applied linguistics more broadly, particularly listening and speaking
List of contributors
xv
assessment, language assessment literacy, and World Englishes and English as a Lingua Franca. His work has been published in journals such as Applied Linguistics, Assessing Writing, Language Assessment Quarterly, Language Teaching, and Language Testing. Marita Härmälä is a Counsellor of Evaluation at the Finnish Education Evaluation Centre (FINEEC), Finland. Her main research interests are language testing and assessment. She conducts national evaluations of learning outcomes in foreign languages (English, Swedish) and functions as an item writer and member of the Swedish section of the Finnish Matriculation Examination Board. She is an expert member in the European Association for Language Testing and Assessment (EALTA) and has participated in two European Centre for Modern Languages (ECML) projects on language awareness among young learners. Ari Huhta is Professor of Language Assessment at the Centre for Applied Language Studies, University of Jyväskylä, Finalnd. His research interests include diagnostic L2 assessment, computer-based assessment, and selfassessment, as well as research on the development of reading, writing, and vocabulary knowledge in a foreign or second language. Melissa R. Hunte is a PhD candidate in Developmental Psychology and Education at the Ontario Institute for Studies in Education, University of Toronto, Canada. Melissa’s research examines the use of natural language processing and machine learning to assess students’ language proficiency and predict their internal beliefs about learning. She also specializes in psychometrics, mixed-methods research, and assessment reform through machine learning applications. Noriko Iwashita is an Associate Professor in Applied Linguistics in the School of Languages and Cultures at the University of Queensland, New Zealand. Her research interests include the interfaces of language assessment and SLA, peer interaction in classroom-based research, and cross-linguistic investigation of four major language traits. Eunice Eunhee Jang is a Professor at the University of Toronto, Canada, with specializations in diagnostic language assessment, technology-rich learning and assessment, and mixed methods research. Dr Jang has led high-impact provincial, national, and international research with various stakeholders. Her BalanceAI project examines ways to promote students’ cognitive and metacognitive development through innovative learning-oriented assessments based on machine learning applications. Jason Kemp is an Assessment Researcher at WIDA at University of WisconsinMadison, US. He participates in research and content development projects that support multilingual learners and their educators. Jason is interested in social justice initiatives that can influence the design of large-scale language proficiency assessments.
xvi
List of contributors
Ahyoung Alicia Kim is a researcher at WIDA at University of WisconsinMadison, US, where she examines the language development of K–12 multilingual learners. Her research interests include language assessment, child bilingualism, second language literacy development, and computer-assisted language learning. Alicia holds an EdD in Applied Linguistics from Teachers College, Columbia University, US. Ute Knoch is the Director of the Language Testing Research Centre at the University of Melbourne, Australia. Her research interests are in the areas of policy in language testing and assessing languages for academic and professional purposes. Senyung Lee is an Assistant Professor in the TESOL program at Northeastern Illinois University, US. She received a PhD in Second Language Studies from Indiana University, US. Her research interests include language assessment and L2 vocabulary learning. She has extensive experience in designing and developing English tests. She received the 2020 Jacqueline Ross TOEFL Dissertation Award by Educational Testing Service and was a finalist for the 2020 AAAL Dissertation Award. David MacGregor is an Assessment Researcher for WIDA at University of Wisconsin-Madison, US. He previously served as Director of the Psychometrics and Quantitative Research Team at the Center for Applied Linguistics. He received his PhD in Applied Linguistics from Georgetown University, US. His interests include issues related to validity, standard setting, and computer adaptive testing. Fabiana MacMillan is the Director of Test Development at WIDA at University of Wisconsin-Madison, US. She oversees the development of a suite of assessments that serve multilingual K–12 learners in the United States and abroad. Her research interests include language assessment, and applications of written discourse analysis to test development and validation. Jukka Marjanen, MEd, is a Counsellor of Evaluation at FINEEC. His main research interests are psychometrics and educational assessment. He works as a statistician at FINEEC and conducts statistical analysis for the national evaluations of learning outcomes. Keith Menary is a PhD student in the Department of Linguistics and English Language, Lancaster University, UK. Keith’s background is in language teaching and language assessment development. He is currently conducting doctoral research exploring the assessment practices of English for Academic Purposes teachers in online teaching environments. Giang Hong Nguyen is a Senior Lecturer in the English Department and the Vice Director of the Master of TESOL (International) at Hanoi University, Vietnam. Her research interests focus on CALL, blended language learning, interactional competence, and teacher professional development.
List of contributors
xvii
Van-Trao Nguyen is an Associate Professor of Linguistics and President of Hanoi University, Vietnam. He is the Governing Board Member of Vietnam to SEAMEO RELC 2017–2020 and 2020–2024 and Vice President of Vietnam Association of English Language Teaching and Research (VietTESOL). He is co-author of Professional Development of English Language Teachers in Asia: Lessons from Vietnam and Japan (2018) and has published articles in journals such as The Asian EFL Journal, Asian Studies Review, and Review of Cognitive Linguistics. John M. Norris is Senior Research Director of the Center for Language Education and Assessment Research at Educational Testing Service (ETS), Tokyo, Japan, where he oversees business support and foundational research. His scholarship focuses on language testing innovation and validation, language teacher development, task-based language teaching, program evaluation, and research synthesis. Barry O’Sullivan is Head of Assessment Research and Development (R&D) at the British Council with a long interest in assessing the productive skills. His work includes practical and theoretical aspects of test development; designing and leading the development of the Aptis test while contributing to the ongoing development of the socio-cognitive model of test development and validation. Spiros Papageorgiou is a Managing Senior Research Scientist in the Center for Language Education and Assessment Research at Educational Testing Service (ETS), Princeton, New Jersey, US. His publications cover topics such as standard setting, score reporting and interpretation, and listening assessment, including the edited volume Global Perspectives on Language Assessment: Research, Theory, and Practice (with Kathleen M. Bailey, 2019). Yena Park is an Assessment Scientist on the test development team for the Duolingo English Test. She holds a PhD in Second Language Studies from Indiana University, US, with a specialization in language assessment both in local and standardized testing settings. Teymour Rahmati has a PhD in TESOL, and he currently works as an Assistant Professor of English at Guilan University of Medical Sciences, Rasht, Iran. His research interests include language assessment and language teacher motivation. He has published papers in Assessing Writing and RELC. Karim Sadeghi is a professor of TESOL at Urmia University, Iran, and the founding Editor-in-Chief of the Iranian Journal of Language Teaching Research (a Scopus Q1 journal). His recent publications include Assessing Second Language Reading (2021), Talking about Second Language Acquisition (2022), and Theory and Practice in Second Language Teacher Identity (2022). Shoko Sasayama is an Associate Professor in the School of Humanities and Social Sciences at Waseda University, Japan. Prior to joining Waseda, she worked as
xviii List of contributors an Associate Research Scientist at Educational Testing Service (ETS) and as an Assistant Professor at the University of Tokyo, Japan. Her research focuses on second language acquisition, language education, and assessment. Nick Saville is Director of Thought Leadership, English, for Cambridge University Press and Assessment, UK, and is Secretary-General of the Association of Language Testers in Europe (ALTE). His research interests include Learning Oriented Assessment (LOA) and uses of AI for the automation of language learning and assessment systems. Sun-Young Shin is an Associate Professor in the Department of Second Language Studies at Indiana University, US. His research interests include integrated assessments and classroom-based language assessment in interactive forms. He has been invited for lectures and workshops on L2 assessment in various places throughout the world including Macau, Mexico, South Korea, Thailand, and the United States. His work has been published in Language Testing, Language Assessment Quarterly, Language Teaching Research, Assessing Writing, Language Learning and Technology, and ReCALL. He is also currently serving on the editorial boards for several journals such as Language Testing and Language Assessment Quarterly. Erik Voss is an Assistant Professor in the Applied Linguistics and TESOL program at Teachers College, Columbia University, US. His research interests include language assessment and technology and validation research. He has served on the board of the Midwest Association of Language Testers (MwALT) and the International Language Testing Association (ILTA). F. Scott Walters is an Associate Professor in the Office of International Affairs at Seoul National University of Science and Technology, Korea. His interests include developing CA-informed tests (CAIT) – that is, assessments of L2 oral pragmatics ability based upon the methodology and findings of conversation analysis; assessment-literacy acquisition processes of L2 educators; TESOL teacher education and curriculum development; test validity theory; and the philosophy of science.
Foreword Lyle F. Bachman Professor Emeritus, Department of Applied Linguistics University of California, Los Angeles
The papers in this collection provide insightful discussions of a rich array of both current and past technology mediated language assessments (TMLA), along with the validity and use issues these assessments present. The assessments discussed cover the full range of language use activities (listening, speaking, reading, writing), different age levels of test takers (children to adults), and different uses (lowstakes diagnostic to high-stakes selection/admission). As such, this collection represents the current end point for TMLA. To better appreciate where the field is at this point in time, it may be of interest to glance back a bit and revisit some of the issues that were raised about TMLA at its infancy; issues that I believe are still relevant today.
Evolution of technology-assisted language assessment tasks It has been nearly 40 years since the first computer-based language tests (CBLTs) were developed.1 One of these was a program available on British Council computers that enabled test developers and classroom teachers to create a gap-filling test (Alderson, 1988). Another, developed by researchers at Brigham Young University (Madsen & Larson, 1986), delivered on the promise of a test that was tailored to the ability levels of individual test takers, making it possible to obtain reliable measures of test takers’ language ability with fewer items, and hence increasing the efficiency of the test. This test used what was the cutting-edge computer technology at the time, along with one of the first applications of item response theory in language testing, for estimating test takers’ levels of ability. However, in terms of its content – what it tested and the types of tasks presented – this test made no advances; it was essentially a multiple-choice paper-and-pencil test delivered in a computer adaptive mode and format. Thus, despite the hightech computer delivery system and the sophisticated measurement model, this test, and others like it in this first generation of CBLTs, did little more than provide an alternative technology for doing what we had already been doing – multiple-choice and gap-fill – for many years in paper-and-pencil tests. In the four decades since those first CBLTs, we have seen a virtual explosion of technology, as pictures, videos, sounds, and music, along with virtually any piece of information or trivia and the latest news are but a mouse stroke away,
xx
Foreword
and “surfing” is no longer limited to the boys and girls of endless summer. The web is our wave, and the internet our fast track to information overload. And while computational linguists at the time assured us that machine translation was just around the corner, it has taken until now for computers utilizing artificial intelligence, to even begin to approach the capacity for using human language for communicating anything beyond literal meanings. In the same period of time, we have seen a major shift in the measurement zeitgeist, with alternative assessments, variously known as performance assessments or authentic assessments, replacing the once dominant multiple-choice item as the format of choice, and qualities such as authenticity, educational relevance, fairness, and impact usurping reliability as the sovereign of psychometrica. In language testing, we have seen a similar sea change, with test developers now paying much more than lip service to the call from the language teaching profession for tests that engage test takers in actual language use and that are aligned with current teaching practice – in short, language tests that pay greater heed to the qualities of authenticity and impact on instruction, or washback. A common theme throughout this period of development, but in my view also of relevance today, was that of the tension between the hoped-for benefits and potential negative consequences of the use of technology in language tests. Canale (1986) discussed what he called the promises and threats of TMLAs, while Alderson (1988) raised a concern about the use of technology promoting questionable test methods as opposed to enabling test developers and researchers to explore innovative assessment tasks and ways of analyzing test takers’ responses. Chapelle (2003) discussed the issue of efficiency-oriented tunnel vision versus theory-oriented expansion of perspectives. Chalhoub-Deville (2002) distinguished between “sustaining” and “disruptive” uses of technology. A sustaining use of technology consists of “innovations intended to simplify and facilitate current practices and products” (p. 472). A disruptive use of technology, on the other hand, “leads to the accomplishment of something that previously had not been considered possible” (p. 472). Bachman (2003) discussed the kinds of changes that are typically made in assessment task characteristics (TCs) when moving from a pencil-and-paper (P&P) to a web-based (WB) format. He described changes in the TCs for input, expected response, and scoring, and the effects these changes may have on the validity of score-based inferences and on the target language use (TLU) domains to which these inferences generalize. WB language assessments may introduce the following changes by introducing multiple courses of input (text, video, audio, visual/graphic) and a variety of response formats: “large-choice” selected responses (e.g., drag and drop, hot spot), limited production responses (e.g., short answer, completion), and extended production responses (e.g., recall, summary). The introduction of machine scoring of responses raises questions about the requirements for these (mimic human scoring processes, enable richer algorithms, tailored to the intended assessment use, or simply produce similar scores) and their potential limitations (restricted range of processes, more stringent criteria).
Foreword
xxi
Bachman then argued that when we make these changes in assessment TCs, we potentially change the nature of the construct measured as well as the TLU domain(s) to which score-based inferences generalize. By 2006 there was sufficient experience with and interest in computer-assisted language tests (CALTs) in the field for the publication of the first full-length book, Chapelle and Douglas (2006), devoted entirely to computer-assisted language assessment. Chapelle and Douglas provide a comprehensive discussion of the issues mentioned above, as well as those associated with implementing and evaluating CALTs. They conclude with a critical discussion of the revolution in assessment through technology anticipated at that time by researchers in educational assessment, arguing that although substantial changes and advances had occurred in CALT, this revolution had not yet occurred.
Validity issues TMLA holds a great deal of promise for the creation and delivery of new types of assessment tasks, and for making such tests available in a truly global fashion, allowing delivery anywhere in the world, 24/7. However, a number of validity issues are raised by changing the mode of delivery from paper and pencil to the computer or the web, and by the introduction of new task types and scoring procedures. Our knowledge about how test takers interact with and respond to TMLA tasks, for example, is still limited, as is our understanding of the effects of features introduced by technology on test performance. That is, we know relatively little about the extent to which TMLAs may enrich the representation of the abilities we want to assess, on the one hand, or introduce variation that is not related to these abilities, on the other. Finally, we are not sure whether the use of automated scoring of open-ended responses to TMLA tasks provides an adequate basis for the valid interpretation of scores, with respect to the constructs to be assessed. In addressing these validity issues, either through research, or in practice for any given TMLA, the following questions, posed by language testers during the early years of CBLT and CALT development, may be worth considering: 1. Will the technology require us to refine or to trivialize our construct definitions? 2. How do test design decisions affect or reflect the construct (e.g., the extent of test taker choice, multiple sources of input, variety of response formats)? 3. How do automated scoring criteria and procedures affect or reflect the construct? 4. To what extent and in what ways do test takers process and respond to differing input across different types of TMLA tasks? 5. To what extent and in what ways do humans and machines score responses differently? If there are differences, do they matter, and under what conditions? 6. To what extent and in what ways does machine scoring change the construct?
xxii
Foreword
If there are `differences, do they matter, and under what conditions? 7. To what extent and in what ways does TMLA enhance our representation of the construct to be measured? 8. To what extent and in what ways does TMLA introduce factors that are irrelevant to this construct? 9. To what extent and in what ways do the TLU domains differ between P&P and TMLA, or across variations in TMLA applications? 10. To what extent and in what ways do TMLA tasks correspond to TLU tasks? 11. To what extent and in what ways do test takers perceive TMLA tasks as being relevant to their TLU domains?
Fairness issues Notions of fairness and use in assessment were by no means at the forefront in early discussions of CBLTs. Indeed, it was not until the late 1990s that fairness was even discussed in our field, the 1997 special issue of Language Testing (Davies, 1997) providing the first published forum on fairness in language testing. But just as the use of TMLA has mushroomed in the last 40 years, so has our concern with and awareness and understanding of ethical issues become much sharper and refined. Thus, when we want to address issues of fairness and use in TMLA, we need to consider the extent to which, and in what ways, these issues may be either ameliorated or exacerbated in TMLA, or whether they are essentially played out differently from traditional P&P tests. Current standards for evaluating the quality of assessments, e.g., the Standards of the American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education (AERA/APA/ NCME) (2014), and the International Language Testing Association’s Code of Ethics (2000) and Guidelines for Practice (2005), treat fairness in different ways. The AERA/APA/NCME Standards, for example, discuss multiple aspects of fairness: • • • • •
Fairness in treatment during the testing process; Fairness as lack of measurement bias; Fairness in access; Fairness as validity of individual test score interpretations; Fairness as opportunity to learn. (American Educational Research Association et al., 2014, pp. 50–7)
The most comprehensive discussion of fairness in language assessment is that of Kunnan (2018), who proposes two principles: fairness in the assessment process and justice in assessment use. Each of these in turn is broken down into sub-principles. Bachman and Palmer (2010) argue that fairness can be operationalized in the claims and warrants in an assessment use argument (AUA). There are specific claims and warrants in their AUA that address all of the aspects of fairness in the AERA/APA/NCME Joint Standards as well as Kunnan’s principles. These fall into the following categories:
Foreword • •
xxiii
Fairness in the assessment process: • Equitability of treatment; • Absence of bias; Fairness in assessment use: • Fairness of assessment-based decisions; • Fairness in fully informing test takers and other stakeholders about how these decisions will be made.
In addressing these fairness issues, either through research, or for any given TMLA, the following questions may be worth considering:
Equitability of treatment in the assessment process 12. Do all test takers have equal access to information about the assessment and the assessment procedures? 13. Do all test takers have equal opportunity to prepare for the assessment? 14. Do all test takers have equal access to the assessment, in terms of location, cost, time or frequency of administration, and familiarity with the procedures and the technology to be used?2 15. Are all test takers given equal opportunity to demonstrate their ability? 16. Are all test takers given the assessment under equivalent and appropriate conditions? 17. Are the results of the assessment reported in a timely manner, and in ways that are clearly understandable to all stakeholder groups, and that insure confidentiality to individuals?
Absence of bias in the assessment process 18. Do the assessment tasks include response formats or content that may either favor or disfavor some test takers? 19. Do the assessment tasks include content that may be offensive (topically, culturally, or linguistically inappropriate) to some test takers? 20. Are the score reports of comparable consistency across different groups of test takers?
Fairness of assessment-based decisions 21. Are existing educational and societal values and relevant legal requirements carefully and critically considered in the kinds of decisions that are to be made? 22. Are existing educational and societal values and relevant legal requirements carefully and critically considered in determining the relative seriousness of false positive and false negative classification errors?
xxiv Foreword 23. Are test takers and other affected stakeholders fully informed about how the decision will be made and whether decisions are actually made in the way described to them? 24. Are cut scores set so as to minimize the most serious classification errors? 25. Do the assessment-based interpretations provide information that is relevant to the decision? 26. Do the assessment-based interpretations provide sufficient information to make the required decisions? (Bachman & Palmer, 2010, pp. 127–30) A final issue that does not fit neatly into either validity or fairness, but which pertains to the consequences, or impact, of using the assessment, is that of the assessment’s potential for promoting good instructional practice and effective learning. This is particularly pertinent for TMLAs, in my view, since it is often the case that the assessment tasks presented in these assessments do not correspond to the teaching/learning tasks one finds in the language classroom. Thus, another question that we need to ask is: 27. In instructional settings, to what extent does the TMLA help promote good instructional practice and effective learning?
Conclusion Technology mediated language assessment has taken computer-based language testing well beyond the “sustaining” first step of making it easier to do what we have always done and is leading us to “disruptive uses” of the technology, as we explore its potential for developing new approaches to assessment. Part of this potential is that the abilities we are able to measure are more complex, reflecting current research and thinking about both the nature of language ability and how this is learned. In addition, the tasks that we are able to present to test takers are richer and more complex, potentially better reflecting language use in test takers’ TLU domains. Thus, TMLA is beginning to realize some of the promises that language testing researchers have been discussing for the past 40 years, in terms of what we test, how we test, and how we use test results. The realization of TMLA’s potential, both for test design and use and for research, will require continued validation research, addressing issues such as those that are discussed above. Furthermore, investigating issues such as these will require that we employ a wide range of research approaches, both quantitative and qualitative. The chapters in this collection address many of these issues using appropriate research methodology and thus provide an encouraging perspective on where we are as a field, with respect to TMLA, and where we are likely to be heading in the future.
Foreword
xxv
Notes 1 Over the years, the terminology used to refer to language assessments that employ some form of technology have changed from “computer-based language testing” (CBLT) to “computer-assisted language testing” (CALT) to the current terminology, “technology mediated language assessment” (TMLA). I have followed the terms that were used by language testers who wrote about this topic as the field evolved. 2 Taylor, Kirsch, Eignor, and Jamieson (1999) dealt with test takers’ familiarity with technology as a validity issue, while in current standards frameworks this is considered to be a fairness issue.
References Alderson, J. C. (1988). Innovations in language testing: Can the microcomputer help? Language Testing Update, Special Report 1 (iv + 44 pp.) American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Psychological Association. Bachman, L. F. (2003). Validity issues in web-based language assessment. Paper presented at the Annual Meeting of the American Association for Applied Linguisic, Arlington, VA. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford University Press. Canale, M. (1986). The promise and threat of computerized adaptive assessment of reading comprehension. In C. W. Stansfield (Ed.), Technology and language testing (pp. 29– 45). TESOL. Chalhoub-Deville, M. (2002). Technology in standardized language assessments. In R. Kaplan (Ed.), Oxford handbook of applied linguistics (pp. 471–484). Oxford University Press. Chapelle, C. A. (2003). English language learning and technology: Lectures on applied linguistics in the age of information and communication technology. John Benjamins Publishing. Chapelle, C. A., & Douglas, D. (2006). Assessing language ability by computer. Cambridge University Press. Davies, A. (1997). Special Issue: Ethics in language testing. Language Testing, 14(3). International Language Testing Association. (2000). ILTA code of ethics. 2000. Retrieved from https://cdn.ymaws.com/www.iltaonline.com/resource/resmgr/docs/ILTA_2018 _CodeOfEthics_Engli.pdf International Language Testing Association. (2005). Guidelines for practice. Retrieved from https://cdn.ymaws.com/www.iltaonline.com/resource/resmgr/docs/guidelines _for_practice/2020_revised/ilta_guidelines_for_practice.pdf Kunnan, A. J. (2018). Evaluating language assessments. Routledge. Madsen, H. S., & Larson, J. W. (1986). Computerized Rasch analysis of item bias in ESL tests. In C. W. Stansfield (Ed.), Technology and language testing (pp. 47–67). TESOL. Taylor, C., Kirsch, I., Eignor, D., & Jamieson, J. (1999). Examining the relationship between computer familiarity and performance on computer-based language tasks. Language learning, 49(2), 219–274.
1
Technology mediated language assessment Key considerations Karim Sadeghi and Dan Douglas
1.1 Introduction Less than two decades ago, in 2006, in the opening page of the opening chapter of their authoritative volume on technology in language assessment, Chapelle and Douglas, while observing that ‘It would be difficult to estimate how many second language learners have taken or will take a language test delivered by computer’ (p. 1), review the relevant research trend and conclude that ‘Taken together, the strands of the technology thread point to an important change in the fabric of language assessment: the comprehensive introduction of technology’ (pp. 1–2) and predict that ‘no matter where second language learners live, they will sooner or later take a computer-assisted language test’ (p. 2). Similarly, close to the outbreak of the COVID-19 pandemic, Winke and Isbell (2017) state that computer-assisted language assessment ‘is becoming normalized’, calling for future research on its ‘fairness for underserved populations’ (p. 313). They were all certainly right in their prediction that language testing in the future will be driven by computer technology, but nobody would have predicted that technology mediated language assessment (TMLA) would soon be the only option left for the field in the near future. And this is exactly what we are experiencing today: a large majority of, if not all, language assessment during the pandemic relies on the mediation of digital technology, which is reflected in a recent interview with Chapelle published in RELC Journal, who notes, ‘the current pandemic situation has created the need for more “at home” testing’ (Meniado, 2020, p. 452). Although computer technology is not a newcomer in large-scale testing, its unplanned, emergent, unprecedented and widespread use for much low- and high-stakes testing, including language assessment, at all levels of education at a time of a global health crisis has indeed introduced a revolution in language testing, with both theoretical and practical implications. Such a widespread integration of technology into language assessment, as Chapelle observes, ‘raises many new issues in testing and challenges for technology. … [such] that the technical, business, ethical, and legal dimensions of online testing will occupy us for the foreseeable future’ (Meniado, 2020, p. 452). Similarly, in an editorial where Harding and Winke (2022) reflect on challenges as editors of the journal Language Testing, they introduce changes to the journal’s future publication DOI: 10.4324/9781003292395-1
2
Technology mediated language assessment
policies and plans resulting from the COVID-19 pandemic. One of the challenges for the future of the language testing profession, given the consequences of the pandemic,is how to redefine the boundaries of TMLA. The concern of Harding and Winke has obviously and understandably been focused on challenges as editors of the journal, but they would certainly agree with us that theoretical and methodological issues introduced by the recent surge in TMLA will continue to challenge and distress testing experts and stakeholders for years and perhaps decades to come. Although various forms of computerised testing have been around for several decades in the context of large-scale proficiency testing (with computer-based/ adaptive testing dating back to as early as the 1970s), as an aftermath of the COVID-19 pandemic, a majority of testing worldwide had no option but either to stop or to shift to ‘at-home’ mode. This remote assessment mode relied heavily on the mediation of digital technology, without which the language testing enterprise could have come to a complete halt or been severely damaged at best. Integration of technology into language assessment has brought with it countless affordances and challenges at the same time, both theoretically and practically (Sadeghi, 2023). An overarching theoretical consideration requiring attention is the way technology has contributed to a re-conceptualisation of major assessment concepts and the underlying constructs these tests are meant to measure. To the best of our knowledge, there is no book available on the market fully devoted to how technology has affected key issues and concepts in language testing, although numerous book chapters or published papers have made reference to the significance of such research and discussion. The dominance of TMLA has now made it a necessity to examine, for example, whether traditional language assessment constructs need a re-evaluation or a re-definition, whether technological skill should be regarded as an integral part of test-takers’ competence or whether it should be considered as a test method facet. There is a pressing need to understand how the emergency integration of technology into second language assessment has shaped our understanding of fundamental traditional concepts like validity, reliability, washback, authenticity, ethics, fairness, security and other theoretical and practical facets of language assessment. Reporting on the adaptations they made to a university English placement test in response to the COVID-19 pandemic, Green and Lung (2021, p. 9), for instance, reflect on their observations of some in-home assessments: We could hear mothers scolding children, babies crying, and chickens crowing in the background … That caused us to wonder if we were getting the best sample from our students. One … reported that she had taken the writing test in McDonald’s because of their free wireless. Such observations that are increasingly being documented by many others, especially those working in technologically disadvantaged regions of the world, have grave consequences in terms of validity and fairness which lie at the centre of the language testing enterprise.
Technology mediated language assessment 3 Similarly, as Chapelle and Douglas (2006) rightly point out, since technology-assisted language assessment implies ‘new configurations of test methods’, it is accordingly ‘important to understand how computers are likely to be different from other means of presenting language test tasks and thus how technology may affect interpretations of language test performance’ (p. 21). In other words, due to a ‘method effect’, the integration of technology with language assessment adds complexities to how language tasks should be assessed and what test-takers’ performance on them means, and raises issues with test validity and reliability, making a computer-assisted language test (CALT) different from its parallel paper-and-pencil version. Despite Chapelle and Douglas’s (2006) call nearly two decades ago for the need to understand whether and how new constructs are tapped by CALTs and whether issues such as computer literacy should be considered as a component of language ability, especially at the current time when no phase of life and education has basically become discernible without technology, little research has been done and few works have been written addressing this important issue. The question posed by Capelle and Douglas (2006, p. 80), i.e., ‘If CALT is different from other forms of testing, should computer-based testing be evaluated against a different set of standards from that used to evaluate other tests?’, is still valid. Even though the short answer to their question is a resounding ‘yes’, what these ‘revised or revisited’ criteria are, or how this evaluation should be conducted is less clear at a time when CALT is the dominant rather than a peripheral discourse in the world of language assessment today. Concerns such as questionable construct validity and bias in computer-based tests have not been addressed properly so far and accordingly warrant more immediate and elaborate attention now that so much language testing is conducted with the mediation of computers or other digital technology. This book is an attempt to ignite discussions that can help us better understand what is involved in CALT, computerassisted language assessment (CALA), computer-mediated language assessment (CMLA), mobile-assisted language assessment (MALA), technology mediated language assessment or any other form of remote, web-based, home-based, online, computerised or other similar form of assessment (we will henceforth use the umbrella term TMLA) conceptually, theoretically and methodologically. In the wake of the current COVID-19 pandemic which has virtually transformed the world in many dimensions, language assessment is hard to find without dependence on technology in one way or another, except for contexts with no technological resources, or sporadic small-scale assessments conducted with limited numbers of test-takers, perhaps administered outdoors (see for example Ockey et al., 2021). During the first months of the pandemic, language assessment had either to stop (which was the case in numerous low-stakes contexts since there was no access to the required hardware and software) or be postponed until the time when lockdowns were removed. Soon realising that the pandemic was here to stay for the foreseeable future, the language testing field had to move forward with whatever available technological resources (even though at a limited scale). Although the world has started returning to a ‘new normal’ and although universities, schools, language centres and other educational bodies are
4
Technology mediated language assessment
gradually returning to full in-person classes, technology is already a mainstay of all education. Once a luxurious commodity only available in technologically well-resourced contexts and to those who could afford it, technology has now become an integral aspect of educational life everywhere, although there are still many learners across the world without easy access to technological resources for various reasons including wars, economic issues, lack of relevant infrastructure and so on. In the aftermath of the COVID-19 pandemic, there are ongoing attempts to develop new technology-assisted language tests or modify those already existing to service wider populations and be deliverable remotely (see for example, Sadeghi, 2023). The language assessment field is experiencing a giant jump and a big shake-up in ‘modernising’ itself. TMLA has now dominated the field of language assessment more than ever, but the thorny issues posed by Chapelle and Douglas in 2006 have not been attended to properly since then, although technological advances that affected language testing have deeply engaged language testers for almost two decades (e.g., Papadima-Sophocleous, 2008). This volume is one of the first attempts to consider theoretical and methodological concerns associated with delivering language tests with the mediation of technology (including computer-, mobile- and tablet-assisted tests, whether delivered synchronously or asynchronously). Although sporadic work exists on issues such as security of technology mediated language tests (e.g., Clark et al., 2021; Garg & Goel, 2022; Ladyshewsky, 2015; Miguel et al., 2015; Papageorgiou & Manna, 2021; Purpura et al., 2021) and their fairness (e.g., Chu & Lai, 2013; Xi, 2010; Green & Lung, 2021), the current health crisis which forced language assessment to migrate from a primarily paper-and-pencil mode to a primarily computer-based one has once again highlighted the need for a fresh look at other fundamental issues such as validity and reliability as well as security, fairness, ethics, authenticity, washback and practicality of the technological delivery of language tests. Indeed, as observed by McNamara and Roever (2006) and McNamara et al. (2019; among others), one key issue in accounting for validity is accounting for fairness and justice, two fundamental concepts that require renewed attention at a time when online/distance assessment, despite being a panacea, raises serious concerns regarding access, social equity and justice as emphasised too by a number of chapters in this volume. Before introducing the chapters comprising this volume which have been written with the aim of bridging this gap in our knowledge of how TMLAs may work differently from traditional tests in terms of key testing concepts, we briefly survey the limited literature in this area.
1.2 Research on theoretical and methodological aspects of TMLA In language assessment, while our ultimate end is to measure an attribute in which we are interested as accurately as possible, ‘how’ we measure (test method) has a direct bearing on ‘what’ we measure: the attribute, ability or construct of interest. Indeed, different methods or procedures used to measure the same construct or
Technology mediated language assessment 5 ability may yield different performances, suggesting different levels of ability. For example, although measuring oral fluency using role plays, dialogues, task descriptions and so on will all lead to differences in performance, even when the same method (face-to-face interaction, for example) is employed, differences in performance are expected which are attributable to the age, gender, cultural background and linguistic proficiency of the testee, role relationship between the testee and the interactional counterpart and so on. The addition of a computer medium certainly adds to the already existing complexity, and any minor change in the way a test is delivered (test method) can potentially lead to changes in performance (directly impacting test validity and reliability) which may then misrepresent the ability being measured. While comprehensive frameworks have been proposed (as discussed below) to account for how changes in test methods and procedures can lead to changes in task performance, little research evidence has been produced to document how the ‘how’ of testing affects the ‘what’ of testing. Within the domain of CMLA with its growing popularity, there is an immediate need to document how the ‘technology’ facet affects test performance and what theoretical, practical and methodological implications such assessment has, especially in terms of reconsidering the seminal concept of validity (the meaningfulness of test scores and the uses made of them), and whether the construct of language ability needs revisiting, a concern this volume is hoped to partially resolve. In his authoritative classic, Fundamental Considerations in Language Testing, in a comprehensive treatment of test method facets, Bachman (1990) provides a very extensive account of test method features that can affect performance on language tests. These facets are categorised under the general headings of testing environment or physical and temporal circumstances, rubrics/instructions, input and expected response, as well as the interactions between these two. Similarly, Bachman offers exhaustive treatments of the concepts of test validity and reliability which have traditionally been regarded as the two most important qualities for evaluating an existing test or constructing a new one. Indeed, the usefulness of a test, which is context-specific, is regarded later as the most important feature of a test such that an appropriate balance is required between validity and reliability (as well as other characteristics such as authenticity, interactiveness, washback/ impact and practicality) (Bachman & Palmer, 1996). The ultimate aim for a testmaker should on the one hand be to produce a test that measures what it is intended to measure (valid) and on the other hand to produce a test which measures as accurately as possible (reliable), at the same time keeping an eye on other features like being practical enough, exerting beneficial washback, including real life tasks and above all being fair to all test-takers. Since Bachman’s (1990) classic text as well as Bachman and Palmer (1996, 2010), no other book has specifically attended to fundamental consideration (like the nature of language ability construct, test methods, reliability and validity) in a reconceptualised manner. All books on language testing/assessment have touched on these key concepts, treating them in very similar veins but resorting to various metaphors to clarify these traditional concepts (like Douglas, 2010,
6
Technology mediated language assessment
using the ‘rubber ruler’ metaphor to illustrate how reliability works in language tests). Attention has also been paid to distinguishing assessment from testing (as well as alternative assessment like dynamic assessment as opposed to static), new frameworks for estimating reliability (like different Item Response Theory (IRT) or Rasch models), and for validation studies (such as the Assessment Use Argument (AUA) framework). One recent edited volume drawing on Bachman’s original volume is Ockey and Green (2020). The volume is a collection of papers originally presented at a conference in 2017 in honour of Bachman and focuses (empirically and theoretically) on a range of fundamental issues such as evolving language ability constructs like English as a lingua franca, validity and validation of language assessment including arguments for ethics-based validation, and understanding the internal structure of language assessments, with no chapter devoted to these key issues in the context of technology mediated language assessment. There is essentially very limited literature available on theoretical underpinnings of computer-mediated language assessment attending to whether and how fundamental issues and constructs in language assessments that are designed, delivered and scored through the medium of computer technology need reconsideration and reconceptualisation. The first book to appear on the theme of technology in language testing was by Stanfield (1986), a collection of papers presented at the Language Testing Research Colloquium (LTRC) in 1985 where most attention was focused on computer adaptive testing and the application of the latent trait model to item construction and selection. This book is followed by Dunkel (1991) where again the focus of the edited work lies heavily on computer-adaptive testing (CAT) issues. One substantial early authored volume on the use of technology in language testing is Chapelle and Douglas (2006). The book does a great job of introducing trends in computerised language assessment, as well as its implementation, evaluation and impact; however, given that the use of technology in language assessment was not as widespread at the time, very little attention is paid to how technology has shaped traditional assessment concepts. Another full volume dedicated to computerised assessment (but not specifically focusing on language assessment) is Brown et al. (1999). The book draws on a range of expertise to share good practice and explore new ways of using appropriate technologies in assessment. It provides a strategic overview along with pragmatic proposals for the use of computers in assessment. The text deals with computerised assessment in higher education but does not specifically deal with the use of computer technology in L2 assessment, which is the theme of this project. The most recent book by Routledge on language testing is the Routledge Handbook of Second Language Acquisition and Language Testing, edited by Winke and Brunfaut (2021). The handbook is a comprehensive account of how language testing serves Second Language Acquisition (SLA) research but there is no chapter related to the use of technology in language assessment. A similar handbook by Fulcher and Davidson (2012) also has no chapters devoted to TMLA.
Technology mediated language assessment 7 Sporadic book chapters or journal articles have also been devoted to the use of technology in language assessment such as Brown (1997), Bernstein et al., (2010), Chalhoub-Deville and Deville (1999), Weigle (2013) and Winke and Isbell (2017); only the latter proposes to investigate conceptualisation of TMLA, a similar goal to the one we are pursuing in this project. None of the books or chapters and papers mentioned above, however, explicitly deal with theoretical considerations in language assessment as a consequence of technology integration; neither do any directly concentrate on how technology has affected the field of language assessment by revisiting key traditional assessment concepts, a gap partially filled by this volume. 1.2.1 Mode comparison studies Kyle et al. (2022) rightly argue that the language produced in technology mediated learning environments (TMLEs) can be different from that produced in nonTMLEs in terms of description of the linguistic features; accordingly, language proficiency assessments need to reflect this. Their previous research indicated substantial differences across the learning environments, an observation that necessitates a reconsideration of underlying constructs tapped by tests delivered with the medium of technology, as well as revisiting surrounding conceptual and theoretical issues. Counting numerous affordances of technology in language assessment, Chapelle and Douglas (2006, p. 41) also highlight six threats of CMLA as primarily being related to validity and security issues, calling for future research in this area. Contributions to this volume are meant to do exactly this: find out more about how TMLA is theoretically and methodologically different from other traditional parallel forms of assessment and how testing experts should address the following threats. Four of the six threats that are more relevant to the aims of this collection are: Performance on a computer-delivered test may fail to reflect the same ability as that which would be measured by other forms of assessment. Computer-assisted response scoring may fail to assign credit to the qualities of a response that are relevant to the construct that the test is intended to measure. CALT may pose risks to test security. CALT may have negative impact on learners, learning, classes and society. (Emphasis added.) In a recent study comparing traditional multiple-choice items (MCI) with technology-enhanced items (TEIs), claiming that the latter ‘offer improved construct coverage’ (p. 1) in that they ‘resemble authentic, real-world tasks’ (p. 4) and acknowledging that little research exists in this area, Kim et al. (2022) collected empirical data from a large number of pupils across grades 1–12 in the USA. English language proficiency scores of 1.2 million candidates who took a largescale online reading test (ACCESS for ELLS: http://wida.wisc.edu/assess/access)
8 Technology mediated language assessment with a difference in item mode (response format) were compared in terms of item difficulty, item discrimination, item information and item efficiency (amount of information provided and item duration) using Item Response Theory (IRT). The study found MCIs to be more difficult, while TEIs were reported to be more informative at higher reading proficiency levels but were generally less efficient with longer item durations (or the time required to present the items). Two other rare mixed-methods studies by Nakatsuhara and her colleagues targeted assessing speaking in a high-stakes test. Nakatsuhara et al. (2017) compared face-to-face assessment of speaking with assessment delivered via the Zoom video conferencing facility in the context of the International English Language Testing System (IELTS) with 32 participants in an IELTS preparation programme in London and found that while both delivery modes led to similar test scores and comparable language functions, they resulted in differences in examiners’ behaviours (as raters and interlocuters) and in test-takers’ functional output, implying that the use of technology in assessing speaking could potentially have tapped a slightly different construct. In 2021, Nakatsuhara et al. investigated six IELTS examiners’ ratings of 36 test-takers’ performances under three different rating modes: live, audio and video conditions, the last two being non-live. Verbal recall data and rater justifications indicated that examiners identified more negative features in the audio condition compared to the live and video conditions and that the richer visual information in the video condition (as well as in the live condition) helped them comprehend the message better, with the implication of relying less on negative performance features and over-rating in these latter conditions. Papageorgiou and Manna (2021, p. 38) compared the internet-based Test of English as a Foreign Language (TOEFL iBT), delivered in a testing centre, with its Home Edition, delivered at the test-taker’s home, introduced by Educational Testing Service in November 2020. They rightly acknowledge that ‘Mode effects could be present because of differences in the testing environment (home v. test center), the computer equipment, and steps in setting up the test for delivery through a computer’ and call for further research on this comparability and its implications for test validity. In a similar vein and concentrating on TOEFL iBT, Brown and Ducasse (2019) compared performance on the speaking component of TOEFL iBT with face-to-face academic oral assessment tasks. Their observations revealed that face-to-face tasks produced rhetorical moves that were more characteristic of traditional TOEFL tasks, with implications for the TOEFL iBT validity argument in terms of domain definition inference. This small but burgeoning sample of mode comparison works is significant in that the studies clearly indicate the differential performance of TMLAs in comparison to their traditional counterparts (or even among CMLA versions themselves) and raise questions about whether different response or delivery or even rating modes tap basically the same construct and whether the resulting scores are constructrepresentative, an issue which requires further discussion and elaboration given that almost all language testing is now conducted online using items or tasks that are technology enhanced and increasingly technology scored. The concern is, in Chalhoub-Deville and Deville’s (1999, p. 292) words,
Technology mediated language assessment 9 we know all too well that test methods can and do influence scores and thus alter our subsequent use and interpretation of the scores. … no method is a panacea; each comes with its own set of advantages and disadvantages. In short, changes introduced to any aspect of test method facets are likely to change the nature of the test in rendering a test that measures a slightly different construct than the originally intended one. As such, delivering tests using the medium of computer technology (in all its forms) and from a distance (which introduces changes to traditional tests in terms of test method facets such as physical and temporal circumstances as well as the mode of input and expected output) will potentially create issues with validity, reliability, fairness, security, authenticity, interactiveness, washback and so on. Indeed, differences exist in various forms of technology-assisted language assessment and these could potentially lead to under- or over-representation of the construct under measurement. For example, an oral interview may be conducted online using Microsoft Teams or any other video conferencing tool where rather than meeting face to face, the interactants meet virtually. This form of TMLA is totally different from both interaction with a pre-recorded simulated interviewer via computer and live interaction with another human being. Evidently, while both scenarios (interaction with a simulated interviewer and interaction with a human being, both via computer) are counted as examples of TMLA, they can offer different representations of the construct (oral language proficiency) being measured, which for both can still be quite different from that measured in a face-to-face scenario. Chapelle and Douglas (2006) acknowledge this by stating that ‘It will be a long time before computer technology mirrors the flexibility of a good interviewer in an oral test’ (p. 32). As another exemplar, Ling (2017) surveyed 17,040 TOEFL iBT examinees from 24 countries, investigating whether the keyboard type (US vs non-US) leads to construct irrelevant variance in writing, and observed that score differences were statistically significantly associated with the type of keyboard used by test-takers during the test as well as their perceptions and awareness of the keyboard policy. As such, numerous variations are available as mini-facets when technology is added as a new test method facet to already existing test method frameworks, and a new or modified facets framework is required to account for these additional mini-facets and their interactions. CMLA may be more convenient for test-takers and more efficient for testing researchers and test administrators, and although language testing scholars and researchers have long been aware of relevant complexities in test construction, administration and interpretation of results, not much research and theoretical discussion has been conducted to tease out what technology brings to and what it possibly takes away from what we already know in terms of important test qualities. We resonate with concerns (such as validity, practicality, test security, fairness, test-taker rights and access) raised by numerous scholars like Muhammad and Ockey (2021) who call for research on yet-to-be-answered questions such as the following:
10
Technology mediated language assessment To what extent is the construct of a … [technology]-mediated assessment equivalent to [that of] a face-to-face assessment? To what extent does familiarity with the technology(ies) used to deliver the test impact a test-taker’s performance? Is it appropriate to use facial recognition, or other personally invasive procedures, to help ensure that the test-taker of record is in fact taking the test? (pp. 53–54)
It is hoped that this volume marks the starting point for a more serious treatment of these and other fundamental issues, given the dominance of language tests delivered online nowadays. Next, we briefly introduce the content of this volume where contributors tackle some of the key and perennial theoretical and methodological issues highlighted above, with the hope that future volumes will attend to other outstanding major concepts and issues.
1.3 Structure of this book The book is made up of 16 full-length chapters, in addition to a short Foreword (by Prof. Lyle F. Bachman) and Conclusion (by the editors). Chapter contributors are among the most distinguished active scholars in the field of language assessment who were invited to contribute to the volume. Many of these researchers are affiliated with major language assessment organisations (like Educational Testing Service (ETS) or Cambridge Assessment), are members of international professional testing associations or are principal researchers or directors of assessment centres in their corresponding institutions. The volume is structured into four parts. Following Chapter 1, the editors’ Introduction, Part I deals with Validity Concerns in Technology Mediated Language Assessment in five chapters. In Chapter 2, Scott Walters revisits validity models in distance language assessment, and Jang and her colleagues discuss validity in technology-delivered language tests in Chapter 3. In Chapter 4, Davis et al. evaluate the TOEFL Essentials test in terms of meaningfulness and efficiency; and Chen and Knoch in Chapter 5 report and discuss actions taken to render a traditional test computerised. In Chapter 6, Härmälä and Marjanen evaluates Finnish ninth graders’ learning outcomes by considering the consequential validity of conclusions. Part II: Reliability Issues and Machine Scoring consists of three chapters. In Chapter 7, Shin et al. explore rater behaviour in dealing with handwritten and typed essays; in Chapter 8, O’Sullivan et al. proposes a taxonomy for terms related to human and machine scoring; and then Carr in Chapter 9 discusses construct conceptualisation and reliability in automated scoring. Part III: Impact, Security and Ethical Considerations also has three chapters. Chapman et al. in Chapter 10 investigate equity, fairness and social justice in remote assessment. Huhta in Chapter 11 then attends to the impact of technology on diagnostic assessment; and Voss examines proctoring and security issues in remote assessments in Chapter 12.
Technology mediated language assessment 11 In Part IV: Options and Issues in Technology Mediated Classroom Assessment, Saville and Buttery in Chapter 13 look at interdisciplinary collaboration in building the future of Learning Oriented Language Assessment. In Chapter 14, Iwashita et al. examine Interactional Competence in online-based speaking; and Cohen et al. in Chapter 15 survey test-taking strategies in technology-assisted language assessment. Finally, Chapter 16 by Menary and Harding introduces methodological challenges and opportunities in TMLA research and practice. The book ends with a Conclusion chapter by the editors.
References Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press. Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford University Press. Bachman, L. F., & Palmer, A. (2010). Language testing in practice. Oxford University Press. Bernstein, J., van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27, 355–377. Brown, A., & Ducasse, A. M. (2019). An equal challenge? Comparing TOEFL iBT™ speaking tasks with academic speaking tasks. Language Assessment Quarterly, 16(2), 253–270. DOI: 10.1080/15434303.2019.1628240 Brown, J. D. (1997). Computers in language testing: Present research and some future directions. Language Learning & Technology, 1(1), 44–59. Brown, S., Race, P., & Bull, J. (1999). Computer assisted assessment in higher education. Routledge. Chalhoub-Deville, M., & Deville, C. (1999). Computer adaptive testing in second language contexts. Annual Review of Applied Linguistics, 19, 273–299. Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge University Press. Chu, M-W., & Lai, H. (2013). Fairness in computerized testing: Detecting item bias using CATSIB with impact present. Alberta Journal of Educational Research, 59(4), 630–643. Clark, T., Spiby, R., & Tasviri, R. (2021). Crisis, collaboration, recovery: IELTS and COVID-19, Language Assessment Quarterly, 18(1), 17–25. DOI: 10.1080/15434303.2020.1866575 Douglas, D. (2010). Understanding language testing. Hodder Education. Dunkel, P. (Ed). (1991). Computer assisted language learning and testing: Research issues and practice. Newbury House. Fulcher, G., & Davidson, F. (2012). The Routledge handbook of language testing. Routledge. Grag, M., & Goel, A. (2022). A systematic literature review on online assessment security: Current challenges and integrity strategies. Computers & Security, 113(1–13), 102544. https://doi.org/10.1016/j.cose.2021.102544 Green, B. A., & Lung, Y. S. M. (2021). English language placement testing at BYUHawaii in the time of COVID-19. Language Assessment Quarterly, 18(1), 6–11. DOI: 10.1080/15434303.2020.1863966 Harding, L., & Winke, P. (2022). Editorial: Innovation and expansion in Language Testing for changing times. Language Testing, 39(1), 3–6. DOI: 10.1177/02655322211053212
12
Technology mediated language assessment
Kim, A. A., Tywoniw, R. L., & Chapman, M. (2022): Technology-enhanced items in grades 1–12 English language proficiency assessments. Language Assessment Quarterly, 19(4), 1–23. DOI: 10.1080/15434303.2022.2039659 Kyle, K., Eguchi, M., Choe, A. T., & LaFlair. G. (2022). Register variation in spoken and written language use across technology mediated and non-technology-mediated learning environments. Language Testing, 1–31. DOI: 10.1177/02655322211057868 Ladyshewsky, R. K. (2015). Post-graduate student performance in ‘supervised in-class’ vs. ‘unsupervised online’ multiple choice tests: Implications for cheating and test security. Assessment & Evaluation in Higher Education, 40(7), 883–897. DOI: 10.1080/02602938.2014.956683 Ling, G. (2017). Is writing performance related to keyboard type? An investigation from examinees’ perspective on the TOEFL iBT. Language Assessment Quarterly, 14(1), 36–53. https://doi.org/10.1080/15434303.2016.1262376. DOI: 10.1080/15434303.2016.1262376 McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Blackwell. McNamara, T., Knoch, U., & Fan, J. (2019). Fairness, justice, and language assessment: The role of measurement. Oxford University Press. Meniado, J. C. (2020). Technology integration into language testing and assessment: An Interview with Prof. Carol A Chappelle. RELC Journal, 51(3), 448–453. DOI: 10.1177/0033688220974219 Miguel, J., Caballe´, S., Xhafa, F., & Prieto, J. (2015). Security in online web learning assessment Providing an effective trustworthiness approach to support e-learning teams. World Wide Web, 18, 1655–1676. DOI: 10.1007/s11280-014-0320-2 Muhammad, A. A., & Ockey, G. J. (2021). Upholding language assessment quality during the COVID-19 pandemic: Some final thoughts and questions, Language Assessment Quarterly, 18(1), 51–55. DOI: 10.1080/15434303.2020.1867555 Nakatsuhara, F., Inoue, C., Berry, B., & Galaczi, E. (2017). Exploring the use of video-conferencing technology in the assessment of spoken language: A mixed-methods study. Language Assessment Quarterly, 14(1), 1–18. DOI: 10.1080/15434303.2016.1263637 Nakatsuhara, F., Inoue, C., & Taylor, L. (2021). Comparing rating modes: Analysing live, audio, and video ratings of IELTS speaking test performances, Language Assessment Quarterly, 18(2), 83–106. DOI: 10.1080/15434303.2020.1799222 Ockey, G. J., & Green, B. A. (eds.) (2020). Another generation of fundamental considerations in language assessment: A festschrift in honor of Lyle F. Bachman. Springer Nature. Ockey, G. J, Muhammad, A. A., Prasetyo, A. H., Elnegahy, S., et al. (2021). Iowa State University’s English placement test of oral communication in times of COVID-19. Language Assessment Quarterly, 18(1), 26–35. DOI: 10.1080/15434303.2020.1862122 Papadima-Sophocleous, S. (2008). A hybrid of a CBT- and a CAT-based new English placement test online (NEPTON). CALICO Journal, 25(2), 276–304. Papageorgiou, S., & Manna, V. F. (2021). Maintaining access to a large-scale test of academic language proficiency during the pandemic: The launch of TOEFL iBT Home Edition, Language Assessment Quarterly, 18(1), 36–41. DOI: 10.1080/15434303.2020.1864376 Purpura, J. E., Davoodifard, M., & Voss, E. (2021). Conversion to remote proctoring of the Community English Language Program online placement exam at Teachers College, Columbia University. Language Assessment Quarterly, 18(1), 42–50. DOI: 10.1080/15434303.2020.1867145
Technology mediated language assessment 13 Sadeghi, K. (Ed.) (2023). Technology-assisted language assessment in diverse contexts: Lessons from the transition to online testing during COVID-19. Routledge. Stansfield, C. (1986). Technology and language testing. TESOL. Weigle, S. (2013). ESL writing and automated essay evaluation. In M. Shermis and J. Burstein (Eds.), Handbook of automated essay evaluation (pp. 36–54.). Routledge. Winke, P. M., & Brunfaut, T. (2021). Routledge handbook of second language acquisition and language testing. Routledge. Winke, P. M., & Isbell, D. R. (2017). Computer-assisted language assessment. In S. L. Thorne and S. May (Eds.), Language, education and technology (pp. 313–325). Springer. Xi, X. (2010). How do we go about investigating test fairness? Language Testing 27(2), 147–170. DOI: 10.1177/0265532209349465
Part I
Validity concerns in technology mediated language assessment
2
Appraising models and frameworks of spoken ability in remote L2 assessment F. Scott Walters
2.1 Introduction As the pandemic caused by the spread of the COVID-19 coronavirus in early 2020 began to surge around the globe, it disrupted many human activities, including those in educational institutions. According to UNESCO data, as of April 2020, across some 150 countries, there were over 400 million complete school closures affecting upwards of 1.2 billion school children, constituting approximately 80% of total enrollees. This declined during 2021, and as of February 2022, there were 43 million schools completely closed and some 620 million partially closed (UNESCO, 2022). Even assuming that the Omicron variant causes the pandemic to settle into “endemic” status (Powell, 2022), for the duration it seems likely that school disruptions will continue to some degree, causing a need not only for online instruction, but – as a means of avoiding spread of the contagion through face-to-face (f2f) test administrations – online assessment as well, including online second language (L2) assessment. A question naturally arises regarding the validity of the use of the results of online L2 assessments, in particular, of those intended to target oral communication skills. There are various dimensions to this issue, including rating condition (Nakatuhara et al., 2021) as well as delivery mode (Nakatuhara et al., 2017). However, a more fundamental issue may be held to be on the theoretical level: namely, the conceptual models and frameworks upon which oral L2 tests are conceived, designed, administered, and scored. In other words, do theoretical descriptions of normative language use that are in currency today reflect the kinds of oral interactions that may be captured by online testing modes? In considering this question, this exploratory study examines a number of these theoretical models, and then surveys some empirical studies into L2 assessment that compare online versus f2f testing conditions, with a view to determining the models’ usefulness in remote L2 oral assessment. Analyses are drawn from the theoretical literature on test validity (e.g., Messick, 1989) and from research within the discipline of conversation analysis (CA) (Pekarek Doehler, 2018, 2021; Sacks et al., 1974; Schegloff, 2007). It should be noted that despite a growing body of research into online written assessment – the technical term for that object of study being, perhaps misleadingly, online talk (e.g., Meredith, 2019; Paulus et al., DOI: 10.4324/9781003292395-3
18
Validity concerns in TMLA
2016) – yet for the sake of brevity, that line of research is beyond the scope of this survey, which focuses on online vs f2f assessment in the oral mode.
2.2 Models In considering the question of the validity of approaches to L2 assessment, whether f2f or online, it may be helpful to first make broad distinctions among models of language ability, frameworks, and constructs. Following Fulcher and Davidson (2007, p. 36), one may first define a model as an “over-arching and relatively abstract theoretical description of what it means to be able to communicate in a second language.” In contrast, a framework can be defined as “a selection of skills and abilities from a model that are relevant to a specific assessment context.” Clarifying the term construct can be somewhat complex as its usage among L2 testers is variable. For example, Douglas (2000), working from a language for specific purposes (LSP) perspective, notes that construct definition involves four aspects: the level of detail necessary to define the construct, whether strategic competence (for which see below) should be included in the definition, the relationship of the canonical “four skills” of reading, writing, speaking, and listening to the definition, and background knowledge (pp. 36–40). As another example, Yin and Mislevy (2022) note different approaches to construct definition, one in which constructs are “aspects of examinees’ capabilities that are stable across a range of situations” (loosely, a “skill in the head”); another in which constructs are defined behaviorally, as observable features of test-taker behavior on specific language tasks; and yet another in which interlocutors co-construct language output (p. 296). Yin and Mislevy (2022) summarize these approaches by stating that a construct is “the assessor’s inevitably simplified conception of examinee capabilities, chosen to suit the purpose of a given assessment” (p. 296). Finally, Fulcher and Davidson (2007) use the term “construct” to simply mean the components of a model or a framework (pp. 36–7). (For the sake of simplicity, unless otherwise indicated this last usage will be the one employed in this chapter.) Models and frameworks may also be related to actual L2 assessment via test specifications, which are generative blueprints that a test designer (or teacher) may use to create a specific test or assessment (Davidson & Lynch, 2001; Fulcher & Davidson, 2007): specifically, as Fulcher and Davidson (2007) point out, “[a] framework document mediates between a model … and test specifications” (p. 36); thus, if a framework and/or a model does not faithfully represent communicative reality, this error could negatively impact L2 test development and hence L2 test use, whether remote or f2f. An early model of language ability is Canale and Swain’s (1980) very influential description of communicative competence. It consists of three components, to which a fourth was added by Canale (1983). The first is grammatical competence, that is, the knowledge of the language code (morphosyntactic rules, vocabulary, pronunciation, spelling, and semantics). A second is sociolinguistic competence, the mastery of the sociocultural code of language use; that is, appropriate application of vocabulary, register, politeness, and style to a given social situation.
Appraising models and frameworks
19
The third is strategic competence, the knowledge of communication strategies enabling the learner to overcome difficulties when communication breakdowns occur. As mentioned above, Canale (1983) expanded the model to include discourse competence – the ability to combine language structures into different types of cohesive texts (e.g., political speech, poetry, academic essays). These four components represent both the knowledge required to effect communication and the skills to do so. Given that this model was devised in the early 1980s, well before the World Wide Web became accessible to the general public with the advent of the Mosaic browser at the University of Illinois at Urbana-Champaign in 1993 (Hopgood, 2001), there is no explicit evidence for this model having special applicability to online L2 assessment. Moreover, as Schachter (1990), cited in Celce-Murcia et al. (1995), indicates, the model has limits in that the various competency domains are not very detailed – indeed, as presented they appear somewhat static – hence, applicability to the specialized domain of online assessment seems rather implicit. However, one may hypothesize that strategic competence, which according to Canale and Swain involves “verbal and non-verbal communication strategies that may be called into action to compensate for breakdowns in communication due to performance variables or to insufficient competence” (p. 30), could be applied to interactions on an online medium prone to connectivity and audio-transmission issues. A descendant of Canale and Swain’s (1980) model is Bachman’s (1990) description of communicative language ability, explicitly intended to inform L2 assessment use. Unlike earlier models, Bachman’s scheme describes how the various components of language ability interact with one another in a system. For example, Bachmann diagrammatically places language competence (or knowledge of language) in an oval separate from and alongside an oval labeled knowledge structures (or knowledge of the world), showing them both informing a diagrammatic bubble labeled strategic competence. This last is defined more broadly than the corresponding component in Canale and Swain, as a “capacity for implementing the components of language competence in contextualized communicative language use” (p. 84). Strategic competence interacts with the speaker’s psychophysiological systems which, in turn, interact with the context of use. Bachman details his language-competence component in a tree diagram (p. 87), with organizational competence branching to the left, and pragmatic competence to the right. The former branches further into grammatical competence (vocabulary, morphology, cohesion, syntax, and phonology/graphology) and textual competence (cohesion and rhetorical organization). The latter is divided into illocutionary competence and sociolinguistic competence. Illocutionary competence involves the use of language functions such as ideational (expressing information or propositions), manipulative (those that affect the world around the speaker of the language), heuristic (seeking information about the world), and imaginative (aesthetic or humorous use of language). As for sociolinguistic competence, this is also divided into various subcomponents, such as the ability to sense dialects, register, naturalness, and cultural references. (This tree diagram, though widely cited, was abandoned in favor of conventional tables by Bachman
20
Validity concerns in TMLA
and Palmer [1996] and Bachman and Palmer [2010], with slight modifications to the names of the components/constructs.) Strategic competence, which is the heart of Bachman’s model, processes the above-mentioned knowledge structures, language competence, and the languageuse context (including test settings) through three subcomponents which respectively assess which language skills are needed in a given situation to achieve communicative goals; plan an utterance by selecting information from the language-competence component as well as an appropriate modality; and execute communicative actions via the psychophysiological organs. As Celce-Murcia et al. (1995; see below) point out, unlike Canale and Swain’s (1980) and Canale’s (1983) models, Bachman’s model separates language knowledge from cognitive (or metacognitive) skills employed in actual language use, for example, grammatical competence versus strategic competence. As with Canale and Swain’s (1980) and Canale’s (1983) models, there is nothing explicitly relevant to the delimited domain of online assessment modes, although one could speculate that in addition to strategic competence, the manipulative functions of Bachman’s illocutionary competence might be applied to a user/speaker/examinee negotiating the technical aspects of a remote L2 assessment setting. Another well-known model is that of Celce-Murcia et al. (1995). This model is explicitly pedagogical in orientation and intent, the authors claiming that earlier models are limited in various ways. For example, Canale and Swain’s (1980) model fails to provide “detailed content specifications for CLT [communicative language teaching] that relate directly to an articulated model of communicative competence” (p. 5). Also, Bachman’s (1990) model is criticized as being relevant only to language testing and cannot serve as a more general model of communicative competence for teaching and learning. Celce-Murcia et al.’s model recasts the grammatical competence of Canale and Swain as linguistic competence, breaks the earlier sociolinguistic competence into two separate components, sociocultural competence and actional competence, and from Canale (1983) includes both strategic competence and discourse competence. These components are represented schematically: a circle labeled “discourse competence” occupies the center of a triangle. Each corner of this triangle is respectively labeled “sociocultural competence,” “linguistic competence,” and “actional competence.” Each of these “cornered” components are individually connected to the central discourse component with double-headed arrows, indicating bi-directional relationships. Strategic competence is represented by a much larger circle surrounding the entire diagram. Affixed to this outer circle are a few unidirectional arrowheads “circumnavigating” the other components; this outer circle represents the notion that one’s strategic faculties monitor and coordinate among the other components in the act of language use. For each of these overarching components/constructs, Celce-Murcia et al. provide detailed breakdowns. For example, linguistic competence, similar to corresponding components of earlier models, consists of knowledge of syntax, morphology, the lexicon, phonology, and orthography. Discourse competence, which is employed “in conveying and understanding communicative intent
Appraising models and frameworks
21
by performing and interpreting speech acts and speech act sets” (p. 13), is comprised of the subcomponents dealing with cohesion, deixis, coherence, the structure of genres, and conversational structure, that last consisting of abilities to effect turn-taking, perform conversational openings and closings, establish topics, interrupt, and manage adjacency pairs. Actional competence, described overall as “competence in conveying and understanding communicative intent by performing and interpreting speech acts and speech act sets” (p. 9), consists of knowledge of language functions, such as greeting and leave-taking, asking for and giving information, agreeing and disagreeing, expressing and finding out about feelings, and suasion. Sociocultural competence contains subcomponents dealing with social contextual factors, stylistic appropriateness, sociocultural background knowledge of the target language community, and nonverbal factors including kinesics, proxemics, and silence. Finally, Celce-Murcia et al.’s version of strategic competence, which “allows a strategically competent speaker to negotiate messages and resolve problems or to compensate for deficiencies in any of the other underlying competencies” (p. 9), consists of five subcomponents: avoidance or reduction strategies, compensatory strategies (e.g., circumlocution and code switching), time-gaining strategies (e.g., hesitation devices), selfmonitoring strategies (self-initiated repair or self-rephrasing), and interactional strategies (e.g., appeals for help, meaning negotiation, and comprehension checks). Celce-Murcia et al.’s detailed breakdown of language functions (or constructs), published for pedagogical reasons at the dawn of the internet age, does not explicitly deal with online assessment issues. However, their strategic component could, like that of the models summarized earlier, conceivably be applicable to technology-related issues of conversational repair that may arise during remote interactions.
2.3 Frameworks As mentioned earlier, an assessment framework can be defined as “a selection of skills and abilities from a model that are relevant to a specific assessment context” (Fulcher & Davidson, 2007, p. 36). One such widely used framework is the Common European Framework of Reference for Languages or CEFR (Council of Europe, 2001), intended by its creators to “provide a transparent, coherent and comprehensive basis for the elaboration of language syllabuses and curriculum guidelines, the design of teaching and learning materials, and the assessment of foreign language proficiency.” The “Global Scale” of L2 proficiency measurement consists of six “Common Reference Levels” (i.e., proficiency levels), A1 and A2 indicating “Basic User” [of an L2], B1 and B2 denoting “Independent User,” and C1 and C2 “Proficient User.” There are also intermediate levels styled A2+, B1+, and B2+. The highest level, C2, “is not intended to imply native-speaker or near native-speaker competence” but rather “the degree of precision, appropriateness and ease with the language which typifies the speech of those who have been highly successful learners” (2001, p. 36).
22
Validity concerns in TMLA
The heart of the CEFR framework consists of three primary descriptor tables, which may be used for assessment purposes, though there is also considerable explanatory documentation accompanying the tables. CEFR Table 1 (for which please see https://rm.coe.int/1680459f97 and scroll down to page 24) gives the CEFR Global Scale and is holistic in design, one descriptor paragraph per proficiency level, and may be used by teachers or testers. Table 2 provides a “selfassessment grid” for the use of learners, giving descriptors (cast in the first person) for receptive and productive language actions across the six-level range, for example, for Spoken Interaction an A2-level learner “can handle very short social exchanges, even though I can’t usually understand enough to keep the conversation going myself” (2001, p. 26). Table 3 is an analytic scale for assessment, giving breakdowns for each scale level for the language features of range, accuracy, fluency, interaction, and coherence. These general tables are summaries of “illustrative scales” that describe facets of language use for speaking, writing, reading, and listening; for example, the illustrative scales for speaking include addressing audiences and overall spoken production. An interesting aspect of this array of descriptors is how some of the language reflects some of the concepts articulated in the above-mentioned models of L2 learning. For example, illustrative scales given for planning, compensating, and monitoring and repair reflect the aspects of strategic competence respectively given in the models of Canale and Swain (1980), Bachman (1990), and CelceMurcia et al. (1995); for example, as described under planning, a B1 speaker “[c] an work out how to communicate the main point(s) he/she wants to get across, exploiting any resources available and limiting the message to what he/she can recall or find the means to express” (2001, p. 64). Another commonly used framework for assessment is that devised by the American Council on the Teaching of Foreign Languages, the ACTFL Proficiency Guidelines (ACTFL, 2012). The measurement scale comprises five levels: Novice, Intermediate, Advanced, Superior, and Distinguished, with the lower three further subdivided into Low, Mid, and High. Each level and sublevel is expressed by detailed text, lengthier in the lower levels, for speaking, listening, writing, and reading. For example, an Advanced High speaker of an L2 can “perform all Advanced-level tasks with linguistic ease, confidence, and competence” and “may demonstrate a well-developed ability to compensate for an imperfect grasp of some forms or for limitations in vocabulary by the confident use of communicative strategies, such as paraphrasing, circumlocution, and illustration” (p. 5). There is no explicit mention of L2 ability pertaining to online language use aside from a short reference to an Intermediate reader, who is “able to understand texts that convey basic information such as that found in announcements, notices, and online bulletin boards and forums” (p. 23). Having summarized the main features of a few influential models and assessment frameworks, this chapter now briefly surveys some recent empirical studies into L2 assessment which examine, and sometimes compare, language behavior in face-to-face and online assessment modes.
Appraising models and frameworks
23
2.4 Studies Comparisons between f2f and online conversation are not new; they predate the pandemic. For example, O’Conaill et al. (1993), in a study of communication features involved in videoconferencing, found that even with high-quality transmission technology there were differences compared with f2f language use; for instance, speakers interacting remotely did not engage in backchanneling as often as in f2f and when selecting others as next speaker (“hand over the floor,” p. 420) they used questions more often than in f2f interactions. In another study, O’Malley et al. (1996) found that pairs of speakers, each performing a map-reading task remotely via an audio-visual communications setup, produced dialogues that were longer than pairs in an audio-only condition. In addition, when comparing interactions under f2f and video conditions, the researchers found that while both used visual cues to check understanding, speakers on video used eye gaze more often than those who were f2f. The authors hypothesized that users on video were less confident of mutual understanding than those in f2f mode and over-compensated by increasing the level of both verbal and nonverbal exchanges. Finally, in a later, longitudinal study, De Dreu et al. (2009) found that three-person groups in a discussion task under a video-teleconferencing condition took more time for turns, engaged in turn-taking less often, and interrupted each other less often than did groups on the same task in a f2f condition. However, these differences narrowed over time as speakers became more familiar with the video-teleconferencing mode. More recently, with the onset of the pandemic, a number of studies have emerged which examine whether normal conversational interaction is altered in online environments, a focus that is relevant to the present study inasmuch as possible modifications or additions to the spoken norm will have implications for any model or framework that informs oral L2 assessments. An example is Boland et al. (2021), who examined how transmission lag in Zoom interactions disrupted normal conversational rhythms. They measured transition times between conversational turns over Zoom and in f2f interactions. Whereas transition times for f2f averaged 135 milliseconds (ms), those over Zoom averaged 976 ms. Another study is van Braak et al. (2021), which analyzed online interactional behavior in three online reflection sessions that were part of General Practitioner training in a Dutch medical school. Results showed a number of interactional strategies that differed from f2f interaction. For example, participants “explicitly addressed a specific feature of the online environment by establishing a norm for audio use at the start of one session” (p. 15); in the transition from one part of the discussion to another, they employed three practices unique to the online setting: self-selection as next speaker using an open mic policy, using the online chat function, and explicit teacher moderation. In short, these and other studies (e.g., Gibson, 2014; Oittengen, 2020; Seuren et al., 2020) suggest that online video environments can negatively affect interactions, yet speakers can find new strategies that take advantage of that environment to cooperatively manage interactions.
24
Validity concerns in TMLA
Given the above-mentioned findings regarding non-testing remote oral interactions, it may be reasonable to investigate whether interactions in remote assessment also evince features that differ somewhat from those in f2f oral test settings. Indeed, a few empirical studies have been conducted in recent years that investigate online L2 assessment procedures and/or results, sometimes comparing them with face-to-face versions of the same task. Two of these were by Craig and Kim (2010) and Kim and Craig (2021), which compared quantitative test results and examinee feedback after administering f2f and online versions of a one-on-one, oral interview test. Regarding the test scores, there was no statistical difference between the two modes, and the post-test interview results indicated that examinees were comfortable with either mode, though examinee pre-test anxiety was reported to be higher regarding the f2f test method. Another recent study is Nakatsuhara et al. (2017), a mixed-methods study which examined the use of video-conferencing software in a mock IELTS speaking test. In addition to the collection of test scores, observations of examiners’ behavior were made, and examiners’ verbal comments on test-takers’ performances and hand-written reflection notes were analyzed. Results showed that between the two modes there were no statistically significant differences in test scores (in online mode, mean scores were negligibly lower), and there were few significant differences in language function uttered by the examinees aside from the functions of comparing and suggesting (used more f2f) and asking for clarification (used more online). In addition, there were tester-reported differences in the use of [verbal and nonverbal] response tokens [nodding and backchanneling; more frequent online by some testers; less frequent by others], articulation [clearer online], and speed of their [own] speech [slower online], intonation, gestures [perceived as harder to do and read online; limited eye contact may have contributed to this difficulty], turn-taking [perceived as more difficult online], and requests for clarification [received from examinees]. (Nakatsuhara et al., 2017, p. 14) Yet another study was conducted by Nakatsuhara et al. (2021). This was also a mixed-methods study, which investigated differences in ratings of IELTS speaking performances in f2f, recorded audio, and recorded video assessment conditions. Quantitative results showed that audio ratings were significantly lower than those in the other two modes. Also, testers, in commenting on examinee performances, tended to note negative performance features more often in the recorded audio and video modes than in the live f2f condition. For example, observing high-proficiency examinees uttering a response with confidence on video led one tester to speculate that such expressions of confidence might cause higher ratings due to a halo effect, whereas video images of low-proficiency examinees having trouble with a response might conceivably lead to lower scores. On the other hand, speaking errors on audio were perceived as worse than they were. None of these qualitative observations of negative features appear to have affected scores except
Appraising models and frameworks
25
in the case of fluency, when observing struggling examinees appears to have led testers to exercise more patience with those examinees than usual, thus inflating fluency scores. According to Nakatsuhara et al. (2021), post hoc tester comments revealed that being able to observe the examinee on video aided tester comprehension of examinee utterances, allowed testers to pick up on nonverbal cues, and to “understand with greater confidence the source of test-takers’ hesitation, pauses, and awkwardness,” and speculate that test-takers’ interactional competence (e.g., Pekarek Dohler, 2021) is more accessible in video recordings than on audio, and that such may have elevated the scores of test-takers in the video mode.
2.5 Discussion While the above discussion of theoretical models of L2 ability and L2 assessment frameworks is in summary form, and while the survey of comparative f2f/ online assessment studies is not comprehensive, it may be argued that enough has been presented to allow one to consider the usefulness of these in relation to online assessment. Indeed, as discussed, there were a number language-use phenomena appearing in the studies which possibly imply areas of online language behavior that differed from f2f conversational practices, and which in turn may require modification of the existing L2 ability models or L2 assessment frameworks. Among them were those in O’Conaill et al. (1993), namely that speakers interacting remotely did not engage in backchanneling as often as in f2f and when selecting others to take the floor, they used questions more often than in f2f interactions. Similarly, Boland et al. (2021) showed that online transition times between turns were measurably delayed by technical difficulties. Also, van Braak et al. (2021) revealed that groups of speakers found ways to adapt to shortcomings in the online environment by establishing new conversational practices at the beginning of and during their pedagogical sessions. Further, in Nakatsuhara et al. (2017), there were differential proportions of various language functions, with asking for clarification used more often online, as well as differences in the use of verbal and nonverbal response tokens, phonetic articulation, speech speed, intonation, gestures, and (as in the O’Conaill et al. study) an effect on turntaking. Finally, in regard to Nakatsuhara et al. (2021), we may recall that post hoc tester reports indicated that viewing video recordings of examinee performance allowed testers to pick up on nonverbal cues that affected appraisal of test-takers’ interactive competence. Some of the above phenomena – turn-taking, response tokens, identifying verbal and nonverbal cues – are analyzable within the overall competence components of discourse competence and strategic competence in the Celce-Murcia et al. scheme or in the textual and illocutionary competence of Bachman’s (1990) model. Thus, given the relatively small number of languageuse phenomena that were noted in the studies, it could be argued that existing models and frameworks may have overall utility in L2 assessment, requiring only slight modifications to them. However, a recurring theme of sorts appears among the models, frameworks, and studies which may point to a larger problem with the existing models
26
Validity concerns in TMLA
and frameworks. One may see this, in part, in terms used therein, such as “interactional knowledge,” “interactional competence,” or simply “interaction” (e.g., Celce-Murcia et al. 1995, p. 19; Nakatsuhara et al., 2021, p. 102; Council of Europe, 2001, p. 28). One may note here that “interaction,” in these respective theoretical contexts, is but a component of a larger set, hence arguably a construct. In fact, interactional competence (commonly abbreviated to IC) may be seen as constituting yet another model of L2 ability. In the IC model, interaction has a specific meaning: the use of language to accomplish social actions across conversational turns and speakers. A “social action” in this regard may include complimenting someone, responding (or not responding) to a request, or apologizing. In this view, conversation becomes talk-in-interaction (Schegloff, 2007, p. xiii): something co-constructed by more than one interlocutor at a time. Thus, interaction is not exclusively a property or phenomenon generated by, or confined to, an individual speaker of a language, whether in f2f or online contexts. This model of understanding language use (whether of an L1 or L2), which may also be termed a social-interactionist model, is closely intertwined with the development of the discipline of conversation analysis or CA (e.g., Sacks et al., 1974; Schegloff, 2007), a field which examines how language use both reflects and maintains social relationships. In CA, interactional competence (IC) is understood as “the universal infrastructure underlying social interaction to which we as human beings orient to produce social order” (Hall, 2018, p. 30). In a similar vein, Firth and Wagner (1997) state, “Language is not only a cognitive phenomenon, the product of the individual’s brain; it is also fundamentally a social phenomenon, acquired and used interactively, in a variety of contexts for myriad practical purposes” (p. 768). (See Hall, 2018, and Galaczi and Taylor, 2018, for more detailed accounts of the history of the notion of IC.) While the model of interactive competence is not commonly represented by an easily recognizable, graphical representation such as those used by Canale and Swain (1980), Bachman (1990), or Celce-Murcia et al. (1995), its subcomponents may be summed up as here by Markee (2000): [T]he notion of interactional competence minimally subsumes the following parts of the [Celce-Murcia et al. 1995] model: the conversational structure component of discourse competence, the non-verbal communicative factors component of sociocultural competence, and all of the components of strategic competence (avoidance and reduction strategies, achievement and compensatory strategies, stalling and time-gaining strategies, self-monitoring strategies, and interactional strategies. (p. 64) (One may note here in passing that under Markee’s formulation, IC is analyzably a model, composed of various components/constructs, in contrast with the characterization of “interactive competence” under the CEFR framework, where it is arguably a component/construct.) The relevance of the IC model to the present discussion is that the aforementioned L2 models and frameworks embody the
Appraising models and frameworks
27
phenomenon of IC poorly, if at all. As Huth points out in his 2021 article (which may well be regarded as seminal in the subfield of L2 assessment research), the CEFR and ACTFL frameworks, and the educational institutions that rely on them, do not well reflect the reality “that human communication is co-constructed;” instead, these frameworks and institutions reflect “primarily a view of language as a psycho-social construct contained in a single learner’s mind” (p. 368). It is true, however, that across these models and frameworks, some integral mention of “interaction” is made; for example, the CEFR Table 3, referred to earlier, posits a facet of “Interaction” across the six proficiency levels, suggesting some alignment with CA empirical findings regarding IC. Nonetheless, the bulk of the language within the CEFR framework, in particular that of the Global Scale, is heavily weighted toward a view of the L2 which is primarily focused on the proficiency of an isolated individual. Moreover, it is unclear whether the leveldescriptors in the “Interaction” column of CEFR Table 3 actually synchronize with the gradations of relative ability in actual L2 learners. As Walters (2021) points out, the empirical picture of what it means for an L2 learner to be at a particular level of IC (the roughly equivalent term used in his study is L2 pragmatic competence) is thus far incomplete, though this area of research is lively and ongoing (e.g., Pekarek Doehler, 2021). Thus, the “Interaction” descriptors may have at best partial empirical support, and this insufficiency of support may be regarded as a threat to test validity (Messick, 1989). A similar observation may be made with regard to the above-mentioned L2 models in general, inasmuch as they too are largely informed by the view of “language proficiency [being] a primarily cognitive-psychological construct located in individual learner minds” (Huth, 2021, p. 368), within which paradigm dyadic performances of talk-in-interaction (Schegloff, 1986), which are perhaps the very engine of second language acquisition (Markee, 2000; Pekarek Doehler, 2021), are largely invisible.
2.6 Conclusions The general findings from this survey, then, indicate both strengths and shortcomings among the several models and frameworks from the standpoint of validity in online assessment. On the one hand, the earlier models – Canale and Swain’s, Bachman’s (and Bachman and Palmer’s), Celce-Murcia et al.’s – still seem to possess utility for many test purposes and constructs/components, perhaps requiring only slight adaptations by a test designer. For example, an online L2 teacher, contemplating the findings of Nakatsuhara et al. (2017) regarding the functions of comparing and suggesting and asking for clarification, might, when designing a scoring scale derived from Bachman’s (1990) listing of components of illocutionary competence (pp. 87–8), supplement his or her rating descriptors with a mention that the first two tend to be used more f2f and that the third may be used more often online. This would be a relatively minor modification to the model, one that might ensure slightly more valid inferences of those subskills (Messick, 1989) vis-à-vis comparable test tasks otherwise used and scored in f2f L2 test settings. (Such a slight modification of a model component and of any test
28
Validity concerns in TMLA
task derived from it would possibly also have implications for positive washback in online L2 instruction.) On the other hand, this survey has also suggested that L2 assessments based on non-social-interactionist models and frameworks may not easily tease apart a learner’s ability along an as-yet-unknown continuum of IC – whether in f2f or online modes. This conclusion may well seem unexpected in a study that initially set out merely to appraise the usefulness of a few models and frameworks in regard to online assessment, only to indicate a general shortcoming of these regardless of assessment mode. One might therefore suggest that this proffered finding is but one among several such that have emerged as communities across the planet have been forced by the pandemic to re-examine professional practices in such diverse areas as public health, economic relationships, and education. In any event, it is suggested that the superficial treatment that the above models and frameworks give to IC may limit their use as foundations for L2 assessment, despite the strengths mentioned above. It may be that richer and more valid theoretical bases for the construction of online (and f2f) measures of L2 competence await further integrations of empirical findings between CA and L2 assessment research.
References American Council on the Teaching of Foreign Languages (2012). ACTFL proficiency guidelines 2012. Retrieved June 1, 2022, from https://www.actfl.org/resources/actfl -proficiency-guidelines-2012 Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press. Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford University Press. Bachman, L. F., & Palmer, A. (2010). Language assessment in practice. Oxford University Press. Boland, J. E., Fonseca, P., Mermelstein, I., & Williamson, M. (2021). Zoom disrupts the rhythm of conversation. Journal of Experimental Psychology: General. Retrieved June 1, 2022, from https://doi.org/10.1037/xge0001150 Canale, M. (1983). On some dimensions of language proficiency. In Oller, J.W. (Ed.) Issues in langauge testing research. Rowley, MA: Newbury House, 333–342. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language acquisition and teaching. Applied Linguistics, 1(1), 1–47. Celce-Murcia, M., Dornyei, Z., & Thurrell, S. (1995). Communicative competence: A pedagogically motivated model with content specifications. Issues in Applied Linguistics, 2, 5–35. Council of Europe (2001). Common European framework of reference for languages: Learning, teaching, assessment. Retrieved June 1, 2022, from https://rm.coe.int/ 1680459f97 Craig, D. A., & Kim, J. (2010). Anxiety and performance in videoconferences and face-toface oral interviews. Multimedia-Assisted Language Learning, 13(3), 9–32. Davidson, F. and Lynch, B. K. (2001). Testcraft: A teacher’s guide to writing and using language test specifications. Yale University Press.
Appraising models and frameworks
29
De Dreu, C. K. W., van der Kleij, R., Schraagen, J. M., & Werkhoven, P. J. (2009). How conversations change over time in face-to-face and video-mediated communication. Small Group Research, 40(4), 355–381. Douglas, D. (2000). Assessing languages for specific purposes. Cambridge University Press. Firth, A., & Wagner, J. (1977). On discourse, communication, and (some) fundamental concepts in SLA research. The Modern Language Journal, 81, 285–300. Fulcher, G. and Davidson, F. (2007). Language testing and assessment: An advance resource book. Routledge. Galaczi, E., & Taylor, L. (2018). Interactional competence: Conceptualizations, operationalisations, and outstanding questions. Language Assessment Quarterly, 15(3), 219–236. Gibson, W. J. (2014). Sequential order in multimodal discourse: Talk and text in online educational interaction. Discourse and Communication, 8, 63–83. Hall, J. K. (2018). From L2 interactional competence to L2 interactional repertoires: Reconceptualizing the objects of L2 learning. Classroom Discourse, 9(1), 25–39. Hopgood, B. (2001). History of the Web. Retrieved June 1, 2022, from https://www.w3.org /2012/08/history-of-the-web/origins.htm Huth, T. (2021). Conceptualizing interactional learning targets for the second language curriculum. In S. Kunitz, N. Markee, & O. Sert (Eds.), Classroom-based conversational analytic research: Theoretical and applied perspectives on pedagogy (pp. 359–381). Springer. Kim, J., & Craig, D. A. (2012). Validation of a videoconferenced speaking test. Computer Assisted Language learning, 25(3), 257–275. https://www.tandfonline.com/doi/full/10 .1080/09588221.2011.649482 Markee, N. (2000). Conversation analysis. Lawrence Erlbaum. Meredith, J. (2019). Conversation analysis and online interaction. Research on Language and Social Interaction, 52(3), 241–256. Messick, S. (1989). Validity. In R. Linn (ed.), Educational measurement (3rd ed., pp. 13– 103). Collier Macmillan Publishers. Nakatsuhara, F., Inoue, C., & Taylor, L. (2021). Comparing rating modes: Analysing live, audio, and video ratings of IELTS speaking test performances. Language Assessment Quarterly, 18(2), 83–106. https://doi.org/10.1080/15434303.2020.1799222 Nakatsuhara, F., Inoue, C., Berry, V., & Galaczi, E. (2017). Exploring the use of videoconferencing technology in the assessment of spoken language: A mixed-methods study. Language Assessment Quarterly, 14(1), 1–18. https://doi.org/10.1080/15434303 .2016.1263637 Oittinen, T. (2020). Noticing-prefaced recoveries of the interactional space in a videomediated business meeting. Social Interaction 3(3), n.p. https://jyx.jyu.fi/handle /123456789/72799 O’Conaill, B., Whittaker, S., &Wilbur, S. (1993). Conversations over video conferences: An evaluation of the spoken aspects of video-mediated communication. HumanComputer Interaction, 8(4), 389–428. O’Malley, C., Langton, S., Anderson, A., Doherty-Sneddon, G., & Bruce, V. (1996). Comparison of face-to-face and video-mediated interaction. Interacting with Computers, 8(2), 177–192. Paulus, T., Warren, A., & Lester, J. N. (2016). Applying conversation analysis methods to online talk: A literature review. Discourse, Context and Media, 12, 1–10. Pekarek Doehler, S. (2018). Elaborations on L2 interactional competence: The development of L2 grammar-for-interaction. Classroom Discourse, 9(1), 3–24.
30
Validity concerns in TMLA
Pekarek Doehler, S. (2021). Toward a coherent understanding of L2 interactional competence: Epistemologies of language learning and teaching. In S. Kunitz, N. Markee, and O. Sert (Eds.), Classroom-based conversation analytic research: Theoretical and applied perspectives on pedagogy (pp. 19–33). Springer. Powell, A. (January 20, 2022). Omicron optimism and shift from pandemic to endemic. Retrieved June 1, 2022, from https://news.harvard.edu/gazette/story/2022/01/optimism -on-omicron-shift-from-pandemic-to-endemic. Sacks, H., Schegloff, E., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking in conversation. Language, 50, 696–735. Schachter, J. (1990). Communicative competence revisited. In H. Harley, P. Allen, J. Cummins, & M. Swain (Eds.), The development of second language proficiency. Cambridge University Press. Schegloff, E. (1986). The routine as achievement. Human Studies, 9, 111–151. Schegloff, E. (2007). Sequence organization in interaction: A primer in conversation analysis, Vol. 1. Cambridge University Press. Seuren, L. M., Wherton, J., Greenhalgh, T., Cameron, D, & Shaw, S. E. (2020). Physical examinations via video for patients with heart failure: Qualitative study using conversation analysis. Journal of Medical Internet Research, 22. Retrieved June 1, 2022, from https://pubmed.ncbi.nlm.nih.gov/32130133/ UNESCO (2022). The COVID-19 pandemic has changed education forever. This is how. Retrieved 6/1/2022 from: https://en.unesco.org/covid19/educationresponse #schoolclosures van Braak, H. M., Schaepkens, S., & Veen, M. (2021). Shall we all unmute? A conversation analysis of participation in online reflection sessions for general practitioners in training. Languages, 6(2), 72. Retrieved June 1, 2022, from https://doi.org/10.3390/ languages6020072 Walters, F. S. (2021). Some considerations regarding validation in CA-informed oral testing for the L2 classroom. In S. Kunitz, N. Markee, & O. Sert (Eds.), Classroom-based conversation analytic research: Theoretical and applied perspectives on pedagogy (pp. 383–404). Springer. Yin, C., & Mislevy, R. J. (2022). Evidence-centered design in language testing. In G. Fulcher & L. Harding (Eds.), The Routledge handbook of language testing (2nd ed.). Routledge.
3
Exploring the role of selfregulation in young learners’ writing assessment and intervention using BalanceAI automated diagnostic feedback Eunice Eunhee Jang, Melissa Hunte, Christie Barron, and Liam Hannah
3.1 Introduction 3.1.1 Young students’ writing development during the COVID-19 pandemic The worldwide COVID-19 pandemic and subsequent school shutdowns profoundly impacted students’ learning. More than 1.6 billion students have been affected by the shutdowns worldwide (UNESCO, 2021), accounting for over 90% of school-aged students globally. There is evidence of a more significant impact of school closures on students with lower socioeconomic backgrounds, newcomers, and students with exceptionalities (Gallagher-Mackay et al., 2021). Students from low-income families were more likely to choose online schooling due to their concerns about higher exposure to COVID-19, which may be associated with higher household density and a higher proportion of essential workers in families (Choi & Denice, 2020; Gilbert et al., 2020). Several studies show more significant losses for students learning remotely (Kogan & Lavertu, 2021), younger students (Blainey & Hannay, 2021), racialized students (Kuhfeld et al., 2020), English language learners (ELLs) (Pier et al., 2021), and students with special educational needs (Juniper Education, 2021). For example, in Canada, Ontario’s provincial achievement assessment data show that ELLs lag three times as many months due to social isolation limiting their social integration and lacking parental support for schoolwork (Gallagher-Mackay et al., 2021). In addition, a series of research on educational recovery in schools in the UK reveals significant delays in young students’ written and oral language skills (Howard et al., 2021). These language skills require hands-on support involving specific feedback on the mechanics of writing, coherence, and organization. However, pandemic disruptions and transitions to virtual learning made it difficult for teachers to support students’ writing skill development. Most teachers were faced with unforeseen improvisation and adaptation at the beginning of the pandemic. For instance, teachers in Australia reported that, due to sudden shifts DOI: 10.4324/9781003292395-4
32
Validity concerns in TMLA
to online learning, they could not find time to provide personalized feedback on students’ writing development (Ewing & Cooper, 2021). Transitioning from in-person learning to virtual distance learning during school closures placed greater demands on students’ ability to regulate their learning. Although it may be too early to appreciate the impact of the COVID-19 pandemic on learning thoroughly, growing research shows that the impact is not the same for all students. Student self-regulation may have played a critical role in differential learning experiences and outcomes. Research shows that, during the lockdown, students experienced difficulty identifying and setting learning goals, organizing activities, coping with distance learning requirements, completing homework on their own, regulating overwhelming emotions (Huber & Helm, 2020; Letzel et al., 2020), and maintaining energy and persistence in learning (Grewenig et al., 2020). Research further reveals that students from disadvantaged families have experienced significantly more barriers to learning due to fewer resources and support at home (Song et al., 2020). Writing is a complex cognitive and metacognitive skill that requires selfregulated learning (SRL) approaches in order to operate on a linguistic repertoire and metacognitive control. SRL allows learners to automate the process of planning, executing, monitoring, and evaluating learning to achieve goals (Zimmerman & Schunk, 2011). A student’s SRL is reciprocally associated with writing ability as both causal and consequential variables (Perry et al., 2018). It is also developmentally sensitive. Therefore, understanding such a reciprocal relationship between SRL and writing ability for young learners requires a careful consideration of intrapersonal and environmental factors, such as parental and teacher support, even before formal schooling starts (Usher & Schunk, 2017). 3.1.2 Supporting students through technologyrich diagnostic assessment Given the pandemic disruptions on students’ learning, educators reported significant shifts in assessment practices (UK Department for Education, 2022) toward formative assessment. Educators recognize the critical role that it can play. With formative assessment, teachers can gather baseline information, assess how individual students learn, adapt teaching approaches, and monitor how teaching supports students’ learning. Technologies have become essential for actualizing the potential benefit of formative assessment as the educational system was forced to adapt to digital learning. As a result, we have witnessed rapid advances in technological innovations in classroom assessment. These innovations enable assessments to simulate authentic learning tasks that can elicit less restricted oral speech (Evanini et al., 2015) and written performance (Burstein et al., 2004). They are infused with authentic audio and video stimuli (Wagner, 2008), spontaneous performance modes, adaptive scaffolding (Azevedo et al., 2004), and simultaneous assessment of integrative skills through multidimensional scoring and profiling models (Jang, 2005; Leighton & Gierl, 2007).
Exploring the role of self-regulation 33 Furthermore, automated real-time data processing allows for timely and adaptive feedback for students. Previous research on feedback has focused chiefly on the impact of feedback formats (Ellis et al., 2006; Ferris, 2010). Little research other than research on dynamic assessment (Leung, 2007) has addressed its scaffolding mechanisms in assessing productive skills in technology-rich environments. Mediation through feedback during task performance can be distinct from end-of-task feedback as it provides simultaneous diagnosis and scaffolding (Lantolf & Poehner, 2011). Scaffolding is a method that incorporates various teaching and learning strategies. It can be self-driven or driven by others (Holton & Clarke, 2006). In particular, dynamic scaffolding requires a coupling between two levels of change: the competence level in a student and the level of competence embedded in the scaffolding (van Geert & Steenbeek, 2005). Technology-rich assessments can afford rich conditions for dynamic scaffolding between mediating and learning agents by synchronizing the time scale between the actual scaffolding process and subsequent learning.
3.2 BalanceAI assessment program BalanceAI is a digital assessment designed to provide diagnostic information that can be used for dynamic scaffolding between teachers and elementary students. It comprises age-appropriate, multi-trait, and multi-method tasks designed to assess learning orientations, literacy and oral language skills, and cognitive capacity such as working memory and fluid reasoning (Sinclair et al., 2021). Overall, diagnostic markers are categorized into four different domains: acoustic, linguistic, cognitive, and psychological dimensions. Linguistic features represent both bottom-up and top-down processing and production of information. Acoustic features represent prosodic characteristics of oral speech samples collected from oral tasks. Cognitive features represent auditory and visual perceptions and working memory, which are highly associated with learning ability. These cognitive features are used as covariates to control for their interaction with linguistic features. Psychological features represent the “big four” learning orientations: self-efficacy in reading and writing (Bandura, 1999), goal orientation (Dweck, 1986), selfregulation (Zimmerman & Schunk, 2011), and grit, also known as perseverance (Duckworth et al., 2007). Psychological features are used to identify distinct latent profiles. Linguistic skills are elicited by cognitively engineered and psychometrically calibrated tasks using cognitive diagnostic models involving task-by-skill specifications and psychometric classification methods. That is, assessment tasks were designed using Q matrices with theories of child development and curricular expectations in school learning, and they were calibrated psychometrically, using Item Response Theory (IRT) and Cognitive Diagnostic Modelling (CDM). The BalanceAI assessment comprises various digital tasks involving different response modes, such as self-reports, selected responses, and performance. For performance tasks, students are asked to write essays, tell stories based on what they hear and read, describe a picture strip, and read a short story aloud. Reading
34
Validity concerns in TMLA
comprehension testlets use selected response item types while psychological trait inventories use self-reports. These multimodal data are processed through natural language processing (NLP) and machine-learning techniques. Real-time feedback is delivered through user dashboards for teachers and students respectively and can be used to facilitate both self-directed and expert-mediated scaffolding. The BalanceAI’s conceptual framework of scaffolding encompasses three core characteristics of dynamic complexity theory (Larsen-Freeman, 2011; WolfBranigin, 2013). First, its overarching goal is to promote individual students’ self-organizing agency (Mainzer, 2007). Similar to SRL, self-organization is an iterative process through which learners engage with learning-oriented tasks, receive real-time feedback, constantly consider options and constraints, and assemble resources dynamically (Thelen & Smith, 1994). Self-organizing learners can initiate and organize their learning, identify their own goals, raise awareness of their strengths and weaknesses, actively seek feedback, and regulate learning strategies. BalanceAI scaffolding plays a central role in supporting the iterative self-organizing process. While the human mediator can offer expert scaffolding to promote skill development (Holton & Clarke, 2006), BalanceAI’s automated feedback can facilitate self-scaffolding through human-to-machine interactions, thereby heightening the learner’s metacognitive awareness and SRL. Second, BalanceAI’s dynamic scaffolding supports co-adaptation. Human learning is analogous to a complex system, comprising the coupling and co-adaptation of many interacting agents. BalanceAI engages multiple agents by feeding them with critical information about what the learner can do, how the learner responds to mediations, and what change emerges from the mediations. As learners interact with their teacher or tutor, their cognitive, affective, and metacognitive resources are altered and they experience the coupling of complex systems (Thelen & Smith, 1994). Actions taken as a response to various task features lead to emerging patterns in learning. This iterative co-adaptation process prompts the emergence of stable patterns from local learning. Lastly, students’ initial conditions matter significantly. Complex systems are chaotic and dynamic. Change in learning occurs gradually in nonlinear patterns and is not necessarily proportionally subject to intervention dosages due to the sensitivity to initial conditions. This sensitivity is known as a butterfly effect. It refers to small changes over time in the initial state that lead to a more significant and unpredictable impact (Lorenz, 1963). The automated diagnostic assessment in BalanceAI provides critical information about individual students’ initial conditions, and its subsequent scaffolding affords changes to the initial conditions over time.
3.3 Study overview In the wake of the COVID-19 pandemic, the BalanceAI assessment program has been used in various settings, including formal classrooms, one-to-one tutoring, and community outreach programs supporting newcomers and refugee children. As part of a larger research project, the present study closely examined the
Exploring the role of self-regulation 35 reciprocal relationship between SRL and students’ writing skill development during the COVID-19 pandemic. We paid particular attention to two focal groups of students: ELLs and students with learning difficulties. Students’ initial conditions were statistically modelled. Building on the resulting SRL and writing skill profiles, we further examined the emergent self-regulation and writing ability patterns through diagnostic scaffolding over one-to-one tutoring sessions. Realtime diagnostic feedback served as a mechanism for co-adaptations. This study was guided by the following research questions: 1. What are the characteristics of students’ self-reported SRL and writing during the COVID-19 pandemic? 2. How does the relationship between SRL and writing ability uniquely differ for ELLs and students with learning difficulties? 3. How does dynamic scaffolding offered over individual tutoring sessions further support students’ SRL and writing ability?
3.4 Method 3.4.1 Participants The study participants consisted of 361 students in grades 3 to 8 (177 identified as male, 168 identified as female, and 17 identified as non-binary or other). Among 361 students, 121 students were in grade 6, 95 were in grade 4, 79 were in grade 5, 59 were in grade 3, five were in grade 8, and two were in grade 7. Sixty-two students were classified as ELLs. Given the complex nature of students’ language backgrounds, we considered students’ self-reported first language status and selfrated language proficiency levels. A total of 62 students reported that they were more proficient in other languages than English. Sixty-five students were identified as having a learning difficulty (LD). LDs represented a range of conditions that affected students’ abilities to take in, store, or use information. Students were identified as having an LD in three ways; first, if parents indicated their child had an individual education plan (IEP) for a reason other than being identified as “gifted.” The local school district uses IEPs to identify and accommodate exceptional students. Second, if parents indicated their child was in a special education program, and third if parents indicated their child had an LD, whether officially diagnosed or not. Among 361 study participants, 25 students participated in the one-to-one BalanceAI tutoring program. Each tutee was paired with a trained BalanceAI tutor and interacted with the tutor over eight to ten sessions. Table 3.1 presents the study participants by grouping variables. 3.4.2 Measures and data collection 3.4.2.1 Writing ability Students were presented with a three-minute video stimulus on children’s use of social media and were instructed to write an essay describing “some good and
36
Validity concerns in TMLA
Table 3.1 Study Participants’ Demographics
ELL LD Tutees Total
Grade 3
Grade 4 Grade 5
Grade 6
Grade 7
Grade 8
Total
3 6 5 59
26 11 5 95
11 28 6 121
0 1 1 2
3 2 4 5
62 65 25 361
19 17 4 79
Figure 3.1 Screenshots of BalanceAI writing task with scaffolding steps.
bad things about social media and suggest how children should use it.” Figure 3.1 depicts how BalanceAI scaffolds students’ writing by asking them to first write down questions about the topic, then outline their writing by brainstorming “some good and bad things about using social media.” They were then instructed to use examples from the video and their ideas from the planning section to write a response with an introduction, body, and conclusion. They were also encouraged to use complete sentences and paragraphs and write as much as possible with no time limit. Finally, upon submission, students were prompted one final time to check and edit their writing before their conclusive submission. Once students submitted their essays, they were asked to reflect on and self-assess their performance on a five-point Likert scale ranging from “definitely not” to “yes definitely.” 3.4.2.2 Self-regulated learning (SRL) Participating students completed an SRL self-assessment. The SRL self-assessment tool consists of ten statements concerning students’ self-assessment of their
Exploring the role of self-regulation 37 ability to plan, monitor, and evaluate their learning on a five-point Likert scale ranging from “definitely not” to “yes definitely.” The measure had high internal reliability (α = 0.85). A confirmatory factor analysis revealed a three-factor fit moderately well χ2(31) = 95.26, p < 0.001, comparative fit index = 0.98, Tucker– Lewis index = 0.97, standardized root mean square residual = 0.03, root mean square error of approximation = 0.08 (90% CI = 0.06, 0.10). Items had high factor loadings and communalities, with one theoretically justifiable residual covariance and high latent variable correlations. 3.4.2.3 One-to-one tutoring sessions BalanceAI’s individualized tutoring was delivered synchronously via video conferencing. Data regarding language backgrounds and learning needs were collected from parents during registration and trained tutors integrated knowledge of students’ initial states to determine session length. Tutoring sessions ranged from 30 to 60 minutes and all tutees progressed through the BalanceAI tasks in the same order. Task performance data were automatically collected and scored by the BalanceAI platform. At the start of each session, tutees created and reviewed learning goals with their tutors. After each session, tutors completed a sessional report documenting changes in the tutees’ attitudes, learning behaviours, learning goals, reflections of learning, levels of enjoyment, and responses to feedback. At the end of the program, parents were asked to provide feedback on improvements they noticed relating to students’ acoustic, linguistic, cognitive, and learning skills. 3.4.3 Analysis 3.4.3.1 Scoring written essays Ten human raters scored students’ writing samples using an analytic rubric with four categories: idea development, organization and coherence, vocabulary and expression, and grammatical range and accuracy. Each writing sample was rated on a four-point rating scale and by two raters. Building on human scoring, machine learning (ML) models were engineered, trained, and cross-validated to predict human rating scores on a sample of the data. The ML models were first built using 27 writing features extracted using NLP. Feature extraction was conducted in Python (version 3.7) using both the Spacy (version 2.2; Honnibal et al., 2020) and NLTK (version 3.4.5; Bird et al., 2009) packages. A random forest classifier from the scikit-learn toolkit (version 2.22; Pedregosa et al., 2011) used the extracted writing features to predict the human scores. Human inter-rater reliability and human–machine inter-rater reliability are presented in Table 3.2. Quadratic weighted kappa (QWK) compares the similarity of two raters’ scores, accounting for agreement by chance alone. According to Ramineni et al. (2012), human–human and human–machine reliability must exceed a QWK and a Pearson’s correlation coefficient (PCC) of 0.7, and human– machine reliability must be within 0.1 of the human–human reliability. They also
38
Validity concerns in TMLA
Table 3.2 Comparison of Human–Human and Human–Machine Reliability Task TF OC VE
Human–Human
Human–Machine
QWK
PCC
d
QWK
PCC
d
0.744 0.712 0.668
0.758 0.720 0.690
−0.175 −0.143 −0.255
0.806 0.791 0.781
0.810 0.798 0.787
−0.003 −0.048 0.049
Note: QWK = quadratic weighted kappa, PCC = Pearson correlation coefficient, d = standardized mean difference.
suggest having a standardized mean score difference (d) that does not exceed an absolute value of 0.15. Table 3.2 indicates that some human–human reliability metrics are below these benchmarks, including QWK, PCC, and d for vocabulary and expression, and d for task fulfilment. Human–machine metrics all show strong alignment. 3.4.3.2 Latent profile analyses Latent profile analysis (LPA) was conducted to identify latent profiles that demonstrate distinct relationships between SRL and writing ability. LPA is a model-based method of identifying latent subpopulations of relatively homogeneous profiles in observed responses (Nyland et al., 2007). The LPA analyses in this study are exploratory, which coincides with complexity theory as it captures the heterogeneity and nuance of the multivariate relationship between diverse students’ SRL and writing. LPA models with 2–6 latent profiles were run with Mplus version 8.4 using a robust maximum likelihood estimator (Muthén & Muthén, 1998–2017). Item means and variances were freely estimated across profiles. LPA was conducted using z-scores of three SRL factors (planning, monitoring, and evaluating) extracted from a three-factor confirmatory factor analysis and three machine-scored writing analytic scores (task fulfilment, organization, and vocabulary). Dummy-coded LD and ELL variables were included as covariates using Vermundt’s three-step procedure (Asparouhov & Muthén, 2014). Various statistical (e.g., Bayesian Information Criterion, Bootstrapped Likelihood ratio test BLRT) and substantive (e.g., interpretability, number of students in each profile) criteria were used to determine the optimal number of latent profiles.
3.5 Results 3.5.1 Relationship between SRL and writing ability during COVID-19 Overall, as shown in Table 3.3, the five-profile model solution provided a richer and more theoretically justifiable depiction of students’ SRL and writing ability. The entropy value was 0.90, indicating an acceptable classification accuracy.
Exploring the role of self-regulation 39 Table 3.3 LPA Model Fit Statistics Number of classes
BIC
Adjusted BIC
BLRT p-value
VLMR p-value
Entropy
2 3 4 5 6
5296.76 5029.77 4925.18 4876.89 4901.01
5226.97 4928.24 4791.93 4711.92 4704.31
0.000 0.000 0.000 0.000 0.156
0.000 0.013 0.118 0.050 0.926
0.94 0.94 0.91 0.90 0.91
Figure 3.2 Estimated means on the SRL and writing skills based on a five-profile model with unique item variances.
Figure 3.2 depicts five profiles from the resulting LPA solution. Students within profile 1 represented about 15% of the sample (n = 57). They were characterized by below-average self-reported SRL and the lowest writing performance in task fulfilment, organization, and vocabulary in the writing task. Profile 2 represented 26% of the total sample (n = 94), whose SRL and writing were slightly below average, but their writing ability was slightly better than profile 1. Profile 3 included 9% of the sample (n = 30), who self-rated high SRL despite very low writing performance. Though the smallest group, profile 3 appeared to exhibit an inflated view of SRL. Profile 4 included 20% of the sample (n = 71) who reported moderately high SRL and had average writing performance. Finally, students in profile 5 included 30% (n = 110) and demonstrated average SRL and above-average writing ability. These results suggest latent profiles that mainly differ in the alignment of reported SRL and writing performance rather than within-construct differences. 3.5.2 Relationship between SRL and writing ability for ELLs and students with LD Next, ELL and LD status were incorporated into the latent profile analysis as covariates. Pairwise comparisons were used to examine the likelihood of profile
40
Validity concerns in TMLA
Table 3.4 Multinomial Logistic Regression of Latent Profile Membership by ELL and LD Profile
Covariate
Profile 1 vs 5 Intercept (Reference) LD ELL Profile 1 vs 4 Intercept (Reference) LD ELL
Logit estimate
Logit p-value
Odds ratio
95% CI odds ratio Lower 95%
Upper 95%
0.77
0.507
0.47 −1.83* 1.42
0.324 0.038 0.239
1.60 0.16
0.63 0.03
4.09 0.90
0.28 −1.87*
0.600 0.037
1.32 0.15
0.46 0.03
3.79 0.90
Note: LD = learning difficulties, ELL = English language learner, * = p