150 52 17MB
English Pages [259] Year 2019
i
GLOBAL PERSPECTIVES ON LANGUAGE ASSESSMENT
The sixth volume in the Global Research on Teaching and Learning English series offers up-to-date research on the rapidly changing field of language assessment. The book features original research with chapters reporting on a variety of international education settings from a range of diverse perspectives. Covering a broad range of key topics—including scoring processes, test development, and student and teacher perspectives—contributors offer a comprehensive overview of the landscape of language assessment and discuss the consequences and impact for learners, teachers, learning programs, and society. Focusing on the assessment of language proficiency, this volume provides an original compendium of cutting- edge research that will benefit TESOL and TEFL students, language assessment scholars, and language teachers. Spiros Papageorgiou is Managing Senior Research Scientist at Educational Testing Service. Kathleen M. Bailey is Professor at Middlebury Institute of International Studies at Monterey, USA.
ii
GLOBAL RESEARCH ON TEACHING AND LEARNING ENGLISH
Co-published with The International Research Foundation for English Language Education (TIRF) Kathleen M. Bailey & Ryan M. Damerow, Series Editors Crandall & Bailey, Eds. Global Perspectives on Language Education Policies Carrier, Damerow, & Bailey, Eds. Digital Language Learning and Teaching: Research, Theory, and Practice Crandall & Christison, Eds. Teacher Education and Professional Development in TESOL: Global Perspectives Christison, Christian, Duff, & Spada, Eds. Teaching and Learning English Grammar: Research Findings and Future Directions Bailey & Damerow, Eds. Teaching and Learning English in the Arabic-Speaking World Papageorgiou & Bailey, Eds. Global Perspectives on Language Assessment: Research,Theory, and Practice For additional information on titles in the Global Research on Teaching and Learning English series visit www.routledge.com/books/series/TIRF
iii
GLOBAL PERSPECTIVES ON LANGUAGE ASSESSMENT Research, Theory, and Practice Edited by Spiros Papageorgiou and Kathleen M. Bailey
A co-publication with The International Research Foundation for English Language Education (TIRF)
TIRF
The International Research Foundation for English Language Education
iv
First published 2019 by Routledge 52 Vanderbilt Avenue, New York, NY 10017 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2019 Taylor & Francis The right of Spiros Papageorgiou and Kathleen M. Bailey to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Names: Papageorgiou, Spiros, editor. | Bailey, Kathleen M., editor. Title: Global perspectives on language assessment : research, theory, and practice / edited by Spiros Papageorgiou, Managing Senior Research Scientist at Educational Testing Service ; Kathleen Bailey, Professor at Middlebury Institute of International Studies at Monterey, USA. Description: Routledge : New York, NY, 2019. | Series: Global research on teaching and learning English series ; volume 6 | Includes bibliographical references and index. Identifiers: LCCN 2018044719 | ISBN 9781138345362 (hardback) | ISBN 9781138345379 (pbk.) | ISBN 9780429437922 (ebk) Subjects: LCSH: English language–Study and teaching–Foreign speakers. | English language–Ability testing. | Language and languages–Ability testing. | Second language acquisition. Classification: LCC PE1128.A2 G556 2019 | DDC 428.0071–dc23 LC record available at https://lccn.loc.gov/2018044719 ISBN: 978-1-138-34536-2 (hbk) ISBN: 978-1-138-34537-9 (pbk) ISBN: 978-0-429-43792-2 (ebk) Typeset in Bembo by Newgen Publishing UK
v
This volume is dedicated to language learners and their teachers around the world, who need to face many tests in their lives, as well as to the language testing professionals who strive to improve language assessment instruments, procedures, and experiences through their research.
vi
vi
CONTENTS
Preface Acknowledgments
xi xvi
PART 1
Raters and Rating 1 Analytic Rubric Format: How Category Position Affects Raters’ Mental Rubric Laura Ballard
1 3
2 Rater Training in a Speaking Assessment: Impact on More-and Less-Proficient Raters Larry Davis
18
3 The Impact of Rater Experience and Essay Quality on the Variability of EFL Writing Scores Özgür Şahan
32
4 Assessing Second Language Writing: Raters’ Perspectives from a Sociocultural View Yi Mei
47
vi
viii Contents
PART 2
Test Development and Validation 5 Assessing Clinical Communication on the Occupational English Test® Brigita Séguis and Sarah McElwee 6 Updating the Domain Analysis of the Writing Subtest of a Large-scale Standardized Test for K-12 English Language Learners Jing Wei,Tanya Bitterman, Ruslana Westerlund, and Jennifer Norton 7 The Effect of Audiovisual Input on Academic Listen-Speak Task Performance Ching-Ni Hsieh and Larry Davis 8 The Reliability of Readability Tools in L2 Reading Alisha Biler 9 Eye-Tracking Evidence on the Role of Second Language Proficiency in Integrated Writing Task Performance Mikako Nishikawa
61 63
80
96 108
122
10 The Effects of Writing Task Manipulations on ESL Students’ Performance: Genre and Idea Support as Task Variables Hyung-Jo Yoon
139
11 Cyberpragmatics: Assessing Interlanguage Pragmatics through Interactive Email Communication Iftikhar Haider
152
PART 3
Test-taker and Teacher Perspectives
167
12 Did Test Preparation Practices for the College English Test (CET) Work? A Study from Chinese Students’ Perspectives Jia Ma
169
13 Intended Goals and Reality in Practice: Test-takers’ Perspectives on the College English Test in China Ying Bai
183
vi
ix
vi
Contents ix
14 How Teacher Conceptions of Assessment Mediate Assessment Literacy: A Case Study of a University English Teacher in China Yueting Xu
197
15 Sociocultural Implications of Assessment Practices in Iranian, French, and American Language Classes Soodeh Eghtesad
212
About the Contributors Subject Index Author Index
228 231 234
x
xi
PREFACE
This sixth volume in the Global Research on Teaching and Learning English Series focuses on the assessment of language proficiency. The title of the volume, Global Perspectives on Language Assessment: Research, Theory, and Practice, reflects the diverse contexts in which the authors of the 15 chapters conducted their research. As the editors, we had the privilege of assembling a book that illustrates the breadth of innovative research happening in the domain of language assessment. But beyond research, there is no doubt that assessment is an indispensable part of the language teaching profession. Teachers need to assess their students’ language proficiency on an ongoing basis to make important decisions, such as monitoring progress, offering feedback, and identifying areas for improvement. Teachers may need to use external language proficiency tests or develop their own assessment instruments. Despite the central role of assessment in teaching and learning, the literature points out that many teachers lack sufficient training in developing assessment instruments, selecting tests that are suitable for a given purpose, and using test scores in ways that are useful and relevant for decision-making (Malone, 2013). This gap in training in language assessment might explain to some extent why studies on language assessment literacy, which refers to any stakeholder’s familiarity and knowledge of language assessment principles, have focused primarily on teachers more than other stakeholders. (See the special issue of Language Testing edited by Inbar-Lourie, 2013.) Our experience working with pre-service and in-service language teachers actually suggests that language assessment is often perceived as a technical field that should be left to measurement “experts”. However, raising teachers’ language assessment literacy is critical. Scarino (2013) notes that there is ample evidence in the literature to support the (perhaps) common notion that the teacher is the most
xi
xii Preface
important factor in influencing students’ learning. Even teachers’ beliefs about assessment might affect how they go about evaluating their students’ language proficiency (Xu, this volume). Tsagari and Papageorgiou (2012) make the case for raising teachers’ language assessment literacy for two reasons: The first is that teachers with a strong background in assessment are likely to be able to develop their own assessment instruments and apply innovative techniques in evaluating their students’ proficiency in the classroom. The second reason is that teachers with a good working knowledge of assessment principles will be equipped to complement their own assessment instruments with external assessments, which in turn are relevant to students’ needs, offer useful and relevant information about students’ language proficiency, and support learning.
Purpose of the Book and Intended Audience TIRF (The International Research Foundation for English Language Teaching) has a long tradition in emphasizing the central role of assessment in language education. Since 2008, language assessment has been one of the Foundation’s research priorities and several of the proposals submitted every year for the TIRF Doctoral Dissertation Grants (DDGs) explore topics related to this particular research priority. The authors of these proposals are doctoral candidates in relevant PhD and EdD programs, and funding from TIRF has been instrumental in helping successful applicants complete their work. This volume offers some of the recent awardees a venue to publish their research. The purpose of the book is to offer readers up-to-date research on the rapidly changing field of language assessment, and to familiarize them with the latest developments in theory and practice on this broad topic. There are several edited volumes offering a comprehensive overview of recent developments in language assessment (for example, Coombe, Davidson, O’Sullivan, & Stoynoff, 2012;Tsagari & Banerjee, 2016); however, the present collection of papers is unique in that it includes 12 studies that report on previously unpublished parts of doctoral research, and as such they have been vetted both by the research committees of the authors’ respective universities and in a blind review process by an international evaluation committee of TIRF’s DDG proposal reviewers. (All 12 of these studies received DGG funding.) We therefore believe that these chapters will be of interest to doctoral students and researchers in the field of language education. This book is not the first one containing papers by TIRF grantees who focused on language assessment. Christison and Saville (2016) edited a collection of 11 chapters, all authored by TIRF grantees receiving DDG funding. One distinctive feature of the present volume is that in addition to the 12 chapters reporting on doctoral research, there are three chapters written by experienced language assessment professionals. These chapters report specifically on issues related to test development or validation and are likely to be of interest in particular to readers involved in designing assessments or analyzing assessment results.
xi
xi
xi
Preface xiii
The consequences that language assessments have on learners, teachers, teaching programs, and society overall is a complex issue and a popular research topic in the field (Bailey, 1999; Wall, 2006). In an attempt to explicitly connect assessment research to teaching practice, each chapter in this book concludes with implications not only for future research, but also for educational policy and practice. Therefore, policy makers, as well as language teachers and educators, may benefit from reading this book.
Organization of the Book The chapters in this volume have been thematically organized under three parts, which are briefly described below. Part I contains four chapters dealing with the scoring process for productive skills. Investigating the rater’s decision-making, the design of scoring rubrics, and training of raters is critical, as such aspects of the scoring process affect the quality of the speaking and writing scores on which important decisions about learners and teaching programs are made. In her chapter, Laura Ballard investigates the position effect of the different categories of a writing scoring rubric on the raters’ decision-making processes. Larry Davis explores the impact of training on raters of speaking, and Özgür Şahan focuses on the impact of rater experience and essay quality on variability of scores. Yi Mei responds to the call in the literature for attention to the social aspects of rating, in particular how raters resolve tensions during the essay scoring process, such as when they disagree with other raters about the score to be awarded. Part II contains seven chapters dealing with test development and validation. The first three chapters are authored by professionals in the field of language testing. Brigita Séguis and Sarah McElwee present work related to the target language use (TLU; Bachman & Palmer, 2010) domain of a clinical communication English test. The TLU domain of the writing section of a K-12 test in the United States is the focus of the chapter by Jing Wei, Tanya Bitterman, Ruslana Westerlund, and Jennifer Norton. Ching-Ni Hsieh and Larry Davis (both former DDG recipients) present their research on the effect of audiovisual input on students’ performance on an integrated, listen-speak academic task.The remaining four chapters in this part are all authored by recent DDG recipients. Alisha Biler explores the extent to which readability tools can predict the difficulty of passages as indicated by students’ performance on the reading comprehension questions related to these passages. Mikako Nishikawa employed eye-tracking methodology to address the effects of source texts and graph features, as well as the roles of reading and test-taking strategies when students take integrated, reading-to- write tasks. Writing is also the focus of Hyung-Jo Yoon’s chapter, in which the author examined the effect of genre type (specifically argumentative and narrative essays) and amount of idea support on writing test scores. The last chapter in this part is by Iftikhar Haider. It examines the extent to which pragmatics features, in
xvi
xiv Preface
particular politeness strategies, differ among learners of different proficiency levels when performing an email writing task. Part III contains four chapters by DDG recipients. The common theme of these four chapters is student and teacher perspectives related to language assessment, which is a critical area of inquiry so that assessment results can be used in a way that is beneficial to language learners. The first three chapters are situated in the Chinese educational context. Jia Ma explores the characteristic of test takers of a high-stakes English language test and their perceptions of preparation courses. Student perspectives are also examined in Ying Bai’s chapter, specifically in relation to the test design and score use. Yueting Xu examines how a teacher’s belief system about the nature and purpose of assessment is related to that teacher’s assessment literacy. In the last chapter, Soodeh Eghtesad compares assessment practices of language teachers in Iran, France, and the United States.
Conclusion As the editors of this volume, we are grateful to all authors for their hard work, for being receptive to our feedback and requests for revisions, and for adhering to tight deadlines. We enjoyed working with them, and seeing these chapters take their final form through the iterative process of reviewing and revising. This was a rewarding experience for us, and hopefully for all authors. We are particularly pleased to be able to support TIRF’s mission by publishing this volume on language assessment, and we believe that readers will find it particularly useful. The Editors Spiros Papageorgiou, Educational Testing Service Kathleen M. Bailey, Middlebury Institute of International Studies at Monterey
References Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford, UK: Oxford University Press. Bailey, K. M. (1999). Washback in language testing (TOEFL Monograph Series MS-15). Princeton, NJ: Educational Testing Service. Christison, M., & Saville, N. (Eds) (2016). Advancing the field of language assessment: Papers from TIRF doctoral dissertation grantees. Cambridge, UK: Cambridge University Press. Coombe, C., Davidson, P., O’Sullivan, B., & Stoynoff, C. (2012), The Cambridge guide to language assessment. Cambridge, UK: Cambridge University Press. Inbar-Lourie, O. (2013). Guest editorial to the special issue on language assessment literacy. Language Testing, 30(3), 301–307. Malone, M. E. (2013). The essentials of assessment literacy: Contrasts between testers and users. Language Testing, 30(3), 329–344. Scarino, A. (2013). Language assessment literacy as self-awareness: Understanding the role of interpretation in assessment and in teacher learning. Language Testing, 30(3), 309–327.
xvi
xv
xvi
Preface xv
Tsagari, D., & Papageorgiou, S. (2012). Special issue on language testing and assessment in the Greek educational context: Introduction. Research Papers in Language Teaching and Learning, 3(1), 4–7. Tsagari, D., & Banerjee, J. (Eds.). (2016). Handbook of second language assessment. Boston, MA: De Gruyter Mouton. Wall, D. (2006). The impact of high- stakes examination on classroom teaching. Cambridge, UK: Cambridge University Press.
xvi
ACKNOWLEDGMENTS
We wish to acknowledge the support of some key people who made the production of this volume possible. First, we are indebted to Mr. Wyatt Boykin, our wonderful editorial assistant and project manager. Wyatt’s participation in this endeavor was made possible as part of his requirement for an internship with an international organization, a portion of his Master’s degree program in International Education Management at the Middlebury Institute of International Studies at Monterey.Wyatt worked for months without charging TIRF a fee—a fact which makes his highly professional management of the editorial process even more precious. Second, we want to thank Mr. Ryan Damerow,TIRF’s Chief Operating Officer and one of the series editors for Global Research on Teaching and Learning English—a co-publication by Routledge and TIRF. Ryan’s timely and careful reading of all the chapters enabled us to proceed on schedule and to produce a more consistent volume of studies from our doctoral dissertation grantees and other authors. And of course, we want to recognize the contributions of the authors themselves. This is the sixth book in the TIRF-Routledge series, and we are pleased to showcase the work of 12 TIRF grant recipients. It is no small task to convert massive dissertation reports of research methods and findings into short, readable texts for a more general audience. We also want to acknowledge the chapter contributions of language assessment professionals from Cambridge English Language Assessment, the Center for Applied Linguistics (CAL), and Educational Testing Service (ETS). As editors we are very pleased that all the authors worked so diligently to incorporate our feedback.We particularly want to note that authors who contribute to books in this series agree to forego royalties or honoraria, so that any proceeds from the book sales can be reinvested in TIRF’s operations and programs.
xvi
xvi
newgenprepdf
xvi
Acknowledgments xvii
Finally, we want to acknowledge the ongoing support of our colleagues Karen Adler, Siobhan Murphy, and Emmalee Ortega at Routledge, as well as that of Lisa Cornish, who served as the copy editor on this project, and Sarah Green, Project Manager, Newgen Publishing UK. We are grateful to the energy, enthusiasm, and professionalism of all these colleagues. Spiros Papageorgiou Kathleen M. Bailey
xvii
1
PART 1
Raters and Rating
2
3
1 ANALYTIC RUBRIC FORMAT HOW CATEGORY POSITION AFFECTS RATERS’ MENTAL RUBRIC Laura Ballard
Issues that Motivated the Research Rater scoring has an impact on performance test reliability and validity. Thus, there has been a continued call for researchers to investigate issues related to rating. In second language writing assessment, the emphasis on investigating the scoring process and how raters arrive at particular scores has been seen as critical “because the score is ultimately what will be used in making decisions and inferences about writers” (Weigle, 2002, p. 108). In the current study, I answer the call for continued research on the rating process by investigating rater cognition in the context of writing assessment. Research on raters’ cognitive processes “is concerned with the attributes of the raters that assign scores to student performances, and their mental processes in doing so” (Bejar, 2012, p. 2). A theme central to research on rater cognition is the way in which raters interact with scoring rubrics, because test designers uphold the assumption that “raters use the categories on the rubrics in the intended manner, applying the rubrics consistently and accurately to judge each performance (or product)” (Myford, 2012, pp. 48–49). Only by understanding this interaction between raters and rubrics will test designers be able to improve rubrics, rater training, and test score reliability and validity.
Context of the Research In this chapter, I focus on rater-rubric interactions, which continue to be of interest because, despite rater-training efforts, variance in rater behavior and scores persists (Lumley & McNamara, 1995; McNamara, 1996;Weigle, 2002), which may lead to reliability problems. Though the goals of rater training are to give raters a
4
4 Laura Ballard
common understanding of the rubric’s criteria and to help raters converge on a common understanding of scoring bands (Bejar, 2012; Roch, Woehr, Mishra, & Kieszczynska, 2012), many studies on rater behavior have shown that raters do not always use rubrics in a consistent way (i.e., they have low intra-rater reliability). Raters do not consistently score (i.e., they have low inter-rater reliability), and they do not use the same processes to arrive at a given score (Cumming, Kantor, & Powers, 2002; Eckes, 2008; Lumley, 2002). One potential explanation for rater behavior and problems with inter-rater reliability may be the primacy effect, as noted by Winke and Lim (2015). The primacy effect is a psychological phenomenon that shows that the positionality of information in a list (e.g., a rubric) affects a listener’s or reader’s assignment of importance to that information (Forgas, 2011). The primacy effect seems particularly relevant for helping to explain how raters pay attention to rubric c riteria. Primacy potentially impacts inter-rater reliability in using analytic rubrics. To the best of my knowledge, however, no researcher has directly investigated the role of primacy in rater cognition and its potential effects on rater scoring (but Winke and Lim posited that primacy effects were observable in their 2015 study).Thus, in the current study, I investigate primacy effects in relation to rater-rubric interactions. I also examine whether primacy effects impacted rater behavior, such as mental- rubric formation, attention to criteria, and rater scoring, when raters used an analytic rubric. In the field of psychology, the primacy effect (also known as serial position or ordering effects) refers to the better recall of information that is presented first in a list of information. Much of the research conducted on ordering effects has investigated the underlying cause of the phenomenon. The most widely accepted model that accounts for the primacy effect is the attention decrement hypothesis (Crano, 1977; Forgas, 2011; Hendrick & Costantini, 1970; Tulving, 2008). This hypothesis asserts that as a list of information is presented to someone, the attention of the person progressively declines over the course of the presentation. As Crano (1977) described this phenomenon, “The relative influence of a descriptor varies as a function of its serial position” (p. 90). Thus, in the process of impression formation, information that is presented later is weighed less heavily than information presented earlier. In other words, the reason that primacy effects occur is because people fail to process later items as carefully and attentively as earlier ones (Crano, 1977; Forgas, 2011). The attention decrement hypothesis is even thought to be supported by biological evidence.Tulving (2008) proposed that a biological process called camatosis, a slowing of neuron activity in the brain, is related to memory and information retention. When a list of information is presented, the brain becomes fatigued and has fewer resources available to attend to later-presented information. Tulving proposed that a measured decrease in neural activity could be observed throughout the course of information presentation, resulting in the most attention being paid at the beginning, and a decreasing amount of attention being paid as information
4
5
4
Analytic Rubric Format 5
presentation continues. In other words, primacy is a psychological phenomenon thought to be undergirded by a biological phenomenon. Decision making has been one of the primary areas in which researchers have investigated the real-world implications of the primacy effect. This focus on primacy effects and decision making has been popular because the order in which information is presented and the amount of attention paid to the information have a measurable impact on subsequent decisions (Forgas, 2011; Hendrick & Costantini, 1970; Luchins & Luchins, 1970; Rebitschek, Krems, & Jahn, 2015). What previous researchers have found is that primacy is the brain’s default setting. As new information is presented, attention to that information decreases. Stronger memory links are formed with first-order information than with later information. This linking leads to decisions that favor the first-order information. The act of scoring itself is a decision-making process. Scoring is meant to be informed by criteria, which are presented on the rubric and are exemplified throughout rater training (Cumming et al., 2002; Roch et al., 2012). During the training, raters build a unique mental model of the rubric, or a “mental scoring rubric” (Bejar, 2012, p. 4) to help them understand and internalize what is important to consider when making a scoring decision. As the rubric is initially presented or read, the order of the rubric categories may impact raters’ formation of their mental rubric. Two components of this mental rubric are (1) which criteria are stored and (2) how important those criteria are considered to be. In this study, I look particularly at how category ordering on a rubric could affect novice raters’ construction of their mental rubrics. As raters engage in the scoring process, they rely on their mental rubric, and if the mental rubric has been impacted by ordering effects, these effects would be expected to carry over into raters’ scores themselves. Analytic rubrics lay out detailed criteria descriptors listed by category and score band (Barkaoui, 2011; Weigle, 2002). According to Hamp-Lyons and Henning (1991), the large number of descriptors included in analytic rubrics has the potential to cause a heavy cognitive load, which may become unmanageable for raters. The vast amount of information presented on these rubrics paired with the heavy cognitive load they may cause is at odds with the corresponding process of analytic rating, which assumes that raters will attend equally to all rubric criteria (Barkaoui, 2007; Weigle, 2002). Raters may not have the attention or capacity to equally attend to all rubric descriptors, thus paving the way for primacy effects to take hold. This view, however, is only speculation, as very little research has been done on the effects of rubric format. One study by Winke and Lim (2015) investigated raters’ cognitive processes while rating essays using a version of the Jacobs, Zinkgraf, Wormuth, Hartfiel, and Hughey (1981) analytic rubric. They found that raters spent the majority of their rubric-use time attending to the left-most categories, and the least amount of time focusing on the right-most categories. The authors suggested that a probable explanation for the raters’ lack of attention to the right-most categories in the rubric was due to the primacy effect. However,
6
6 Laura Ballard
because their study was not set up to investigate primacy effects, they were only able to suggest primacy as a likely explanation for the observed rater behavior. To my knowledge, no study in language testing or educational measurement has explicitly investigated primacy effects in relation to rater-rubric interactions.
Research Question Addressed In this study, I sought to answer the following research question: To what extent do raters show evidence of ordering effects in their mental-rubric formation after rater training?
Research Methods Data Collection Procedures I recruited participants from a large, Midwestern university who were native speakers of English, were undergraduate students, had no essay-rating experience, and were pursuing a social-science major or minor. I recruited them through social science courses and through flyers posted around campus. Seventy-three students expressed interest in participating, but only 31 met the established criteria. Participants received $150 to compensate them for approximately 12 hours of data collection. The 31 participants had a mean age of 21 (SD = 1.77). I randomly assigned participants to one of two groups (i.e., Group A and Group B) according to their availability for data collection. The two groups were necessary for counterbalancing purposes, and the groups differed only in the presentation order of the two rubrics. For rater training and rating, Group A used the standard rubric in Round 1, followed by the reordered rubric in Round 2, and Group B first used the reordered rubric in Round 1, followed by the standard rubric in Round 2. The study included two rounds of data collection, and each round consisted of two sessions: rater training and rating. The Round 1 and Round 2 procedures were identical, except that Round 1 included a consent form at the beginning of the rater training session, and Round 2 included a background questionnaire at the end of the rating session. The data-collection procedure is shown in Table 1.1. The purpose of the rater-training session was to train the participants on a rubric and to collect their pre-training and post-training beliefs about the rubric criteria. In these sessions, the following took place: administration of a pre-training criteria importance survey (CIS; described later), rater training, and a post-training criteria importance survey. These sessions took place in groups of two to five participants and lasted approximately three hours. The purpose of the rating session was to have the participants rate essays, to collect the participants’ scores on these essays, and to collect their pre-rating and
6
7
6
Analytic Rubric Format 7 TABLE 1.1 Procedure Summary
Phase
Round 1
Round 2
Training
CIS1 Training CIS2
CIS5 Training CIS6
Rating
CIS3 Rating CIS4
CIS7 Rating CIS8
Note. CIS = Criteria Importance Survey.
post-rating recall of and beliefs about the rubric criteria.The data collection for the rating sessions took place individually on campus within two days (but not on the same day) of the completion of the rater-training session. During the rating session, all participants did the following: a pre-rating criteria importance survey, rubric reorientation, essay rating, and a post-rating criteria importance survey. The rating session lasted approximately three hours. Round 2 of rater training and the subsequent rating took place approximately five weeks after Round 1. The main difference between Round 1 and Round 2 was that raters were trained on and used a differently ordered rubric for each round. During the rating sessions, participants rated 20 ESL essays, which were provided to me by the university’s English language center. I used two analytic rubrics that contained the same content but differed in layout. The base rubric was adapted from Polio (2013). This rubric is a five-category analytic rubric containing the following categories, with their possible points given in parentheses: Content (20), Organization (20),Vocabulary (20), Language Use (20), and Mechanics (10). To adapt the rubric for the current study, I made one significant modification: I increased the points in the Mechanics category from 10 to 20, thus giving equal point values to each category. It was important to weight the categories equally so as to indirectly reinforce to raters that each category was of equal value and importance in the rating process. The standard rubric categories were ordered in the following way: Content, Organization,Vocabulary, Language Use, and Mechanics.The standard rubric can be found in the Appendix. The reordered rubric is identical to the standard rubric, but the categories appeared in the reverse order: Mechanics, Language Use,Vocabulary, Organization, and Content. Including this reordered rubric in the study allowed me to investigate whether cognitive focus on the rubric categories was linked to the position on the rubric (i.e., order) or whether it was linked to the category itself. During rater training, I trained the participants on the rubric and benchmark essays. The session included an orientation to the rubric, a discussion
8
8 Laura Ballard TABLE 1.2 Summary of CIS Administrations
Round
Administration
Description
Rubric
1
1 2 3 4
Pre-rater training Post-rater training Pre-rating Post-rating
1 1 1 1
2
5 6 7 8
Pre-rater training (not yet exposed to new rubric) Post-rater training Pre-rating Post-rating
1 2 2 2
Note. Rubric 1 is the standard rubric for Group A and revised rubric for Group B. Rubric 2 is the revised rubric for Group A and the standard rubric for Group B.
of benchmark essays, and a round of practice scoring accompanied by group discussion. I modeled this protocol after the standard rater-training protocol at the university’s English Language Center, but with special care to spend equal time focusing on each category of the rubric during rubric orientation and essay discussions, thus controlling for the time spent on each category during the training. The criteria importance survey (CIS) is a Likert-scale questionnaire modeled after the criteria importance questionnaire used by Eckes (2008). For this survey, I asked participants to mark on a 10-point Likert scale how important they thought each criterion was when scoring an essay. Each criterion was taken from the analytic rubric used in this study (with only minor modifications to clarify descriptors when presented independently). I administered the CIS eight times throughout the study, and each time, the descriptors were presented in a different order. Table 1.2 summarizes the CIS administrations.
Data Analysis Procedures In analyzing the CIS data, I sought to investigate whether the rubric that raters trained on, that is, the order in which they encountered each category, affected how important the raters considered the descriptors in each category to be. I collected data at multiple time points throughout the study. Because the repeated-measures, cross-over design would make it difficult to meaningfully analyze all the data in one model, I selected specific administrations to analyze that would uncover ordering effects, if ordering effects were present. Round 1 Pre- rating (CIS3) provided a snapshot of the participants’ beliefs about the criteria after a short-term delay in exposure to the initial rubric (i.e., one to two days after exposure).The data from Round 2 Pre-training (CIS5) revealed a long-term delay in exposure to the initial rubric (i.e., five weeks after exposure).
8
9
8
Analytic Rubric Format 9
After inspecting the descriptive statistics to understand general trends in the data, I sought to uncover differences in category importance beliefs over time and between the two groups. Thus, I computed a 2 (Group: A, B) x 2 (Time: CIS3, CIS5) x 5 (Category: Content, Organization, Vocabulary, Language Use, Mechanics) repeated measures ANOVA. I examined the data to ensure that they met the assumptions of the model. I inspected the output of the Mauchly’s test of sphericity, a test that examines whether the assumption of compound symmetry (i.e., sphericity) has been met. The results were statistically significant (p < .05), indicating a violation of sphericity. I report a Greenhouse-Geisser correction, which is used when sphericity is not present. I did not find any important deviations from normality or homogeneity of variances.
Findings and Discussion To uncover the extent to which raters showed evidence of ordering effects in their mental-rubric formation, I administered CISs to measure how important raters considered each rubric descriptor to be and to understand raters’ beliefs about the importance of each category as a whole. Descriptive statistics for the CIS data are given in Table 1.3. On the Likert scale, the options were labeled as follows: 1 = unimportant, 4 = somewhat important, 7 = important, and 10 = very important. General trends in the descriptive data TABLE 1.3 Criteria Importance Survey (CIS) Means
Round Time
Group Content
Organization Vocabulary Language Mechanics Use
1
1 (pre-training) A B 2 (post-training) A B 3 (pre-rating) A B 4 (post-rating) A B
7.7 (1.1) 7.3 (1.2) 8.4 (1.2) 8.3 (1.4) 8.2 (1.3) 7.9 (1.3) 8.2 (1.2) 8.4 (1.3)
7.3 (1.5) 8.1 (1.1) 8.6 (1.2) 8.7 (1.3) 8.4 (1.1) 8.5 (1.0) 8.8 (1.1) 8.8 (1.3)
5.7 (0.8) 7.0 (0.9) 7.6 (1.4) 7.8 (1.7) 7.7 (1.4) 7.8 (1.7) 7.5 (1.4) 8.1 (1.5)
6.4 (1.3) 7.3 (1.4) 8.0 (1.2) 8.2 (1.6) 7.9 (1.4) 7.9 (1.4) 7.9 (1.5) 8.6 (1.1)
6.5 (1.7) 7.5 (1.3) 7.4 (1.3) 7.9 (1.8) 7.4 (1.2) 7.8 (1.3) 7.3 (2.0) 8.6 (1.2)
2
5 (pre-training) A B 6 (post-training) A B 7 (pre-rating) A B 8 (post-rating) A B
8.1 (1.0) 8.4 (1.2) 8.3 (1.4) 8.3 (1.2) 8.4 (1.2) 8.1 (1.4) 8.5 (1.3) 8.5 (1.3)
8.5 (1.2) 8.5 (1.4) 8.6 (1.2) 8.8 (1.2) 8.6 (1.3) 8.8 (1.3) 8.7 (1.3) 8.9 (1.3)
7.6 (1.4) 7.8 (2.0) 7.8 (1.5) 7.9 (1.8) 7.8 (1.4) 8.0 (1.8) 7.8 (1.3) 8.1 (1.7)
7.6 (1.5) 7.9 (1.9) 8.3 (1.2) 8.3 (1.4) 7.7 (1.5) 8.0 (1.6) 7.9 (1.5) 8.2 (1.7)
6.9 (2.0) 7.7 (2.2) 7.4 (1.8) 8.0 (1.5) 7.5 (1.6) 8.0 (1.7) 7.5 (1.5) 8.3 (1.5)
Note. Standard deviations are in parentheses. Group A: n = 13. Group B: n = 14.
10
10 Laura Ballard
show that Group A starts with a lower, wider spread of categories than Group B. At CIS5 (after the five-week lapse, which is the most likely time for ordering effects to manifest), Group A’s ratings replicate the general order of the standard rubric, with participants rating the left-most categories (Content and Organization) as most important and the right-most categories (Language Use and Mechanics) as least important. Group B, however, has a narrower spread between categories, with participants rating Content and Organization as most important, and Vocabulary, Language Use, and Mechanics clustered together. Comparing CIS3 and CIS7 (both pre- rating administrations), each group maintained similar beliefs about category importance between the two time points, even though the CIS administrations at these two time points were based on different rubric exposure. Importantly, for Group A, Mechanics was generally rated lowest across the time points, but for Group B, Mechanics fell toward the middle. For the repeated measures ANOVA model (CIS3 and CIS5), the Group-by- Time- by- Category interaction was not statistically significant (F2, 56 = 1.060, p = .359, η2P = .041), nor was the Group-by-Category interaction (F3, 67 = 1.539, p = .216, η2P = .058). For main effects, Category was statistically significant (F3, 67 = 1.539, p < .001, η2P = .369), but Time (F1, 25 = 0.160, p = .692, η2P = .006) and Group (F1, 25 = 0.123, p = .729, η2P = .005) were not. In other words, when looking at differences in category importance beliefs over time and between the two groups, I only found statistical differences between categories, which I then examined with pairwise comparisons by group. Pairwise comparisons of Category for Group A on CIS3 revealed that only Organization and Mechanics were significantly different (mean difference = 1.013, p = .006, 97.5% CI [0.148, 1.878]). Note that the confidence interval (CI) indicates a 97.5% confidence that the value of the true population would fall within the given interval. For CIS5, Organization was significantly different from Vocabulary (mean difference = 0.883, p = .014, 97.5% CI [0.054, 1.712]) and Mechanics (mean difference = 1.648, p = .002, 97.5% CI [0.387, 2.910]). For Group B on CIS3, only Content and Organization were significantly different (mean difference = -0.656, p = .020, 97.5% CI [-1.295, - 0.017]). For CIS5, no categories were significantly different. Group B’s ratings of importance for the five categories were much more similar to one another, with the only significant difference being between Organization and Content, with Organization being deemed more important. A discussion of these findings follows. The goal of rater training is to bring raters onto the same page, that is, to instill in them the same “mental scoring rubric” (Bejar, 2012, p. 4; see also Cumming, 1990) by which they will score essays. After developing a common mental rubric during rater training, raters should produce scores that are similar, regardless of who they are.Thus, the raters should be interchangeable, meaning that they should each score in the same way and produce the same results.
10
1
10
Analytic Rubric Format 11
To investigate this mental-rubric representation in raters’ minds, raters completed Criteria Importance Surveys (CISs) at several points during training and rating over time. These data gave a snapshot of which descriptors and categories raters considered to be important. Within this study, I trained the raters on rubrics that differed as to how the analytic categories on the rubric were ordered. Thus, with this design, I was able to investigate any ordering effects on the raters’ beliefs of category importance. In other words, I could see whether there was a primacy effect stemming from the rubric layout, which would dictate that the categories that came first on the rubric would be viewed by the raters as more important. In the CIS data, one overarching trend emerged. After the two groups of raters were trained on the rubric, regardless of the order in which the analytic categories were presented on the rubric, both groups of raters indicated that Organization was the most important category. Raters maintained this belief across rounds, regardless of the rubric on which they trained. If primacy effects were strongly at work, raters would have considered the left-most categories to be most important during initial training (e.g., Group A, Content and Organization; Group B, Mechanics and Language Use). However, Organization was the prevailing category for both groups (regardless of which rubric the raters were first exposed to). This finding suggests that this group of raters (though novice raters) were bringing with them beliefs about important aspects of quality writing, perhaps learned from their collegiate writing courses or perceived while applying the rubric itself. Furthermore, these important criteria transcended primacy, showing that even if there are ordering effects, they are (or can be) softened or mediated by other factors. Despite the prominence of the Organization category, there was evidence in the data suggesting ordering effects. The primacy effect predicts that ordering of information shapes one’s impressions about information importance, and these impressions persist over time even when contradictory information is presented (Forgas, 2011; Hendrick & Costantini, 1970; Luchins & Luchins, 1970). There are two notable trends pervading the CIS data that suggest that raters’ initial impressions were affected by category ordering and that these impressions persisted even when new category ordering was introduced. Group A initially trained on the standard rubric, in which Content appeared first and Mechanics appeared last.The ordering may have subconsciously affected the raters’ beliefs that Content is most important and Mechanics is least important. These beliefs carried through to five weeks after the initial training (i.e., CIS5), and then persisted even after the raters trained on a new rubric (i.e., CIS7) and had the opportunity to reconsider their beliefs about category importance. In other words, their beliefs about category importance were consistent with their initial impression and did not seem to change even after receiving different (i.e., somewhat contrary, opposite-ordered) input. Group B’s mental rubric formation, however, was likely impacted by primacy effects and by the presence of certain categories in certain positions. At CIS3, raters in Group B indicated that they believed Mechanics (left-most on their
12
12 Laura Ballard
rubric of initial exposure) to be just as important as all other categories. At the five-week mark (i.e., CIS5), when primacy effects may be strongest, Group B did not have any statistically significant differences among their category-importance beliefs, suggesting that the order in which the categories were presented had a leveling effect (i.e., indicating equal importance across categories). This result was contrary to that of the raters in Group A, who showed a distinct difference at CIS5 in their beliefs about Mechanics, that it was significantly less important than other categories. Thus, while both groups demonstrated evidence of primacy effects in their beliefs about criteria importance, there were distinct patterns in the groups’ beliefs which related to the position of particular categories.
Implications for Policy, Practice, and Future Research In this study, I examined one key aspect of the rating process, mental-rubric formation, in order to develop a preliminary understanding of how ordering effects may influence rater beliefs. The data seem to show the following: As novice raters train on a new rubric, the raters’ behavior pertaining to categories in the outer- most positions (e.g., left-most and right-most) seems most susceptible to ordering effects. That is, the findings of this study have provided some evidence that the position of a category affected the raters’ beliefs about what criteria are the most and least important when scoring an essay. While these effects were not always present for both groups, the data did suggest that the category itself and its position were both impactful.That is, the nature of the category and the order in which the categories appeared on the rubric did matter. Primacy does not explain everything, however. Even though the participants in this study were all novice raters who had never previously scored essays or been trained on a rubric, these raters still brought their own ideas about quality writing to the task.These ideas could have come from their own experiences in collegiate writing courses. All of the participants had completed at least two such courses. While raters do bring their own biases with them (Barkaoui, 2011; Cumming, 1990; Lumley & McNamara, 1995; Weigle, Boldt, & Valsecchi, 2003; Winke, Gass, & Myford, 2012), this study has shown that rater training and exposure to a rubric can effectively shape raters’ beliefs and behaviors in the scoring process, as has been expounded upon before (Davis, 2016; Weigle, 1994; Wolfe, Matthews, & Vickers, 2010). Though this study only involved two rounds of rating over the course of five weeks, many raters in rating programs score essays over a duration of long sequences of time. I posit that as raters continue to be exposed to and rate using a single rubric, the primacy effect could potentially cause a deep entrenchment of beliefs that shape raters’ scoring behavior, since Luchins and Luchins (1970) showed that primacy effects grow stronger over time. Such an entrenchment would be problematic because, on many analytic rubrics, the categories are meant to be treated with equal importance and should be scored accordingly (Lumley, 2002).
12
13
12
Analytic Rubric Format 13
However, ordering effects could lead raters to stray from ideal rating behavior, thus compromising the essay score interpretation, and thus test validity. Overall the results of this study show that the psychological phenomenon described by psychologists like Underwood (1975)—that people assign more importance to information that comes first on a list—is important for language testing programs that use analytic rubrics. For this group of novice raters, information ordering (on a rubric) mattered in terms of how the raters processed the rubric descriptors. Given the findings of this study, it would be beneficial for test designers to carefully consider the layout and ordering of analytic rubrics used in operational testing. Rubric designers could leverage ordering effects to their benefit by fronting any categories that are typically, perhaps unintentionally, seen as less important. Since raters may become more and more entrenched in their beliefs over long periods of time, test designers could also consider creating an online rater-training and scoring platform (see Knoch, Read, & von Randow, 2007; Wolfe et al., 2010), which would encourage raters to pay equal attention to each rubric category. One example may be a digital platform that presents raters with a randomized, forced order of training, norming, and scoring, which may reduce raters’ conditioning to attend most to certain categories while least to others. Further research is needed to elucidate exactly how ordering effects impact the rating process. Researchers could investigate other components of mental- rubric formation, such as criteria recall. Another critical aspect to investigate is (1) whether and how primacy may impact raters’ attention to criteria during the rating process, (2) whether and how primacy may impact rater scoring, and (3) how meaningful the impact of primacy may be in real-life scoring. In addition to investigating other components for the rating process, it is important to consider the rater population. The participants in this study were novice raters from a left-to-r ight oriented reading language. To better understand the effects of primacy on raters, future research should consider (1) expert raters who have already developed a strong sense of criteria importance, and (2) raters whose text-directionality bias is right-to-left or top-to-bottom, which may impact one’s primacy directionality. This study has provided evidence that ordering effects in the categories of the scoring rubric are present in the rater training process. Rater training programs could now use this information to carefully craft rubrics and rater training such that any ordering effects are intentional and to the betterment of the program. Otherwise, category ordering needs to be thoughtfully controlled so that ordering effects will not take hold and become detrimental to the rating program over time.
References Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86–107.
14
14 Laura Ballard
Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study of their veridicality and reactivity. Language Testing, 28(1), 51–75. Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. Crano, W. D. (1977). Primacy versus recency in retention of information and opinion change. The Journal of Social Psychology, 101(1), 87–96. Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67–96. Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. Forgas, J. P. (2011). Can negative affect eliminate the power of first impressions? Affective influences on primacy and recency effects in impression formation. Journal of Experimental Social Psychology, 47(2), 425–429. Hamp-Lyons, L., & Henning, G. (1991). Communicative writing profiles: An investigation of the transferability of a multiple-trait scoring instrument across ESL writing assessment contexts. Language Learning, 41(3), 337–373. Hendrick, C., & Costantini, A. F. (1970). Effects of varying trait inconsistency and response requirements on the primacy effect in impression formation. Journal of Personality and Social Psychology, 15(2), 158–164. Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel,V., & Hughey, J. (1981). Testing ESL composition: A practical approach. Rowley, MA: Newbury House. Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26–43. Luchins, A., & Luchins, E. (1970). The effects of order of presentation of information and explanatory models. The Journal of Social Psychology, 80(1), 63–70. Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276. Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training, Language Testing, 12(1), 54–71. McNamara, T. F. (1996). Measuring second language performance. London, UK: Longman. Myford, C. M. (2012). Rater cognition research: Some possible directions for the future. Educational Measurement: Issues and Practice, 31(3), 48–49. Polio, C. (2013). Revising a writing rubric based on raters’ comments. Paper presented at the Midwestern Association of Language Testers (MwALT) conference, East Lansing, Michigan. Rebitschek, F. G., Krems, J. F., & Jahn, G. (2015). Memory activation of multiple hypotheses in sequential diagnostic reasoning. Journal of Cognitive Psychology, 27(6), 780–796. Roch, S. G., Woehr, D. J., Mishra,V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta-analytic review of frame-of-reference training. Journal of Occupational and Organizational Psychology, 85(2), 370–395. Tulving, E. (2008). On the law of primacy. In M. A. Gluck, J. R. Anderson, & S. M. Kosslyn (Eds.), Memory and mind: A festschrift for Gordon H. Bower (pp. 31–48). New York, NY: Taylor & Francis Group.
14
15
14
Analytic Rubric Format 15
Underwood, G. (1975). Perceptual distinctiveness and proactive interference in the primacy effect. The Quarterly Journal of Experimental Psychology, 27(2), 289–294. Weigle, S. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press. Weigle, S. C., Boldt, H., & Valsecchi, M. I. (2003). Effects of task and rater background on the evaluation of ESL student writing. TESOL Quarterly, 37(2), 345–354. Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231–252. Winke, P., & Lim, H. (2015). ESL essay raters’ cognitive processes in applying the Jacobs et al. rubric: An eye-movement study. Assessing Writing, 25, 37–53. Wolfe, E.W., Matthews, S., &Vickers, D. (2010).The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. The Journal of Technology, Learning and Assessment, 10(1), 4–21.
16
newgenrtpdf
APPENDIX Standard Rubric Content
Organization
20 Thorough and logical 20 development of thesis Substantive and detailed No irrelevant information Interesting A substantial number of words for amount of time given Well-defined relationship to the prompt 16 16 15 Good and logical development of thesis Fairly substantive and detailed Almost no irrelevant information Somewhat interesting An adequate number of words for the amount of time given Clear relationship to the 11 prompt
Excellent overall 20 organization Clear thesis statement Substantive introduction and conclusion Excellent use of transition words Excellent connections between paragraphs Unity within every paragraph 16
Vocabulary
Language Use
Mechanics
Very sophisticated 20 No major errors in word order 20 Appropriate layout vocabulary or complex structures with well-defined Excellent choice of No errors that interfere with paragraph separation words with no errors comprehension No spelling errors Excellent range of Only occasional errors in No punctuation errors vocabulary morphology No capitalization Idiomatic and near Frequent use of complex errors native-like vocabulary sentences Academic register Excellent sentence variety 16
15 Good overall 15 Somewhat sophisticated 15 organization vocabulary, usage not Clear thesis statement always successful Good introduction and Good choice of words conclusion with some errors that Good use of don’t obscure meaning transition words Adequate range of Good connections vocabulary but some between paragraphs repetition Unity within most Approaching academic paragraphs register 11 11 11
16 Occasional errors in word 15 Appropriate layout order or complex structures with clear paragraph A variety of complex separation structures, even if not No more than a few completely successful spelling errors Almost no errors that interfere in less frequent with comprehension vocabulary Some errors in morphology No more than a few Frequent use of complex punctuation errors sentences No more than a few Good sentence variety capitalization errors 11
16
0
No development of thesis 5 No substance or details Substantial amount of irrelevant information Completely uninteresting Very few words for the amount of time given Relationship to the prompt not readily apparent 0
No coherent organization No thesis statement or main idea No introduction and conclusion No use of transition words Disjointed connections between paragraphs Paragraphs lack unity
5
0
Note. The formatting of the rubric was modified for publication purposes.
Very simple vocabulary Severe errors in word choice that often obscure meaning No variety in word choice No resemblance to academic register
17
5
Unsophisticated vocabulary Limited word choice with some errors obscuring meaning Repetitive choice of words Little resemblance to academic register
newgenrtpdf
16
10 Some development 10 Some general coherent 10 of thesis organization Not much substance Minimal thesis statement or detail or main idea Some irrelevant Minimal introduction information and conclusion Somewhat uninteresting Occasional use of Limited number of words transition words for the amount of Some disjointed time given connections between Vague but discernable paragraphs relationship to the Some paragraphs may 6 prompt 6 lack unity 6
10 Frequent errors in word 10 Appropriate layout order or attempts at with somewhat complex structures clear paragraph Some errors that interfere separation with comprehension Some spelling errors Frequent errors in morphology in less frequent Minimal use of complex and more frequent sentences vocabulary Little sentence variety Several punctuation errors Several capitalization errors 6 6 5 Serious errors in word order or 5 complex structures Frequent errors that interfere with comprehension Many errors in morphology Almost no attempt at complex sentences No sentence variety
0
0
No attempt to arrange essay into paragraphs Several spelling errors even in frequent vocabulary Many punctuation errors Many capitalization errors
18
2 RATER TRAINING IN A SPEAKING ASSESSMENT IMPACT ON MORE-AND LESS-PROFICIENT RATERS Larry Davis
Issues that Motivated the Research In tests of productive language, it is the responsibility of the rater to perform the act of measurement; that is, assign a number to an observation through the application of a rule (Stevens, 1946). How and why raters make scoring decisions, therefore, has important consequences for test reliability and the validity of score interpretations, and indeed, the nature of rater-associated variability in scores has been a persistent source of concern for language performance tests (Norris, Brown, Hudson, & Yoshioka, 1998) and speaking tests in particular (Fulcher, 2003). Understanding the nature of such variability is no simple matter, however, given the complexity of the scoring task and the many factors that may influence raters’ judgments. The apparent simplicity of a score belies the fact that language performance tests often measure complex phenomena and that raters must make judgments on the basis of brief scoring rubrics, where the fit between performance and score is not always obvious (Cumming, Kantor, & Powers, 2002; Lumley, 2005).Yet, despite the challenges raters face, raters are still capable of producing reliable scores that rank test-takers in similar order (e.g., Huot, 1990; McNamara, 1996). Although raters may generally agree on the relative quality of a response in terms of score, such consistency does not necessarily mean that raters were identical in the way they applied the scoring criteria. A substantial body of research in both speaking and writing assessment has observed diversity in the features of performance attended to by raters. For example,Vaughan (1991) noted variability across 14 categories of comments mentioned by raters while scoring essays holistically, with one rater who seemed to focus on introductions, another who focused on mechanics, and a third who commented most often on content issues. Similar findings were also reported by Sakyi (2000) and Milanovic, Saville, and Shen
18
19
18
Rater Training in a Speaking Assessment 19
(1996), again with raters engaged in holistically scoring essays. One might expect that such variability could be reduced by the use of analytic scoring, where more explicit guidance is given regarding how to score different aspects of performance. However, Smith (2000) reported that while agreement in analytic scores for specific aspects of performance was fairly high among six raters, the actual criteria used to produce the scores were less consistent. In one case, general agreement that the essay was a “fail” was accompanied by disagreement regarding which of the scoring criteria had actually gone unmet. Similar variability in raters’ scoring criteria have been reported for speaking tests. Meiron (1998) found that of six raters who scored audio-recorded responses to the SPEAK test (Speaking Proficiency English Assessment Kit), each showed a unique pattern in the frequency with which different features were mentioned in verbal reports. In a more dramatic example, Orr (2002) observed multiple cases where raters scoring the First Certificate in English (FCE) speaking test gave the same score to a response but made opposite comments regarding the quality of a particular language feature. For example, in one case, two raters both awarded a score of three for “grammar and vocabulary” (on a scale of 1–5), with one rater commenting “the vocabulary was pretty inappropriate” while the other rater commented “the vocabulary was appropriate” (p. 146). Similarly, May (2009) observed that among four raters who scored an asymmetric interaction in a paired speaking task, there was at least one instance where the group was equally split regarding the appropriateness of an interactional sequence. Similar variation in judgments were seen regarding which test-taker should be penalized or compensated for the asymmetric nature of the interaction. Finally, Winke, Gass, and Myford (2011) reported that of eight major categories of comments made by trained undergraduate raters who scored responses to the TOEFL iBT® Speaking Test, only three categories were mentioned by all 15 raters. In addition, two of the three issues noted by all raters had little relationship to the scoring rubric (i.e., the examinee’s accent and volume/quality of voice). It is clear that raters will have differing perceptions of the same response, or may attend to irrelevant features of performance. A major purpose of rater training is therefore to guide and standardize rater perceptions, and it is widely accepted in the language testing community that rater training is necessary to ensure the reliability and validity of scores produced in language performance tests (Fulcher, 2003;Weigle, 2002).The major goals of rater training are (1) to ensure the consistency of scores within and between raters, and (2) to ensure that raters are operationalizing the construct of language ability in the same way when applying the scoring criteria. Training of naïve raters has been shown to be helpful in making raters more internally consistent and may improve inter-rater agreement or accuracy (e.g., Fahim & Bijani, 2011; Furneaux & Rignall, 2007; Kang, 2008; Lim, 2011; Weigle, 1998), but improvement in scoring performance is not guaranteed. A number of studies have reported little impact of training on rater bias (Elder, Barkhuizen,
20
20 Larry Davis
Knoch, & von Randow, 2007; Knoch, 2011) or variation in rater severity (Brown, 1995; Lumley & McNamara, 1995; Myford & Wolfe, 2000).The impact of training on rater perceptions is less clear, however, and variability in scoring criteria and decision-making processes have also been seen among raters who have undergone training (Meiron, 1998; Orr, 2002; Papajohn, 2002; Winke, Gass & Myford, 2011). These studies demonstrate that trained raters may still differ in their perceptions, but the extent to which rater training actually influences raters’ scoring criteria has not been directly investigated by comparing naïve raters’ perceptions before and after training. The purpose of the current study was to investigate this issue in a speaking assessment context.
Context of the Research This chapter reports on one part of a larger study of rater expertise; a portion of this study was reported in Davis (2016).The overall focus of the study was to investigate the contributions of rater training and scoring experience on rater performance in terms of the scores they produced, their behavior while scoring, and their application of the scoring criteria. Davis (2016) reported findings related to the effects of training and experience on score quality; in brief, there were four major findings: •
• •
•
Individuals who were untrained as raters but experienced teachers of English achieved levels of agreement and scoring consistency, as indicated by multifaceted Rasch measurement (MFRM) severity and model fit, that are typical for standardized tests. In a scoring session prior to training, raters were provided with exemplar responses but not a scoring rubric; the exemplars apparently were sufficient for raters to approximate the rating scale in their scoring decisions. However, following a training session, between- rater agreement and agreement with established reference scores (scoring accuracy) improved. Additional experience gained in a series of three scoring sessions over approximately 10 days resulted in further gains in scoring accuracy, but no additional improvement was seen in terms of inter-rater correlations or MFRM severity and model fit measures. More accurate and consistent raters tended to review the exemplar responses more often while scoring and took longer to reach a decision, in part because they were checking the exemplars more often while scoring, or replaying the test-taker’s response.
The final finding came from a part of the study where subsets of raters with relatively good, average, and poor scoring performance were selected and compared in terms of their scoring behavior and the aspects of performance they attended to while scoring.This chapter elaborates this latter part of the study and will describe the scoring criteria used by relatively strong-and weak-performing raters, before
20
21
20
Rater Training in a Speaking Assessment 21
and after training. The purpose was to compare the effects of training and experience, as well as to provide a comparison among individuals who, for whatever reason, demonstrated desirable or undesirable scoring patterns.
Research Questions Addressed The study examined the following research questions: In what ways does rater cognition differ between raters who are more and less proficient in scoring? Specifically, what differences, if any, are seen in attention paid to distinct language features and in explicit attention given to the scoring process?
Research Methods Data Collection Procedures For the analyses reported here, nine participants were selected from a group of 20 experienced teachers of English who participated in the larger study. None had previously worked as scorers for the TOEFL iBT Speaking Test, although all had two years or more experience in teaching English. The participants were divided into three groups: more-proficient raters who demonstrated relatively good scoring performance during the study, less-proficient raters whose performance was relatively poor, and intermediate raters whose performance was of intermediate quality (see Appendix). Information regarding the scoring performance of individual raters is reported in Davis (2016). There was little difference in the backgrounds of raters selected for each group although the more-proficient raters had, on average, more teaching experience: 8.5 years versus 5.7 and 6.0 years, respectively, for the less-proficient and intermediate rater groups. Also, the majority (five of six) of the more-proficient and intermediate raters had taught some sort of TOEFL preparation class while none of the less-proficient raters had done so. Materials based on the TOEFL iBT Speaking Test were used in the study, which included responses from the TOEFL Public Use Dataset (Educational Testing Service, 2008), the TOEFL iBT speaking scoring rubric (adapted to a modified scale as described below), and a limited number of scoring rationales provided by Educational Testing Service (ETS) and used for rater training. Raters scored responses to two independent speaking tasks where test-takers provided an opinion (task 1) or were asked to choose one of two options and explain their choice (task 2). Responses came from a single test form taken from the TOEFL Public Use Dataset, with 240 test-takers responding to the same two tasks to produce a total of 480 audio files. Responses were scored holistically using the TOEFL iBT speaking rubric, which provides scoring criteria for three domains: delivery (pronunciation, fluency), language use (grammar and vocabulary), and topic development (detail and coherence of content) (Educational Testing Service, 2007). While the scoring
2
22 Larry Davis
criteria provided in the rubric were used unchanged, the rating scale was modified from the four-point scale used operationally to a six-point scale by adding half points between scores 2, 3, and 4. This modification was intended to make the scoring task more challenging, as needed for the investigation in scoring performance that was carried out in the broader study. For evaluating scoring accuracy using the modified rating scale, reference scores were collected from 11 TOEFL iBT scoring leaders who each scored all responses. Their scores were then combined to produce “gold standard” reference scores. It should be noted that TOEFL iBT materials were used for reasons of convenience and the study was neither designed nor intended to support specific claims regarding the TOEFL iBT Speaking Test. Scoring was done independently by raters who accessed materials online and entered their scores into an interactive Adobe Acrobat document. Raters worked at their own locations and were unsupervised, although they were required to complete each scoring session within the same day and were requested to work diligently while scoring. At the start of the study, each rater completed an individual orientation with the researcher, where (1) the procedures for accessing materials and scoring were explained, (2) it was confirmed the participant could successfully access and use the materials, and (3) the stimulated recall procedure used to collect rater perceptions was reviewed and practiced. Following orientation, raters completed four scoring sessions over a period of roughly two to three weeks. These sessions included a pre-training session, a session the day after rater training, and two subsequent sessions at intervals of five to seven days. Each session started with a mandatory review of six exemplar responses to the independent speaking task being scored first, which demonstrated performance at each point of scale. Then 50 responses to the task were scored, followed by an additional 10 responses which were used for stimulated recall. This process was then repeated with the other independent speaking task, to make a total of 100 responses scored and 20 stimulated recall protocols. The order in which each task was presented was randomly assigned for each rater. In the first session, the exemplar responses were the only scoring materials provided; the rubric was not made available. This approach was taken to collect raters’ unguided perceptions regarding the aspects of performance relevant for a given score level. Once the scoring rubric had been introduced in the training session, raters were required to review the scoring rubric before scoring each task type in all subsequent sessions. Each task type typically required approximately two and a half hours to complete, making a total of roughly five hours for each scoring session. In the rater training, participants first read through the scoring rubric after which they reviewed exemplar responses. These exemplars were accompanied by scoring commentaries which gave a brief (one-paragraph) rationale for awarding a given score, and which focused on the features of topic development expected in the response. The scoring commentaries were obtained either from published
2
23
2
Rater Training in a Speaking Assessment 23
materials or directly from ETS. A total of 10 commentaries were used that addressed five different independent speaking tasks, including the two specific tasks used to elicit the oral responses used in the study. Following review of the scoring commentaries, raters completed practice scoring of ten responses for each task type, where the reference score for each response was provided as feedback immediately after the raters confirmed their scores. The training was presented in an interactive PDF document, and required approximately one and a half to two hours to complete. Data regarding rater cognition were collected using verbal reports, specifically, stimulated recall (Gass & Mackey, 2000). Although concurrent verbal report techniques such as think aloud have been widely used in the study of writing assessment (and at least one study of rater cognition in a speaking test; Meiron, 1998), retrospective verbal reporting techniques such as stimulated recall are generally considered to be less likely to interfere with the scoring process within speaking test contexts (Lumley & Brown, 2005). They may also require less training because they are easier to produce than concurrent verbal reports (Gass & Mackey, 2000). Raters were trained in how to produce stimulated recalls during the orientation session for the study.The stimulated recall training started with the researcher giving an explanation of the general purpose for collecting the stimulated recall as well as a description of the type of comments the raters should be making. In particular, it was stressed that the raters should “think out loud” rather than attempting to give an “explanation” or narrative of their scoring. This focus on verbalizing internal speech has been suggested to provide relatively direct access to the thought process being reported (Ericsson & Simon, 1993). Raters were also asked to try to recall their original thoughts while scoring, to the extent possible, and some questions were provided to guide raters in case they had difficulty in thinking of something to say. Raters first scored the response and then were prompted to record their recall. During the recall, raters replayed the test-taker’s response and made comments; raters were asked to pause the test-taker’s response at least two or three times to record their thoughts, but were also allowed to speak over the response without pausing, stop and replay part of the response to make comments, and comment after the replay was finished. Generally, most raters paused the response several times to make comments and then made additional comments at the end. As mentioned earlier, ten recalls were done at the end of the scoring of the first task type (halfway through the overall scoring session), with another ten recalls done for the other task at the end of the scoring session.
Data Analysis Procedures Word-for-word transcriptions were produced for all stimulated recalls. Ten recalls for each rater (five for independent task 1 and five for task 2) were randomly selected for coding from scoring sessions 1, 2, and 4 (before training, the day after,
24
24 Larry Davis
and 10–14 days after), which represented half of the recalls actually produced in these sessions. This sampling of the data was done to limit the time required for coding to a manageable level; responses collected in session 3 (five to seven days post-training) were not coded for the same reason. The transcripts of the selected recalls were then segmented into units wherein each segment represented a single cognitive process or action (Green, 1998). Often such segments corresponded to a clause, but ranged from a single phrase to several clauses long, and in any case captured specific actions that were not obviously decomposable into sub-actions. A simple coding scheme was then developed based on three randomly selected recalls from each rater, collected in scoring session 1. The coding scheme was further refined as coding continued, with final codes for all recalls reviewed and revised as needed at the end of the coding process. The coding scheme focused on two aspects of rater cognition: (1) language features attended to, and (2) mention of scores and the scoring process (Davis, 2012). Language features mentioned by raters were coded to evaluate the claim that rater training functions to clarify the criteria used for scoring (e.g., Fulcher, 2003; Weigle, 2002). Accordingly, specific language features were coded into one of the three categories used in the TOEFL scoring rubric: delivery (pronunciation and fluency), language use (grammar and vocabulary), and topic development (detail and coherence of content). Comments regarding scoring were divided into two categories: (1) mention of a score for all or part of a response (such as “This person is about a three so far”), and (2) comments regarding the scoring process (scoring meta-talk such as “I checked the rubric here”). Other categories that were coded included a “general” category for comments addressing the response as a whole, an “unclear” category where the language focus of the comment could not be determined (such as when the rater simply repeated an examinee’s utterance), a “technical” code used for comments about technical issues (such as poor sound quality), and an “other” category used for comments that did not fit elsewhere, such as speculation regarding the examinee’s background or intentions. A total of 40% of the stimulated recall transcripts were coded by a second individual, who was an experienced teacher of English as a foreign language and was trained to use the coding system. The overall inter-coder agreement was 82.3% (974 of a total 1,184 coded units).
Findings and Discussion There was considerable individual variation in the frequency of comments made concerning various language features, and on the whole there appeared to be few obvious or consistent differences between the three groups of raters (Table 2.1). This variation, along with the various ways that raters organized and elaborated on their comments, suggests that different raters had different approaches to scoring that were not necessarily tied to the severity or consistency of their scores. It is interesting to note that this finding stands somewhat in contrast to a finding of
24
25
24
Rater Training in a Speaking Assessment 25 TABLE 2.1 The Relative Frequency of Raters’ Comments on Rubric Categories
Raters More-Proficient
Intermediate
Less-Proficient
R101 R102 R123 R117 R120 R125 R108
R109 R113
Session 1 Delivery Language Use Topic Dev.
44% 30% 27%
38% 43% 20%
14% 35% 51%
20% 28% 52%
29% 38% 32%
31% 22% 47%
24% 26% 50%
47% 44% 8%
22% 28% 49%
Session 2 Delivery Language Use Topic Dev.
39% 18% 44%
33% 28% 39%
30% 27% 43%
19% 29% 52%
29% 22% 49%
16% 16% 69%
14% 6% 80%
32% 29% 39%
25% 13% 62%
Session 4 Delivery Language Use Topic Dev.
46% 17% 37%
41% 31% 28%
30% 45% 25%
18% 37% 46%
43% 11% 46%
16% 38% 46%
21% 13% 66%
32% 38% 31%
33% 35% 31%
Note: Perecentages for each rater may not add to exactly 100% due to rounding.
the broader study, which was that better-performing raters tended to check the exemplar responses more often and took longer to reach a scoring decision (Davis, 2016). However, we should also note that this finding and the results reported below should be considered preliminary, given the very small sample of raters in the current qualitative analysis. Despite the limited scope of the data, a few trends over time were apparent. In particular, following training (session 2), the proportion of comments focused on topic development increased and became more consistent across raters. This increase was seen for raters in all three groups, with more dramatic differences seen for less-proficient and intermediate raters. For some individual raters, a dramatic increase in the frequency of such comments was seen. For example, only 8% of Rater 109’s comments related to language phenomena were about issues of topical development in session 1, but this figure increased to 39% in session 2. Moreover, by session 2, most raters made more comments regarding topical development than they did for either delivery or language use. While comments regarding delivery remained at approximately the same level across sessions (typically 20% to 40%), comments addressing language use decreased in relative frequency for all but one rater, suggesting that the increase in attention given to topical development may have come at the expense of attention to language use issues.This change may be related to the scoring commentaries provided during the rater-training session, which focused exclusively on matters of topical development. No such instructions were provided for the other two categories of
26
26 Larry Davis
the scoring rubric (delivery and language use). The focus on topic development seemed to diminish by the final scoring session, however, where roughly equal mention was made of the three language categories included in the scoring rubric. Comments about the scoring process were also generally more frequent following rater training, particularly for less-proficient raters (Table 2.2). In session 1, the frequency of scoring comments as a percentage of all comments ranged from 3% to 12% for this group, increasing to 13% to 28% following training and suggesting a more explicit focus on the scoring process. Given that reflective practice has been reported as one characteristic of experts (Feltovich, Prietula, & Ericsson, 2006; Tsui, 2003), the observation that less-proficient raters talked more about the scoring process seems counterintuitive. On the other hand, it is possible that a focus on the scoring process was prompted by negative feedback received during training, or the provision of the scoring rubric. Comments made by Rater 109 illustrate both of the possibilities. In comments made during session 2, Rater 109 repeatedly referred to the rater training, in which she received feedback that her scores were too low. (In fact, in session 1, she was the most severe of all 20 raters in the larger study.) Comments included several statements like the following: “mm so far I’m thinking this is a four. uh I tend to grade a little bit harsher than what happened in the rater training, so I’m trying to see if it would hit five or not, or, yeah, but so far, I’m thinking it’s, perfect four, which, I end up doubting this judgment now because, on the rater training when I thought something was a perfect four, uh it was a five so.” In addition, it was also clear in both session 2 and session 4 that Rater 109 was consulting the scoring rubric while scoring. In one example, the rubric is quoted verbatim (with quoted material enclosed in quotation marks): “so I’m seeing a number two with ‘the response demonstrating a limited range and control of grammar and vocabulary’, and it ‘prevents the full expression of ideas’, and there’s difficulty with fluidity and, although sentences are very simple, and some connections, (to) things are a little bit unclear, so that part, is two.” In this sequence, Rater 109 provides two direct quotes and a paraphrase of the scoring rubric description of language use expected for a score of two.This behavior is much like that observed by Kim (2011), who found that inexperienced raters in a speaking test relied more on the scoring rubric than did more experienced raters. In Kim’s (2011) study, inexperienced raters also mentioned construct-irrelevant features of performance, and similar patterns were seen in this study. Table 2.2 shows the frequency with which raters made general, unclear, or miscellaneous comments. Comments in each of these categories typically did not exceed 20% of
26
27
26
Rater Training in a Speaking Assessment 27 TABLE 2.2 The Frequency of Raters’ Comments on Scoring, or Not Directly Addressing
Scoring Criteria, as a Percentage of All Comments Rater More-Proficient
Intermediate
Less-Proficient
R101 R102
R123 R117 R120
R125 R108
R109 R113
Session 1 Scoring Other
6% 7%
3% 21%
13% 17%
15% 10%
19% 44%
27% 17%
3% 49%
7% 15%
12% 18%
Session 2 Scoring Other
5% 6%
9% 12%
26% 9%
19% 4%
28% 12%
24% 17%
13% 31%
28% 11%
16% 20%
Session 4 Scoring Other
6% 11%
13% 2%
32% 9%
16% 8%
14% 10%
33% 6%
13% 25%
27% 3%
28% 16%
all comments, although there were two noticeable exceptions in the first session where such comments accounted for more than 40% of the total for one less- proficient rater (R108) and one intermediate rater (R120). The frequency of unfocused comments decreased after training for seven of nine raters and in general the frequency was lowest at the final scoring session. However, individual variability was relatively large compared to differences among groups, suggesting that scoring performance was not closely associated with the frequency of irrelevant comments. Nonetheless, irrelevant comments were generally fewer following training, suggesting that raters’ attention became more focused on language features mentioned in the scoring rubric, a pattern that was particularly evident among those who made the most such comments at the beginning.These findings are generally in keeping with results from research on writing assessment, such as studies by Cumming (1990) and Wolfe, Kao, and Ranney (1998). They reported that raters with more experience in either teaching or scoring tended to make more mention of specific linguistic features and made fewer comments regarding phenomena not mentioned in the scoring rubric.
Implications for Policy, Practice, and Future Research While the rater training used in this study positively influenced scoring performance, the impact on raters’ internalized scoring criteria was less clear. More-and less-proficient raters differed little in the frequency with which they mentioned different scoring criteria, and a diligent effort by a less-proficient rater to rigorously apply the scoring rubric (R109, above) did not lead to greatly improved scoring performance in terms of severity or agreement with reference scores.
28
28 Larry Davis
This finding also raises the question of the degree to which explicit descriptions of performance, such as those found in scoring rubrics, actually influence scores awarded. Raters are faced with the task of applying relatively vague scoring criteria to complex phenomena, and written scoring criteria might be used more to justify a decision made intuitively than to actually make the decision itself (Lumley, 2005). Rather than being a definitive statement of the language features measured by a particular test, it seems possible that the scoring rubric simply provides one piece of evidence, and the actual construct operationalized in a particular test resides just as much in the specific exemplars and training procedures used to develop a shared understanding among raters. If this hypothesis is true, then careful development of materials beyond the scoring rubric will be important for ensuring the construct validity of raters’ judgments. Moreover, when the current findings are viewed in light of the findings reported in Davis (2016), it seems possible that the ability to quantify the quality of test- taker performances could be somewhat distinct from the ability to focus on appropriate aspects of performance. The provision of exemplars alone was sufficient for most raters to achieve reasonably good scoring performance and score accuracy improved following training (Davis, 2016). However, while training impacted the language features noticed by raters and the attention given to the scoring process, raters’ perceptions were far from uniform. It would seem the training protocol used in this study was more effective in standardizing raters’ magnitude judgments and less effective in standardizing their qualitative observations. This standardization effect could be an indication that developing each ability requires different types of learning experiences. For example, extensive practice scoring of many examples might promote scoring consistency and accuracy, while intensive qualitative training in understanding and applying the scoring criteria might help to standardize raters’ perceptions of relevant language phenomena. Such a distinction is speculative, but raises interesting possibilities for further research which might directly compare the impact of learning experiences designed to promote either internalization of the rating scale, or a more standardized and explicit understanding of the features of good performance.
References Brown, A. (1995).The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1–15. Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86(1), 67–96. Davis, L. E. (2012). Rater expertise in a second language speaking assessment: The influence of training and experience (Doctoral dissertation, University of Hawai’i at Manoa). Retrieved from https://scholarspace.manoa.hawaii.edu/handle/10125/100897
28
29
28
Rater Training in a Speaking Assessment 29
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117. Educational Testing Service. (2007). The official guide to the new TOEFL iBT (2nd ed.). New York, NY: McGraw Hill. Educational Testing Service. (2008). TOEFL iBT public use dataset [data files, score prompts, scoring rubrics]. Princeton, NJ: Educational Testing Service. Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 37–64. Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data (2nd ed.). Cambridge, MA: MIT Press. Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1–16. Feltovich, P. J., Prietula, M. J., & Ericsson, K. A. (2006). Studies of expertise from psychological perspectives. In K. A. Ericsson, N. Charness, P. J. Feltovich, & R. R. Hoffman (Eds.), The Cambridge handbook of expertise and expert performance (pp. 41–67). Cambridge, UK: Cambridge University Press. Fulcher, G. (2003). Testing second language speaking. Harlow, UK: Longman. Furneaux, C., & Rignall, M. (2007). The effect of standardization- training on rater judgements for the IELTS Writing Module. In L. Taylor & P. Falvey (Eds.), IELTS collected papers: Research in speaking and writing assessment (pp. 422–445). Cambridge, UK: Cambridge University Press. Gass, S. M., & Mackey, A. (2000). Stimulated recall methodology in second language research. Mahwah, NJ: Lawrence Erlbaum Associates. Green, A. (1998). Verbal protocol analysis in language testing research: A handbook. Cambridge, UK: Cambridge University Press. Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60(2), 237–263. Kang, O. (2008). Ratings of L2 oral performance in English: Relative impact of rater characteristics and acoustic measures of accentedness. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6, 181–205. Kim, H. J. (2011). Investigating raters’ development of rating ability on a second language speaking assessment (Unpublished doctoral dissertation, Teachers College, Columbia University). Retrieved from ProQuest Dissertations and Theses database. (UMI No. 3448033) Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior –a longitudinal study. Language Testing, 28(2), 179–200. Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. Lumley,T. (2005). Assessing second language writing: The rater’s perspective. Frankfurt, Germany: Peter Lang. Lumley, T., & Brown, A. (2005). Research methods in language testing. In E. Hinkle (Ed.), Handbook of research in second language teaching and learning (pp. 833–856). Mahwah, NJ: Lawrence Erlbaum. Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71. May, L. (2009). Co-constructed interaction in a paired speaking test: The rater’s perspective. Language Testing, 26(3), 397–421. McNamara, T. (1996). Measuring second language performance. London, UK: Longman.
30
30 Larry Davis
Meiron, B. E. (1998). Rating oral proficiency tests: A triangulated study of rater thought processes (Unpublished master’s thesis). California State University Los Angeles, Los Angeles, California. Milanovic, M., Saville, N., & Shen, S. (1996). A study of the decision-making behaviour of composition markers. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium (LTRC), Cambridge and Arnhem (pp. 92–114). Cambridge, UK: Cambridge University Press. Myford, C. M., & Wolfe, E. W. (2000). Monitoring sources of variability within the Test of Spoken English assessment system. (TOEFL Research Report No. 65, RR-00-6.) Princeton, NJ: Educational Testing Service. Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing second language performance assessments. Honolulu, HI: University of Hawai’i, National Foreign Language Resource Center. Orr, M. (2002). The FCE Speaking test: Using rater reports to help interpret test scores. System, 30(2), 143–154. Papajohn, D. (2002). Concept mapping for rater training. TESOL Quarterly, 36(2), 219–233. Sakyi, A. A. (2000). Validation of holistic scoring for ESL writing assessment: How raters evaluate compositions. In A. J. Kunnan (Ed.), Fairness and validation in language assessment: Selected papers from the 19th Language Testing Research Colloquium, Orlando, Florida (pp. 129–152). Cambridge, UK: Cambridge University Press. Smith, D. (2000). Rater judgments in the direct assessment of competency-based second language writing ability. In G. Brindley (Ed.), Studies in immigrant English language assessment (pp. 159–190). Sydney, Australia: National Centre for English Language Teaching and Research, Macquarie University. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680. Tsui, A. B. M. (2003). Understanding expertise in teaching: Case studies of ESL teachers. Cambridge, UK: Cambridge University Press. Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp- Lyons (Ed.), Assessing second language writing in academic contexts (pp. 111–125). Norwood, NJ: Ablex. Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press. Winke, P., Gass, S., & Myford, C. (2011). The relationship between raters’ prior language study and the evaluation of foreign language speech samples (TOEFL iBT Research Report No. 16, RR-11–30). Princeton, NJ: Educational Testing Service. Wolfe, E. W., Kao, C.-W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465–492.
30
31
30
Rater Training in a Speaking Assessment 31
APPENDIX Selection Criteria for Raters More-proficient raters Severity Severity as measured by multi-facet Rasch measurement (MFRM) consistently within +/-.5 logit of the mean (zero) of in all scoring sessions Consistency MRFM rater infit mean square values .9 or lower for all scoring sessions Accuracy Pearson correlation with the reference scores .8 or higher for all sessions Less-proficient raters Severity At least one session with MFRM severity greater than +/-1 logit away from the mean; relatively large changes in severity across sessions that do not trend towards the mean Consistency No specific criterion Accuracy Pearson correlation with the reference scores .75 or lower, with no consistent upward trend over time Intermediate raters Severity No specific criterion Consistency MFRM infit mean square values consistently decreasing over time, with a total decrease of at least .2 Accuracy Pearson correlations with the reference scores consistently increase over time, with a total increase of least .1
32
3 THE IMPACT OF RATER EXPERIENCE AND ESSAY QUALITY ON THE VARIABILITY OF EFL WRITING SCORES Özgür Şahan
Issues That Motivated the Research While assessing writing, raters go through three main stages in which they (1) read the text to create a text image, (2) analyze the text image, and (3) make their evaluations (Freedman & Calfee, 1983). Although raters are subject to the aforementioned stages in the same order, they might rely on their own idiosyncratic experiences, beliefs, values, world knowledge, and understanding of writing when creating the text image. In this sense, raters seem to play a prominent role in the assessment process (Lumley, 2005). Furthermore, the rating assigned to a composition reflects multiple interactions of several factors including the examinee, the writing prompt or task, the composition itself, the rater(s), and the scoring scale (Hamp-Lyons, 1990; McNamara, 1996). As such, ratings given to students’ compositions can be subjective in that they not only reflect the quality of performances but the quality of scorers’ judgments (McNamara, 2000), putting the rater in the center of the rating process (Attali, 2015). Rater-related factors have been examined from two perspectives: essay features that raters attend to during their assessments and the effects of rater characteristics on essay scoring (Weigle, 2002). As a rater characteristic, previous scoring experience seems to play a key role in the assessment process (Barkaoui, 2010). Although one might assume that experienced raters are more reliable scorers, the literature suggests that there does not seem to be a clear relationship between rater experience and scoring consistency and leniency. Some research has found that more-experienced raters tend to be more consistent in their scoring. For example, Cumming (1990) found that novice and expert raters assigned significantly different scores to the sub- scales of content and rhetorical organization, but they did not differ significantly in their scores on the language use component. Moreover, Cumming contended
32
3
32
Impact of Rater Experience and Essay Quality 33
that expert raters were more consistent than their novice peers in the scores they assigned to the aforementioned three aspects. In the same vein, Lim (2011) noted that the rating quality of novice raters increased upon practice, suggesting a positive relationship between rating quality and experience. Similarly, Leckie and Baird (2011) found that raters became more homogenous as they scored more essays. Other research, however, has produced contradictory findings on the relationship between rater experience and scoring severity, with some research finding that less-experienced raters tended to assign higher scores to essays (Rinnert & Kobayashi, 2001) and other research suggesting that inexperienced raters were more severe than experienced raters in their evaluations (Weigle, 1999). Investigating the differences in scores assigned using different rating methods, Song and Caruso (1996) concluded that more-experienced raters gave significantly higher holistic scores than less-experienced raters, although they did not differ significantly in their analytic ratings. In another study, both experience groups were more lenient in their analytic scores than in their holistic scores, and novice raters were more lenient than expert raters while using both scoring methods (Barkaoui, 2010). Research has also investigated the effect that text quality has on raters’ reactions and decisions concerning students’ compositions (Brown, 1991; Engber, 1995; Ferris, 1994; Han, 2017; Huang, 2008; Huang, Han, Tavano, & Hairston, 2014). Brown (1991) examined the differences between the ratings assigned to essays written by English as a second language (ESL) and native English-speaking (NES) students, and the results suggested no significant differences between the scores assigned to the student groups’ compositions. Huang (2008) similarly investigated the reliability of the scores given to essays composed by ESL and NES students. He found that ESL students’ performance was rated significantly lower than that of the NES students. Although the largest variance component contributing to score variability for both groups was found to be the writing skills of the students, the residual variance for ESL students’ ratings was significantly larger than that of NES students’ ratings, suggesting that more unexplained variance contributed to the variability of scores assigned to ESL students’ writing performance than to those of NES students. In addition, students’ L1 background was found to be a salient factor contributing to the differences between ratings assigned to the two groups. Huang suggests that these findings indicate a potential fairness problem between the writing assessment of ESL and NES students. Considering the writing abilities of students studying English as a foreign language (EFL), Han (2017) examined the ratings assigned to low-, medium-, and high-quality essays. The results revealed that raters assessed high-quality papers more consistently compared to their ratings of low-quality papers. In the same vein, Huang et al. (2014) found that raters gave similar scores to high-quality essays but varied in their scores given to low-quality EFL essays. In addition, Ferris (1994) found that raters awarded higher scores to ESL compositions in which students used diverse syntactic and lexical tools correctly. Similarly, Engber (1995) concluded that there was a positive correlation between the correct uses of lexical items and the ratings assigned to ESL texts.
34
34 Özgür Şahan
The discussion above highlights the importance of the rater in writing assessment. However, previous research has primarily focused on differences between novice and expert raters. Thus, this study attempted to address a research gap by examining raters with varying levels of experience. Moreover, the contradictory findings of previous research on the role of experience and the scarcity of research investigating the impact of text quality on ESL/EFL writing assessment motivated me to investigate the role of rating experience and essay quality on the variability of essay scores in the Turkish higher education context.
Context of the Research The purpose of this study was to investigate the variability of ratings assigned to EFL essays of different qualities in Turkish higher education by raters with varying levels of experience. EFL writing assessment is carried out mainly for three purposes in Turkey. Firstly, according to the regulations of the Council of Higher Education, universities offer one-year intensive English preparatory programs to students enrolled in departments where the medium of instruction is English. In these programs, students’ writing abilities are assessed in high-stakes entrance and exit tests. Secondly, EFL writing is offered as a separate course in different forms, such as advanced writing or academic writing, in the preparatory programs and English major departments, and students’ progress is evaluated through in-term and end-of-term exams. Thirdly, some universities require a good command of writing as a prerequisite for participation in international exchange programs. University test development units in Turkey are responsible for preparing the exams for the abovementioned purposes; however, different rating protocols and the lack of assessment standards between institutions and even within the same educational body raise reliability concerns in language performance assessment. Given that assessing students’ EFL writing performance in higher education plays an important role in students’ academic careers, ensuring reliability to provide students with fair judgments is essential in tertiary education in Turkey.
Research Questions Addressed Building on previous empirical research, this study investigates the impact of rating experience along a gradient with respect to the assessment of essays of distinct text quality by addressing the following research questions (RQs): 1. Are there any significant differences among the analytic scores assigned to low-and high-quality EFL essays? 2. Are there any significant differences among the analytic scores assigned by raters with varying previous rating experience? 3. What are the sources of score variation that contribute most (relatively) to the score variability of the analytic scores of EFL essays?
34
35
34
Impact of Rater Experience and Essay Quality 35
4. Does the reliability (e.g., dependability coefficients for criterion-referenced score interpretations and generalizability coefficients for norm-referenced score interpretations) of the analytic scores assigned by raters differ based on their amount of experience?
Research Methods The data used in this study were obtained from a larger study, which used a convergent parallel design as a mixed-methods approach (Şahan, 2018). In this approach, qualitative and quantitative strands of the research interact during data collection and interpretation, but are independent during data analysis (Cresswell, 2011). This chapter reports on the quantitative aspect of that research. The research included 33 EFL raters who were employed at higher education institutions in Turkey. While 15 participants were full-time EFL instructors in the English preparatory program of the same institution, 16 raters worked in that same position at 13 different state universities. The remaining two raters served as research assistants in the English Language Teaching (ELT) departments of their respective universities.The raters had all graduated from ELT or English Language and Literature departments in Turkey and had the same L1 background (Turkish). The participants had varying levels of experience in teaching and assessing writing in EFL contexts. The raters were divided into three experience groups based on their reported experience scoring EFL essays: low-experienced raters, who reported four or fewer years of experience (n = 13), medium-experienced raters, with five to six years of experience (n = 10), and high-experienced raters, who reported seven or more years of experience (n = 10).
Data Collection Procedures The essays used in this study were collected from first-year students in an ELT department (prospective English teachers) at a state university in Turkey. Using a holistic scale, 104 essays were evaluated by three expert EFL writing raters for quality division.These expert raters were not included as raters in the main data collection portion of the study. For the main data collection process, 25 low-and 25 high- quality essays were selected, as agreed upon by the expert raters. The participating raters scored these essays using a 10-point analytic scoring scale (see Appendix A) adapted from Han (2013). The scoring scale included five components with differing score weights: grammar (1.5 pts.), content (3 pts.), organization (2.5 pts.), style and quality of expression (2 pts.), and mechanics (1 pt.). Each sub-scale had five performance bands with varying cut points and score intervals. The score distribution of the analytic rubric was arranged in accordance with the feedback that raters provided following a rubric adaptation process in which raters were asked to comment on descriptors, weight distribution, and performance band cut-scores. From the scoring process, 1,650 total scores and 8,250 sub-scores were obtained.
36
36 Özgür Şahan
Data Analysis Procedures The data obtained from the essay ratings were analyzed using descriptive and inferential statistics through SPSS 24.0. Descriptive statistics were used to observe the general trend in the data. An independent-samples t-test was used to investigate whether the scores assigned to essays of different qualities showed statistically significant differences, and Kruskall-Wallis and Mann-Whitney U tests were conducted to examine whether rater groups differed from each other significantly in their essay scores. Furthermore, generalizability (G-) analysis was used to investigate the effect of raters on the variability of essay scores. G-theory is a theo retical framework used to estimate multiple sources of variation or measurement error within a given assessment context (Brown & Bailey, 1984; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson & Webb, 1991). G-theory analysis consists of two stages, a generalizability study (G-study) and a decision study (D-study).The former is used to estimate variance sources and their relative contributions to the ratings within the given assessment, while the latter stage allows the decision-maker to estimate the generalizability (G-or Ep2) and dependability (Ф) coefficients by varying the number of conditions of a facet (Brennan, 2001). The information obtained from D-studies is used for future test designs and assessment practices. In the current study, students (persons as p) were treated as the object of measurement and given that all participant raters (r) assessed the essays of two distinct qualities (q), the G-study design was completely crossed as p × r × q. Using this person-by-rater-by-quality random effects G-study design, seven independent sources of variation were obtained: persons (p), raters (r), quality (q), person-by-rater (p × r), person-by-quality (p × q), rater-by-quality (r × q), and person-by-rater-by-quality (p × r × q). Moreover, separate G-study analyses were conducted for high-quality and low- quality essays to estimate the relative contributions of the three variance components, persons (p), rater (r), and person-by-rater (p × r), to the variability of ratings. In addition to calculating G-and Ф coefficients, I conducted D-studies to compare the G-and Ф indices for high-and low-quality essay scores assigned by low- , medium- , and high- experienced raters. Although several programs, including GENOVA, ETUDGEN, SPSS, SAS, and MATLAB, can be used for G- theory analysis, I opted to use the computer program EDUG 6.0 (Swiss Society for Research in Education Working Group, 2006) because of its user-friendly interface (Clauser, 2008; Teker, Güler, & Uyanık, 2015).
Findings and Discussion To answer RQ1, a t-test analysis revealed significant differences in the ratings assigned to high-quality (M = 7.67, SD = 0.50) and low-quality essays (M = 4.65, SD = 1.09; t (33.5) = 12.55, p ˂ .001, two-tailed). Furthermore, when the ratings assigned by the experienced groups were analyzed using descriptive statistics, I observed that high- experienced raters tended to assign higher scores than
36
37
36
Impact of Rater Experience and Essay Quality 37
less-experienced raters. The mean scores were assigned to the essays by raters using a 10-point analytic scoring scale. The low-, medium-, and high-experienced raters assigned mean scores for high-quality essays as 7.40, 7.53, and 8.17, respectively. They assigned mean scores for low-quality essays as 4.25, 4.58, and 5.24, respectively. Thus, there is a relationship between raters’ experience level and the assigned essay scores. In order to investigate whether this pattern was statistically significant between rater groups, non-parametric tests were conducted to answer RQ2. A Kruskal-Wallis test revealed no statistically significant differences in the mean scores assigned by rater groups (p > .05) to high-quality essays; however, a Kruskal- Wallis test did reveal a statistically significant difference in the mean scores assigned to low-quality essays across three experience groups (Gr1, n low-experienced = 13; Gr2, n medium-experienced = 10; Gr3, n high-experienced = 10): χ2 (2, n = 33) = 6.72, p = .04. The high-experienced group awarded a higher median score (Mdn = 5.27) than the other two groups: median values of 4.28 for the low-experienced group and 4.38 for the medium-experienced group. Given the findings of the Kruskal-Wallis test, follow-up Mann-Whitney U tests were conducted to determine whether the scores of the groups were statistically significantly different from each other. The Mann-Whitney U test revealed statistically significant differences between the low- (Mdn = 5.28, n = 13) and high-(Mdn = 4.28, n = 10) experienced groups (U = 23, z = -2.61, p = .009, r = .54). Using the same non-parametric tests, the scores assigned to the sub-scale components of the analytic scoring system were also analyzed to investigate whether raters’ scores assigned to the rubric components differed from each other significantly. A Kruskal-Wallis test revealed statistically significant differences in the mean scores assigned to the mechanics component of low-quality essays across three experience groups (Gr1, n low-experienced = 13; Gr2, n medium-experienced = 10; Gr3, n high-experienced = 10:): χ2 (2, n = 33) = 6.06, p = .048. For the mechanics component of the low-quality essays, the high-experienced group recorded a higher median score (Mdn = 0.68) than the low-experienced group (Mdn = 0.50) and the medium- experienced group (Mdn = 0.58). Given the findings of the Kruskal-Wallis test, follow-up Mann-Whitney U tests were performed to determine which of the rater experience groups were statistically significantly different from each other. The Mann-Whitney U test revealed statistically significant differences between the low-(Mdn = 0.50, n = 13) and high-(Mdn = 0.58, n = 10) experienced rater groups in terms of the mechanics component scores they awarded (U = 29.5, z = -2.20, p = .028, r = .46). No significant differences between groups were found for the other components of either high-or low-quality essays. As for RQ3, G-studies were conducted for mixed-quality essays, high-quality essays, and low-quality essays, in order to estimate the variance sources contributing to the score variability. Table 3.1 reports the variance components for mixed-quality ratings and their relative effects on scoring. As shown in Table 3.1, the largest variance (45.3%) was attributable to persons (students), indicating that the writing task worked well to distinguish students’
38
38 Özgür Şahan TABLE 3.1 Variance Components for Random Effects P × R × Q Design
Source
df
P R Q PR PQ RQ PRQ Total Ep2 Ф
24 32 1 768 24 32 768 1,649
σ2
%
2.59 0.09 -0.03 0.43 0.01 0.80 1.79
45.3 1.6 0 7.6 0.2 14 31.3 100 .98 .98
EFL writing performances. This result is desirable, given that the purpose of the assessment task is to differentiate students’ writing abilities. The second greatest variance component was due to the residual (31.3%), which was obtained from the interaction of raters, compositions, essay quality, and other unexplained unsystematic and systematic sources of errors. The third largest source of variance (14%) contributing to the score variability was the interaction between raters and essay quality, indicating that raters differed substantially while scoring compositions of distinct qualities. As can be seen in Table 3.1, the fourth largest variance source was found to be the interaction between persons and raters (7.6%); this result suggests that there was inconsistency between certain raters in terms of severity while assessing certain essays. The remaining variance sources, including rater, essay quality, and person-by-quality, were negligible since their relative contributions to the variability of scores were small (1.6%, 0%, and 0.2%, respectively). In order to estimate the score variance components for the scoring of essays of different qualities, separate G-study analyses were conducted for high-and low-quality essays. Table 3.2 reports the variance sources and their relative contributions to the dependability of both high-and low-quality essay scorings. Table 3.2 reveals that the greatest variance component was found to be the residual for both high-quality (54.2%) and low-quality (46.2%) essays, indicating that a large variance source cannot be explained in these designs due to the interaction between persons, raters, and other systematic and unsystematic error sources. For high-quality essays, the second largest variance component following the residual was the rater facet (39%), indicating that raters’ scores assigned to high-quality papers were markedly inconsistent. However, the contribution of the rater facet to the score variance for low-quality essays was relatively small (23.4%), indicating that raters were more consistent while grading low-quality essays. The smallest variance was attributable to persons (6.8%) in the scorings of high- quality essays while persons explained a larger variance (30.4%) in the scorings
38
39
38
Impact of Rater Experience and Essay Quality 39 TABLE 3.2 Variance Components for Random Effects P × R Design
Source
P R PR Total Ep2 Ф
df
24 32 768 824
High-quality
Low-quality
σ2
%
σ2
%
0.20 1.15 1.60
6.8 39 54.2 100 .81 .71
1.15 0.88 1.74
30.4 23.4 46.2 100 .96 .94
of low-quality essays, indicating that low-proficient students differed substantially from one another in their writing abilities compared to students in the higher proficiency band. Given the particular selection of high-and low-quality essays for the analyses, smaller variances were expected from the object of measurement. That is, because the essays were split into two quality divisions (low-and high- quality essays) for analysis, the students within each subgroup were not expected to vary greatly from one another in terms of their writing abilities. Finally, as can be seen in Table 3.2, the dependability (Ф) and generalizability (G-) coefficients observed from the scoring of low-quality essays were larger than those obtained from the scorings of high-quality essays, suggesting that raters were more consistent while grading low-quality essays. Although the residuals obtained in the G-study analyses were high, near-perfect generalizability (G-) and dependability (Ф) indices were attained. This finding might be due to the number of conditions in the rater facet (a high number of raters) in that a greater number of conditions leads to higher dependability indices (Brennan, 2001; Shavelson & Webb, 1991). Thus, D-studies were conducted by decreasing the number of raters in order to find out the minimum number of raters required for reliable scoring. Furthermore, to address RQ4, G-and Ф coefficients were calculated for the scorings of mixed-, high-, and low-quality essays assigned by the three rater experience groups. Table 3.3 shows the G-and Ф indices pertaining to each rater groups’ scorings. As can be seen in Table 3.3, rater groups scored mixed-quality and low-quality essays more consistently than they did high-quality essays. When considering the scores assigned to high-quality essays, lower coefficients were observed for every rater group. In particular, the most-experienced raters were the most inconsistent group while grading high-quality essays. Using the variance components in G-study designs, several D-studies were conducted until acceptable dependability coefficients (i.e., above .80) were obtained. In doing so, the number of raters was decreased for mixed-quality and low-quality essay scorings since the current scenario resulted in high G-and Ф
40
40 Özgür Şahan TABLE 3.3 Generalizability and Dependability Coefficients across Rater Groups
Rater Group
NRaters
EFL Compositions Mixed-quality (n = 50)
Low-experienced Medium-experienced High-experienced
13 10 10
Low-quality (n = 25)
High-quality (n = 25)
Ep2
Ф
Ep2
Ф
Ep2
Ф
.95 .93 .95
.93 .92 .93
.89 .85 .87
.85 .79 .85
.57 .52 .47
.44 .46 .27
coefficients. However, for high-quality essays, assessment scenarios with more raters were conducted until an acceptable degree of reliability was attained. The analyses revealed that three raters would be enough to observe acceptable generalizability (.85) and dependability (.81) coefficients for scoring mixed-quality essays. As for low-quality essays, ten raters would be needed to obtain acceptable coefficients (G- = .87 and Ф = .81). When it comes to the scorings of high-quality essays, a minimum number of 58 raters would be necessary to produce coefficients above .80 (G- = .88 and Ф = .81). These findings suggest that double-g rading or even including three raters in the assessment of high-stakes EFL writing tests would be necessary for fair judgment. However, reliable scoring could only be obtained in the evaluation of EFL compositions at similar performance bands if an unrealistic number of raters were included, with ten raters needed for low-quality essays and 58 raters needed for high-quality essays. These findings suggest that, while raters were able to distinguish between high-and low-quality essays and that this quality division was reflected in their assigned scores, raters were less reliable when assessing essays of similar quality. More-experienced raters were found to be more lenient while scoring EFL essays compared to their less-experienced peers, a result which corroborates Weigle’s (1999) findings. However, unlike the findings of the present study, Song and Caruso (1996) found that raters with varying experience did not differ in their analytic scorings. In addition, other research has suggested that less-experienced raters tend to assign higher scores (Rinnert & Kobayashi, 2001) and be less severe in their analytic ratings (Barkaoui, 2010, 2011; Sweedler-Brown, 1985). The tendency of more-experienced raters to assign higher scores in this study might be explained by high-experienced raters’ text repertoire with diverse written performances, resulting in more realistic judgments. The G- study analyses revealed that rater groups were inconsistent while grading high-quality essays, which contradicts the findings of previous research suggesting that raters assigned more consistent scores while grading high-quality texts (Han, 2017; Huang et al., 2014). Although raters gave higher scores to high- quality essays compared to low-quality essays, the inconsistency between raters
40
41
40
Impact of Rater Experience and Essay Quality 41
regarding their judgments of high-quality essays would be of particular concern in high-stakes test situations (e.g., when ratings are used to determine scholarships, admission to academic programs, or eligibility for English-medium departments). In such situations, score differences of a single point could have practical outcomes for students. In line with previous research (e.g., Engber, 1995; Ferris, 1994; Han, 2017; Huang et al., 2014; Song & Caruso, 1996), raters’ tendency to assign similar scores to low-quality essays in this study suggests that they agree more in their judgment of EFL essays with poor text features. The variation among raters while assessing high-quality essays suggests a fairness problem in EFL assessment for student writers with better writing abilities. In other words, while raters’ interpretations of poor EFL text features were similar, they varied in their understanding of a quality construct for high-quality essays even when guided by detailed scoring criteria. This variation might be related to rater bias in that I informed the raters about the students’ academic background prior to the assessment task and thus some raters might have expected better performances from their future colleagues, awarding lower scores to high-quality essays than they perhaps deserved.
Implications for Policy, Practice, and Future Research In light of the findings, this chapter provides four avenues for future research in line with implications for policy and practice. First, continuous and systematic rater training is necessary, even for experienced raters, to reduce the gap between ratings. Furthermore, traditional rater training models can be revisited, as score differences between experience groups might be related to varying interpretations of or emphasis on specific essay components. Although they applied the same analytic scoring criteria, in this study high-and low-experienced raters differed in their judgments of low-quality essays, particularly with respect to mechanics. In this regard, future research can examine the impact of a rater-training model that guides scorers to consider multiple aspects of an essay in accordance with the scoring criteria. Second, as in many EFL/ESL language assessment contexts, some English programs in Turkey have standard assessment procedures, such as double-g rading. However, this research suggests that double-g rading may not necessarily provide reliable ratings. Because this study found that high-experienced raters were more lenient compared to their less-experienced peers, English programs can consider pairing relatively high-experienced and relatively low-experienced raters when double-grading students’ essays. In other words, if two less-experienced raters grade a certain EFL essay, they might assign lower scores while the same performance could be awarded a higher score if assessed by two high-experienced raters. In this regard, the effectiveness of institutional assessment policies and practices on the reliability of scorings requires re-examination in light of empirical evidence. Third, although analytic scales are considered more reliable than holistic scoring, variation was observed in the scores assigned by raters when using an
42
42 Özgür Şahan
analytic scoring scale in this study. As a result, future research is needed to improve the implementation of analytic scoring scales (e.g., by exploring the differences between traditional scoring scales and context-bound scoring criteria that blend local, cultural, and institutional dynamics with respect to raters’ scoring standards). Fourth, the dependability coefficients obtained from the D-studies suggest that three raters would be needed to attain reliable scores. However, including three raters for the assessment of EFL essays would not be feasible in most contexts in Turkey, particularly in large-scale assessment. Therefore, to provide students with fair judgments and to maximize the reliability of ratings in writing performance assessments, increasing the number of tasks might be more cost-efficient than increasing the number of ratings per task (Lee, Kantor, & Mollaun, 2002). Thus, future research can explore the effects of using multiple tasks or topics, rather than a single task or topic, on the reliability of scores within a given test. As with all research, this study was limited in some respects. Although raters were oriented to use the analytic scoring scale during the adaptation process, the results may have differed had the raters received more in-depth training prior to the assessment task. Second, to avoid pressuring volunteer raters in the research context, I did not control the rating time and conditions. However, differences in rating time and conditions might have affected the dependability of the scorings. Third, although a familiar and interesting topic was chosen for this research, students’ writing and raters’ grading performances might have been different with a different topic. In addition to the highlighted areas of research, each of these limitations suggest avenues for further research.
References Attali, Y. (2015). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. Barkaoui, K. (2010). Do ESL essays raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31–57. Barkaoui, K. (2011). Effects of marking method and rater experience on ESL scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–293. Brennan, R. L. (2001). Generalizability theory: Statistics for social science and public policy. New York, NY: Springer-Verlag. Retrieved on April 30, 2017 from www.google.com. tr/search?hl=tr&tbo=p&tbm=bks&q=isbn:0387952829 Brown, J. D. (1991). Do English and ESL faculties rate writing samples differently? TESOL Quarterly, 25(4), 587–603. Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language writing skills. Language Learning, 34(4), 21–42. Clauser, B. E. (2008). A review of the EDUG software for generalizability analysis. International Journal of Testing, 8(3), 296–301. Cresswell, J. W. (2011). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4th ed.). New Delhi, India: PHI Learning Private. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY: Wiley.
42
43
42
Impact of Rater Experience and Essay Quality 43
Cumming, A. (1990). Expertise in evaluating second language composition. Language Testing, 7(1), 31–51. Engber, C. A. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing, 4(2), 139–155. Ferris, D. R. (1994). Lexical and syntactic features of ESL writing by students at different levels of L2 proficiency. TESOL Quarterly, 28(2), 414–420. Freedman, S. W., & Calfee, R. C. (1983). Holistic assessment of writing: Experimental design and cognitive theory. In P. Mosenthal, L. Tamor, & S. A. Walmsley (Eds.), Research on writing: Principles and methods (pp. 75–98). New York, NY: Longman. Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In B. Kroll (Ed.), Second language writing: Research insights for the classroom (pp. 69–87). New York, NY: Cambridge University Press. Han, T. (2013). The impact of rating methods and rater training on the variability and reliability of EFL students’ classroom-based writing assessments in Turkish universities: An investigation of problems and solutions (Unpublished doctoral dissertation). Atatürk University, Turkey. Han, T. (2017). Scores assigned by inexpert raters to different quality of EFL compositions, and the raters’ decision-making behaviors. International Journal of Progressive Education, 13(1), 136–152. Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? –A generalizability theory approach. Assessing Writing, 13(3), 201–218. Huang, J., Han, T., Tavano, H., & Hairston, L. (2014). Using generalizability theory to examine the impact of essay quality on rating variability and reliability of ESOL writing. In J. Huang & T. Han (Eds.), Empirical quantitative research in social sciences: Examining significant differences and relationships (pp. 127–149). New York, NY: Untested Ideas Research Center. Leckie, G., & Baird, J. A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399–418. Lee, Y.-W., Kantor, R., & Mollaun, P. (2002, April). Score dependability of the writing and speaking sections of new TOEFL. Paper Presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans. Abstract retrieved from ERIC. (ERIC No. ED464962) Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. Lumley, T. (2005). Assessing second language writing: The rater’s perspective. New York, NY: Peter Lang. McNamara, T. F. (1996). Measuring second language performance. London, UK and New York, NY: Addison Wesley Longman. McNamara, T. F. (2000). Language testing. Oxford, UK: Oxford University Press. Rinnert, C., & Kobayashi, H. (2001). Differing perceptions of EFL writing among readers in Japan. The Modern Language Journal, 85(2), 189–209. Şahan, Ö. (2018). The impact of rating experience and essay quality on rater behavior and scoring (Unpublished doctoral dissertation). Çanakkale Onsekiz University, Turkey. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A premier. Newbury Park, CA: Sage. Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays of native English-speaking and ESL students? Journal of Second Language Writing, 5(2), 163–182.
4
44 Özgür Şahan
Sweedler-Brown, C. O. (1985). The influence of training and experience on holistic essay evaluation. English Journal, 74(5), 49–55. Swiss Society for Research in Education Working Group. (2006). EDUG user guide. Euchatel, Switzerland: IRDP. Teker, G. T., Güler, N., & Uyanik, G. K. (2015). Comparing the effectiveness of SPSS and EduG using different designs for generalizability theory. Educational Sciences: Theory & Practice, 15(3), 635–645. Weigle, S. C. (1999). Investigating rater/ prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145–178. Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.
4
45
newgenrtpdf
4
APPENDIX A Analytic Scoring Scale Rater’s Name:
Your Score: ….. /10 Score and Criteria
Grammar
Content
0–0.2
0.3–0.5
0.6–0.9
1.0–1.2
1.3–1.5
Many recurring errors in syntax and morphology, resulting in ungrammatical sentences that hinder meaning. Numerous errors such as tense, subject-verb agreement, article, and preposition errors, run-on sentences and fragments, sentence structure errors. The text is not clear or understandable.
Errors in syntax and morphology that hinder meaning. Frequent errors such as incorrect use of auxiliaries and modals, tense agreement, article, and preposition errors, fragments and run-on sentences, difficulty forming complex sentences. The text is difficult to understand and meaning is often lost.
Moderate-level accuracy in syntax and morphology. Some errors such as occasional incorrect use of auxiliaries and modals, tense agreement, articles and prepositions, fragments and run-on sentences. Some local errors in forming complex sentences. The text is mostly clear with some meaning loss.
Generally accurate use of syntax and morphology. Few errors such as incorrect uses of auxiliaries and modals, tense agreement, article and preposition errors. Rare use of fragments or run-on sentences. A few local errors in forming complex sentences. The text is generally clear and understandable.
Almost no errors with syntax and morphology. Almost no errors such as incorrect use of auxiliaries, modals, tense agreement, article or preposition errors. Correct use of complex sentences. The text is clear and understandable.
0–0.5
0.6–1.1
1.2–1.8
1.9–2.4
2.5–3.0
Off-topic.The text contains almost no elaboration of a single topic or introduces multiple topics. Almost no evidence of a thesis statement and supporting details. Lacks evidence of critical thought.
Wanders off-topic. Ideas are loosely connected and underdeveloped. Major problems with thesis statement and supporting details. Poor evidence of critical thought.
Generally on-topic. Ideas are not fully developed. Some problems with thesis statement and supporting details. Occasionally includes irrelevant details. Some evidence of critical thought.
On-topic. Ideas are generally developed. Almost no problems with thesis statement and supporting details. A few irrelevant details are given. Satisfactory evidence of critical thought.
On-topic. Ideas are fully developed. Clear thesis statement and relevant supporting details. Strong evidence of critical thought. (continued)
46
newgenrtpdf
(Cont.) 0–0.4 Organization
Style and quality of expression
Mechanics
0.5–0.9
1.0–1.5
1.6–2.0
2.1–2.5
Poor introduction and conclusion. The body of the text lacks organization and transition between ideas. Weak unity and cohesion. Poor logical organization of paragraphs.
Fair introduction and conclusion. The body of the text partly lacks flow of ideas, appropriate transitions, and clear supporting ideas. Some issues with unity and cohesion. Some issues with logical organization of paragraphs.
Clear introduction and conclusion. The body of the text mostly includes clear supporting ideas and transitions. Almost no issues with unity and cohesion. Almost no issues with logical organization of paragraphs.
Exemplary introduction and conclusion. The body of the text includes clear supporting ideas and transitions. Demonstrates exemplary unity and cohesion. Logical organization of paragraphs.
0–0.3
0.4–0.7
0.8–1.2
1.3–1.6
1.7–2.0
Many language errors that interfere with meaning. Many direct translations from Turkish. Weak and inappropriate vocabulary. Unrelated and repetitive sentences.
Frequent language errors and direct translations from Turkish. Limited and repetitive vocabulary. Lacks sentence variety.
Some language errors and direct translations from Turkish. Moderate use of vocabulary. A few repetitive sentences.
Few language errors and direct translations from Turkish. Appropriate and varied vocabulary. Sufficient sentence variety.
Exemplary language use with almost no errors. A wide range of advanced vocabulary. Exemplary sentence variety.
0–0.1
0.2–0.3
0.4–0.6
0.7–0.8
0.9–1.0
Many spelling, punctuation, and capitalization errors that occasionally interfere with meaning.
Some spelling, punctuation, and capitalization errors that rarely interfere with meaning.
Almost no introduction or conclusion. The body of the text lacks unity and cohesion. Almost no organization.
Numerous and recurring spelling, punctuation, and capitalization errors that interfere with meaning.
A few spelling, punctuation, Almost no spelling, and capitalization errors punctuation, or that do not interfere with capitalization errors. meaning.
46
47
46
4 ASSESSING SECOND LANGUAGE WRITING RATERS’ PERSPECTIVES FROM A SOCIOCULTURAL VIEW Yi Mei
Issues that Motivated the Research The essay format is widely used within large-scale assessments to assess writing directly. Despite the recent prominence of automated scoring, most essays are still scored by human raters. Researchers have attempted to uncover the nature of essay rating by examining how raters respond to writing tasks, essays, and rating scales, and how these interactions affect raters’ decision-making processes and results (e.g., Barkaoui, 2010; Shi, 2001; Wolfe, Kao, & Ranney, 1998). Researchers have conducted extensive studies on the writing aspects to which raters attend when assessing essays under different conditions (e.g., Broad, 2003; Wolfe et al., 1998). Various models have been proposed to determine how raters respond to key texts (e.g., writing tasks, essays, rating scales) and the rating sequences raters follow to form scoring decisions (e.g., Cumming, Kantor, & Powers, 2002; Lumley, 2005). While these studies enrich our understanding of this human cognitive process, they often assume a decontextualized view that does not address the interactions between raters and the sociocultural contexts where the essay rating occurs. Consequently, rating studies conducted in different sociocultural contexts often yield discrepant results about the writing aspects raters focused on and raters’ rating processes. These conflicting results suggest that raters may not only interact with those key texts while rating essays, but also with the sociocultural environment. Therefore, essay rating ought not to be seen as a solely cognitive process but rather a socially situated practice with socially constructed meanings, motives, and consequences. A situated understanding of essay rating may yield contextualized interpretations, with important implications for the understanding of scoring reliability and improvement in essay-rating practices.
48
48 Yi Mei
Researchers have called for attention to the social aspects of essay rating. Lumley (2005) investigated the influences of some social factors (e.g., test stakes, rater training, institutional requirements) on rater decision-making. Lumley’s milestone work included a framework that attempted to offer insights into raters’ minds during decision-making. However, the indeterminate process about how raters solve rating tensions remains unaddressed. While these studies started to reveal the influences of sociocultural contexts on raters’ decision-making, they are confined by the cognitive approach they follow, and hence cannot fully address rater-context interactions. This gap reiterates the need for a comprehensive understanding of essay rating that situates this activity within its sociocultural context.
Context of the Research Testing has played an important role in Chinese social and educational life, having wide social acceptance and recognition as a fair process to select talented individuals into the social hierarchy. Among the numerous tests Chinese students take throughout their years of education, the national University Entrance Examination, known as the Gaokao, may bear the highest stakes as university admission decisions are based solely on Gaokao results. The Gaokao, administered to over nine million students every June, indirectly drives educational reforms because of its strong influences on teaching and learning in secondary education. The context for the study was the Gaokao’s compulsory English component, the National Matriculation English Test (NMET). The NMET includes an essay writing section, worth 25 out of 150 points on the NMET component. In the writing task, students are given a prompt and asked to respond with a purposive letter of approximately 100 words. Recruited high school English teachers typically work centrally at a provincial rating center. They rate essays by all provincial test-takers based on a five-band rating scale, with five points within each band. Due to the high-stakes impact of the Gaokao and the role of English teachers as NMET raters, it is essential to examine these raters’ essay-rating activity as socially situated. My study drew on Engeström’s (1999, 2001) cultural-historical activity theory (CHAT) from sociocultural theory to reconceptualize essay rating. CHAT has evolved into a well-established approach to examine human activities from a sociocultural perspective, having been applied in research areas including linguistics, cognitive science, and education (Lantolf & Thorne, 2006). CHAT acknowledges the central role of social relationships and culturally constructed artifacts in organizing human cognition (Lantolf & Thorne, 2006). Rather than being restricted to simple stimulus-response reflexes, humans are able to make indirect connections, or mediate between incoming stimulation and their responses through various links to achieve certain goals. Those mediated, goal- directed actions are not isolated events; they are related, and can be understood with respect to the collective human activity.
48
49
48
Sociocultural Aspects of Assessing Writing 49
Human activity is a hierarchical process relating to different levels of consciousness (Engeström, 1999). This process includes individuals’ goal- directed actions (the physical or cognitive acts an individual does to achieve desired results) and the collective’s object-oriented activity (sets of various actions motivated by socially or culturally constructed objectives or purposes). This hierarchical structure directs our attention to understanding from a sociocultural perspective how mental and observable activity can be regarded as an integrated unit of analysis, and how interactions between the two affect both individuals and the environment (Yamagata-Lynch, 2010). This human activity can be represented in the form of an activity system (Engeström, 1999), which encompasses (1) individuals as subjects performing goal- or object-oriented activities; (2) tools, which include physical objects and signs that facilitate the activities; (3) rules that dictate how the tools are used and regulate how the subjects act to obtain the objects; (4) the community of which the subjects are members when engaging in the activities; (5) the division of labor that describes the continuously negotiated distribution of responsibilities among community members; and (6) the outcomes that are the results or consequences of an activity. I categorize tools, rules, community, and division of labor as mediators, which are the indirect connections between the subjects and their desired object (or goals) that are shaping and being shaped by the activity. These four mediators constantly interact with one another, during which systemic contradictions emerge. Resulting tensions within the system can affect the interactions between mediators, thus either promoting or hindering the subjects’ ability to attain an object (Yamagata-Lynch, 2010). With respect to CHAT, essay-rating activity (ERA) can be viewed as a complex, socially mediated activity with two hierarchical layers. The cognitive layer (individual raters’ goal- directed decision- making actions; reading essays or assigning scores, for instance) is understood in relation to the social layer (e.g., their collective object-oriented ERA system; a team of essay raters making scoring decisions used for university admission decisions). The sociocultural contexts are the external, socially organized, historical, and cultural worlds inseparably linked with raters’ cognition. An ERA sociocultural context includes the four mediators and the interactions among them within the activity system when a subject (rater) tries to achieve the object (goals). The cognition is distributed in the individual and in the sociocultural context (Cole & Engeström, 1993). The unit of analysis is the collective activity, because the durable collective activity makes individuals’ actions understandable by bringing social meanings to the short-lived, goal- directed actions. For example, when assessing essays a rater may compare an essay against the rating scale descriptors to assign a score, while sometimes the same rater might compare essay A with essay B to accomplish the same goal. It is hard to understand why the same rater behaves inconsistently if only looking into the individual actions. Instead, the rater’s behavior might be better understood within its context, that is, the rating of essays used for high-stakes purposes (e.g.,
50
50 Yi Mei
university admission decisions). In this context, the rater might have tried to act as expected by the rating institution while attempting to assign fair scores when encountering rating difficulties.
Research Questions Addressed This study sought to understand the extent to which ERA was socially mediated by the sociocultural context of the NMET at one provincial rating center in China. This center typifies provincial centers found across the country, with similar structure and work procedures for essay rating. Two research questions were addressed: 1 . (Cognitive layer) How do raters assess NMET essays to achieve their goals? 2. (Both cognitive and social layers) What is the nature of NMET ERA within the local sociocultural context?
Research Methods I employed a qualitative multiple-method case study design based on multiple perspectives (Merriam, 2009). The first research question was addressed using data from think-aloud protocols, stimulated recalls, and interviews produced by eight NMET essay raters. The second research question was addressed using data from documents and interviews with rating-center directors, team leaders, and high school teacher essay raters, in addition to the data from the same eight raters for the first research question. A total of 15 participants were involved. The data were analyzed using open and axial coding techniques, before being integrated and mapped onto CHAT framework for a comprehensive and contextualized understanding.
Data Collection Procedures Data were collected between December 2014 and July 2015. I first established rapport by interviewing the eight raters about their NMET rating experiences. These eight raters then produced think- aloud protocols individually while assessing eight NMET sample essays, before providing stimulated recalls and follow-up interviews. Studies employing think-aloud protocols usually involve a small number of participants as they focus on providing rich data from thought processes (e.g., Lumley, 2005). The interviews focused on the raters’ perceptions and thought processes. I also interviewed five team leaders and two directors within this rating community. Team leaders selected anchor essays, discussed operational rating guidelines, conducted rater training, reviewed score discrepancies, and monitored essay-rating quality.These team leaders provided their perspectives as go-betweens
50
51
50
Sociocultural Aspects of Assessing Writing 51
who interacted with both raters and directors.The directors provided perspectives regarding administrative support for the rating activity. Document collection (e.g., news articles on Gaokao rating, NMET test syllabi, participants’ work notes) were also used to understand the context.
Data Analysis Procedures Think-aloud protocols, stimulated recalls, and interviews were audio recorded and transcribed. All transcripts and collected documents were imported into MAXQDA 11 (qualitative data analysis software) and manually coded. Due to the shortage of a second coder capable of working with large volumes of verbal data in Chinese, throughout this process I discussed the coding scheme regularly with my supervisor, a Chinese speaker, to minimize possible inconsistent coding by a single researcher. To further enhance the study’s trustworthiness and rigor, I kept the chain of evidence (Yin, 2009), which enabled me to identify and track sources of evidence leading to a particular statement. I also conducted member checking by summarizing and sharing my major findings with my participants for feedback, to ensure the results captured their perspectives. My data analysis was guided by Strauss and Corbin’s (1990) open and axial coding techniques. The analysis of think-aloud protocols and stimulated recalls was conceptually informed by the descriptive framework of rater decision-making behaviours from Cumming et al. (2002). After analyzing the data, I used the CHAT framework for data integration by mapping my findings onto different components in the framework and identifying tensions within the activity system for a comprehensive picture of the two-layered rating activity.
Findings and Discussion The findings and discussion are organized as follows: (1) raters’ goal-directed rating actions (cognitive layer); (2) raters’ object-oriented NMET ERA (social layer); and (3) roles of tensions and solutions (connecting the cognitive and social layers).
Raters’ Goal-directed Rating Actions Raters’ goal-directed rating actions—the cognitive layer of the activity—include their rating foci, sequence, and influential factors. In terms of the rating foci, raters attended most frequently to three writing aspects: content coverage, language quality, and writing tidiness. Content coverage refers to raters’ notions of content that an essay ought to include. Language quality consists of various linguistic features, including syntax and morphology; lexis; the frequency and gravity of errors; comprehensibility and naturalness; and cohesion and coherence. Writing tidiness refers to legibility and neatness of students’ handwritten work.
52
52 Yi Mei
Two general trends arose.The raters paid the most attention to language quality, and least to writing tidiness; and among language quality, raters most frequently paid attention to syntax or morphology, and least to cohesion and coherence. These aspects were largely aligned with the prescribed rating scale descriptors (available in Chinese only). Within each of the five bands, the scale requires that judgment be made primarily based on task completion in three dimensions— content coverage, coherence, and the range and accuracy of lexical and grammatical resources used.These findings are consistent with previous studies where raters were found to focus on a wide range of writing aspects, such as content, grammar, handwriting, organization, coherence, and task completion (e.g., Barkaoui, 2010; Broad, 2003; Li & He, 2015). With regard to the rating sequence, while attending to the three foci, all eight raters followed a similar four-step process to rate NMET essays. They formed a general impression by scanning the essay (in whole or in part). They made an initial scoring decision based on this impression. They would then read the rest of the essay or skim it again, looking primarily for evidence to support their initial decisions, particularly on content coverage and language quality. They ended by articulating their final score decisions based on overall impressions of a composition by fine-tuning their scores within the band range. These findings support previous cognitive approach studies that raters appear to follow a certain procedure when rating essays, although the sequence may differ. For example, some researchers have found that raters tend to employ different approaches or styles (e.g., Wolfe et al., 1998), while others found that raters go through a similar process, either sequential (e.g., Crisp, 2012) or complex and interactive (e.g., Lumley, 2005). Raters in this study demonstrated uniformity in their rating foci and sequence, suggesting a strong training effect acquired from NMET rating sessions. Factors that influenced raters’ decision-making can be grouped into five categories: institutional requirements, high-stakes consequences to students, saving “face”, prior teaching and rating experience, and collegial advice. Institutional requirements were raters’ dominant considerations, derived from documents illustrating rating scale and specifications, rater training, and rating quality indicators used to monitor performance. Participants considered the NMET rating scale and specifications documents to be “framework documents” (according to the rating center directors) that established the fundamental rating principles and criteria. Rater training, through anchor essays and leadership teams, complemented the rating scale and specifications by norming raters’ interpretation and application of the NMET rating criteria. The rating quality indicators, which include the ratios of raters’ valid ratings and error ratings to the total number of essays rated, also influenced raters’ decisions. The high-stakes implications for students that arise from scoring decisions led raters to score essays more leniently. However, they did not do so to the extent
52
53
52
Sociocultural Aspects of Assessing Writing 53
that it would threaten the fairness to other students, as raters’ heightened sense of responsibility and dedication to being impartial remained at the forefront. Saving “face” means that a rater tried to avoid being judged as less competent than other raters while claiming deference from them. Information about individuals’ rating quality indicators were often shared within each team. Raters sometimes felt the shared statistics potentially threatened their face.To avoid losing face, raters attended to maintaining or improving their own statistics, which included considering the highest and lowest scores a second rater might assign to the same essay and then assigning a score within this range. Raters’ prior teaching and rating experience was used to judge essay language quality and to envision students’ personal situations (e.g., proficiency level, attitude). Prior NMET rating experience was also cited by returning raters as a factor that contributed to rating speed and confidence about final score decisions. Finally, the rating sequences and score decisions of some first-time raters were shaped by advice from experienced NMET rating colleagues.This advice included how to read NMET essays more efficiently and reduce scoring leniency. In previous studies, raters’ decision-making was found to be primarily affected by factors relating to the raters themselves, the writing task, the essay, and the rating scale (e.g., Barkaoui, 2010). Some added sociocultural factors, such as consequences of scoring decisions for test-takers, institutional constraints, test purposes, and raters’ community of practice (e.g., Cumming et al., 2002; Lumley, 2005; Mei & Cheng, 2014). One factor, saving face, was uniquely identified in this study. This study contained no unexpected findings when compared to previous cognitive approach studies.The influencing factors showed that although similar rating foci and sequential procedures among raters seemed to suggest a strong training effect, their decision-making actions in fact involved interactions among various contributing factors, which were far more complicated than the assumed training effect. The investigation of rating foci and sequence alone thus seems inadequate to capture such complexity. To further understand these raters’ decision-making, it is important to situate their actions within a wider sociocultural context where those actions—that is, their object-oriented rating activity—took place.
Raters’ Object-oriented NMET ERA The essay raters’ participation at the rating center can be conceptualized as an NMET ERA system, with essay raters being the subject of the rating activity. There was a wide range of mediating tools available to these raters, including: • •
resources provided by the rating center (NMET essays, writing tasks, the rating scale and specifications, rater training, and rating quality indicators); and additional influential factors, including possible scores by a second rater, teaching and rating experience, and collegial advice.
54
54 Yi Mei
Raters intended to achieve two objects in this activity: (1) rating with high accuracy and speed, as expected by the rating center and related to raters’ concern of saving face; and (2) being accountable for student writers, due to raters’ awareness of the high-stakes consequences of their ratings. Four rules (two institutional, two sociocultural) guided these raters in this activity. The institutional rules were that raters need to follow the rating scale and specifications, and that they should maintain high accuracy and speed. Similar institutional rules can be found in the previous literature on essay rating (e.g., Lumley, 2005). The sociocultural rules included the high-stakes nature of the Gaokao for students, a pervasive concern among all raters, and raters’ attempts at saving face, which was highly pronounced in this study. Face—a person’s social self-image—is not only a means of social interaction through which social relations are formed, but also functions as an objective of social relations (Qi, 2011). In this study, the matter of face involved judgments of a rater’s capacities to perform the NMET rating task and the rater’s social standing relative to other raters, which was related to a sense of the rater’s fulfilment of obligations (e.g., rating accuracy and speed) as a member of the NMET rating community (Qi, 2011). Evaluation of a rater’s face involves both internal (self) and external sources (social others), and the interaction between the rater’s own perception of rating statistics and of other raters, and a sense of being watched by others. Face is quantifiable in the sense that gaining or losing face amounts to an addition or subtraction from an existing stock of face; and face can also be threatened or saved in social interactions (Qi, 2011). For example, when raters saw that their valid ratings or rating speed were lower than their colleagues’, they felt their face threatened; but they gained face when the opposite was true. If their valid ratings were below the bar, they lost face. As face is subject not only to external judgment but also to internal assessment, this face evaluation often leads to subsequent, varied emotions within face holders (Goffman, 1967; Qi, 2011). Participants’ performance varied with their emotions: Raters felt proud when gaining face; nervous when their face was threatened; and embarrassed when losing face.The power of face also influences the person’s behavior, motivating the person to enhance or change face states (Qi, 2011). To save face, raters would try to improve their rating performance. Through face, raters interacted with others in the rating community and saving face could create an incentive for raters to behave in certain ways; therefore, saving face acted as another sociocultural rule. In brief, the institutional and sociocultural rules together influenced raters’ decision- making and dictated their activities. The community to which these raters belonged included their peer essay raters, team leaders, and directors at the rating center. The division of labor within this community was shared as follows: •
raters followed institutional rules while coping with rating difficulties;
54
5
54
Sociocultural Aspects of Assessing Writing 55
•
•
team leaders elaborated the rating criteria, standardized raters’ application of those criteria, settled rating disagreements, and monitored rating quality and progress; and directors provided administrative support, ensuring overall rating quality and timely completion of the rating task.
The outcome of this activity was that most raters managed to satisfy institutional requirements while ensuring accountability to student writers. The findings about raters’ object-oriented activity provided a broadened view by situating raters’ cognitive actions within the surrounding sociocultural context. This activity, however, is not without tensions. Such tensions connect the cognitive and social layers and illustrate the complexity of this rating activity.
Roles of Tensions and Solutions Tensions are the pressure coming from the internal, systemic contradictions that raters encounter while participating in the NMET ERA. These contradictions are caused by the interactions among the mediators—tools, rules, community, and division of labor—in the activity system. As tensions can either promote or hinder raters’ abilities to attain their objects, they are the driving forces of action and change in the activity system. The conditions of the activity system investigated here caused six related tensions within and among different components. Within the rules component, the two institutional rules conflicted with each other, forming a circular Tension A between rating essays based on rating criteria and maintaining high rating accuracy and speed. Raters were required to score essays based on stated criteria despite encountering numerous textual features that may land outside the stated criteria. All participants described the pressure to apply the inadequately defined rating criteria (band descriptors) on certain essays.Tension A further triggered another issue, Tension B, with these raters’ object, that is, trying to maintain high rating accuracy and speed while working with inadequately defined criteria. Participants all attempted to maintain rating accuracy and speed as a self-generated goal associated with institutional requirements and face-saving concerns. This situation generated tension between raters’ division of labor and their two objects, (rater reliability being the first and accountability the second), Tensions C and D, respectively. Raters’ responsibilities led to self-developed strategies (e.g., studying the rating criteria more carefully, examining anchor essays, predicting possible scores assigned by other raters) to cope with rating difficulties. These self-developed strategies may also have weakened rater reliability and accountability. Meanwhile, a fifth issue, Tension E, formed between tools and one of the raters’ objects, that is, dealing with insufficient resources while raters tried to maintain accuracy and speed. This tension exacerbated Tension B and caused Tension F between the object of rating accuracy and speed and the sociocultural
56
56 Yi Mei
rule of saving face. Participants expressed their concerns of losing face in front of others if they failed to maintain high rating accuracy and speed. The tensions revealed the rating difficulties and challenges raters encountered. In an attempt to alleviate the above six tensions, raters employed various mediating tools provided by the rating center and brought in additional mediating tools to attain their objects, both sources of tools containing factors that affected raters’ decision-making. I summarize below how raters addressed those six tensions. To alleviate the circular Tension A within rules, raters studied the features of anchor essays and used the scores assigned by expert raters as reference points. Raters also consulted team leaders for clarification or assistance with score decisions or drew on their teaching or rating experience to help interpret the band descriptors. Some raters used their rating quality indicators to identify the extent to which their scores deviated from the norm by strategically using scores they assumed that a second rater might assign as a point of reference. Raters used their prior experience along with their colleagues’ advice to alleviate Tension B between rules and their first object, and Tension E between tools and their first object, coping with the inadequate rating criteria and resources. Their prior teaching experience helped them to envision a student’s English proficiency level, while their NMET experience made the rating process “faster and more accurate” (according to teacher participants). In addition, raters discussed uncertain ratings and strategies with colleagues to address rating challenges. Similarly, to reduce Tensions C and D between division of labor and their two objects (i.e., following rules while maintaining rater reliability and accountability), raters considered teaching and rating experience and collegial advice. Some raters also compared essays against each other to ensure rating equitability and used rating quality indicators to check their rating performance. If they noticed their rating performance declining, they also predicted possible scores assigned by a second rater to ensure the scores they assigned did not treat students unfavorably. Raters strategically attempted to ensure satisfactory rating quality indicators to relieve Tension F, the one between their first object and the sociocultural rule of saving face. They learned from their colleagues’ experiences, but some compromised their own judgments by assigning an essay score that would fall within the score range of a second rater to avoid generating a rating error. In general, raters were trying to maintain a good rating record while preserving face. The similarities amongst the six tensions led the raters to apply the same tools to solve different tensions, and sometimes different tools for a particular tension. For ease of discussion, I separated these tensions and their corresponding solutions; in real life, they were concurrent and intermingled. For example, raters used their rating quality indicators to identify whether or not their performance was deviant from the norm. If their indicators were not good enough, it triggered three types of raters’ concerns associated with raters’ objects and the rules: inability to meet institutional requirements; possible assignment of inequitable scores; and appearing less competent than others, a potential threat to their face. The use of various mediating tools was a
56
57
56
Sociocultural Aspects of Assessing Writing 57
strategy to minimize the discrepancies between their ratings and those of their peer raters. As a result, those inherent tensions were alleviated but never resolved, which explains why those raters still struggled at the end of their rating experiences, despite receiving a satisfactory rating performance as evaluated by the rating-center system. The rating tensions and raters’ corresponding solutions represent one unique finding in this study. Rating criteria usually establish a frame of reference by identifying key writing aspects so essays with numerous different features can be assessed. Any rating criteria will remain inadequate to cover all the situations that raters may encounter while rating numerous essays (Lumley, 2005). In this study, this inadequate tool created conflict within the two institutional rules, resulting in a cascading series of tensions. Lumley identified a tension between the inadequate rating scale and the intuitive impression in rater decision-making, whereby raters had to reconcile their impression of an essay and the rating scale wording before producing a final essay score. My findings confirmed the presence of this tension, and additionally revealed that such a tension was masked by a complex tension mechanism in rater decision-making that involved constant interactions between different mediators in the activity system. These findings furthered the understanding of how raters’ cognition interacted with this testing context. Guided by the CHAT framework and my findings, I proposed a surface structure and a deep structure conceptualization to better understand how raters assessed NMET essays to achieve their goals. I borrowed the linguistics terms surface structure and deep structure from Chomsky’s (1965) work in transformational grammar. Deep structures are generated by phrase-structure rules that form the basis of abstract (semantic) meanings, while surface structures are specific sentences (phonetic forms) that express those meanings derived from deep structures (through a series of transformations). With this bilevel conception of grammatical structures, Chomsky attempted to explain why sentences with different phonetic forms may have similar semantic meanings while similar phonetic forms may have different semantic meanings. I see a similar relationship between the cognitive and social layers of ERA.The surface structure reflects what raters’ decision-making looked like on the cognitive layer, including findings about raters’ rating foci, rating sequence, and factors influencing their decision-making.The deep structure explains the how and the why of raters’ decision-making by drawing on evidence from the social layer visualized as an activity system highlighting interactions between raters’ cognition and the activity sociocultural context. The surface and deep structures are interrelated through a series of tensions and raters’ corresponding solutions. This bilevel conceptualization can explain why raters may turn to the same tools to solve different tensions or adopt different tools to solve similar systemic tensions.
Implications for Policy, Practice, and Future Research This study adopted a sociocultural perspective through the CHAT framework to postulate a bilevel conceptualization (i.e., surface and deep structures) and
58
58 Yi Mei
to examine NMET essay rating as both cognitive and social processes of rater decision-making, revealing its socially mediated nature. The findings highlight the value of assuming a sociocultural view of cognition, where social interaction plays a fundamental role in cognitive functioning. Looking into the social layer provides nuanced insights into rater decision-making embedded in the sociocultural context. Raters’ interactions with the sociocultural context may provide a plausible and meaningful explanation to the indeterminate process of how raters resolved tensions first introduced in Lumley’s (2005) study. This study has two major implications for language education practitioners and policy makers. First, the findings have implications for improving NMET essay rater training practices. Rater concerns about interrater agreement could have both positive and negative impact on raters’ essay rating; NMET rating centers should thus be careful to avoid overemphasizing interrater agreement. Instead of imposing an emphasis on rules that allow for few opportunities to consider raters’ viewpoints, training ought to include regular sessions for raters to discuss essays they have difficulty scoring with their peers such that they can construct and deepen their understanding of the intricacies of the rating criteria. Second, this study suggests the practical value of applying CHAT as a tool to improving essay- rating practices. A CHAT analysis can help us better understand what works and what does not in essay-rating practices. Use of the CHAT framework as a sociocultural approach also broadens opportunities for essay-rating research. Future studies can replicate the current study in different contexts or conditions, which may include essay rating with different stakes or different purposes (e.g., selection, certification, placement, diagnosis). Research could also involve raters from different cultural contexts or different rater environments (e.g., alone or collaborative, at home or at a rating center). CHAT analyses may help us better understand differences in raters’ rating processes and results that are related to the different contexts or conditions. Future essay-rating research can also apply CHAT to help understand inconsistent findings from cognitive evidence. Previous studies have suggested that raters may employ different strategies for reading essays, interpreting rating scales, and assigning scores (e.g., Milanovic, Saville, & Shen, 1996;Vaughan, 1991). CHAT could bring insight into how and why raters adopt different styles to accomplish the same rating task.
References Barkaoui, K. (2010). Do ESL essay raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 37–57. Broad, B. (2003). What we really value: Beyond rubrics in teaching and assessing writing. Logan, UT: Utah State University Press. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: Massachusetts Institute of Technology Press.
58
59
58
Sociocultural Aspects of Assessing Writing 59
Cole, M., & Engeström, Y. (1993). A cultural-historical approach to distributed cognition. In G. Salomon (Ed.), Distributed cognitions: Psychological and educational considerations (pp. 1–46). Cambridge, UK: Cambridge University Press. Crisp,V. (2012). An investigation of rater cognition in the assessment of projects. Educational Measurement: Issues and Practice, 31(3), 10–20. Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67–96. Engeström, Y. (1999). Activity theory and individual and social transformation. In Y. Engeström, R. Miettinen, & R.- L. Punamaki (Eds.), Perspectives on activity theory (pp. 19–38). New York, NY: Cambridge University Press. Engeström, Y. (2001). Expansive learning at work: Toward an activity theoretical reconceptualization. Journal of Education and Work, 14(1), 133–156. Goffman, E. (1967). On face-work: An analysis of ritual elements in social interaction. Interaction ritual: Essays in face-to-face behavior. London, UK: Penguin Books. Lantolf, J. P., & Thorne, S. L. (2006). Sociocultural theory and the genesis of second language development. New York, NY: Oxford University Press. Li, H., & He, L. (2015). A comparison of EFL raters’ essay-rating processes across two types of rating scales. Language Assessment Quarterly, 12(2), 178–212. Lumley, T. (2005). Assessing second language writing: The rater’s perspective. New York, NY: Peter Lang. Mei, Y., & Cheng, L. (2014). Scoring fairness in large-scale high-stakes English language testing: An examination of the National Matriculation English Test. In D. Codium (Ed.), English language education and assessment: Recent developments in Hong Kong and Chinese Mainland (pp. 171–187). Singapore: Springer Science+BusinessMedia LLC. http://doi. org/10.1007/978-981-287-071-1_11 Merriam, S. B. (2009). Qualitative research: A guide to design and implementation. San Francisco, CA: Jossey-Bass. Milanovic, M., Saville, N., & Shen, S. (1996). A study of the decision-making behaviour of composition markers. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium (LTRC) (pp. 92–114). Cambridge, MA: Cambridge University Press. Qi, X. (2011). Face: A Chinese concept in a global sociology. Journal of Sociology, 47(3), 279–295. Shi, L. (2001). Native-and nonnative- speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing, 18(3), 303–325. Strauss, A., & Corbin, J. (1990). Basics of qualitative research: Grounded theory procedures and techniques. Newbury Park, CA: Sage Publications. Vaughan, C. (1991). Holistic assessment: What goes on in the raters’ minds. In L. Hamp- Lyons (Ed.), Assessing second language writing in academic contexts (pp. 111–126). Norwood, NJ: Ablex Publishing Corporation. Wolfe, E. W., Kao, C., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465–492. Yamagata-Lynch, L. C. (2010). Activity systems analysis methods: Understanding complex learning environments. Boston, MA: Springer. Yin, R. K. (2009). Case study research: Design and methods (4th ed.).Thousand Oaks, CA: Sage Publications.
60
61
PART 2
Test Development and Validation
62
62
63
62
5 ASSESSING CLINICAL COMMUNICATION ON THE OCCUPATIONAL ENGLISH TEST® Brigita Séguis and Sarah McElwee
Issues that Motivated the Research The role of a healthcare professional is complex and multi-faceted. A high degree of expertise and specific clinical knowledge is expected, as well as resilience, conscientiousness, and the highest ethical standards. Increasingly, healthcare approaches are shifting from systems where the professional is considered the central authoritative figure to more patient-centered approaches that emphasize the importance of relationship-building, inclusion, and appropriate communication of diagnoses, treatment, and outcomes. For example, in the UK, the General Medical Council (GMC), the main regulatory body for the medical profession, makes specific reference to the expectation that graduate doctors will be able to communicate effectively and appropriately with both patients and colleagues in medical contexts by listening, sharing, and responding, including in difficult or sensitive situations (GMC, 2015). Increasing global mobility and internationalization affect healthcare systems and the diversity of professionals within them. According to the latest National Health Service (NHS) England figures, as of September 2017, 12.5% of all staff (doctors, nurses, and clinical support) held a non-British nationality, with 200 different nationalities reported.The biggest proportion of overseas health professionals were doctors, with 36% qualified outside the UK. Ensuring the linguistic competence of international workforce members so they can function safely and effectively is crucial, but there is also an increasing need to address broader communicative elements of the healthcare role which may be culturally specific. This chapter outlines the introduction of clinical communication skills as an extension of the construct underpinning the speaking subtest of the Occupational English Test® (OET), a test of English for specific purposes in the healthcare domain.
64
64 Brigita Séguis and Sarah McElwee
The rationale and process for the revision of the speaking test is described, and in particular the standard-setting exercise for the new additional assessment criteria. We then consider the effects that these changes have on the consequential validity of the test—the impact on candidates, stakeholders, and society, the opportunity to create positive washback that translates to professionals’ own practice, and the implications for wider policy.
The Occupational English Test The OET is a specific-purpose language test designed to assess the linguistic proficiency of healthcare professionals who seek to register and practice in an English- speaking environment. It is owned by Cambridge Boxhill Language Assessment Trust (CBLA), a joint venture between Cambridge Assessment English and Box Hill Institute. As of 2018, the test is recognized by regulatory healthcare authorities in the UK, Ireland, Australia, New Zealand, Singapore, Dubai, and Namibia. It currently serves 12 major professions: general medicine, nursing, physiotherapy, occupational therapy, dentistry, pharmacy, dietetics, optometry, speech pathology, podiatry, radiography, and veterinary science. It consists of four components: listening, reading, writing, and speaking. The subtests for receptive skills (listening and reading) are the same for all professions, whereas productive skills (speaking and writing) are assessed according to occupation. Following feedback from some of the senior doctors and nurses that some candidates scored well on the speaking subtest yet did not always display the expected level of communicative and pragmatic competence in the actual workplace (Vidakovic & Khalifa, 2013), an extensive research project was carried out. Its main aim was to align the criteria used to assess candidates’ performance on the speaking component of the test with the health professionals’ views of what they perceive as the crucial components of effective communication in the workplace. It appeared that the primarily linguistic orientation of the speaking criteria used in OET were not tapping some of the communicative aspects valued by professionals in real-life situations. This apparent gap prompted a thorough analysis of the target language use situations to develop what has been referred to as more “indigenous” assessment criteria (Jacoby & McNamara, 1999, p. 214) for the OET speaking subtest. The notion of indigenous assessment criteria was described by Jacoby (1998) in her doctoral thesis, which focused on a group of physicists giving feedback to one another while rehearsing conference presentations. Some members of the group spoke English as their second language and had some limitations in their language ability. However, when giving feedback, the attention of the group was mainly directed towards other issues which were perceived as more relevant in the given context (Pill, 2016). Jacoby concluded that the group members rely on an implicit inventory of assessment criteria shared within the group, which she referred to as “indigenous” assessment criteria (Jacoby, 1998, p. 311, cited in Pill, 2016).
64
65
64
Assessing Clinical Communication 65
This chapter focuses on the process of developing the indigenous assessment criteria for clinical communication on the OET speaking subtest and establishing defensible cut-scores. It also addresses the implications in terms of the wider policy and practice.
Context of the Research There is consensus in the language testing literature that tests which assess language for specific purposes can be distinguished from the more general language tests in two ways, namely, task authenticity and the interaction between language knowledge and specific-purpose content knowledge (Douglas, 2000). OET is often cited as a good example of specific-purpose language testing, where the situational and interactional authenticity of the input are potentially high, which is especially evident in the evaluation of productive skills. For example, the speaking subtest, which is always profession-specific, consists of two role-plays between the candidate and interlocutor.The candidates assume the role of a health professional while the interlocutor acts as a patient. During the speaking performance, the candidates advise the interlocutor on the most suitable treatment, discuss medication options, and give further details regarding the patient’s condition. Such role-play tasks give them an opportunity to engage both language and background knowledge, as well as to perform interactional tasks similar to the ones that they would normally carry out in their day-to-day work, such as reassurance, persuasion, or explanation. Given the close relationship between the tasks that the candidates perform on the test and in real- life workplace situations, the potential for eliciting a rich and authentic performance in terms of specific-purpose language ability is very high. An example of the general medicine candidate’s role-play card is included in Figure 5.1. The candidate’s role-play card contains explicit contextual information about the setting, participants, content, and language functions. However, some of the information is implicit and is derived from the candidate’s specific-purpose background knowledge. While some information about shingles, possible medication, and non-pharmacological approaches is included in the prompt, the candidate is also encouraged to rely on additional background knowledge when explaining the cause of the patient’s symptoms and answering questions about excision of the affected area. Moreover, there is also an implicit requirement for the candidate to demonstrate empathy while addressing the patient’s concerns and reassuring them that surgery is not required, which again may require drawing on their profession- specific background knowledge as well as their interactional competence. The original OET speaking scale was adapted from the one used by the US Government Foreign Service Institute to assess the language skills of foreign service officers (Douglas, 2000; McNamara, 1996). Thus, while OET candidates had the opportunity to demonstrate some of their profession-specific background knowledge, the actual speaking assessment criteria— Overall Communicative
6
66 Brigita Séguis and Sarah McElwee
NO. 3
MEDICINE
DOCTOR
FIGURE 5.1 Sample
OET Candidate Role-play Card—General Medicine (B. Zhang (Cambridge Boxhill Language Assessment), Personal communication, May 15, 2018).
6
Effectiveness, Intelligibility, Fluency, Appropriateness of Language, and Resources of Grammar and Expression—did not reflect many of the characteristics of specific- purpose language ability or the aspects of communication typically encountered in the target language use domain. As a result of the primarily linguistic orientation of the OET assessment criteria, there were some observations that OET test-takers struggled with communication skills in the workplace despite getting relatively high scores for their speaking performance (McNamara, 1996; Wette, 2011). There appeared to be a mismatch between the criteria applied by the healthcare professionals to evaluate the communicative abilities of their international colleagues and what was being assessed by the test itself. In fact, McNamara (1996) concluded that “whatever the doctors were complaining about was not being captured by the OET” (p. 241). This realization shifted the test developers’ attention specifically to the assessment criteria, particularly the lack of indigenous assessment criteria.
In Search of Indigenous Assessment Criteria The search for indigenous assessment criteria began when the Language Testing Research Center (LTRC) at the University of Melbourne was commissioned to carry out a research project focusing on communication in the clinical workplace environment (Elder, McNamara, Woodward-Kron, Manias, McColl, Webb,
67
6
Assessing Clinical Communication 67
& Pill, 2013). This project, initiated in 2009, constituted the first stage of the OET speaking test revision. The main aim of the project was “to investigate the indigenous assessment criteria of health professionals in the context of their interaction with patients” (Pill, 2016, p. 178). General medicine, nursing, and physiotherapy were chosen as the focus, as these professions comprise the largest OET candidature. The research team sought to capture the most important aspects of interaction between the health professional and patient and decide which would be most amenable to assessment on a language test. Drawing on these findings, the LTRC researchers considered how the OET speaking assessment criteria, as well as the underlying speaking test construct, could be revised and possibly extended to achieve closer alignment with aspects of communication valued by healthcare professionals in their day-to-day workplace interactions. Not only would this construct expansion increase the predictive validity of the test, but it would also create immediate positive washback: by preparing for the test, the candidates would also learn about clinical communication, the doctor-patient relationship, and the patient-centered model of care prevalent in the majority of Western countries, where they were likely to be coming to work.
Language Ability, Communication, and Interaction The data for this project were derived from thematic analysis of the commentary given by experienced clinical educators on the performance of novices in their professions, which allowed the researchers to extract a wide range of aspects that matter to the healthcare professionals and group them into broader themes. Given that the OET is a language test, aspects directly related to professional issues (i.e., clinical and practitioner skills) were automatically excluded, thus limiting the remaining themes to Interaction tools and Communication skills (Pill, 2016). Examples of indicators pertaining to the theme of Interaction tools that were deemed appropriate to be assessed on the OET are signposting changes of topic and before sensitive personal questions, structuring the interaction clearly and purposefully using appropriate information- eliciting techniques, asking open- ended questions, finding out what the patient already knows, and checking if the patient has understood. Some of the indicators classified under the theme of Communication skills included supporting the patient’s narrative with active listening, not interrupting the patient, demonstrating a positive and respectful attitude, and interacting in an approachable manner. The indicators pertaining to the two themes were further classified into four groups of similar behaviors to create a checklist, consisting of 24 indicators. The following four groups were identified: Indicators of Professional Manner (seven descriptions), Indicators of Patient Awareness (five descriptions), Indicators of Information-gathering (six descriptions) and Indicators of Information-giving (six descriptions). The checklist of indicators allowed the researchers to establish
68
68 Brigita Séguis and Sarah McElwee
Language ability
Clinical communicaon
Interaconal competence
FIGURE 5.2 Relationship
Among Language Ability, Interactional Competence, and Clinical Communication on the OET Speaking Test.
authentic values singled out by the healthcare professionals and draw on these values to develop assessment criteria for use on the OET speaking subtest (Elder et al., 2013). One immediate advantage of using indicators was the potential to assess the interaction between the health professional candidate and the simulated patient, rather than solely the candidate’s linguistic performance. Doing so expanded the construct of the OET speaking subtest to incorporate linguistic performance, interactional competence, and clinical communication, as demonstrated in Figure 5.2. As Figure 5.2 shows, ideally, in order to make a valid prediction about a candidate’s readiness to function in the real-life medical workplace, the speaking subtest should assess the overlapping area in the middle of the diagram where the three main aspects come together. The second advantage of having more detailed indicators was that the existing holistic criterion of Overall Communicative Effectiveness could be replaced with more explicit and transparent means of assessing how candidates apply such interactional strategies as paraphrasing and signposting in facilitating their consultations with patients. Similarly, the indicators subsumed under the theme of Communication skills offer a clearly defined means of assessing the extent to which the candidate displays professional manner and awareness of the patient’s needs, both of which help the candidates maintain a patient-centered approach throughout the consultation.
From Themes to Clinical Communication Criteria During the second stage of the speaking subtest revision process, the checklist of indicators was reviewed by researchers at Cambridge Assessment English, in consultation with Dr. Jonathan Silverman, an expert in the field of clinical communication and co-author of a core text on this subject (Silverman, Kurtz, &
68
69
68
Assessing Clinical Communication 69
Draper, 2005). Dr. Silverman proposed further amendments in order to achieve a closer alignment with the clinical communication framework used to assess medical graduates, as outlined in the Calgary-Cambridge Guides to the Medical Interview (Kurtz, Silverman, & Draper, 2005; Silverman et al., 2005). The revised checklist comprises 20 indicators presented in four groups, namely Indicators of Relationship Building, Indicators of Understanding and Incorporating the Patient’s Perspective, Indicators of Providing Structure, Indicators of Information Gathering- Information Giving. These groups were converted into four clinical communication criteria, with Information Gathering and Information Giving assessed together as one criterion. Following the speaking assessor trial, carried out after the standard-setting study described in this paper, the decision was made to split the criterion of Information Gathering-Information Giving into two separate ones in order to facilitate the marking process. The final list of clinical communication criteria, as well as the corresponding indicators are included in the Appendix. One of the more significant changes at the second stage was the inclusion of empathy as an aspect of relationship building. Empathy has long been recognized as fundamental for ensuring quality in medical care, enabling health professionals to perform key medical tasks more accurately and thus contributing to enhanced health outcomes overall (Neumann, Bensing, Mercer, Ernstmann, Ommen, & Pfaff, 2009).While there has been some debate about the extent to which empathy can be taught, according to Silverman (2016), many of the core skills of clinical communication, such as attentive listening, facilitating the patient’s narrative, and picking up cues, demonstrate to patients a genuine interest in hearing about their thoughts. By employing these techniques, health professionals can create an environment that facilitates disclosure and enables the first step of empathy to take place, namely understanding and acknowledging the patient’s predicament, feelings, and emotional state.
Setting Standards on Clinical Communication Criteria During the third and final stage, which will be described in the present paper, a standard-setting exercise was conducted with a panel of healthcare practitioners and OET examiners to inform the setting of cut-scores on the new test. Kane (2001) defines standard- setting as the decision- making process of classifying candidates into a number of levels or categories (as cited in Papageorgiou 2010a). The “boundary between adjacent performance categories” (Kane, Crooks, & Cohen, 1999, p. 344, as cited in Papageorgiou 2010a) is called a cut-score. It can be described as the “point on a test’s score scale used to determine whether a particular score is sufficient for some purpose” (Zieky, Perie, & Livingston, 2008, p. 1, as cited in Papageorgiou, 2010a). During a standard-setting exercise, a panel of expert judges goes through the process of determining cut-scores under the guidance of one or more facilitators.
70
70 Brigita Séguis and Sarah McElwee
The initial stage allows the panelists to familiarize themselves with the test and the assessment criteria, with an opportunity to practice and understand the procedure that will be used for making judgments. Once the familiarization process is complete, the panelists decide on the cut-scores. More than one round of judgments is typically organized so that the panelists could discuss their decisions and, if necessary, make adjustments. Ideally, the standard setting exercise should be evaluated in terms of procedural validity, internal validity, and external validation. The aim of procedural validity is to ensure that the procedures followed were practical and properly implemented, that feedback given to the judges was effective, and that relevant documentation has been carried out. Internal validity is concerned with the accuracy and consistency of results, while external validity refers to the process of collecting evidence from independent sources which support the outcome of the standard-setting exercise (Papageorgiou, 2010b).
The Bookmark Method While there are several methods that can be used to perform a standard-setting exercise, it was the bookmark method that was chosen for the OET speaking subtest (Mitzel, Lewis, Patz, & Green, 2001). One of the characteristics of the bookmark method is that all of the items in a test are arranged in order of difficulty. The panelists’ task is to place a bookmark, i.e., to select a point in the progression of items where a minimally competent test-taker is likely to answer correctly (Zieky, 2013).The advantages of the bookmark method are that panelists can mark multiple cut-scores (Morgan & Michaelides, 2005), and that the method is relatively easy to understand and to apply.
Research Questions Addressed The aim of this chapter can be described as twofold.The first aim is to describe the standard-setting exercise that was performed in order to arrive at the cut-scores for the clinical communication criteria. The second is to analyze the implications of introducing clinical communication criteria as part of the OET speaking subtest with regard to stakeholders and the wider policy. The following research questions will be addressed: 1. What are the appropriate cut-scores for the clinical communication criteria on the OET speaking subtest? 2. How confident were the panelists in their ratings and in the final cut-scores that they arrived at? 3. What are the implications of introducing clinical communication criteria as part of the OET test for language testing practice and the stakeholders, as well as for wider healthcare policy?
70
71
70
Assessing Clinical Communication 71
Research Methods Part of the validation of the standard-setting exercise is making sure that a rigorous and principled approach is followed throughout, in order to minimize arbitrariness and randomness of results. This section of the chapter provides an overview of the panel and the methods used for collecting data during the standard-setting exercise.
Data Collection Procedures The panel assembled for the standard-setting exercise comprised five senior OET speaking assessors and eight healthcare domain experts.The healthcare practitioners were from a variety of health professions and had experience supervising international medical graduates. The panelists included nurses, occupational therapists, pharmacists, physical therapists, speech pathologists, and healthcare managers. One of the advantages of having domain experts involved in the standard- setting exercise is that their judgments are likely to be based on the real-life clini cal context. The views of domain experts can therefore help to ensure that the test is designed appropriately and remains fit for purpose by offering an adequate assessment of the candidate’s readiness to cope with spoken communicative tasks in the actual clinical setting. On the other hand, given that the format of the test is a role-play between the healthcare professional and the simulated patient, and that the marks will be assigned by language experts, ensuring the commensurability of the views of the two groups is of paramount importance. The standard-setting study started with the familiarization stage, during which the panelists were told what clinical communication is and what the clinical communication criteria are. They were informed about the rationale for introducing those criteria as part of the speaking subtest. The panelists were then talked through each of the indicators in greater detail and were given a glossary that contained explanations and illustrations of each indicator. Next, the panel was split into smaller groups to discuss the qualities and markers that characterized the minimally competent person, which was followed by a plenary discussion. At the end, the panelists indicated their confidence at having arrived at a good common understanding of the minimally competent person. Once the familiarization stage was complete, the panelists moved on to the first judgment stage. Following the bookmark method, during the judgment stage panelists listened to 14 audio-recorded test performances that had been selected across a range of proficiency levels and were arranged from worst to best. Prior to the standard- setting exercise, these performances were marked on the clinical communication criteria by four experienced raters. At the time of the study, a final way of reporting and weighting the clinical communication criteria had not yet been decided, so the results reported were based on multiple markings with a maximum
72
72 Brigita Séguis and Sarah McElwee
possible score of 30 (i.e., the raters’ scores were added to arrive at the final total score for each performance). The top performance was assigned a score of 29, whilst the bottom performance received only three points. The recordings were played, and the panelists selected the first performance that they deemed to be minimally acceptable. Following the first round of judgment, the raters were shown statistics related to their judgments. As will be seen in the Results section, there was such a high level of consensus in their first round of judgments that the standard-setting facilitator decided to forego a second round of judgments. Following the exercise, the panelists also completed evaluation forms which are part of the evidence for the procedural validity of the standard-setting. They were asked to rate their level of confidence in terms of the judgments that they made, and whether the cut-scores derived from their judgments will provide appropriate recommendations for the speaking subtest. Additionally, they were asked to indicate if they felt they had been given enough time to make their judgments. Both the proposed cut-scores and evaluations will be analyzed in greater detail in the next section.
Data Analysis Procedures This section focuses on the analysis of the cut-scores obtained during the judgment task. Table 5.1 gives an overview of the cut-scores recommended by each panelist, on a scale from 0 to 30. The panelists’ standard-setting judgments were computed, including mean and median values, as well as standard deviation and standard error of judgment. The inclusion of both mean and median values allows the presentation of a full spectrum of the data since, unlike the mean, the median is not affected by extreme values.The standard deviation (SD) indicates the amount of variability in the panelists’ cut-score recommendations. The smaller the standard deviation, the greater the consistency in the individual recommendations (Tannenbaum & Katz, 2008). As the data in Table 5.1 demonstrate, the level of consensus between the panelists was very high, with eight judges selecting performances with a score of 18, three judges selecting performances with a score of 17, and two judges selecting performances with a score of 19 out of a maximum possible score of 30. The high level of agreement becomes even more evident when the mean cut- score and standard deviation are inspected. All of these figures further confirm
TABLE 5.1 Overall Cut-score Judgments for OET Speaking Subtest Standard-setting
Panelist no. Round 1 Scores
1 18
2 18
3 19
4 18
5 19
Mean: 17.9; Median: 18.0; SD: 0.6; Min: 17; Max: 19
6 18
7 18
8 18
9 18
10 17
11 17
12 17
13 18
72
73
72
Assessing Clinical Communication 73
that there was little variation between the panelists’ recommendations and provide additional validity evidence for the study. After the judgment task, the panelists were asked to evaluate the standard setting using 4-point Likert-type scales, where 4 is best (“confident”/“enough time”) and where 1 is worst (“not at all confident”/“more time needed”). Their responses were calculated and analyzed as part of the procedural validity of the standard-setting exercise. When asked how confident they were with their judgments, the panelists expressed a relatively high level of confidence (an average score of 3.5 out of 4). They also indicated a high level of confidence that the cut-scores derived from their judgments will provide appropriate cut-score recommendations for the clini cal communication criteria (3.7 out of 4). Finally, the panelists also felt that they were given sufficient time to complete their judgment (3.7 out of 4).These figures can be regarded as additional procedural validity evidence to further support the findings of the standard-setting study. Following the standard-setting exercise, the last remaining task was to convert the panelists’ judgments into the actual scores. The finalized scale for marking clinical communication criteria, which can be found in the appendix, consists of four options: “ineffective use,” “partially effective use,” “competent use,” and “adept use.” Assuming these categories are translated into scales of 0 to 3, the two responses that were scored 18 obtained 0, 1, 1, 1 and 1, 1, 1, 1, respectively.The two responses that were scored 19 both obtained 1, 1, 1, 2. Thus, if marked from 0 to 3 on a total of four clinical communication criteria, the cut-score would appear to be an average of the total score of the four responses, i.e., 4 out of a maximum score of 12.
Findings and Discussion In addition to the immediate benefit of arriving at a defensible cut-score for the clinical communication criteria, the present study has demonstrated the practicality of consulting both domain experts and language testing assessors to define the standards for candidates’ performance.The results of the standard-setting study indicate that the ranking of candidates’ clinical communication performances by healthcare professionals is comparable to the ranking assigned by language-trained assessors. As explained earlier in the chapter, the level of agreement between the two groups of panelists was so high that the standard-setting facilitator decided to forgo the second round of judgments. This finding is both reassuring and important, since it suggests that domain experts and language assessors alike were drawn to similar qualities in the candidates’ performances and any differences in their judgments were not substantial enough to have any major impact. It is also reassuring to confirm that the views of healthcare professionals, who arguably have a better understanding of clinical communication aspects than do language assessors, are nevertheless shared by the latter group. This agreement
74
74 Brigita Séguis and Sarah McElwee
suggests that given proper training and relevant background information, language assessors have the capacity to assess not only linguistic but also interactional and communicative features of candidate performance that have been shown to be important to healthcare professionals who operate in the clinical setting where these features are actually practiced. Consequently, this finding suggests that adopting the new criteria and involving the medical professionals to set cut-scores can lead to more valid conclusions about the test-takers’ speaking performances and their readiness to function in the actual workplace environments. In particular, the new criteria will lead to greater recognition of strongly competent communicative and interactional ability, rather than merely a strongly competent linguistic performance. While inclusion of clinical communication criteria can further increase the overall fitness for purpose of the OET, there are several outstanding issues that need to be addressed. Since clinical communication criteria are quite distinctive, fairness issues may potentially arise which could threaten the validity of the score outcomes. As the cut-score arrived at during the standard-setting study is relatively low, it is possible that test-takers could earn some extra points by simply memorizing some of the key generic responses or questions without having an adequate and competent grasp of their actual usage. Two possible ways of mitigating any fairness issues and negative washback from inappropriate test preparation would be by adjusting the weighting of the criteria to ensure that the clinical communication and linguistic criteria are appropriately balanced, or by raising the cut score. Thus, the four linguistic criteria (Intelligibility, Fluency, Appropriateness of Language, and Resources of Grammar, and Expression) will be assessed on a scale from 0 to 6, while the newly introduced clinical communication criteria will be assessed on a scale from 0 to 3. However, once the new criteria are examined, the typical score profiles on these criteria will be monitored and if necessary, the weighting will be further modified.
Implications for Policy, Practice, and Future Research Clinical communication and the importance of patient-centeredness are now taught extensively as part of the medical school curriculum in a number of English-speaking countries, including the UK. The emergence and rise of clinical communication can be attributed to a variety of drivers, ranging from such poli tical and sociological influences as the rise of neo-liberalism, globalization, and cultural shifts, to the prominent attention given to the importance of clinical communication in such influential policy documents as Tomorrow’s Doctors, produced by the UK’s GMC. (See Brown, 2008, for an in-depth overview.) Bearing these socio-political and cultural shifts in mind, it is important to consider what implications they will have for the OET test-takers’ readiness to work in the medical environment. Existing studies (McCullagh, 2011; Verma, Griffin, Dacre, & Elder, 2016) suggest that international medical graduates trained
74
75
74
Assessing Clinical Communication 75
overseas perform less well in clinical skills assessments compared to their UK- trained counterparts. A great majority of the OET test-takers come from countries where doctor-centered, as opposed to patient-centered, models of clinical decision-making are perceived as the norm. As a result, a question arises as to what can be done to support them in their transition to a more patient-centered model of care. The extent to which this change can be accommodated within the realms of the OET, which, first and foremost, is a language test, must also be considered. A key reason for introducing clinical communication criteria on the Speaking subtest was an attempt to bridge an identified gap between candidates’ linguistic performances on the OET and their performances in real-world communication in the workplace. Being assessed on clinical communication criteria will certainly influence how the candidates prepare or are prepared for the test, thus potentially creating immediate positive washback. It is hoped that any preparation done for taking the test will support and complement candidates’ preparation for further exams they will need to take in order to obtain their licenses to practice. In the UK, for example, the majority of international medical graduates will be required to take the Professional and Linguistic Assessment Board test, which assesses not only the candidates’ clinical knowledge, but also their ability to apply knowledge to the care of patients during a mock consultation. Making the new OET speaking criteria available to the test users will familiarize them with how clinical communication is performed within the patient-centered approach, and will encourage development of skills that will benefit them beyond the immediate context of the OET. This additional emphasis on the long-term educational benefits of test preparation has the potential to contribute to OET’s positive impact on the stakeholders and society in the long term. Acknowledgement: The authors would like to thank Dr. Gad Lim, who conducted the OET standard-setting study in Melbourne, Australia in November 2017.
References Brown, J. (2008). How clinical communication has become a core part of medical education in the UK. Medical Education, 42(3), 271–278. Douglas, D. (2000). Assessing languages for specific purposes. Cambridge, UK: Cambridge University Press. Elder, C. A., McNamara, T. F., Woodward-Kron, R., Manias, E., McColl, G. J., Webb, G. R., & Pill, J. (2013). Towards improved healthcare communication: Development and validation of language proficiency standards for non-native English speaking health professionals (Final report for the Occupational English Test Center). Melbourne, Australia: The University of Melbourne, Language Testing Research Center. General Medical Council (GMC). (2015). Tomorrow’s doctors. Retrieved from www.gmc- uk.org/media/documents/Outcomes_for_g raduates_Jul_15_1216.pdf_61408029.pdf Jacoby, S. W. (1998). Science as performance: Socializing scientific discourse through the conference talk rehearsal (Unpublished doctoral dissertation). University of California, Los Angeles.
76
76 Brigita Séguis and Sarah McElwee
Jacoby, S., & McNamara, T. (1999). Locating competence. English for Specific Purposes, 18, 213–241. Kane, M. (2001). So much remains the same: Conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 53–88). Mahwah, NJ: Lawrence Erlbaum Associates. Kane, M., Crooks,T., & Cohen, A. S. (1999).Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17. Kurtz, S., Silverman, J., & Draper, J. (2005). Teaching and learning communication skills in medicine. Abingdon, UK: Radcliffe Publishing Limited. McCullagh, M. (2011). Addressing the language and communication needs of IMGs in a UK context: Materials development for the Doctor–Patient interview. In B. J. Hoekje, & S. M. Tipton (Eds.), English language and the medical profession: Instructing and assessing the communication skills of international physicians (pp. 221–228). Leiden, The Netherlands: Brill. McNamara,T. F. (1996). Measuring second language performance. London, UK: Addison Wesley Longman. Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001): The Bookmark Procedure: Psychological perspectives. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (pp. 249–281). Mahwah, NJ: Lawrence Erlbaum Associates. Morgan, D. L., & Michaelides, M. P. (2005). Setting cut scores for college placement. (Research Report No. 2005–9). New York, NY: The College Board. Neumann, M., Bensing, J., Mercer, S., Ernstmann, N., Ommen, O. & Pfaff, H. (2009). Analyzing the ‘nature’ and ‘specific effectiveness’ of clinical empathy: A theoretical overview and contribution towards a theory-based research agenda. Patient Education and Counseling, 74, 339–346. Papageorgiou, S. (2010a). Investigating the decision-making process of standard setting participants. Language Testing, 27(2), 261–282. Papageorgiou, S. (2010b). Setting cut scores on the common European Framework of Reference for the Michigan English Test (Technical Report). Ann Arbor, MI: University of Michigan. Pill, J. (2016). Drawing on indigenous criteria for more authentic assessment in a specific- purpose language test: Health professionals interacting with patients. Language Testing, 33(2), 175–193. Silverman, J. (2016). Relationship building. In J. Brown, L. Noble, J. Kidd, & A. Papageorgiou, (Eds.). Clinical communication in medicine (pp. 72–75). Oxford, UK: John Wiley & Sons. Silverman, J., Kurtz, S., & Draper, J. (2005) Skills for communicating with patients (3rd ed.). Boca Raton, FL: CRC Press. Tannenbaum, R. J., & Katz, I. R. (2008). Setting standards on the Core and Advanced iSkills™ Assessments (ETS Research Memorandum No. RM-08-04). Princeton, NJ: Educational Testing Service. Verma, A., Griffin, A., Dacre, J., & Elder, A. (2016). Exploring cultural and linguistic influences on clinical communication skills: A qualitative study of international medical graduates. BMC Medical Education, 16(1), 162. 162. doi: https://doi.org/10.1186/ s12909-016-0680-7 Vidakovic, I., & Khalifa, H. (2013). Stakeholders’ perceptions of Occupational English Test (OET): An exploratory study. Research Notes, 54, 29–32.
76
7
76
Assessing Clinical Communication 77
Wette, R. (2011). English proficiency tests and communication skills training for overseas- qualified health professionals in Australia and New Zealand. Language Assessment Quarterly, 8(2), 200–210. Zieky, M. J. (2013). So much has changed: How the setting of cutscores has evolved since the 1980s. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (pp. 33–66). Mahwah, NJ: Lawrence Erlbaum Associates. Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.
78
newgenrtpdf
Appendix Clinical Communication Criteria. In the role-play, there is evidence of the test taker A. Indicators of Relationship Building
A: Indicators of Relationship Building
A1 initiating the interaction appropriately (greeting, introductions, nature of interview) A2 demonstrating an attentive and respectful attitude A3 adopting a non-judgmental approach A4 showing empathy for feelings/predicament/emotional state
0 –Ineffective use 1 –Partially effective use 2 –Competent use 3 – Adept use
B. Indicators of Understanding and Incorporating the Patient’s Perspective
B. Indicators of Understanding and Incorporating the Patient’s Perspective
B1 eliciting and exploring patient’s ideas/concerns/expectations B2 picking up patient’s cues B3 relating explanations to elicited ideas/concerns/expectations
0 –Ineffective use 1 –Partially effective use 2 –Competent use 3 – Adept use
C. Indicators of Providing Structure
C. Indicators of Providing Structure
C1 sequencing the interview purposefully and logically C2 signposting changes in topic C3 using organising techniques in explanations
0 –Ineffective use 1 –Partially effective use 2 –Competent use 3 – Adept use
D. Indicators for Information Gathering
D. Indicators for Information Gathering
D1 facilitating patient’s narrative with active listening techniques, minimising interruption D2 using initially open questions, appropriately moving to closed questions
0 –Ineffective use 1 –Partially effective use
78
E. Indicators for Information Giving
E. Indicators for Information Giving
E1 establishing initially what patient already knows E2 pausing periodically when giving information, using response to guide next steps E3 encouraging patient to contribute reactions/feelings E4 checking whether patient has understood information E5 discovering what further information patient needs
0 –Ineffective use 1 –Partially effective use 2 –Competent use 3 – Adept use
79
2 –Competent use 3 – Adept use
newgenrtpdf
78
D3 NOT using compound questions/leading questions D4 clarifying statements which are vague or need amplification D5 summarising information to encourage correction/invite further information
80
6 UPDATING THE DOMAIN ANALYSIS OF THE WRITING SUBTEST OF A LARGE-SCALE STANDARDIZED TEST FOR K-12 ENGLISH LANGUAGE LEARNERS Jing Wei, Tanya Bitterman, Ruslana Westerlund, and Jennifer Norton
Issues that Motivated the Research This chapter reports on a project updating the domain analysis of the writing subtest of a standardized test for K–12 English language learners (ELLs) in the US. Conducting a domain analysis is a standard procedure that is used in the initial process of test development. In language testing, one part of a domain analysis is to identify what tasks and situations test-takers typically encounter in the target language use (TLU) domain (Bachman & Palmer, 2010). Establishing the correspondence between the test tasks and the tasks in the real-world communicative contexts helps provide support for the validity claim that test-takers’ performance on a test can be used to make inferences about their ability to perform relevant tasks in real life. Domain analysis is important when a test is being initially developed and when a test is undergoing revisions. In this chapter, we describe the process of updating the domain analysis when refining the writing subtest, with the goal of supporting the argument for the validity of the test. ACCESS for ELLs 2.0® is a large-scale assessment of English language proficiency that is administered annually to approximately two million Kindergarten to twelfth-grade ELLs in 39 WIDA Consortium member states. Operational since 2005, it assesses developing English language proficiency as expressed in four domains: listening, reading, speaking, and writing. The writing subtest (referred to as the ACCESS Writing Subtest hereafter) is designed to monitor ELLs’ progress in developing English language proficiency as expressed in the domain of writing (WIDA Consortium, 2014).With the relatively recent implementation of college- and career-ready standards, including the Common Core State Standards (CCSS, 2010), the language demands placed on students have increased and classroom
80
81
80
Updating a Writing Domain Analysis 81
instructional practices have changed (Frantz, Bailey, Starr, & Perea, 2014). To that end, it became essential to re-examine tasks that are typically required in classroom settings and to ensure that new tasks developed for the ACCESS Writing Subtest align with current standards and practices. This alignment would help ensure that students’ writing responses elicited by the ACCESS Writing Subtest can be used to make valid inferences about students’ ability to meet the language demands of tasks typically encountered in the TLU domain. Currently, considerations have been given to refine the current ACCESS Writing Subtest design and to develop new task specifications to accompany the refined test design. The first phase of revisions focuses on writing tasks targeting the language used in English Language Arts (ELA) and Social Studies (SS) classes. Before developing the new task specifications, an updated domain analysis must be carried out to better understand tasks used in the TLU domain.
Context of the Research Test Validation Frameworks Domain analysis serves as a foundational step in validation frameworks that are commonly used to guide test development work. The ACCESS Writing Subtest was developed using the Center for Applied Linguistics (CAL) test validation framework (Kenyon, 2014), which is a synthesis of the evidence-centered design (ECD) (Mislevy, Steinberg, & Almond, 2002) and assessment use argument (AUA) approaches (Bachman & Palmer, 2010). In ECD, test design is conceptualized as a four-stage process: domain analysis, domain modeling, construction of a conceptual framework, and deployment of an operational assessment (Hines, 2010; Mislevy et al., 2002). Domain analysis is the stage in which test designers investigate the construct that is being assessed, that is, what knowledge, skills, and abilities are important for successfully dealing with task demands in contexts outside of the testing situation. In the AUA approach, Bachman and Palmer contend that one should start the test development process by thinking about what beneficial consequences will be brought about by a test and what decisions will be made based on the test results.Then, one needs to carefully design the test to ensure that it leads to the intended outcomes. By synthesizing the ECD and AUA into one single framework, the CAL validation framework not only highlights the importance of taking into consideration test consequences and decisions in the initial design stage, but also pinpoints the concrete steps that need to be taken to attain the desirable test consequences (Kelly, Renn, & Norton, 2018; Kenyon, 2014). Grounded in the CAL validation framework, we started the ACCESS Writing Subtest revisions process by considering the types of decisions that are made based on the test results (i.e., serving as one of the measures to determine whether an ELL is ready to exit English language support services) and by rigorously following the recommended process of domain analysis, domain modeling, and conceptual framework construction in test development.
82
82 Jing Wei, et al.
Domain Analysis As an essential stage in the test validation framework, domain analysis has been incorporated into the development processes by a number of large-scale standardized tests. Various methods have been used to conduct domain analysis in language tests, including reviewing standards, reviewing curriculum, surveying and interviewing stakeholders, and conducting corpus analyses of characteristics of language used in the TLU domain. For example, the domain analysis of the TOEFL Primary® and the TOEFL Junior® tests was conducted by reviewing English language standards, curricula, and textbooks from both the US and other countries and by reviewing literature on language used in the academic contexts (Cho, Ginsburgh, Morgan, Moulder, Xi, & Hauck, 2016; So,Wolf, Hauck, Mollaun, Rybinski, Tumposky, & Wang, 2015). The domain analysis of the TOEFL iBT® was conducted by gathering data from an even wider variety of sources. First, literature on English skills needed for English-medium higher education institutions was reviewed to identify what knowledge, abilities and skills are needed to succeed in that context and what tasks and materials are typically encountered in that specific setting (Jamieson, Eignor, Grabe, & Kunnan, 2008; Taylor & Angelis, 2008). Undergraduate and graduate students and faculty members in higher education institutions were also surveyed on the importance and relevance of each of the skills and tasks that had been identified as essential in the literature (Rosenfeld, Leung, & Oltman, 2001). Finally, corpus analyses were conducted to examine the authenticity of the test stimulus material (Biber, Conrad, Reppen, Byrd, Helt, Clark, Cortes, Csomay, & Urzua, 2004). This body of literature informed the methods we employed in conducting the domain analysis for the ACCESS Writing Subtest. Similar to the methods employed in the domain analysis studies of the TOEFL Family of Assessments, we collected information about the TLU domain by reviewing the relevant literature and conducting surveys and focus group interviews with educators in the grades 1–12 context. Kindergarten was not the focus of test revisions and therefore was not included in the domain analysis.
Genre Theory In this domain analysis study, we used genre as the unit of analysis because the WIDA English language development standards that serve as the foundation of the test are based on a functional view of language (WIDA Consortium, 2014).To identify the genres of writing expected in grades 1–12 ELA and SS, disciplinary expectations for the corresponding disciplines were analyzed along with the genre literature specific to ELA and SS. For ELA, the CCSS for writing were reviewed. For SS, the College, Career, and Civic Life Framework for Social Studies State Standards (C3 Framework) was consulted. The CCSS for writing list narrative, informative/explanatory, and argumentative as the main text types for writing in
82
83
82
Updating a Writing Domain Analysis 83
ELA.The SS expectations as stated in the C3 Framework require that students use explanations and arguments to communicate their conclusions (National Council for the Social Studies, 2013). After studying the disciplinary expectations as delineated in the standards, genre literature was consulted to provide more nuanced descriptions of the genre families.Various typologies for organizing genres were reviewed (e.g., Brisk, 2015; Christie & Derewianka, 2008; Coffin, 2006). Some typologies were overly nuanced for the purposes of this study. Therefore, they were condensed and referenced against the CCSS and C3 Framework. The genre family of recounts include personal, imaginative, procedural, autobiographies, empathetic autobiographies, biographies, and historical recounts (Brisk, 2015). The informative/explanatory category includes reports and procedures; and the argumentative category refers to exposition (one side of the argument), discussion (both sides of the argument), and critical response (Brisk, 2015). In SS, the C3 Framework and the work of genre theorists such as Coffin (2006) and Christie and Derewianka (2008) guided the selection of genres. In her work with historical discourse, Coffin (2006) differentiates between factorial and consequential explanations. Factorial explanations are concerned with factors leading up to the event. Consequential explanations foreground consequences of an event. An antecedent genre to explanations in grades 1–5 is reports of various kinds (descriptive, comparative, classifying). In SS, historical reports are particularly important because they give information about a particular period of time or site (Derewianka & Jones, 2016). In the genre family of recounts, Derewianka and Jones (2016) identify key genres, such as historical recounts (grades 5–12) and biographical recounts (grades 3–12), with the personal recounts as the antecedent genre. Historical recounts record, explain, and interpret important events in a society’s past. The purpose of biographical recounts is to retell episodes from another person’s life. Argument is a key genre that starts in grades 3–5 in SS. In grades 1–2, students are only expected to express opinions, rather than to write logically reasoned arguments supported by evidence. The last step in the process of identifying genres was to condense the number of written genres identified from the literature and group the genres by content area and grade-level clusters. The Appendix shows the list of genres of writing according to content area and for grades 1 through 12. For each grade level, the genres in which students are expected to write independently are marked as “Ind.” The genres that are not applicable to a grade level are marked as “N/A.” As the Appendix shows, genres that are only applicable to elementary school students include the following: procedures, personal recounts, narratives, and reports. The genres that are mostly applicable to middle and high school students include critical responses, historical recounts, causal explanations, consequential explanations, factorial explanations, systems explanations, and arguments from evidence. The Appendix also summarizes a variety of ways that lower-g rade students are first introduced to a genre. They gain experience with a genre by constructing a text
84
84 Jing Wei, et al.
with their teachers, identifying features of genre through reading, or practicing with oral and written activities or labeled diagrams.
Research Questions This study was guided by two main research questions: 1. What topics and genres are typical of writing tasks used in grades 1–12 ELA and SS instruction? 2. How are ACCESS Writing Subtest tasks different from and similar to writing tasks used in grades 1–12 ELA and SS?
Research Methods Data Collection Procedures To identify tasks that are most relevant to the TLU domain, we conducted a mixed-methods study that included two phases. In Phase I, we administered a survey questionnaire to investigate educators’ perspectives about genres and topics that are typically used in the focal content areas across grades 1–12. In Phase II, we conducted virtual focus groups to probe deeper into educators’ perceptions about the features of effective classroom writing tasks, about current ACCESS Writing Subtest tasks, and about the similarities and differences between ACCESS Writing Subtest tasks and classroom writing assignments. In the following sections, we describe in detail the process of designing and administering the survey questionnaire and focus group interviews. We then outline our approach for analyzing the survey and focus group interview data. The first phase of data collection involved a questionnaire. Using the genres of writing identified above, we designed a survey to elicit educators’ perspectives on the importance of different writing genres, as well as additional information about types of writing tasks in the focal content areas. We divided the survey questionnaire into three sections. In the first section, we asked educators to select the grade level that they were most familiar with. Then we asked if they had had experience teaching ELA and SS for that grade level in the past three years. Their answers to these questions determined the remaining questions they would be given. Only educators who reported experience teaching the specified grade level and content area were presented with questions about genres of writing in that content area. The second section focused on genre importance ratings. For each genre, we asked educators,“How important is it for students in [grade level] to be competent at producing [genre (definition)]?” Response options consisted of a 4-point Likert- type scale: 1 = Not important at all, 2 = Somewhat important, 3 = Important, 4 = Very important. Educators were also asked to provide an example of a typical task that students in the selected grade level are asked to write for every genre they rated as Important or Very important.
84
85
84
Updating a Writing Domain Analysis 85 TABLE 6.1 Number of Survey Participants by Grade-Level Cluster and Content Area
Grade-Level Cluster
English Language Arts
Social Studies
Grade 1 Grades 2–3 Grades 4–5 Grades 6–8 Grades 9–12 Total
34 52 44 54 34 218
22 41 25 22 13 123
The third section asked additional questions about characteristics of classroom writing tasks. These questions included open-ended items about the amount of writing that students are expected to complete in 20 minutes, the percentage of writing tasks students complete independently versus with support, typical writing tasks that students complete independently, and forms of support that are typically provided for students when writing in the classroom. The survey participants were any educators who (1) worked in one of the 39 member-states of the WIDA Consortium, (2) taught students in grades 1–12, and (3) had taught ELA and/or SS in the past three years. A total of 315 educators were recruited for Phase I through a convenience sample. As shown in Table 6.1, across grades 1–12, 218 educators completed the questions about ELA genres and 123 educators completed the questions about SS genres. Because the ACCESS Writing Subtest consists of test forms which combine grade levels into grade-level clusters, survey responses were grouped by grade-level cluster for analysis. The second phase of data collection consisted of focus group interviews. At the end of the questionnaire, educators were asked if they would be willing to participate in a virtual focus group interview. From the pool of interested survey participants, a total of 26 educators were recruited to participate. Prior to participating in the virtual focus groups, participants submitted examples of effective writing tasks they had used with students. A total of seven one-hour virtual focus groups were conducted to elicit information on the types of writing tasks educators typically give to students, and to elicit perspectives on the similarities and differences between writing tasks used by educators and tasks on the ACCESS Writing Subtest. For each focus group, the number of participants varied from three to five. Focus groups were moderated by one researcher, with a second researcher taking detailed notes.
Data Analysis Procedures Both quantitative and qualitative data were collected and analyzed following different procedures. The quantitative data included educators’ ratings of each of the genres deemed as important in the standards and genre theories. The mean and standard deviation of educators’ responses to each genre-related question were
86
86 Jing Wei, et al. TABLE 6.2 Mean Importance Ratings of ELA Writing Genres at Each Grade-Level
Cluster Genre
Personal Recounts Procedures Narratives Arguments Critical Responses
Grade 1
Grades 2–3
Grades 4–5
Grades 6–8
Grades 9–12
Mean SD
Mean SD
Mean SD
Mean SD
Mean SD
3.35
0.65
3.24
0.79
3.33 0.61
2.98 0.80
3.12
0.82
2.97 2.88 2.38 2.15
0.72 0.81 0.99 0.89
3.24 3.15 2.84 2.68
0.79 0.83 0.86 0.84
2.86 3.40 3.36 3.15
2.65 3.22 3.76 3.07
2.53 3.35 3.35 3.35
0.99 0.73 0.52 0.73
0.88 0.74 0.71 0.81
0.85 0.74 0.48 0.95
calculated to gauge the relative importance of each genre and the extent to which educators agree with one another on each genre’s importance. The mean importance ratings were then ranked for each of presentation of the importance ratings, as shown in Table 6.2, with the top-three rated genres indicated in bold. In addition to the resulting ratings, which ranged from 1.0 (Not important at all) to 4.0 (Very important), responses to the open-ended questions were reviewed qualitatively and synthesized thematically to provide illustrative examples of typical writing tasks related to specific genres. The focus group data from Phase II were also analyzed qualitatively. Interview notes were reviewed and coded, and the patterns were extracted from the data. To examine the similarities and differences between the ACCESS Writing Subtest and classroom writing assignments, themes were extracted related to (1) topics and tasks that are typically used in ELA and SS classes, (2) characteristics that make classroom tasks effective, and (3) characteristics of the ACCESS Writing Subtest tasks that are perceived by educators as similar to and different from classroom tasks. The themes in educators’ responses provided deeper understanding of the features found in effective writing tasks across grades 1–12, as well as of the processes that educators use to teach and assess writing with their students.
Findings and Discussion This section discusses the findings of both phases of data collection: the questionnaire and the focus groups.
Phase 1: Survey Findings The survey findings revealed those ELA and SS writing genres that educators considered as most important for students to be competent at producing in grades 1, 2–3, 4–5, 6–8, and 9–12 respectively, shown in Tables 6.2 and 6.3.
86
87
86
Updating a Writing Domain Analysis 87
Across grades 1–12, the importance ratings of ELA writing genres indicated a progression from personal recounts, procedures, and narratives in the lower elementary grades to arguments, narratives, and critical responses in the middle and high school grades. In grades 1 and 2–3, the most important writing genre was personal recount, yet in grades 4–5, it was rated third and was not among the three most important writing genres for grades 6–8 and 9–12. The grades 4–5 cluster straddled lower elementary and middle and high school in terms of the typical genres of ELA writing: Personal recounts were still important, but arguments and critical responses replaced procedures and narratives in the importance ratings of the top three ELA writing genres. The importance ratings of the ELA genres for each grade-level cluster generally aligned with the grade-level designations found in various content standards and the literature we reviewed, as summarized in the Appendix, with exceptions in grades 1, 6– 8, and 9– 12. One difference is that the narrative genre was perceived as being important for a broader range of grade-level clusters than they were in the content standards and the genre literature. For grade 1, the genre analysis indicated writing in this grade is typically joint construction, meaning that the teacher and students jointly construct writing. In addition, the narrative genre, the third most important genre in the survey for grade 1 ELA, was not found in the genre analysis to be a genre that students wrote at that grade; rather it was more common that students would identify features of narratives when reading. In addition, educators rated narrative as the second most important for grades 6–8 and the most important for grades 9–12, but the genre analysis included only argument, autobiography, and critical response as key ELA writing genres for these grades. Ratings of SS writing genres showed a similar pattern across grade- level clusters, with the genres in the lower elementary grade-level clusters focusing more on recounts, whether recounting personal information, procedures, or steps. The most important middle and high school-level writing genres were reported as arguments from evidence, historical recounts, and causal explanations. Similar to the ELA writing genres for grades 4–5, in SS, the writing genre ratings straddled the upper and lower grades. The ratings showed that grades 4–5 students were expected to produce more sophisticated genres, such as argument from evidence, while teachers simultaneously continue to expect grades 4–5 students to produce writing genres of report, sequential explanation, and biographical recount. However, it was also noticeable that the mean importance ratings of all SS writing genres in grades 1–5 were below 3.0, meaning none of the genres were considered as important by educators. Only grades 6–8 and 9–12 had any SS writing genres with a mean importance rating higher than 3.0, suggesting that writing genres specific to SS were not considered as important in the lower grades as in the higher grades. Also, some SS genres for grades 1 and 2–3 have standard deviations higher than 1, suggesting that educators of those grades were more likely to disagree with one another about the importance of those SS genres.
8
88 Jing Wei, et al. TABLE 6.3 Mean Importance Ratings of SS Writing Genres at Each Grade-Level Cluster
Genre
Reports Sequential Explanations Biographical Recounts Causal Explanations Arguments from Evidence Systems Explanations Consequential Explanation Historical Recounts Factorial Explanations
Grade 1
Grades 2–3
Grades 4–5
Grades 6–8
Grades 9–12
Mean SD
Mean SD
Mean SD
Mean SD
Mean SD
2.73 0.94 2.45 1.18
2.85 2.68
0.82 1.04
2.88 2.84
0.78 0.80
3.00 0.82 3.00 0.63
2.85 2.92
0.99 0.86
2.27 1.08
2.54
0.87
2.80
0.82
2.62 0.80
2.54
0.78
1.62
1.12
2.41
0.88
2.76
0.66
3.14 0.65
3.23
0.83
1.48
1.29
2.44
0.97
2.84
0.85
3.30 0.92
3.46
0.78
1.43
1.25
2.00
0.76
2.36
0.91
2.61 0.92
2.69
1.03
1.38
1.16
2.28
0.89
2.64
0.76
3.05 0.60
3.23
0.83
1.36
1.26
2.29
0.93
2.68
0.80
3.24 0.83
3.38
0.65
1.19
1.21
2.13
0.77
2.44
0.82
2.95 0.69
3.15
0.69
8
The SS genre analysis presented in Table 6.3 aligned with the ratings that educators gave to genres at different grade-level clusters. For grades 1 and 2–3, reports, sequential explanations, and biographical (or personal) recounts, which were the highest rated in importance by educators, were also indicated as genres typical of this grade, as shown in the Appendix. For grades 4–5, reports, sequential explanations, arguments from evidence, and biographical recounts were all found to be typical of the grade-level cluster in genre analysis, which was also substantiated by the ratings by educators for the grade. Grades 6–8 and 9–12 yielded a comparable result, with the three highest-rated SS writing genres of argument from evidence, historical recount, and causal explanation also indicated as grade-level appropriate according to the literature.
Phase 2: Focus Group Findings The focus groups explored in more depth how the ACCESS Writing Subtest tasks are similar to and different from classroom writing tasks. In the grade 1 group, the educators expressed the idea that writing tasks are typically completed with substantial support and that they also focus on building fine motor skills, mechanics, and vocabulary. They reinforced the point that writing expectations at this grade are the same for both ELA and SS. The educators emphasized that students write best when they write about something that is developmentally appropriate for their young age; when they have a choice of topics; when they have interest
89
8
Updating a Writing Domain Analysis 89
in a topic; and when they draw or discuss the topic as a pre-writing activity. The educators perceived the grade 1 ACCESS Writing Subtest tasks to be geared towards end-of-the year grade 1 students who can write a few sentences. The grades 2–3 group reported that typical writing tasks included significant support during the editing and revision process. They felt that the directions for completing the tasks should be very specific and detailed, and that the tasks should have topics that are engaging to students in grades 2–3. According to the educators, classroom tasks happen one step at a time, over time, with substantial teacher support, whereas ACCESS Writing Subtest tasks occur in one longer sitting and are completed fairly independently. This point suggests that educators spend time on the writing process and instruct students very closely over time, rather than conducting independent writing assignments as assessments. For grades 4– 5, educators reported that students are most often assigned evidence-based writing or are asked to provide opinions based on reading. For a task to be effective, students need input, such as texts, to support their output. At these grades, students make diagrams to organize ideas prior to writing and often have the opportunity to work in groups during the writing process. The grades 6–8 ELA group explained that writing consists of many short writing tasks across a range of genres, and that key elements of effective writing tasks at these grades include clear expectations, modeling, authentic audience and purpose, and aligning assessment tasks to instruction. In SS at grades 6–8, a wide range of writing tasks are important, from critical thinking and citing evidence to writing from a character’s perspective or writing about current global issues. This finding was consistent with the Phase 1 findings that argument and critical response were identified as two of the most important genres for grades 6–8. Both long and short writing tasks are common. Like the ACCESS Writing Subtest tasks, classroom writing tasks often involve a checklist for students to assess their own writing.They also ask students for details and give students options to choose from when writing. Grades 9–12 writing was characterized by the group as largely source-based writing, encompassing a wide range of genres and lengths, including argumentative, expository, narrative, and citing evidence, which supported the Phase 1 findings. Teacher or peer models and peer-editing were cited as common features of writing instruction and as a key element of effective writing tasks.
Implications for Policy, Practice, and Future Research The first research question investigated which topics and genres are typical of writing assignments used in grades 1–12 ELA and SS classes. The participants’ importance ratings of the writing genres have important implications for test developers. They provided test developers the opportunity to examine the alignment between genres taught in the TLU and genres assessed on the ACCESS Writing Subtest, which resulted in a broadening of writing genres targeted on the test. Such comparative analysis is particularly useful for domain definition and
90
90 Jing Wei, et al.
the construction of tasks that are authentic to the TLU domain (Cushing, 2017). Nationally used standards, such as the CCSS, the C3 Framework, and the WIDA Standards, provided theoretical conceptualization of language development trajectory in the domain of writing. Findings from our study provided educators’ perspectives on the developmental trajectory in terms of genres needed in the TLU domain. Such an inquiry into the educators’ perspectives can be useful for both providing empirical evidence to back up the standards and for evaluating the correspondence between the test tasks and tasks in the TLU domain. Through data from the focus groups, the second research question addressed educators’ perspectives on differences and similarities of ACCESS Writing Subtest tasks and writing tasks assigned in content area classes. One important difference between classroom writing tasks and ACCESS Writing Subtest tasks is that educators tend to instruct writing through a process of drafting, editing, and revising, and they are less likely to assign writing to be completed independently in one sitting, especially at the younger grades. Given that the ACCESS Writing Subtest is conducted in one sitting and aims to capture a snapshot of students’ writing as a rough first draft, these constraints are reflected in the domain modeling, which aims to reflect the domain analysis, yet with conscious decisions reflecting those constraints. For example, in grades 1 and 2–3, although the support from the teacher is different in the classroom setting, the ACCESS Writing Subtest can continue to utilize guided scripting (i.e., scripted language that test administrators read aloud to guide students through the task completion process) for test administrators to assist the younger students in accessing the input so they may in turn demonstrate their best writing in English. At the upper grades, a primary feature of classroom writing is the reliance on texts and evidence to support claims. This important feature of classroom writing requires careful transformation to the ACCESS Writing Subtest, which cannot afford as much reading load or reading time as in the classroom. Furthermore, given that students may be accustomed to a specific and often lengthy writing process, ACCESS Writing Subtest tasks must provide input that is sufficient but not overly complex or reliant on background knowledge to stimulate a meaningful written response. Such considerations are key decision points for creating effective task types that reflect the target language use domain while also fitting the purpose of ACCESS for ELLs 2.0 and the subsequent scores that will be used to inform decisions about English learners’ language services. This study illustrates the usefulness of updating the domain analysis in the process of refining the writing subtest of a large-scale standardized language test. It demonstrates the values of collecting data from multiple sources related to the TLU, including content standards, genre literature, and educators, in re-examining task specifications and refining test tasks. Given the developmental differences of students across grades 1–12 and the importance of writing in the TLU domain,
90
91
90
Updating a Writing Domain Analysis 91
this updated domain analysis serves as an especially critical and nuanced foundation. As those developing assessments consider approaches to conducting or updating a domain analysis, we recommend analyzing multiple sources of evidence, including stakeholders, such as educators, who provide realistic examples that richly illustrate the TLU domain.
References Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, UK: Oxford University Press. Biber, D., Conrad, S. M., Reppen, R., Byrd, P., Helt, M., Clark,V., Cortes,V., Csomay, E., & Urzua, A. (2004). Representing language use in the university: Analysis of the TOEFL 2000 spoken and written academic language corpus. Princeton, NJ: Educational Testing Service. Brisk, M. (2015). Engaging students in academic literacies. Genre-based pedagogy for K-5 classrooms. New York, NY: Routledge. Cho,Y., Ginsburgh, M., Morgan, R., Moulder, B., Xi, X., & Hauck, M. C. (2016). Designing the TOEFL@ PrimaryTM test (ETS Research Memorandum No. RM-16-02). Princeton, NJ: Educational Testing Service. Retrieved from www.ets.org/Media/Research/pdf/ RM-16-02.pdf Christie, F., & Derewianka, B. (2008). School discourse. New York, NY: Continuum. Coffin, C. (2006). Historical discourse: The language of time, cause and evaluation. New York, NY: Bloomsbury. Common Core State Standards Initiative (2010). Common Core State Standards for English Language Arts & Literacy in History/Social Studies, Science, and Technical Subjects. Retrieved from www.corestandards.org/wp-content/uploads/ELA_Standards1.pdf Cushing, S.T. (2017). Corpus linguistics in language testing research. Language Testing, 34(4), 441–449. Derewianka, B., & Jones, P. (2016). Teaching language in context (2nd ed.). Oxford, UK: Oxford University Press. Frantz, R. S., Bailey, A. L., Starr, L., & Perea, L. (2014). Measuring academic language proficiency in school-age English language proficiency assessments under new College and Career Readiness Standards in the United States. Language Assessment Quarterly, 11(4), 432–457. Hines, S. (2010). Evidence-centered design: The TOEIC® speaking and writing tests (Research Report TC-10-07). Princeton, NJ: Educational Testing Service. Retrieved from www. ets.org/Media/Research/pdf/TC-10-07.pdf Jamieson, J. M., Eignor, D., Grabe,W., & Kunnan, A. (2008). Frameworks for a new TOEFL. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 55–95). New York, NY: Routledge. Kelly, J., Renn, J., & Norton, J. (2018). Addressing consequences and validity during test design and development. In J. E. Davies, J. M. Norris, M. E. Malone, T. H. McKay, & Y. Son (Eds.) Useful assessment and evaluation in language education (pp. 185–200).Washington, DC: Georgetown University Press. Kenyon, D. (2014, May). From test development to test use consequences: What role does the CEFR play in a validity argument? Paper presented at the 2014 EALTA Conference. Coventry, England. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-based language assessment. Language Testing, 19(4), 477–496.
92
92 Jing Wei, et al.
National Council for the Social Studies (2013). The College, Career, and Civic Life (C3) Framework for Social Studies State Standards: Guidance for enhancing the rigor of K-12 civics, economics, geography, and history. Silver Spring, MD: NCSS. Rosenfeld, M., Leung, P., & Oltman, P. K. (2001). The reading, writing, speaking, and listening tasks important for academic success at the undergraduate and graduate levels. Princeton, NJ: Educational Testing Service. So, Y., Wolf, M. K., Hauck, M. C., Mollaun, P., Rybinski, P., Tumposky, D., & Wang, L., (2015). TOEFL Junior® Design Framework (ETS Research Report No. RR-15-13). Princeton, NJ: Educational Testing Service. Retrieved from www.ets.org/research/ policy_research_reports/publications/report/2015/junw Taylor, C., & Angelis, P. (2008). The evolution of the TOEFL. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.) Building a validity argument for the Test of English as a Foreign Language (pp. 27–54). New York, NY: Routledge. WIDA Consortium (2014).The WIDA Standards framework and its theoretical foundations. Retrieved from www.wida.us/aboutUs/AcademicLanguage/
92
93
newgenrtpdf
92
Appendix Genres of Writing by Grade Level and Content Area Content Area
Genre
English Language Procedure (to instruct someone Arts how to do something through a sequence of steps) Personal Recount (to recount a sequence of events and to share a personal response to those events) Narrative (to narrate a sequence of complicating events, their resolution, to evaluate events and their outcome) Autobiography (to recount the significant events of one’s own life) Argument (to argue for a particular point of view on an issue, discuss two or more points of view or a range of perspectives on an issue before making a judgment or recommendation)
Grade 1
Grade 2
Grade 3
Grade 4
Grade 5
Grades 6–8
Grades 9–12
Joint construction Ind
Ind
Ind
N/A
N/A
N/A
Joint construction Ind
Ind
Ind
Ind
N/A
N/A
Identify features in reading
Identify features Ind in reading
Ind
Ind
N/A
N/A
Ind (oral +written)
Ind
Ind
Ind
Ind
Ind
Ind
Ind (oral opinions)
Ind (oral + written opinions)
Ind (opinions)
Ind (opinions)
Ind (opinions)
Ind
Ind
(continued)
94
newgenrtpdf
(Cont.) Content Area
Social Studies
Genre
Grade 1
Grade 2
Grade 3
Grade 4
Grade 5
Grades 6–8
Grades 9–12
Critical Response (to analyze and challenge the message, characters, values) Report (to describe and provide generalized info on a topic) Biographical Recount (to recount the significant events of one’s own life) Historical Recount (to recount events from the past at particular stages in history before making a judgment or drawing a conclusion) Causal Explanation (to explain phenomena in a linear sequence showing how one step causes the next) Sequential Explanation (to explain phenomena in a linear sequence, e.g., how a bill becomes a law) Consequential Explanation (to explain consequences of a historic event, e.g., consequences of Westward Expansion)
N/A
N/A
N/A
N/A
Ind
Ind
Ind
Ind
Ind
Ind
Ind
Ind
N/A
N/A
Personal recount
Personal recount Ind
Ind
Ind
Ind
Ind
N/A
N/A
N/A
N/A
Ind
Ind
Ind
N/A
N/A
N/A
Ind
Ind
Ind
Ind
Ind (oral + labeled diagram)
Ind (oral + labeled diagram)
N/A
N/A
Ind Ind Ind Ind (written + (written + (written + labeled labeled labeled diagram) diagram) diagram) Ind Ind Ind Ind
N/A
Ind
94
95
newgenrtpdf
94
Content Area
Genre
Grade 1
Grade 2
Grade 3
Grade 4
Grade 5
Grades 6–8
Grades 9–12
Factorial Explanation (to explain factors that contributed to an outcome, e.g., factors that led to World War II) Systems Explanation (to explain how a system works, e.g., how governments work) Argument from Evidence (to argue from evidence by developing claims and counterclaims)
N/A
N/A
N/A
Ind
Ind
Ind
Ind
N/A
N/A
N/A
Ind
Ind
Ind
Ind
N/A
N/A
Ind
Ind
Ind
Ind
Ind
Notes. Ind: genres in which students are expected to write independently. N/A: genres in which students are not expected to write. Joint construction: genres in which students are expected to write with support from teachers. Identify features in reading: genres in which students are not expected to write; rather students are expected to identify features of that genre in reading. Oral + written: oral and written activities that prepare students to write in a genre. Oral opinions: oral activities that prepare students to write in the genre of “opinions.” Oral + written opinions: oral and written activities that prepare students to write in the genre of “opinions.” Oral + labeled diagram: With lower grades students, students are expected to produce sequential explanations through oral explanations or diagrams with labels, rather than writing in that genre. Written + labeled diagram: For grades 3–5, students are expected to produce sequential explanations either through writing or using diagrams with labels.
96
7 THE EFFECT OF AUDIOVISUAL INPUT ON ACADEMIC LISTEN-SPEAK TASK PERFORMANCE Ching-Ni Hsieh and Larry Davis
Motivation for the Research A common feature of language use in academic contexts is the need to integrate information obtained from source materials into a written or spoken report. The ability to successfully carry out such integrated language tasks has long been identified as an important element of academic language proficiency (Brown, Iwashita, & McNamara, 2005; Butler, Eignor, Jones, McNamara, & Suomi, 2000; Cumming, 2014; Douglas, 1997). While a huge body of research into the use of integrated tasks to assess writing has been accumulated (e.g., Cumming, 2013; Cumming, Lai, & Cho, 2016; Plakans, 2010), the use of integrated tasks to assess speaking ability has been studied less often (Frost, Elder, & Wigglesworth, 2012; Huang, Hung, & Hong, 2016). Source materials used in integrated speaking tasks, which require the use of multiple language skills such as reading, listening, and speaking, have typically utilized written texts or audio recordings as the primary means of conveying content. While assessments of general speaking ability have long included the description of visual materials such as photographs or cartoons (Butler et al., 2000), content in tests of academic speaking has usually been delivered via readings or audio.Visuals, when used in language assessment, have generally been still images used to supply contextual information, such as depicting the setting and speaker (Ginther, 2002; Suvorov, 2009). Audiovisual input that involves the use of video and other types of visual aids, such as on-screen key words, is uncommon despite the widespread use of audiovisuals in language classrooms and the increasing use of audiovisuals to provide content in online instruction and as a modality for real-life, online communication (Cross, 2011; Sydorenko, 2010; Weyers, 1999).
96
97
96
The Effect of Audiovisual Input 97
While the use of video in L2 listening assessment and its impact on test-taker performance have garnered much research attention (see review below) (e.g., Batty, 2015; Coniam, 2001; Feak & Salehzadeh, 2001; Ginther, 2002; Ockey, 2007; Suvorov, 2009; Wagner, 2010, 2013), little research exists that investigates the feasibility of audiovisual input in integrated tasks and whether audiovisual input impacts test-taker performance (Cubilo & Winke, 2013). Because the production of audiovisuals requires substantial resources to ensure scalability and comparability across test versions in large-scale language assessment, sound justifications for their use are greatly needed. It is also critical to investigate the potential impact of audiovisuals on integrated task performance in order to ensure valid score interpretations. At the time of this writing, there is no known publicly available empirical study on the use of audiovisuals in integrated speaking tasks. To address this gap in the literature, in this study, we set out to investigate the effect of audiovisualversus audio-only source materials on the language produced by L2 learners responding to an integrated listen-speak task within the context of an English for Academic Purposes (EAP) assessment context by comparing performance differences in aspects of fluency, grammar, vocabulary, and content. Greater authenticity has been a major argument for the use of audiovisuals in language assessment, given that audiovisuals have been used for several decades as a source of language input in the language classroom. With the advent of online instruction, video has become increasingly common as a conduit for instruction and communication. A more general authenticity argument for the use of audiovisuals in language assessment is that many real-world target language use domains require the simultaneous processing of visual and aural input (Wagner, 2010). Non-verbal cues such as gestures and body movements are also used by listeners to understand and interpret verbal messages (Kellerman, 1992; Sueyoshi & Hardison, 2005) and it has been argued that the presence of visual input in listening tests might support test takers in comprehending spoken texts and help them to achieve their best performance (Wagner, 2013). Nonetheless, the use of audiovisuals in language assessments has been uncommon, which may in part be due to the ongoing debate within the language testing community regarding whether the ability to process visual information in support of an oral message should be considered part of listening ability. On one side of the debate are those who argue that the ability to interpret non- verbal information reflects many contexts of real-world language use and should therefore be considered part of the listening construct (Gruba, 1997; Ockey, 2007; Wagner, 2008, 2010). On the other side are those who prefer to define the listening construct more narrowly in terms of the abilities specifically required to understand spoken texts (Buck, 2001; Coniam, 2001). Although audiovisuals have yet to see a widespread use in speaking assessment, a variety of studies have investigated the impact of video input in listening assessment. A majority of such studies have examined the impact of video-based
98
98 Ching-Ni Hsieh and Larry Davis
stimulus on test performance or scores, typically comparing assessments where listening input was provided via video versus audio- only. Such studies have reported mixed results, with some finding that test takers achieved higher scores in the video input condition (Latifi & Mirzaee, 2014; Sueyoshi & Hardison, 2005; Wagner, 2010, 2013), while other studies observed little or no difference between video and audio conditions (Batty, 2015; Coniam, 2001; Londe, 2009). A few studies have reported that test takers performed better when input was audio-only (Pusey & Lenz, 2014; Suvorov, 2009). Some studies also found that language proficiency could serve as a mediating factor influencing test takers’ abilities to utilize the supporting characteristics of the video input (Ginther, 2002; Latifi & Mirzaee, 2014; Sueyoshi & Hardison, 2005). The relative impact of video on performance also relies on the degree to which test takers actually pay attention to the visual input. Considerable individual variations have been reported in small-scale studies, with variation in both the amount of time test takers spent looking at the video as well as their perceptions of whether the video was helpful or distracting (Ockey, 2007; Wagner, 2008). Suvorov (2015) used eye-tracking technology to investigate the manner in which test takers attended to video input; while individuals varied in their viewing behavior, such differences were not associated with scores (but see Nishikawa, this volume). Inconsistency in these findings may be a result of both methodological shortcomings as well as the many possible interactions between different types of input and language proficiency of the test takers (Batty, 2015; Ginther, 2002). Interactions among test facets may include the nature of the task, the type of visual, and learner characteristics (Ginther, 2002; Sueyoshi & Hardison, 2005; Suvorov, 2015). One example is the study by Latifi and Mirzaee (2014), who found no difference in the performance of advanced learners who received video versus audio-only input, but higher scores were seen in intermediate-level test takers who received video input. The way in which visual input supports the verbal message may also moderate the effect of video or other visuals on listening performance. An issue of particular interest has been the impact of “content” versus “context” visuals (i.e., visual materials that communicate content relevant to making a response versus visuals that provide only information about the context of communication, such as who is speaking. Bejar, Douglas, Jamieson, Nissan, and Turner (2000) distinguish content visuals that provide new information, which may complicate the listening task, versus those that replicate, illustrate, or organize information provided verbally, which are expected to promote comprehension. In a study using still images, Ginther (2002) found that the presence of content visuals in the listening comprehension section of TOEFL® CBT (Computer-based Test) had a slight positive effect on comprehension of mini-lectures, where context visuals had a slightly negative effect. Ockey (2007) reported that test takers spent little time looking at context photos accompanying an audio passage, although some individuals reported that the photos were helpful to establish an understanding of
98
9
98
The Effect of Audiovisual Input 99
the situation. It has been suggested that such setting of the scene may be helpful for comprehension by activating background knowledge or indicating who is speaking (Suvorov, 2009), with the latter being especially helpful when listening to conversational input (Buck, 2001).
Research Context This study investigated the effect of audiovisual input on performances elicited with integrated speaking tasks that require the use of listening and speaking skills. From a theoretical perspective, research on cognitive load theory of multimedia learning (Mayer, 2001) suggests that instructional materials presented in both visual and verbal modalities can assist in comprehension more than presentations delivered in a single modality (i.e., visual or verbal only). This facilitative effect occurs because learners are able to use both the visual and verbal mental systems to process more input information than would otherwise be possible with only one of the systems. Empirical research findings in computer-assisted language learning also indicate that L2 learners are generally equipped with strategies for processing input in multiple modalities and can benefit from multimodal presentations of input materials (Sydorenko, 2010; Taylor, 2005). Nevertheless, results of some studies in multimedia learning also show that multimodal presentation of input information could result in split attention, imposing a heavy cognitive load especially for novice learners (e.g., Mayer, Heiser, & Lonn, 2001). Inferring from these findings, we consider it possible that, when engaged in integrated speaking tasks that use audiovisual input, L2 learners may need to split their attention. This process could leave fewer cognitive resources available for learners to effectively synthesize and reproduce the source information in the subsequent oral response. If this is the case, integrated speaking task performance could be adversely impacted, particularly for less proficient speakers. The conflicting views and results from previous research suggest that the impact of audiovisual input on language test or task performance is complex and can vary depending upon contextual factors such as L2 learners’ proficiency. Given the impact of video input on performances elicited with integrated writing tasks and on listening test scores, an examination of performance differences in audio-mediated versus audiovisually mediated integrated speaking tasks can provide important guidance for the application of multimodal presentations of source materials in such tasks.
Research Questions In this study, we surmised that the use of audiovisual input in integrated listen- speak tasks could facilitate information processing during the listening phase of task completion because learners would have access to both audio and visual input to assist comprehension. We assumed that learners having access to dual-mode
01
100 Ching-Ni Hsieh and Larry Davis
input would have more resources left for subsequent speech production, compared to those having access to only one mode. We hypothesized that L2 learners responding to an audiovisually mediated listen-speak task would benefit from the dual-mode presentation of the source material and outperform those responding to an audio-only mediated task. Therefore, the study addressed the following research questions: 1. To what extent does the use of audiovisual input in an integrated listen-speak task result in better performance than the use of mere audio input? 2. Is the effect of input type related to L2 learners’ speaking proficiency?
Research Methods The study consisted of two groups of participants: audio and audiovisual. These participants were chosen from two separate pools of official TOEFL iBT® test takers, with 135 test takers in each group. The participants were selected based on their speaking proficiency, gender, and first languages (L1s). There were 126 male and 144 female participants, with a mean age of 24.1 (SD = 6.73). They represented 49 native countries and spoke 39 languages as their first language. We used the test takers’ TOEFL iBT speaking scores to select the participants at three levels of proficiency (low, medium, and high), with 45 participants at each level for both the audio and audiovisual groups. In order to examine the effect of speaking proficiency, we selected an equal number of participants at each score point for both groups so that the mean speaking score for each level was the same between task conditions. The mean score was 15.9 (SD = 1.61) for the low- proficiency level, 20.3 (SD = 1.26) for the medium-proficiency level, and 25.1 (SD = 1.84) for the high-proficiency level, out of 30 points possible.
Data Collection Procedures One integrated TOEFL iBT listen-speak task was used as the prompt to gather spoken responses. The task required the participants to listen to an academic lecture given by a professor and then summarize the lecture in a 60-second response. In the lecture, the professor began by defining a concept and then discussed two important aspects related to it. The participants were allowed to take notes while listening to the lecture. The lecture was presented in two versions: audio and audiovisual. For the audio version, the lecture was recorded by a professional voice actor and delivered via audio with a still photo depicting the speaker standing in front of a classroom. For the audiovisual version, a researcher was video-recorded delivering the same lecture. The researcher was an L1 American English speaker and an experienced language teacher. To simulate a classroom setting where key lecture information is often presented in PowerPoint slides or on a whiteboard, three key words
01
10
01
The Effect of Audiovisual Input 101
representing the key concept and the two aspects of the topic discussed in the lecture were shown in a slideshow format alongside the video. The key words were chosen so that substantive source information was not given away, in order to avoid any potential impact on performance. The audio group of participants responded to the audio version of the integrated listen-speak task described above as part of their official TOEFL iBT speaking test. The audiovisual group responded to the audiovisual version of the task as part of a large research project shortly after the participating students completed the TOEFL iBT test; in other words, the audiovisual task was not included in the participants’ official TOEFL iBT speaking test. In the large project, all research participants were asked to respond to six innovative academic speaking tasks; the audiovisual version of the task used in the current study was one of them. To answer the research questions, we compared the two groups of participants’ performances on features of fluency, grammatical complexity, vocabulary, and content. The fluency features included Pause duration (mean pause length in seconds), Speech rate (number of words per second), and Mean length of run (mean number of words per run, where a run is defined as a segment whose boundaries are set by a silence of two-tenths of a second or longer). Two grammatical complexity measures were used: Clause length (number of words per clause), and Sentence length (number of words per sentence).Vocabulary was measured by Word type (number of unique word types) and Word token (number of tokens). The quality of speech content was measured by Key points (number of accurately reproduced key points). These features represented the major rating criteria used in the TOEFL iBT scoring rubrics and have been commonly employed in analyses of spoken performances (e.g., Brown et al., 2005). Features of fluency, grammatical complexity, and vocabulary were computed using SpeechRater, an automated scoring engine developed by Educational Testing Service (Zechner, Higgins, Xi, & Williamson, 2009). In this analysis, the transcripts produced by human transcribers were used to generate the automated language measures. For content, we followed the “key points coverage” approach employed by Frost et al. (2012). Two coders coded the number of accurately reproduced key points in all the responses. Exact agreement between the two coders was reached at 71%. The discrepant cases were resolved through discussion until 100% agreement was reached.
Data Analysis Procedures We conducted a series of two-way ANOVAs to examine the main effects of input type and proficiency and the interaction effect between these two variables. Each spoken feature was analyzed in a separate ANOVA where type of input and proficiency level were the independent variables and the feature value was the dependent variable. Prior to the ANOVA analyses, we examined multicollinearity among the features to ensure that none of the features included were highly
012
102 Ching-Ni Hsieh and Larry Davis
correlated with each other (r > .90).We also checked the assumption of normality using the Shapiro-Wilk test. Levene’s tests were examined for homogeneity of variances for all dependent variables. We found that some data were skewed and variances were not homogenous. In the skewed cases, we used log transformed data and re-ran the ANOVAs, which is a typical method to remedy the violation of normality assumption (Field, 2013). All the same statistically significant results were found when compared to the results derived from the use of untransformed data. ANOVA is fairly robust to violations of normality and homogeneity of variance if group sample sizes are equal, which is the case in the current study. Therefore, for consistency, we reported the ANOVA results using untransformed data because transformed data are hard to interpret. For all analyses, we set the significance level at .05.
Findings and Discussion Descriptive statistics of the eight features are shown in Table 7.1. There was no significant main effect of input type for the three fluency measures, Pause duration, F(1, 263) = 1.28, p = .258, Speech rate, F(1, 263) = 0.28, p = .597, and Mean length of run, F(1, 26) = 0.045, p = .833; the two grammatical complexity measures, Clause length, F(1, 263) = 1.77, p = .184, and Sentence length, F(1, 263) = .637, p = .425; or the vocabulary measure of Word token, F(1, 263) = 3.35, p = .068. A significant main effect of input type was found for Word type with a small effect size, F(1, 263) = 7.02, p = .009, η2partial = .026, showing that the audio group performed slightly better than the audiovisual group. To determine whether the effect of input type on Word type was related to the participants’ proficiency, independent-samples t tests were performed to compare the mean differences at each proficiency level. A significant difference was found for medium-proficiency level, t(87) = 2.375, p = .02, d = 0.50, indicating that the medium-proficiency participants in the audio condition produced a significantly larger number of unique word types than those in the audiovisual condition. A significant main effect of input type was also found for Key points with a small effect size, F(1, 263) = 4.21, p = .04, η2partial = .015, showing that the audio group outperformed the audiovisual group in the number of accurately reproduced key points. An interaction effect was observed with a small effect size, F(2, 263) = 7.70, p = .001, η2partial = .055. The low-and medium-proficiency participants in the audio condition outperformed their counterparts in the audiovisual group, but the high- proficiency participants in the audiovisual group performed better than their counterparts in the audio group. This result suggests that the effect of input type on Key points was different for learners at different proficiency levels. Independent-samples t tests yielded a significant difference for the low-proficiency level, t(88) = 3.59, p = .001, d = 0.75, demonstrating that the difference in the content measure was most noticeable between low-proficiency speakers.
012
103
newgenrtpdf
012
TABLE 7.1 Descriptive Statistics of Performance Features by Input Type and Proficiency
Features
PD SR LR CL SL WP WT KP
Low
Medium
High
Audiovisual
Audio
Audiovisual
Audio
Audiovisual
Audio
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
0.61 1.6 3.8 8.6 13.7 55.3 92.2 2.2
0.24 0.40 1.30 2.10 5.57 13.69 27.35 1.23
0.56 1.7 4.0 8.1 13.1 60.6 98.7 3.2
0.19 0.35 1.29 1.94 4.70 11.23 22.17 1.22
0.51 1.9 4.9 8.3 14.1 64.1 107.9 3.3
0.15 0.33 1.65 1.81 5.24 10.00 21.93 1.27
0.52 1.9 4.8 8.3 13.4 69.4 114.4 3.7
0.17 0.35 1.69 1.81 4.22 11.02 21.49 1.03
0.48 2.2 6.1 9.5 14.1 77.5 130.2 4.4
0.09 0.31 1.72 2.29 4.24 10.30 19.42 0.96
0.46 2.2 6.0 9.5 14.0 78.1 132.4 4.0
0.11 0.35 1.66 2.29 4.21 11.62 22.74 1.14
Note. PD = pause duration; SR = speech rate; LR = Mean length of run; CL = Clause length; SL = Sentence length; WP = Word type; WT = Word token; KP = Key points.
104
104 Ching-Ni Hsieh and Larry Davis
Implications for Policy, Practice, and Future Research This study compared performances elicited from an audio versus an audiovisual version of an integrated listen-speak task as a way to examine the effect of audiovisual input on L2 learners’ performances on such tasks. The prediction was that L2 learners in the audiovisual condition would outperform those in the audio condition.The prediction was not supported by the data. Performances of the two groups of participants did not differ significantly in measures of fluency, grammatical complexity, and the number of words produced. This finding echoes the results of Cubilo and Winke (2013), who found little difference in integrated writing task performances between audio and video task conditions. Contrary to our initial hypothesis, the audio group outperformed the audiovisual group in measures of word types and content, despite small effect sizes. The differences were most marked for low-and medium-proficiency groups of learners, who obtained mean TOEFL iBT speaking scores of 15.9 and 20.3, respectively. The results showed that the positive effects of audiovisuals on performance were most pronounced for the high-proficiency group of speakers, who had a mean TOEFL iBT speaking score of 25.1. This finding differs from the results of Latifi and Mirzaee (2014), who found that intermediate-level test takers benefited most from video input in listening assessment. Our study results were more intriguing when we looked at the interaction effect between input type and proficiency level on the content measure. The high-proficiency speakers in the audiovisual group reproduced more accurate key points than did their audio counterparts; on the other hand, the low-and medium-proficiency speakers in the audio group exceeded those in the audiovisual group. It is possible that the high-proficiency speakers were better able to utilize the visual cues and the on-screen key words to help them recall, synthesize, and reproduce the source materials than were the less-proficient speakers. From a theoretical standpoint, findings of the study can be explained by the split-attention effect of the cognitive load theory of multimedia learning, which suggests that integration of information presented in multiple modes can pose an additional cognitive load for novice learners because learners’ attention is split between different modes of information (Mayer et al., 2001). The process of splitting attention between listening to the lecture, watching the video, reading the on-screen key words, and taking notes may be very taxing for L2 learners and may overload some individuals, particularly the less-proficient ones.The process of doing so could add a resource-depleting dimension of their attention and thus the less-proficient students may not be able to take the full advantage of the presence of additional visual aids. As a result, their performance was affected (Mayer & Moreno, 1998). Although we had assumed that the use of audiovisual input could foster the coordination of the visual and aural materials and enhance task performance, our study results reveal that the adoption of audiovisuals in integrated listen-speak
104
105
104
The Effect of Audiovisual Input 105
tasks—tasks that are very demanding for many L2 learners (Brown et al., 2005)— can impose an extra cognitive load. L2 learners’ attentional resources to construct connections between the visual and auditory information are more likely to split and be disrupted, thus potentially hindering their performances. Nevertheless, the interaction effect observed for the content feature also indicates that L2 speakers’ abilities to accurately reproduce the source materials may be enhanced by the visual information if the individuals have reached a high enough level of proficiency. We speculate that there may be a threshold level of speaking proficiency above which L2 learners’ performance on audiovisually mediated integrated listen-speak tasks can be boosted and below which the ability to process and reproduce the source materials might be weakened. From a pedagogical viewpoint, results of the study indicate that the use of audiovisual texts in language classrooms should take into consideration students’ language proficiency level, because it is likely that audiovisually mediated integrated speaking activities would require threshold levels of abilities for competent performance. The findings suggest that audiovisual source materials in academic listen-speak learning activities have the potential to assist comprehension and speech production for more advanced L2 students. However, such type of input materials may distract low-proficiency students and overload their visual channels and should therefore be used selectively in the language classrooms. A few limitations should be pointed out. In this study, we only used one TOEFL iBT listen-speak task, and the sample size was small. In addition, each participant only responded to one version of the task.Therefore, the results cannot be generalized outside of the study context. Despite the limitations, our results have implications for task design and the validity of the test results. Unlike listening assessment, which requires language learners to process only aural input, integrated listen-speak tasks require multiple language skills and thus can be very cognitively demanding for L2 learners. For large-scale language assessments that involve the use of integrated tasks, to ensure valid score interpretations test developers should carefully examine the effect of task demands on performances if audiovisuals are to be used. On the other hand, while the use of audiovisuals may increase cognitive demands, we consider multimodal communication to be a good reflection of authentic language use. Therefore, the performance differences related to proficiency level seen in this study could very well be viewed as construct-relevant. Taking this view, we maintain that the use of audiovisual input in integrated speaking tasks might actually improve the inferences we make regarding a language learner’s ability to use language in the target domain. We believe that the field would benefit from the investigation of the impact of audiovisual input on integrated speaking task performance in different learning and assessment contexts. This line of research could provide further guidance and best practices for the design and development of assessment tasks that use multimodal presentations of source materials. We also recommend that future research investigate how L2 learners interact with visual cues and how the interaction
016
106 Ching-Ni Hsieh and Larry Davis
contributes to learners’ abilities to recall and synthesize information in integrated tasks. The results of such research could help us better understand the construct measured in integrated speaking tasks that use audiovisual input materials.
References Batty, A. O. (2015). A comparison of video-and audio- mediated listening tests with many-facet Rasch modeling and differential distractor functioning. Language Testing, 32(1), 3–20. Bejar, I., Douglas, D., Jamieson, J. M., Nissan, S., & Turner, J. (2000). TOEFL 2000 listening framework: A working paper (TOEFL Monograph No. MS-19). Princeton, NJ: Educational Testing Service. Brown, A., Iwashita, N., & McNamara,T. F. (2005). An examination of rater orientations and test taker performance on English for Academic Purposes speaking tasks (TOEFL Monograph No. MS-29). Princeton, NJ: Educational Testing Service. Buck, G. (2001). Assessing listening. Cambridge, UK: Cambridge University Press. Butler, F. A., Eignor, D., Jones, S., McNamara, T., & Suomi, B. K. (2000). TOEFL 2000 Speaking Framework: A working paper (TOEFL Monograph No. MS-20). Princeton, NJ: Educational Testing Service. Coniam, D. (2001). The use of audio or video comprehension as an assessment instrument in the certification of English language teachers: A case study. System, 29, 1–14. Cross, J. (2011). Comprehending news videotexts: The influence of the visual content. Language Learning & Technology, 15(2), 44–68. Cubilo, J., & Winke, P. (2013). Redefining the L2 listening construct within an integrated writing task: Considering the impacts of visual-cue interpretation and note-taking. Language Assessment Quarterly, 10, 371–397. Cumming, A. (2013). Assessing integrated writing tasks for academic purposes: Promises and perils. Language Assessment Quarterly, 10(1), 1–8. Cumming, A. (2014). Assessing integrated skills. In. A. J. Kunnan (Ed.), The companion to language assessment. Volume I: Abilities, contexts, and learners (pp. 216–229). West Sussex, UK: Wiley. Cumming, A., Lai, C., & Cho, H. (2016). Students’ writing from sources for academic purposes: A synthesis of recent research. Journal of English for Academic Purposes, 23, 47–58. Douglas, D. (1997). Testing speaking ability in academic contexts: Theoretical considerations (TOEFL Monograph No. MS-08). Princeton, NJ: Educational Testing Service. Feak, C. B., & Salehzadeh, J. (2001). Challenges and issues in developing an EAP video listening placement assessment: A view from one program. English for Specific Purposes, 20, 477–493. Field, A. (2013). Discovering statistics using IBM SPSS statistics. London, UK: Sage. Frost, K., Elder, C., & Wigglesworth, G. (2012). Investigating the validity of an integrated listening-speaking task: A discourse-based analysis of test takers’ oral performances. Language Testing, 29(3), 345–369. Ginther, A. (2002). Context and content visuals and performance on listening comprehension stimuli. Language Testing, 19(2), 133–167. Gruba, P. (1997). The role of video media in listening assessment. System, 25(3), 335–345. Huang, H.-T. D., Hung, S.-T. A., & Hong, H.-T. V. (2016). Test-taker characteristics and integrated speaking test performance: A path- analytic study. Language Assessment Quarterly, 13(4), 283–301.
016
107
016
The Effect of Audiovisual Input 107
Kellerman, S. (1992). “I see what you mean”: The role of kinesic behaviour in listening and implications for foreign and second language learning. Applied Linguistics, 13(3), 239–258. Latifi, M., & Mirzaee, A. (2014).Visual support in assessing listening comprehension: Does it help? International Journal of Research Studies in Educational Technology, 3(2), 13–20. Londe, Z. C. (2009). The effects of video media in English as a second language listening comprehension tests. Issues in Applied Linguistics, 17(1), 41–50. Mayer, R. E. (2001). Multimedia learning. Cambridge, UK: Cambridge University Press. Mayer, R. E., Heiser, J., & Lonn, S. (2001). Cognitive constraints on multimedia learning: When presenting more material results in less understanding. Journal of Educational Psychology, 93(1), 187–198. Mayer, R. E., & Moreno, R. (1998). A split-attention effect in multimedia learning: Evidence for dual processing systems in working memory. Journal of Educational Psychology, 90, 312–320. Ockey, G. (2007). Construct implications of including still image or video in computer- based listening tests. Language Testing, 24(4), 517–537. Plakans, L. (2010). Independent vs. integrated writing tasks: A comparison of task representation. TESOL Quarterly, 44(1), 185–194. Pusey, K., & Lenz, K. (2014). Investigating the interaction of visual input, working memory, and listening comprehension. Language Education in Asia, 5(1), 66–80. Sueyoshi, A., & Hardison, D. M. (2005). The role of gestures and facial cues in second language listening comprehension. Language Learning, 55(4), 661–699. Suvorov, R. (2009). Context visuals in L2 listening tests: The effects of photographs and video vs. audio-only format. In C. Chapelle, H. G. Jun, & I. Katz (Eds.), Developing and evaluating language learning materials (pp. 53–68). Ames, IA: Iowa State University. Suvorov, R. (2015). The use of eye-tracking in research on video-based second language (L2) listening assessment: A comparison of context videos and content videos. Language Testing, 32(4), 463–483. Sydorenko, T. (2010). Modality of input and vocabulary acquisition. Language Learning & Technology, 14(2), 50–73. Taylor, G. (2005). Perceived processing strategies of students watching captioned video. Foreign Language Annals, 38(3), 422–427. Wagner, E. (2008). Video listening tests: What are they measuring? Language Assessment Quarterly, 5(3), 218–243. Wagner, E. (2010). The effect of the use of video texts on ESL listening test-taker performance. Language Testing, 27(4), 493–513. Wagner, E. (2013). An investigation of how the channel of input and access to test questions affect L2 listening test performance. Language Assessment Quarterly, 10, 178–195. Weyers, J. H. (1999). The effect of authentic video on communicative competence. The Modern Language Journal, 83(3), 339–349. Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non- native spontaneous speech in tests of spoken English. Speech Communication, 51, 883–895.
018
8 THE RELIABILITY OF READABILITY TOOLS IN L2 READING Alisha Biler
Issues that Motivated the Research The task of assessing the difficulty, or readability, of a text for first lanuage (L1) readers has been a pursuit of researchers for almost a century. For educators, the ability to clearly categorize the level of a text is essential for matching learners to appropriate texts in order to maximize learning. Textbook writers, materials developers, and test developers need to be aware of how to construct or select texts carefully so that meaning can be conveyed clearly. Reading is foundational to the learning process and as such, the need for readable texts cannot be understated. While the significance of identifying text difficulty is generally acknowledged, methods for objectively determining and quantifying the difficulty of a given text are contested. Traditionally, the US education system has relied upon readability formulas measuring word and sentence complexity as the primary means of assessing L1 text difficulty (e.g., Flesch-Kincaid Grade Level Index [Kincaid, Fishburne, Rogers, & Chissom, 1975]). However, critics have argued that such measures only evaluate surface reading characteristics and ignore the deeper psychological processes involved in reading (Carrell, 1987). Other researchers have proposed that the structure of a text is another source of comprehension difficulty. A text with fewer conceptual gaps in its organization, or a highly coherent text, has been found to aid L1 readers’ comprehension (Britton & Gulgoz, 1991). Some studies have given evidence for the inclusion of cohesion in formulas determining the difficulty of a text for L1 readers, especially readers with minimal knowledge of the topic of the text (see, e.g., McNamara, Kintsch, Butler-Songer, & Kintsch, 1996). For second language (L2) learners, the challenge of ascertaining a text’s difficulty is equally important. In terms of learning a second language, errors made
018
019
018
The Reliability of Readability Tools 109
in productive skills, like grammatical errors in writing or pronunciation errors in speaking, are easily detected and diagnosed. However, when learners encounter comprehension difficulties in reading, there is no simple method for isolating the source of the breakdown; it is likely a combination of multiple factors at the sentential, textual, and contextual levels. Thus, readability metrics used in L1 reading, which assess only sentence-level difficulty, are ineffective at predicting L2 reader difficulty despite their frequent usage by L2 educators and materials developers.
Context of the Research Since the 1920s, educators and researchers have been using readability indices for assessing the level of text difficulty for L1 readers, and over 200 readability formulas exist with consistent success in native-language reading (Carrell, 1987). Despite the prevalence of readability tools in L1 reading, there is a small but growing body of empirical research on readability tools and L2 reading comprehension. Although additional tools are available, the two tools reviewed in this study are the Flesch-Kincaid Grade Level Index (Kincaid et al., 1975) and Coh- Metrix (Graesser, McNamara, Louwerse, & Cai, 2004). The Flesch-Kincaid Grade Level Index (FKGL) assesses text readability by analyzing two aspects of a text: syntactic complexity and semantic complexity. Syntactic complexity is measured by the number of words per sentence, and semantic difficulty is measured by word length. Traditional readability formulas such as this one are built on the premise that long, complex sentences with longer, and thus low-frequency, words are more difficult for readers than shorter sentences with shorter, more frequently used words. FKGL is appealing for researchers and educators due to its simple measurement and ease of access; the tool is available to anyone with access to Microsoft® Office. Once a text is analyzed, FKGL provides a measurement indicating the corresponding grade level in the US K-12 school system, with a range of grades one through twelve. The scores can also be interpreted as the number of years of education required to understand a text. For example, an FKGL score of 9.0 implies that the text requires a ninth-g rade reading knowledge or the equivalent of nine years of schooling. While traditional readability tools such as FKGL have been used in formal educational settings, they are not without criticism. The predominant critique of such formulas is that semantic and syntactic complexity are shallow features of a text. They only measure the lower- level processes of reading discussed earlier; they ignore the higher-level processes of reading, such as text integration. Thus, those measurements alone cannot predict reading success, since they do not acknowledge the deeper processes that occur in reading comprehension (McNamara, Louwerse, McCarthy, & Graesser, 2010). For instance, Carrell (1987, p. 24) provides the following example of two sentences that would yield the same readability from traditional formulas:
10
110 Alisha Biler
1 . The uneven numbers are one, three, five, seven, nine, eleven, and thirteen. 2. The squares of the absolute values of the transition amplitudes are summed. While acknowledging the detriment of taking two sentences in isolation, which reduces the statistical power that traditional readability formulas rely on, Carrell argues that the above sentences are not equal in processing demands. Although FKGL would assign both sentences the same readability index because they have the same number of syllables per word (1.4) and words per sentence (12), the use of more advanced grammar and unfamiliar words would render the second sentence more difficult to read. Carrell’s critique aligns with the literature previously discussed regarding the higher-level factors involved in discourse reading, namely cohesive devices and individual reader differences in background knowledge and motivation. Readability tools are used frequently in L2 environments despite the limited number of empirical studies exploring their effectiveness in predicting L2 comprehension. One of the first studies to explore this relationship is Hamsik (1984), who found that L2 reader comprehension scores were correlated with Flesch- Kincaid readability predictions; however, the study was limited in size with only 40 participants of heterogeneous language backgrounds. Similarly, Greenfield (1999) used traditional readability tools, including Flesch-Kincaid, to predict the difficulty of 31 short academic texts and compared the results to cloze test comprehension scores from 200 Japanese L2 learners of English. The resulting correlation for FKGL and cloze score was .85, suggesting readability tools are useful in predicting L2 reading. However, not all studies indicated such a relationship. Brown (1998) looked at 2,300 Japanese L2 readers’ cloze scores on 50 randomly chosen English books. He found that the correlation between cloze score and FKGL predictions ranged from .48 to .55 and concluded that traditional readability formulas were not conducive for predicting L2 reading comprehension. A possible explanation for the lack of consistency in experimental studies regarding traditional readability formulas is that L2 learners have a paucity of vocabulary and significant cultural gaps in their background knowledge, causing them to rely more strongly on the text and its structure for comprehension. Thus, traditional readability formula predictions are likely to be even more problematic for L2 readers than for L1 readers due to the exclusive focus on surface features of texts in such formulas. Specifically, shorter sentences may leave out connections and relations between ideas necessary for building connections within the text, forcing readers to make inferences and placing a greater burden on the processor (Blau, 1982). In these instances, longer sentences with rich context and multiple sentence connectors (e.g., because or secondly) facilitate L2 reading more effectively than the shortest and simplest syntactic structures. Thus, cohesion may play a significant role for L2 learners—a factor that is not represented in traditional readability tools but is present in a more recently established readability formula, which will be discussed below.
10
1
10
The Reliability of Readability Tools 111
In light of these critiques, another readability tool has recently emerged due to developments in computational linguistics. Coh-Metrix is a readability tool that analyzes texts using over 108 different measurements, including 50 types of cohesion. This web tool is a single interface that accesses multiple modules including various corpora, part of speech categorizers, syntactic parsers, and statistical representations of world knowledge. It is freely available to the public and can be accessed at http://cohmetrix.com/. The distinctive feature of Coh-Metrix is measuring readability at the discourse level through metrics which analyze the cohesion of a text, or how related sentences are within paragraphs and within whole texts (Graesser et al., 2004).The web tool includes FKGL in its analysis as well as its own composite L2 readability index (Crossley, Dufty, McCarthy, & McNamara, 2007). The L2 readability index of Coh-Metrix (RDL2) analyzes three distinct aspects of a text: (1) syntactic complexity, as measured by the number of words per sentence; (2) co-referentiality, as measured by content word overlap, or the number of repeated nouns, between adjacent sentences; and (3) word frequency, as measured by the Center for Lexical Information (CELEX) frequency scores. In terms of Coh-Metrix applications to L2 reading, Crossley, Greenfield, and McNamara (2008) replicated the study from Greenfield (1999) using Coh-Metrix RDL2 as well as FKGL. They used the same 31 passages and cloze-test scores as Greenfield (1999) and found that RDL2 scores had a correlation of .925, while the correlation for FKGL was .85.The authors concluded that RDL2 was as successful as traditional readability formulas in predicting L2 reading comprehension. In the limitations section, however, the authors acknowledge that cloze tests are not the ideal proficiency measure as they evaluate word-and sentence-level understanding, similar to FKGL measures; therefore, cloze test scores may actually be more highly correlated to FKGL. Thus, the authors call for further studies that rely on more traditional comprehension assessments, such as the reading of a text with objective comprehension questions (e.g., selected-response test items), rather than cloze tests, to gain a more accurate understanding of L2 reading comprehension. In sum, the literature on the reliability of readability tools to predict L2 reading comprehension is limited and lacks consensus, particularly in exploring the role of cohesion in L2 reading. Additionally, the limited studies available all make use of cloze testing as a means of measuring reading comprehension, a procedure which has been identified as problematic when examining readability tool predictions. The current research aims to fill this gap in the literature by being one of the first studies to use traditional reading comprehension assessments as opposed to cloze tests as indicators of readability tool predictive power.
Research Question Addressed The literature on reading comprehension is extensive and broad, yet for L2 reading, there exists a less comprehensive understanding of what facilitates readers’
12
112 Alisha Biler
comprehension beyond L2 lexical and syntactic proficiency.The L1 reading literature is in agreement that cohesion facilitates comprehension for low-knowledge readers and some comprehension recall for all readers. However, there is little research on the extension of these effects of cohesion into L2 reading. The present study builds upon the existing literature investigating cohesion as a source of text difficulty through the use of readability tools for predicting L2 reading comprehension. The guiding research question for this study is: RQ 1: Does a readability tool analyzing cohesion predict L2 learner comprehension across proficiency levels better than a traditional readability formula, as assessed by a traditional reading comprehension assessment? This study fills a gap in the literature by using traditional reading comprehension assessment tasks (which include a passage and comprehension questions based on it) rather than cloze tasks, which may have biased results towards tools only measuring lexical and grammatical complexity.
Research Methodology Data Collection Procedures Following the call from Crossley et al. (2008), comprehension scores from a non- cloze reading assessment were used for analysis in this study.The scores were taken from a set of reading assessments used by an English language program at a major southeastern university. The assessments were written and designed by the language program faculty such that every test has a specific and uniform structure that requires students to first read an expository text of 300–500 words taken from a level-appropriate L2 textbook and then answer objective, multiple-choice and short-answer questions, covering various reading skills. Each test is standardized in the number of questions and points, where each test has ten questions and is worth ten points. There are 15 tests available for analysis in the current study. The language program has five proficiency levels (beginner to high-intermediate) and three tests of equal difficulty for each level. The rationale for having three different tests per level is that if students do not pass the course at the end of each term, they can repeat the level two additional times without taking the same reading assessment twice. Scores on these 15 tests were gathered from a university database over a two- year period, resulting in 1,131 reading comprehension scores from 569 English L2 students (378 males, mean age: 20.1; 191 females, mean age: 20.1), with varying L1 backgrounds, including Arabic, Chinese, French, Hindi, Japanese, Korean, Nepali, Portuguese, Spanish, Thai, Turkish, and Vietnamese. The distribution of these scores by level in the language program and test set are summarized in Table 8.1.
12
13
12
The Reliability of Readability Tools 113 TABLE 8.1 Total Reading Comprehension Scores Collected and Mean (M) Accuplacer ©
Proficiency Scores (Max Score = 120) Level
Test A Scores Collected
2 56 3 64 4 71 5 94 6 76 Grand Total
Test B
Test C
Total
M
Scores Collected
M
Scores Collected
M
Scores Collected
M
48.4 65.4 85.2 93.5 104.7
57 75 85 99 83
47.5 64.1 82.4 94.5 104.4
42 66 86 107 70
49.0 65.3 83.3 95.1 102.4
155 205 242 300 229
48.21 64.85 82.34 94.48 103.92 1,131
Since these scores were collected from a language program, most students have more than one test score represented in the data set. For instance, students who did not pass their class took a different test at the same level the following term while those students who progressed took a test at the next level. Despite this, scores can be compared within levels as all students taking a given test are of the same proficiency level.The language program ensures all students within a reading class are of the same proficiency level by using an independent placement exam (The Accuplacer© ESL reading test, College Board) which is taken when students first enter the program and again at the end of each term to determine progression. Mean Accuplacer© reading test scores for all students at each testing time are presented in Table 8.1.
Data Analysis Procedures Within each of the five proficiency levels, the comprehension scores on the three tests used in that level were gathered and compared to determine if any of the three level tests received significantly higher or lower comprehension scores, indicating the text used in the test was either too difficult or too easy for the students at that level. Next, the readability predictions from both RDL2 and FKGL were calculated for each of the 15 texts used in the tests. As a reminder, RDL2 measures: (1) CELEX word frequency; (2) syntactic similarity; and (3) content word overlap. FKGL measures only two components of a text: (1) number of syllables per word and (2) number of words per sentence. These factors combine into a grade-level score predicting readability according to the US grade school system. Finally, the readability predictions were compared to the comprehension score analysis to determine if one or both tools were reliable in predicting the difference in reading comprehension scores.
14
114 Alisha Biler
Findings and Discussion The comprehension scores on the reading assessments for all 15 tests are reported by level. The internal consistency of the scores was similar across tests as Alpha ranged from .55-.62.The average score and standard deviation by test and level are given in Table 8.2 below. In order to determine if comprehension on tests within a level varied, a one- way analysis of variance (ANOVA) with a significance threshold of .05 was conducted by level, such that the comprehension scores for each of the three tests were compared to one another. Results of the ANOVA are discussed below. As a reminder, all tests were identical in the types of questions asked with each test worth a maximum of 10 points. Additionally, all scores for each test represent students who were confirmed to be of the same proficiency level through a separate proficiency measure, allowing for comparison across tests within levels. In levels two and five, there were no significant differences in comprehension scores across the three tests (p = .211 and p = .701, respectively), indicating students comprehended the material in all three of the texts approximately equally. Within levels three, four, and six, however, comprehension scores differed significantly on at least one of the tests. Post-hoc analyses using the Tukey-Kramer comparison found mean comprehension scores for test 3B were significantly higher than for test 3A (p < .0001) and 3C (p = .0009), while comprehension scores for tests 3A and 3C did not differ from one another (p = .573). At level four, test 4C yielded significantly lower comprehension scores than did test 4A (p = .017) and test 4B (p = .018), while 4A and 4B had no significant differences (p = .959). For level six, test 6C also had significantly lower scores than tests 6A (p = .002) and 6B (p < .001), but 6A and 6B did not differ from each other (p = .366). Thus, the analyses show that comprehension among all of the tests within levels three, four, and six is not equal. Since the test developers had run item analyses for each test, there were no outlier questions skewing the data. Also, the proficiency of participants has been independently controlled through a placement exam and ensures that students in each level were of the same proficiency. This limits the
TABLE 8.2 Mean Comprehension Scores (Max = 10) by Tests and Levels
Level
2 3 4 5 6
Test A
Test B
Test C
M
SD
M
SD
M
SD
6.59 5.82 6.47 6.47 7.11
2.22 2.36 1.74 2.11 1.78
6.33 7.14 6.40 6.28 7.51
1.94 2.33 1.81 1.74 1.95
5.79 5.42 5.67 6.42 5.99
3.41 2.26 1.94 1.93 1.82
14
15
14
The Reliability of Readability Tools 115
possibility of participant reading ability skewing the results. Therefore, we turn to the readability of the passages used in the tests to address the variability in comprehension scores across tests. In order to investigate readability as the possible source of incongruity in comprehension performance within levels, readability predictions from both Coh- Metrix (RDL2) and Flesh-Kincaid Grade Level Index (FKGL) are reported by level and by test. RDL2 scores range from 0–40 and lower scores indicate more difficult texts; FKGL has a narrower range (1–12) and higher scores indicate more difficult texts. Predictions from both tools for each test are presented in Table 8.3 below. Of central interest to this analysis is whether one or both readability tools reliably predict the actual learner reading comprehension scores on each test. In previous studies, Pearson’s correlation was conducted to inferentially demonstrate this relationship between the two variables (that is, the readability predictions and learner comprehension scores) on data sets of at least 30 tests. However, the analysis in the current study involves only three tests (per level), which is not a large enough sample to use Pearson’s correlation reliably. Therefore, this study seeks to carefully analyze the raw means and variance descriptively and make suggestions for future research. This analysis looks at the set of three texts for each proficiency level and compares the readability predictions from both RDL2 and FKGL to comprehension score differences on the same test. The comparison is made by measuring the distance between predictions, and greater distances between predictions indicate greater ease or difficulty of the text. The distance between readability predictions for pairs of tests within levels is presented in Table 8.4. It is important to recall that lower RDL2 scores indicate difficulty; in other words, a negative distance in Table 8.4 indicates that RDL2 TABLE 8.3 Readability Scores for Each Test by Coh-Metrix (RDL2) and
Flesch-Kincaid (FKGL) Test A
Test B
Test C
Level 2
RDL2 FKGL
25.312 6.563
28.933 5.114
27.476 6.212
Level 3
RDL2 FKGL
21.193 6.188
28.717 7.841
20.092 8.619
Level 4
RDL2 FKGL
17.944 9.176
21.395 7.169
12.683 8.043
Level 5
RDL2 FKGL
16.158 10.767
17.529 9.292
21.325 8.426
Level 6
RDL2 FKGL
12.959 11.919
10.862 9.468
5.06 11.605
16
116 Alisha Biler TABLE 8.4 Comparison of RDL2 Predictions of Difficulty and Student Comprehension
Level
Difference in RDL2 Readability
Difference in FKGL Readability
p-value from Comprehension Scores
Tests A/C
-2.164
-0.351
p = .185
Tests B/A
3.621
1.449
p = .803
Tests C/B
-1.457
-1.098
p = .185
Tests A/C
1.101
2.431
p = .573
Tests B/A
7.524
-1.653
p < .001
Tests C/B
-8.625
-0.778
p < .001
Tests A/C
5.261
-1.133
p = .018
Tests B/A
3.451
2.007
p = .952
Tests C/B
-8.712
-0.874
p = .018
Tests A/C
-3.167
-2.341
p = .945
Tests B/A
2.371
1.475
p = .694
Tests C/B
0.796
0.866
p = .852
Tests A/C
7.899
-0.314
p < .001
Tests B/A
-2.097
2.451
p =.366
Tests C/B
-5.802
-2.137
p < .001
2
3
4
5
6
predicts that test to be a more difficult text than another text at that level. For example, it can be seen in Table 8.4 above that when comparing Tests A/C for Level 2, the distance in RDL2 score is -2.164, indicating that Test 2A is more difficult than test 2C. Conversely, lower FKGL scores indicate easier texts, so a negative distance means that a text is relatively easier than another text at that level. For example, FKGL distance between tests A and C for Level 2 is -0.351, meaning FKGL predicts test 2A to be easier than test 2C. It is also important to note in this table that the range of RDL2 is 0–35 and FKGL is 1–12; therefore, it is not possible to compare size of differences between tools but only to compare relative size of distance within RDL2 scores alone or FKGL scores alone. Between the two tools, it should be noted where the predictions differ (e.g., RDL2 indicates Test 2A is easier than Test 2C whereas FKGL predicts Test 2A is the harder test).
16
17
16
The Reliability of Readability Tools 117
Texts with the largest differences in readability predictions within RDL2 or within FKGL (i.e., tests predicted to be relatively easier or harder) are indicated in Table 8.4 in bold. Of interest to this analysis is what happened to learners’ comprehension scores on tests with large readability prediction differences. To do so, Table 8.4 compares the readability differences with the results of the ANOVA conducted on the comprehension scores for these tests (Table 8.2) that were discussed above. Tests that had significantly higher or lower comprehension scores (p < .05) are highlighted for reference in Table 8.4. For a readability tool to be considered reliable, larger distances between tests (numbers in bold) should be paired with significant comprehension score differences (highlighted p-values). For all RDL2 scores presented in Table 8.4, any prediction greater than five points of difference is a test that was also found to have significantly different comprehension scores. In other words, if RDL2 predicted a test to be more difficult, the test received significantly lower comprehension scores, and a test predicted to be easier received higher comprehension scores. Test 4C is 8.7 points lower than 4B and 5.26 points lower than 4A; comprehension scores mirror these results with 4C having lower comprehension scores (Mean = 5.67) than 4A (Mean = 6.47; p = .018) and 4B (Mean = 6.40; p = .018). Test 6C is 5.8 points lower in RDL2 prediction than 6B and 7.9 points lower than 6A. Again, comprehension scores of 6C (Mean = 5.99) are lower than 6B (Mean = 7.51; p < .001) and 6A (Mean = 7.11; p