Contemporary Second Language Assessment Volume 4: Contemporary Applied Linguistics 9780567632746, 9781474295055, 9780567147066

Includes chapters on key aspects of second language assessment such as test construct, diagnosis, exam design, and the g

220 78 3MB

English Pages [329] Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover page
Halftitle page
Series page
Title page
Copyright page
CONTENTS
Introduction Jayanti Banerjee and Dina Tsagari
Preamble
Recent Activity in Language Testing and Assessment
Contemporary Issues in Language Assessment
References
PART ONE Key Theoretical Considerations
1 Modelling Communicative Language Ability: Skills and Contexts of Language Use
Modelling Communicative Language Ability: Skills and Contexts of Language Use
Design of Study
Results
Discussion
Conclusion
References
2 Presenting Validity Evidence: The Case of the GESE
Introduction
Current Thinking on Validity
Evidencing Validity
A Case Study
Methodology
Findings
Limitations of the Study
Follow-up Studies
Benefits of Using a Validation Framework
Conclusion
References
APPENDICES
Appendix 2. Summary of Panelists’ Comments on the GESE According to the Validation Model
3 A New National Exam: A Case of Washback
Introduction
Study
Results
Summary and Discussion
Conclusion
References
Appendix 1: Questionnaire
4 Linking to the CEFR: Validation Using a Priori and a Posteriori Evidence
Research Background
A Priori Validation
A Posteriori Validation
Discussion: Advantages and New Developments
Conclusion
References
PART TWO Assessment of Specific Language Aspects
5 Authentic Texts in the Assessment of L2 Listening Ability
Overview of Authenticity in L2 Listening Assessment
How Scripted Spoken Texts and Unscripted Spoken Texts Differ
The Types of Spoken Texts Used in L2 Listening Assessment
How the Type of Spoken Text Might Affect Test-Taker Performance
The Study
Construct Implications of Using Scripted and Unscripted Spoken Texts
Possible Washback Effect on Learners
Suggestions for Integrating Unscripted Spoken Texts into L2 Listening Tests
Areas for Future Research
Conclusion
References
6 The Role of Background Factors in the Diagnosis of SFL Reading Ability
Diagnosing SFL Reading
The DIALUKI Project
Review of Previous Research into Learners’ Backgrounds
Description of the Background Questionnaires
Findings
Discussion
Conclusion
References
Appendix 1
Appendix 2
7 Understanding Oral Profi ciency Task Difficulty: An Activity Theory Perspective
Introduction
Defining Task
Task Diffi culty in Second Language Acquisition and Language Testing
Activity Theory
The Study
Results
Discussion
Conclusion
References
Appendix 1
8 Principled Rubric Adoption and Adaptation: One Multi-method Case Study
Introduction
Literature Review
Research Context
Methodology
Results
Discussion
Conclusion
Notes
References
Appendix A. Category Measures, Fit Statistics, Separation Values for the Baseline Study
Appendix B. Levels Contrasts for the Original and Revised Rubrics
9 The Social Dimension of Language Assessment: Assessing Pragmatics
Testing of Second Language Pragmatics
Construct of Second Language Pragmatics
Tests of Second Language Pragmatics
Sample Test Project
Conclusion
References
PART THREE Issues in Second Language Assessment
10 Applications of Corpus Linguistics in Language Assessment
Introduction
Focal Study
Discussion
Conclusion
References
11 Setting Language Standards for International Medical Graduates
Introduction
Overview of Standard Setting Procedures
Research Design
Materials Used in the Study
Stakeholder Focus Groups (SFGs)
Results and Discussion of Findings from Initial Panels
Final Panel Deliberations and Recommendations
Discussion of Problems Encountered in the Standard-setting Procedures
General Medical Council’s Response to Recommendations
Conclusion: The Way Forward
References
12 Fairness and Bias in Language Assessment
Fairness in Testing
Differential Item Functioning
DIF Research in Language Assessment Literature
Some Problems with DIF Analysis
Profile Analysis
Applying Profile Analysis
Conclusion
Note
References
13 ACCESS for ELLs: Access to the Curriculum
Introduction
A Federal Mandate to Adequately Support ELLs
What Is Academic Language?
How Is Academic Language Represented in the WIDA ELD Standards?
How Is Academic Language Operationalized in the WIDA Assessment System?
What Research Supports the Use of ACCESS for ELLs?
Continual Refi nement: Research and Development for an Online Speaking Test
Conclusion
Notes
References
14 Using Technology in Language Assessment
Using Technology in Language Assessment
Automated Scoring
Automated Measurements of Text Characteristics
Automated Scoring in AutoTutor CSAL
AutoTutor and Learning-oriented Assessment
Conclusion
Author Note
References
15 Future Prospects and Challenges in Language Assessments
Introduction
Purposes and Contexts
Constructs and Criteria
Formats and Methods
Advances in Quantitative Methods of Analysis
Agency in Assessment
Uses and Consequences
Concluding Remarks
References
INDEX
Recommend Papers

Contemporary Second Language Assessment Volume 4: Contemporary Applied Linguistics
 9780567632746, 9781474295055, 9780567147066

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Contemporary Second Language Assessment

i

Contemporary Applied Linguistics Series Editor: Li Wei, Birkbeck College, London, UK Contemporary Applied Linguistics Volume 1: Language Teaching and Learning, edited by Vivian Cook and Li Wei Contemporary Applied Linguistics Volume 2: Linguistics for the Real World, edited by Vivian Cook and Li Wei Discourse in Context: Contemporary Applied Linguistics Volume 3, edited by John Flowerdew Cultural Memory of Language: Contemporary Applied Linguistics Volume 5, Susan Samata

ii

Contemporary Second Language Assessment Contemporary Applied Linguistics Volume 4 Jayanti Banerjee and Dina Tsagari

Bloomsbury Academic An imprint of Bloomsbury Publishing Plc

LON DON • OX F O R D • N E W YO R K • N E W D E L H I • SY DN EY

iii

Bloomsbury Academic An imprint of Bloomsbury Publishing Plc 50 Bedford Square London WC 1B 3DP UK

1385 Broadway New York NY 10018 USA

www.bloomsbury.com BLOOMSBURY and the Diana logo are trademarks of Bloomsbury Publishing Plc First published 2016 © Jayanti Banerjee and Dina Tsagari and Contributors, 2016 Jayanti Banerjee and Dina Tsagari have asserted their right under the Copyright, Designs and Patents Act, 1988, to be identified as the Authors of this work. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. No responsibility for loss caused to any individual or organization acting on or refraining from action as a result of the material in this publication can be accepted by Bloomsbury or the authors. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN :

HB : 978-0-5676-3274-6 ePDF : 978-0-5671-4706-6 ePub: 978-0-5676-2871-8

Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress. Series: Contemporary Applied Linguistics Typeset by RefineCatch Limited, Bungay, Suffolk

iv

CONTENTS

Introduction 1 Jayanti Banerjee and Dina Tsagari Part One: Key Theoretical Considerations

15

1 Modelling Communicative Language Ability: Skills and Contexts of Language Use 17 Lin Gu 2 Presenting Validity Evidence: The Case of GESE Elaine Boyd and Cathy Taylor 3 A New National Exam: A Case of Washback Doris Froetscher

37

61

4 Linking to the CEFR : Validation Using a Priori and a Posteriori Evidence 83 John H.A.L. de Jong and Ying Zheng Part Two: Assessment of Specific Language Aspects

101

5 Authentic Texts in the Assessment of L2 Listening Ability Elvis Wagner

103

6 The Role of Background Factors in the Diagnosis of SFL Reading Ability 125 Ari Huhta, J. Charles Alderson, Lea Nieminen, and Riikka Ullakonoja 7 Understanding Oral Proficiency Task Difficulty: An Activity Theory Perspective 147 Zhengdong Gan

v

vi

CONTENTS

8 Principled Rubric Adoption and Adaptation: One Multi-method Case Study 165 Valerie Meier, Jonathan Trace, and Gerriet Janssen 9 The Social Dimension of Language Assessment: Assessing Pragmatics 189 Carsten Roever Part Three: Issues in Second Language Assessment

207

10 Applications of Corpus Linguistics in Language Assessment Sara Cushing Weigle and Sarah Goodwin

209

11 Setting Language Standards for International Medical Graduates 225 Vivien Berry and Barry O’Sullivan 12 Fairness and Bias in Language Assessment 243 Norman Verhelst, Jayanti Banerjee, and Patrick McLain 13 ACCESS for ELL s: Access to the Curriculum Jennifer Norton and Carsten Wilmes

261

14 Using Technology in Language Assessment 281 Haiying Li, Keith T. Shubeck, and Arthur C. Graesser 15 Future Prospects and Challenges in Language Assessments 299 Sauli Takala, Gudrun Erickson, Neus Figueras, and Jan-Eric Gustafsson Index

317

Introduction Jayanti Banerjee and Dina Tsagari

Preamble In a ten-year review of research activity in the field of language testing and assessment, Alderson & Banerjee (2001, 2002) comment on the volume and variety of work that had taken place since the previous review. They say that, “[t]he field has become so large and so active that it is virtually impossible to do justice to it . . . and it is changing so rapidly that any prediction of trends is likely to be outdated before it is printed” (2001, p. 213). In the intervening fifteen years the challenge of capturing the essence of language testing and assessment enquiry has only increased. The International Language Testing Association (ILTA , http://www.iltaonline.com) maintains a bibliography of publications in the field. The most recent iteration of this bibliography (Brunfaut, 2014) catalogs research published in the period 1999–2014 and comprises 4,820 items, a more than seven-fold increase from the first bibliography published in 1999 (Banerjee et al., 1999). At the beginning of the twenty-first century, Alderson & Banerjee (2001, 2002) identified a number of new and perennial concerns. The effect of a test on teaching and learning as well as education policy, textbooks, and other resources (i.e., test washback) was a growing area. Standards of practice in language assessment were of great interest. The Association of Language Testers in Europe (ALTE , http://www. alte.org) published a Code of Practice (ALTE , 2010) in order to ensure comparable levels of quality among language tests prepared by their members and ILTA adopted a Code of Ethics to guide professional conduct (ILTA , 2000). A new framework for language learning, teaching and assessment—the Common European Framework for Languages: learning, teaching, assessment (CEFR , Council of Europe, 2001)— had just been published and its influence had not yet been fully felt. There was increasing standardization of test design and use, particularly the establishment of national examination boards in contexts where there previously had been none. At the same time researchers and practitioners began exploring the ethics and politics of language testing, asking questions about the appropriateness of tests as well as the effect of politics at the national and local level on the design and practice of testing (see Alderson, 2009). 1

2

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Among the perennial concerns were those of construct, test validation, and test development in the different language skill areas of listening, reading, writing, and speaking as well as for different test-taker groups such as young learners. Alderson & Banerjee (2001, 2002) noted that despite the work already done, there was still much to be understood about how a correct answer to a question might be interpreted in terms of what the test taker has actually comprehended and can do in the language. They closed their review by looking at issues that they believed would “preoccup[y] the field . . . for some time to come” (2002, p. 98). One of these issues was authenticity, both in terms of how similar a test task is to the task that might be encountered in real life (situational authenticity) and how much the task requires the test taker to use language competences that would be engaged in real life (interactional authenticity) (see Bachman, 1991). Another was the question of how tests might be designed so that the scores can be interpreted and explained to score users. This entails understanding what a test taker’s performance on a test means in terms of what he/ she can do with the language. In the last decade or so, collaborations between second language acquisition (SLA ) and language assessment researchers have added particularly to our understanding of the language features typical at different levels of language proficiency as elicited by writing tasks (e.g., Banerjee et al., 2007; Gyllstad et al., 2014), but recent work by Alderson et al. (2013) shows how difficult it is to disentangle the construct components underlying reading ability. Indeed, this remains an area where there are more questions than answers, a veritable “black box.” A final issue was the activities related to test validation. At the time of Alderson & Banerjee’s (2002) survey there was an energetic debate about validation and the extent to which test developers provide adequate validation evidence for their tests. Since the survey, Kane’s (1992) argument-based approach to validation has gained considerable traction in the field and, indeed, is a core reference for many of the chapters in this collection.

Recent Activity in Language Testing and Assessment Apart from the ascendance of Kane’s (1992) argument-based approach to validation, what are the other recent trends in language testing and assessment? Perhaps one indication of the trends are the topics of the special issues in the main language assessment journals: Language Testing and Language Assessment Quarterly. The areas covered have been very varied and include the testing of language for specific purposes (such as Aviation English), alternative assessment, task-based language assessment, translating and adapting tests, teacher assessment, and automated scoring and feedback. The twenty-fifth anniversary issue of Language Testing focused a great deal on the teaching of language testing and assessment (textbooks and courses). Indeed, language assessment literacy was also the subject of a special issue of the journal in 2013. There have also been special issues that focus on geographical regions, such as language testing in Europe and in Asia as well as in specific countries such as in Taiwan, Australia, New Zealand, and Canada. This suggests that regional and national issues are very important and is one of the reasons why we made an

INTRODUCTION

3

effort to include contributors from Europe, Asia, Latin America, and the United States. Even in this effort, we were constrained by the availability of potential contributors, and we regret not being able to include voices from the Middle East or Africa, a gap we hope future volumes are able to fill.

Contemporary Issues in Language Assessment So, given the breadth and range of topics that have engaged and still perplex the field of language testing and assessment, how does a volume like this proceed in its selection of topics and concerns to focus on? We set ourselves the objective of preparing a collection of articles that give an overview of an important aspect of the field and report on an empirical study. Each of the resulting chapters offers a casestudy of contemporary issues in the field of second language assessment. We have delimited three main foci, each with a dedicated section. Part One, comprising four chapters and titled “Key Theoretical Considerations,” is on enduring considerations in the field, asking the questions: What are we testing, how do we know we are testing it, what is the effect of the test on teaching and learning, and, what is the relationship between the test and established language frameworks? In the first chapter, Lin Gu addresses the question of what we are testing by discussing the need for tests that are well grounded in applied linguistics and educational measurement theories. Her chapter “Modelling Communicative Language Ability: Skills and Contexts of Language Use” takes as a starting point the dominant view of language ability in the last thirty-five years, communicative language ability (Canale & Swain, 1980). Gu comments that research to date has focused on the skills-based aspects of Canale & Swain’s (1980) model and has not incorporated context and the interaction between skills and the contexts of language use. Gu’s research fills this gap. Her chapter examines the dimensionality of the TOEFL iBT ® test using confirmatory factor analysis, investigating the role of cognitive skill and the context of language use (defined by whether the context was instructional and non-instructional). Gu’s results were constrained by the context definitions applied; the influence of context on test performance was not as strong as the influences of skills. It is possible that a more detailed task analysis that accounted for more context variables (e.g., formality, social distance between interlocutors) would have been even more revealing. However, this study takes an important step forward in the area of construct definition. It demonstrates that it is insufficient to place tasks in different language use contexts. Rather, the unique characteristics of tasks in different language use contexts must be fully understood and carefully replicated to ensure that context-related language abilities can be reliably identified. These results contribute important insights into both the nature of communicative language ability and the practice of item development. In Chapter 2, Elaine Boyd and Cathy Taylor take an explicit ethical stance with respect to the test taker; they emphasize the test developer’s responsibility to ensure that a test is accurate in its claims about a test taker’s language proficiency. Their chapter, entitled “Presenting Validity Evidence: The Case of the GESE ,” addresses the question of how we support our claims about what we are testing. Boyd and Taylor provide an overview of current views of validity and also the process of test validation. They make the important point that test validation consists of gathering

4

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

evidence for the validity of a test and presenting that evidence in the form of an argument (Kane, 1992). They also tackle head-on the resourcing constraints faced by small language assessment organizations and show how it is possible to assemble validity evidence even with limited resources. Boyd and Taylor’s case study of the Graded Exams in Spoken English (GESE , Trinity College) is a very small-scale investigation of the construct of the exam in relation to the Canale & Swain (1980) framework of communicative language ability. It utilizes O’Sullivan’s (2011) reconceptualization of Weir’s (2005) sociocognitive framework to carefully analyze completed tests at each of GESE ’s twelve grade levels. The aim was to find evidence of each of the competences described in Canale & Swain’s (1980) model: linguistic, sociolinguistic, strategic, and discourse competences. Despite the constraints of this study, Boyd and Taylor report that the investigation identified many subsequent studies, which together, provide substantial validity evidence for the GESE . They also suggest how contemporary learner corpora allow new insights into how validity can be evidenced. Chapter 3 considers the effect of the test on teaching and learning (test washback and impact). At the turn of the twenty-first century, this area was relatively new and growing fast (see Alderson & Banerjee, 2001). There is now considerable evidence of how tests affect the content of teaching and the nature of teaching materials (Alderson & Hamp-Lyons, 1996; Tsagari, 2009, Watanabe, 1996). There is also some evidence of what test takers study (Abbasabady, 2009; Watanabe, 2001), but mixed evidence of whether tests affect how teachers teach (Cheng, 2005; Glover, 2006). This is the theoretical context in which Doris Froetscher situates her chapter titled “A New National Exam: A Case of Washback.” Froetscher administered questionnaires to teachers of English, French, Italian, and Spanish in order to better understand the effect of a reformed secondary school-leaving examination on teaching and learning in Austrian classrooms. Her results showed considerable positive and negative washback, for example, teachers mirror the methods used and the skills included in the school-leaving exam, and make heavy use of past papers. Froetscher’s findings were constrained by her data-collection methodology; questionnaires provide selfreport information, but a direct observational approach would be useful to confirm and clarify the claims that teachers make in their questionnaire responses. This is among the many suggestions that Froetscher makes for future washback studies. Nevertheless, her findings make a powerful contribution to existing calls for teacher training in language testing and assessment (Vogt et al., 2014). The final chapter in Part One, Chapter 4 in the volume, addresses the relationship between the test and established language frameworks. Written by John de Jong and Ying Zheng, it is titled “Linking to the CEFR : Validation Using a Priori and a Posteriori Evidence” and it provides a case study of how to link a test to international standards such as the CEFR (Council of Europe, 2001). The central focus of their chapter is on how the linking process establishes the criterion-referenced validity of the Pearson Test of English Academic (PTE Academic™). They provide a priori validity evidence from test development activities comprising item writer training in the CEFR and the development of clear and detailed item writer guidelines. They also provide a posteriori evidence from a CEFR linking exercise as well as concordances with other measures of English language proficiency. Their approach is among the most thorough available in the field today.

INTRODUCTION

5

The second section of the volume uses the assessment of different language skills (such as listening or writing) to explore new theoretical or methodological developments in the field. Part Two is titled “Assessment of Specific Language Aspects” and comprises five chapters. The first—Chapter 5 in the volume—has been written by Elvis Wagner and is titled “Authentic Texts in the Assessment of Second Language (L2) Listening Ability.” Wagner’s research is situated in the broader discussion of authenticity in language assessments (Bachman, 1991). This is a characteristic that is considered central to any test of communicative language ability, but its operationalization is open to question. Lewkowicz (1997, 2000) points out that even experienced teachers and linguists are unable to accurately identify authentic texts from those that have been written specifically for a test. She also finds that test takers are less concerned with test authenticity than are testing theorists. In this chapter, Wagner acknowledges the fact that listening tests are not real-life language events. In his view, however, this does not mean that listening input materials cannot reflect the characteristics of real-life aural language (i.e., authentic listening material). He describes the differences between scripted and unscripted spoken texts, pointing out that both occur in real-life aural language. Wagner also presents the theoretical and pedagogical arguments for using primarily scripted texts for language instruction. However, he argues that the exclusive use of scripted texts leaves language learners poorly prepared for real-life language situations. He advocates the use of both scripted and unscripted texts as appropriate for the different target language use situations. For instance, an academic lecture exemplifies a scripted aural text while a discussion between friends of a film they saw the night before exemplifies an unscripted aural text. Wagner then argues that language testing organizations primarily use scripted texts in their listening tests. To make his point, Wagner undertakes an analysis of the spoken texts used on four high-stakes tests of English for academic purposes. His analysis shows that most of these tests use scripted spoken texts. In Wagner’s view, the choice of the types of spoken texts affects the construct validity of those tests and can have a washback effect on learners, teachers, and even second language acquisition research. The chapter concludes with practical suggestions for how to integrate unscripted spoken texts into listening tests. Chapter 6 is situated in the broader discussion of the diagnosis of language ability. Arguably the act of diagnosing language ability is poorly understood. Alderson (2005) points out that it is often confused with placement decisions. However, if a diagnosis of language ability is to provide useful and usable information about a language learner’s strengths and weaknesses in the language, then the focus of the diagnosis needs to be much better understood. This was the explicit aim of the DIALANG project (http://www.lancaster.ac.uk/researchenterprise/dialang/about), a language diagnosis system that is aligned with the CEFR and that can provide test takers with information about what they can and cannot currently do in any of fourteen European languages. Ari Huhta, J. Charles Alderson, Lea Nieminen, and Riikka Ullakonoja pick up this discussion. In their chapter, titled “The Role of Background Factors in the Diagnosis of SFL Reading Ability,” they consider how second or foreign language reading ability might be predicted by a range of linguistic, cognitive, motivational, and background factors. The authors argue that in order to better diagnose second/foreign language (SFL ) reading abilities, it is important to fully understand how those abilities develop. This understanding is perhaps best

6

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

reached by examining the context in which language learners develop their SFL reading abilities, that is, their home environment, their parents’ reading practices, and their own reading practices. This information can be matched to the language learners’ performance on reading tests (measures of their SFL reading ability). Huhta et al.’s data is taken from the DIALUKI project (https://www.jyu.fi/hum/laitokset/ solki/tutkimus/projektit/dialuki/en) and covers both L2 Finnish and foreign language (FL ) English learners in Finland. The L2 Finnish learners were typically the children of Russian immigrants, while the FL English learners were typically Finnish as a first language (L1) learners of English. The findings presented in this chapter were complex and fascinating. Parental background alone did not explain much of the variance in the learners’ SFL reading ability, but factors like age of arrival in Finland had an important effect on progress in their SFL reading. Huhta et al. acknowledge, however, that background factors such as those explored in this chapter are less influential in the prediction of SFL reading ability than other components of reading, such as first language and motivation to learn. Chapter  7 by Zhengdong Gan is titled “Understanding Oral Proficiency Task Difficulty: An Activity Theory Perspective.” Task difficulty is a major concern in language assessment; researchers and practitioners are particularly interested in the variables that influence task difficulty (Brindley & Slatyer, 2002; Cho et al., 2013; Elder et al., 2002; Filipi, 2012) as this might help test developers to make a priori task difficulty predictions. Gan uses activity theory to explore the difficulty of a speaking test. Activity theory is an ethnographic approach to research that makes actions and the motivations for those actions explicit. The participants in the research reflect concretely on their feelings, their understandings, and their motivation (see Roth, 2005). Gan’s research participants are two trainee English language teachers in Hong Kong. Both are speakers of English as a second language (ESL ). As part of their teacher training, they are required to pass the Language Proficiency Assessment for Teachers (LPAT ). On their first attempt, they failed the test. The study reported in this chapter follows the trainee teachers as they prepare for their second attempt. Gan takes his lead from Commons & Pekker (2006), who comment: “If we could understand why people fail at tasks, we might be able to offer a better framework for success” (p. 3). Gan therefore studies these two test takers as they reflect on the reasons for their failure. He also documents the steps they take to improve their spoken English. By doing so, Gan both uncovers the very different reasons why each test taker failed the LPAT as well as demonstrates the insights into task difficulty to be gained from activity theory. Chapter  8 addresses rating scale development and use (the subject of a special issue of Assessing Writing in October 2015). Discussions of scale design are relatively uncommon (Banerjee et  al., 2015); this chapter by Valerie Meier, Jonathan Trace, and Gerriet Janssen, titled “Principled Rubric Adoption and Adaptation: One Multimethod Case Study,” therefore makes a positive contribution to our understanding of how a rating scale functions as well as how scales can be revised so they are better fit for purpose. Meier et  al.’s starting point is the well-known and widely used analytic rating scale developed by Jacobs et al. (1981) that had been adopted for a placement exam at a Colombian University. A baseline multifaceted Rasch measurement (MFRM ) study revealed that the rating scale was too complex to function optimally. Consequently, Meier et  al. embarked on a series of actions

INTRODUCTION

7

intended to improve the efficacy of the rating scale in the context in which it was being used. In a first step, Meier et  al. combined score points so that each scoring category had fewer performance levels. They then revised the scoring descriptors to correspond to the new score points, working with raters embedded in the testing context to develop scoring descriptors that were meaningful and usable. The chapter investigates how well these changes have worked and shows the benefits of combining both empirical data and expert intuition in rating scale development. Chapter 9, the final chapter in Part Two, is grounded in research into the assessment of second language pragmatics. Though an important feature of Canale & Swain’s (1980) model of communicative language ability, pragmatics has proved very difficult to isolate and to test (Plough & Johnson, 2007). It is very difficult to assess pragmatic competence without constructing a task that is rich in context. Such tasks, however, are long and can be impractical in a testing situation. Consequently, pragmatic competence is often mentioned only in the rating scales for speaking assessments. In recent years, however, important strides forward have been made—notably work by Grabowski (2009, 2013) and Walters (2004, 2013). Carsten Roever’s chapter, titled “The Social Dimension of Language Assessment: Assessing Pragmatics,” gives an overview of current approaches in the assessment of second language pragmatics. He explains the construct of second language pragmatics and its roots in theories of social interaction, such as interactional sociolinguistics and conversation analysis. Roever then presents the validity evidence for a web-based test of pragmatics that includes multiturn discourse completion tasks as well as appropriateness judgments and single-turn items. Adopting the prevailing approach to test validation—the argument-based approach (Kane, 1992), Roever provides evidence for all but one of the inferences in the interpretive argument (domain description, evaluation, generalization, explanation, and extrapolation). Evidence for the utilization inference is not provided because the test is not currently used for decision-making. He shows that it is possible to test second language sociopragmatics in a relatively efficient manner but he says that questions remain about the use of pragmatics tests. Such tests remain resource intensive, and serious questions must be asked about their ability to add value to existing testing instruments and their ability to capture realworld pragmatic competence. The final section of the volume, Part Three, is entitled “Issues in Second Language Assessment.” The aim of this section is to identify current trends in the field, aspects of language testing and assessment that are likely to engage researchers and practitioners in the years to come. Part Three contains six chapters. Five of these focus on a specific assessment issue while the final chapter in the volume broadens the discussion by looking into the future at the opportunities and challenges that might arise going forward. But we risk getting ahead of ourselves. Part Three begins with a chapter (Chapter  10 in the volume) by Sara Cushing Weigle and Sarah Goodwin. Titled “Applications of Corpus Linguistics in Language Assessment,” this chapter addresses the recent explosion in the use of corpora in language assessment. Enabled by the growth in and availability of corpus tools such as AntConc (Anthony, 2014) and CohMetrix (McNamara et  al., 2014), corpora can be used much more efficiently and effectively in test development and test validation. Weigle and Goodwin’s main focus is the analysis of a corpus of essays from the Georgia State Test of English Proficiency (the

8

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

GSTEP ), a test of academic English proficiency. They are concerned with the use of multiword units (also referred to as formulaic sequences and formulae) in the essays, specifically in whether the occurrence and use of multiword sequences is different in essays written by test takers that have been judged to be more or less proficient. Their premise was that successful language learners (and therefore, the test takers that should receive high scores on the GSTEP ) are more likely to use multiword units that are common in academic discourse. Conversely, less successful language learners (i.e., the test takers that receive low scores on the GSTEP ) are less likely to use academic English formulae and are also more likely to copy multiword units from the source material provided. Though the corpus was relatively small (332 essays), Weigle and Goodwin’s results were in line with previous research (Weigle & Parker, 2012), showing that high-scoring essays did indeed use more academic formulae, while lowscoring essays were more likely to include multiword units copied from the source material. This provides validity evidence for the scoring of the GSTEP ; that is, evidence that the essay task and the rating process are able to capture differences in test-taker proficiency as measured by the use of multiword units. As such, it is an excellent case study of how corpus tools can be used for test validation. Chapter 11, by Vivien Berry and Barry O’Sullivan, is situated in the contested area of test use. Typically, tests are designed with a particular purpose. For instance, the International English Language Testing System (IELTS TM) was originally designed as a test of English for academic purposes. Its main use was intended to be as an indicator of English language proficiency for individuals seeking admission to English-medium universities—generally in the United Kingdom, Australia, and the United States. However, as Fulcher & Davidson (2009) state, tests can come to be used “for purposes other than those intended in the original designs” (p. 123). Fulcher & Davidson (2009) question changes in test use in particular as this creates difficulties in the interpretation of test scores. Additionally, McNamara & Roever (2006) discuss the use of language tests as part of immigration requirements. They acknowledge the influence of language proficiency on employment outcomes, but they worry that language tests might become tools for excluding immigrants rather than a safeguard against long-term unemployment in the host country. The governing concern in Berry and O’Sullivan’s chapter is that of patient safety when in the care of international medical graduates. Titled “Setting Language Standards for International Medical Graduates,” the chapter describes a project undertaken on behalf of the British General Medical Council (GMC ), which gathered evidence on the use of the IELTS TM as an indicator of language ability for International Medical Graduates (IMG s) wishing to work in the United Kingdom. Though not expressly designed for this use, the IELTS TM is widely available around the world and can be taken prior to the IMG s arrival in Britain. It is intended to function as a measure of language proficiency only and is not a substitute for measures of medical skills; these are assessed by the Professional and Linguistic Assessments Board (PLAB , http://www.gmc-uk.org/doctors/plab.asp). Berry and O’Sullivan describe the procedures used, working with different stakeholder panels (patients, health professionals, and medical directors) to identify the minimally acceptable IELTS TM score for a newlyarrived IMG . They also explain the difficulties that arise in such an endeavor. For instance, the panel comprising patients was very uncomfortable with the concept of a minimally competent candidate. Panel members conflated the concepts of English language proficiency and medical skills. The panels also had difficulty relating the

INTRODUCTION

9

IELTS TM tasks to the real-life language use of doctors. This, perhaps, gets to the nub of the problem when tests are used in contexts for which they were not originally designed or intended. The results of Berry and O’Sullivan’s work are also very interesting for they demonstrate the interaction between empirical data and politics in decision-making. The cut-scores that were finally set by the General Medical Council appeared to be very influenced by impact data and the practices of medical councils in other parts of the English-speaking world, perhaps a convergence to the norm. The next chapter (Chapter 12), by Norman Verhelst, Jayanti Banerjee, and Patrick McLain, titled “Fairness and Bias in Language Assessment,” is grounded in questions of test fairness and bias. The Standards for Educational and Psychological Testing (2014, henceforth referred to as Standards) lists twenty-one fairness standards with the central aim of “identify[ing] and remov[ing] construct-irrelevant barriers” (p. 63). Fairness in an exam should be safeguarded from the start, before an item is written. Most testing organizations have fairness and bias guidelines that inform item writing and review as well as the test compilation process. These guidelines ensure that the content of the test is not derogatory or upsetting for test takers and enjoin test developers to avoid assessing construct-irrelevant skills. However, the existence of and adherence to fairness guidelines does not guarantee that a test is fair. Post-hoc fairness investigations typically use analyses such as Differential Item Functioning (DIF ) to explore possible sources of test bias. Verhelst et  al. use a generalized form of DIF called profile analysis (PADIF ). PADIF is a relatively new approach to DIF analysis and offers a number of benefits. First, it is neutral with respect to test construct; it does not assume that all the items measure the target construct of the test. Second, it allows for more than two groups to be defined. This makes the approach more flexible than the better-known approaches. Verhelst et al. demonstrate how PADIF can be applied both to an artificial dataset and real data from the Michigan English Test (MET ®), a multilevel test of general English language proficiency. Their chapter raises important questions, however, about how the results of DIF analyses can be used. The growth in immigration around the world has resulted in an increased demand for language support services for children whose first (or home) language is not the language of mainstream education. In the United Kingdom, research with children with English as an additional language (EAL s) has highlighted the need for teacher training so that they do not implicitly apply native-speaker norms when judging the performance and progress of EAL s (Leung & Teasdale, 1996). In the United States, the progress of English language learners (ELL s)—a growing sector of the K–12 context in the United States—is monitored through a variety of measures (Llosa, 2011, 2012). Avenia-Tapper & Llosa (2015) take research in this area a step further by analyzing the linguistic features of content-area assessments. Their aim is to highlight the inequities created for ELL s when the content-area assessments use language that is too complex, thus introducing construct-irrelevant language variables into the assessment of content knowledge. Chapter 13 is situated in this rich area of theory and practice. Written by Jennifer Norton and Carsten Wilmes and titled “ACCESS for ELL s: Access to the Curriculum,” the chapter describes the research and development of the ACCESS test, a test of English language proficiency designed to identify the academic English language needs of ELL s at various stages of readiness for engagement with the mainstream curriculum. Norton and Wilmes present a

10

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

major refinement to the test, the online delivery of the speaking test. They describe the use of cognitive labs to validate the new delivery mode. Despite the disruptions that can occur when computer systems malfunction and the challenges of ensuring that all the test takers have the computer familiarity necessary to take a computerbased test, Norton and Wilmes are very positive about the value of innovative technologies in language assessment. Technology is the main theme of the penultimate chapter in the volume (Chapter 14). Written by Haiying Li, Keith T. Shubeck, and Arthur C. Graesser and titled “Using Technology in Language Assessment,” the chapter looks at the use of intelligent systems and automated measurements of text characteristics such as Coh-Metrix (McNamara et  al., 2014) in the assessment functions of intelligent tutoring systems (ITS ). They describe and demonstrate the functionality of AutoTutor, an ITS that has been developed to help adults learn and practice their reading. The system comprises a teacher agent and a student agent, both of whom interact with the human learner. The system is designed to evaluate and respond to the human learner’s answers as well as to encourage the learner to expand on or develop their answers. If the human learner struggles with a particular text (i.e., gets less than 70 percent of the answers correct), then the next text presented will be easier. In contrast, if the human learner performs well on a text, then the next text presented will be more difficult. In this way, AutoTutor can be situated within a relatively new area of language assessment research, learning-oriented assessment (LOA , Turner & Purpura, forthcoming). Carless (2006) describes LOA as opportunities for learning during assessment activities. Students are involved either as peer evaluators or in self-evaluation. In this way, the on-the-way checks that teachers engage in during a lesson and explicit assessments of language progress are simultaneously assessments and opportunities for further language learning. AutoTutor provides an excellent example of how technology can be used to involve learners in learning and the automated mini-checks that can enable appropriate targeting of new language learning challenges. As we have already said, the final chapter (15) in the volume looks into the future, a possibly Sisyphean task given the range of activity in the field. In their chapter “Future Prospects and Challenges in Language Assessments,” Sauli Takala, Gudrun Erickson, Neus Figueras, and Jan-Eric Gustafsson reflect on the state of language assessment and discuss the possible next steps that the field needs to take. They structure their wide-ranging chapter according to the purposes and contexts of language tests, their underlying constructs, the different formats and methods available and on the horizon, the agents of language assessments, and the uses and consequences of tests. Among the many issues that they believe will be important going into the future, they urge language testers not to forget the less easily operationalized language competencies such as reflection, critical analysis, and problem solving. They also believe that more attention needs to be given to the agency of students in language learning and assessment. Finally, they emphasize the responsibility of all test users when making assessment decisions. In summary, this book represents the best of current practice in second language assessment, and as a one-volume reference, is invaluable for students and researchers looking for material that extends their understanding of the field. Given its scope and nature, the collection aspires to become a primary source of enrichment

INTRODUCTION

11

material for a wide audience, such as postgraduate students, scholars and researchers, professional educational organizations, educational policy makers and administrators, language testing organizations and test developers, language teachers and teacher trainers, and special/general educators and school psychologists. We hope that the key aspects covered in the volume and the epistemological dialog between language assessment and neighboring disciplines ensures that this volume becomes essential reading for anyone interested in the testing and assessment of aspects of language competence, applied linguistics, language study, language teaching, special education, and psychology.

References Abbasabady, M. M. (2009). The Reported Activities and Beliefs of the Students Preparing for the Specialised English Test (SPE) (Doctoral thesis). Lancaster University, UK . Alderson, J. C. (2005). Diagnosing Foreign Language Proficiency. London, UK : Continuum. Alderson, J. C. (2009). The Politics of Language Education: Individuals and Institutions. Bristol, UK : Multilingual Matters. Alderson, J. C., & Banerjee, J. (2001). Language testing and assessment (Part 1), Language Teaching, 34(4), 213–236. Alderson, J. C., & Banerjee, J. (2002). Language testing and assessment (Part 2), Language Teaching, 35(2), 79–113. Alderson, J. C., & Hamp-Lyons, L. (1996). TOEFL preparation courses: A study of washback. Language Testing, 13(3), 280–297. Alderson, J. C., & Kremmel, B. (2013). Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge. Language Testing, 30(4), 535–556. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA , APA , NCME ). (2014). Standards for Educational and Psychological Testing. Washington, DC : American Educational Research Association. Anthony, L. (2014). AntConc (Version 3.4.3) [Computer Software]. Tokyo, Japan: Waseda University. Available from: http://www.laurenceanthony.net/. [9 August 2015]. Association of Language Testers in Europe. (2010). The ALTE Code of Practice. Available from: http://www.alte.org/attachments/files/code_practice_eng.pdf. [4 August 2015]. Avenia-Tapper, B., & Llosa, L. (2015). Construct relevant or irrelevant? The role of linguistic complexity in the assessment of English language learners’ science knowledge. Educational Assessment, 20(2), 95–111. Bachman, L. F. (1991). What does language testing have to offer?. TESOL Quarterly, 25(4), 671–704. Banerjee, J., Clapham, C., Clapham, P., & Wall, D. (1999). ILTA Language Testing Bibliography 1990–1999. Lancaster, UK : Centre for Research in Language Education (CRILE ), Lancaster University. Banerjee, J., Franceschina, F., & Smith, A. M. (2007). Documenting features of written language production typical at different IELTS band score levels. In S. Walsh (Ed.), IELTS Research Report Volume 7. Canberra, Australia: British Council and IDP : IELTS Australia, pp. 241–309. Banerjee, J., Yan, X., Chapman, M., & Elliott, H. (2015). Keeping up with the times: Revising and refreshing a rating scale. Assessing Writing, 26, 5–19.

12

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19(4), 369–394. Brunfaut, T. (2014). ILTA Bibliography of Language Testing 1999–2014 by Category. Birmingham, AL : International Language Testing Association (ILTA ). Canale, M., & Swain, M. (1980). Theoretical bases of communicative approach to second language teaching and testing, Applied Linguistics, 1(1), 1–47. Carless, D. (2006). Learning-oriented assessment: Conceptual bases and practical implications. Innovations in Education and Teaching International, 44(1), 57–66. Cheng, L. (2005). Changing Language Teaching through Language Testing: A Washback Study. Cambridge, UK : Cambridge University Press. Cho, Y., Rijmen, F., & Novák, J. (2013). Investigating the effects of prompt characteristics on the comparability of TOEFL iBT ™ integrated writing tasks. Language Testing, 30(4), 513–534. Commons, M., & Pekker, A. (2006). Hierarchical Complexity and Task Difficulty. Available from: http://www.dareassociation.org/Papers/HierarchicalComplexityandDifficulty 01122006b.doc. [8 August 2015]. Council of Europe. (2001). Common European Framework of Reference for Languages: learning, teaching, assessment. Cambridge, UK : Cambridge University Press. Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19(4), 347–368. Filipi, A. (2012). Do questions written in the target language make foreign language listening comprehension tests more difficult? Language Testing, 29(4), 511–532. Fulcher, G., & Davidson, F. (2009). Test architecture, test retrofit. Language Testing, 26(1), 123–144. Glover, P. (2006). Examination Influence on How Teachers Teach: A Study of Teacher Talk (Doctoral thesis). Lancaster University, UK . Grabowski, K. (2009). Investigating the Construct Validity of a Test Designed to Measure Grammatical and Pragmatic Knowledge in the Context of Speaking (Doctoral thesis). Teachers College, Columbia University, New York, NY. Grabowski, K. (2013). Investigating the construct validity of a role-play test designed to measure grammatical and pragmatic knowledge at multiple proficiency levels. In S. Ross & G. Kasper (Eds.), Assessing Second Language Pragmatics. New York, NY: Palgrave MacMillan, pp. 149–171. Gyllstad, H., Granfeldt, J., Bernardini, P., & Källkvist, M. (2014). Linguistic correlates to communicative proficiency levels of the CEFR : The case of syntactic complexity in written L2 English, L3 French and L4 Italian. EUROSLA Yearbook, 14, 1–30. International Language Testing Association. (2000). Code of Ethics. Available from: http:// www.iltaonline.com/images/pdfs/ilta_code.pdf. [4 August 2015]. Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hugley, J. (1981). Testing ESL Composition: A Practical Approach. Rowley, MA : Newbury House. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535. Leung, C., & Teasdale, A. (1996). English as an additional language within the national curriculum: A study of assessment practices. Prospect, 12(2), 58–68. Lewkowicz, J. A. (1997). Investigating Authenticity in Language Testing (Doctoral thesis). Lancaster University, UK . Lewkowicz, J. A. (2000). Authenticity in language testing: some outstanding questions. Language Testing, 17(1), 43–64. Llosa, L. (2011). Standards-based classroom assessments of English proficiency: A review of issues, current developments, and future directions for research. Language Testing, 28(3), 367–382.

INTRODUCTION

13

Llosa, L. (2012). Assessing English learners’ progress: Longitudinal invariance of a standards-based classroom assessment of English proficiency. Language Assessment Quarterly, 9(4), 331–347. McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated Evaluation of Text and Discourse with Coh-Metrix. New York, NY: Cambridge University Press. McNamara, T., & Roever, C. (2006). Language Testing: The Social Dimension. Malden, MA : Blackwell Publishing. O’Sullivan, B. (2011). Language Testing. In J. Simpson (Ed.), The Routledge Handbook of Applied Linguistics. Oxford, UK : Routledge, pp. 259–273. Plough, I., & Johnson, J. (2007). Validation of an expanded definition of the grammar construct. Paper presented at the annual Language Testing Research Colloquium (LTRC), Barcelona, Spain. Roth, W.–M. (2005). Doing Qualitative Research: Praxis of Method. Rotterdam, The Netherlands: Sense Publishers. Tsagari, D. (2009). The Complexity of Test Washback (Language Testing and Evaluation Volume 15). Frankfurt am Main: Peter Lang. Turner, C. E., & Purpura, J. E. (2016). Learning-oriented assessment in second and foreign language classrooms. In D. Tsagari & J. Banerjee (Eds.), Handbook of Second Language Assessment. Berlin, Germany: De Gruyter Mouton, pp. 255–273. Vogt, K., & Tsagari, D. (2014). Assessment literacy of foreign language teachers: findings of a European study. Language Assessment Quarterly, 11(4), 374–402. Watanabe, Y. (1996). Does grammar translation come from the entrance examination? Preliminary findings from classroom-based research. Language Testing, 13(3), 318–333. Watanabe, Y. (2001). Does the university entrance examination motivate learners? A case study of learner interviews. In Akita Association of English Studies (Ed.), Trans-Equator Exchanges: A Collection of Academic Papers in Honour of Professor David Ingram. Akita City, Japan: Faculty of Education and Human Studies, Akita University, pp. 100–110. Walters, F. S. (2004). An Application of Conversation Analysis to the Development of a Test of Second Language Pragmatic Competence (Doctoral thesis). University of Illinois, Urbana-Champaign. Walters, F. S. (2013). Interfaces between a discourse completion test and a conversation analysis-informed test of L2 pragmatic competence. In S. Ross & G. Kasper (Eds.), Assessing Second Language Pragmatics. New York, NY: Palgrave, pp. 172–195. Weigle, S. C., & Parker, K. (2012). Source text borrowing in an integrated reading/writing assessment. Journal of Second Language Writing, 21(2), 118–133. Weir, C. J. (2005). Language Testing and Validation: An Evidence-Based Approach. London, UK : Palgrave Macmillan.

14

PART ONE

Key Theoretical Considerations

15

16

1 Modelling Communicative Language Ability: Skills and Contexts of Language Use Lin Gu

ABSTRACT

L

anguage ability is such a complex construct with multiple components that it remains unclear what the makeup of this ability is or how the constituent parts interact. The current view in applied linguistics conceptualizes language ability as communicative in nature. In this study, I demonstrate how our understanding of communicative language ability can be enhanced and verified by examining the dimensionality of a language test. Situated in the context of the internet-based Test of English as a Foreign Language (TOEFL iBT ®), confirmatory factor analysis was performed to investigate how skill-based cognitive capacities that operate internally within individual language users and context-dependent capacities of language use interact in a communicative competence framework. The results contribute an important insight into the nature of communicative language ability. The outcome of investigating the internal structure of the test also has implications for test item development.

Modelling Communicative Language Ability: Skills and Contexts of Language Use The Standards for Educational and Psychological Testing (American Educational Research Association [AERA ] et  al., 2014; hereafter referred to as Standards) 17

18

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

regards validity as the most fundamental consideration in developing and evaluating tests. Messick (1989) defined validity as “an integrated evaluation judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (p. 13; original emphases in italics). As pointed out by Bachman (1990), this unitary view of validity has been “endorsed by the measurement profession as a whole” (p. 236). A crucial aspect of this unitary validity concept is the centrality of construct validation. The Standards (AERA et  al., 2014) define construct as “the concept or characteristic that a test is designed to measure” (p. 11) and highlight the importance of providing a conceptual framework that delineates the construct for a test. Hence, the centrepiece of a validation inquiry is to develop a framework that specifies the nature of the construct based on both theoretical and empirical grounds. As language testing researchers attempt to determine what constitutes the construct of language ability, the field has witnessed rapid growth in theoretical positions and empirical techniques. The current view in applied linguistics conceptualizes language ability as communicative in nature (Canale, 1983). The definition of communicative language ability (CLA ) encompasses both skill-based cognitive capacities of individual language users and their ability to respond to the language demands of specific contexts (Chapelle et al., 1997). Empirical investigations into the nature of language ability have focused mainly on cognitive language skills and the relationships among them, and have led to a profusion of skill-based language ability models (for a comprehensive review of language ability models, see Gu, 2011). The most researched cognitive skills are a language user’s ability to listen, read, write, and speak. The common approach adopted by these researchers is to conduct factor analysis to examine the dimensionality of a test. The dimensionality (or internal structure) of a test refers to the latent factorial structure that underlies observed test performance. The latent factor structure of a test cannot be directly observed, and is represented by observed scores (also called indicators) on the test. As explained by Chapelle (1999), a dimensionality analysis examines the fit of empirical test performance to a psychometric model based on a relevant construct theory. The results of such an analysis allow us to evaluate the degree to which an empirically derived representation of the construct based on test performance reflects the proposed theoretical configuration of the construct the test intends to measure. The consensus reached by the research community is that language ability is multidimensional as opposed to unitary, although it is still unclear what this ability construct consists of and how the constituent parts interact (Chalhoub-Deville, 1997; Kunnan, 1998). Missing from these skill-based models are context-embedded ability components; that is, the ability to interact with the context in which communication is situated. Hymes (1972) highlighted the interactive nature of language use in his interpretation of communicative competence and identified eight features that can be used to establish a language use context: setting, participants, ends, act sequence, key, instrumentalities, norms of interaction and interpretation, and genre. Canale & Swain (1980) and Canale (1983) included the competence of understanding and responding to the social context in which language is used in their communicative competence framework. The relationship between context and the cognitive competences residing within individual learners was also part of the language ability

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

19

model proposed by Bachman (1990). Although contexts of language use are integral to the conceptualization of CLA , context-dependent language abilities do not appear to have been included in any of the previous empirical investigations. In summary, the current understanding of the nature of CLA is limited because language ability models established thus far are largely skill-based. Excluded from these skill-based models is the context in which communication is situated. The nature of context-embedded language abilities and the relationships among them are largely unknown due to the scarcity of empirical inquiries. To advance our understanding of CLA , in the current study, I investigated the role of both cognitive skill and context of language use in defining CLA by examining the latent structure underlying the performances on the TOEFL iBT ® test. The reasons for choosing this test are explained below. First, the theoretical reasoning behind the development of the test reflects the current thinking in applied linguists, which views language ability as being communicative in nature. The purpose of the TOEFL iBT ® test is to measure communicative language proficiency in academic settings (Chapelle et al., 2008). The proposed construct model makes an explicit distinction between the context of language use and the internal capacities of individual language users, and portrays the relationships between the two as dynamic and integrated, as it is believed that “the features of the context call on specific capacities defined within the internal operations” (Chapelle et al., 1997, p. 4). This understanding of language ability in relation to its context of use sets the theoretical foundation for the TOEFL iBT ® test development framework (Jamieson et al., 2000). Second, both cognitive skills and contexts of language use are clearly specified in the design of the TOEFL iBT ® test. The test adopts a four-skills approach to test design, and each task measures one of the four skills, namely, listening, reading, writing, and speaking. Each test task is also designed with one of the two language use situations in mind, one relating to university life and the other pertaining to academic studies. This situation-based approach to test design “acknowledges the complexities and interrelatedness of features of language and contexts in communicative language proficiency” (Chapelle et al., 1997, p. 26). The interplay between skills and contexts are reflected in both the conceptualization and design of the test. Therefore, performance on the test should enable us to empirically examine the nature of CLA . Situated in the context of TOEFL iBT ® testing, in this study, I investigated how skill-based cognitive capacities and contextembedded capacities of language use interact in a communicative competence framework by examining the internal structure of the TOEFL iBT ® test. The research question to be answered was: To what extent do cognitive skill and language use context play a role in defining CLA ?

Design of Study Study Sample TOEFL iBT ® public dataset Form A and its associated test performance from 1,000 test takers were used for this study. The average age of these test takers

20

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

was 24 at the time of testing. The majority of the test takers (about 86%) were between the ages of 15 and 30. They were evenly distributed across gender and were from 98 countries or regions. A total of 66 different native languages were represented in the sample, and the most frequently spoken native languages in order of the number of their speakers were Chinese, Korean, Japanese, Spanish, and Arabic. Native speakers of these five languages made up about 59 percent of the total sample.

Instrument The test instrument used in this study was an operational form of the TOEFL iBT ® test administered during the fall of 2006. This test consisted of seventeen language tasks, organized into four sections by skill modality. There were six listening tasks, three reading tasks, six speaking tasks, and two writing tasks. The listening and reading tasks required selected responses. The tasks in the speaking and writing sections elicited constructed responses. In the context of TOEFL iBT ® testing, language use situation is characterized by five variables: participants, content, setting, purpose, and register (Jamieson et al., 2000).The contexts of language use in these tasks are described below. The first listening task (L1) was situated in a conversation between a male student and a female biology professor at the professor’s office. The participants talked mainly about how to prepare for an upcoming test. The nature of the interaction was consultative with frequent turns. The second listening task (L2) presented part of a lecture delivered by a male professor in an art history class. The language used by the professor was formal. There was no interaction between the professor and the audience. The third listening task (L3) presented part of a lecture in a meteorology class in a formal classroom setting. Three participants were involved: one male professor, one male student, and one female student. The language used was formal with periods of short interaction between the professor and the students. The fourth listening task (L4) was situated between a female student and a male employee at the university housing office. They spoke about housing opportunities on and off campus. The nature of the interaction was consultative with relatively short turns. The fifth listening task (L5) presented part of a lecture in an education class in a formal classroom setting. The participants were a female professor and a male student. The language used was formal with periods of short interaction between the professor and the student. The last listening task (L6) presented part of a lecture in an environmental science class in a formal classroom setting. The male professor was the only participant. The language used was formal with no interaction between the professor and the audience. All three reading tasks were based on academic content. Test takers were asked to respond to questions based on reading passages on the topic of psychology in the first task (R1), archeology in the second task (R2), and biology in the third task (R3). The language used in the readings was formal in register. There were six tasks in the speaking section. The first two tasks were independent in that the completion of the tasks did not relate to information received through other modalities of language use (e.g., listening). The remaining four tasks were

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

21

integrated tasks which required the use of multiple skills. Speaking task one (S1) and two (S2) asked test takers to orally respond to a prompt relating to university life. The third speaking task (S3) had a reading and a listening component. The topic was the university’s plan to renovate the library. The reading component required test takers to read a short article in the student newspaper about the change. The listening component involved a conversation between a male and a female student discussing their opinions about this renovation plan. The two participants interacted with each other frequently with relatively short turns. The fourth speaking task (S4) also had a reading and a listening component. The reading component required test takers to read an article from a biology textbook. The listening component presented part of a lecture on the same topic delivered by a male professor in a formal classroom setting. The language used in this task was formal, and there was no interaction between the professor and the audience. The fifth speaking task (S5) contained a listening component, which was situated in a conversation between a male and a female professor in an office setting. The focus of their dialogue was related to the class requirements for a student. The register of the language was consultative in nature with frequent turns between the two participants. The last speaking task (S6) had a listening component, which presented part of a lecture delivered by a female anthropology professor in a classroom setting. The language used was formal with no interaction between the professor and the audience. In the writing section, the first task (W1) was integrated with reading and listening. Test takers were first asked to read a passage about an academic topic in biology and then to listen to part of a lecture on the same topic delivered by a male professor in a classroom setting. There was no interaction between the professor and his audience during the entire listening time. The second task (W2), an independent writing task, asked test takers to write on a non-academic topic based on personal knowledge and experience. As illustrated above, context of language use could be identified at the task level. Individual items within a task shared a common language use context that was either related to academic studies or university life. Tasks related to academic studies were situated in instructional contexts. The language acts were specified or assumed to take place inside classrooms. The content of these tasks was mainly based on scholarly or textbook articles in the realm of natural and social sciences. In this type of setting, interactions among the participants, when applicable, were usually sporadic and the language used was academically oriented and formal. Tasks that were associated with university life involved topics related to courses or campus life. The development of these tasks did not rely on information in a particular academic area. These tasks took place outside classrooms, in other words, in non-instructional contexts. In this type of setting, interactions were usually frequent and the language used tended to be less formal. This study investigated whether context, when examined together with skill, influences test performance. To this end, all seventeen tasks were categorized to be either in an instructional context or a non-instructional context. There were ten tasks in the first category, and seven in the second category. Table 1.1 summarizes the test tasks by skill and by context of language use.

22

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 1.1 Categorizing the Tasks by Skill and Context of Language Use Task

Skill

Context of Language Use

L1

Listening

Non-instructional

L2

Listening

Instructional

L3

Listening

Instructional

L4

Listening

Non-instructional

L5

Listening

Instructional

L6

Listening

Instructional

R1

Reading

Instructional

R2

Reading

Instructional

R3

Reading

Instructional

S1

Speaking

Non-instructional

S2

Speaking

Non-instructional

S3

Speaking

Non-instructional

S4

Speaking

Instructional

S5

Speaking

Non-instructional

S6

Speaking

Instructional

W1

Writing

Instructional

W2

Writing

Non-instructional

Analysis Analyses were performed by using Mplus (Muthén & Muthén, 2010). Raw task scores were used as the level of measure. Each listening task had a prompt followed by five or six selected response questions. Each reading task had a prompt followed by twelve to fourteen selected response questions. The majority of the listening and reading items were dichotomously scored, except for three reading items that were worth two or three points. For the listening and reading sections, a task score was the total score summed across a set of items based on a common prompt. A task score in the writing and speaking sections was simply the score assigned for a task. Speaking tasks were scored on a scale of 0–4 and writing tasks were scored on a scale of 0–5. Using task scores instead of item scores

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

23

allowed all variables to be treated as continuous. Using this level of measure, as Stricker & Rock (2008) suggested, would also help alleviate the problem caused by the dependence among items associated with a common prompt. All task scores were treated as continuous variables. There were seventeen observed variables in the study. A confirmatory factor analysis (CFA ) approach to multitrait-multimethod (MTMM ) data (Widaman, 1985) was adopted in this study. CFA is commonly used to investigate the impact of latent ability factors on test performance by performing model specification at the onset of the analysis. The original purpose of a CFA approach to MTMM data was to investigate the influences of both trait and test method (e.g., multiple-choice questions, short answer questions, etc.) on test performance. This methodology was employed in this study to examine the influences of both skill and context of language use on test performance. The first step during the analysis was to establish a baseline model with both skill factors and contextual factors. Each language task was designed to measure one of the four skills: listening, reading, speaking, and writing. Results from previous factor analytic studies based on the TOEFL iBT ® test (Gu, 2014; Sawaki et  al., 2009; Stricker et al., 2008) showed that the relationships among the four skills could be captured by three competing models: a correlated four-factor model, a higher-order factor model, and a correlated two-factor model. In addition, language tasks were also classified into two categories by context: instructional and non-instructional. In this study, it was hypothesized that the two context factors related to each other because both were associated with higher-education in an English medium environment. The two context factors were added to the skill-based models. Three competing baseline models were therefore obtained in which each language task was specified to load on two factors, a skill-based factor and a context-based factor. In other words, each model proposed that performance on a task was accounted for by the skill being measured as well as the context in which language use was situated. Figure 1.1 shows the first competing baseline model. The relationships among the skills are summarized by a correlated four-factor structure in which tasks in each skill section load on their respective skill factors, namely, listening, (L), reading (R), speaking, (S), and writing (W), and the four skills correlate with each other. The second competing baseline model is shown in Figure 1.2, in which items in each skill section load on their respective skill factors and the four skills are independent, conditional on a higher-order general factor (G); that is, the correlations among the skills are explained by G. The third competing baseline model, as illustrated in Figure  1.3, has two correlated skill factors. Tasks in the listening, reading, and writing sections load on a common latent factor (L/R/W), and tasks in the speaking section load on a second factor (S). In all three models, in addition to their respective skill factor, each task also loads on one of two correlated context factors, instructional or non-instructional. The three competing models were fitted to the data. The model that had the best model-data fit was chosen as the baseline model. Using Chi-Square difference tests, the fit of the baseline model was compared to the fit of alternative models in the following steps.

FIGURE 1.1 Competing Baseline Model One.

FIGURE 1.2 Competing Baseline Model Two. 24

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

25

FIGURE 1.3 Competing Baseline Model Three.

The test of skill effects was first performed with the purpose of examining whether or not the influence of the skills on the test performance was negligible. The baseline model was compared to a model with only context but no skill factors. Finding a significant Chi-Square difference would suggest choosing the baseline model over the alternative model, supporting the proposition of including the skill factors in accounting for the test performance. This was followed by testing context effects to evaluate the influence of the contexts on the test performance. The baseline model was compared to a model with only skill but no context factors. A significant Chi-Square difference test result would suggest choosing the baseline model over the alternative model, providing support for the inclusion of the context factors in accounting for the test performance. To assess global model fit, the following indices were used: Chi-Square test of model fit (χ2), root mean square error of approximation (RMSEA ), comparative fit index (CFI ) and standardized root mean square residual (SRMR ). A significant χ2 value indicates a bad model fit, although this value should be interpreted with caution because it is highly sensitive to sample size. As suggested by Hu & Bentler (1999), a CFI value larger than 0.9 shows the specified model has a good fit. RMSEA smaller than 0.05 can be interpreted as a sign of good model fit, while values between 0.05 and 0.08 indicate reasonable approximation of error (Browne & Cudeck, 1993). A SRMR value of 0.08 or below is commonly considered to be a sign of acceptable fit (Hu et al., 1999).

26

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Results Preliminary Analysis Data were inspected for normality so that an informed decision could be made regarding choosing an appropriate estimation method. The most commonly used estimator for latent analysis with continuous variables is maximum likelihood (ML ), which estimates parameters with conventional standard errors and ChiSquare test statistics. Since ML estimation is sensitive to non-normality, when the distributions of the observed variables are non-normal, a corrected normal theory method should be used to avoid bias caused by non-normality in the dataset, as recommended by Kline (2005). Table  1.2 summarizes the descriptive statistics for TABLE 1.2 Descriptive Statistics for the Observed Variables Variables

Range

Mean

Std. Dev.

Kurtosis

Skewness

L1

0–5

3.23

1.22

–0.40

–0.38

L2

0–6

3.46

1.39

–0.54

–0.13

L3

0–6

2.85

1.57

–0.70

0.15

L4

0–5

4.32

1.02

2.61

–1.69

L5

0–6

4.21

1.37

–0.27

–0.59

L6

0–6

4.61

1.50

0.40

–1.07

R1

0–15

6.71

2.72

–0.23

0.33

R2

0–15

9.82

3.23

–0.73

–0.37

R3

0–15

9.70

3.26

–0.86

–0.17

S1

0–4

2.50

0.76

–0.29

0.02

S2

0–4

2.58

0.78

–0.47

0.11

S3

0–4

2.50

0.77

–0.10

–0.02

S4

0–4

2.42

0.84

–0.08

–0.10

S5

0–4

2.55

0.79

0.08

–0.10

S6

0–4

2.53

0.85

–0.07

–0.24

W1

1–5

3.11

1.22

–0.99

–0.20

W2

1–5

3.41

0.86

–0.34

–0.04

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

27

the observed variables, including possible score range, mean, standard deviation, kurtosis, and skewness. Variable L4 had a kurtosis value larger than two. The histograms of the variables revealed that the distributions of Variable L4 and L6 exhibited a ceiling effect. Univariate normality could not be held in these two cases, indicating that the distribution of this set of variables could deviate from multivariate normality. A corrected normal theory estimation method, the SatorraBentler estimation (Satorra & Bentler, 1994), was implemented to correct global fit indices and standard errors for non-normality by using the MLM estimator in Mplus. Model estimation problems could occur when some variables are too closely correlated with each other (Kline, 2005). Pairwise collinearity was checked by inspecting the correlation matrix of the variables. As shown in Table 1.3, dependence among all pairs of variables was moderate (.32–.69). No extremely high value of correlation coefficient was found. Kline also suggested calculating the squared multiple correlation between each variable and all the rest to detect multicollinearity among three or more variables. A squared multiple correlation larger than 0.90 suggests multicollinearity. None of the squared multiple correlations exceeded the value of 0.90. Therefore, there was no need to eliminate variables or combine redundant ones into a composite variable.

The Baseline Model The three competing baseline models were fitted to the data. The estimation of the first two models did not succeed, and both models were therefore discarded. The test results of the third model indicated that, at the global level, this model fit the data well (χ2 = 220.849, df = 100; CFI = 0.987; RMSEA = 0.035; SRMR = 0.019). This model was accordingly chosen as the baseline model. The baseline model had two correlated skill factors, one related to speaking and the other associated with listening, reading, and writing. It also contained two factors that represented the abilities underlying language performance in two contexts, instructional and non-instructional. According to this model, performance on a task was accounted for by the skill (L/R/W or S) as well as the context in which language use was situated (instructional or non-instructional).

Tests of Skill and Context Effects The baseline model was compared to two alternative models, the skill- and contextonly model to examine the impact of skill and context on test performance by using Chi-Square difference tests. Table 1.4 summarizes the fit indices for the models being compared and the results of the Chi-Square difference tests. To examine the influence of the skills, the baseline model was compared to the context-only model (Figure  1.4). The results showed that the baseline model performed significantly better than the context-only model (χ2S-B|DIFF = 826.219, df = 18, Sig. p = 0.01), providing evidence for the influence of the skills on the test performance.

28

TABLE 1.3 Correlations of the Observed Variables L1

L2

L3

L4

L5

L6

R1

R2

R3

S1

S2

S3

S4

S5

S6

W1

L1

1.00

L2

0.36

1.00

L3

0.42

0.48

1.00

L4

0.41

0.34

0.35

1.00

L5

0.43

0.43

0.48

0.46

1.00

L6

0.47

0.42

0.48

0.54

0.55

1.00

R1

0.35

0.43

0.49

0.34

0.43

0.42

1.00

R2

0.39

0.49

0.51

0.43

0.52

0.56

0.56

1.00

R3

0.42

0.47

0.53

0.42

0.50

0.58

0.58

0.69

1.00

S1

0.41

0.39

0.39

0.39

0.41

0.45

0.36

0.35

0.39

1.00

S2

0.36

0.36

0.36

0.41

0.35

0.41

0.32

0.35

0.36

0.55

1.00

S3

0.43

0.37

0.39

0.45

0.42

0.45

0.34

0.37

0.38

0.55

0.57

1.00

S4

0.41

0.37

0.44

0.44

0.42

0.51

0.34

0.40

0.45

0.56

0.55

0.59

1.00

S5

0.43

0.39

0.41

0.44

0.41

0.44

0.37

0.40

0.42

0.54

0.56

0.56

0.59

1.00

S6

0.44

0.38

0.43

0.48

0.46

0.52

0.36

0.39

0.41

0.57

0.59

0.60

0.61

0.63

1.00

W1

0.51

0.47

0.54

0.50

0.55

0.62

0.53

0.62

0.62

0.47

0.46

0.51

0.51

0.53

0.55

1.00

W2

0.49

0.48

0.47

0.48

0.49

0.55

0.48

0.54

0.57

0.52

0.55

0.53

0.55

0.55

0.55

0.64

W2

1.00

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

29

TABLE 1.4 Results of Model Testing and Comparisons χ2S-B

df

CFI

Baseline model

220.849

100

Context-only model

1064.119

Skill-only model

RMSEA

SRMR

0.987

0.035

0.019

118

0.896

0.090

0.056

530.729

118

0.955

0.059

0.039

χ2S-B|DIFF

df |DIFF

p

Baseline vs. Context-only

826.219

18

Sig. (p = 0.01)

Baseline vs. Skill-only

304.148

18

Sig. (p = 0.01)

FIGURE 1.4 Context-only Model.

To investigate the influence of the contexts, the baseline model was compared to the skill-only model (Figure  1.5). The results indicated that the baseline model demonstrated a significantly better fit to the data than the skill-only model (χ2S-B|DIFF = 304.148, df = 18, Sig. p = 0.01), supporting the inclusion of the context factors in accounting for the test performance.

30

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 1.5 Skill-only Model.

The baseline model fit significantly better than the context-only and skill-only models, indicating that the performance on the TOEFL iBT ® test was influenced by both skill and context factors. To further investigate the influence of skill and context on the test performance, standardized factor loadings of the baseline model, as shown in Table  1.5, were examined. Taking the first listening task (L1) as an example, the loading of L1 on the skill factor (L/R/W) was 0.458, meaning that 21 percent (0.4582) of the variance in L1 could be accounted for by this skill factor. The loading of L1 on its context factor (non-instructional) was 0.434, meaning that 18.8 percent (0.4342) of the variance in L1 could be explained by this context factor. Therefore, in the case of the first listening task, the proportion of the variance of this measure that could be explained by its skill factor exceeded the proportion of the variance that could be explained by its context factor. This suggested that the skill-based ability factor had a stronger influence on the test performance than the context-specific ability factor. The same pattern was observed for most of the variables, except for three of the speaking tasks. For Speaking Tasks 2, 3, and 6, the loading on the context factor was higher than the loading on the skill factor. Among all the loadings on the context factors, the ones for the reading tasks were the lowest, suggesting that the performance on the reading tasks was least affected by context factors. A closer look at the significance of the factor loadings revealed that, while all loadings on the skill factors were significant, nine out of the seventeen factor loadings on the context factors were not significant. Furthermore, the correlation between the

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

31

TABLE 1.5 Standardized Factor Loadings of the Baseline Model Standardized Factor Loadings Skills

Contexts

L/R/W S

Instructional

L1

0.458

L2

0.571

0.231

L3

0.623

0.247

L4

0.463

L5

0.590

0.338

L6

0.633

0.405

R1

0.684

0.091

R2

0.825

0.070

R3

0.827

0.093

Non-instructional 0.434

0.457

S1

0.524

0.501

S2

0.504

0.530

S3

0.515

0.559

S4

0.584

S5

0.571

S6

0.543

W1

0.708

W2

0.619

0.511 0.511 0.608 0.405 0.489

two context factors was estimated as high as 0.964, which signalled linear dependency between the two latent factors. In other words, the abilities to respond to language demands in the instructional and non-instructional contexts did not emerge as distinct factors. To summarize, at the global level, the baseline model satisfied all the criteria to be considered a well-fit model. This model also outperformed the skill-only and contextonly models in terms of model-data fitting. However, the examination of the individual parameters exposed model specification problems, suggesting that some of the hypothesized relationships among the variables could not be held.

32

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Discussion This study examined the role of skill and context in defining CLA . To this end, context-related factors were modelled together with skill-specific factors in search of the best empirical representation of CLA as manifested in the test performance on the TOEFL iBT ® test. The results contributed an important insight into the nature of CLA . The outcome of investigating the internal structure of the test also had implications for test item development. At the global level, the baseline model had adequate model-data fit. This model had both skill and context factors, demonstrating the influence of skill and context on test performance. In other words, both skill-specific and context-specific factors were needed to represent the ability construct measured by the test. This finding is consistent with the theoretical framework of communicative language competence which posits that both internal cognitive abilities and the abilities to interact with the context of language use are part of the definition of the construct. Two distinct skill factors were represented in the final model: speaking ability and the ability to listen, read, and write. According to this configuration, listening, reading, and writing shared a common latent factor and were statistically indistinguishable, whereas the speaking skill was distinct from the other three. One possible explanation for the distinctiveness between a speaking factor and a nonspeaking factor could be instruction or lack of it. The speaking section became mandatory with the introduction of the TOEFL iBT ® test, whereas listening, reading, and writing had long been required as part of the TOEFL testing routine before the TOEFL iBT ® test. The emergence of the two-factor structure could be, as Stricker et al. (2005) have suggested, a reflection of English language instruction practice that places more emphasis on training in listening, reading, and writing skills than on the development of oral proficiency. In sum, a lack of test preparation and language training for speaking could have contributed to the identification of a speaking ability that was different from listening, reading, and writing combined. In contrast to skills, contexts of language use did not emerge as separate factors. The language abilities demanded in instructional versus non-instructional contexts could not be distinguished based on this dataset. The test instrument used in the study may have limited the extent to which context-dependent language abilities can be differentiated. Since the TOEFL iBT ® test aims to measure English language proficiency in academic settings, both instructional and non-instructional language use contexts are situated in a higher-education environment, and therefore, inevitably share a great deal of similarities. For instance, typical speaker roles on campus include students, teaching staff and non-teaching staff. We observed that these speaker roles appeared in both instructional and non-instructional language tasks. If we had used a different instrument that included contexts that were very distinct from each other (e.g., language use in a classroom and language use off campus), the chances of finding each context factor unique would have risen. At the individual parameter level, it was found that the performance on most of the tasks was more strongly influenced by the skill factor than by the context factor, and that more than half of the loadings on the context factors were not statistically significant. These results indicated that the language abilities elicited by the test were

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

33

predominantly skill-based. One potential reason for this finding could be that when interpreting context, one must consider many parameters to adequately represent the target language use domain, such as, in the case of TOEFL iBT ®, participants, content, setting, purpose, and register. Consequently, tasks belonging to the same language use context inevitably vary. For example, both the first and fourth listening tasks were categorized as non-instructional. However, the former focused on an upcoming exam between a student and a professor at the professor’s office and the latter was situated between a student and a non-teaching university staff member on the topic of housing options. As a result, the influence of contexts on test performance could have been weakened and have become less pronounced than the skills. To address this challenge, a more detailed task analysis must examine context dimensions, such as participant role or distance, formality, and so on. However, this finer-grained task categorization would result in a larger number of context factors, and subsequently, fewer indicators for each context factor, which would make the dataset unsuitable for the latent analysis adopted for this study.

Conclusion In conclusion, the results of the study indicated the important role of both skill and context in defining CLA as measured by the TOEFL iBT ® test. The findings also revealed discrepancies between the theoretical conception of the test construct and the empirically obtained internal structure of the test. These outcomes contribute to our understanding of communicative language assessment in the following aspects. First, the construct definition of a communicative language assessment must consider both skill-based and context-oriented language ability factors as well as the interaction between the two. Second, when defining and operationalizing language use contexts, a systematic approach must be taken to ensure that each context possesses unique characteristics that distinguish it from the others. In this study, similarities in task characteristics (e.g., speaker roles) across instructional versus non-instructional contexts in a higher-education environment made it statistically impossible to distinguish between the two factors. Future researchers are recommended to clearly identify characteristics that are unique to each language use context, and those that are common across different contexts. Unique characteristics must be highlighted when designing tasks for different contexts to ensure that different context-dependent ability components can be reliably distinguished from one another. On the other hand, common characteristics need to be applied consistently across all contexts with the goal of preventing the influences of these characteristics from masking context uniqueness. A few limitations need to be pointed out. First, the analyses were based on parcel scores summed across items associated with a common prompt instead of item-level scores. An item-level modelling approach would allow a more detailed analysis of the relationships among the items. With more variables included in the analyses, an item-level approach would also require a much larger sample size. Using item-level scores was not feasible in this study due to a limited sample size. Future studies are encouraged to model the relationships among the latent constructs based on itemlevel scores if the sample size permits. Second, when investigating the latent structure

34

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

of the construct, the writing factor was represented by only two indicators. In factor analysis, a factor needs at least two indicators for the model to be identifiable. This is the minimum requirement, and in practice, it is advised to have more than two indicators for each factor. Future researchers are recommended to test the latent structure of communicative language ability with each factor represented by a sufficient number of indicators. This study demonstrates how the nature of the construct of CLA could be examined via investigating the internal structure of the TOEFL iBT ® test. The findings demonstrate the influences of both skill and context in defining CLA , but also raise questions about the separability of context-dependent ability components. Recommendations are given to be considered for assessing communicative language ability.

References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC : American Educational Research Association. Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford, UK : Oxford University Press. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollon & J. S. Long (Eds.), Testing Structural Equation Models. Newbury Park, CA : Sage, pp. 136–162. Canale, M. (1983). From communicative competence to communicative language pedagogy. In J. G. Richards & R. W. Schmidt (Eds.), Language and Communication. London, UK : Longman, pp. 2–27. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approach to second language teaching and testing. Applied Linguistics, 1(1), 1–47. Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test construction. Language Testing, 14(1), 3–22. Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254–272. Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Test score interpretation and use. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.), Building a Validity Argument for the Test of English as a Foreign Language. New York, NY: Routledge, pp. 1–25. Chapelle, C., Grabe, W., & Berns, M. (1997). Communicative Language Proficiency: Definition and Implications for TOEFL 2000. (TOEFL Monograph Series Report No. 10). Princeton, NJ : Educational Testing Service. Gu, L. (2011). At the Interface Between Language Testing and Second Language Acquisition: Communicative Language Ability and Test-Taker Characteristics (Doctoral dissertation). Retrieved from the ProQuest Dissertation and Theses database. Gu, L. (2014). At the interface between language testing and second language acquisition: Language ability and context of learning. Language Testing, 31(1), 111–133. Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1–55. Hymes, D. (1972). Towards Communicative Competence. Philadelphia: Pennsylvania University Press.

MODELLING COMMUNICATIVE L ANGUAGE ABILITY

35

Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000 Framework: A Working Paper. (TOEFL Monograph Series Report No. 16). Princeton, NJ : Educational Testing Service. Kline, R. B. (2005). Principles and Practice of Structural Equation Modeling (2nd edn.). New York, NY: Guilford. Kunnan, A. J. (1998). Approach to validation in language assessment. In A. J. Kunnan (Ed.), Validation in Language Assessment. Mahwah, NJ : Lawrence Erlbaum Associates, pp. 1–16. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd edn.). New York, NY: American Council on Education and Macmillan, pp. 13–103. Muthén, L. K., & Muthén, B. O. (2010). Mplus User’s Guide (6th edn.). Los Angeles, CA : Authors. Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors on covariance structure analysis. In A. von Eye & C. C. Clogg (Eds.), Latent Variables Analysis: Applications for Developmental Research. Thousand Oaks, CA : Sage, pp. 399–419. Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL Internetbased test. Language Testing, 26(1), 5–30. Stricker, L. J., & Rock, D. A. (2008). Factor Structure of the TOEFL Internet-Based Test Across Subgroups. (TOEFL iBT Research Report No. 07; Educational Testing Service Research Report No. 08-66). Princeton, NJ : Educational Testing Service. Stricker, L. J., Rock, D. A., & Lee, Y.-W. (2005). Factor Structure of the LanguEdge™ Test Across Language Groups. (TOEFL Monograph Series Report No. 32). Princeton, NJ : Educational Testing Service. Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitraitmultimethod data. Applied Psychological Measurement, 9(1), 1–26.

36

2 Presenting Validity Evidence: The Case of the GESE Elaine Boyd and Cathy Taylor

ABSTRACT

T

his chapter outlines current thinking on the necessity for rigorous and extensive evidence for validity. This includes an overview of the different aspects of validity together with a consideration of how these aspects might be verified. A case study that utilized O’Sullivan’s reconceptualization of Weir’s (2005) sociocognitive framework is presented and discussed. This study was conducted in order to develop the construct argument for Trinity College London’s Graded Exams in Spoken English (GESE ) and demonstrates how a framework can be used to substantiate validity. The study considers the range of validity arguments the model can substantiate as well as the limitations of the model and, further, how the findings can define the need for, and scope of, subsequent studies. The chapter concludes by suggesting how contemporary learner corpora, such as the Trinity Lancaster Corpus of spoken learner language, will allow new insights into how validity can be evidenced.

Introduction Validity—“the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (AERA et  al., 2014, p. 11)—is the core or essence of a test (Bachman & Palmer, 1996; Messick, 1989). By scoping and shaping the nature of validation work, the theoretical notion of validity drives the construction of a compelling body of evidence that defines the purpose, identity, and value of a test. It is not a single element but a multifaceted nucleus that lays the foundation for the very existence of the test. It speaks to the test taker and test user, answering their questions of why should I do this test, what will the test demonstrate, what effect will it have on my learning, is it fair? 37

38

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Since the 1990s, shifts in education toward assessment for learning (Black & Wiliam, 1998) together with a better understanding of test impact (Alderson & Wall, 1993) have allowed language tests to be seen more as living, organic activities that form part of a learner’s narrative rather than as flat, one-dimensional events isolated by the boundaries of their stated purpose. This has fed a growing recognition that tests do not necessarily function solely as, for example, a summative “report” (even though that might be the intention), but can influence, and even drive, the test taker’s learning (Carless et al., 2006). A deeper understanding of this effect has, in turn, led to a much greater appreciation of ethics (Davies, 2014) and a consequent awareness and concern for how far all tests—whether high stakes or not—affect, and have the potential to damage, a test taker’s self-esteem and identity. This knowledge brings with it a recognition of the responsibility that language testers have toward stakeholders. They have to be able to prove that their test is representing the claims they make about it and its assertions about a test taker’s abilities.

Current Thinking on Validity Chapelle (1999) offers a useful summary of how work through the 1990s, prompted by Bachman’s (1990) and Messick’s (1989) perspectives, led to a much broader and more detailed understanding of validity, which identified its philosophical underpinnings. This resulted in an important change in thinking and what emerged was the current view of validity as an argument (Chappelle, 1999, p. 258; Kane, 1992) rather than an incontrovertible characteristic. This thinking shifted the view of reliability so that it has come to be seen as a part of validity, that is, as one piece of verification (albeit sometimes a rather dominant one) rather than a separate or discrete element. This means there is nowadays a much more unified and interconnected approach whereby, although separate validity strands are—and should be—distinguishable from each other, there is greater recognition of how these strands work together and impact on each other. For example, however important and self-standing reliability studies appear to be they need to be firmly linked to validity in order to be convincing. Even though modern validity theory defines construct validity as the overarching purpose of validity research, subsuming all other types of validity evidence (Messick, 1995), test designers need to understand what the sub-types are—and thus be able to specify the range of proof required to support any validity claims. As Cumming explains, validity: . . . has come to be conceived as a long-term process of accumulating, evaluating, interpreting, conveying, and refining evidence from multiple sources about diverse aspects of the uses of an assessment to fulfill [sic.] particular purposes related to its guiding construct(s), including the methods of assessment and reporting as well as other uses of information and activities relevant to the assessment, in both the short and long terms, by diverse populations. (2012, p. 1) Cumming captures the wealth of separate elements that need to be investigated— and investigated continuously—to create a robust validity argument as well as the

PRESENTING VALIDITY EVIDENCE

39

rigor involved in these investigations. In an era when the consequences of test misuse are much better understood (see Davies, 2014), test designers must take a much more responsible and consistent approach when building evidence to show that their tests are fair and that the predictive outcomes they claim are appropriate. This entails a more systematic approach adopted from reliability studies, whereby claims are supported by a scrupulous scientific methodology in data gathering. There are different ways of describing the array of studies that make up the separate elements of a validity argument (see Chapelle, 1999; Xi, 2008) though Cumming (2012, pp. 4–5) outlines an accessible summary: 1 Content relevance—do the test tasks, topics, and so on, represent the kind of language events that either the test taker has been learning or that external judges would recognize as reasonable for the purpose of the test? 2 Construct representation—does the test cover an adequate and representative range of the various sub-skills and language that is needed to be proficient in what the test claims to test? 3 Consistency among measures—does the score analysis confirm consistency across different versions or administrations of the test or similar assessments to exclude construct-irrelevant factors? 4 Criterion-related evidence—is there a relationship between test results and relevant criteria and/or predictive outcomes? 5 Appropriate consequences—does the test have appropriate consequences in the wider context of its setting, for example, schools, curricula, etc.? Although this outline is at a macro level, it illustrates the concept that the more evidence a test designer can present, the stronger the validity argument is likely to be.

Evidencing Validity Validity has a dual dimension: It represents not just the features of a test, but also how those features are verified. So there are two layers—validity and the validation framework which is used to demonstrate that validity. In other words, validity guides validation practices. As we entered the twenty-first century, Alderson & Banerjee (2001, 2002) conducted a review that took stock of the language testing research in the previous ten years and noted several unresolved issues, some of which still hold true today. One of these was around this complex relationship between validity and validation. It is worth noting that any design and/or evaluation of language tests needs to acknowledge the fact that there are no definitive answers to this issue as yet—it is all in the strength (Kane’s “plausibility,” 1992, p. 527) of the argument. The test features outlined by Cumming (2012) can be robustly corroborated or validated by using a variety of methods. Table 2.1 presents a summary of the validation methods described in Xi (2008, pp. 182–189). The list is not exhaustive, but serves as an example of the types of work that could be undertaken. It reinforces the view that almost all elements need to be supported by a range of studies and emphasizes the need for both judgmental, qualitative evidence, and quantitative, statistical evidence.

40

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 2.1 Types of Validity Validity Element

Research Focus

Evidence to be Collected

Evaluation: linking test performance to scores

Quantitative and qualitative

Influence of test conditions; match between scoring rubrics (mark schemes) and skills tested; rater bias

Generalization: linking scores to universe scores

Quantitative

Score reliability analysis

Explanation: linking scores to interpretation

Qualitative and quantitative

Statistical analysis of items; group differences studies; verbal protocols on processes; conversation or discourse analysis of test language; questionnaires/ interviews and/or observation of processes; logical (judgmental) analysis of test tasks

Extrapolation: linking universe scores to interpretations

Quantitative and qualitative

Judgmental evidence that test tasks represent domain; statistical correlation analyses

Utilization; linking interpretation to uses

Qualitative and quantitative

Score reporting practices; decision-making processes; empirical evidence of consequences

The different studies support and complement each other; they are equally important and must be given equal attention. However, within this, there is varying complexity in that some aspects of validity are much more easily demonstrated than others. And this is where validity butts up against its partner—reliability. There is a tension between authenticity and reliability. The closer an assessment task replicates a real life exchange or communicative event, the more this very strong content validity introduces variables that inevitably impede efforts to ensure robust reliability. Equally, a task that is very fixed, such as objective multiple choice items, can achieve good reliability of measurement, but it can then be hard to prove a meaningful relationship between the test task and the world outside the test. In each case, the test developers may have to compromise aspects of task design in order to meet the overarching consideration of test purpose. This means that tasks which allow some flexibility will also need some elements to constrain them in order to meet standardization requirements. Equally, very controlled or fixed tasks will have to demonstrate how they are connected to a relevant external measure—predictive validity—in order to justify their existence in the test. Consequential validity is also an important consideration. Messick (1989) states that “. . . it is not that adverse social consequences of test use render the use invalid but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as construct-irrelevant variance” (pp. 88–89).

PRESENTING VALIDITY EVIDENCE

41

The resources required for the scope, range, and depth of work needed to validate a test must sound daunting to the novice test designer or small organization and—as Cumming (2012) points out—it has traditionally been the province of large-scale testing boards. However, designers of even low stakes tests—including teachers—must acknowledge the ethical responsibilities of testing and should think about how they can support the validity of their test. We hope that the following case will show how a relatively small-scale and easily administered pilot study, if conducted systematically within a framework, can yield a wealth of evidence and insight, reassure and guide the test designers, as well as provide a fertile source for future directions.

A Case Study This study sought to validate the Graded Exams in Spoken English (GESE ), a communicative Speaking and Listening test in English for speakers of other languages offered by Trinity College London. The test is complex in that it comprises a set of tasks that allow some personalization by the test taker, making it highly authentic (and thus, good predictive validity), but presenting a challenge for standardization and generalizability. The overarching purpose of the study was to establish initial construct validity evidence (confirming that the test was consistently testing what it claimed to be testing) in relation to current frameworks of communicative language ability (Bachman & Palmer, 1996; Canale, 1983; Canale & Swain, 1980).

Trinity College London Exams Trinity College London (Trinity) is an international examinations board that provides accredited qualifications in English language, teacher training, music, and the performing and creative arts. The underlying philosophy of all the Trinity exams emphasizes candidate performance and participation. Trinity believes that an assessment which allows personal choice, preparation, and a supportive environment provides candidates with the maximum opportunity for success–what Swain (1985) has called “bias for best”–—and promotes learning beyond the exam. Each subject suite has a number of “micro” grades or levels to accommodate as wide a range of abilities as possible, and provides a stepped pathway to achievement rather than requiring sizeable leaps in competences between, for example, CEFR levels. This rewarding achievement approach is reinforced by the fact that the Trinity assessments are criterion-referenced rather than norm-referenced, and are expressed in positive “can do” statements.

Description of the GESE Suite The first GESE exams were held in 1938, and have evolved as theories of language learning and proficiency have changed. By the 1980s, the test reflected the influence of communicative teaching methodology (Canale & Swain, 1980), and has since remained a test that prizes successful communication above all. The GESE examination consists of four developmental stages (Initial, Elementary, Intermediate,

42

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

and Advanced), with three grades at each stage, making twelve grades in all. The twelve grades provide a continuous measure of linguistic competence and take the learner from absolute beginner (Grade 1, below A1 CEFR ) to full mastery (Grade 12, C2 CEFR ). The graded system sets realistic objectives in listening to and speaking with expert English speakers. A detailed description of the stages and grades can be found in Table 2.2. The GESE is a face-to-face oral interview between a single examiner and a single candidate. The examination simulates real-life communicative events in which the candidate and the examiner exchange information, share ideas and opinions, and debate topical issues. Each grade assesses a range of communicative skills and linguistic functions that are listed by grade in the GESE Exam Information Booklet (available at http://www.trinitycollege.com/site/?id=368). Skills and functions are cumulative across the grades. Examiners are not provided with a prescribed script for the interview, but rather base their questions and contributions on a structured test plan focused on the specific skills and functions defined for each grade. The test plan ensures examiners elicit the language required by the grade the candidate is attempting, while allowing space for the candidate to contribute in a highly personalized way. Assessment is made on holistic performance descriptors and scores are awarded on a Distinction, Merit, Pass or Fail format.

Methodology As already mentioned, the main aim of this study was to establish the construct validity of the GESE in relation to current frameworks of communicative language ability. The Canale & Swain (1980) model is attractive because it includes the more subjective/abstract components of strategic and sociolinguistic features of communication alongside discourse and grammar. These features should be particularly evident in GESE because of its emphasis on real-life communication. Bachman & Palmer’s (1996) model of language ability is referenced particularly for its coverage of strategic competence and metacognitive strategies. Table 2.3 presents competences of interest and provides brief definitions. The goal was to investigate how far the GESE tapped into these communicative competences and to what extent. It was always recognized that following any results, further studies would be necessary to develop a fuller validation argument for the GESE. It can be a challenge to investigate the more abstract competences such as sociolinguistics, pragmatics, metacognition and Trinity needed a model to investigate these in a principled way. The team selected O’Sullivan’s validation model (2011, Figure 2.1), which is an updated version of Weir’s (2005) sociocognitive framework for test development and validation (see O’Sullivan, 2011; O’Sullivan & Weir, 2011). There are two important points to note about the model: 1 It is particularly suited to Kane’s (1992) “validity as argument” approach because it recognizes the equal importance of qualitative data in its contribution to the interpretative argument and allows for the inclusion of studies that are constrained by resources and that might originally have been excluded because of their limited size or scope.

TABLE 2.2 The Graded Examinations in Spoken English (GESE ) Stage

Grades

CEFR Grades

Timing (minutes)

Conversation

Topic Discussion

Interactive Task

Initial

1–3

Pre-A1–A2

5–7



Elementary

4–6

A2–B1

10





Intermediate

7–9

B2

15







Advanced

10–12

C1–C2

25







Topic Presentation

Monologue Listening





TABLE 2.3 Theoretical Bases of Communicative Approaches to Second Language Teaching and Testing Component

Definition

Grammatical

Knowledge of lexical items and of rules of morphology, syntax, sentence-grammar semantics, and phonology

Sociolinguistic

Knowledge of the sociocultural rules of language and of discourse

Discourse

Ability to connect sentences in stretches of discourse and to form a meaningful whole out of a series of utterances

Strategic

The verbal and nonverbal communication strategies that may be called into action to compensate for breakdowns in communication due to performance variables or due to insufficient competence

43

44

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 2.1 O’Sullivan Test Validation model (2011).

2 Its focus on cognitive elements (Appendix 1, Section 1) and the broader scope of its approach to task details (Appendix 1, Section 2)—including, for example, discourse, function, structure, and content—meant it provided a very good fit for marrying Canale & Swain’s (1980) competency framework with the intended communicative constructs of the GESE test. In other words, it allowed the exam to be investigated in minute detail to see what emerged. This validation model is dynamic and flexible in that it considers three elements: the test system, the test taker, and the scoring system. The interaction of these three elements provides a qualitative framework with which to substantiate elements of test validity. The model is not only useful and practical, but also is particularly suitable in identifying cost and resource efficient starting points that nevertheless generate a considerable amount of data. This can then be used as the spring board for a variety of further validation studies, such as those mentioned in Table 2.1. It was anticipated that an independent panel of external experts would be able to infer the GESE construct from the empirical evidence generated by the model. The panel would also comment on the underlying language model using Canale & Swain’s (1980) and Canale’s (1983) model of communicative competence. If necessary, given the GESE ’s long history, the outcomes would allow the test to be retro-fitted to its purpose. Fulcher & Davidson (2009) define this as the process of “making an existing test more suitable for its original stated purpose, by ensuring that it meets new or evolving standards, . . .” (p. 124).

PRESENTING VALIDITY EVIDENCE

45

Study Design The GESE suite of tests is complex: not only do the number of tasks increase in each GESE developmental stage, but also the responsibility to initiate and maintain the interaction shifts from the examiner to the candidate as the level increases. Therefore, it was challenging to ensure that each grade and each task would be evaluated in a systematic way to enable comparison between performances at the same grade and also with performances at grades above and below. 1 The panel The study gathered and analyzed the judgments of a panel of four experts on aspects of the GESE . The panelists were a second language acquisition expert, an expert in pedagogy (both external), Professor Barry O’Sullivan (the designer of the validation model with expertise in similar studies), and an internal member of the Trinity academic team. The rationale to include a member of the Trinity academic team was to ensure that the panel did not omit important information or misunderstand an aspect of the exam. The team was small due to time and resources available, but it was nevertheless deemed to be of value in the context of a pilot or initial study. 2 The data The data bank comprised video-recorded GESE tests that had been prepared for standardization purposes. They showed genuine candidates and examiners from the international panel, but had been recorded after the candidate had done their own test. Twelve performances were selected that the panel would be asked to watch and listen to, one from each GESE grade. In order to gather the data systematically on all three elements of the validation model, a series of tables was created (see Appendix 1). Each table was based on a different aspect: 1 An overview of the grade from the candidate and cognitive domain 2 A specific look at the parameters that define the grade. This section was subdivided into five tables, to reflect each GESE task. 3 An overview of the appropriacy of the scoring system. 4 Some general comments on the underlying language model (see Appendix 1). To cover all twelve GESE grades while simultaneously working within both time and cost constraints, it was decided that each panelist would watch either four or five GESE exams at different grades. Everyone watched a GESE grade 7 (CEFR B2), the rationale for this being to help the panelists establish common ground in how they interpreted and reacted to the exam. The lead expert (O’Sullivan) watched four videos and reported on all four GESE stages. The SLA and Pedagogy experts watched five videos covering all four GESE stages. The Trinity representative watched three videos covering three GESE stages. Table  2.4 shows how the performances were distributed.

46

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 2.4 Design of the Completion Rota Initial Stage Elementary Intermediate Advanced Stage Stage Stage

Common Grade 7

Lead Expert

G1

G6

G7

G12

yes

SLA Expert

G2

G5

G8

G11

yes

Pedagogy Expert

G3

G4

G9

G10

yes

G 11

yes

Trinity Representative

G5

Process The project was divided into three parts: 1 2

3

An introduction, led by Professor O’Sullivan, to the aim and scope of the project together with clarification of any issues. A three-day session during which the panel completed the relevant pro forma (Appendix 1) while watching their assigned exams. In order to ensure consistent completion of the documentation, the respondents were asked to consider how each grade reflected the parameters of each domain in the O’Sullivan model. Some of the parameters, for example, the cognitive domain, presented a challenge in that respondents might not have been able to understand easily or fully how they worked. Nevertheless, the panel was encouraged to use its expertise to draw conclusions as this would facilitate a fuller discussion in the final stage of the project. The Trinity representative completed documentation to provide an overview of the GESE for the respondents. This included categories such as: purpose of the test, intended population, number of examiners, and rating scale type. A two-day seminar with the full panel, led by Professor O’Sullivan, to discuss the evidence gathered from the observation exercise and reach a degree of agreement regarding the final overview of the 12 GESE grades.

Findings The aim of the study was to investigate whether the GESE construct reflects Canale & Swain’s (1980) and Canale’s (1983) communicative competence model. The external panel completed detailed tables on all twelve GESE grades for the test taker, test task, and scoring system, and recommended research for each. Appendix 2

PRESENTING VALIDITY EVIDENCE

47

summarizes the results. The completed tables generated a wealth of qualitative comments from the panel that could be applied to the competences outlined in Canale & Swain (ibid.) These were then categorized as to how far they informed each of the competences.

Evidence of Competences Strategic Competence At the Initial stage in the GESE (grades 1–3), there was little indication of candidates needing to use their strategic competence. However, at higher grades, the amount of spontaneous language generated by the tasks increases. As a result, the candidates’ need for strategic competence increases too. Strategic competence was observed at grade 3 (CEFR A2.1), the highest grade in the Initial stage. From the Elementary stage, grade 4 (CEFR A2.2) onward, there was more evidence of strategic competence.

Discourse Competence This exhibited a very similar pattern to strategic competence. At grades 1 (CEFR pre-A1) and 2 (CEFR A1), candidates are required to produce very little continuous discourse. However, at grade 3 (CEFR A2.1), confirmation of emergent discourse competence was observed, and as the GESE grades progress, there is increasingly more proof of this. Some tasks highlighted this more than others; for example, the interactive task at the Intermediate stage.

Sociolinguistic Competence There is very little evidence of sociolinguistic competence until the Elementary stage (grades 4–6). At grade 6 (CEFR B1.2), there were indications of sociolinguistic competence in the conversation phase. At the Intermediate stage (grades 7–9, CEFR B2), there is clear evidence of sociolinguistic competence in the Interactive task. However, the panel noted that only pragmatic competence is assessed.

Linguistic Competence Similar to the other competencies, there is relatively little indication of linguistic competence in grades 1 (CEFR pre A1) and 2 (CEFR A1). At grade 3 (CEFR A2.1), where candidates are asked to produce some continuous discourse, linguistic competence is evident and this increases as the grades progress. The final conclusion from the panel was that, overall, the GESE construct was communicative in relation to the Canale & Swain (1980) and updated Canale (1983) models. The panel observed that the communicative construct appears to expand as the GESE grade increases and that it was clearest in the Advanced stage (grades

48

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

10–12, CEFR C1–C2). The panel noted that communicative competences were only emerging at grade 3 (CEFR A2.1), and therefore, there was an argument for a different construct for grades 1 and 2 (CEFR A1 and below). Similarly, at the Elementary stage (GESE grades 4–6, CEFR A2–B1), the panel considered that the communicative competences were not fully evident and again a modified construct to reflect this would be feasible.

Limitations of the Study The GESE is a complex suite of tests given the range of constructs it seeks to report on across micro-levels. Although the validation model facilitated a detailed analysis of all the GESE grades, the panel advised it was not possible to observe all the parameters from the candidate performances. This applies particularly to the cognitive domain. Information of pre-articulation processing and planning cannot be accessed by observation only. The experts’ evaluation was based on conjecture of the cognitive processes. Further research using verbal protocols is necessary to confirm some of the conclusions of this study. The study was also small-scale, with only one exam from each GESE grade included in the sample. A larger sample, being more representative of the typical candidature for the grade, may well have resulted in some different conclusions from the panel.

Follow-up Studies As described previously, the key to a robust validity argument is to carry out a variety of validity studies that review the test from different angles and use a mix of quantitative and qualitative studies. One of the benefits of this study was that Trinity now had a large pool of research ideas for future validation studies, some of which have now been undertaken.

GESE Construct Variation The panel observed variations in the levels of the competencies tapped as the GESE grades increased and mooted separate constructs for the GESE stages. To investigate this further at the Intermediate stage, Dunn (2012) carried out a comparative analysis of GESE grades 7, 8, and 9. The aim was to provide empirical insights into whether it is reasonable to claim the existence of a continuum of spoken language competence, and the capacity of these grade exams to elicit language along this continuum. In addition, further qualitative research on the pragmatic competences elicited by the interactive task was conducted (Hill, 2012) to establish what patterns exist between examiners’ assessments/grading of candidate performance and the detail of both parties’ contributions to the interaction. Trinity has subsequently modified the examiner training and item writer guidelines on the recommendations of these studies.

PRESENTING VALIDITY EVIDENCE

49

Assessment of Listening Skills Listening is an integral skill to the GESE suite, and although not formally assessed, it is nonetheless a constituent part of the assessment. The term interactive listening, as defined by Ducasse & Brown (2009), is applicable to the type of listening skills used throughout the GESE suite. Interactive listening skills are those that support effective interaction between the participants, for example, showing signs of comprehension by carrying out a task (applicable to the GESE Initial stage) and supportive listening through back-channelling. In order to understand more fully the role that interactive listening skills play in the GESE exams, Trinity commissioned a study by Nakatsuhara & Field (2012). Audio recordings of twenty examiners, each interacting with a low-level candidate and a high-level candidate, were analysed. Six contextual parameters were selected for analysis from Weir’s (2005) framework. One of the parameters researched was examiner speech rate. It is to be expected that by task 2, the interactive task, examiners have estimated the candidate’s speaking and listening proficiency. Analysis of speech rates indicated that examiners with basic pass level candidates tended to slow their speech rate by including additional pauses, whereas they did not do so with high scoring candidates. This indicates that examiners are sensitive to listening proficiency, as would happen in the world outside the test. This variation allows candidates the best opportunities in the exam, supporting the “bias for best” argument.

Discrete Listening Task at the Advanced Stage In response to the limitations in the Advanced listening task noted by the panel, Trinity commissioned an independent investigation into tasks which assessed L2 listening comprehension at the Advanced stage. Brunfaut & Révész (2013) confirmed there were issues with the task. This gave Trinity the necessary weight of evidence to implement changes to the task.

Microlinguistic skills The panel analysis showed that the grammar skills used by the candidates did not always match the GESE grade requirements. Though the levels had been validated by Papageorgiou (2007), the panel observed that eliciting micro-skills across some of the levels is very difficult. For example, they suggested it is not possible to predict the grammar elicited at the micro level (such as the expectation that zero and first conditional is elicited at grade 6, CEFR B1.2). This has proved a rich vein to explore, and Trinity has commissioned research into this aspect, some of which is ongoing. A separate study by Inoue (2013) looked at the functions used by candidates in GESE grades 4, 6, 8, and 10. Generally, candidates use the functions specified for the grades and higher-scoring candidates used a wider range of functions. However, at higher levels, not all the functions listed in the specifications were elicited by the tasks. Further analysis is being carried out by a major corpus project—the Trinity Lancaster corpus project (described later). This big dataset is unique in allowing robust quantitative arguments to be constructed to support validation.

50

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Rating Process The panel expressed some concern over the rating process, which rewards the production of skill indicators. As a result, Trinity commissioned a study into the rating scales (Green, 2012), and a review of the test as a whole will focus on this aspect. Separately, Taylor (2011) conducted a study into whether there were any differences in the interview and assessment techniques of experienced and novice examiners. This indicated small differences, for example, how they each dealt with errors, which were then addressed through training.

Benefits of Using a Validation Framework There were several benefits to using this particular validation framework as outlined above. First, it confirmed several aspects of the GESE construct, which had not been formally researched prior to this work. It also provided a system for accessing the views of external experts who knew comparatively little about the GESE suite, and thus who brought a more objective perspective than, for example, senior examiners would have done. An additional benefit was their differing areas of expertise, which resulted in well-rounded discussion. The two-day format of the final stage was successful as it permitted a full discussion without undue time constraints. The scope of this validation framework proved to be comprehensive, covering, as it did, the test taker, test system, and scoring system. The data is all valuable evidence in developing part of the validity argument for a test. Operationally, the framework is comparatively easy to use, flexible, and can be adapted for other tests. The pro forma for this study, although lengthy, are accessible and the participants can work remotely and in their own time. This means that the framework and this approach is accessible to relatively small testing operations—even within schools, for example—to allow them to create initial validity arguments and have some early confidence that their test is fit for purpose. What is especially helpful is that a welcome, and very much desired, outcome of this study was the significant amount of additional information generated by the investigation, which has instigated and guided further research programs to support and extend the ongoing GESE validity argument. This shows that the concerns and queries raised in even a small scale study can be invaluable for test designers and test managers seeking to strengthen their validity arguments. As we have said, a one-time piece of validation research is not sufficient. Tests need to be constantly reviewed to ensure that the arguments are continually added to or adapted as necessary. From this study, Trinity recognized the need to carry out further research into related aspects of the GESE construct to strengthen the validity argument. This included not only more traditional investigations (outlined above), but also studies that look into wider aspects of the test, such as the consequences of the test, which address the challenges of modern language testing as well as studies that embrace the newly available opportunities offered by technology.

PRESENTING VALIDITY EVIDENCE

51

Conclusion Challenges for Validation Research Cumming (2012) outlines some remaining challenges that test designers need to be aware of as they set out on their validation pathway. ●







Resources and expertise: The more academically sophisticated approach to validity and validation frameworks, which is expected today, is not intended to put test designers off. However, while needing to be aware of the investment and skills needed to conduct even small-scale studies, we should not encourage short cuts or be tempted to forgo certain studies, but perhaps focus on how validity can be demonstrated in other, more economical ways perhaps provided by technology. Resistance to change: All experienced test designers are confronted with the fact that however forward-looking the test-taking population—and they are becoming increasingly so—there are stakeholders who can become gatekeepers to change. This can include teachers who may be reluctant to change or test users who are nervous about risks associated with what they can perceive as moving the parameters when a test changes. This response to change indicates how critical it is to get stakeholder buy-in with new or revised tests. Generalizability: Cumming (2012) warns against local or individual assessments over-reaching their data or results. This has become particularly relevant in an environment in which there has been an increase in teacherinvolved formative assessment. As a consequence, a strong argument is emerging for raising assessment literacy among teachers. In our view, this is not so much to enable them to create their own tests, but so that they do not reach potentially damaging conclusions from what may be a flawed instrument. SLA match: There is currently not much convergence between theories of SLA and the practices of language assessment. Some of this might be solved by shifting to tests that have a very strong relationship with the world outside the test despite the fact that this brings its own challenges in terms of reliability and standardization.

Evidencing Consequential Validity The initial study did not consider the impact of the test on learning. If, as we claim, the test assesses communicative competence, we would expect it to promote communicative practices in the classroom. Since there is now a better understanding of the potential ethical consequences of testing (Davies, 2014), we might make sure that the GESE either has or supports the intended impact. In so doing, we should note the debate about how far language test designers should engage with

52

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

consequential validity and how far this might distract from their focus on construct and reliability (Alderson & Banerjee, 2002; McNamara, 2006). The division of this concern into professional responsibility and social responsibility outlined by McNamara (2006, p. 44) is helpful in allowing test designers to focus on the former unless they wish to engage in political stances. Thus, it would seem sensible to at least check the effects of the test, and in this spirit, Trinity Italia conducts regular surveys on perceptions of the GESE by school principals, students, and teachers (available at: http://www.trinitycollege.it). The surveys ask about the examination experience and perceptions of the exam as well as the perceived impact on learning. Teachers consistently report that they change their classroom practice to employ more communicative methodology. Clearly, these claims would need to be supported by observational studies, but it is an initial indication that the test may be having the desired impact. If this can be further validated, it would mean that the construct is rooted in communicative competences as we would wish, and that the test is supporting teachers as well as learners. This might seem like a substantial claim, but it is arguably what all modern tests should seek to do.

New Opportunities in Validation Research: Corpus Data The benefits of cross-disciplinary approaches to research are becoming increasingly evident. This fusion allows test designers to be aware of, and draw on, a much wider range of resources using technology, corpora, psychology, and so on. The caution would be that, for any conclusions to be part of a credible and robust argument for validity, any cross-disciplinary study would have to take account of the drivers of the different strands from those other subject areas and the challenges that may present in order to ensure that the results were bound into a unified argument. The qualitative study into the GESE proved worthwhile because specialist expert judgment is extremely valuable, but the results would be, of course, significantly enhanced if they could be supported by wider or more quantitative studies. Technology offers this by allowing us to study much larger datasets—and in more complex ways—to enrich these qualitative studies. Limitations caused by the small samples can now be ameliorated by the findings offered by more robust quantitative checks that tackle big datasets. This is where modern corpora are becoming an increasingly important instrument in the test validation toolbox. In 2013, Trinity partnered with the Centre for Corpus Approaches to Social Science (CASS ) at Lancaster University to create a large-scale spoken learner corpus, The Trinity Lancaster Corpus. This was based on GESE test recordings. To date, the corpus exceeds three million words from a range of first language speakers, ages, education backgrounds, and other metadata sets. Some promising early investigations of this corpus data have been captured by the team at CASS Lancaster. One of the first formal studies focused on one of the core pragmatic skills, the expression of epistemic stance in spoken communication (Gablasova et  al., 2015). The study used data from the Trinity Lancaster corpus of spoken learner language to examine expressions of epistemic stance in spoken production both in different task types and by speakers of different L1 and

PRESENTING VALIDITY EVIDENCE

53

cultural backgrounds. By focusing on stance-taking and engagement in discourse, it aimed to contribute to our understanding of the communicative competence of advanced L2 speakers. The results demonstrated that each task within the GESE Advanced levels targets a different communicative construct. While not a replication of the case study outlined here, this supporting study shows how the big datasets that corpora offer can be analyzed to provide effective quantitative supporting evidence on aspects of validity. The promising results are an early indication of the powerful arguments corpus data can generate. Consequently, further studies into strategic and pragmatic competences are planned and the design of the Trinity test with its focus on personal expression offers a rich dataset. A limitation of this approach might be the frequent criticism that corpora constructed from test settings can be self-locking. In other words, the restrictions around the test tasks and the often controlled nature of the language elicited means that the corpus can be highly limited in its applications. The more restrictive and standardized the test, the more likely this is to be true. However, the Trinity Lancaster corpus benefits from the fact that each GESE test, while constrained by level, allows the candidates to construct and negotiate meaning jointly with the examiner as they utilize the language of the grade. The fact that each test is structured but not scripted and has tasks that represent language pertinent to communicative events in the wider world allows this test to be more reflective of naturally occurring speech than many other oral tests. Corpus analysis is likely to become more sophisticated in the future, especially with multiple layers of corpus annotation that allow searching according to different linguistic and background criteria. The Trinity Lancaster Corpus has an aspiration to become a leading research tool in this respect.

References Alderson, J. C., & Banerjee, J. (2001). State of the art review: Language testing and assessment Part 1. Language Teaching, 34(4), 213–236. Alderson, J. C., & Banerjee, J. (2002). State of the art review: Language testing and assessment Part 2. Language Teaching, 35(2), 79–113. Alderson, J. C., & Wall, D. (1993). Does Washback Exist? Applied Linguistics, 14(2), 115–129. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC : American Educational Research Association. Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford, UK : Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford, UK : Oxford University Press. Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education, 5(1), 7–71. Brunfaut, T., & Révész, A. (2013). Text characteristics of task input and difficulty in second language listening comprehension. Studies in Second Language Acquisition, 35(1), 31–65. Canale, M. (1983). On some dimensions of language proficiency. In J. Oller (Ed.), Issues in Language Testing Research. New York, NY: Newbury House Publishers, pp. 33–42.

54

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47. Carless, D., Joughin, G., Liu, N-F., & Associates. (2006). How Assessment Supports Learning. Hong Kong: Hong Kong University Press. Chapelle, C. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254–272. Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge, UK : Cambridge University Press. Cumming, A. (2012). Validation of language assessments. In The Encyclopedia of Applied Linguistics. Oxford, UK : Blackwell Publishing Online. Davies, A. (2014). Fifty years of language assessment. In A.J. Kunnan (Ed.), The Companion to Language Assessment (volume 1). Oxford, UK : Blackwell, pp. 3–22. Ducasse, A. M., & Brown, A. (2009). Assessing paired orals: Raters’ orientation to interaction. Language Testing, 26(3), 423–443. Dunn, K. (2012). Comparative analysis of GESE grades 7, 8 and 9: Quantitative study. Unpublished report commissioned by Trinity College London, UK . Fulcher, G., & Davidson, F. (2009). Test architecture, test retrofit. Language Testing, 26(1), 123–144. Gablasova, D., Brezina, V., McEnery, A., & Boyd, E. (2015). Epistemic stance in spoken L2 English: The effect of task type and speaker style. Applied Linguistics. Available from: http://applij.oxfordjournals.org/content/early/2015/10/31/applin.amv055.full Green, A. (2012). Trinity College London Graded Examinations in Spoken English Performance descriptors: A review and recommendations for development. Unpublished report commissioned by Trinity College London, UK . Hill, C. (2012). An Interactional Pragmatic Analysis of the GESE Interactive Task. Unpublished report commissioned by Trinity College London, UK . Inoue, C. (2013). Investigating the use of language functions for validating speaking test specifications. Paper presented at the Language Testing Forum, Nottingham Trent University. Available from: http://www.beds.ac.uk/crella/resources/presentations. [29 July 2015]. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement. New York, NY: American Council on Education and Macmillan Publishing, pp. 13–103. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. Nakatsuhara, F., & Field, J. (2012). A Study of Examiner Interventions in Relation to the Listening Demands They Make on Candidates in the GESE Exams. Report commissioned by Trinity College London, UK . O’Sullivan, B. (2010). GESE Construct Project. Report on the Outcomes of the Construct Definition Procedure. Internal report for Trinity College London, UK . O’Sullivan, B. (2011). Language Testing. In J. Simpson (Ed.), The Routledge Handbook of Applied Linguistics. Oxford, UK : Routledge, pp. 259–273. O’Sullivan, B., & Weir, C. J. (2011). Language testing and validation. In B. O’Sullivan (Ed.), Language Testing: Theories and Practices. Oxford, UK : Palgrave, pp. 13–32. Papageorgiou, S. (2007). Relating the Trinity College London GESE and ISE Examinations to the Common European Framework of Reference. Final Project Report, February 2007. Available from: http://www.trinitycollege.com/site/viewresources.php?id=1245. [29 July 2015].

PRESENTING VALIDITY EVIDENCE

55

Swain, M. (1985). Large-scale communicative language testing: A case study. In Y. P. Lee, A. C. Y. Y. Fok, R. Lord, & G. Low (Eds.), New Directions in Language Testing. Oxford, UK : Pergamon Press, pp. 35–46. Taylor, C. (2011). An Exploratory Study into Interview and Assessment Techniques of Experienced and Novice Oral Examiners (Masters dissertation). University of Lancaster. Trinity College London. (2014). GESE Exam Information Booklet. Available from: http:// www.trinitycollege.com/site/?id=368. [30 July 2015]. Trinity College London. (2014). Trinity College London Integrated Skills in English (ISE) Specifications. Available from: http://www.trinitycollege.com/site/?id=3192. [29 July 2015]. Weir, C. J. (2005). Language Testing and Validation: An Evidence-Based Approach. London, UK : Palgrave Macmillan. Xi, X. (2008). Methods of Test Validation. In E. Shohamy & N. Hornberger (Eds.), Encyclopedia of Language and Education Volume 7: Language Testing and Assessment. New York, NY: Springer, pp. 177–196.

56

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

APPENDICES Appendix 1. Respondent Sheets for the Validation Model 1. The Candidate and Cognitive Domain [complete one table only but refer to all tasks at each grade] Grade 1 CANDIDATE CHARACTERISTICS PHYSICAL/PHYSIOLOGICAL Short-term ailments Longer term disabilities Age Sex PSYCHOLOGICAL Memory Personality Cognitive style Affective schemata Concentration Motivation Emotional state EXPERIENTIAL Education Examination preparedness Examination experience Communication experience TL-Country residence

Grade 6

Grade 7

Grade 12

PRESENTING VALIDITY EVIDENCE

57

COGNITIVE PROCESSES Conceptualizer Pre-verbal message Linguistic formulator Phonetic plan Articulator Overt speech Audition Speech comprehension Monitoring COGNITIVE RESOURCES [content knowledge] Internal External

2. Task details [complete one table for each task reviewed—again refer to table for explanation] TASK NUMBER: 1 TASK TYPE: Conversation Format TASK SETTINGS Purpose Response format Known criteria Time constraints Intended operations

Grade

Grade

Grade

Grade

1

6

7

12

58

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TASK DEMANDS Input Channel Discourse mode Structural range Functional range [use list provided] Nature of information Content knowledge Output Channel Lexical range Structural range Functional range Interlocutor Candidate role Variety of accent Acquaintanceship Number of speakers Gender TASK ADMINISTRATION Physical conditions Uniformity of administration Security

PRESENTING VALIDITY EVIDENCE

59

Appendix 2. Summary of Panelists’ Comments on the GESE According to the Validation Model Validation Model Categories

Panelists’ Comments

Review of the cognitive and candidate domain

Overall: Tasks are appropriate in terms of age and gender and free from bias. Tasks are dyslexia friendly, can easily be taken by visually impaired candidates. Memorization is not necessary as notes are permitted, and at lower levels there is no advantage in memorizing as the examiner guides the discussion. Concentration: Short tasks and the level of candidate control over the interaction help maintain concentration. Cognitive processing: The candidate is not under pressure to respond; therefore, the likelihood of useful cognitive processing is high. Higher levels of cognitive skills are required from grade 7 (B2 CEFR ).

Review of the GESE tasks

Overall: Topic discussion, presentation, interactive task, and conversation tasks function as intended. All the tasks elicited a broadening range of language and discourse functions as the grades increased. Microlinguistic skills: In all the tasks, the grammar structures used in the exams did not consistently match those prescribed in the specifications. However, the prescribed functional language was largely appropriate. Listening task, Advanced (grades 10–12, CEFR C1–2): This does not elicit the range of competencies that the other tasks do. Three short test items are unlikely to be a reliable measure of candidates’ listening abilities.

Review of the scoring system and rating process

Overall: Awarding a holistic score for each task was considered to offer a potentially reliable and valid score. The single rater raised some concerns, but Trinity’s rigorous live and audio monitoring process mitigated this. Listening skill: This is not formally assessed from grades 1–9. The panel recommended that it should be since comprehension is essential to carry out the tasks. Trinity maintains that interactive listening is assessed within the communication skills criteria. Skills indicators: Broader communicative and functional skills can be assessed, but grammar is difficult to predict and assess at the micro-level.

60

3 A New National Exam: A Case of Washback Doris Froetscher

ABSTRACT

T

his chapter centers on exam washback—the effects of external exams on what happens in the language classroom. Unlike washback on teaching or learning, washback on classroom assessment is under-researched. The research reported here was conducted in the context of the reformed Austrian secondary school-leaving examination. Using an online questionnaire, the study investigated teachers’ perceptions of the new examination’s washback effect on summative class tests. Results show considerable washback. In their class tests, teachers mirror the methods used and the skills included in the school-leaving exam, and make heavy use of past papers. Matura experience, school type, and training emerge as mediating factors. While some of the findings can be seen as positive washback; others are more negative. The results call for further research into washback on classroom assessment both generally and in the specific context, and highlight the need for teacher training in language testing and assessment.

Introduction Whenever a new or reformed test is introduced into a system, the test may have effects on its stakeholders, for example, teachers and students. In the language testing and assessment (LTA ) literature, such effects are called washback or backwash (Alderson & Wall, 1993; Hughes, 1989). In general terms, washback is “the effect of testing on teaching and learning” (Hughes, 1989, p. 1). Messick specifies that the teachers’ and learners’ actions are influenced by washback in that they “do things they would not otherwise do that promote or inhibit learning” (Messick, 1996, 61

62

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

p.  241). Shohamy (1999, p.  711) gives examples of where such change may be situated: “in educational matters such as curriculum, teaching methods, teaching and learning strategies, material and courseware, assessment practices, and the content of instruction.” Although the term washback is sometimes used interchangeably with impact, the latter embraces an even broader area (Bachman et  al., 1996; HampLyons, 1997; Wall, 1996). As Wall (1997) explains, impact refers to “any of the effects that a test may have on individuals, policies or practices, within the classroom, the school, the educational system or society as a whole” (p. 291). Since the 1990s, test washback on language teaching and learning has been investigated in several contexts and countries (e.g., Alderson & Hamp-Lyons, 1996; Cheng, 2005; Hawkey, 2006; Shohamy et  al., 1996; Wall, 2005; Wall & Horák, 2006; Watanabe, 1996, 2000). In many studies, washback has been shown to be a complex phenomenon (Cheng, 2005; Tsagari, 2009; Wall, 2005), which may be affected by other factors such as teacher or test characteristics (Alderson & HampLyons, 1996; Green, 2007; Watanabe, 1996, 2004). Washback can take a positive or negative direction (Bachman & Palmer, 1996; Bailey, 1996; Messick, 1996), and it can be intended or unintended by those who introduce or reform a test (Cheng, 2008; Qi, 2004). Washback can be strong or weak, general or specific, and may exist for a short or long period of time (Watanabe, 2004). Some scholars (Messick, 1996; Shohamy et al., 1996) see positive washback as evidence of test validity (consequential validity), whereas others argue that tests with high or low validity can lead to positive and negative washback, thus questioning the existence of a direct link between washback and validity (Alderson, 2004; Alderson & Wall, 1993; Bailey, 1996).

Washback on Classroom-Based Assessment (CBA) Even though Wall & Alderson (1993), Wall and Horák (2006) and Watanabe (2000) stressed the importance of investigating how external tests affect assessment in the classroom, few studies in the field of LTA have addressed this area. One early study found that “multiple-choice tests only have limited influence on classroom test format” in the Netherlands (Wesdorp, 1982, p. 51). The Sri Lankan washback study by Wall & Alderson (1993) and Wall (2005), on the other hand, revealed that the newly introduced O-level exam at the end of secondary schooling had both negative and positive effects on teachers’ CBA . In designing tests, teachers stopped testing listening and speaking skills as these were not part of the exam (negative washback), and gave reading and writing skills more attention than grammar (positive washback). They used the same item or task types as in the exam for reading and writing tests, which was judged to be “positive when these have also appeared in the textbook, but negative when they have not and when certain types are over-used” (Wall & Alderson, 1993, p. 66). In addition, teachers copied questions and passages directly from past papers (negative washback). A study by Tsagari (2009) found that teachers in Greek FCE courses chose skill areas and test formats for their classroom tests that mirrored the exam. In the Austrian context, two studies have found evidence of washback of its reformed school-leaving exam on CBA . In research pertaining to the same project as this chapter, Froetscher (2013, 2014) analyzed class test reading tasks using a specially designed instrument. A comparison between tasks used pre and post the introduction

WASHBACK OF A NEW NATIONAL EXAM

63

of the reformed school-leaving exam showed test method alignment to the new exam as well as an increase in task quality characteristics such as instructions, example items, or distracters. Kremmel et  al.’s (2013) similar comparative analysis of class test-writing tasks showed a convergence to school-leaving exam task types, a change in communicative functions targeted as well as a rise in task quality characteristics such as authenticity or specification of readership and expected task type. Further evidence for washback on CBA can be found in the general education literature. Most of these studies reflect on the implementation of the No Child Left Behind Act in the United States in 2002, which “requires all states in the nation to set standards for the grade-level achievement and to develop a system to measure the progress of all students” (Mertler, 2010, p.  3). Although the studies differ widely in terms of the geographical area in which they are situated, the subjects and the levels of education investigated, there are striking commonalities in their results: With the exception of one study (Mabry et al., 2003), the studies found washback of a standardized test on the task types used by teachers in CBA (see Abrams et al., 2003; McMillan et al., 1999; Mertler, 2010; Stecher et al., 1998). Washback on CBA in general education is also mediated by factors such as teachers’ assessment literacy (Tierney, 2006) and whether the test is high-stakes or low-stakes (Abrams et al., 2003). The former finding is particularly relevant in the light of Tsagari & Vogt’s study (2014), which showed that European language teachers feel they lack assessment literacy across the areas Classroom-focused LTA, Purposes of testing, and Content and Concepts of LTA.

Methods Used in Washback Research Most washback studies adopt a mixed-method data collection approach that increases the validity of washback research (Tsagari, 2009). Watanabe (2004, p. 20) explains that “there is no single correct methodology.” He suggests researchers take the following steps to arrive at the most appropriate methodology for their study: identify the problem, describe the context, identify and analyze the potentially influential test, produce predictions, and then design the research. The two most frequent methods in washback research are questionnaires and interviews with teachers or students, and often a combination of the two (e.g., in Qi, 2005). It is important to stress, however, that questionnaire and interview data rely on selfreports, and therefore, results need to be treated with caution. Indeed, where selfreport data were combined with classroom-observation, it was found that what teachers say they do is not always accurate (Wall & Alderson, 1993). Therefore, a third method in washback studies is classroom observation. It offers the researcher the possibility to directly observe what is happening in the classroom without a filter through teachers or learners (Watanabe, 2004). A further method in washback studies is the analysis of documents such as test tasks or textbooks (Bonkowski, 1996; Hawkey, 2006; Saville & Hawkey, 2004; Tsagari, 2009). An important consideration in the design of washback studies regards the timing of data collection. Studies of a newly introduced exam should collect baseline data that “describe an educational context before the introduction of an innovation intended to cause change” and “can serve as a point of comparison [. . .] to determine whether change has indeed occurred” (Wall & Horák, 2007, p. 99).

64

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Study Context: The Reformed Austrian School-leaving Examination The Austrian school-leaving examination. This washback study was conducted in the context of the national Austrian secondary school-leaving examination (hereafter, “the Matura”). The Matura is a compulsory set of exams taken at the end of secondary schooling. Depending on the focus and duration of the upper-secondary school (four-year general or five-year vocational schools), this will be in Year 12 or Year 13 of schooling, usually at the age of eighteen or nineteen. As an examination that represents a university entrance qualification, the Matura is high stakes. The Matura has a written component followed by an oral component; this study focuses on the former. The written Matura consists of a series of three to four exams taken within a week. The exact subject combination depends on the school type, school focus, and on the students’ choices, but in most cases includes German, Maths, and first and/or second foreign languages (Austrian Ministry of Education, 2012a). The reform. Although the Matura is high stakes, it was until recently produced, administered, and marked by the individual class teacher. In 2007, the Ministry of Education, BIFIE (the Federal Institute for Education Research, Innovation and Development of the Austrian School System) and the University of Innsbruck launched a project to develop a standardized national Matura for English and French. This standardized Matura was administered in general secondary schools from 2008 on a voluntary basis. After several successful administrations, in 2010, a national exam reform for both general and vocational schools was passed by parliament. This reform called for the core subjects traditionally taken in the written Matura to be standardized: German, first and second foreign languages, Maths, Latin, and Ancient Greek. In 2013, the first voluntary administration of the standardized Matura in vocational schools took place. In 2015, the new examination became compulsory for all general secondary schools, and in 2016, it will be compulsory for vocational secondary schools (Austrian Ministry of Education, 2012a). Foreign languages. English is a compulsory subject in all Austrian secondary schools. In some schools, students also pick from a choice of two or more second foreign languages. Foreign language instruction in Austria is based on a relatively new curriculum (introduced in 2004), which states a target level of language ability as defined in the Common European Framework of Reference for Languages (CEFR , Council of Europe, 2001) for listening, reading, speaking, and writing for each grade (Austrian Ministry of Education, 2004). While the old written Matura for English covered listening and writing only and for the second foreign languages covered writing only (Austrian Ministry of Education, 1990), the reformed exam follows the four-skill approach of the 2004 curriculum. The new standardized written Matura developed for English, French, Italian, and Spanish consists of listening, reading, language in use and writing papers. As can be seen in Table 3.1, the CEFR levels targeted in the foreign languages are B1 and B2, depending on the duration of instruction a school offers. The language in use paper is only administered in general schools (BIFIE , 2015).

WASHBACK OF A NEW NATIONAL EXAM

65

TABLE 3.1 CEFR Level Targeted by Foreign Language Matura (BIFIE , 2015) Years of Language Instruction Listening Reading in Use Writing First foreign language (English)

general vocational

8 years 9 years

B2 B2

B2 B2

B2 –

B2 B2

general Second foreign language (French, general Italian, Spanish) vocational

4 years 6 years 5 years

B1 B1 B1

B1 B2 B1

B1 B1 –

B1 B1 B1

TABLE 3.2 Test Methods and Task Types Used (BIFIE , 2014a, 2014b) Exam Paper

Test Methods/Task Types

Listening

multiple choice; multiple matching; short answer questions

Reading

multiple choice; multiple matching; short answer questions; true/false with justifications

Language in use

multiple choice; banked gap-fill; open gap-fill; editing; word formation

Writing

essay; article; report; email/letter; blog comment/ blog post; leaflet

The foreign language Matura is also standardized in terms of the task types and test methods it includes. The overview in Table  3.2 shows the test methods used for the different skills. Examples can be viewed on the BIFIE website under http://www.bifie.at. The new Matura undergoes quality control mechanisms at various stages of the test development cycle. Tasks are produced by trained item writers and moderated by language testing experts. Psychometric data from field tests as well as judgment data from standard setting panels are collected to ensure all tasks comply with good practice in LTA as described in the EALTA guidelines (BIFIE , 2013; EALTA , 2006).

Focus on Class Tests Under the old Matura, the setup and content of class tests were completely in teachers’ hands. Generally, in the last year of secondary school, class tests were aligned with the writing-dominated old Matura examination. In 2012, the Ministry of Education passed laws stating that in preparation for the standardized Matura, class tests needed to include the skills tested in the Matura, and could include recommended standardized test methods (Austrian Ministry of Education, 2012b).

66

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

The following year, guidelines for the construction of upper secondary class tests were published in order to ensure optimal preparation for the new Matura (Austrian Ministry of Education, 2013). The existence of these documents and their use of the term standardized class test indeed suggest intended washback of the new Matura on class tests, thus enforcing the presence of all four skills listening, reading, language in use, and writing as well as the Matura test methods. The study reported in this chapter is part of a two-phase multimethod project addressing the research gap regarding washback on CBA . It focuses on the washback of the new Austrian Matura on summative classroom-based tests. These class tests are required in the curriculum for foreign language teaching and form an important component of CBA in that they account for a large part of the semester and end-ofyear grade. Typically, they take place twice a semester with a duration of up to three consecutive lessons or 150 minutes (Austrian Ministry of Education, 2004). The research project looks specifically at washback on the classroom-based assessment of reading, and employs both qualitative and quantitative data collection and analysis methods (see Table 3.3). Following the task analyses in phase 1 (Froetscher 2013, 2014), a pilot study explored, from the teachers’ point of view, whether washback on classroom tests is present. This pilot data is reported in this chapter. In the second phase of the project, this thread will be taken further. Using interviews, phase 2 will investigate teachers’ approaches and strategies in designing and selecting reading class test tasks.

Instruments For the study presented in this chapter, a questionnaire was chosen as the data collection instrument in the hope of reaching a larger number of participants. This is an important consideration given the qualitative nature of the other phases of the study. The questions focused on class test characteristics potentially subject to washback from the Matura: presence of language skills, test methods employed,

TABLE 3.3 The Larger Research Project Phase

Focus

Instrument

1

characteristics of class-test Instrument for the task analysis—qualitative Analysis of statistical analysis—quantitative reading tasks pre and Reading Tasks post the introduction of the new Matura

pilot

questionnaire teachers’ perceptions of washback on classroom tests

statistical analysis—quantitative coding analysis—qualitative

2

teachers’ approaches for designing and selecting reading class test tasks

coding analysis—qualitative

interview

Data Analysis Method

WASHBACK OF A NEW NATIONAL EXAM

67

input texts, the use of past papers, and the direction (positive/negative) of perceived washback. The questionnaire contained eleven questions, including yes/no questions, 4-point and 5-point Likert-type questions (e.g., 4 points like never—seldom— sometimes—often) as well as open-ended questions. Open-ended questions offer the possibility for “authenticity, richness, depth of response, honesty and candour” in responses that might otherwise not be caught (Cohen et  al., 2007, p.  330). The questionnaire was piloted on a small group of three colleagues, and subsequently, revised. As the implementation of the Matura reform had already begun, it was unfortunately not possible to collect baseline data. Baseline information could therefore only be elicited by asking the participants to think of their classroom tests before the standardized Matura and to compare it with their present work. The questionnaire (translated from German) can be found in the Appendix. The questionnaire was set up using LimeSurvey (LimeSurvey Project Team et al., 2012) and administered online through a link sent to participants. It was embedded as a washback subsection in a larger teacher questionnaire developed by BIFIE , accompanying the standardized test papers delivered to schools for the May 2013 Matura. The advantage of this approach was that the entire population of teachers teaching a standardized Matura language class in 2013 could be contacted. The drawbacks, however, were the restricted length of the washback subsection, its necessary similarity with the other sections, and the length of the total questionnaire. The total questionnaire consisted of twenty-one sections dealing with aspects of the exam paper, marking, training, and implementation measures.

Participants While 1,514 teachers received the questionnaire, only 196 completed it, a disappointingly low response rate of 13 percent. Additionally, the respondents did not necessarily complete all the questions. This resulted in varying response rates for each question. The length of the total questionnaire is likely to have contributed to this low response rate. To contextualize the results in terms of the response rate for each question, frequencies are reported in rounded valid percentages, giving the number of respondents in brackets. The gender distribution of the respondents—84 percent were female; 16 percent, male (N = 177)—reflects the gender distribution of teachers in Austrian secondary school language classrooms. The participants had a teaching experience of three years minimum and forty-two years maximum (Mean = 21; Mode = 30; N = 177). Table 3.4 gives an overview of their teaching experience. The majority were teachers of English (77%, N = 193), which reflects the fact that there are more English teachers in Austria. A vast majority, 80 percent, were general school teachers, while 20 percent were vocational school teachers (N = 193). Table 3.5 shows details of their school type and the language. (Note that in 2013, vocational test versions had not yet been introduced for the second foreign languages French, Italian, and Spanish.) Bearing in mind that in general schools, the standardized Matura had been available on a voluntary basis since 2008, Table 3.6 shows that in 2013 almost half of the respondents had prepared students for the standardized Matura for the first

68

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 3.4 Teaching Experience n

%

1–9 years

36

20%

10–19 years

41

23%

20–29 years

46

26%

30+ years

54

31%

177

100%

Total

TABLE 3.5 Language and School Type Taught n

%

38

20%

English general

111

58%

French general

29

15%

Italian general

9

5%

Spanish general

6

3%

193

100%

English vocational

Total

TABLE 3.6 Number of Standardized Maturas n

%

1–2 Matura

82

47%

3–4 Maturas

49

28%

5–6 Maturas

45

26%

176

100%

Total

or second time. About a quarter had already prepared students for the standardized Matura three or four times, and yet another quarter five or six times (N = 176). This suggests that Austrian schools and teachers had been quite quick to adopt the standardized Matura for the foreign languages, with relatively few teachers choosing to do so only when it was clear that it would become compulsory in 2015 and 2016.

WASHBACK OF A NEW NATIONAL EXAM

69

While 83 percent reported having received training regarding the new Matura or teaching development, 17 percent indicated they had not (N = 181).

Results Changes in Class Tests The vast majority of the respondents, 91 percent, indicated that they changed their class tests because of the new Matura (N = 178). The reasons given by those teachers who had not changed their class tests fell into three categories: ●





the skills tested in the Matura had already been present in their class tests (n = 8) test methods like in the Matura had already been present in their class tests (n = 2) they had not yet taught before the introduction of the new Matura (n = 2)

Although these responses indicate no change, all of them seem to refer to the Matura as a model in some sense.

Changes in the Presence of Skills The vast majority of the respondents (91%) indicated that they changed the proportionate presence of skills in their class tests (N = 160). The follow-up question about how they changed this reveals that the majority increased the proportion of listening and reading, and decreased that of writing (N = 142). The proportion of language in use was increased by about two-thirds of the participants (N = 115; note that this question was only answered by general school teachers). The results are illustrated in Figure 3.1. The responses to question 3 (“Please briefly describe what you changed”), confirm these results and further illustrate the change (Table 3.7). One participant’s response puts the change regarding skills in a nutshell: “always testing all skills; balanced weighting of the skills, moving away from over-weighting of writing.” Some respondents (n = 6) indicated that they set aside separate grammar tasks in favor of language in use tasks, for example, “not testing single grammar chapters separately, but mixed as general language in use task.”

Changes in Test Methods Used More than three-quarters of the participants (79%) indicated that they changed the test methods used in class tests (N = 159). Table 3.8 presents the types of test method changes reported. A strikingly high number of teachers said that they adopted the Matura test methods for their class tests.

70

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 3.1 Changes in the Presence of Skills.

TABLE 3.7 Changes in the Presence of Skills n

Example Response(s)

Inclusion of all skills

43

“I ALWAYS make room for listening, reading, and language in use in addition to writing.” “inclusion of all areas (listening wasn’t part of my class tests before)”

Balanced weighting of the skills

21

“weighting of 25% each for listening, reading, language in use, and writing”

More value on receptive skills

21

“stronger focus on listening and reading”

Less value on writing

20

“less importance of written production”

More value on language in use

14

“Before, there was no language in use.”

Changes in Texts Used Just under two-thirds of the participants (64%) indicated that they changed the texts they use for listening and reading tasks in class tests (N = 158). In response to the question what they changed, however, few teachers gave text characteristics. The majority repeated their responses to the preceding question regarding test methods, indicating that they used Matura-like test methods (n = 42). The respondents whose comments related to texts (summarized in Table 3.9) reported, for example, that they changed the topics, used more difficult texts, used more texts than before, and no longer developed the tasks themselves.

WASHBACK OF A NEW NATIONAL EXAM

71

TABLE 3.8 Changes in Test Methods Used n

Example Response(s)

Change toward 60 Matura test methods

“I systematically try to cover all the Matura test methods.” “use of test methods that are Matura-like or similar”

More closed question types

5

“for listening and reading no more questions that are answered by students in their own words; rather, task types like multiple choice, multiple matching”

Other kinds of writing task types

3

“Writing task has a particular ‘format’; that was not the case before”

No more integrated tasks

1

“listening and reading often used to be combined with a writing task”

TABLE 3.9 Changes in Texts Used n Other topics than before

4

More difficult texts

3

More texts

3

No longer search for text and develop task themselves

3

More authentic listening texts

1

Topic less important

1

Shorter texts

1

Replace difficult vocabulary in reading texts

1

Changes in Writing Tasks Several responses to question 3 (“Please briefly describe what you changed.”) indicate washback for writing; this is in line with the earlier reported result that writing gave way to a more balanced testing of all skills. Table 3.10 shows the changes in writing class test tasks. Regarding the increased focus on task types, one teacher explained: “For writing, I now always keep an eye on the task types, which was also not the case in the past.”

72

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 3.10 Changes in Writing Class Test Tasks n Changed text types

10

Bullet points in prompt

7

Decreased word length

5

Use of Matura assessment scale

2

More focus on writing task types

2

Perception of Changes The majority of the respondents felt positively about their changes in class tests—38 percent indicated that they consider them entirely positive, 58 percent considered them partly positive, and only 3 percent considered them not positive at all (N = 160). Respondents’ criticism toward the changes included: ● ● ●

too high volume of class tests (amount of photocopying and paper) (n = 3), restriction in variety of topics and ideas in writing tasks (n = 3), increased time and effort needed (n = 2).

Use of BIFIE Tasks As the institution responsible for the development and implementation of the new Matura, BIFIE regularly publishes practice tasks and selected past papers of the exam. These tasks reflect the demands the students will face when they take the Matura. Participants reported a frequent use of BIFIE tasks. For listening, reading, and language in use, BIFIE tasks were used sometimes to often by the majority (N = 177; N for language in use = 139). Writing tasks, on the other hand, were used seldom to sometimes (N = 175). The frequency of use of BIFIE tasks is illustrated in Figure 3.2.

Mediating Factors To investigate differences between subgroups of the sample, cross-tabulations with Chi-Square tests or Mann-Whitney-U tests (for ordinal dependent variables) were performed. While no significant associations were found regarding participants’ sex, language taught, and teaching experience, significant differences regarding training received, Matura experience, and school type emerged. It should be noted, however, that distributions among the subgroups are unequal (e.g., 83% reported having

WASHBACK OF A NEW NATIONAL EXAM

73

FIGURE 3.2 Use of BIFIE Tasks across Skills.

received training, while 17% did not). The unequal distributions resulted in expected counts of less than 5 in some of the Chi-Square tests, which may undermine the validity of the results (Cohen et al., 2007). However, all of the cross-tabulations have an expected count above 3.

Training received Teachers who had received training regarding the new Matura or teaching development were more likely to change the test methods in their class tests. Table 3.11 reports the results of a Chi-Square test, giving the Pearson Chi-Square value (χ2), the degree of freedom (df), the level of significance (p), and a measure of the effect size (Cramer’s V). The values show that there is a highly significant association between Training received and Change in test methods. Cramer’s V, however, indicates that the effect size is small.

Teachers’ Matura experience This variable indicates how often since its introduction in 2008 a teacher taught classes taking the standardized Matura. Cross-tabulations revealed more washback in the less experienced teacher groups. First, a significantly higher proportion of teachers with little Matura experience changed their class tests, compared to their colleagues with more Matura experience. Second, changes in the presence of listening tasks in class tests also seemed to be linked to teachers’ Matura experience. Teachers with less Matura experience showed a significantly higher increase of listening in their class tests. Table  3.11 reports the Chi-Square Test Results. The values show that there is a significant but weak association between Matura experience and Class test change, and between Matura experience and Change in presence of listening.

74

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

School type An analysis of differences between the responses of vocational and general school teachers is particularly relevant because the exam reform started several years earlier in general schools (see the Context section), and therefore, washback might have started earlier and might differ. First, more teachers from general schools (95%) than teachers from vocational schools (78%) report that they changed their class tests. Table 3.11 gives the Pearson Chi-Square test results for a cross-tabulation of School type and Class test change. From this data, it can be concluded that there is a highly significant but weak association between School type and Class test change. Second, there are also differences between general and vocational school teachers regarding change in the presence of listening tasks (but not reading and writing). Vocational school teachers report a greater increase in the presence of listening in their class tests than general school teachers. On the other hand, general school teachers report a more frequent use of BIFIE tasks than vocational school teachers. Mann-Whitney-U tests indicate that these differences are statistically significant. However, the strength of association as indicated by the small effect size is weak. The results of the Mann-Whitney-U tests are reported in Table 3.12, giving the U value (U), the Z-score (Z), the level of significance (p), and a measure of the effect size (r).

TABLE 3.11 Chi-Square Test Results χ2

df

p

Cramer’s V

15.06

1

.000

.309

Matura experience * Class test change

8.48

2

.014

.220

Matura experience * Change in presence of listening

8.69

2

.013

.248

10.54

1

.001

.243

Z

p

r

Training received * Change in test methods

School type * Class test change

TABLE 3.12 Mann-Whitney-U Test Results U School type * Presence of listening

1033

−2.61

0.009

0.20

School type * Use of BIFIE tasks listening

1914

−2.72

0.007

0.20

School type * Use of BIFIE tasks reading

1988.5

−2.16

0.031

0.16

School type * Use of BIFIE tasks writing

1976.5

−2.02

0.043

0.15

WASHBACK OF A NEW NATIONAL EXAM

75

Summary and Discussion It was surprising to see how many teachers reported changing their class tests. The changes seem to have been directly caused by the new Matura and not its underlying curriculum, which had been in effect since 2004. This seems to support the measurement-driven approach to educational reform that sees high-stakes tests as “curricular magnets” (Popham, 1987, p. 681). Participants reported changes regarding the skills represented and the test methods used in class tests, where the characteristics of the Matura are adopted. Teachers also heavily used tasks provided by BIFIE . These results confirm findings of the few studies in LTA addressing this kind of washback (Froetscher, 2013, 2014; Kremmel et al., 2013; Tsagari, 2009; Wall, 2005; Wall & Alderson, 1993) as well as findings from general education research (Abrams et al., 2003; McMillan et  al., 1999; Mertler, 2010; Stecher et  al., 1998). This study also confirms the complexity of washback and the existence of mediating factors.

Skills tested Teachers reported substantial changes in class tests regarding the skills represented. They started to include listening and reading, but also (for general school teachers) language in use in their class tests, and use the Matura test methods for these skills. For the same skills, teachers heavily used BIFIE tasks in their class tests. Writing, on the other hand, has decreased in importance in class tests; the other language skills are now on par with writing, as stipulated in the curriculum. In line with its decreased presence in class tests, writing seems to show a lesser extent of change: comparatively few participants reported changes in writing tasks or use of BIFIE writing tasks in their class tests. The latter could also be linked to the central role writing has had up to recently, and as a consequence thereof, teachers’ experience in designing writing tasks. The changes in writing tasks described by this study’s participants correspond with the findings of Kremmel et al.’s (2013) writing task analysis.

Beyond test methods While participants reported strong washback of the new Matura on the test methods they use, few referred to other characteristics of tasks. There were many inadequate answers to question 9 (“Please briefly describe what you changed regarding the input texts used for listening and reading in your class tests.”). Most of the respondents focused on test methods and only a few mentioned characteristics such as length, difficulty, topics, authenticity, or reading/listening behaviors targeted. This could indicate that the participants have a somewhat superficial understanding of what the test is about, where the test method seems the most important characteristic. This would suggest that the ministry, BIFIE , and teacher training institutions should foster teachers’ LTA literacy and awareness. Assessment literate teachers will consciously choose tasks for class tests because they are fit for the purpose, aware of underlying characteristics beyond test methods.

Direction of washback The washback identified can be seen as partly positive and partly negative. A first positive aspect is that teachers test a fuller spectrum of language skills where

76

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

previously there was a focus on writing. This makes class tests more content valid. Second, the mirroring of Matura test methods in class tests contributes to students’ familiarity with these methods, which could reduce their anxiety as well as possible test method bias toward the Matura. Third, although Wall & Alderson (1993) see the copying of past papers as negative washback in their Sri Lankan study, I would argue that it also has positive potential in the Austrian context. By using ready-made, professionally developed, standardized test tasks in their class tests, teachers can reduce the pressure of producing (new) class tests every year, and free up time to focus on their other professional duties. On the other hand, an externally caused restriction of class test methods can be seen as negative washback. While the Matura as a high-stakes examination has to observe inherent restrictions, these do not necessarily apply to classroom tests. By focusing too much on Matura test methods and past papers, teachers fail to embrace the wealth of methods and task types available for classroom testing. In addition, similar to “teaching to the test,” a restriction to and overuse of Matura test methods and past papers in class tests could lead to frustration on the part of the students and teachers.

Mediating factors Some aspects of washback were found to be stronger for some subgroups in this study. Change in test methods reported was more frequent with trained teachers (cf Tierney, 2006). In addition, teachers who were less experienced in teaching toward the new Matura more frequently reported change in class tests and an increased presence of listening tasks. Even within groups washback varied. For instance more vocational school teachers than general school teachers reported an increase in the use of listening tasks in class tests (an indicator of washback), but vocational school teachers did not appear to use BIFIE tasks as frequently as general school teachers (an indicator of less washback).

Limitations and methodological considerations The limitations of this study are that first, it relied on teacher self-report data. Second, the response rate of 13 percent is lower than the approximate aim of 30 percent in second language research (Dörnyei & Tatsuya, 2010). However, given that the whole population was contacted, 13 percent lies above the “magic sampling fraction” of 1–10 percent (Dörnyei & Tatsuya, 2010, p.  62). Third, although the sample is representative of the wider teaching population, its size causes limitations in the interpretation of inferential statistics for subgroups, and their results need to be treated with caution. A sample of 100 teachers per subgroup, in this case 500 participants in total, would have been ideal (Dörnyei & Tatsuya, 2010). Fourth, the coding reliability of the open answers hinges on the researcher’s correct interpretation of the respondents’ answers. Fifth, some open questions used in the questionnaire might not have been clear enough. Question 9, in particular, might not have successfully prompted participants to think of text or task characteristics beyond test methods. Nonetheless, the results of this study offer guidance for the design of further research instruments for tapping deeper into the washback of the Matura on class tests. With a combination of data collection methods, a more complete picture of the complex phenomenon of washback will hopefully be reached.

WASHBACK OF A NEW NATIONAL EXAM

77

Conclusion This study sheds some light on the little-researched area of washback on CBA . It shows, for the Austrian context, the strong influence of the Matura, a high-stakes test, on the way teachers test their students in the foreign language classroom. As LTA or language education professionals, we need to take account of this form of washback, in addition to washback on teaching, learning, and teaching materials. We should aim to promote positive washback on all these areas. Enough teacher training needs to be provided to ensure that despite practicality constraints, CBA is valid and reliable. From the results of this study, further questions for the larger research project have arisen and will be taken up in the interview phase: ●

● ●

● ●

What characteristics other than test method do teachers consider in preparing their class tests, and how important are they? What characteristics of the Matura particularly influence class tests? Which aspects of washback on class tests do teachers perceive as positive or negative? What role does specific training in LTA play as a mediating factor? Are teachers aware of the legal regulations and Ministry guidelines regarding class tests?

On a more general level, the results also suggest that washback on CBA merits further research. This could take a number of directions. First, high-stakes tests for purposes other than school-leaving, such as international language tests, could be investigated. In addition to their high teacher and student numbers, large-scale tests also offer the opportunity for research across countries and contexts. Second, future research in this area of washback should also strive to go beyond summative tests and address assessment in its wider sense, including formative assessment and incidental assessment. Classroom observation coupled with teacher interviews could be useful instruments for studies of this kind. Third, studies could investigate washback on how class tests, not only but particularly for writing, are marked by the teacher. Fourth, we do not yet know how positive washback on CBA could be promoted, or how it varies in the long run. Such research could involve teacher cognition approaches (see Borg, 2006; Glover, 2006). Fifth, it would be interesting to investigate to what extent washback on teaching and washback on CBA occur together and share characteristics. Sixth, a further focus could be the voice of students on how high-stakes exams influence their CBA , and how they prepare for and master these; to this end, diaries or interviews could be used. Finally, following Green’s (2007) investigation into IELTS score gains related to washback on teaching, the relationship between washback on CBA and assessment outcomes could be explored.

References Abrams, L. M., Pedulla, J. J., & Madaus, G. F. (2003). Views from the classroom: teachers’ opinions of statewide testing programs. Theory into Practice, 42(1), 18–29.

78

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Alderson, J. C. (2004). Foreword. In L. Cheng, Y. Watanabe, & A. Curtis (Eds.), Washback in Language Testing: Research Contexts and Methods. Mahwah, NJ : Lawrence Erlbaum Associates, pp. ix–xii. Alderson, J. C., & Hamp-Lyons, L. (1996). TOEFL preparation courses: A study of washback. Language Testing, 13(3), 280–297. Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14(2), 115–129. Austrian Ministry of Education. (1990). Verordnung des Bundesministers für Unterricht, Kunst und Sport vom 7. Juni 1990 über die Reifeprüfung in den allgemeinbildenden höheren Schulen. Available from: https://www.ris.bka.gv.at/Dokumente/ BgblPdf/1990_432_0/1990_432_0.pdf. [27 July 2015]. Austrian Ministry of Education. (2004). Verordnung der Bundesministerin für Bildung, Wissenschaft und Kultur, mit der die Verordnung über die Lehrpläne der allgemein bildenden höheren Schulen geändert wird. Available from: http://www.bmbf.gv.at/ schulen/unterricht/lp/11668_11668.pdf?4dzgm2. [27 July 2015]. Austrian Ministry of Education. (2012a). Bundesgesetzblatt 174 für die Republik Österreich. Änderung der Prüfungsordnung AHS. Available from: http://www.bmbf.gv.at/ schulen/recht/erk/bgbl_ii_nr_174_2012_22504.pdf?4dzi3h. [27 July 2015]. Austrian Ministry of Education. (2012b). Bundesgesetzblatt 352 für die Republik Österreich. Änderung der Verordnung über die Lehrpläne der allgemeinbildenden höheren Schulen. Available from: https://www.ris.bka.gv.at/Dokumente/BgblAuth/ BGBLA_2012_II_255/BGBLA_2012_II_255.pdf. [27 July 2015]. Austrian Ministry of Education. (2013). Der Weg zur kompetenzorientierten Reifeprüfung. Leitfaden zur Erstellung von Schularbeiten in der Sekundarstufe 2 – AHS. Lebende Fremdsprachen Englisch, Französisch, Italienisch, Spanisch, Russisch. Available from: http://oesz.at/download/Leitfaden_Schularbeiten_NEU_02_2014.pdf. [27 July 2015]. Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford, UK : Oxford University Press. Bailey, K. (1996). Working for washback: A review of the washback concept in language testing. Language Testing, 13(3), 257–279. BIFIE . (2013). Standardisierte Kompetenzorientierte Reifeprüfung/ Reife- und Diplomprüfung. Grundlagen—Entwicklung—Implementierung. Retrieved from https:// www.bifie.at/node/2045. [27 July 2015]. BIFIE . (2014a). AHS SRP 2014/15—Überblick über Testmethoden (lebende Fremdsprachen). Available from: https://www.bifie.at/node/2405. [27 July 2015]. BIFIE . (2014b). BHS SRP 2014/15 – Überblick über Testmethoden (lebende Fremdsprachen). Available from: https://www.bifie.at/node/2404. [27 July 2015]. BIFIE . (2015). Lebende Fremdsprachen. Available from: https://www.bifie.at/node/78. [27 July 2015]. Bonkowski, J. F. (1996). Instrument for Analysis of Textbook Materials. Unpublished manuscript. Department of Linguistics and Modern English Language, Lancaster University, Lancaster, UK . Borg, S. (2006). Teacher Cognition and Language Education: Research and Practice. London, UK : Continuum. Cheng, L. (2005). Changing Language Teaching Through Language Testing: A Washback Study. Cambridge, UK : Cambridge University Press. Cheng, L. (2008). Washback, impact and consequences. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and Education. Volume 7: Language Testing and Assessment (2nd edn.). New York, NY: Springer, pp. 349–364. Cohen, L., Manion, L., & Morrison, K. (2007). Research Methods in Education (revised edn.). Abingdon, UK : Routledge.

WASHBACK OF A NEW NATIONAL EXAM

79

Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge, UK : Cambridge University Press. Dörnyei, Z., & Tatsuya T. (2010). Questionnaires in Second Language Research: Construction, Administration, and Processing. New York, NY: Routledge. EALTA . (2006). EALTA Guidelines for Good Practice in Language Testing and Assessment. Available from: http://www.ealta.eu.org/guidelines.htm. [27 July 2015]. Froetscher, D. (2013). A national exam’s washback on reading assessment in the secondary classroom. Paper presented at the EALTA Conference, Istanbul, Turkey. Froetscher, D. (2014). Reading assessment in the secondary classroom: Washback by a national exam. Paper presented at the EALTA CBLA SIG Meeting, Warwick, UK . Glover, P. (2006). Examination Influence on How Teachers Teach: A Study of Teacher Talk (Doctoral thesis). Lancaster University, UK . Green, A. (2007). IELTS Washback in Context: Preparation for Academic Writing in Higher Education. Cambridge, UK : Cambridge University Press. Hamp-Lyons, L. (1997). Washback, impact and validity: Ethical concerns. Language Testing, 14(3), 295–303. Hawkey, R. (2006). Impact Theory and Practice: Studies of the IELTS Test and Progetto Lingue 2000. (Studies in Language Testing, Volume 24). Cambridge, UK : Cambridge University Press. Hughes, A. (1989). Testing for Language Teachers. Cambridge, UK : Cambridge University Press. Kremmel, B., Eberharter, K., Konrad, E., & Maurer, M. (2013). Righting writing practices: The impact of exam reform. Poster presented at the EALTA Conference, Istanbul, Turkey. LimeSurvey Project Team & Schmitz, C. (2012). LimeSurvey: An Open Source Survey Tool. Hamburg, Germany. Available from: http://www.limesurvey.org. [27 July 2015]. Mabry, L., Poole, J., Redmond, L., & Schultz, A. (2003). Local impact of state testing in southwest Washington. Education Policy Analysis Archives, 11(22), 1–35. McMillan, J., Myran, S., & Workman, D. (1999). The impact of mandated statewide testing on teachers’ classroom assessment and instructional practices. Paper presented at the American Educational Research Association Annual Meeting, Montreal, Canada. Mertler, C. A. (2010). Teachers’ perceptions of the influence of No Child Left Behind on classroom practices. Current Issues in Education, 13(3), 1–35. Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241–256. Popham, W. J. (1987). The merits of measurement-driven instruction. Phi Delta Kappan, 68(9), 679–682. Qi, L. (2004). Has a high-stakes test produced the intended changes? In L. Cheng, Y. Watanabe, & A. Curtis (Eds.), Washback in Language Testing: Research Contexts and Methods. Mahwah, NJ : Lawrence Erlbaum Associates, pp. 171–190. Qi, L. (2005). Stakeholders’ conflicting aims undermine the washback function of a highstakes test. Language Testing, 22(2), 142–173. Saville, N., & Hawkey, R. (2004). The IELTS impact study: investigating washback on teaching materials. In L. Cheng, Y. Watanabe, & A. Curtis (Eds.), Washback in Language Testing: Research Contexts and Methods. Mahwah, NJ : Lawrence Erlbaum Associates, pp. 73–96. Shohamy, E. (1999). Language testing: impact. In B. Spolsky & R. Asher (Eds.), Concise Encyclopedia of Educational Linguistics. Oxford, UK : Pergamon, pp. 711–714. Shohamy, E., Donitsa-Schmidt, S., & Ferman, I. (1996). Test impact revisited: Washback effect over time. Language Testing, 13(3), 298–317.

80

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Stecher, B. M., Barron, S. L., Kaganoff, T., & Goodwin, J. (1998). The Effects of Standards-Based Assessment on Classroom Practices: Results of the 1996–1997 RAND Survey of Kentucky Teachers of Mathematics and Writing. Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing. Tierney, R. (2006). Changing practices: Influences on classroom assessment. Assessment in Education: Principles, Policy and Practice, 13(3), 239–264. Tsagari, D. (2009). The Complexity of Test Washback (Language Testing and Evaluation Volume 15). Frankfurt am Main: Peter Lang. Tsagari, D., & Vogt, K. (2014). Assessment literacy of foreign language teachers: findings of a European study. Language Assessment Quarterly, 11(4), 374–402. Wall, D. (1996). Introducing new tests into traditional systems: Insights from general education and from innovation theory. Language Testing, 13(3), 334–354. Wall, D. (1997). Impact and washback in language testing. In D. Corson & C. Clapham (Eds.), Encyclopedia of Language and Education. Volume 7: Language Testing and Assessment. Dordrecht: Kluwer, pp. 291–302. Wall, D. (2005). The Impact of High-Stakes Examinations on Classroom Teaching: A Case Study Using Insights from Testing and Innovation Theory (Studies in Language Testing Volume 22). Cambridge, UK : Cambridge University Press. Wall, D., & Alderson, J. C. (1993). Examining washback: The Sri Lankan impact study. Language Testing, 10(1), 41–69. Wall, D., & Horák, T. (2006). The Impact of Changes in the TOEFL® Examination on Teaching and Learning in Central and Eastern Europe: Phase 1, the Baseline Study. Princeton, NJ : Educational Testing Service. Wall, D., & Horák, T. (2007). Using baseline studies in the investigation of test impact. Assessment in Education: Principles, Policy and Practice, 14(1), 99–116. Watanabe, Y. (1996). Does grammar translation come from the entrance examination? Preliminary findings from classroom-based research. Language Testing, 13(3), 318–333. Watanabe, Y. (2000). Washback effects of the English section of the Japanese university entrance examinations on instruction in pre-college level EFL . Language Testing Update, 27, 42–47. Watanabe, Y. (2004). Methodology in washback studies. In L. Cheng, Y. Watanabe, & A. Curtis (Eds.), Washback in Language Testing: Research Contexts and Methods. Mahwah, NJ : Lawrence Erlbaum Associates, pp. 19–36. Wesdorp, H. (1982). Backwash effects of language-testing in primary and secondary education. Journal of Applied Language Study, 1(1), 40–55.

WASHBACK OF A NEW NATIONAL EXAM

81

Appendix 1: Questionnaire Please think of class tests which you prepared before the standardised school-leaving exam, and compare them with your current work. 1 Did you change your class tests because of the standardised school-leaving exam? (yes/no) 2 Why was there no need to change your class tests? (open) 3 Please briefly describe what you changed. (open) 4 Did you change the presence of the language skills in the class tests? (yes/no) 5 How did you change the presence of the skill Listening/Reading/Writing/ Language in Use? The presence now is . . . (much more/slightly more/about the same/slightly less/much less) 6 Did you change the test methods that you use in your class tests? (yes/no) 7 Please briefly describe what you changed. (open) 8 Did you change the input texts used for listening and reading in your class tests? (yes/no) 9 Please briefly describe what you changed. (open) 10 How do you perceive these changes? (entirely positively/partly positively/not at all positively) 11 Do you use practise tasks and past papers from BIFIE for listening/reading/ language in use/writing in your class tests? (never/seldom/sometimes/often)

82

4 Linking to the CEFR: Validation Using a Priori and a Posteriori Evidence John H.A.L. de Jong and Ying Zheng

ABSTRACT

L

inking tests to international standards, such as the Common European Framework of Reference for Languages: learning, teaching, assessment (CEFR , Council of Europe, 2001), is a way of establishing criterion-referenced validity. This chapter reports on how CEFR scales were operationalized in practice in the course of developing the Pearson Test of English Academic. Measures to link the test to the CEFR were studied at different stages of test development. A posteriori statistical evidence was also collected from both field tests and live tests. Field test data were used to establish the extent to which scores from this test can be linked to the CEFR , which involved both a test-taker-centered approach and an itemcentered approach.

Research Background Achieving test validity is an essential concern in test development, particularly when a test is used for high-stakes purposes. However, as Messick (1992) commented, “Many test makers acknowledge a responsibility for providing general validity evidence of the instrumental value of the test, but very few actually do it” (p. 18). More recently, Weir (2005) reported that, while most examinations claim different aspects of validity, they often lack validation studies of actual tests that demonstrate evidence to support inferences from test scores. 83

84

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Messick’s (1995) unified view of validity predicated that validity is a multifaceted concept, which can only be established by integrating considerations of content, criteria, and consequences into a comprehensive framework for empirically testing rational hypotheses about score meaning and utility. It is widely recognized that the validation process should start from the very beginning of test development. Schilling (2004) maintained that, in addition to a posteriori validity evidence (which traditionally focused on scoring validity, criterion-related validity, and consequential validity); a priori validity evidence (such as test design decisions and the evidence that supports these decisions) also makes a significant contribution to the establishment of validity. Similarly, Weir (2005) highlighted the importance of a priori validity evidence when he stated that “the more fully we are able to describe the construct we are attempting to measure at the a priori stage, the more meaningful might be the statistical procedures contributing to construct validation that can subsequently be applied to the results of the test” (p.  18). The reason is that the statistical analysis at the a posteriori stage does not generate conceptual labels by themselves, and therefore, to make the scores meaningful, the test developers can never escape from the need to define what is being measured at the beginning of test development. The Common European Framework of Reference for Languages (CEFR , Council of Europe, 2001) has had a major impact on language education worldwide. Byram & Parmenter (2012) noted that the CEFR has not only helped develop both strategic language policy documents and practical teaching materials, but it has also become the most reliable reference for curriculum planning. Linking tests to international standards such as the CEFR is a way of establishing criterion-referenced validity. As is widely acknowledged, validation is a continuous process of quality monitoring (AERA , APA , and NCME , 2014). The central concept involved in relating contextualized examinations to the CEFR is validity (North et al., 2010), including (1) internal validity, which is the quality of the test in its own right, comprising of content validity, theory-based validity, and scoring validity (Weir, 2005); (2) external validity or concurrent validity, which adds value to the test by relating it to an external framework such as the CEFR or by using expert judgments on a CEFR panel, either inside or outside the testing organization (Papageorgiou, 2007). The validation framework, as a basis, should be embedded from the starting point of test development to later stages of test administration or test data analyses in order to provide both a theoretical perspective and a practical process for generating validity evidence. There has been a plethora of empirical studies concerning the establishment of this type of linking argument in recent years. Martyniuk’s (2010) book gathered a series of studies that looked into linking a single test to the CEFR as well as linking a suite of exams to the CEFR . The majority of these studies undertake the systematic stages of familiarization, specification, standardization and empirical validation as recommended by the Council of Europe (2009). For example, Kantarcioglu et  al. (2010) reported on a study linking the Certificate of Proficiency in English (COPE ) to the CEFR B2 level. The study closely followed the guidelines described in the preliminary pilot version of the publication usually referred to as the “Manual” (Council of Europe, 2003) and all four interrelated stages were undertaken. The study involved familiarization with activities suggested in the Manual, supplemented

LINKING AN EXAM TO THE CEFR

85

by a series of in-house quizzes. They used a graphical profile in the specification stage and the examinee-paper selection method was employed for the writing paper. The Angoff and the YES /NO method were used for the reading and listening papers in the standardization stage, to establish the reliability of the cut-off score. Evidence was gathered from the live exam, teacher judgments, and a correlation study that was used to establish the validity of COPE . In addition, O’Sullivan’s (2010) study provided empirical evidence that aimed to confirm the link between a single test, the City and Guilds’ Communicator, and the CEFR B2 level by following the stages of familiarization, specification, standardization, and empirical validation. Other researchers also used similar procedures to establish the criterion-referenced validity of the test(s) they are examining and wishing to link to the CEFR . For example, Downey & Kollias (2010) attempted to link the Hellenic American University’s Advanced Level Certificate in English examination (ALCE ) to the CEFR . The project executed the first three stages proposed in the Manual and designed an empirical validation stage for future research. The project involved an in-depth familiarization with the content and levels of the CEFR , followed by the mapping of the ALCE examination to the categories and levels of the CEFR using a graphic profile. The results of the study suggested that the ALCE test is targeted at the C1 level of the CEFR . Similarly, Khalifa et al. (2010) applied the familiarization and specification procedures to confirm the alignment of the First Certificate in English (FCE ) with the CEFR B2 level. This study demonstrated that the Manual’s methodology can be constructively utilized for the development of a linking argument and also for the maintenance of the alignment of a test to the CEFR . In addition, Kecker & Eckes (2010) studied all four English skills, that is, writing, reading, listening, speaking, of the Test of German as a Foreign Language (TestDaF), following the steps described in the Manual to examine the relation between the TestDaF and the relevant CEFR level. The receptive skills (reading and listening) adopted a modified Angoff approach, whereas the productive skills (writing and speaking) used teachers’ judgments on test performance by TestDaF candidates. Wu & Wu (2010) attempted to establish an alignment of the reporting levels of the General English Proficiency Test (GEPT ) reading comprehension test to the CEFR levels by following the internal validation procedures, including familiarization, specification, and standardization. The project was undertaken to meet the Taiwanese Ministry of Education requirement (issued in 2005) that all major language tests needed to be mapped to the CEFR . The majority of the studies reviewed above are concerned with establishing a posteriori validity evidence for the existing test(s) by linking to the relevant CEFR levels. Not much effort has been made to collect a priori validity evidence alongside the test development procedures. This chapter reports on how CEFR scales were operationalized in practice in the course of developing the Pearson Test of English Academic (PTE Academic™). PTE Academic™ is a computer-based international English language test launched globally in 2009. The purpose of the test is to assess English language competence in the context of academic programs of study where English is the language of instruction. It is targeted at intermediate to advanced English language learners. In order to claim that PTE Academic™ is fit for purpose, validity evidence has been collected during the various stages of test development through to its administration.

86

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

The constructs measured in PTE Academic™ are the communicative language skills needed for reception, production, and interaction in both oral and written modes, as these skills are considered necessary to successfully follow courses and to actively participate in the targeted tertiary level education environment. The CEFR describes the skills language learners need to acquire in order to use a language for communication and effective action. Language ability is described within the CEFR with reference to a number of scales, which include a global scale, skill specific communicative competency scales, and linguistic competency scales. In the context of PTE Academic™, measures to link the test to the CEFR have been studied at different stages of test development. A priori measures include activities that incorporate the use of CEFR scales in item writing. A posteriori evidence includes the statistical validation procedures used to establish the extent to which PTE Academic™ scores can be linked to the CEFR .

A Priori Validation Since test scores of PTE Academic™ are used for university admission purposes, the high-stakes nature of the decisions requires this test to be valid for the inferences the test users make, that is, whether test takers have adequate English proficiency to participate in English-medium tertiary settings. In developing valid test items, quality assurance measures were adopted at each stage of the test development process. Qualified item writers are trained to become familiar with two essential test development documents, that is, the Test Specification (hereafter the Specification) and the Item Writer Guidelines (hereafter, the Guidelines). The Specification serves as an operational definition of the constructs the test intends to assess. The Guidelines include detailed test specification of PTE Academic™, reproduce the relevant CEFR scales, specify in detail the characteristics of each item and give item writers rules and checklists to ensure that the test items they develop are fit for purpose and suitable for inclusion in the item bank. In developing reading and listening items, item writers are trained in three aspects: 1) familiarization with the Target Language Use (TLU ) situations; 2) selection of appropriate reading or listening texts; and 3) familiarization with the CEFR scales on reading and listening. The Guidelines explain the characteristics of reading and listening passages through which test takers can best demonstrate their abilities. For the reading items, this includes text sources, authenticity, discourse type, topic, domain, text length, and cultural suitability. For the listening items, they include text sources, authenticity, discourse type, domain, topic, text length, accent, text speed, how often the material will be played, text difficulty, and cultural suitability. In developing writing and speaking items, the Guidelines explain TLU situations with details of the CEFR scale from levels B1 to C2. In the Guidelines for writing, the purpose of writing discourse and the cognitive process of academic writing are presented in a matrix format with recommendations for preferred item types. The purposes of writing tasks are defined as 1) to reproduce, 2) to organize or reorganize, and 3) to invent or generate ideas. Three types of cognitive processing are differentiated: to learn, to inform, and to convince or persuade. In the Guidelines for speaking, item writers are instructed to produce topics focusing on academic interests

LINKING AN EXAM TO THE CEFR

87

and university student life. A list of primary speaking abilities is also provided, including the ability to comprehend information and to deliver such information orally, and the ability to interact with ease in different situations.

Writing to the CEFR Levels This section describes specific procedures involved in the writing of test items according to the CEFR levels and Table 4.1 presents an overview of the four main stages in the CEFR familiarization training for item writers. Item writers are TABLE 4.1 Item Writer CEFR Training Stages Stages

Details

STAGE 1: Familiarization with the definitions of some basic terms used in CEFR

For example: general language competence, communicative language competence, context, conditions and constraints, language activities, language processes, texts, themes, domains, strategies, tasks

STAGE 2: Familiarization with the common reference levels: the global descriptors

Proficient user (C2 and C1): precision and ease with the language, naturalness, use of idiomatic expressions and colloquialisms, language used fluently and almost effortlessly, little obvious searching for expression, smoothly flowing, well-structured language Independent user (B2 and B1): effective argument, holding one’s own, awareness of errors, correcting oneself, maintains interaction and gets across intended meaning, copes flexibly with problems in everyday life Basic user (A2 and A1): interacts socially, simple transactions in shops, etc., skills uneven, interacts in a simple way

STAGE 3: Familiarization with the subscales for four skills

CEFR Overall Spoken Production and subscales CEFR Overall Spoken Interaction and subscales CEFR Overall Listening Comprehension and subscales CEFR Overall Reading Comprehension and subscales CEFR Overall Written Interaction and subscales CEFR Overall Written Production and subscales

STAGE 4: Rating candidates’ performances Rating communicative tasks on CEFR scales

Rate individually Express reasons and discuss with colleagues Compare rating with experts’ marks

88

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

instructed to write items with a targeted difficulty level from B1 to C2 on the CEFR scale. The CEFR estimate is one of the item dimensions that item writers need to provide when they submit an item, among other item dimensions such as item source, accent and speech rate for speaking items, prompts for writing items, and distractor options for multiple choice items. Item writers’ CEFR estimates of item difficulty levels were empirically validated when the items were analyzed, either though field testing or through a live item seeding process. As shown in Table  4.1, there are four stages in the item writers’ CEFR familiarization training. The first stage covers the instruction of some key terms that are used in the CEFR descriptors, aiming to facilitate item-writer trainees to understand the CEFR in general. By introducing the global descriptors at each level, the second stage gives trainees an idea of what kind of tasks and how well the test takers are expected to perform at the targeted levels. The third stage provides the trainees with more detailed descriptions of the can-do statements. Finally, after becoming familiar with the CEFR scales, item writers are asked to rate several example performances and communicative tasks individually, discuss their ratings with their colleagues, and compare their scores and reasons with those given by experts. Table  4.2 shows an example of CEFR overall written production and subscales that were used in the item-writer training. TABLE 4.2 An Example of CEFR Overall Written Production and Subscales CEFR Overall Written Production

CEFR Writing Subscales

C2 Can write clear, smoothly flowing, complex texts in an appropriate and effective style and a logical structure that helps the reader find significant points.

Creative writing

C1 Can write clear, well-structured texts on complex subjects, underlining the relevant salient issues, expanding and supporting points of view at some length with subsidiary points, reasons and relevant examples and rounding off with an appropriate conclusion.

Correspondence

B2 Can write clear, detailed texts on a variety of subjects related to his/her field of interest, synthesizing and evaluating information and arguments from a number of sources.

Orthographical control

B1 Can write straightforward connected texts on a range of familiar subjects within his/her field of interest, by linking a series of shorter discrete elements in a linear sequence. A2 Can write a series of simple phrases and sentences linked with simple connectors like and, but, and because. A1 Can write simple isolated phrases and sentences. © Council of Europe 2001

Reports and essays Overall written interaction Notes, messages, and forms Note taking Processing text Thematic development Coherence and cohesion General linguistic range Vocabulary range Vocabulary control Coherence Propositional precision

LINKING AN EXAM TO THE CEFR

89

In summary, in the context of PTE Academic™, the concepts and approach of the CEFR is built into the essential test development documentation, that is, the Specification and the Guidelines. It is then implemented at the initial stage of item writing. Item writers provide a CEFR estimate for each item, which is then crossvalidated at the a posteriori stage using statistical evidence.

A Posteriori Validation This section reports on the statistical validation procedures used to establish the alignment of PTE Academic™ scores to the CEFR scales. Statistical procedures for relating PTE Academic™ scores to the levels of the CEFR scales involved both a test-taker-centered approach and an item-centered approach.

Linking to the CEFR : A Test-Taker-Centered Approach For the test-taker-centered approach, data in an incomplete, overlapping design containing responses from 3,318 test takers on close to 100 items representing three item types was used. Five responses were available per test taker: Written essay (one item), Oral description of an image (two items) and Oral summary of a lecture (two items). The essay writing task has eleven score categories (0–10 points), the oral description of an image has eight score categories (0–7 points), and the oral summary of a lecture has five score categories (0–4 points), adding up to a total score of twenty-three score categories. Each response was rated on the relevant CEFR scales for writing and speaking by two human raters, independently of the ratings produced to score the test. A total of forty raters were randomly assigned to items, each rater rating on average 500 responses. Given the probabilistic and continuous nature of the CEFR scale, ratings at adjacent levels were expected in the model. The relation between ability estimates based on scored responses on the above PTE Academic™ test items and the CEFR is displayed in Figure 4.1, with one chart for the written responses and the other for the oral responses. The horizontal axis ranges from CEFR levels A2 to C2. The vertical axis shows the truncated PTE Academic™ theta scale within the range from −2 to +2. The box plots show substantial overlap across adjacent CEFR categories as well as an apparent ceiling effect at C2 for writing. CEFR levels, however, are not to be interpreted as mutually exclusive categories. Language development is continuous and does not take place in stages. Therefore, the CEFR scale and its levels should be interpreted as probabilistic: learners of a language are estimated most likely to be at a particular level, but this does not reduce to zero their probability of being at an adjacent level. The overlap between the box plots in Figure 4.1 is therefore in agreement with the model. Although the official CEFR literature does not provide information on the minimum probability required to be at a CEFR level, the original scaling of the levels (North, 2000) is based on the Rasch model, where cut-offs are defined at 0.5 probability. A mathematical feature that is often overlooked in aligning tests to the CEFR is therefore that the cut-offs for the ability of test takers is necessarily

90

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 4.1 CEFR Level Distribution Box Plots.

at the midpoint between the lower and upper boundary of the level for the task difficulties. This feature is also described by Adams & Wu (2002, pp. 197–199) for the level definitions in PISA (Programme for International Students Assessment). The distance of approximately 1 or 2 logits between the CEFR levels implies that anyone typically reaching a probability of around 0.8 at level X has 0.5 probability of being at level X+1, and is therefore exiting level X and entering level X+1. Having a probability of 0.5 of being at level X implies a probability of 0.15 to be at level X+1 and as little as 0.05 at level X+2. This probabilistic relation between task difficulty and test-taker performance can also be seen in Figure 4.2. A deterministic level definition where any performance is either at one level or at another would result in a step-like function. It is clear from this figure, however, that the PTE Academic™ theta increases monotonically from A2 to C2. Based on this monotone increase, a positive relation between the CEFR scale and the PTE Academic™ scale is established. To find the exact cut-offs on the PTE Academic™ theta scale corresponding to the CEFR levels, that is, the position where a particular CEFR rating becomes more likely than a rating below, the first stage is to establish the lower bounds of the CEFR categories based on the independent CEFR ratings. For this purpose, the CEFR ratings were scaled using FACETS (Linacre, 1988; 2005). The estimates of category boundaries on the CEFR theta scale are shown in Table 4.3. The relationship between the scale underlying the CEFR levels and the PTE Academic™ theta for those test takers about whom we had information on both scales (n = 3,318) is shown in Figure 4.2. The horizontal axis shows the CEFR theta, and the vertical axis shows the PTE Academic™ theta estimate. The correlation between the two measures is 0.69. A better fitting regression is obtained with a first order polynomial (unbroken curved line), yielding an r2 of slightly over 0.5. This

LINKING AN EXAM TO THE CEFR

91

TABLE 4.3 Category Lower Bounds on CEFR Theta Category

CEFR Level

CEFR Theta (Lower bounds)

1

A2

−4.24

2

B1

−1.53

3

B2

0.63

4

C1

2.07

5

C2

3.07

FIGURE 4.2 Relation between CEFR Theta and PTE Theta.

regression function was used to project the CEFR cut-offs from the CEFR scaled ratings onto the PTE Academic™ theta scale. Because of noisy (messy and unpredictable) data at the bottom end of the scales, the lowest performing 50 candidates were removed. Further analyses were conducted with the remaining 3,268 subjects. Figure  4.3 shows the cumulative frequencies for these 3,268 candidates for whom theta estimates are available on both scales

92

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 4.3 Cumulative Frequencies for CEFR Levels on CEFR and PTE Theta Scales.

FIGURE 4.4 Cumulative Frequencies on CEFR and PTE Theta Scales after Equipercentile Equating.

LINKING AN EXAM TO THE CEFR

93

TABLE 4.4 Final Estimates for CEFR Lower Bounds on the PTE Theta Scale

CEFR Levels

Cumulative Frequency

Theta PTE

Frequency

Percentage

A2

−1.155

677

21%

0.25

B1

−0.496

1,471

45%

0.70

B2

0.274

769

24%

0.93

C1

1.105

170

5%

0.98

C2

>1.554

55

2%

1.00

3,268

100%

Totals

(CEFR scale and PTE Academic™ scale). The cumulative frequencies are closely aligned, though the PTE scale shows slightly less variance. In the next stage, equipercentile equating was chosen to express the CEFR lower bounds on the PTE theta scale. Equipercentile equating of tests determines the equating relationship as one where a score has an equivalent percentile on either test. The cumulative frequencies are shown in Figure 4.4 showing a complete alignment on both scales and the resulting projection of the CEFR lower bounds on the PTE theta scale together with the observed distribution of field test candidates over the CEFR levels is shown in Table 4.4.

Linking to the CEFR: An Item-centered Approach As reported above, at the item development stage, item writers were required to indicate, for each item, the level of ability expressed in terms of the CEFR levels they intended to measure, that is, whether they thought test takers would need to be able to correctly solve the items. In the item review process, these initial estimates from item writers were evaluated, and if needed, corrected by the item reviewers. Based on observations from field tests, the average item difficulty was calculated for items to fall into a particular category according to item writers. Table 4.5 provides the mean observed difficulty for each of the CEFR levels targeted by the item writers. However, rather than the average difficulty of CEFR levels, the cut-offs between these levels as they are projected on the PTE Academic™ theta scale need to be established. To this effect, from the data, given item difficulty, the likelihood of any item having been assigned to any of the CEFR levels was estimated. The cut-offs between the two consecutive levels is the location on the scale where the likelihood of belonging to the first category becomes less than the likelihood of belonging to the next category. In this way, the PTE theta cut-offs based on the items were found. The estimated lower bounds of the difficulty of items targeted at each of the CEFR levels were plotted against the lower bounds of these levels as estimated from the independent CEFR ratings of test takers’ responses by human raters. In Figure 4.5,

94

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 4.5 Intended and Observed Item Difficulty Intended CEFR Level

Mean Observed Difficulty

A2

0.172

B1

0.368

B2

0.823

C1

1.039

C2

1.323

the horizontal axis represents the theta scale of the CEFR cut-offs from the testtaker-centered analysis, while the vertical axis represents the scale of the CEFR cutoffs from the item-centered analysis. Note that because the analyses were conducted independently, each scale has its own origin and measurement unit. Both estimates,

FIGURE 4.5 Lower Bounds of CEFR Levels Based on Targeted Item Difficulty versus Lower Bounds Based on Equated CEFR Ratings of Candidates’ Responses.

LINKING AN EXAM TO THE CEFR

95

derived independently, agree to a high degree (r = 0.99) on the relative distances between the cut-offs. Provisions for the maintenance of the relationship with the CEFR need also to be made continuously as new test items are written by item writers in adherence with both Test Specification and Item Writer Guidelines. These new items are systematically seeded in the operational test forms to gather live test-taker responses. When enough responses are collected, these items are scored and analyzed, together with other precalibrated test items. The analysis adopts a concurrent design of calibration, whereby new test items, following analysis of the results, are benchmarked to CEFRreferenced item difficulties.

Concordance with Other Measures of English Language Competencies In order to triangulate the estimated relationship with the CEFR , concordance studies were conducted between PTE Academic™ and other measures of English language competencies claiming alignment to the CEFR during the field and beta testing stages. Test takers’ self-reported scores on other tests of English, including TOEIC ®, TOEFL ® PBT, TOEFL ® CBT, TOEFL iBT ®, and IELTS ™ taken within two months prior to or after their participation in the PTE Academic™ field tests. In addition, test takers were asked to send in a copy of their score reports from these tests. About one in four of all test takers that provided self-reported scores also sent in their official report. The correlation between the self-reported results and the official score reports was .82 for TOEFL iBT ® and .89 for IELTS ™. This finding is in agreement with earlier research on self-reported data. For example, Cassady (2001) found students’ self-reported GPA scores to be “remarkably similar” to official records. The data are also consistent. When the TOEFL iBT ® was first released, the Educational Testing Service prepared a document (ETS , 2005) with concordances between the TOEFL iBT ® and its predecessors the TOEFL ® CBT and the TOEFL ® PBT. This helped users of previous versions of the TOEFL to transition to the Internet-based version. In this, now historical, document (ETS , 2005, p. 7), the score range 75–95 on TOEFL iBT ® is comparable to the score range 213–240 on TOEFL ® CBT and to the score range 550–587 on TOEFL ® PBT. Table 4.6 shows the mean of the self-reported scores in those tests and their corresponding correlation with PTE Academic™ during field testing, and Table 4.7 shows the corresponding correlation with PTE Academic™ during beta testing. In addition, a score range of 800–850 on TOEIC ® corresponds to a score range of 569–588 on TOEFL ® PBT (Wilson, 2003). This is in line with data collected during the PTE Academic™ field test (see Table 4.6). Based on the data presented in Tables 4.6 and 4.7, concordance coefficients were generated between PTE Academic™ and other tests of English using linear regression. The regression coefficients were then used to predict the scores of PTE Academic™ BETA test takers’ scores on TOEFL iBT ® and IELTS ™. Table 4.7 shows the selfreported mean scores and those from the official reports, the mean scores from the same test takers as predicted from their PTE Academic™ score, and the correlations between the reported scores and predictions from PTE Academic™. Table 4.8 shows

96

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 4.6 Means and Correlations of PTE Academic™ Field Test Takers on Other Tests Test

Self-reported Data

Official Score Report

n (valid)

Mean

Correlation

327

831.55

.76

n/a

TOEFL ® PBT

92

572.3

.64

n/a

TOEFL ® CBT

107

240.5

.46

n/a

TOEFL iBT ®

140

92.9

.75 .76

TOEIC ®

IELTS ™

2432

6.49

n

Mean

Correlation

19

92.1

.95

169

6.61

.73

TABLE 4.7 Correlation and Prediction of PTE Academic™ BETA Test Takers Test

Self-reported Data n

Mean Predicted Correlation

Official Score Report n

Mean Predicted Correlation

TOEFL iBT ® 42

98.9

97.3

.75

13

92.2

98.2

.77

IELTS ™

6.80

6.75

.73

15

6.60

6.51

.83

57

TABLE 4.8 TOEFL iBT ® CEFR Cut-offs Estimated by ETS CEFR

TOEFL iBT® Estimated by ETS

Alignment to CEFR Estimated from PTE Academic™

C1

110

110–111

B2

87

87–88

B1

57

57–58

two independent approaches to estimating CEFR cut-scores for the TOEFL iBT ®. One column shows the cut-scores arrived at through research conducted by ETS (Tannenbaum & Wylie, 2008). The other column shows the estimated cutscores based on the PTE Academic™ concordance with TOEFL iBT ®. The high level of agreement between these two approaches provides independent support for the validity of the CEFR cut-offs on the PTE Academic™ reporting scale as presented in this study.

LINKING AN EXAM TO THE CEFR

97

Discussion: Advantages and New Developments Studies that link tests to the CEFR have relevance beyond supporting evidence for test validity. Once sufficiently convincing evidence for the alignment has been established, it can also help stakeholders to understand the meaning of test scores in a more comprehensive way because test scores can be interpreted in terms of the “can do” statements in the CEFR . In other words, both teachers and learners can gain a better understanding of what students with a particular score are likely to be able to accomplish in terms of the descriptive system of the CEFR and its level descriptors. Such understanding can help learners selfassess their learning and assist teachers to reflect on their teaching, which could potentially make learning and teaching more effective. Future studies might also investigate the impact of the validity of the linking study from teaching and learning perspectives. The reporting scale of PTE Academic™ uses the Global Scale of English (GSE ), which is a linear scale used by Pearson to express increasing difficulty of language tasks as well as growing ability of language users. The scale runs from ten to ninety, thereby offering a more granular scale to measure progress in English language proficiency than the six levels of the CEFR can offer. Other tests can be developed that report on the same scale. In order to understand how this is possible, we need to remind the reader of the creation process of CEFR common reference levels. The six levels of the CEFR are described by a set of holistic paragraphs presented in Table 4.1 of the CEFR (Council of Europe, 2001, p. 24). These paragraphs as well as those presented in Tables 4.2 (Council of Europe 2001, pp. 26–27) and Table 4.3 (Council of Europe 2001, pp.  28–29) in fact constitute summaries of sets of illustrative descriptors or “can do” statements (Council of Europe 2001, p.  25) estimated to describe language proficiency at six predefined intervals on an underlying latent language proficiency variable. The locations of the descriptors on the underlying continuous variable were estimated in a research project reported by North (2000) and summarized in Appendix B of the CEFR (Council of Europe, 2001, pp. 217–225). After calibrating the bank of close to 500 descriptors, North (2000) divided the continuous variable in intervals which later became the CEFR levels. Going back to the illustrative descriptors allows for the subdivision of the CEFR levels and for making the development of functional language competence reportable with greater precision. The finer resolution of the granular scale is defined by the descriptors in the form of “can do” statements. However, in order to realize this granularity at all levels of the CEFR , more illustrative descriptors are needed than are available in the CEFR or in North (2000). The CEFR offers about three times more descriptors over the range from A2 to B2 than on the rest of the scale and more than half of all descriptors relate to speaking. In a large scale project at Pearson, hundreds of new descriptors are being developed and scaled using the original North (2000) descriptors as anchors. The descriptors (grouped as “General Adult,” “Academic,” “Professional.” and “Young Learners”) together describe the Global Scale of English. The learning objectives for adults, learners of Professional English, and academic English are already available free of charge

98

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

(GSE ; see: http://www.english.com/gse) and a version tailored for young learners will be made available in due course. The possibility of defining a universal scale of functional language ability across first and second and/or foreign languages (De Jong, 1984, 1988) with validity across various school systems (De Jong, 1986) and across various first language backgrounds (De Jong & Oscarson, 1990) had already been suggested and supported by a number of research projects compiled in De Jong (1991). The requirements for the psychometric definition of levels that compartmentalize a latent continuum were suggested in the context of reporting the PISA 2000 results by De Jong in an e-mail to Ray Adams and described by Adams & Wu (2002). Like the CEFR , the GSE can be used to create syllabuses, course material, and examinations, but at a more granular level, so as to make progress observable within a school year or even a semester. Further studies are needed to chart the learning time required to make progress on the scale, depending on parameters such as distance of first language from English, exposure to English and intrinsic motivation. Furthermore, studies can be conducted on the relative efficacy of learning and teaching methods. On the one hand, such studies could deepen our understanding of the relation between language proficiency levels and the actual teaching practices in the classroom. On the other hand, these studies could help to develop realistically attainable curricular standards and national language policies.

Conclusion Linking a test to a common scale, such as the CEFR , presents its merits alongside its challenges. Establishing concurrent validity entails establishing the degree to which results from a test agree with the results from other measures of the same or similar constructs. One caveat, however, with this type of validity evidence, as Moller (1982) reminds us, is that we need to check whether or not the criterion measure itself is valid. If it is not valid or not designed to measure the same construct then one cannot claim that a test has criterion-related validity because it correlates highly with another test or external criterion of performance. To make these kinds of linking efforts, the processes of linking should involve intertwined stages from test development to statistical validation. This chapter has reported on the measures taken to support the concurrent validity of PTE Academic™, established from the beginning of the test development process by writing items according to CEFR criteria, gaining statistical evidence to demonstrate alignment with the CEFR , and by comparing results from other tests of a similar nature that have claimed alignment with the CEFR . The establishment of a valid link to the CEFR helps facilitate the interpretation of test scores to worldwide test users and potentially across tests of similar nature. This link should be supported by both qualitative and quantitative evidence. The linking study reported in this chapter combines a priori validation and a posteriori validation. To further consolidate validity evidence, it would be advisable to adopt a variety of other approaches to collect more qualitative data, for instance, collecting introspective justification, retrospective group discussion, and think-aloud protocols or systematic recordings.

LINKING AN EXAM TO THE CEFR

99

References Adams, R., & Wu, M. (2002). PISA 2000 Technical Report. Paris: OECD. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC : American Psychological Association. Byram, M., & Parmenter, L. (2012). The Common European Framework of Reference: The Globalisation of Language Education Policy. Bristol, UK : Multilingual Matters. Cassady, J. C. (2001). Self-reported GPA and SAT scores. ERIC Digest. ERIC Identifier: ED 458216. Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge, UK : Cambridge University Press. Council of Europe. (2003). Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Preliminary Pilot Version DGIV /EDU /LANG (2003), 5. Council of Europe. (2009). Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment: A Manual. Strasbourg, France: Author. De Jong, J. (1984). Listening: A single trait in first and second language learning. Toegepaste Taalwetenschap in Artikelen, 20, 66–79 (ERIC : ED 282 412). De Jong, J. (1986). Achievement tests and national standards. Studies in Educational Evaluation, 12(3), 295–304. De Jong, J. (1988). Rating scales and listening comprehension. Australian Review of Applied Linguistics, 11(2), 73–87. De Jong, J. (1991). Defining a Variable of Foreign Language Ability: An Application of Item Response Theory (ISBN 90-9004299-7) (Doctoral dissertation). Twente University. De Jong, J., & Oscarson, M. (1990). Cross-national standards: A Dutch–Swedish collaborative effort in national standardized testing. AILA Review, 7, 62–78. Downey, N., & Kollias, C. (2010). Mapping the Advanced Level Certificate in English (ALCE ) examination onto the CEFR . In W. Martyniuk (Ed.), Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press, pp. 119–130. Educational Testing Service. (2005). TOEFL® Internet-Based Test: Score Comparison Tables. Princeton, NJ : Educational Testing Service. Kantarcioglu, E., Thomas, C., O’Dwyer, J., & O’Sullivan, B. (2010). Benchmarking a high-stakes proficiency exam: the COPE linking project. In W. Martyniuk (Ed.), Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press, pp. 102–118. Kecker, G., & Eckes, T. (2010). Putting the manual to the test: The TestDaF-CEFR linking project. In W. Martyniuk (Ed.), Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press, pp. 50–79. Khalifa, H., ffrench, A., & Salamoura, A. (2010). Maintaining alignment to the CEFR : The FCE case study. In W. Martyniuk (Ed.), Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press, pp. 80–101. Linacre, J. M. (1988). A Computer Program for the Analysis of Multi-Faceted Data. Chicago, IL : Mesa Press. Martyniuk, W. (Ed.), (2010). Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press.

100

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Messick, S. (1992). Validity of test interpretation and use. In M. C. Alkin (Ed.), Encyclopedia of Educational Research (6th edn.). New York, NY: Macmillan, pp. 1487–1495. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. Moller, A. D. (1982). A Study in the Validation of Proficiency Tests of English as a Foreign Language (Unpublished doctoral thesis). University of Edinburgh, Scotland. North, B. (2000). The Development of a Common Framework Scale of Language Proficiency. New York, NY: Peter Lang. North, B., Martyniuk, W., & Panthier, J. (2010). Introduction: The manual for relating language examinations to the common European framework of reference for languages in the context of the Council of Europe’s work on language education. In W. Martyniuk (Ed.), Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press, pp. 1–17. O’Sullivan, B. (2010). The City and Guilds Communicator examination linking project: A brief overview with reflections on the process. In W. Martyniuk (Ed.), Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press, pp. 33–49. Papageorgiou, S. (2007). Relating the Trinity College London GESE and ISE exams to the Common European Framework of Reference. London, UK : Trinity College London. Schilling, S. G. (2004). Conceptualizing the validity argument: An alternative approach. Measurement: Interdisciplinary Research and Perspectives, 2(3), 178–182. Tannenbaum, R. J., & Wylie, E. C. (2008). Linking English-Language Test Scores onto the Common European Framework of Reference: An Application of Standard-Setting Methodology. Educational Testing Service Research Report. RR-08-34, TOEFL iBT-06. Princeton, NJ : Educational Testing Service. Weir, C. J. (2005). Language Testing and Validation: An Evidence-Based Approach. Oxford, UK : Palgrave. Wilson, K. (2003). TOEFL® Institutional Testing Program (ITP) and TOEIC® Institutional Program (IP): Two On-Site Testing Tools from ETS at a Glance. Handout from Educational Testing Service at Expolingua, Berlin, Germany. Wu, J. R. W., & Wu, R. Y. F. (2010). Relating the GEPT reading comprehension tests to the CEFR . In W. Martyniuk (Ed.), Aligning Tests with the CEFR: Reflections on Using the Council of Europe’s Draft Manual. Cambridge, UK : Cambridge University Press, pp. 204–224.

PART TWO

Assessment of Specific Language Aspects

101

102

5 Authentic Texts in the Assessment of L2 Listening Ability Elvis Wagner

ABSTRACT

T

his chapter presents an overview of the issue of authenticity in the assessment of L2 listening ability, focusing on the use of unscripted spoken texts. It describes how scripted and unscripted spoken texts differ, and argues that the exclusive use of scripted texts in L2 listening assessment can present threats to construct validity. An analysis of the spoken texts used on four high-profile L2 tests in North America is then presented. The results show that most of these tests do not use unscripted spoken texts. How this might affect the construct validity of those tests is examined, and it is argued that the choice of the types of spoken texts can have a washback effect on learners, teachers, and even SLA research. The chapter concludes with practical suggestions for how to integrate unscripted spoken texts into listening tests.

Overview of Authenticity in L2 Listening Assessment In life, authenticity is almost always considered a positive attribute. If something is “authentic,” it is considered to be real and of value. In contrast, inauthentic almost always has a pejorative sense, as being of less value, a “knockoff” or a fake. In language testing, authenticity is also valued for a number of reasons, especially as it relates to construct validity. Bachman & Palmer (1996) describe authenticity as “the degree of correspondence of the characteristics of a given language test task to the features of” (p.  23) a real-world language situation. In other words, the more closely the characteristics of the test task align with and are similar to real-world language use, the more authentic that test task is. Bachman & Palmer (1996) argue that authenticity 103

104

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

is thus linked to construct validity. The more authentic the test task, the more confident one can be about test takers’ ability beyond the testing context, and the more valid are the inferences made about test takers’ ability based on their test performance. This idea of authenticity is also related to “communicative competence,” the notion widely accepted in the field of second language acquisition (SLA ) that being proficient in a language means that a person is able to communicate with other speakers in that language (Canale, 1983; Canale & Swain, 1980; Gilmore, 2011). For the assessment of L2 listening, then, assessing the communicative competence of test takers necessarily involves assessing their ability to understand real-world, authentic, spoken language. But tests, by their very nature, are inauthentic. Tests are not real-life language situations in which speakers are using language to communicate; rather, test takers are aware that they are “being asked not to answer a question by providing information but to display knowledge or skill” (Spolsky, 1985, p. 39). Different aspects of a testing situation can have varying degrees of authenticity, thus complicating the matter. That is, for a single listening test task, the spoken texts used, the listening task context, and the listener’s response to the task might all have varying degrees of authenticity. It is not possible, then, to simply say that a test task is “authentic” or “inauthentic.” Because of the inherent multidimensionality involved, researchers have coined different terms and frameworks to investigate the idea of authenticity in language testing. Bachman (1990) described the idea of interactional authenticity, which later became interactiveness in Bachman & Palmer (1996). They describe how the interactiveness of a test task involves “the ways in which the test taker’s areas of language knowledge, metacognitive strategies, topical knowledge, and affective schemata are engaged by the test task” (p. 25). Similarly, Field (2013) writes extensively about cognitive validity, which he defines in relation to assessing listening as “the extent to which the tasks employed succeed in eliciting from candidates a set of processes which resemble those employed by a proficient listener in a real-world listening event” (p. 77). While these notions of a test or test task’s interactiveness and cognitive validity are multidimensional constructs, the focus of this chapter is on only one of these dimensions—the authenticity of the spoken texts used in L2 listening tests.

How Scripted Spoken Texts and Unscripted Spoken Texts Differ Wagner (2014) describes how spoken texts can be seen as ranging on a continuum of scriptedness. At the “scripted” end of the continuum are texts that are entirely scripted. These texts are planned, written, revised, edited, polished, and then read aloud. At the “unscripted” end of the continuum are spontaneous spoken texts that involve no advanced planning, and thus, are composed and uttered at virtually the same time. And obviously, there are texts with varying levels of scriptedness in between these two ends of the continuum. There has been a great deal of research that describes how the characteristics of written texts or scripted spoken texts differ from unscripted, real-world spoken language (e.g., Buck, 2001; Field, 2008, 2013; Gilmore, 2007; Rost, 2011; Wagner, 2013, 2014). Wagner & Wagner (forthcoming) synthesized the literature and stated

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

105

that spoken texts at the “unscripted” end of the continuum tend to have more hesitation phenomena (filled and unfilled pauses, false starts, redundancies), a faster speech rate, more connected speech (due to linking, assimilation, reduction, epenthesis, deletion, etc.), more listener response features (e.g., back channels, interruptions, overlaps), and more informal lexical usage by the speakers (e.g., slang and colloquial language). They also report that unscripted spoken texts tend to have less linear and systematic organizational patterns in extended discourse in comparison to scripted texts. Finally, they describe how unscripted spoken texts also tend to have different grammatical systems or “rules” than scripted spoken texts (i.e., “spoken” grammar). The consensus in the literature seems to be that these linguistic, organizational, and articulatory characteristics that are typically found in unscripted, real-life, spoken language tend to make these types of texts more difficult for L2 listeners to comprehend, especially for L2 learners with little exposure to these types of spoken texts. Recognizing this, publishers, materials developers, and many L2 teachers, when creating spoken texts for L2 materials, seek to make them comprehensible for learners by planning and writing texts that are systematically and linearly organized, have minimal colloquial language and use written grammatical norms. Then, voice actors speak the texts slowly and enunciate clearly, which results in a slower speech rate, little or no hesitation phenomena, and minimal connected speech or listener response features. These resulting texts are the “textbook texts” so often found in materials for L2 learners. However, an argument can also be made that some of these characteristics of unplanned spoken texts might lead to increased comprehension for L2 listeners. An interesting example is hesitation phenomena such as filled pauses (e.g., uh, um), false starts, and redundancies, which have been shown to present difficulties to L2 listeners (Griffiths, 1990, 1991; Kelch, 1985). L2 listeners often do not recognize these phenomena as hesitations and part of the speaker’s composing process, but instead devote limited processing resources to try and interpret them semantically. At the same time, these hesitation phenomena also serve to slow down the speech rate, and the literature has consistently shown that a lowered speech rate (e.g., King & East, 2011; Zhao, 1997) leads to increased comprehension. A similar argument can be made in relation to the organizational patterns of spoken texts. Planned texts, in which the speaker has time to plan and revise the text before uttering it, often result in texts that are more linearly and systematically organized. In contrast, unplanned texts, in which the speaker faces time constraints when composing and uttering simultaneously, tend to be less linear and organized. While the more linear and systematic organization should make these types of texts easier for listeners to comprehend than unplanned texts (Buck, 1995, 2001), an alternative viewpoint is that listeners are better able to understand spoken language that is more typically “oral” than “literate” (e.g., Shohamy & Inbar, 1991; Rubin, 1980). In this view, the “oral” features that include a higher degree of shared context between the speaker and listener, the prominence of prosodic features, more informal vocabulary, less complex syntax, and shorter idea units, all serve to make texts from the “oral” end of the continuum more comprehensible for listeners than “literate” texts. An obvious example of this, which many people have experienced, is listening to a formal (often academic) paper that is read aloud. Listening to these types of “literate” texts can be cognitively demanding and mentally draining, especially for the L2 listener, and can actually lead to lower levels of comprehension than more “oral” texts.

106

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

It is too simplistic, then, to argue that scripted spoken texts are more (or less) difficult for L2 listeners to comprehend than unscripted spoken texts. Rather than focusing on the perceived level of difficulty, SLA researchers, language testing researchers, and teachers should focus on how L2 listeners respond to and process the different types of spoken texts they are likely to encounter in real-world listening contexts (Wagner, 2013, 2014).

The Types of Spoken Texts Used in L2 Listening Assessment The previous section described how textbook publishers, curriculum developers and many language teachers recognize that the linguistic, organizational, and vocal characteristics of unplanned spoken texts might present comprehension difficulties for L2 listeners. For that reason, they create spoken texts that lack many of these characteristics in an attempt to make them more comprehensible for the learners. There are many sound theoretical and pedagogical reasons for doing so. Richards (2006) writes about the importance of using comprehensible texts when teaching L2 listening, and that using authentic texts can be counterproductive because it can increase learner frustration and negative affect. However, the exclusive use of these scripted, highly edited, and polished texts (“textbook texts”) is problematic. It can result in the all-too-common phenomenon of the language learner who is highly proficient in the language classroom, but has great difficulty understanding the language spoken in real-world language contexts. Another danger of the reliance on the exclusive use of scripted spoken texts in textbooks and other classroom materials is that this genre of spoken texts then becomes the “default” genre of text that is used in many L2 language tests. In an L2 listening assessment context, a scripted spoken text can be authentic if it has a high degree of correspondence to spoken texts the L2 listeners would typically encounter outside the testing context. Similarly, an unscripted text can be considered inauthentic if it is not the type of text a particular group of L2 listeners would typically encounter outside the testing situation. One way to ensure that the spoken texts used on an L2 listening assessment are authentic is to utilize audio or audiovisual recordings taken from an actual real-world language event. This is an option that test developers do sometimes utilize, including using recordings of radio broadcasts, academic lectures, and two-person interactive conversations. However, it seems that most spoken texts that are actually used in L2 listening assessment are scripted. Wagner (2013, 2014) describes how spoken texts used in L2 teaching and testing are usually scripted, with writers creating a text, then extensively editing and revising it, usually as part of the process of crafting the comprehension items that the learners or test takers will have to answer. After extensive revision, the text is finalized, and then voice actors are recorded while reading the texts aloud. Often these voice actors are trained in diction, and seek to enunciate clearly and speak deliberately and clearly. The resulting spoken texts often lack many of the characteristics of unplanned, unscripted spoken texts that are found in real-life language situations. In other words, they lack authenticity.

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

107

As stated above, there is an extensive body of research describing how the characteristics of unplanned, real-world spoken language differs from scripted and planned spoken language, yet there is surprisingly little empirical research investigating the types of spoken texts used on L2 listening assessments. Perhaps the most extensive analysis of spoken texts used in L2 listening tests was conducted by Field (2013), who assessed the cognitive validity of the Cambridge ESOL suite of tests. As part of this validation exercise, Field analyzed the spoken texts used in the listening tasks, focusing on how representative the spoken texts were of real-world language, including whether they were scripted, improvised, authentic, or rerecorded. Field also examined the extent to which the process of listening to the test texts was similar to the process of listening to real-world spoken language, as well as the extent to which the spoken texts had the linguistic features commonly found in unplanned spoken discourse, particularly focusing on connected speech. He concluded that the tests in the Cambridge ESOL suite had varying levels of cognitive validity. Another finding was that while the higher level tests in the suite tended to require a wide range of levels of cognitive processing, these higher level tests also tended to under-represent the characteristics of real-world spoken language, and that some of the spoken texts used “are unconvincing as samples of speech” (2013, p. 111). Field critiqued the procedures of how the listening tests were constructed, finding that the writers of the exam usually used written texts as the source of the spoken texts, and concluded that the “material was thus only authentic in the sense that its content derived from written texts not designed for language teaching” (p. 112). Wagner & Wagner (forthcoming) analyzed the spoken texts used in the listening sections of three high-stakes English proficiency tests: the College English Test (CET ) from China (four listening tasks), the EIKEN from Japan (four listening tasks), and the General English Proficiency Test (GEPT ) from Taiwan (three listening tasks). The researchers examined qualitatively and impressionistically the spoken texts used on these exams for nine different linguistic components that are characteristic of unscripted, real-world, spoken language. They concluded that of the eleven listening texts used on the tests, only one of the texts (the two-person interview text from the EIKEN test) seemed to be unscripted, and that it was the only text of the eleven that had many of the linguistic characteristics of unplanned spoken language. They stated that this text was strikingly different from the other ten listening texts examined.

How the Type of Spoken Text Might Affect Test-Taker Performance There is also a surprising dearth of research examining how the performance of L2 test takers might differ on scripted versus unscripted spoken texts. Studies by Henrichsen (1984) and Ito (2001) showed that L2 learners have more difficulty comprehending spoken texts that have those articulatory features (e.g., reduced forms, assimilations, and contractions that are a result of rapid articulation, and are typical of unplanned spoken language) than spoken texts without these features. While informative, these studies included only sentence-level input, in very controlled conditions. There appears to be only one empirical research project that has directly

108

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

investigated this notion at the discourse level. The participants in Wagner & Toth (2014) included two comparable groups of American university learners of Spanish as a foreign language. The first group listened to an unscripted text and the second group listened to a scripted version of the same text. The unscripted texts included the linguistic, organizational, and articulatory features commonly found in unscripted, real-world spoken language, while the scripted texts had features more commonly found in planned, formal spoken language. The group that listened to the scripted texts scored significantly higher on the listening comprehension test than the group that listened to the unscripted texts. Wagner & Toth’s (2014) findings mirrored the results of the studies of sentence-level input (Henrichsen, 1984; Ito, 2001). However, Wagner & Toth’s (2014) findings went beyond sentence level input and included longer, discourse-level input, the types of authentic spoken texts that L2 learners would encounter outside the classroom. Wagner & Toth (2014) conclude that using unscripted spoken texts on L2 listening tests can result in varying testtaker performance, and that if the target language use (TLU ) domain of interest includes real-world spoken language (which is typically unscripted), then this variance in test performance would be construct relevant variance.

The Study The review of the literature demonstrated that there is a great deal of research describing how scripted and unscripted spoken languages differ. There seems to be a consensus that scripted and unscripted spoken texts have very different phonological, lexico-grammatical and organizational characteristics, although there is only a limited amount of research that directly investigates this notion. However, there is an even more fundamental gap in the literature: there are almost no studies that have examined the phonological, lexico-grammatical, and organizational characteristics of spoken texts that are used on high-stakes English listening proficiency tests in North America. Field (2013) examined the spoken texts on the Cambridge suite of tests used primarily in the United Kingdom and Europe, while Wagner & Wagner (forthcoming) examined the spoken texts used on high-stakes listening tests in Asia. But there do not seem to be any studies that have examined the types of spoken texts that are used on high-stakes English proficiency tests in the North American context. Therefore, this study examines the spoken texts used in the listening sections of four large-scale, high-stakes tests of English proficiency used in North America for university admissions purposes: the International English Language Testing System (IELTS ™), the Michigan English Language Assessment Battery (MELAB ®), the Pearson Test of English Academic (PTE Academic™), and the Test of English as a Foreign Language Internet-Based Test (TOEFL iBT ®). These four tests are taken by millions of test takers each year and are used by North American universities for admissions purposes. Thus, the tests have a huge potential for both positive and negative washback on test stakeholders and are worthy of study. The following research question is addressed: To what extent do the spoken texts used on the IELTS ™, MELAB ®, PTE Academic™, and TOEFL iBT ® have the characteristics of authentic, unscripted spoken language?

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

109

Methodology The spoken texts used in these four tests were analysed qualitatively to examine the extent to which these spoken texts have the linguistic features characteristic of scripted or unscripted spoken texts, and focused on eight textual features in particular: connected speech (e.g., assimilation, reduction, contraction), speech rate, filled pauses (uh, um, you know), unfilled pauses (noticeable silent pauses), hesitation phenomena (false starts, repetitions, and repeats), spoken grammatical/lexical norms (oral turn starters, oral turn closers, instances of ungrammatical speech), slang/colloquial language, and listener response features (back channels, interruptions, and overlaps). The analysis is similar to the analysis performed on three Asian high-stakes proficiency tests in Wagner & Wagner (forthcoming). It was decided to investigate these eight textual features because they have been identified in the literature (e.g., Buck, 2001; Field, 2008, 2013; Wagner, 2013, 2014; Wagner & Wagner, forthcoming) as important in differentiating between scripted and unscripted spoken texts. The analysis conducted here is an exploratory one and relies on my subjective and impressionistic judgment. I first found complete sample test versions of the listening sections on the test developers’ websites. I then listened to the spoken texts individually, made notes about the different characteristics of the texts, and then listened to each of the texts at least two more times to confirm my initial impressions. It is important to acknowledge that this involves impressionistic reactions, and it is possible to measure more reliably at least some of the features. Nevertheless, this is an exploratory analysis and serves to make a general relative comparison among the spoken texts used on the listening tests as well as a comparison between these texts and real-world spoken texts.

Listening Sections of the IELTS™, MELAB®, PTE Academic™, and TOEFL IBT® IELTS™ The IELTS ™ Listening section takes approximately thirty minutes and is composed of forty items (a variety of item types including multiple-choice (MC) , matching, labeling, etc.). The listening skills assessed include the ability to understand main ideas and specific information, to recognize the attitudes, opinions, and purpose of the speaker, and follow the development of an argument. A number of native-speaker accents are used (i.e., British, American, Australian), and test takers can listen to the text only once.

MELAB® The MELAB ® Listening section takes thirty-five to forty minutes to complete and is composed of sixty MC items. The listening skills assessed include global skills (i.e., understanding the main idea, identifying speaker’s purpose, synthesizing ideas from the text); local skills (i.e., identify details, understanding vocabulary, synthesizing details, recognizing restatements); and inferential skills (i.e., making inferences,

110

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

inferring details, understanding rhetorical function and pragmatic implications). Each text is heard only once.

PTE Academic™ The PTE Academic™ listening section takes forty-five to fifty-seven minutes to complete. There is a wide variety of item types, including MC , fill in the blank, summary, and dictation. Many different accents are used, including American, British, and Australian as well as nonnative speakers. Each spoken text is played only once. There are two parts to the listening section, and the speaking section also involves a listening component.

The TOEFL iBT® The TOEFL iBT ® listening section takes sixty to ninety minutes to complete. The listening skills assessed include listening for basic comprehension (i.e., main idea, major points, and important details); listening for pragmatic understanding (i.e., recognizing a speaker’s attitude, degree of certainty and function or purpose); and connecting and synthesizing information (i.e., recognizing organization and relationships between ideas presented; making inferences and drawing conclusions; making connections among pieces of information; and, recognizing topic changes, introductions, and conclusions). A variety of native-speaker accents are used (i.e., British, American, Australian) and test takers can listen to the text only once. There are two separate sections in the listening section, and there are two additional integrated tasks that have a listening component. A summary of the features of the listening components of the four tests is provided in Table 5.1.

Results of the Analysis of the Spoken Texts Used on the Tests IELTS™ The speaking texts for sections  1 and 3 of the IELTS™ involve multiple-person conversations, while the texts for sections 2 and 4 are one-person monologues. Most of the texts involve TLU domains in which one would expect mostly unscripted spoken texts. For example, the spoken texts previewed here include a phone conversation between a customer and a shipping agency, two friends discussing their university studies, and an academic lecture with student questions and comments. However, a few of the texts were simulated radio programs that typically would include scripted texts. Only the texts in which the TLU domain would dictate unscripted texts were analyzed here (see Table 5.2). The analysis indicated that the spoken texts used in the IELTS™ listening had some of the characteristics of unscripted spoken texts. There were some (minimal) contractions and very minimal amounts of connected speech. The speech rate seemed normal, and there were a few instances of filled pauses (uh, um), although there were

TABLE 5.1 The Listening Components of the IELTS ™, MELAB ®, PTE Academic™, and TOEFL iBT ®

Test

Parts of the Listening Section Text Type (Number of Speakers, Genre)

Length of Text

Item Type

IELTS™

Section 1

2-Person Conversations

3–5 minutes

Varies—Combination of Form Completion, MC , Short Answer, Sentence Completion, etc.

Section 2

1-Person Monologues

2–4 minutes

Varies

Section 3

Multiple-Person Conversations (Usually 2 people, but sometimes up to four people)

2–5 minutes

Varies

Section 4

1-Person Monologue

3–5 minutes

Varies

Type 1

1-Person Statement or Question

4 seconds

MC

Type 2

2-Person Conversations and 1-Person Statements/ Questions

10–15 seconds

MC

Type 3

Multiple-Person Radio Interviews

2–4 minutes

MC

1-Person Academic Mini-lecture

90 seconds

Written Summary of Spoken Text

Listening Part 2

1-Person Monologue

40–90 seconds

MC

Listening Part 2

1-Person Monologue

30–60 seconds

Fill in the Blanks

Listening Part 2

1-person Monologue, Some with Content Video (e.g., PowerPoint slides)

30–90 seconds

Highlight Correct Summary

MELAB®

PTE Academic™ Listening Part 1

111

(Continued)

112

TABLE 5.1 (Continued)

Test

TOEFL iBT®

Parts of the Listening Section Text Type (Number of Speakers, Genre)

Length of Text

Item Type

Listening Part 2

1-Person Monologue

30–60 seconds

MC

Listening Part 2

1-Person Monologue

20–70 seconds

Select Missing Word

Listening Part 2

1-Person Monologue

15–70 seconds

Highlight Incorrect Word

Listening Part 2

1-Person Sentence Read Aloud

4 seconds

Dictation

Speaking 1

1-Person Spoken Sentence

3–9 seconds

Sentence Repeat (Listen to and then repeat a sentence orally)

Speaking 2

1-Person Mini-Lecture

Up to 90 seconds Oral Summary (Listen to/ watch a lecture, retell the lecture in own words)

Speaking 3

1-Person Asking a Question

3–9 seconds

Oral Short Answer

Listening Part 1

1-Person Lecture with Students’ Questions and Comments

3–5 minutes

MC , 6 per Lecture (4–6 lectures)

Listening Part 2

2-Person Conversations

2–3 minutes

MC , 5 per Conversation

Integrated Writing Task

1-Person Lecture (Read/Listen/Write)

2 minutes

Summary Writing

Integrated Speaking Task

1-Person Lecture and 2-Person Conversations (Listen/Speak)

1–2 minutes

Summary Speaking

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

113

TABLE 5.2 Characteristics of the Spoken Texts Used on the IELTS™ Sections 1 and 3

Sections 2 and 4

Connected Speech

Minimal; some contractions

Minimal; some contractions

Speech Rate

Normal

Normal

Filled Pauses

Minimal, a few instances of (uh, um)

Almost none

Unfilled Pauses

None

None

Hesitation Phenomena

Minimal, some repeats and restatements

Minimal

Spoken Grammatical/Lexical Oral turn starters (and, right, None Norms OK, So, Well); there’s instead of there are Slang/Colloquial Language

None

None

Listener Response Features

A few instances

n.a.

virtually no unfilled pauses. There were also some hesitation phenomena present, including repeats and restatements as well as numerous turn starters such as And, Right, and So. There were very few instances of listener response features, such as back channels. Not surprisingly, these instances of hesitation phenomena, spoken grammatical norms, and listener response features were more prominent in sections 1 and 3, which involved multiple speakers, than they were in sections 2 and 4, which involved monologues. However, while the spoken texts used on the IELTS ™ listening had some of the characteristics of unplanned spoken texts, it also seemed obvious that the texts were indeed scripted, composed, written, and then read aloud by trained voice actors. There were fewer filled pauses and hesitation phenomena than one would expect in informal, unscripted, spoken discourse. There was no colloquial language used, and there were surprisingly few listener response features in the conversational texts, when in real life contexts they would be much more prominent. Even the filled pauses that were present seemed artificial, as if they had been scripted and then read/ performed by the voice actors.

MELAB® The speaking texts for the listening sections of the MELAB ® involve monologic, two-person, and multiple person conversations. Most of the texts involve TLU domains in which one would expect unscripted spoken texts. For example, the spoken texts previewed here include short statements or questions, a two-person

114

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

phone conversation between a customer and a representative of the water company, and a radio news program interview. However, the third part of the listening section of the MELAB ®, which involves a radio news program, is ambiguous, in that some radio programs are entirely scripted, while some others that involve interviews would include unscripted texts. Indeed, two of the MELAB ® radio programs examined in the sample test were examples of news reports which would be totally scripted, while two of the radio program texts included interviews (which are examples of unscripted texts). Only the texts in which the TLU domain would suggest unscripted texts are analyzed here (see Table 5.3). As a whole, the spoken texts used in the MELAB ® listening had few of the characteristics of unscripted spoken texts, and it was readily apparent that the spoken texts were scripted and read aloud. There were only very minimal instances of connected speech. While the speech rate seemed normal, there were virtually no instances of filled or unfilled pauses, hesitation phenomena or listener response features, and very few instances of spoken grammatical/lexical norms. The lack of features of unscripted spoken texts was especially noticeable in the two-person short conversations of part 2. There are about twenty of these two-person informal conversations texts in part 2, and there are usually two to four turns per conversation. In the real-world contexts that these texts represent, one would expect numerous filled pauses, back channels, overlaps, interruptions and spoken grammatical/lexical norms, yet they are almost completely absent here. However, there were many instances of slang and colloquial language used in the MELAB ® listening, especially in the two-person conversation texts.

TABLE 5.3 Characteristics of the Spoken Texts Used on the MELAB ® Part 1

Part 2

Part 3

Connected Speech

Minimal; some contractions

Minimal; some contractions

Minimal; some contractions

Speech Rate

Normal

Normal

Normal

Filled Pauses

None

None

None

Unfilled Pauses

None

None

None

Hesitation Phenomena

None

None

None

Spoken Grammatical Norms

None

None

None

Slang/Colloquial Language

Prominent

Prominent (e.g., in the Minimal red, gets a kick out of )

Listener Response Features

n.a.

None

None

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

115

PTE Academic™ The first PTE Academic listening section involves an academic lecture text, while the second listening section involves a number of different types of monologues, generally ranging from fifteen seconds to ninety seconds as well as a short dictation task. Most of these texts involve TLU domains in which we would expect unscripted spoken texts (the dictation task is the obvious exception). In addition, the speaking section has three tasks involving listening. Of these three tasks, only one (an academic lecture) involves a TLU domain where unscripted spoken texts are expected. Again, only the texts in which the TLU domain would dictate unscripted texts are analyzed here (see Table 5.4). The spoken texts used in the listening sections and the spoken text used in the retell lecture speaking task of the PTE Academic™ had many of the features typical of unplanned spoken texts. There was a moderate amount of connected speech, including contractions, reduced speech, assimilation, blending, and so on. The speech rate was normal and some of the texts had seemingly rapid speech. There were numerous instances of filled pauses and some unfilled pauses. Likewise, there were numerous instances of hesitation phenomena and spoken grammatical/lexical norms, and there were even some notable instances of colloquial language used. Since all of the texts examined were monologic, there were no instances of listener response features. ™

TABLE 5.4 Characteristics of the Spoken Texts Used on the PTE Academic™ PTE Listening Section 1

PTE Listening Section 2

PTE Speaking (Retell Lecture)

Connected Speech

Moderate

Moderate

Minimal; some contractions

Speech Rate

Normal

Normal, some rapid speech

Normal

Filled Pauses

Numerous instances (uh, um)

Numerous instances (uh, um)

Some instances

Unfilled Pauses

Some

Some

Some

Hesitation Phenomena

Numerous repeats and restatements

Numerous repeats and restatements

Moderate repeats and restatements

Spoken Grammatical Extensive Norms

Extensive

Some

Slang/Colloquial Language

Minimal, some instances

Minimal, some instances

Prominent

Listener Response Features

n.a.

n.a.

n.a.

116

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

While performing the analysis of the spoken texts used on the PTE , it quickly became apparent that many of them were, in fact, authentic, unscripted spoken texts. The amount of connected speech, pauses, hesitation phenomena, and spoken grammatical/lexical norms was much more extensive than was encountered in the spoken texts from the other tests. In addition (and unlike most of the other spoken texts analyzed in this research), the instances of hesitation phenomena, filled and unfilled pauses, and connected speech also sounded natural, unforced, and authentic. It was obvious that the speakers in these texts were not voice actors trained to sound natural. These speakers seemed to be simultaneously composing and uttering their words, and the connected speech, pauses, and hesitations were all part of the composing process.

TOEFL iBT® The spoken texts used with the different TOEFL tasks involve both one-person monologues and multiple-person conversations. All of the texts reviewed involve TLU domains in which one would expect unscripted spoken texts (e.g., an academic lecture with student comments and questions; a service encounter between a student and a registrar). The speaking texts examined include those used in sections 1 and 2 of the listening component of the TOEFL as well as the speaking texts used in the integrated writing task (read/listen/write) and the integrated speaking task (listen/ speak) (see Table 5.5). The spoken texts used in the TOEFL listening section had some features typical of unplanned spoken discourse. Although there was minimal connected speech, the speech rate was normal and there were a number of instances of filled and unfilled pauses. There were minimal instances of repeats and restatements, although spoken grammatical/lexical norms were more prominent. There were virtually no instances of colloquial language or listener response features. The spoken texts used in the integrated writing and integrated speaking tasks had even fewer features typical of unplanned spoken discourse than was found in the spoken texts from the listening section of the TOEFL . Similar to the spoken texts used on the IELTS™ and the MELAB ®, it seemed obvious that the spoken texts used on the TOEFL were scripted, composed, and then read aloud by trained voice actors. While the texts had some of the features typical of unscripted spoken language, and the speakers were able to integrate some of these features (i.e., filled pauses, restatements) into their speech somewhat naturalistically, the texts overall seemed qualitatively different from real-world unplanned spoken language.

Construct Implications of Using Scripted and Unscripted Spoken Texts As described earlier, much of the research examining how L2 listeners engage with scripted or unscripted spoken texts seems to focus on whether scripted texts are easier or more difficult for L2 listeners than unscripted texts, while very little research

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

117

TABLE 5.5 Characteristics of the Spoken Texts Used on the TOEFL iBT ® TOEFL 1 (Lectures)

TOEFL 2 TOEFL (Conversations) Integrated Writing

Connected Speech

Minimal

Minimal; some contractions

Almost none Almost none

Speech Rate

Normal

Normal

Normal

Filled Pauses

Minimal, a few instances (uh, um)

Some (uh, um)

Almost none Almost none

Unfilled Pauses

Some

Minimal

Almost none Almost none

Hesitation Phenomena

Some repeats and Some repeats and restatements, restatements (you know, I mean)

Almost none Almost none

Spoken Grammatical Norms

Oral Turn Starters,

Oral Turn Starters, Oral Turn Closers

None

Some Oral Turn Starters; lecture discourse markers (OK? Well,)

Slang/Colloquial None Language

None

None

Almost none

Listener Response Features

None

n.a.

None

None

TOEFL Integrated Speaking

Normal

has been conducted examining the types of spoken texts that are actually used on high-stakes tests of L2 listening proficiency. From an assessment standpoint, it is essential to consider the construct implications of the types of spoken texts used on an L2 listening assessment. Numerous longitudinal research studies (e.g., Gilmore, 2011; Herron & Seay, 1991) have demonstrated that exposure to and focused instruction on the different linguistic, organizational, and vocal characteristics of unscripted spoken texts lead to L2 learners’ increased ability to comprehend these types of texts. Yet some students learning the target language in foreign language contexts might hear very little English beyond the textbook texts provided in their EFL classroom. As Bachman & Palmer (1996) describe, it is necessary when developing a test to identify the TLU domain—the “situation or context in which the test taker will be using the language outside of the test itself” (p. 18). For an L2 listening test, the TLU

118

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

domain dictates the types of spoken texts that will be used as the stimulus on the test, and thus, it is necessary to identify the types of spoken texts that are found in the TLU domain. Making sure that the spoken texts used on the test have many of the same characteristics of the spoken texts found in the TLU domain will enable the test developer to make more valid inferences about the test takers’ listening ability outside the text context. For the four tests examined here, the TLU domain is academic listening involving spoken texts that are typical of a higher education context. In this TLU domain, the vast majority of the spoken texts are unscripted (e.g., academic lectures and other forms of classroom discourse, interactions with classmates, informal social interactions, service encounters, etc.). Messick (1989; 1996) describes two major threats to the construct validity of a test: construct irrelevant variance and construct underrepresentation. If the spoken texts used to assess test takers’ ability in this academic TLU domain include only scripted spoken texts, they would lack many of the phonological, organizational, and vocal characteristics of realworld spontaneous (unscripted and unplanned) spoken language. This can result in construct underrepresentation, a serious threat to the construct validity of the assessment. Of the four tests examined here, three of them did not use unscripted spoken texts. Even the PTE Academic™, which did utilize unscripted spoken texts, only included monologic unscripted texts. None of the tests examined here utilized unscripted spoken texts that involved conversational interaction. When taking a listening test that utilizes unscripted spoken texts, some test takers will be better able to process and comprehend unscripted spoken language than other test takers. As described above, research has found that a learner who has had extensive exposure to unscripted spoken texts should be able to comprehend this type of language better than a learner who has limited or no exposure to it. Similarly, a student in a class whose teacher focuses on listening to authentic spoken language, and provides practice and strategies for processing these types of texts would likely score higher on an authentic listening test than a student whose teacher only exposes him or her to scripted, textbook texts. These differing ability levels are exactly what tests are trying to measure—construct relevant variance. The failure to include unscripted spoken texts (if they are part of the TLU domain) would lead to construct underrepresentation, and disadvantage those test takers who are better able to comprehend these types of spoken input. More recently, the field of L2 assessment has moved toward an argument-based examination of validity (e.g., Kane, 2001; Weir, 2005). If test developers are creating an assessment use argument for the validity of their test, they need to examine the importance of unscripted spoken texts as part of the domain definition and analysis. For the three tests that were analyzed here that did not include unscripted spoken texts, it would seemingly be difficult for the test developers to provide evidence for claims that the scores from their test lead to valid inferences about the test takers’ ability to understand real-world spoken language.

Possible Washback Effect on Learners The previous discussion regarding the importance of using authentic, unscripted spoken texts on L2 listening assessments focused on construct validity, and how doing

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

119

so can affect the validity of the inferences made from the results of the test. However, it is also important to consider this issue in relation to consequential validity, and the impact this decision can have on test stakeholders. Messick (1989) associated construct validity with the idea of “consequential validity.” He argued that it is necessary to consider the influence or impact that a test has on teaching and learning when examining the validity of that test. Hughes (2003) argues even more explicitly that the test developer must consider how to maximize positive washback through the design of the test. This issue of test washback is an increasingly important area of L2 assessment research (e.g., Alderson & Wall, 1993; Shohamy, 2001), and numerous washback studies have been conducted that clearly show that high-stakes tests impact how and what L2 learners study, how L2 teachers teach, and how L2 curriculum developers develop curricula (e.g., Cheng, 2008; Shih, 2010; Vongpumivitch, 2012). However, there does not seem to be any research that focuses on the possible washback effect that using authentic, unscripted spoken texts on L2 listening tests might have on test takers and test users. Part of the reason for the lack of washback research may be because so few high-stakes L2 listening tests actually use authentic, unscripted spoken texts (Wagner & Wagner, forthcoming). Indeed, this issue presents a bit of a “chicken or the egg” dilemma for test developers. On the one hand, if L2 learners are only exposed to scripted, planned, and polished textbook texts in the classroom, it would be unfair to include unscripted spoken texts on a test that these learners would take. On the other hand, if high-stakes L2 listening tests only use scripted spoken texts, then L2 learners, teachers, and curriculum developers will have little incentive to include authentic, unscripted spoken texts in the L2 classroom (Wagner, 2013, 2014). For this dilemma to be resolved, both test developers and curriculum developers will need to acknowledge the importance of the use of unscripted spoken texts in the teaching and testing of communicative competence, and integrate these types of texts into their products. This seems to be exactly what the developers of the PTE Academic™ are doing; their website stresses the authentic nature of the test, and argues that the use of genuine academic lecture texts demonstrates the test developer’s “commitment to ensuring students are better prepared to use their English for academic study” (http:// www.pearsonpte.com/wp-content/uploads/2014/07/RelevantFactsheet.pdf).

Suggestions for Integrating Unscripted Spoken Texts into L2 Listening Tests There are many legitimate reasons why publishers and test developers might be reluctant to include unscripted spoken texts on L2 listening tests. Wagner (2014) describes a number of these reasons, including the fact that they might sound unprofessional; using authentic spoken texts can sometimes be frustrating and demotivating for lower-ability learners; copyright and security issues involved might present hurdles to their use; and if learners are only exposed to scripted, textbook texts in their classrooms, it would be unfair to include unscripted spoken texts on the test. However, perhaps the primary reason test developers do not use unscripted spoken texts is that in most test development contexts, it is more efficient to create a spoken text specifically for use on that particular test. Test developers often have

120

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

specific, predetermined test task specifications dictating the length of spoken texts used, the number (and types) of items for each task, the genre and content of the texts, and the types of abilities to be assessed (Carr, 2011; Wagner, 2013). Finding authentic texts that fit these predetermined task specifications can be difficult, and thus, many test developers choose to create spoken texts for use in their tests. Test writers can create the script to match their test specifications in terms of time, length, vocabulary level, and so on, and the text can be tweaked to test specific information and types of abilities that might be mandated by the test specifications. Most importantly for test developers, this process is seen as more efficient, and makes it much easier for item writers to meet test specifications. This approach has been criticized and critiqued by researchers (e.g., Buck, 2001; Carr, 2011; Wagner, 2013) on theoretical grounds, but defended by test developers because of practicality and efficiency constraints. The alternative, in which an unscripted spoken text is identified, and then the response items are created to fit the text, allows for the use of authentic texts, but might be less efficient and more difficult if the test task specifications are already set (Buck, 2001; Carr, 2011). However, the use by PTE Academic™ of genuine (authentic) lectures as spoken texts in the listening sections indicates the feasibility of their use. And again, the developers of the PTE Academic™ trumpet the fact that “the lectures are genuine academic lectures, not actors reading scripts” (http://www. pearsonpte.com/wp-content/uploads/2014/07/RelevantFactsheet.pdf). Another possibility for incorporating more authentic texts is through the use of semi-scripted texts (Buck, 2001). With semi-scripted texts the voice actors are given the basic outline of a text but are not given the line-by-line, word-for-word texts to be spoken. Buck (2001) argues that the use of semi-scripted texts is efficient in that the outline of the text can be structured to match with the required test specifications and that the resulting texts can have many of the characteristics of natural, unplanned speech. Wagner & Toth (2014) and Wagner (2008) used semi-scripted texts in their research and reported that the resulting texts had many of the characteristics of unplanned, spontaneous speech. Clark (2014) described in his study how he revised the listening section of a placement test by creating a number of semi-scripted texts meant to simulate academic lectures. He analyzed these semi-scripted lecture texts, and found that they had many of the features of natural academic language, including natural hesitation phenomena, self-corrections, the use of questions oriented to the audience, and many instances of the first person, and concluded that if a listening test seeks to assess an “examinee’s ability to understand ‘real’ speech, the method employed here is a viable alternative to the use of scripted listening passages” (Clark, 2014, p.  21). The use of semi-scripted texts should result in broader construct coverage, and avoid the construct under-representation that results in the exclusive use of scripted texts.

Areas for Future Research Although using semi-scripted texts would seem to be an effective strategy for test developers, there seems to be little research examining the extent to which semi-scripted texts do indeed have the same linguistic, organizational, and vocal characteristics as unscripted, spontaneous, authentic spoken texts. More research is

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

121

needed to ensure that using semi-scripted texts is a viable alternative for L2 listening test developers. Another aspect of the issue in need of more research is to better understand how individual learners differ in their ability to process and comprehend unscripted, spontaneous spoken texts. Individual differences in learners, including experience with these types of texts, classroom strategy training, working memory, and other individual differences all might contribute to learners’ differing ability to comprehend unscripted spoken texts. Longitudinal research (e.g., Gilmore, 2011; Herron et al., 1991) on how explicit classroom instruction can help L2 listeners develop the ability to understand the unscripted spoken texts has been very informative, but a broader research agenda is needed to investigate these other aspects. Finally, more research is needed to examine which of the characteristics of unscripted, authentic, spoken language presents the most difficulty for L2 learners. A more nuanced understanding of how things such as the linguistic (i.e., grammatical, lexical), the vocal (i.e., connected speech, hesitation phenomena), and the organizational (i.e., patterns of discourse structure, listener response features) characteristics of unplanned spoken discourse affect learners’ processing and comprehension could result in better informed pedagogical practices.

Conclusion This chapter has stressed that the TLU domain should dictate the types of spoken texts used on L2 listening assessments. I have advocated the use of authentic, unscripted spoken texts on L2 listening assessments since these types of texts would seem to be part of virtually every domain of interest to test users trying to assess the communicative competence of test takers. This is certainly true of the four tests examined here, which have an academic TLU domain. I am not advocating the exclusive use of unscripted texts because being able to understand scripted spoken texts is also part of the TLU domain of interest in most language testing contexts. As Buck (2001) states, “not every text needs to be authentic” (p.  165). Rather, it is important that the texts used be representative of the types of spoken texts that are found in the TLU domain. While the results of this study indicate that most of the spoken texts used in these four high-stakes L2 listening assessments are scripted spoken texts, PTE Academic’s™ use of unscripted spoken texts demonstrates that it can be done. Using authentic, unscripted spoken texts on L2 listening assessments can help ensure that the communicative competence of test takers is actually being assessed, provide evidence supporting the argument about the validity of the test, and promote positive washback among test stakeholders.

References Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14(2), 115–129. Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford, UK : Oxford University Press.

122

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford, UK : Oxford University Press. Buck, G. (1995). How to become a good listening teacher. In D. Mendelsohn & J. Rubin (Eds,), A Guide for the Teaching of Second Language Listening. San Diego, CA : Dominie Press, pp. 113–131. Buck, G. (2001). Assessing Listening. Cambridge, UK : Cambridge University Press. Canale, M. (1983). From communicative competence to communicative language pedagogy. In J. Richards & R. Schmidt (Eds.), Language and Communication. London, UK : Longman, pp. 2–27. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47. Carr, N. (2011). Designing and Analyzing Language Tests: Oxford Handbooks for Language Teachers. Oxford, UK : Oxford University Press. Cheng, L. (2008). The key to success: English language testing in China. Language Testing, 25(1), 15–37. Clark, M. (2014). The use of semi-scripted speech in a listening placement test for university students. Papers in Language Testing and Assessment, 3(2), 1–26. Field, J. (2008). Listening in the Language Classroom. Cambridge, UK : Cambridge University Press. Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.), Examining Listening: Research and Practice in Assessing Second Language Listening. Cambridge, UK : Cambridge University Press, pp. 77–151. Gilmore, A. (2007). Authentic materials and authenticity in foreign language learning. Language Teaching, 40(2), 97–118. Gilmore, A. (2011) “I Prefer Not Text”: Developing Japanese learners’ communicative competence with authentic materials. Language Learning, 61(3), 786–819. Griffiths, R. (1990). Speech rate and NNS comprehension: A preliminary study in timebenefit analysis. Language Learning, 40(3), 311–336. Griffiths, R. (1991). Pausological research in an L2 context: A rationale and review of selected studies. Applied Linguistics, 12(4), 345–364. Henrichsen, L. (1984). Sandhi-variation: A filter of input for learners of ESL . Language Learning, 34(3), 103–126. Herron, C., & Seay, I. (1991). The effect of authentic oral texts on student listening comprehension. Foreign Language Annals, 24(6) 487–495. Hughes, A. (2003). Testing for Language Teachers (2nd edn.). Cambridge, UK : Cambridge University Press. Ito, Y. (2001). Effect of reduced forms on ESL learners’ input-intake process. Second Language Studies, 20(1), 99–124. Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342. Kelch, K. (1985). Modified input as an aid to comprehension. Studies in Second Language Acquisition, 7(1), 81–89. King, C., & East, M. (2011). Learners’ interaction with listening tasks: Is either input repetition or a slower rate of delivery of benefit? New Zealand Studies in Applied Linguistics, 17(1), 70–85. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational Measurement (3rd edn.), New York, NY: American Council on Education and Macmillan, pp. 13–103. Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 242–256. Richards, J. (2006). Materials development and research—making the connection. RELC Journal, 37(1), 5–26.

AUTHENTIC TEXTS IN THE ASSESSMENT OF L2 LISTENING ABILITY

123

Rost, M. (2011). Teaching and Researching Listening (2nd edn.), Harlow, UK : Pearson Education. Rubin, A. (1980). A theoretical taxonomy of the difference between oral and written language. In R. Spiro, B. Bruce & W. Brewer (Eds.), Theoretical Issues in Reading Comprehension. Hillsdale, NJ : Erlbaum, pp. 411–438. Shih, C.-M. (2010). The washback of the General English Proficiency Test on university policy: A Taiwan case study. Language Assessment Quarterly, 7(3), 234–254. Shohamy, E. (2001). The Power of Tests: A Critical Perspective on the Uses of Language Tests. Essex, UK : Longman. Shohamy, E., & Inbar, O. (1991). Validation of listening comprehension tests: The effect of text and question-type. Language Testing, 8(1), 23–40. Spolsky, B. (1985). The limits of authenticity in language testing. Language Testing, 2(1), 31–40. Vongpumivitch, V. (2012). Motivating lifelong learning of English? Test takers’ perceptions of the success of the General English Proficiency Test. Language Assessment Quarterly, 9(1), 26–59. Wagner, E. (2008). Video listening tests: What are they measuring? Language Assessment Quarterly, 5(3), 218–243. Wagner, E. (2013). Assessing listening. In A. Kunnan (Ed.), Companion to Language Assessment (Volume 1). Oxford, UK : Wiley-Blackwell, pp. 47–63. Wagner, E. (2014). Using unscripted spoken texts to prepare L2 learners for real world listening. TESOL Journal, 5(2), 288–311. Wagner, E., & Toth, P. (2014). Teaching and testing L2 Spanish listening using scripted versus unscripted texts. Foreign Language Annals, 47(3), 404–422. Wagner, E., & Wagner, S. (Forthcoming). Scripted and unscripted spoken texts used in listening tasks on high stakes tests in China, Japan, and Taiwan. In V. Aryadoust & J. Fox (Eds.), Current Trends in Language Testing in the Pacific Rim and the Middle East: Policies, Analyses, and Diagnoses. Newcastle upon Tyne, UK : Cambridge Scholars Publishing. Weir, C. (2005). Language Testing and Validation: An Evidence-Based Approach. Hampshire, UK : Palgrave Macmillan. Zhao, Y. (1997). The effects of listeners’ control of speech rate on second language comprehension. Applied Linguistics, 18(1), 49–68.

124

6 The Role of Background Factors in the Diagnosis of SFL Reading Ability Ari Huhta, J. Charles Alderson, Lea Nieminen, and Riikka Ullakonoja

ABSTRACT

I

n order to diagnose strengths and weaknesses in second/foreign language (SFL ) reading, we need a much more detailed understanding of the construct and the development of SFL reading abilities. This understanding is also necessary to design useful feedback and interventions. In this chapter, we report findings from DIALUKI , a research project into the diagnosis of SFL reading that examined hundreds of Finnish learners of English as a foreign language (10- to 18-year-olds) and Russian learners of Finnish as a second language (10- to 15-year-olds). The study aimed to predict SFL reading ability with a range of linguistic, cognitive, motivational, and background measures. In this chapter, we cover both L2 Finnish and FL English learners, and focus on the findings regarding learners’ background, and examine to what extent such learner characteristics can predict SFL reading and possibly distinguish between L2 and FL readers.

Diagnosing SFL Reading The ability to read in a second or foreign language (SFL ) is important for business people, for academic researchers, for politicians, and for professionals in many fields. It also provides a useful basis for the ability to use the language in general and to interact with its speakers, for professional, educational, social, and leisure purposes. The assessment of a person’s SFL reading ability may serve several different purposes: 125

126

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

To identify whether a person has sufficient reading proficiency in order to study at university, to test a person’s progress in reading during a language course, or to diagnose a person’s strengths and weaknesses in reading in an SFL . However, to date, the design and construction of diagnostic SFL reading tests is not well understood. Little is known about how SFL reading ability develops, how to identify strengths and weaknesses in SFL reading ability, how teachers can best facilitate such reading abilities, or which factors contribute most to the development of overall SFL reading performance. Thus, research into the factors that influence progress in SFL reading is an important and growing area of applied linguistics as well as of assessment, and this chapter will present one aspect of the growing field of the diagnostic assessment of SFL reading.

The DIALUKI Project The research described in this chapter is part of a four-year, multidisciplinary project studying the relationship between reading in one’s first language (L1) and one’s second or foreign language. The research team included applied linguists, language assessment specialists, psychologists, and researchers into reading problems such as dyslexia. The project examined two separate groups of informants: Finnish-speaking learners of English as a foreign language, and Russian-speaking learners of Finnish as a second language in Finland. The English learners (henceforth labeled the FIN ENG group) form three separate subgroups: mostly 10-year-old learners in primary school grade four (n = 211), mostly 14-year-old learners in grade 8 of lower secondary school (n = 208), and mostly 18-year-old learners in the second year of upper secondary gymnasia (n = 219). The Russian-speaking Finnish learners (henceforth known as the RUS -FIN group) are the children of immigrants to Finland, and constitute two groups, the primary group and the lower secondary group. The division of the FIN -RUS group into two broader subgroups was based on the fact that recently arrived immigrant children are not always placed in the grade level that matches their age because of their proficiency in Finnish and the time of the year when they arrive. Some of them are thus older than their Finnish-background fellow students. The participants in the primary group came from grades 3–6 and were mostly 9–12 years old (n = 186); the lower secondary group came from grades 7–9 and were 13–16 years old (n = 78). The project involved several substudies: a cross-sectional study, part of which we deal with here; a longitudinal study; and several intervention studies. In this chapter, we report on a range of background factors that may affect the SFL reading ability of the fourth and eighth graders in the FIN -ENG group, and both the primary and lower secondary groups in the RUS -FIN group.

Review of Previous Research into Learners’ Backgrounds A wide range of individual and contextual factors that relate to, or even cause, differences in L1 and SFL skills have been investigated in previous research, including

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

127

but not limited to, learners’ motivation, intelligence, aptitude, personality, L1 skills, and contextual characteristics such as the teacher, the classroom, and the school more generally. Such variables can indeed correlate with, or even cause, foreign language proficiency (see, e.g., Alderson et  al., 2014; Sparks & Ganschow, 1995; Sparks et al., 2006). In this chapter, however, we focus on those background factors that were included in the DIALUKI study, namely certain characteristics of the learner’s home, for example, parents’ education, reading and writing habits, language skills and possible reading problems, and of the learners themselves, for example, gender, reading/writing habits, use of the SFL , age of learning to read in L1, and so on. Therefore, our review of previous research focuses on these aspects in the learners’ background in particular. A considerable amount of literature exists on the relationship between different background factors and reading in a first language. Parents’ socioeconomic status (SES ) is one of the key factors that affect their children’s L1 reading comprehension. More specifically, middle- or high-SES parents engage more in joint book reading with their children and have more books available for their children than low-SES parents. They also talk more to their children and use more elaborate and abstract (context-independent) language (see, e.g., Hoff, 2006; Mol & Bus, 2011). In MelbyLervåg & Lervåg’s (2014, p. 412) words “the weight of evidence suggests that SES affects the quality and the quantity of the language to which children are exposed.” Consistent findings from the international PISA studies of 15-year-olds reading in the language of education (often their L1) show that girls outperform boys when reading, although the gender gap in performance is narrower in digital reading than in print reading (OECD 2011, p. 19), and children of more highly educated parents from more affluent families outperform their lower SES peers (OECD, 2010, p. 35). Canadian studies have also shown that parents’ socioeconomic status is related to their children’s second language (L2) performance (Geva, 2006). Geva’s review concluded that the children of higher SES parents achieved better results in L2 reading tests. The European Survey of Language Competences (ESLC ) was a study of more than 50,000 learners’ foreign language achievement in fourteen European Union countries (EC , 2012) that provides information about learners’ background and foreign language (FL ) proficiency. The study focused on students at the end of lower secondary education (typically aged 14–16) and covered reading, writing, and listening. The study included extensive background questionnaires for the students, teachers, and schools. The findings indicated that, in most countries, the number of languages that the students studied in school was significantly associated with their test performance: The more languages the students studied, the better their SFL reading and writing. The parents’ knowledge of the target language, as reported by the learner, was positively related to their child’s test results in that language, especially for writing. Target language use at home was also correlated with higher performance in that language. Learners’ exposure to and use of the FL through media was strongly associated with their test performance in all the skills in almost all countries (see EC , 2012, pp. 205–220). A recent Finnish national evaluation of learning outcomes in foreign languages covered several background factors and focused on grade 9 in the lower secondary school. The results of English as an FL were based on a sample of almost 3,500

128

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

students (Härmälä et  al., 2014). Unlike in the other foreign languages studied, in English reading, girls did not outperform boys, but both genders achieved equally good results. However, the parents’ educational level was strongly associated with performance in English. The difference between the children of parents without general upper secondary education and the children of parents who both had a general upper secondary education, was clearest for listening (effect size d = 0.84), but quite sizable also for the other three major skills (d = 0.78–0.79). The differences were most marked in students who had the highest English proficiency. However, the amount of time the students reported doing homework for English did not correlate with their performance on the tests of English. The students’ use of English in their free time was fairly strongly associated with their reading performance (.39). The two activities with the highest correlation with reading performance were watching movies and video clips in English, and reading online texts in English (.40 and .37, respectively) (see Härmälä et al., 2014, pp. 76–84). In addition, another recent study (Pietilä & Merikivi, 2014) of the same age group of Finnish pupils found that students who reported reading in English in their free time had significantly larger receptive and productive vocabularies than those who did not. To summarize, previous research on the relationship between background factors and reading in L1 or SFL has highlighted the importance of the parents’ socioeconomic background: Middle and high SES children tend to read better than their low SES peers. Parents’ knowledge of the SFL is also helpful for their children’s SFL skills, as also is learner’s use of the language in their free time.

Description of the Background Questionnaires In the DIALUKI project, questionnaires were administered to the students participating in the study and to their parents, and covered factors considered relevant for the development of SFL reading and writing skills. The questionnaires for the students and parents in the FIN -ENG groups were in Finnish, but for the RUS -FIN students and their parents, the questionnaires were bilingual in Russian and Finnish. The two-page parents’ questionnaire was completed by 95–98 percent of the parents, in about 80 percent of cases by the mother. The questionnaire covered the socioeconomic status of the family (i.e., parents’ education and income level), how often the parents read and write at home, and whether the students’ parents, siblings, or parents’ siblings have had problems in reading. The parents were also asked to reply to a set of questions about prereading activities that a family member might have done with their child when he or she was learning to read (e.g., reading books to them or playing word games with them), and to state the age at which the child had learned to read in their L1. Furthermore, the FIN -ENG learner’s parent completing the questionnaire was asked to self-assess his or her English oral, reading, and writing skills, and the RUS -FIN learner’s parent was asked to evaluate his or her Finnish skills, if his or her first language was not Finnish. The four-page student questionnaire was filled in by about 90 percent of the students in the RUS -FIN groups and 95–99 percent of the students in the FIN -ENG groups. The questions focused on:

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY



● ● ● ●



● ● ●

129

languages known by the student (L1 and other languages) and used in his or her home, amount of homework done on normal school days, attitude to reading and writing in free time, amount of reading and writing in free time, frequency of reading different kinds of material (fifteen types of material, e.g., e-mail, Facebook, online discussions, on-line news, magazines, newspapers, factual texts, fiction), frequency of writing different kinds of material (thirteen types of material, e.g., text messages, e-mails, Facebook messages, online discussions, letters or cards, notes, stories), frequency of using SFL in free time—for reading, writing, speaking, listening, knowledge of SFL before starting to study it at school, living in an English-speaking country or attending an English-medium school and for how long (only in the FIN -ENG group).

In addition, the students were asked for the age at which he or she learned to read, the same question as had been asked of the parents. For the second language learner group, this question was asked separately for Finnish and Russian.

Findings Descriptive Characteristics of the Informant Groups: Language and Reading The family language practices and environment of our informants can be described from several different perspectives. First, the parents’ L1: In the FIN -ENG groups, the parents almost exclusively reported their first language to be Finnish. In the RUS -FIN groups, more than 90 percent of mothers and 80 percent of fathers reported Russian as their L1, but in addition Bulgarian, Tatar, Chechen, Ukrainian, Belarusian, Estonian, Karelian, Chinese, and Arabic were also mentioned as first languages. Finnish was the L1 of 14 percent of the fathers but only for 0.7 percent of the mothers, that is, only two mothers. These figures tell us that quite a few of the children in the RUS -FIN groups lived in a bilingual family, in which one of the parents had an L1 other than Russian. When the students were asked about their L1 and any other languages used at home, the answers from the FIN -ENG pupils were similar to their parent’s reports: Their L1 was Finnish and they used Finnish at home. However, in the RUS -FIN groups, the linguistic situation was different from their parents. Russian was reported to be the L1 of 65 percent of primary school children and 70 percent of those in lower secondary school, but 19 percent of the primary and 14 percent of the secondary school children identified themselves as bilinguals and 11–12 percent reported that Finnish was their first language. The gradual movement toward the dominant language of the Finnish society is not unexpected. In fact, according to

130

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Montrul (2012), immigrant children are very likely to adopt the language of the surrounding society as their first language by the time they are adolescents. The linguistic situation in RUS -FIN homes was also reported differently by the children compared to their parents. 72 percent of the children said that Russian was the language mostly used at home, but 17 percent of the primary level and 22 percent of the secondary level students reported that both Finnish and Russian were used at home. This is likely to reflect the children’s own language use: They may speak Finnish with their siblings and also reply in Finnish to their parents who speak Russian to them. Both of these situations have been reported to be common in multilingual families (e.g., Mäntylä et al., 2009). The use of the second or foreign language in free time is the second perspective to be considered in the linguistic environment of the student groups. As expected, the RUS -FIN students used Finnish a lot in their free time. More than 80 percent reported speaking and reading in Finnish and listening to Finnish at least once or twice a week outside the daily routines of Finnish schools. Although the amount of speaking and reading in English in their free time in the FIN -ENG groups was much lower, the percentages were still rather high, especially for the fourth graders who had started learning English at school less than a year and half earlier. Around 47 percent of the fourth graders and 60 percent of the eighth graders reported speaking English at least once or twice a week. For reading in English, the percentages were 55 percent and 58 percent, respectively. Listening to English was even more common: 78 percent of the fourth graders and 95 percent of the eighth graders listened to English weekly in their free time. English enjoys quite a high status in Finland: It is the most popular foreign language taught in school, and for most pupils, it is also the first foreign language they learn from the third grade onward. English is seen as the language that will benefit one most in the future. It is also the language of video games, TV programs, and popular music, all very important factors for young people. It has sometimes been said that English is actually no longer a foreign language in Finland, but has become a second language for many younger people. The status of the Russian language in the RUS -FIN group is interesting. The vast majority of children both in the primary and lower secondary school said that they listen to and speak Russian every day, but reading in Russian in their free time was much rarer. Only approximately half of the children reported reading Russian daily, and as much as 15 percent of the primary and 10 percent of the lower secondary students claimed that they hardly ever read in Russian. This is likely to be at least partly due to the fact that there is less Russian reading material available than Finnish. If the family does not buy a Russian newspaper or magazine and there are no Russian books at home, children have to look for Russian reading material in libraries or on the Internet. The responsibility for developing reading in Russian is left to the parents and the students themselves, since 55 percent of the primary and 38 percent of the secondary students said they had never attended a Russian school, and thus had most likely first learned to read in their second language. Although most children (primary: 65%; secondary: 78%) had attended Russian language classes in Finnish schools, their Russian reading skills were not much used outside the classroom. In order to create a more precise picture of the students, we also asked about their literacy history as well as their interests and attitudes toward reading. According to

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

131

the parents’ reports in all groups, the majority of students learned to read (in any language) when they were six or seven years of age, which means they learned to read during the first year at school. In the FIN -ENG groups, the vast majority of the fourth graders reported liking to read to some extent (47%) or a lot (46%). For the eighth graders, the situation had changed. By that age, 25 percent did not like to read and the proportion of those who liked reading a lot had dropped to only 29 percent. The RUS -FIN primary and lower secondary pupils did not differ so clearly, and the proportions between “not liking,” “liking to some extent,” and “liking a lot” remained quite similar in both age groups (primary: 14%–58%–27%; secondary: 21%– 49%–29%). What and how much did these children read in their free time? Of the fourth graders in the FIN -ENG group, 72 percent reported reading only once or twice a month or even less frequently, whereas for the eighth graders, the proportion of infrequent readers was 45 percent. Thus, despite their positive attitude toward reading, the fourth graders seemed to read less frequently than the eighth graders. In the RUS FIN group, the trend was similar: 62 percent of the primary students reported reading once or twice a month or even less frequently, whereas for the lower secondary students the proportion of infrequent readers was only 35 percent. If the time spent on reading is divided between digital (Internet, games, SMS , e-mails, etc.) and print media (books, newspapers, magazines, etc.), then a similar trend can be seen in both language groups: The younger students read books, newspapers, magazines, and other print media, whereas the teenagers read digital media more frequently.

Correlates of SFL Reading To investigate the relationship between SFL reading performance and learners’ background variables, we ran correlational analyses (Spearman’s rho) and regression analyses for both language groups. Table  6.1 reports the significant correlations between the parental background variables and reading in English as a foreign language and Finnish as a second language. Table  6.2 reports the corresponding findings for the student-related background variables. See Appendix 1 for the descriptive statistics concerning these variables. Parents’ educational level and self-assessed proficiency in English were the most consistent parental background correlates of their children’s reading in English in the FIN -ENG groups. Although the correlations were significant, they were quite modest and only ranged from .15 to .28, which may be partly due to the rather short ordinal scales used in some of the background variables (see Appendix 1). For the RUS -FIN students, only their parents’ (mostly the mother’s) oral skills in Finnish were correlated with their children’s L2 Finnish reading test scores. Interestingly, the RUS -FIN parents’ reading and writing habits were negatively correlated with their children’s Finnish test scores. The reason for this might be that the parents (mothers in particular) were often native speakers of Russian, and therefore, probably read and wrote mostly in Russian. In addition, many of them were not working, which allows them to spend a lot of time reading and writing. It is worth noting that neither the families’ income level nor the kinds of potentially supportive reading activities during the time when their child was learning to read

132

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 6.1 Spearman Rank-order Correlations between Parental Background Variables and Reading in FL English and L2 Finnish FIN-ENG Learners

RUS-FIN Learners

4th grade 8th grade Primary

Lower sec.

Mother’s educational level

ns

0.207 p = .003

ns

ns

Father’s educational level

0.178 p = .016

0.215 p = .004

ns

ns

Household income level

ns

ns

ns

ns

Amount of reading parent does at home, in any language

ns

0.15 p = .038

ns

−0.328 p = .009

Amount of writing parent does at home, in any language

ns

ns

ns

−0.276 p = .030

Parent’s self-assessment of oral skills in FL English/L2 Finnish

0.156 p = .029

0.182 p = .011

ns

0.266 p = .035

Parent’s self-assessment of reading in FL English/L2 Finnish

0.281 p = .000

0.217 p = .002

ns

ns

Parent’s self-assessment of writing in FL English/L2 Finnish

0.147 p = .039

0.199 p = .005

ns

ns

Parents’ self-assessment (mean across 3 skills) of FL English / L2 Finnish

0.211 p = .003

0.224 p = .002

ns

ns

Parents’ prereading activities (mean)

ns

ns

ns

ns

were associated with SFL reading test results in either of the language groups. The reason why family income level was not associated with the child’s SFL reading is not entirely clear. We speculate that the relatively small income differences in Finland may play a role in this (parents’ educational level was correlated with children’s SFL reading, but a higher educational level does not automatically mean a higher income level, and vice versa, although the two are correlated aspects of socioeconomic status). It may also be that parents’ income, and particularly, their supportive reading practices are more relevant to learning to read in one’s L1 than in an SFL . The most consistent correlation between student-related variables and students’ SFL reading across all the groups was the age at which the child had learned to read. However, there was an important difference between the foreign and second language groups. For the Finnish-speaking learners of English, learning to read in L1 Finnish before going to school was associated with higher performance on FL English reading tests even several years after learning to read. For the Russian-background

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

133

TABLE 6.2 Spearman Rank-order Correlations between Learner Background Variables and Reading in FL English and L2 Finnish FIN-ENG Learners

RUS-FIN Learners

4th grade

8th grade

Primary

Lower sec.

Age of the student

ns

ns

ns

ns

Age of learning to read in Finnish (according to parents)

−0.309 p = .000

−0.253 p = .000

N/A

N/A

Age of learning to read (in any language, but in practice, either Russian or Finnish) (according to parents)

N/A

N/A

ns

ns

Age of learning to read in Finnish (according to child)

−0.222 p = .002

−.269 p = .000

−0.171 p = .049

−0.317 p = .015

Age of learning to read in Russian (according to child)

N/A

N/A

0.191 p = .030

ns

Feeling about reading in free time

ns

ns

ns

ns

Feeling about writing in free time

ns

ns

ns

ns

Time spent doing homework in general

ns

ns

ns

−.370 p = .003

Amount of time spent on reading in free time

ns

ns

ns

ns

Amount of time spent on writing in free time

ns

ns

ns

ns

Frequency of listening in English/ Finnish outside school

ns

ns

ns

ns

Frequency of reading in English/ Finnish outside school

ns

0.338 p = .000

ns

ns

Frequency of speaking in English/ Finnish outside school

0.155 p = .031

0.279 p = .000

ns

ns

Frequency of writing in English/Finnish outside school

ns

ns

ns

ns

Number of languages the student reports knowing

ns

0.209 p = .003

0.172 p = .039

0.460 p = .000 (Continued)

134

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 6.2 (Continued) FIN-ENG Learners

RUS-FIN Learners

4th grade

8th grade

Primary

Lower sec.

Frequency of free time reading (mean across all activities)

ns

ns

ns

ns

Frequency of free time writing (mean across all activities)

ns

ns

ns

−0.267 p = .033

Frequency of using English/Finnish overall (mean across 4 skills)

0.158 p = .026

0.288 p = .000

ns

ns

learners of L2 Finnish, early learning to read in L1 Russian was not associated with better L2 reading performance. In fact, we found the opposite: The earlier they learned to read in L1 Russian, the worse their L2 Finnish reading test performance was. We discuss possible reasons for this in the final part of the chapter. The use of the SFL outside school was associated with better reading performance, especially in the foreign language (FIN -ENG ) groups. Frequency both of speaking English, and for the older age group, reading in English correlated significantly with reading skills. The number of different languages that the student reported knowing to any extent was also associated with SFL reading except for the youngest foreign language group, in which probably very few students could be expected to know languages other than their L1 and the foreign language they had just started to learn at school. Perhaps somewhat surprisingly, students’ attitudes toward reading or writing and the frequency of reading and writing were not correlated with their SFL reading in any of the groups. This is probably because those background questions were not language-specific, but aimed at tapping their attitude and habits toward reading and writing in general, across any languages they happened to know and use.

Regression Analyses To find out more about the relationship between SFL reading and learners’ background, we conducted a series of stepwise multiple linear regression analyses. These analyses included pair-wise deletion using the Rasch measure from Winsteps analyses of the SFL reading test as the dependent variable, and the background variables as independent (predictor) variables. The exception to this procedure was the older RUS -FIN group in which the small size of the group did not make a Rasch analysis meaningful. Two analyses were done: first with parents’ background variables as predictors, then with the students’ background variables (see Appendix 2). The analyses with parents’ background variables included: • mother’s and father’s educational level, • household income level,

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

135

• amount of reading and writing that the responding parent does at home, • parents’ mean prereading activities, • mean self-assessed FL English or L2 Finnish skills of the parent who completed the questionnaire. The analyses with students’ background variables included: • • • • • •

the age at which they learned to read in their first language, attitudes to reading and writing in their free time, amount of time spent on reading and writing in their free time, frequency of using English, Finnish, or Russian outside school, frequency of free time reading and writing, the number of languages the student reported knowing.

In addition, age was used as a background variable for the RUS -FIN students as they came from different age groups.

Findings for the Fourth Grade Finnish Learners of English as a Foreign Language Parents’ background In the regression analysis, only the parent’s (mostly the mother’s) self-assessed English skills in reading, writing, and speaking accounted for a significant amount of variance in the child’s English reading test score. However, only 5 percent of the variance was explained (for details of the regression analyses, see Appendix 2).

Student’s background The result of the regression analysis with only the student background variables was similar to that of the parents’ background as only one significant predictor emerged, namely, the age at which the student learned to read in Finnish. The earlier the child had learned to read, the better his or her English test result in grade 4. Again, the amount of variance explained in the English test scores was quite modest, only 9.7 percent.

Findings for the Eighth Grade Finnish Learners of English as a Foreign Language Parents’ background Only the father’s educational level accounted for a significant amount of variance in the child’s English reading test score, but the amount of variance explained was only 3.3 percent.

Student’s background In contrast to the parental background variables that only explained a very small proportion of English reading, the student background variables accounted for 11.1

136

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

percent of the score. A total of three different variables turned out to be significant: frequency of using English in free time, age of learning to read in Finnish, and amount of time spent on reading in any language in one’s free time.

Findings for the Russian-speaking Learners of Finnish as a Second Language Parents’ background No single background variable in the regression analysis explained variation in L2 Finnish reading for primary school learners. In the lower secondary school group, the amount of reading the parent does at home explained 9 percent of the variance, but the relationship was negative, that is, less parental reading was related to better reading scores for the child. The somewhat skewed distribution of responses (43% of the parents reported reading more than ten hours a week) and the likelihood that their reading is done mostly in Russian could explain this negative relationship.

Student’s background In the primary school group, the frequency of using Russian overall, across all skills, and the number of languages the student reported knowing explained 8 percent of the variance in reading. In the lower secondary school group, the number of languages the student knew, frequency of free time writing, and amount of homework emerged as predictors, explaining 34 percent of the variance.

Findings for Certain Dichotomous Background Variables Finally, we compared the students’ performances with those background variables that are best treated as dichotomous rather than as continuous variables, using independent sample t-tests. These include gender, occurrence of reading problems in the family, and knowledge of the SFL before starting school.

Gender Boys and girls in the four groups were rather equally balanced. In the fourth grader FIN -ENG and the lower secondary RUS -FIN groups there was almost exactly 50 percent of boys and girls. In the other two groups, about 53–55 percent of the students were girls. In the fourth grader FIN -ENG group, there was no difference between boys and girls in their performance in English reading. However, in the eighth grader group, boys slightly but significantly outperformed girls. In the Russian-background group, the girls outperformed boys in the primary school sample, whereas boys were better in the lower secondary school sample, but the difference was not statistically significant.

Reading problems in the family The background questionnaire asked parents to indicate whether the biological mother, father, child’s siblings, mother’s siblings or parents, or father’s siblings or parents had had or still had problems in reading. About 35 percent of the families in both FIN -ENG groups indicated that at least one member of their extended family

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

137

had experienced problems in reading. In the RUS -FIN groups, the corresponding percentage was 13 percent. In both FIN -ENG groups, the students with no reading problems among their biological family members achieved significantly higher scores in the English reading test than their peers with even one such problem. No statistical differences were found for the two RUS -FIN groups. However, the low percentage of RUS -FIN parents reporting familial reading problems suggests that the occurrence of such problems may have been under-reported in this study.

Knowing the SFL to some degree before starting to learn it at school This question was put to the two language groups slightly differently. The FIN -ENG students were asked if they knew any English (“except perhaps for a few words,” as specified in the question) before starting to study it at school in grade three. 22 percent of the fourth graders and 17 percent of the eighth graders reported knowing at least some English before commencing their formal studies of the language. In contrast, the RUS -FIN students were asked if they knew Finnish, except for a few words, before going to school in Finland as the language of instruction in their schools was Finnish. About half of both age groups reported they knew some Finnish before going to school in Finland. In the FIN -ENG groups, the students with some prior knowledge of English outperformed their peers without such knowledge. The difference was particularly clear among the fourth graders, and also quite substantial in the eighth graders’ group. A similar although not quite as strong a difference was found for the students in the younger RUS -FIN group: those who knew Finnish before going to school outperformed those who did not. However, in the lower secondary group the difference was not statistically significant.

Discussion The DIALUKI study aimed to explore the diagnosis of strengths and weaknesses in reading a second or foreign language. The focus of this chapter was on the relationship between learners’ background and their SFL reading in order to pave the way for a better understanding of the diagnostic potential of such background information. Consistent with results from previous studies on both L1 and SFL reading, parents’ educational level was correlated with their children’s English reading performance (see, e.g., Härmälä et al., 2014; OECD, 2010). Interestingly, no such relationship was found for the Russian-background children. In neither language group was parents’ income related to children’s test results. A metastudy (MelbyLervåg & Lervåg, 2014) and the European ESLC study (EC , 2012) suggested that parents’ proficiency in SFL may contribute to children’s SFL proficiency and our study provided some support for that. Only one significant positive correlation was found between parents’ reading habits and their children’s reading performance in English. However, for the Russianbackground parents, this relationship was negative, which was probably due to them reading and writing more in Russian than in Finnish. The interplay of background factors in the Russian-background families appears quite complex

138

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

and requires more investigation: For example, these parents were often quite highly educated in Russia, but in Finland they are more likely to be unemployed than Finnish-speaking parents. The occurrence of first language reading problems among the biological family members was a significant indicator of weaker English reading performance in both groups of Finnish-background learners. The reason is most likely related to the fact that some L1 reading problems have a genetic basis (see Richardson & Lyytinen, 2014) and that L1 and SFL reading problems are often related. Therefore, students whose family members had experienced L1 reading problems were likely to have some L1—and FL —reading problems themselves, and therefore, did less well on the foreign language reading tests. For the Russian-background learners, no such difference was found between students with or without such family risks, for which we have no plausible explanation. However, the low percentage of parents reporting familial reading problems compared to the FIN -ENG parents (13% vs. 35%) suggests that the phenomenon may have been under-reported in the RUS -FIN group. On the whole, parental background alone could not explain much variance in SFL reading. For the Finnish-background students, only 3–5 percent of variance was accounted for, and for the Russian-background learners, also less than 10 percent. Student-related background variables explained more variance in learners’ SFL reading: about or slightly more than 10 percent among the English FL learners and the younger L2 Finnish learners. In the older RUS -FIN group, explained variance rose to over 30 percent. The most consistent correlation across all the groups was the age at which the child had learned to read. For the Finnish-speaking FL English learners, preschool learning to read in L1 Finnish was associated with higher performance in FL English reading even years after learning to read. For the Russianbackground learners of L2 Finnish, early learning to read in L2 Finnish was related to better L2 Finnish reading. However, early learning to read in L1 Russian indicated weaker L2 Finnish reading performance. This finding is explained by the age of arrival of the pupils: The pupils who were born in Finland or immigrated to Finland prior to school age would have learned to read earlier (and better) in Finnish than those who immigrated to Finland after having learned to read in Russian first and who thus became exposed to Finnish later. As has been found in other studies, for example, in the ESLC study and the Finnish national evaluations, the use of the SFL outside school was associated with better reading performance, especially in the foreign language (FIN -ENG ) groups. Frequency of speaking English, and for the older age group, reading in English correlated significantly with reading skills. The number of different languages that the student reported knowing was also associated with SFL reading in most groups studied. A similar finding was made in the ESLC study where the number of languages studied (but not the number of languages the child had learned at home before the age of five) was linked to better test performance. On the basis of our study, it is not possible to establish any causal links between the number of languages known and SFL reading performance. It may be that they both reflect some common underlying cause such as a more general interest in learning/studying languages. Students’ attitudes toward reading or writing, and the frequency of reading and writing activities were not related to SFL reading. This is probably because the

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

139

questions were not language-specific, but aimed at tapping their attitudes toward, and habits of, reading and writing in general, across any languages they happened to know.

Conclusion From a diagnostic point of view, it is important to distinguish background factors that can be affected by parents, teachers, or the students themselves, and those that cannot be changed. In order to predict weak performance and to prepare for possible problems at the institutional and teaching level, many background factors are potentially useful, for example, in order to organize remedial teaching and materials, and to hire specialist staff. However, in order to act on diagnostic information at the level of the individual student, factors that are amenable to improvement are most useful (Alderson et al., 2014). An obvious example of such factors is the recreational use of the language: Since more frequent use of the language is usually associated with higher language proficiency (and may even cause it), it makes sense to encourage learners to use the language as much as possible in their free time and to provide them with ideas and opportunities to do so. However, we need to put the role of background factors into a wider perspective. On the basis of the diagnostic studies on SFL reading we have conducted so far, the background factors reported here enable us to better understand the bigger picture of which factors relate to strengths and weaknesses in SFL reading. Nevertheless, compared with other components of reading, such as L1, basic cognitive/ psycholinguistic processes, specific linguistic components (e.g., lexical and structural knowledge), and motivation, these background factors may play a much smaller role (see Sparks et al., 2006; Alderson et al., 2014). More research is needed, however, especially in second language contexts because the predictive power of the student and family background appeared to differ between FL and L2 learners. Furthermore, the L2 contexts are probably more complex (see, e.g. Jang et  al.’s 2013 study of different kinds of L2 learners), and therefore, more information and more different kinds of L2 learners should be included in future studies to get a more accurate picture of the similarities and differences in the importance of background factors in SFL development.

References Alderson, J. C., Haapakangas, E.-L., Huhta, A., Nieminen, L., & Ullakonoja, R. (2014). The Diagnosis of Reading in a Second or Foreign Language. New York, NY: Routledge. EC (European Commission). (2012). First European Survey on Language Competences. Final Report. Brussels: European Commission. Retrieved November 10, 2014 from http:// www.ec.europa.eu/languages/policy/strategic-framework/documents/language-surveyfinal-report_en.pdf. Geva, E. (2006). Second-language oral proficiency and second-language literacy. In D. August & T. Shanahan (Eds.), Developing Literacy in Second-Language Learners: Report of the National Literacy Panel on Language. Mahwah, NJ : Lawrence Erlbaum, pp. 123–139.

140

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Hoff, E. (2006). How social contexts support and shape language development. Developmental Review, 26(1), 55–88. Härmälä, M., Huhtanen, M., & Puukko, M. (2014). Englannin kielen A-oppimäärän oppimistulokset perusopetuksen päättövaiheessa 2013 [Achievement in English according to syllabus A at the end of comprehensive education in 2013]. Helsinki, Finland: National Board of Education. Retrieved November 10, 2014 from http://www.oph.fi/download/160066_englannin_kielen_a_oppimaaran_oppimistulokset_ perusopetuksen_paattovaiheessa.pdf. Jang, E. E., Dunlop, M., Wagner, M., Kim, Y-H., & Gu, Z. (2013). Elementary school ELL s’ reading skill profiles using Cognitive Diagnosis Modeling: Roles of length of residence and home language environment. Language Learning, 63(3), 400–436. Mäntylä, K., Pietikäinen, S., & Dufva, H. (2009). Kieliä kellon ympäri: perhe monikielisyyden tutkimuksen kohteena [Languages around the clock: A family as the target for studies in multilingualism]. Puhe Ja Kieli, 29(1), 27–37. Melby-Lervåg, M., & Lervåg, A. (2014). Reading comprehension and its underlying components in second-language learners: A meta-analysis of studies comparing first- and second-language learners. Psychological Bulletin, 140(2), 409–433. Mol, S., & Bus, A. (2011). To read or not to read: A meta-analysis of print exposure from infancy to early adulthood. Psychological Bulletin, 137(2), 267–296. Montrul, S. (2012). Is the heritage language like a second language? In L. Roberts, C. Lindqvist, C. Bardel., & N. Abrahamsson (Eds.), EUROSLA Yearbook Vol 12. Amsterdam, The Netherlands: John Benjamins, pp. 1–29. OECD. (2010). PISA 2009 Results: What Students Know and Can Do. Student Performance in Reading, Mathematics and Science. Retrieved 10 November 2014 from http://www.oecd.org/pisa/pisaproducts/48852548.pdf. OECD. (2011). PISA 2009 Results: Students on Line: Digital Technologies and Performance (Vol VI). Retrieved November 10, 2014 from http://www.dx.doi. org/10.1787/9789264112995-en. Pietilä, P., & Merikivi, R. (2014). The impact of free-time reading on foreign language vocabulary development. Journal of Language Teaching and Research, 5(1), 28–36. Richardson, U., & Lyytinen, H. (2014). The GraphoGame method: The theoretical and methodological background of the technology-enhanced learning environment for learning to read. Human Technology, 10(1), 39–60. Sparks, R., & Ganschow, L. (1995). A strong inference approach to causal factors in foreign language learning: A response to MacIntyre. Modern Language Journal, 79(2), 235–244. Sparks, R., Patton, J., Ganschow, L., Humbach, N., & Javorsky, J. (2006). Native language predictors of foreign language proficiency and foreign language aptitude. Annals of Dyslexia, 56(1), 129–160.

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

141

Appendix 1 Descriptive Statistics for the Variables Used in the Study PARENTAL VARIABLES

FIN-ENG Learners

RUS-FIN Learners

4th grade

8th grade

Primary

Lower sec.

mean / SD

mean / SD

mean / SD

mean / SD

Mother’s educational level (1 = basic education . . . 5 = higher university degree)

3.14 / 1.22

3.20 / 1.20

3.51 /1.38

3.53 / 1.40

Father’s educational level (scale as above)

2.95 / 1.26

2.96 / 1.28

3.34 / 1.40

3.46 / 1.31

Household income level (1 = under 14,000€/year . . . 8 = over 140,000€/year)

4.49 / 1.26

4.82 / 1.51

2.73 / 1.51

2.94 / 1.54

Amount of reading parent does at home (1 = < 1h/week . . . 4 = >10h/week)

2.66 / .84

2.78 / .82

3.01 / .89

3.07 / .94

Amount of writing parent does at home (as above)

1.93 / .78

2.04 / .94

2.26 / .99

2.37 / 1.16

Parents’ self-assessment of oral skills in FL English /L2 Finnish (1 = weak . . . 5 = excellent)

2.52 / .80

2.52 / .83

2.33 / .79

2.26 / .84

Parent’s self-assessment of reading in 2.70 / .76 FL English/L2 Finnish (as above)

2.71 / .85

2.67 / .84

2.55 / .89

Parent’s self-assessment of writing in 2.43 / .82 FL English/L2 Finnish (as above)

2.44 / .83

2.18 / .78

2.07 / .81

Parents’ mean self-assessment of FL 2.55 / .73 English/L2 Finnish (as above)

2.56 / .78

2.33 / .71

2.31 / .77

Parents’ prereading activities (mean)

3.13 / .49

3.16 / .54

3.2 / .59

3.28 / .51

(1 = never or less often than 1–2/month . . . 5 = daily or almost daily)

2.00–.00

1.00–4.00

1.43–4.00

1.57–4.00

N of respondents

n = 183–200 n = 184–204 n = 116–151 n = 67–78

142

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

STUDENT VARIABLES

FIN-ENG Learners

RUS-FIN Learners

4th grade

8th grade

Primary

Lower sec.

mean / SD

mean / SD

mean / SD

mean / SD

Age

9.96 / .37

14.03 / .37

10.90 / 1.25 14.65 / 1.14

Age of learning to read in Finnish (according to parents)

6.38 / .88 range: 3–9

6.38 / .85 range: 3–8

N/A

Age of learning to read (Russian or Finnish, according to parents)

N/A

N/A

6.00 / 1.12 5.74 / .90 range: 3–10 range: 3–7

Age of learning to read in Finnish (according to child)

5.90 / 1.00 range: 3–9

6.27 / .99 range: 3–9

6.84 / 1.76 9.20 / 3.36 range: 3–12 range: 4–16

Age of learning to read in Russian (according to child)

N/A

N/A

5.81 / 1.83 5.53 / 1.86 range: 1–10 range: 1–11

Feeling about reading in free time (0 = doesn’t like . . . 2 = likes a lot)

1.37 / .61

1.03 / .74

1.12 / .61

1.08 / .71

Feeling about writing in free time (as above)

1.12 / .63

.91 / .66

1.03 / .61

1.07 / .60

Time spent doing homework (0 = not at all . . . 4 = over 2h/day)

1.70 / .80

1.54 / .75

1.73 / .85

2.12 / .92

Time spent on reading outside school (as above)

1.86 / 1.14

1.65 / 1.23

1.59 / 1.12

1.97 / 1.38

Time spent on writing outside school (as above)

1.16 / 1.04

1.07 / 1.06

1.03 / 1.30

1.79 / 1.33

Listening in English/Finnish outside school (0 = never or less often than 1–2/month . . . 3 = daily or almost daily)

2.13 / 1.04

2.82 / .53

2.45 / .89

2.66 / .82

Reading in English/Finnish outside school (as above)

1.27 / 1.17

1.72 / 1.06

2.23 / .96

2.36 / .91

Speaking in English/Finnish outside school (as above)

1.31 / 1.13

1.72 / 1.11

2.49 / .85

2.66 / .70

N/A

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

143

Writing in English/Finnish outside school (as above)

.93 / .1.11

1.45 / 1.17

2.09 / 1.11

2.36 / .99

Using English/Finnish outside school (mean) (0–3 scale)

1.43 / .87

1.93 / .76

2.32 / .74

2.48 / .68

Number of languages the student reports to know

2.80 / 1.12 range: 1–6

3.52 / .76 range: 1–6

3.48 / .79 range: 2–6

4.12 / .89 range: 3–6

Frequency of free time reading (mean) (0 = never or less often than 1–2/month . . . 3 = daily or almost daily)

1.18 / .58 range: .07–3.00

1.64 / .49 range: .30 –2.80

1.36 / .60 range: .07–3.00

1.68 / .52 range: .20–2.73

Frequency of free time writing (mean) (as above)

.93 / .63 range: .00–3.00

1.10 / .48 range: .10 –2.30

1.18 / .72 range: .00–3.00

1.26 / .56 range: .00–3.00

N of respondents

Nn = 184–201 n = 193–201 n = 124–145 n = 71–77

SFL READING MEASURES (RAW SCORES)* 4th grade 8th grade Primary

Lower sec.

mean / SD mean / SD mean / SD mean / SD Pearson Young Learners Test of English 5.28 / 3.11 (k = 12, score range 0–12; Cronbach’s alpha = .78) Pearson Test of English General & DIALANG English (k = 49, score range 7–49; Cronbach’s alpha = .87) ALLU test of Finnish (k = 11, score range 1–11; Cronbach’s alpha = .80) DIALANG Finnish (k = 30, score range 4–29; Cronbach’s alpha = .88)

30.02 / 8.42

7.13 / 2.51 18.75 / 6.35

*Note: in the correlational and regression analyses, interval scale Rasch scores were used instead of raw scores.

144

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Appendix 2 Regression Analyses FIN -ENG Fourth Graders: Parental Background Variables as Predictors of Reading in FL English Variable

B

Parents’ self-assessed English skills (mean)

.548

R2

.055

Adjusted R2

.050

F

SE B

β

.168

.235

10.566**

*p < .05 **p < .01

FIN -ENG Fourth Graders: Student Background Variables as Predictors of Reading in FL English Variable

B

Age at which the child learned to read in L1 Finnish

−.618

R2

.102

Adjusted R2

.097

F

SE B

β

.137

−.320

20.286***

*p < .05 **p < .01 ***p < .000

FIN -ENG Eighth Graders: Parental Background Variables as Predictors of Reading in FL English Variable

B

Father’s educational level

.144

R2

.038

Adjusted R2

.033

F *p < .05 **p < .01

6.743*

SE B

β

.055

.196

BACKGROUND FACTORS IN DIAGNOSIS OF SFL READING ABILITY

145

FIN -ENG Eighth Graders: Student Background Variables as Predictors of Reading in FL English SE B

β

.264

.087

.214

−.220

.078

−.200

Time spent on reading in one’s free time, in any language

.105

.053

.137

R2

.126

Adjusted R2

.111

Variable

B

Mean frequency of using English in free time Age at which the child learned to read in L1 Finnish

F

8.861***

*p < .05 **p < .01 ***p < .000

RUS -FIN Primary School Group: Student Background Variables as Predictors of Reading in L2 Finnish Variable

B

SE B

β

Mean frequency of using Russian in free time

−.577

.192

−.254

Number of languages that the student reports to know

.470

.195

.204

R2

.089

Adjusted R2

.075

F

6.369**

*p < .05 **p < .01

RUS -FIN Secondary School Group: Parental Background Variables as Predictors of Reading in L2 Finnish Variable

B

SE B

β

Amount of reading the parent does at home

−2.260

.889

−.333

R2

.111

Adjusted R2

.094

F *p < .05.**p < .01

6.470*

146

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

RUS -FIN Secondary School Group: Student Background Variables as Predictors of Reading in L2 Finnish Variable Number of languages that the student reports to know

B

SE B

β

3.393

.762

.477

Frequency of free-time writing (mean across all activities)

−3.523

1.217

−.310

Length of studying and doing homework on a normal school day

−1.575

.735

−.228

R2

.371

Adjusted R2

.338

F *p < .05.**p < .01

11.220***

7 Understanding Oral Proficiency Task Difficulty: An Activity Theory Perspective Zhengdong Gan

ABSTRACT

A

s a result of the popularity of performance assessment as opposed to traditionally discrete-point item-based testing, there has been an increasing awareness of the value of tasks as a vehicle for assessing learners’ second or foreign language ability (Elder et al., 2002). This chapter first reviews how tasks are defined in the testing and assessment context. The chapter then discusses the use of a psycholinguistic approach to characterizing task difficulty, and suggests that this may not fully capture the difficulty or complexity of second or foreign language activity. Grounded in an activity theory perspective, this chapter reports on a case study about the difficulties two ESL trainee teachers encountered while they were preparing for speaking tasks in an English language proficiency assessment for teachers in Hong Kong. As such, it exemplifies the insights into task difficulty to be gained from activity theory.

Introduction One of the unique characteristics of task-based assessment is that it entails the integration of topical, social, and/or pragmatic knowledge along with knowledge of the formal elements of language (Mislevy et  al., 2002). This presents considerable challenges for researchers and practitioners to define and investigate L2 oral assessment tasks, and gives rise to a number of unresolved questions, such as what makes a task 147

148

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

difficult or complex, and how it can be realistically identified. These issues are currently being debated not only in the language testing and assessment community, but also in the fields of second language pedagogy and discourse analysis. This chapter reviews the relevant literature in order to clarify the definition of task in the testing and assessment context. The chapter then discusses a cognitive approach to characterizing task difficulty frequently reported in the second language acquisition literature. In my view, SLA task-based researchers tend to focus on the pure intrinsic cognitive demands of a task, which are believed to contribute to task variation in spoken and other kinds of performance (Robinson, 2005). It seems, however, that they have paid less attention to the fact that the cognitive processing of certain kinds of information is likely to be socially or environmentally driven (Dörnyei, 2009; O’Sullivan, 2002). For example, Swain (2001) argues that mental activities such as attention, planning, and problemsolving are mediated activities whose source is the interaction that occurs between individuals. According to Ellis (2003), the essence of a sociocultural perspective of learning is that external mediation serves as the means by which internal mediation is achieved. Lantolf (2000) further suggests that mediation in second language learning and use involves: (1) mediation by others in social interaction; (2) mediation by self through private speech; and (3) mediation by artifacts, such as tasks and technology. Significantly, sociocultural theory attempts to capture the context, action, and motives of language events between individuals that are simultaneously social and cognitive (Nunn, 2001). Nunn suggests that a sociocultural perspective is most suitable for analyzing task or activity-mediated language learning and use since the central focus of sociocultural theory is learners using language in tasks. In recent years, one of the sociocultural theoretical paradigms that has become very popular among educational researchers is activity theory. This theory provides an analytical tool that conceptualizes individuals and their environment as a holistic unit of analysis to illustrate complex language learning and use situations. It has been noted that research into task difficulty in speaking tests relying on a cognitive approach has so far failed to identify task features or conditions that can predict successful completion of a task (Fulcher & Marquez-Reiter, 2003). A major interpretation of the difficulty of establishing criteria for task difficulty in terms of task features or conditions is that such task features or conditions are confounded with nontask properties, thus making it difficult to determine their contribution to the successful completion of a task (Commons & Pekker, 2006). Typical nontask properties include the level of preparedness of the candidate (Cheng, 2006), level of support for problem solving (Commons & Pekker, 2006), and level of the candidate’s engagement in the process of learning and developing the subject knowledge needed for problem solving (Platt & Brooks, 2002). While educational measurement researchers agree that these nontask factors constitute sources of task difficulty and affect task performance, they do not appear to know how these nontask properties may be operationalized and used in estimations of task difficulty (Commons & Pekker, 2006). Another limitation in the studies relying on a cognitive approach is to focus on the relationships between task features or conditions and task performance at one time and within one particular task setting. In the study described in this chapter, two individual ESL teacher trainees took almost one year to learn the sources of difficulty of an oral proficiency task and to develop the subject-related knowledge and ability needed to pass the test. Obviously, the activity system of the two ESL teacher trainees

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

149

preparing for the speaking test task consists of many sublevel learning and practicing activities. The two ESL teacher trainees’ perceptions of speaking task difficulty resulted from complex and unstable interactions between different language learning or language use situations and oral proficiency task conditions or features. These perceptions can best be illustrated and understood through the lens of activity theory. Specifically, informed by an activity theory perspective, this chapter examines the difficulties two ESL teacher trainees encountered while they were coping with speaking tasks in the Language Proficiency Assessment of Teachers (LPAT ) in Hong Kong. In this study, difficulty was experienced as challenges, dilemmas, and discoordinations. The term discoordination is used to refer to any language practice or use situation where the participants in an activity/interaction fail to fully communicate, or where neither of the participants in an activity/interaction has a satisfactory experience or outcome. There were two research questions: 1 What challenges, dilemmas, and discoordinations did two ESL teacher trainees encounter while they were preparing for the LPAT oral proficiency tasks? 2 How and to what extent were these challenges, dilemmas, and discoordinations resolved?

Defining Task Although task-based approaches to teaching and assessment are currently popular in L2 pedagogy and L2 testing and assessment, researchers and practitioners do not agree on what constitutes a task. Skehan (1998), Ellis (2003), and Bygate et  al. (2001) describe tasks mainly in terms of communicative purpose. Bygate et al. (2001, p.  11) propose a fairly basic, all-purpose definition: “a task is an activity which requires learners to use language, with emphasis on meaning, to attain an objective.” For others, however, communicative purpose is not an essential criterion. The “realworld” characteristic is sufficient for an activity to qualify as a task. For instance, Norris et al. (1998) define tasks “as those activities that people do in everyday life and which require language for their accomplishment” (p.  331). Nevertheless, researchers and practitioners generally agree that pedagogic tasks provide a means of engaging learners in meaningful use of the target language (Brumfit, 1984). In the assessment context, Bygate et al. (2001) further specify that the task is “a contextualized, standardized activity . . . which will elicit data which can be used for purposes of measurement” (p. 112). Bachman et al. (1996) adopt a similar approach. They define a language use task in the assessment context: an activity that involves individuals in using language for the purpose of achieving a particular goal or objective in a particular situation (p.  44). This allows a range of “tasks” to be included, “assessment tasks as well as target language use (TLU ) tasks, or tasks in relevant settings outside the test itself, including tasks intended specifically for language teaching and learning purposes” (Bachman, 2002, p. 458). This notion of tasks can be characterized as a construct-based approach that focuses on language proficiency and the underlying competencies and abilities. It is also possible, however, to define assessment tasks in relation to their real-world characteristics (cf. Norris et al., 1998). This is characterized as a task-based approach that focuses on functional

150

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

language performance in relation to a real-world and contextualized activity (Bachman, 2002). In the L2 assessment literature, Davies et al. (1999) observe that the terms item and task tend to overlap. The difference is that there is usually an implication that an item is smaller and less complex, while a task is larger and more complex. “A test would typically include either a large number of relatively short items or a small number of relatively complex tasks” (Davies et al., 1999, p. 196). The Association of Language Testers in Europe defines task as a combination of instructions, item and response that relate to a test (see also Brindley & Slatyer, 2002). There are also some L2 testing and assessment researchers who seem to equate test method with task. Chalhoub-Deville (2001) refers to several popular oral assessment methods such as OPI (oral proficiency interview), SOPI (simulated oral proficiency interview), and CoSA (contextualized speaking assessment) as different assessment tasks. Similarly, He & Young (1998) explicitly refer to a speaking test interview as a task: On the surface, at least, interviews appear to be an authentic task: learners are involved in spoken interaction with a proficient speaker of the language, a task that appears to be very similar to many kinds of face-to-face conversations in the target language community outside the testing room (pp. 2–3).

Task Difficulty in Second Language Acquisition and Language Testing In Skehan (1998) and Bachman & Palmer’s (1996) models of oral test performance, one source of influence on a candidate’s test score is task characteristics and performance conditions. This is because different tasks are likely to encourage different types of language processing, and variations in task characteristics and task conditions have the potential to influence learners’ linguistic output, and may eventually impact on their test score. A main objective in researching language tasks in pedagogic contexts has been to identify the key factors that affect task difficulty. This is considered essential to facilitate the selection of appropriate tasks and to provide effective feedback about students’ oral abilities (Nunan, 2004). L2 testing and assessment researchers also believe that research on task characteristics has potential for identifying a variable of task difficulty in assessing learner spoken performance (Iwashita et al., 2001). Consequently, if a hierarchy of task difficulty is established, “students with greater levels of underlying ability will then be able to successfully complete tasks which come higher on such a scale of difficulty” (Skehan, 1998, p. 184). Task “difficulty” in language testing has been examined by investigating the impact on scores awarded for speakers’ performance across different tasks (e.g., Chalhoub-Deville, 1995; Fulcher, 1996). In language pedagogic studies, a cognitive, information-processing perspective has been widely adopted to characterize and investigate language task difficulty (cf. Robinson et al., 2009; Skehan, 1998, 2001, 2009). Skehan (1998) conceptualizes task difficulty in terms of three sets of features: (1) code complexity: the language required to accomplish the task; (2) cognitive

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

151

complexity: the thinking required to accomplish the task; and (3) communicative stress: the performance conditions for accomplishing a task. Bachman (2002) points out that two of Skehan’s task difficulty features, that is, cognitive complexity and communicative stress, “entail assumptions about the kinds and amounts of processing required of the test taker in order to accomplish a given task” (p. 466). For example, cognitive complexity involves a function of two components: cognitive processing and cognitive familiarity, both of which may vary from one learner to another. Bachman thus thinks that these two task difficulty features are, in fact, the functions of the interaction between the test taker and the task. Consequently, according to Bachman, one problem with Skehan’s formulation of task difficulty features is that the difficulty features confound the effects of the test taker’s ability with the effects of the test tasks. Robinson (2001a, 2001b, and Robinson et  al., 2009) makes a distinction between task complexity and task difficulty. In Robinson’s view, task complexity involves the cognitive demands of a task that contribute to task variation in learner performance, whereas task difficulty involves those learner factors that contribute to differences between learners in their performance on a task. In other words, Robinson operationalizes task difficulty in terms of individual learners’ perceptions of a task. Such perceptions are usually elicited through Likert-scale questions. Empirical studies (e.g., Brindley, 1987; Brown et al., 1984; Elder et al., 2002; Elder & Iwashita, 2005; Iwashita et al., 2001; Norris et al., 1998; Skehan, 1998; Skehan & Foster, 1997; Waks & Barak, 1988; Weir et al., 2004) that have been based on either Skehan’s or Robinson’s framework have so far resulted in mixed findings concerning the impact of those complex factors on learners’ language performance. For example, Weir et  al.’s study reveals that simply altering a task along a particular dimension may not result in a version that is equally, more, or less difficult for all test candidates. “Instead, there is likely to be a variety of effects as a result of the alteration” (Weir et al., 2004, p. 143). It appears therefore that Skehan’s and Robinson’s frameworks are insensitive in a language testing context (Fulcher & Marquez-Reiter, 2003). In fact, the cognitive approach underlying each framework focuses mainly on individual learners’ cognitive processing of a given task, obscuring the role of sociocultural and interpersonal characteristics in defining task difficulty and interpreting learner task performance. In particular, such a cognitive approach may ignore “the complex social construction of test performance, most obviously in the case of interactive tests such as direct tests of speaking” (McNamara, 2001, p. 333). Implicit in this argument is that sociocultural approaches that prioritize qualitative research methodology and pay close attention to the settings and participants in interactions have the advantage of developing a full and thorough knowledge of a particular phenomenon.

Activity Theory The origins of activity theory have often been attributed to the Soviet Russian sociocultural psychology of Vygotsky, Leont’ev, Luria, and Ilyenkov (Tsui & Law, 2007). According to the principles of activity theory, an activity is a coherent and

152

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

dynamic endeavor directed to an articulated or identifiable object. This means an activity is undertaken by a community, and it has an object and a motive (Bakhurst, 2009). The object is “the goal of an activity, the subject’s motives for participating in an activity, and the material products that subjects gain through an activity” (Yamagata-Lynch & Haudenschild, 2009, p. 508). The object thus distinguishes one activity from another. The subject of an activity system is usually a person or group with agency oriented to attain some object through the use of a socioculturally constructed tool that can be material or psychological. The subject relates to the community via rules that refer to explicit and implicit regulations, norms, and conventions, which constrain actions and interactions within the activity system (Engeström, 2001). The community is the organization to which the subject belongs. The driving force of change and development in activity systems is internal contradiction and usually exists in the form of resistance to achieving the goals of the intended activity and of emerging dilemmas, challenges, and discoordinations (Roth & Tobin, 2002). As Virkkunen & Kuutti (2000) emphasize, “contradictions are fundamental tensions and misalignments in the structure, which typically manifest themselves as problems, ruptures, and breakdowns in the functioning of the activity system” (p.  302). Engeström (1987) commented that primary and inner contradictions exist within each constituent component of an activity system, while secondary contradictions are found between the constituents. In the present study, the subject is the ESL teacher trainees who participated in a locally-designed English speaking proficiency test and whose activity was influenced by the community within which they were located. Activity refers to the ESL students’ preparation for and participation in the LPAT speaking tasks. The object was to demonstrate their best possible performance in the speaking tasks and to achieve a pass in the assessment. The community that supported the ESL students was the testing organization and the university in which the ESL student teachers were enrolled in their undergraduate studies. The outcomes of the ESL students participating in the speaking test refer to the language knowledge and abilities displayed while performing the required assessment tasks and derived from “meeting the demands of a particular situation” (Connelly & Clandinin, 1988, p. 25), and to the development and maintenance of learner agency in autonomous language learning as a result of such activity experience.

The Study Setting In 2001, the Hong Kong Examinations and Assessment Authority (HKEAA ) and the Education Bureau (EDB ) of the Government of Hong Kong introduced the LPAT to assess the English language proficiency of English language teachers in Hong Kong. This assessment consists of oral and written papers, covering Listening, Speaking, Reading, and Writing. Candidates who attain Level 3 or above in all papers of the assessment are deemed to meet the English Language Proficiency Requirement. This chapter focuses on ESL student teachers’ experiences in coping with LPAT Speaking.

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

153

The speaking component of the LPAT involves three major assessment tasks: (1) Reading Aloud, (2) Recounting an Experience/Presenting Arguments, and (3) Group Interaction. According to HKEAA (2011), each of the three tasks assesses candidates on two scales. Task 1 assesses candidates on: 1) Pronunciation, Stress, and Intonation; and 2) Reading Aloud with Meaning. Task 2 assesses candidates on: 1) Grammatical and Lexical Accuracy and Range, and 2) Organisation and Cohesion. Task 3 assesses candidates on: 1) Interacting with Peers, and 2) Discussing Educational Matters with Peers (see Language Proficiency Assessment for Teachers [English Language] Handbook, 2011). Statistics show that about one third of the students in the Bachelor of Education (B.E d.) program at the participants’ university failed the LPAT Speaking on their first attempt. As LPAT is conducted once a year, test takers who fail the exam must wait a tortuous twelve months to retake the LPAT.

Research Methodology This study is part of a larger research project concerning L2 oral proficiency tasks that draws on the test-taking experiences of a cohort of thirty-three Hong Kong ESL trainee teachers. The study reported here focuses on two of these thirty-three students. The two students, Jessica (a Mandarin mother-tongue speaker) and George (a Cantonese mother-tongue speaker), were selected for analysis as they were unanimously seen by the project team to exemplify the typical difficulties ESL teacher trainees encountered in preparing for and coping with LPAT oral proficiency tasks. Pseudonyms are used for these two ESL students to ensure participant anonymity. At the time when the study was conducted, the two participants were trainee teachers in the third year of a B.E d. program, majoring in English language teaching, at a university in Hong Kong. This program generally adopts a Content and Language Integrated Learning (CLIL ) approach that aims to help ESL trainee teachers develop subject content knowledge and improve their English language proficiency at the same time. Unlike Robinson (2001a), who tends to use Likert-scale questionnaire items (e.g., “I thought this task was easy”) to investigate learners’ perceptions of second language task difficulty, this study operationalizes difficulty as contradictions inherent in the ESL students’ learning and test preparation activity systems that exist in the form of challenges, dilemmas, and discoordinations. It adopts a case study approach. The data for the case study reported in this chapter came from in-depth interviews. Each interview lasted two hours. A list of main interview questions appears in Appendix 1. The interviews were transcribed and analyzed qualitatively. As mentioned above, activity theory provides an analytical approach that conceptualizes individuals and their environment as a holistic unit of analysis. This approach distinguishes between short-lived goal-directed actions and durable, object-oriented activity systems (Engeström, 2000). As an analytical tool, activity theory can provide a response to what learners do in interaction with an aim through analysis at the level of: 1) activity, 2) action, and 3) operations (Nunn, 2001).

154

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

According to Nunn, at the global level, there is activity that is the frame or context in which something occurs. The second level is goal-directed action that illustrates “what” must be done to get from A to B, and thus, through this implies a motive to resolve a problem or challenge. The third level is operations which describe “how” something is done. So, data were analyzed in three phases: (1) First, each interview transcript was read in its entirety to gain an overview assessment; (2) Second, I performed a line-by-line review within each transcript and made notes or marginal remarks; and (3) I then began to unitize the data by circling, underlining, and highlighting any units of data that indicated problems, dilemmas, challenges, or discoordinations the two ESL learners encountered, and the actions that were implemented to resolve those problems, dilemmas, challenges, or discoordinations.

Results As mentioned above, an activity theory perspective of language task difficulty maintains that such difficulties may have their source in a complex chain of events, knowledge, and abilities other than the assumption that the language learning/use problems often result from the cognitive operations of the mind. That chain might partly consist of social, strategic, and language-based problems that end up as difficulties in coping with a particular kind of oral language assessment task. The vignettes below hence illustrate such problems or challenges the two ESL teacher trainees encountered and to what extent these problems and challenges were resolved. We particularly focus on what actions these two ESL learners took to overcome the problems or challenges in the object-oriented activity systems of learning for and coping with LPAT speaking assessment, and how such actions were actually implemented.

Jessica Jessica took LPAT for the first time in the spring of 2013. She passed in Listening and Reading, but failed in Speaking and Writing. She said that before she took the Speaking test, she had no knowledge about the specifications of the LPAT Speaking test as she did not prepare for the test. During the test, she soon noticed many problems: ●







She was not sure about her pronunciation of many words in the Reading Aloud task. In the second part of the Speaking test, that is, the Recounting and Presenting task, she did not even know what she was talking about. Her speech lacked organization, and topic sentences were not well supported with evidence. In the Group Interaction task, she had no confidence at all as one of her group members spoke almost like a native-speaker.

Her conclusion from her performance in the LPAT Speaking test was that her oral English was far from reaching the required level of a pass in the assessment.

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

155

In the months that followed, preparing to retake the LPAT Speaking test the next year and obtaining a pass became one of the most important goals of Jessica’s learning as a university ESL trainee teacher. She started with efforts to improve her pronunciation. She reported: There are native-speakers reading aloud novels in YouTube. For example, someone is reading aloud Yellow Wallpaper, and I just read after him. Moreover, the transcript of Yellow Wallpaper can also be available online. I then traced words in the transcript and read after the native-speaker, and I thus could notice the differences in pronunciation, stress and intonation between the native speaker and me. After a sustained period of studying recordings of native-speakers, Jessica said she felt she had made good progress in her pronunciation and intonation. During the semester after her first LPAT experience, Jessica also opted for a phonology course, which improved her English pronunciation and enabled her to guess how to pronounce words she had never seen before. Realizing that the Recounting/Presenting and Group Interaction tasks in LPAT assessment would constitute the biggest challenge, Jessica decided to make better use of speaking opportunities in classes. She paid particular attention to assignments that required oral presentation, doing her best to rehearse every presentation as many times as possible. In her spare time, she also managed to get a part-time job teaching a native English-speaking boy Mandarin, which she found offered her even more English speaking opportunities than during her immersion program in the United Kingdom, where she primarily lived and studied with her Chinese speaking classmates. “Although the boy was only eight, he has an amazingly large English vocabulary. Communicating in English with him expanded my vocabulary and brought about tangible progress in my spoken English,” Jessica spoke with excitement in the interview. Jessica started the official preparation for the LPAT Speaking test one month prior to the test, which took place in spring 2014. During that month, she said she largely relied on doing past LPAT speaking papers to practice for the three major LPAT speaking task types. While she was practicing, she tape-recorded her own reading, recounting, or presentation. She then listened to these recordings and tried to spot if there were any grammatical errors and what improvement should be made. She said: I noticed that “subject-verb agreement” error apparently occurred frequently, especially when I spoke fast. I then particularly focused on this issue for a while. For example, when I see “he likes,” I slightly paused and deliberately practiced this structure several times so that next time I would be able to say “he likes” more readily rather than “he like.” I then further discovered that when the subject of a clause is a third person singular form, I will pay special attention to the verb involved. After this chain of reading, recounting/presenting, and listening to her recordings, Jessica felt she was beginning to have better pronunciation and better “grammar awareness,” which also seemed to have resulted in more effective control over her language use in communication with her lecturers or classmates. By the time she

156

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

took the LPAT speaking test for the second time, she felt she was more confident: “Although I still felt a bit nervous, I had a different feeling this time, and I had greater confidence this year.” In the Reading-Aloud task, she came across a few unknown names of persons and places that she dealt with generally smoothly, relying on her knowledge acquired from her phonetics and phonology class. In the Recounting and Presentation task, she ensured that she would not make grammatical errors by speaking slightly slowly, by not rushing, and by thinking a little bit before saying things aloud. Also, in the Recounting and Presentation task, she tried to ensure that there would be no problems in her organization of ideas by following the topic sentences she prepared during the allowed preparation time. “While using examples to support my topic sentences, I used those easy-to-express ones. Using those difficult-to-express ones would be problematic,” Jessica further commented. Asked how she felt about her performance in the final LPAT Speaking task this year, that is, the group discussion task, she described it as “normal”: . . . it cannot be said to be very good. The whole group discussion was like a chat among peer classmates. There was one member in my discussion group who was trying to dominate, but this candidate’s English was not so good. This candidate’s dominance somewhat left me and another candidate not many opportunities to speak. Otherwise, I think we three could all get even higher marks in a better and harmonious atmosphere which I believe could result in more things to say.

George After he graduated from a Chinese-Medium school, George studied in an associate degree program for a year at another Hong Kong university before he was admitted to the B.Ed. program at his current university. Like Jessica, George took LPAT in 2013, but failed the speaking paper. He said in the interview that last year he was too nervous and he didn’t know how to transition from the Reading-Aloud task to the personal Recounting/Presenting task: I think the most difficult part was the middle part, personal recounting and presenting. That’s because you need to suddenly transit yourself from reading aloud to work with your own words. This happens immediately after your reading aloud and you don’t have much time. What I mean is that during the transit part you need to be accurate and fluent. Otherwise, you may come up with some situations that you cannot speak, which will kill you in this test. Asked how he dealt with that “middle part” this year, George described: The format is that they usually give you a question. For this year my question is having elderly at home. I basically come up with two supportive ideas during the preparation time. Based on the past failure experience, I realized that I shouldn’t come up with something too difficult. If I say something that can be understood, that’s fine. That kind of production, that kind of articulation that I made should

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

157

be suitable to my own level. That means if I use too much complex sentences that are beyond my ability, I will probably fail because you cannot articulate too many sentences. The key is that if you cannot express some complex things, I would say try to keep it simple. Coherence is (also) rather important in such an examination. Last year, the question asked about something about the technology. Rather technical, I think. I tried so hard to come up with something unique . . . eventually [I] found that it was quite hard to comprehend. George mentioned in the interview that his priority in the LPAT Speaking test this year was to get a pass. He said that his oral English proficiency is not like that of people who could achieve Level four or Level five. The failure in his first LPAT experience made him realize that the exam is like a game: “If you don’t know how to play the game, it’s over, no matter how proficient you are.” When he took the LPAT speaking test the second time, he appeared to be completely familiar with the “game,” as he pointed out: The “game” consists of three parts. In the Reading-Aloud part, you need to read aloud with your mood, your emotion. For the second part, you need to be very organized and quick. That means you need to have a framework, a very simple framework like first I agree with the statement that having elderly at home is good for the children, and followed by two arguments that I prepared to say in the third and fourth paragraphs. The examiners are your audience. You need to have eye contact with them and you need to speak naturally. Actually in group discussion, you don’t need to speak that much. I had two members with me this time. I figured out they talked too much without responding to each other’s questions. I expect they lost some marks. George said that the specific strategy he adopted in the Group Discussion task this year was to respond to the other group members most of the time, but only initiate one argument. He particularly emphasized that he learned this strategy from his failure in the last LPAT speaking assessment. While George clearly knew the crucial importance of how to play the “game” in the LPAT assessment context, he also believed that productive language skills like speaking cannot be improved in a short period of time: “For students who speak very much in daily life, you can find they are very confident and pass the test with almost no difficulty. For students who don’t speak a lot in daily life, they need to practice a lot.” George had studied in a Chinesemedium secondary school, and later, took one year studying for an associate degree in a tertiary program, and thus, did not experience many opportunities to speak English. While he was in the B.Ed. program at his current university, he started to have more opportunities to speak and listen to native-speakers. He said he was trying to imitate native-speakers’ pronunciation and intonation in class besides studying content area knowledge. In the semester after his first LPAT experience, George was in the United Kingdom on an immersion program. He still had the LPAT speaking exam on his mind. “I have to make good use of the opportunity because next year I have LPAT to prepare,” George told himself. He thus talked to his host family often throughout his stay, and the result is that “I became less fear English.”

158

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Discussion Contradiction as Driver of Change and Development in L2 Oral Performance A key feature of Engeström’s (1987, 2001) activity theory is that contradictions (i.e., dilemmas, challenges, disturbances, and discoordinations) internal to human activity constitute the drivers of change. Contradictions are thus conceptualized as sources of change and development in the activity systems. The contradictions the two participants, Jessica and George, experienced in learning for and coping with the oral proficiency assessment tasks appeared to be varied and complex, but also share some similar features. These contradictions faced by the two ESL trainee teachers emerged as challenges, dilemmas, and discoordinations and existed in the form of resistance to achieving their goal of obtaining a pass in the LPAT Speaking assessment. Jessica’s first LPAT speaking assessment experiences led her to conclude that her oral performance was almost a nightmare (e.g., pronunciation problems, organization problems, lack of confidence, etc.). In the course of improving her speaking, she also realized she needed to have better grammatical accuracy and to expand her vocabulary. As can be seen in the above description and analysis, to resolve all these varied types of challenges, dilemmas, and discoordinations, Jessica actively appropriated contextual resources, such as online learning materials, past test papers, and significant others such as lecturers, peer classmates, and even her Mandarin student. The mediating tools she developed to achieve her goal in her learning efforts included listening to native speakers read aloud and imitating their pronunciation and intonation, maximizing speaking opportunities in her daily life, and doing past LAPT Speaking test papers. Evidently, all the challenges and dilemmas Jessica encountered in her first LPAT experience as well as subsequently while preparing for the second LPAT assessment appeared to be reasonably well resolved by the time she retook the test. This explains why she felt much better when she retook the test and was more confident about her ESL speaking. Even though she encountered a group member who dominated in the group interaction task, something that was obviously beyond her control, she was nevertheless able to face it calmly and reacted appropriately as can be seen from her remarks: “We were all probably able to get higher marks without that student dominating in the interaction.” This suggests that Jessica was not only aware that dominating behavior would do no good to that candidate, but also aware that she and the third test taker had been deprived of opportunities to contribute to the supposedly co-constructed group discussion. In the case of Gorge, he failed in the LPAT Speaking tasks the first time because of his mishandling of the second task of the assessment. According to him, the challenge lies in achieving a good transition from the two different tasks, one of which was a Reading-Aloud task, while the other was personal Recounting/ Presenting task with different marking criteria. For example, “The weird thing I found in the speaking part is in the personal Recounting. The examiners don’t assess you in pronunciation or the intonation with recount. They only assess you in grammar and organisation” (George). Also in the personal Recounting task last year,

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

159

George presented an argument that was complicated and not well understood by the examiners or himself. The lesson he learned from his first LPAT experience was that doing well in LPAT Speaking tasks is like playing a “game,” and that it is important to know how to play this “game.” His official two-week preparation prior to the second LPAT speaking test concentrated on familiarizing himself with this “game.” Eventually, he was very satisfied with his performance in the second test. He was also able to comment on how well other group candidates performed. For example, he could see that in the Group Discussion task, the other two candidates talked too much without responding to each other’s questions. This type of candidate behavior, in his view, would violate the “game” and would lead the candidates to lose marks in the assessment. George also felt that, overall, obtaining a pass in LPAT Speaking was a difficult task for an ESL trainee teacher like him who had graduated from a Chinese-medium secondary school and who previously had limited English-speaking opportunities. Implicit in this perception is that his background learning experience constituted a contrast to classmates who graduated from English-medium secondary schools, who typically speak English a lot in their daily school life and outside of school, and consequently, could achieve a Level 4 or 5 in LPAT assessment. This LPAT-mediated awareness of the gap (or contradiction) in speaking proficiency between him and his classmates enabled him to work hard to make better use of speaking opportunities in the current B.Ed. program, to imitate native speaker lecturers’ pronunciation and intonation, and to develop confidence in his oral skills.

The Role of Agency in ESL Trainee Teachers’ Resolving of Difficulties An essential component of an activity system is the subject of the activity system as a person or group with agency (Engeström, 2001). From an activity theory perspective, individuals can transform social reality in the production of new activities in everyday social practices and in that sense they “lead” emerging forms of activity through individual agency (Feryok, 2005; Stetsenko & Arievitch, 2004). A key concept underlying this perspective is that human beings have the ability to influence their lives and environment, while they are also shaped by the social contexts and cultural tools available to them (Lasky, 2005). In the case of learning for and coping with L2 oral proficiency assessment tasks, resolving the difficulties associated with LPAT Speaking assessment tasks and demonstrating the required L2 oral performance is the outcome of a dynamic relationship. This relationship involves learners’ conceptual resources (e.g., beliefs and motivation), the physical resources available (e.g., equipment and materials), and the affordances and constraints of the learning and assessment context. Central to this relationship is learner agency (Kelly, 2006). The results of this study thus affirm the agency of the two ESL trainee teachers resolving various challenges, dilemmas, and discoordinations in their environments while coping with the assessment tasks through using resources that were contextually developed. Such agency was revealed when they demonstrated a high degree of commitment and motivation, and appeared to be able to maintain their agency by making good use of opportunities to enhance their speaking proficiency. Consequently,

160

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

they developed and adopted a variety of tools (i.e., strategies) that enabled them to resolve those challenges, dilemmas, and discoordinations, and eventually, achieve better L2 oral performance in the LPAT speaking tasks.

Conclusion The study reported in this chapter adopted activity theory as both a theoretical framework and an analytical tool for investigating learner perceptions of oral proficiency task difficulty in the activity systems of individual ESL teacher trainees coping with speaking tasks in the LPAT in Hong Kong. Data analysis in this chapter illustrates how individual learners pinpointed the sources of their speaking task difficulty, and how they learned and developed the subject-related knowledge and ability required to pass the speaking test, which clearly resulted in target language development—what Engeström calls expansive learning. For example, on the occasion of participating in the LPAT the first time, Jessica realized that her oral English proficiency was far below the standard (e.g., oral accuracy and fluency) required for a “pass” in the assessment. An action she took to resolve this conflict was to do a phonetics course in which she further discovered that she pronounced many phonetic symbols incorrectly. In Jessica’s view, the phonetics course undoubtedly provided good opportunities to model phonetic symbols that previously appeared to be difficult and confusing to her. Following up on this development, Jessica studied recordings of native English speakers reading novels aloud. Clearly all these actions contributed to expansive developmental transformations. In other words, all such actions narrowed the gap between Jessica’s existing oral proficiency level and the target level Jessica wished to reach in the LPAT Speaking Test. These findings thus suggest that the activity systems of the two ESL students learning and practicing for oral assessment tasks were transformational and enlightening in nature, resulting in not only tangible growth in their language proficiency, but also development and maintenance of learner agency in autonomous language learning. Significantly, the nature and level of engagement in narrowing the gap between each ESL teacher trainee’s existing oral proficiency level and the level they reached in the LPAT Speaking Test represents how difficult the speaking task was for each of them. While this case study sheds light on issues related to the difficulty of oral proficiency tasks, it is important to note its limitation. The study is highly reliant on test candidates’ self-report data, as is the case with much phenomenographic research (Ellis et  al., 2008). Future studies thus could be beneficially complemented with observational data about what test candidates actually do in the process of preparing for and coping with oral speaking test tasks.

References Bachman, L. F. (2002). Some reflections on task-based language performance assessment. Language Testing, 19(4), 453–476. Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford, UK : Oxford University Press.

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

161

Bakhurst, D. (2009). Reflections on activity theory. Educational Review, 61(2), 197–210. Breen, M. (1987). Learner contributions to task design. In C. Candlin & D. F. Murphy (Eds.), Language Learning Tasks. Englewood Cliffs, NJ : Prentice-Hall International and Lancaster University, pp. 23–46. Brindley, G. (1987). Factors implicated in task difficulty. In D. Nunan (Ed.), Guidelines for the Development of Curriculum Resources. Adelaide, Australia: National Curriculum Resource Centre, pp. 45–56. Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19(4), 369–394. Brown, G., Anderson, A., Shilcock, R., & Yule, G. (1984). Teaching Talk: Strategies for Production and Assessment. Cambridge, UK : Cambridge University Press. Brumfit, C. J. (1984). Communicative methodology in language teaching. Cambridge, UK : Cambridge University Press. Bygate, M., Skehan, P., & Swain, M. (Eds.). (2001). Researching Pedagogic Tasks, Second Language Learning, Teaching and Testing. Harlow, UK : Longman. Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater groups, Language Testing, 12(1), 16–33. Chalhoub-Deville, M. (2001). Task-based assessments: characteristics and validity evidence. In M. Bygate, P. Skehan, & M. Swain (Eds.), Researching pedagogic tasks: Second language learning, teaching and testing. Harlow, UK : Longman, pp. 210–228. Cheng, L. S. (2006). On varying the difficulty of test items. Paper presented at the 32nd annual conference of the International Association for Educational Assessment, Singapore. Commons, M., & Pekker, A. (2006). Hierarchical Complexity and Task Difficulty. Available from: http://dareassociation.org/Papers/HierarchicalComplexityandDifficulty01122006b.doc [7 June 2015]. Connelly, F. M., & Clandinin, D. J. (1988). Teachers as Curriculum Planners: Narratives of Experience. New York, NY: Teachers College Press. Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of Language Testing. Cambridge, UK : Cambridge University Press. Elder, C., & Iwashita, N. (2005). Planning for test performance: Does it make a difference? In Ellis, R. (Ed.), Planning and Task Performance in a Second Language. Amsterdam, The Netherlands: John Benjamins, pp. 219–237. Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19(4), 347–368. Ellis, R. (2003). Task-Based Learning and Teaching. Oxford, UK : Oxford University Press. Ellis, R.A., Goodyear, P., Calvo, R.A., & Prosser, M. (2008). Engineering students’ conceptions of and approaches to learning through discussions in face-to-face and online contexts. Learning and Instruction, 18(3), 267–282. Engeström, Y. (1987). Learning by Expanding: An Activity-Theoretical Approach to Developmental Research. Helsinki, Finland: Orienta-Konsultit. Engeström, Y. (2000). Activity theory as a framework for analyzing and redesigning work. Ergonomics, 43(7), 960–974. Engeström, Y. (2001). Expansive learning at work: toward an activity theoretical reconceptualization. Journal of Education and Work, 14(1), 133–156. Feryok, A. (2005). Activity theory and language teacher agency. The Modern Language Journal, 96(1), 95–107. Fulcher, G. (1996). Testing tasks: issues in task design and the group oral. Language Testing, 13(1), 23–51. Fulcher, G., & Marquez-Reiter, R. (2003). Task difficulty in speaking tests. Language Testing, 20(3), 321–344.

162

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

He, A.W., & Young, R. (1998). Language proficiency interviews: a discourse approach. In R. Young & A. W. He (Eds.), Talking and Testing: Discourse Approaches to the Assessment of Oral Proficiency. Amsterdam/Philadelphia, PA : John Benjamins, pp. 1–26. Hong Kong Examinations and Assessment Authority. (2011). Language Proficiency Assessment for Teachers (English Language) Handbook 2011. Available from: http:// www.hkeaa.edu.hk/DocLibrary/Local/Language_Proficiency_Assessment_for_Teachers/ LPATE_Handbook.pdf [27 March 2015]. Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information-processing approach to task design. Language Learning, 51(3), 401–436. Kelly, P. (2006). What is teacher learning? A socio-cultural perspective. Oxford Review of Education, 32(3), 505–519. Lantolf, J. P. (2000). Second language learning as a mediated process: Survey article. Language Teaching, 33(2), 79–96. Lasky, S. (2005). A sociocultural approach to understanding teacher identity, agency and professional vulnerability in a context of secondary school reform. Teaching and Teacher Education, 21(8), 899–916. Littlewood, W. (2004). The task-based approach: Some questions and suggestions. ELT Journal, 58(4), 319–326. McNamara, T. (2001). Language assessment as social practice: Challenges for research. Language Testing, 18(4), 333–349. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-based language assessment. Language Testing, 19(4), 477–496. Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing Second Language Performance Assessments (Vol. SLTCC Technical Report #18). Honolulu, HI : Second Language Teaching and Curriculum Center, University of Hawai’i. Nunan, D. (2004). Task-based language teaching. Cambridge, UK : Cambridge University Press. Nunn, B. (2001). Task-based methodology and sociocultural theory. The language teacher. Available from: http://www.jalt-publications.org/old_tlt/articles/2001/08/nunn. [22 March 2015]. O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Language Testing, 19(3), 277–295. Platt, E., & Brooks, F. B. (2002). Task engagement: A turning point in foreign language development. Language Learning, 52(2), 365–400. Robinson, P. (2001a). Task complexity, task difficulty and task production: Exploring interactions in a componential framework. Applied Linguistics, 22(1), 27–57. Robinson, P. (2001b). Task complexity, cognitive resources, and syllabus design. In P. Robinson (Ed.), Cognition and Second Language Instruction. Cambridge, UK : Cambridge University Press, pp. 287–318. Robinson, P. (2005). Cognitive complexity and task sequencing: Studies in a componential framework for second language task design. International Review of Applied Linguistics, 43(1), 1–32. Robinson, P., Cadierno, T., & Shirai, Y. (2009). Time and motion: Measuring the effects of the conceptual demands of tasks on second language speech production. Applied Linguistics, 30(4), 533–544. Roth, W-M., & Tobin, K. (2002). Redesigning an “urban” teacher education program: An activity theory perspective. Mind, Culture, and Activity, 9(2), 108–131. Skehan, P. (1998). A Cognitive Approach to Language Learning. Oxford, UK : Oxford University Press.

UNDERSTANDING ORAL PROFICIENCY TASK DIFFICULTY

163

Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan, & M. Swain (Eds.), Researching Pedagogic Tasks, Second Language Learning, Teaching and Testing. Harlow, UK : Longman, pp. 167–185. Skehan, P. (2009). Modelling second language performance: Integrating complexity, accuracy, fluency, and lexis. Applied Linguistics, 30(4), 510–532. Skehan, P., & Foster, P. (1997). Task type and task processing conditions as influences on foreign language performance. Language Teaching Research, 1(3), 185–211. Stetsenko, A., & Arievitch, I. (2004). The self in cultural historical activity theory: Reclaiming the unity of social and individual dimensions of human development. Theory and Psychology, 14(4), 475–503. Swain, M. (2001). Examining dialogue: another approach to content specification and to validating inferences drawn from test scores. Language Testing, 18(3), 275–302. Tsui, A., & Law, D. (2007). Learning as boundary-crossing in school-university partnership. Teaching and Teacher Education, 23(8), 1289–1301. Virkkunen, J., & Kuutti, K. (2000). Understanding organizational learning by focusing on “activity systems.” Accounting, Management and Information Technologies, 10(4), 291–319. Waks, S., & Barak, M. (1988). Characterization of cognitive difficulty level of test items. Research in Science and Technological Education, 6(2), 181–191. Weir, C., O’Sullivan, B., & Horai, T. (2004). Exploring difficulty in speaking tasks: An intra-task perspective. IELTS Research Reports Volume 6. Available from: http://www. ielts.org/pdf/Vol6_Report5.pdf. [22 July 2015]. Yamagata-Lynch, L., & Haudenschild, M. (2009). Using activity systems analysis to identify inner contradictions in teacher professional development. Teaching and Teacher Education, 25(3), 507–517.

Appendix 1 Key questions asked in the interviews 1) 2) 3) 4) 5) 6) 7) 8)

How important is LPAT to you? How did you learn and prepare for your first LPAT Speaking test? How did you feel about your performance in your first LPAT Speaking test? Which aspects of your English speaking did you think you had the most difficulty with? How did you learn and prepare for your second LPAT Speaking test? How did you feel about your performance in your second LPAT Speaking test? Did you feel tangible progress in your spoken English after your LPAT Speaking test-taking experience? If a prospective LPAT Speaking test candidate asked you to give advice about LPAT Speaking test preparation, what would you tell him or her?

164

8 Principled Rubric Adoption and Adaptation: One Multi-method Case Study Valerie Meier, Jonathan Trace, and Gerriet Janssen

ABSTRACT

T

he development or adaptation of rubrics, used to guide exam marking and score reporting, is a critical element of second language performance assessments. Nevertheless, descriptions of rubric development projects—whether from a priori (i.e., intuitive or theory-based) or empirically-based perspectives—are relatively rare. This study illustrates a research cycle focused on the revision of the Jacobs et al. (1981) analytic rubric, which was used to score the writing section of a high-stakes English placement exam at one Colombian university. A baseline study (n = 542) of rubric function indicated that the rubric was reliably separating examinees of different abilities; however, each rubric category contained too many possible scores to function optimally, which spurred the development of a revised rubric. This chapter describes how the scoring structure was simplified, and more specifically, locally-meaningful performance descriptors were developed, and presents an evaluation of this revised rubric.

Introduction Among applied linguists there has been an ongoing and important conversation focused on understanding and representing linguistic competence. These representations have increasingly been addressed in the development of course 165

166

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

outcomes and assessment practices that approximate authentic language use, reflecting the myriad ways in which language is used by individuals in their local contexts. It is not surprising then that multiple-choice assessment tasks have lost ground because of their frequent lack of authenticity and interactiveness (Bachman & Palmer, 1996), and that performance assessment has grown in popularity (Lane & Stone, 2006). The use of performance assessment, however, introduces several challenges of its own. One of these is that performance assessments generally entail the use of a rating scale that qualifies the degrees of success regarding the task in question or the underlying proficiency the task is designed to tap. Such rating scales can be costly, time-consuming, and difficult to develop, particularly when they are intended for high-stakes assessments, in which establishing validity and reliability is essential. In the face of these challenges, Brindley (2013) writes that “it is not surprising that many test developers opt for the easiest way to define assessment criteria and describe language performance by turning to existing rating systems, either in unmodified or adapted form” (p. 4). This chapter reconsiders this common practice of unquestioningly adopting rubrics into new contexts. Specifically, we are looking at how an established analytic writing rubric developed by Jacobs et al. (1981, p. 90) worked in the context of a high-stakes entrance exam at one research university in Colombia and how it could be better adapted for use in this new context.

Literature Review Performance Assessment In their analysis of the components and design of assessment alternatives, Norris et al. (1998) describe the primary features of performance assessments: What distinguishes performance assessments from other types of tests, then, appears to be that (a) examinees must perform tasks, (b) the tasks should be as authentic as possible, and (c) success or failure in the outcome of the tasks, because they are performances, must usually be rated by qualified judges. (p. 8) Thus, at its core, performance assessment is centred on examinees actually doing something with language—going beyond providing abstract knowledge—to complete tasks that are situated within real-world contexts or uses (Bachman, 2000; Wigglesworth, 2008) and evaluated according to the constructs defined by the tasks. For this chapter, we require a more in-depth theoretical understanding of the third feature presented by Norris et al. (1998)—the rating of performances.1 Performance assessments are complex in that they usually do not rely on purely objective measurement techniques (e.g., multiple-choice or cloze tests); rather, scoring decisions are typically based on some kind of rating. Ratings, in turn, depend on clear definitions of the language constructs of the task (Bachman, 2007) and what can be extrapolated about the examinees’ language ability based on the different characteristics of their performance. In one context, evaluators may be concerned with the examinees’ ability to complete the task, in line with a strict view of task-

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

167

based learning (Long & Norris, 2000). In other assessment settings, evaluators may be more interested in what task performance represents in terms of underlying language constructs (Bachman, 2007). The rating of performance assessments thus entails some degree of subjective judgment on the part of the raters. This is true even when rigorous training practices are in place (McNamara, 2000). Still, it is important to uphold the standards of reliability and validity for scoring decisions that any assessment should address (Brown & Hudson, 1998). The question then is how to score performances in a way that is reliable (i.e., maximizes rater consistency) and valid (i.e., correctly identifies and distinguishes among different characteristics of language ability among examinees). Therefore, it should come as no surprise that the development and use of rating scales is a key component of performance assessment design (Allaei & Connor, 1991; Brindley, 2013; Lane & Stone, 2006).

Rating Scales Rating scales, or scoring rubrics, are central in performance assessment because they explicitly or implicitly represent how the construct being assessed is operationalized. Human raters typically mediate the relationship between rating scale criteria and assign scores, and one line of research has investigated how raters interpret rating scales in their scoring decisions for writing assessments (cf. Barkaoui, 2007; Cumming et al., 2002; Eckes, 2008; Johnson & Lim, 2009; Lim, 2011; Schaefer, 2008). In this chapter, however, we concern ourselves with the way such scales are developed. The most common types of rating scales are holistic rubrics, which ask raters to assign a single score for overall quality, and analytic rubrics, which ask raters to assign scores for multiple dimensions, or traits, of the performance. Perhaps less familiar are the types of scales that guide raters through a series of branching yes/no questions to arrive at a final score (for an example of a performance decision tree, see Fulcher et  al., 2011; for an example of an Empirically-derived, Binary-choice, Bound-decision scale, see Upshur & Turner, 1995). Unlike holistic or analytic rubrics, which can be written at various levels of generality, these types of scales are necessarily task- and context-specific. Typically, the choice of rating scale type will orient the rubric development process. For holistic or analytic rubrics, Weigle (2002) has proposed that preliminary decisions first be made about the type of scale to use, the intended audience, which traits to include, the range of possible scores, and how to report scores. This is followed by the writing of performance descriptors based on a priori assumptions or on an examination of sample performances. In contrast, the alternative rating scales described above demand that the development process begin with a detailed analysis of performance samples. The selection of rating scale traits and the writing of performance descriptors are particularly important as these features most clearly embody the test construct. As described above, there are two main approaches to selecting traits and writing descriptors: a priori or empirical. Some a priori approaches are theoretically grounded, while others are primarily intuitive, relying on individuals to draw on their expertise in the relevant domain. As Fulcher et al. (2011) note, intuitively-derived scales “are

168

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

not without theory, but theory impacts through experience and remains largely implicit” (p.  7). Empirical approaches can involve scaling performance descriptors using Rasch analysis, as was done for the CEFR (see also North & Schneider, 1998), or the close examination of examinee responses to prototypical tasks in order to identify salient features that discriminate among different proficiency levels. Each approach to writing scale descriptors has its relative disadvantages. Critics of intuitive approaches have charged that when descriptors are written independently of actual performance samples, they can refer to linguistic features that will not actually be elicited by the target task or do not co-occur at the given level of proficiency (Upshur & Turner, 1995). On the other hand, researchers who have attempted to base scales on existing theories of communicative competence or L2 writing ability have found these theories inadequate on their own (Knoch, 2011). At the same time, those who have developed descriptors based on performance samples acknowledge how laborious this approach can be (Fulcher, 1996; Fulcher et  al., 2011) and that descriptors can be unpredictably influenced by the samples chosen for the scale development process (Turner & Upshur, 2002). Thus, any rating scale development project entails some degree of compromise.

Research Context This research was conducted at a Colombian research university, where prospective PhD students are required to take an English language placement exam as part of several entrance conditions. Due to the high stakes nature of this exam, it has been the topic of various studies since its creation in 2009 (e.g., Janssen et  al., 2014a, 2014b; Meier, 2013; Trace et  al., forthcoming). This chapter examines how the writing subtest scoring rubric has been functioning. Test developers initially adopted the Jacobs et al. rubric because of its perceived content validity and its recognition in the field of second language writing assessment. This rubric had been used with apparent success for approximately three years of test administrations. However, baseline analyses (n = 542) conducted using multifaceted Rasch measurement (MFRM ) indicated that while the rubric was generally reliable in separating differently proficient examinees, the individual rubric category scores were problematic (Janssen et al., 2014b; Meier, 2013). In particular, the scale structure for each rubric category indicated considerable redundancy among scores. This finding was echoed in interviews with exam raters, for whom the rubric seemed unnecessarily complicated. This study briefly considers the creation of a revised rubric and the application of a variety of statistical (MFRM , profile analysis) and qualitative methods to evaluate the revised rubric in the three following key areas: 1 How do the original and the revised rubrics compare in terms of category scale structure function? 2 How do the original and the revised rubrics compare in terms of distinguishing different placement levels? 3 In what ways do the original and the revised rubrics influence how raters assign rubric category scores?

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

169

Methodology Rubric Revision and Rescoring The rubric described in Jacobs et  al. (1981, p.  90) asks raters to score writing performance according to five main categories: Content, Organization, Vocabulary, Language Use, and Mechanics. Each rubric category is subdivided into four broad ability bands (excellent to very good; good to average; fair to poor; very poor), and each band contains a set of written descriptors and a range of possible numerical scores to be assigned. It is worth noting that the Jacobs et al. rubric categories are not equally weighted—they have different maximum scores—and they also contain a different range of possible score points, from a low of nine possible score points for Mechanics to a high of twenty-one for Language Use. Unsurprisingly, our baseline Rasch model indicated that there were too many possible score points for category scales to function optimally (Janssen et  al., 2014b; Meier, 2013). In particular, graphic and numerical data indicated that specific scores did not correspond to clearly defined segments of examinee ability. Furthermore, some specific scores were not being assigned frequently enough to produce stable measurements. Informed by these data, we combined adjacent scores to provide a scale with fewer possible score points, experimenting with 4-, 5-, 6-, and 7-point scales for each rubric category. Rasch measurement confirmed that the 6-point scale was most efficient. These scale changes necessitated the revision of scoring descriptors characterizing typical performances at each of the six points of each rubric category. To do this, we relied on descriptors from the original rubric as well as from an internal document that raters had created to deepen their understanding of the original Jacobs et  al. descriptors. We also reworked the original descriptors to eliminate features that appeared in more than one category and to sharpen distinctions across categories and between levels. During this process, it became clear that Mechanics would be better served by a 4-point scale. All three authors were involved in this iterative process, and the final draft of the revised rubric was submitted to raters for their comment. To compare the function of the original and the revised rubrics, in the follow-up study, we drew a random sample of eighty essays from the complete set of 542 essays. These eighty essays were rated by two of the authors, using both the original and the revised rubrics. To minimize any possible treatment order effects, half of the essays were rated using the original rubric first, with the other half of the essays rated using the revised rubric first. Because these raters were not in the same physical location, they independently rated approximately ten essays at a time and then discussed their discrepant scores over Skype. When using the original rubric, they negotiated any category scores that were discrepant by more than two points (as was routine in the operational testing situation), while with the revised rubric, they negotiated all category scores that did not agree exactly. In the end, all eighty essays were rated twice over the course of approximately one month. Note that we analyzed both negotiated and nonnegotiated scores and found that negotiation did not have an impact on scale structure (Trace et  al., 2015). In order to allow for direct comparison with the results of our baseline study (where we used negotiated scores) results presented in this chapter are for negotiated scores.

170

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Analyses Using Multi-faceted Rasch Measurement In order to compare the different components of rubric function, we conducted an MFRM using FACETS 3.67 (Linacre, 2010) as this permits the simultaneous incorporation of many different elements or facets into a single statistical model. In MFRM , raters, items, test takers, and different rubric categories can all be included within one model and then compared against each other (Bachman, 2004, pp. 141– 142; Bond & Fox, 2007, p. 147). Moreover, these facets can be mapped on a single vertical ruler, permitting easy visual interpretation.2 Our Rasch model included three facets: raters, examinees, and rubric categories. As a first step, we used infit mean-square measures (Bond & Fox, 2007, p. 35) to establish that our data adequately fit the model and ensure that MFRM was appropriate for our data. To compare the way category rating scales for the original and the revised rubric functioned, we studied two visual representations produced by MFRM : the vertical ruler and the probability curves (or category response curves) for each rubric category. To assess the degree to which scores on each rubric category scale were clustered together or where broad tracts of examinee ability were represented by only a few scores, we began our analyses of these rubrics by looking at the vertical rulers. Next, we examined the probability curves, which visually represent the scale structure for each rubric category. An efficient scale will produce a probability curve with a series of well-defined peaks, indicating that each score is, in turn, the most probable score (indicated by the vertical axis) for test takers at a given ability level (measured on the horizontal axis). We also examined the complementary numerical data that accompany the probability curves. First, we checked frequency counts to identify scores that raters seldom used. Second, taking standard error into account, we calculated the distance between step difficulties, which indicate when it becomes more likely for a test taker of a certain ability to receive a particular score rather than the previous one, and compared the resulting threshold distances to recommended minimum values (1.4 logits; see Bond & Fox, 2007, p. 224). Threshold distances that do not meet the minimum values suggest adjacent scores do not correspond to a distinct segment of test-taker ability, and are therefore redundant (Linacre, 2010, p. 174).

Analyses Using Profile Analysis To gauge the degree to which the function of the two rubrics was similar in terms of placement, profile analysis (PA ) was also used. Profile analysis, though seldomreported (cf. Kondo-Brown, 2005; Irie, 2005), is a useful method of examining rubric function in terms of categories and placement. This multivariate approach is a repeated measures analysis of variance (ANOVA ), which compares means across several dependent variables—here, the five rubric categories—for different groups of participants, in this case, the five English course levels classifications: Pre, 1, 2, 3, and 4. This analysis can help determine the degree to which each course level is distinctive in relation to other course levels as well as in relation to each of the rubric categories.

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

171

Qualitative Analysis To understand how the two rubrics impacted raters’ scoring decisions, we used raters’ discussions of discrepant scores to understand how raters were conceptualizing examinee performance. Fifteen out of sixteen negotiation sessions were audio recorded, and two pairs of recordings were transcribed and read repeatedly in order to gather evidence about how raters were using the rubric to assign scores. In the first pair, the same ten essays were rated using first the original rubric and then the revised rubric. In the second pair, a different set of nine essays was rated using the revised rubric first and then the original rubric. By examining these two pairs of recordings, it was possible to compare raters’ discussions of the same essay as well as to explore broad similarities and differences across the raters’ use of the different rubrics.

Results Analyses Using Multi-faceted Rasch Measurement For all datasets, we assessed the Rasch model’s fit by virtue of how well each of the infit mean square (IMS ) values of the item fell within the range of 0.60–1.40 suggested for rating scales (Bond et  al., 2007, p.  243). Table  8.1 shows how the

TABLE 8.1 Follow-up Study: Category Measures, Fit Statistics, Separation Values for the Jacobs et al. and Revised Rubric Rubric Categories

Measure SE

Infit MS Separation Reliability

χ2

Jacobs et al. Content

0.85

0.06

1.41

−0.32

0.08

0.91

0.76

0.07

0.85

Language Use

−0.40

0.06

0.63

Mechanics

−0.89

0.10

1.16

Organization Vocabulary

Revised Rubric Content

−0.23

0.16

1.17

Organization

0.34

0.17

0.98

Vocabulary

0.56

0.15

0.90

Language Use

0.83

0.15

0.66

−1.50

0.18

1.19

Mechanics

Note: n = 80.

8.95

.99

.00

5.03

.96

.00

172

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

model created using data from the revised rubric has somewhat better fit than the original rubric as the IMS value for Content goes from slightly misfitting the model (IMS = 1.41) to fitting the model appropriately (IMS = 1.17). The IMS values are, broadly speaking, within the suggested range, and the model is highly reliable (see Appendix A for baseline study fit values). Vertical rulers permitted direct comparison of the different facets modelled within MFRM , which are arranged along the vertical axis according to increasing proficiency or difficulty. Figure 8.1 shows how the original and the revised rubrics compared in the follow-up study (n = 80). As in the baseline study, the original rubric generally resulted in many scores being closely clustered together, especially in the middle. However, the revised rubric generates scores that are more evenly spaced out. A similar trend is visible in Figure  8.2, which presents the probability curve for one representative category (Language Use). Quite clearly, scores in the middle of the original rubric scale (top) overlap extensively, indicating redundancy, while scores in the revised rubric (bottom) are characterized by distinct peaks and little overlap with adjacent scores, indicating that they correspond to a clear segment of test-taker ability. We cross-checked this with the statistical results provided by FACETS (e.g., threshold distances and score frequency counts) and confirmed that for all five categories the scale structure was notably more efficient in the revised rubric. Jacobs et al. Ruler +---------------------------------------------------------------+ Measr -raters +test takers Cont Org Vocab LangU Mech ------+ --------+ -------------+------+------+ ------+------+-----5 + + +(30) +(19) + (20) +(25) +(5) --* ---

4 +

+

+

---

+

+

+

29 * *

+ 24

---

--18

19

* 3 +

+

+ 28

+

+

+ ---

+

** * ** *

---

---

23 ---

4.5

27

FIGURE 8.1 Vertical Rulers (n = 80), Comparing the Jacobs et al. Rubric and the Revised Rubric.

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

173

+---------------------------------------------------------------+ Measr -raters +test takers Cont Org Vocab LangU Mech ------+ --------+ -------------+------+------+ ------+------+-----2 + + * + + + + --- + * --17 18 * 22 * 26 --** ------***** --**** 1 + + + + + 17 + 21 + ** 16 25 --4.0 *** --**** ----20 **** 16 * 0 * 7 8 * ** * 24 * 15 * * --- * --- * ***** --19 ******* --**** 23 ----*** 15 18 3.5 *** --14 −1 + + ** + + + --- + --- + 22 --** --13 14 17 * ----** 21 ----** --12 16 −2 ** 20 13 + + + + + + + **** ----* --15 3.0 11 --19 (17) (10) (11) (13) (2.5) −3 * + + + + + + + ------+ --------+ -------------+------+------+ ------+------+----Measr -raters * = 1 Cont Org Vocab LangU Mech +---------------------------------------------------------------------------------------------------------------------+ (Continued)

174

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Revised Rubric Ruler +---------------------------------------------------------------+ Measr -raters +test takers Cont Org Vocab LangU Mech ------+- - - - - - + ------------+ ------+ ------+ ------+ ------+ ------8 + +. + (6) + (6) + (6) + (6) + (4)

7 +

+.

6 +

+

5 +

. + *.

4 + 3 + 2 +

. +. *. . + *. *** +. **

+

+

+

---

+

+

+ ---

+

+ 5

+

+

+

+

+

---

---

+

+

+. **** . * ***. . + **. * *****. +* *

+

+

+

−3 +

+. *.

+

+

+

−4 +

+

+ 3

+

−5 +

+

+

+

0 * 7 8 −1 + −2 +

** **

4

4

*

*

+

+

+

+

+

+

---

---

+

5

+

1 + *

.

+

---

3

*

+

+

+

+

+

+

+ 5

---

4

+

+ ---

+

+

+

*

4

---

3

* ---

*

+

+

+

+ 3

+

+

+

+

+ ---

+ ---

+

3

---

+

+ 5

+ ---

---

2

---

(2) (2) (2) (2) (1) −6 + +. + + + + + -----+ -------+ ------------+ ------+ ------+ ------+ ------+ ------Measr -raters * = 2 Cont Org Vocab LangU Mech +---------------------------------------------------------------- + FIGURE 8.1 Continued

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

175

Jacobs et al. rubric −16.0 −12.0 −8.0 −4.0 0.0 4.0 8.0 ++--------+---------+-------------+----------+----------+------++ 1 | 333333333 | |2 3333 333 | | 22 3 3 55 | | 2 3 3 5 | | 2 3 3 5 | P | 2 3 3 | r | 3 5 | o | 2 3 | b | 2 3 5 | a | 3 | b | * 8 44 5 | i | 3 8 11 3334 4 | l | 3 2 44 8 1 54 | i | 3 4 45 1 3 43 4 | t | 2 4 354568 8 1 4 5 | y | 3 2 656 *00 3 3 4 | | 3 2 4 5364 77 224 53 4 | | 3 2 4 5 *5670 8 * * 5 4 | | 33 2 44 5 6374 **992 04 ** 33 44 | |3 222 444 55 66 7* ***6***940 512 3 | 0 |************************************************************** | ++--------+---------+-------------+----------+----------+------++ −16.0 −12.0 −8.0 −4.0 0.0 4.0 8.0 Revised rubric −9.0 −6.0 −3.0 0.0 3.0 6.0 9.0 ++---------+----------+----------+----------+----------+------++ 1 |22 66666 | | 2222 6666 | | 222 66 | | 2 6 | | 2 6 | P | 2 33 444 6 | r | 2 33 33 4 44 6 | o | 2 3 3 4 4 6 | b | 3 3 4 555 6 | a | 2 3 3 4 4 55 5 | b | * 4 45 56 | i | 2 43 54 65 | l | 3 3 5 | i | 3 2 4 3 5 4 6 5 | t | 3 2 4 3 5 4 6 | y | 3 2 4 3 5 46 5 | | 3 2 4 3 5 4 55 | | 3 24 35 66 4 5 | | 333 4422 5533 6 4 55 | | 3333 444 222 555 33*66 444 5555 | 0 |***********************************2************************** | ++----------+----------+----------+----------+----------+------++ −9.0 −6.0 −3.0 0.0 3.0 6.0 9.0 FIGURE 8.2 Category Response Curves (n = 80) Comparing the Jacobs et al. Rubric (top) and the Revised Rubric (bottom) for Language Use.

176

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Profile Analysis Profile analysis displays differences in terms of (1) levels, (2) parallelism, and (3) flatness (Tabachnick & Fidell, 2013). Levels examine the degree to which differences exist between placement groups as a whole (i.e., the differences between the course levels Pre-, 1, 2, 3, and 4), similar to between-subjects main effects in repeated measures ANOVA . Parallelism measures whether or not profiles for each placement level vary in similar ways across the categories as a whole (i.e., are the patterns of mean differences across the DV s similar for all levels). This is analogous to withingroup interaction effects in repeated measures ANOVA . Last, flatness looks into the degree to which the individual rubric categories are generating different responses by placement level (i.e., if the DV s are all acting similarly for all groups). This test is similar to within-group main effects in repeated measures ANOVA .

Data cleaning As profile analysis functions similarly to repeated measures ANOVA , the same assumptions concerning data screening require verification before analysis. We first converted both score sets from the original and the revised rubrics to percentage scores for each individual category to account for different scoring bands between and within the rubrics. Assumptions about normality, univariate, and multivariate outliers were then all checked with no apparent problems.

Original rubric analysis Beginning with the original rubric, Table  8.2 displays a summary of the repeated measures ANOVA comparisons. Effects for both within-subjects (flatness and parallelism) and between-subjects (levels) are displayed in the first column, followed by sum of squares (SS), degrees of freedom (df), mean squares (MS), F values, and partial effect size (η2). Statistically significant differences were found for flatness (F(4, 300) = 8.79, p = .00), but not for parallelism (F(16, 300) = 1.57, p = .10). This indicates that, while the individual categories were performing differently from each TABLE 8.2 Summary of Profile Analysis for the Original Rubric Source

SS

df

MS

F

p

η2

Within-group Category (Flatness) Category × Placement (Parallelism) Error

2616.20

4

654.05

8.79

.00

.11

1871.34

16

116.96

1.57

.10

.08

22322.47

300

74.41

4 11093.64 31.72

.00

.63

Between-group Placement (Levels) Error

44374.54 26227.01

75

349.69

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

177

other, similar patterns of performance were found across each placement level. In other words, scores varied by rubric category, but these variations were similar for all placement levels. Note, however, that the partial effect size for flatness is quite low (η2 = .11), indicating that the degree of variation between categories is minimal. Differences were also found for levels (F(4, 75) = 31.72, p = .00) with a larger effect (η2 = .62) that accounts for about 40 percent of the variance. As it is easier to make sense of these findings visually, Figure  8.3 displays the profiles of each of the placement levels across the five rubric categories by mean scores. We can see that each of the five lines (i.e., placement levels) are separate and seem to pattern in the same way, indicating that the rubric is distinguishing among placement levels and that categories are functioning similarly regardless of level in most cases. Two notable exceptions are the categories Organization and Mechanics. For Organization, we can see that the profile lines for the course levels 2 and 3 are nearly touching. This might indicate that raters are having difficulty distinguishing

FIGURE 8.3 Profiles for Exam Placements Using the Original Rubric with Percentage Scores (n = 80).

178

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

levels for this category, or that the performance indicators for this category are too similar at this score. In addition, Mechanics acts differently at the Pre-level. This is likely explained by the fact that this category has the narrowest range of possible scores (1–5), and as such, making fine-grained distinctions at lower proficiency levels is likely to measure error variance rather than real difference in ability. Also, note that the lines are not flat, and we can see variation in scores across categories. This is a desired effect as it points to categories that are providing unique information regarding placement rather than categories that are seemingly redundant. Because we are primarily interested in the direct effects of placement by the rubric, contrasts were performed for levels to determine where differences were occurring among the placement levels. This was done through comparisons of marginal means for the grouping variable using a Scheffe post-hoc alpha adjustment. Statistically significant differences were only found for the adjacent course levels 1 and 2 (for a complete summary of results see Appendix B). A lack of differences for Pre- might be explained by the low number of scores for this group (n = 4), and therefore, not enough data to make true comparisons. In regards to levels 2 and 3, this finding is likely the result of both Content and Organization scores being very similar for both levels, and thus, differences are not as pronounced overall as they are for other levels. Likewise, for levels 3 and 4, while the category scores themselves are separated, the curvature of the line reveals that total scores for these two levels are in close proximity (course 3 M = 74.09, SE = 1.70; course 4 M = 83.65, SE = 2.01).

Revised rubric analysis Summary statistics for scores using the revised rubric are displayed in Table  8.3. Statistically significant differences were again found for flatness (F(4, 300) = 7.06, p = .00), indicating that rubric categories were performing differently from each other. As with the original rubric, the effect size for these differences was small (η2 = .09), indicating little meaningful effect. Differences were also found among the patterns of scores across the categories by placement level at F(16, 300) = 2.73, p = .00. These

TABLE 8.3 Summary of Profile Analysis for the Revised Rubric Source

p

η2

7.06

.00

.09

2.73

.00

.13

12061.81 30.21

.00

.62

SS

df

MS

F

2743.41

4

685.85

4239.60

16

264.97

29132.985

300

Within-group Category (Flatness) Category x Placement (Parallelism) Error

97.11

Between-group Placement (Levels) Error

48247.24

4

29947.03

75

399.29

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

179

differences, too, were nearly nonexistent in terms of partial effect (η2 = .13). As with the original rubric, only levels were found to be significantly different (F(4, 75) = 30.21, p = .00) with a large degree of variation (η2 = .62). Figure 8.4 displays the profile analysis plot by placement level, and we see very similar patterns for the revised rubric compared to what we saw with the original rubric. The separation between levels is mostly clear, with the possible exception of Organization for levels 2 and 3 as well as similarly mixed profiles for Mechanics at lower proficiency levels. Contrasts for levels were also carried out for the revised rubric in similar fashion to the original rubric. Adjacent levels were significantly different for both levels 1 and 2 as well as levels 3 and 4. As with the original rubric, the small sample size is likely to blame for overlap between the levels Pre- and 1. When looking at Figure 8.4, we can see that for levels 2 and 3, both Organization and Mechanics are nearly identical, which is likely causing the levels to appear similar overall.

FIGURE 8.4 Profiles for Exam Placements Using the Revised Rubric with Percentage Scores (n = 80).

180

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

What is important to note is that through the revision of the rubric, the levels and profiles of the groups by categories appear to have improved in terms of separation, which is important for our purposes of placement. While there remains some overlap among levels and categories (e.g., Language Use, Mechanics), the overall picture is an encouraging one. Both rubrics are similar enough to indicate that the function of the rubric itself has not changed drastically, but differences reveal that the overall effectiveness and clarity of scores by placement level are improved.

Qualitative Analysis Analysis of how raters negotiated discrepant scores suggests that the two rubrics shaped what raters used as their point of reference for judging examinee performances. When using the original rubric, raters frequently interpreted examinee performance in terms of the adjectives Jacobs et al. (1981) use to define the four broad ability bands (i.e., excellent to very good; good to average; fair to poor; very poor), and then employed words such as strong, weak, high, or low to qualify their assessment. The almost exclusive use of these descriptive labels indicates that while raters seem to have internalized the broad ability bands and subbands used in Jacobs et al. (1981), they have not done the same for individual score points. In contrast, when using the revised rubric, raters interpreted examinee performance in terms of distinct score points. This difference is highlighted in the excerpts below, in which the same two raters, during two different negotiation sessions, discussed discrepant scores for the same category (Content) of the same essay. Original Rubric

Revised Rubric

R8: . . . especially compared to the other, the one that we just rated as an average this one definitely shows more [development], so I think I will make it a good . . . uhm, so I’ gonna do 24

R7: the first paragraph definitely seems higher than other 4s. What do you think? The second paragraph to me seems really vague, though, but I’m wondering how you read it. R8: (12 s. reading) No, I think the second one is pretty, the second one to me is a 4.

In this example, no clear rationale is given for one score over an adjacent score in the same band that would seem to fulfil the same criteria (i.e., a score of 25 would also have qualified as good). Throughout the transcripts, there is evidence to suggest that when using the original rubric, raters chose scores somewhat impressionistically, but when using the revised rubric, they were very clear about what each numeric score meant. Although the reduced number of choices in the revised rubric seems to have helped raters to assign individual scores with more precision, it also seems to have caused some problems when raters came across an essay that they deemed to be “in between” the established levels of performance. This can be seen in the extract below, in which raters 7 and 8 deliberate about whether to give the examinee a 5 or a 4 for Language Use.

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

181

R7: I don’t know this was hard because again there wasn’t a lot of data to work with (9 s. pause) yeah, I’m not quite sure because it’s simple but it’s clear R8: yeah and the one complex sentence at the end of the second paragraph is still accurate R7: mhmm (9 s. pause) yeah (3 s. pause) well maybe we’ll just we’ll agree to disagree on this one too just because it really feels like it’s kind of in between In general, when raters used the revised rubric, they more frequently expressed trouble deciding between two adjacent scores. In these instances, as in the example above, raters frequently proposed the “agree to disagree” strategy as one way to create a kind of in-between score. One possible interpretation of this phenomenon is that raters may need to differentiate between more than six levels of performance. However, given the fact that students are placed into one of five levels of English classes, a 6-point scale seems reasonable. Another possibility is that the difficulty experienced by raters sometimes in assigning scores is due to ambiguous descriptors or the conflation of several different characteristics (e.g., clarity and development of a main idea) in one descriptor. These are issues that are certainly worth pursuing.

Discussion RQ1. How do the original and the revised rubric compare in terms of category scale structure function? The graphics presented by the MFRM succinctly represent the improvements found in the revised rubric. In the revised rubric, scores are both more evenly and broadly distributed across the vertical ruler. This is echoed in the probability curves, which show that the reduced number of possible scores has resulted in category scale structures that are much more efficient in the revised rubric. With a six-point scale (4 points for Mechanics), each score more precisely corresponds to a specific segment of examinee abilities. Importantly, this study indicates a relatively straightforward way that MFRM can be used first to assess the function of an existing rubric and inform principled revision, and then used again to confirm the degree to which modifications were helpful. RQ2. How do the original and revised rubric compare in terms of distinguishing different placement levels? Profile analysis demonstrated that the revised rubric creates similar profiles as the original rubric in terms of level separation as well as overall category function within particular placement levels. In other words, the revised rubric succeeds in placing students similarly to how the original rubric placed students, and in some cases, it is better able to distinguish among levels. As we might expect with any performance rating, placements at the highest levels were the most distinct, but there was still some ambiguity among levels in the middle, and this is consistent with impressions from raters found in the baseline study (Janssen et al., 2014b). In terms of categories, the degree of separation across placements was, for the most part, consistent between both the original and the revised rubrics. Content is one

182

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

category that seems to be markedly improved in the revised rubric, especially between the course levels 2 and 3. Whereas Content scores were previously similar for both groups, now there seem to be clear differences in the two placements. Differences in Organization at these same levels remains somewhat unclear, and marks one area where further revision of the performance descriptors are likely needed. The only other problem area continues to be Mechanics, which for both rubrics is hard to distinguish across placement levels (e.g., course levels 2 and 3), and actually displays opposite profiles for lower levels (i.e., examinees placed in Pre- had higher scores for Mechanics than those placed in level 1). While the latter is probably best explained as a result of the particular sample of essays rated, the problem remains that this category is behaving oddly compared to the other rubric categories. One explanation for this is that Mechanics has the most restricted range of scores (1–4), and the qualitative differences between these different performance levels are themselves small. For example, it is quite likely that writers of a variety of skill levels can use simple punctuation (e.g., periods) quite well, so the differences for this category are not as consistent by placement levels as something like Vocabulary or Content. These small issues aside, we seemed to have met the overall goal of consistency across the two rubrics in terms of overall function. Happily, this indicates that the function of the revised rubric has not changed despite revisions to internal features such as scale structure and performance descriptors. RQ3. In what ways do the original and the revised rubrics influence how raters assign rubric category scores? Qualitative analysis of rater negotiation sessions supports the MFRM findings that there are too many possible score points in the Jacobs et al. rubric. When using this rubric, raters attached concrete meaning to the seven descriptive labels that Jacobs et al. use to categorize different levels of performance, rather than to the individual scores, which they tend to assign somewhat impressionistically. In contrast, when using the revised, six-point rubric, raters attached concrete meaning to individual scores. This suggests that with fewer options, raters can more easily and reliably assign scores. At the same time, it was not uncommon for raters using the revised rubric to express doubt about which of two adjacent scores an essay merited, suggesting that further revision may be in order, particularly of category descriptors.

Conclusion This chapter began with a quote from Brindley (2013) regarding the potential dangers of blindly adopting rating scales and performance descriptors when faced with the challenge of creating valid and reliable assessment tools. Through this study, we have tried to suggest one possible solution by describing an accessible yet systematic approach to rubric scale revision that could be used to adapt existing rating scales in other language programs. Importantly, our findings describe how changes could be made to the features of a rubric while also preserving its overall function. Like restoring a car, we wanted to overhaul the individual pieces while maintaining the original structure and function of the rubric to make it run smoothly and reliably.

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

183

It is our hope that this study can provide an approach to the analysis of the rubric structure from a richly contextualized position. The revision process itself took place within the program and was carried out with the assistance and guidance of program insiders. While we believe that our analyses show that the revisions were successful, this interpretation is really only tenable in light of what teachers, raters, and learners within the program actually do in terms of academic second language writing.

Recommendations As performance assessment increasingly becomes the norm for many programs as a way of testing learners in authentic and valid ways, rating scales will continue to be a crucial aspect of the testing situation. Despite this, rubric development remains a difficult process for local test designers focused on ensuring that their assessment tools are constructed and used in a valid, fair, and reliable way for placement, diagnostic, or achievement purposes. While some authors have attempted to do this in an empirical way (see Bejar et al., 2000, as one example), it remains a laborious process that, in many cases, still turns out to be partially intuitive in design. One of the largest problems with strictly theoretically-informed approaches continues to be the need to clearly match descriptors, levels, and constructs to theory, typically in the form of a hierarchy of language acquisition. In other words, if we want to have an empirically bound rubric that distinguishes learners at different levels in their learning, we need to know what those different levels are first. As we might imagine, this is not easy to accomplish. While the fields of second language acquisition and psycholinguistics attempt to provide verifiable proof of these hierarchies, with varying results, the issue is further confounded if we consider that there may not be a strict step-by-step path to acquisition as we might find in a sociocultural or complexity theory of language learning (Larsen-Freeman, 2006; van Lier, 2004). Does this, then, leave the test designer with nothing other than intuition to go on in order to create rating scales? This chapter demonstrates that this is not the case. Our focus should be directed away from constructing a scale from an either/or perspective of empirical or intuitive models. Instead, we are best served by utilizing both empirical data and our intuition, and furthermore, linking these to the specific context within which our assessment is being used (Bachman & Palmer, 2010). Regardless of the method we use for scale development, the main goal is to identify constructs, levels, and performance descriptors that have a direct link to program outcomes in relation to the target language uses of the examinees. What is essential is that the conclusions we are drawing are situated within the context of the learners, the program goals, and what learners can actually do with the language.

Notes 1 There are several texts that we think provide a thorough overview to several important themes concerning performance assessment, whether one is creating or adapting, such an assessment for local use. For good conceptualizations of task, see Norris et al. (2002) or

184

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

van den Branden et al. (2009). For canonical treatments of authenticity, see Bachman & Palmer (1996, 2010). Test development procedures are described thoroughly in Bachman & Palmer (2010), Brown (2005), Kane (2013), and Schmeiser & Welch (2006). 2 For research into the rating of academic writing using Rasch, see Eckes (2008), Knoch (2009), Schaefer (2008), Sudweeks et al. (2005), and Weigle (1998).

References Allaei, S. K., & Connor, U. (1991). Using performative assessment instruments with ESL student writers. In L. Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts. Norwood, NJ : Ablex, pp. 227–240. Bachman, L. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17(1), 1–42. Bachman, L. (2004). Statistical Analyses for Language Assessment. Cambridge, UK : Cambridge University Press. Bachman, L. (2007). What is the construct? The dialectic of abilities and contexts in defining constructs in language assessment. In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe (Eds.), Language Testing Reconsidered. Ottawa, Canada: University of Ottawa Press, pp. 41–71. Bachman, L., & Palmer, A. (1996). Language Testing in Practice. New York, NY: Oxford University Press. Bachman, L., & Palmer, A. (2010). Language Assessment in Practice. New York, NY: Oxford University Press. Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86–107. Bejar, I., Douglas, D., Jamieson, J., Nissan, S., & Turner, J. (2000). TOEFL 2000 Listening Framework: A Working Paper. TOEFL Monograph Series. Princeton, NJ : Educational Testing Service. Bond, T., & Fox, C. (2007). Applying the Rasch Model: Fundamental Measurement in the Human Sciences (2nd edn.). New York, NY: Routledge. Brindley, G. (2013). Task-based assessment In C. A. Chapelle (Ed.), The Encyclopedia of Applied Linguistics. Oxford, UK : Blackwell Publishing Ltd. Brown, J. D. (2005). Testing in Language Programs: A Comprehensive Guide to English Language Assessment. New York, NY: McGraw-Hill. Brown, J. D., & Hudson, T. (1998). The alternatives in language assessment. TESOL Quarterly, 32(4), 653–675. Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL /EFL writing tasks: a descriptive framework. The Modern Language Journal, 86(1), 67–96. Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208–238. Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking tests: Performance decision trees, Language Testing, 28(1), 5–29. Irie, K. (2005). Stability and Flexibility of Language Learning Motivation (Doctoral thesis), Temple University, Tokyo, Japan. Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hugley, J. (1981). Testing ESL Composition: A Practical Approach. Rowley, MA : Newbury House.

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

185

Janssen, G., Meier, V., & Trace, J. (2014a). Classical test theory and item response theory: Two understandings of one high-stakes performance exam. Colombian Applied Linguistics Journal, 16(2), 3–18. Janssen, G., Meier, V., & Trace, J. (2014b). Building a better rubric: Towards a more robust description of academic writing proficiency. Paper presented at the Language Testing Research Colloquium (LTRC), Amsterdam, The Netherlands. Johnson, J., & Lim, G. (2009). The influence of rater language background on writing performance assessment, Language Testing, 26(4), 485–505. Kane, M. (2013), Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1) 1–73. Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275–304. Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16(2), 81–96. Kondo-Brown, K. (2005). Differences in language skills: Heritage language learner subgroups and foreign language learners. The Modern Language Journal, 89(4), 563–581. Lane, S., & Stone, S. (2006). Performance assessment. In R. Brennan (Ed.), Educational Measurement (4th edn.). Westport, CT: American Council on Education/Praeger, pp. 387–431. Larsen-Freeman, D. (2006). The emergence of complexity, fluency, and accuracy in the oral and written production of five Chinese learners of English. Applied Linguistics, 27(4), 590–619. Lim, G. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Learning, 28(4), 543–560. Linacre, J. (2010). FACETS (Version 3.67.0). Chicago, IL : MESA Press. Long, M., & Norris, J. (2000). Task-based teaching and assessment. In M. Byram (Ed.), Routledge Encyclopedia of Language Teaching and Learning. London, UK : Routledge, pp. 597–603. McNamara, T. F. (2000). Language Testing. Oxford, UK : Oxford University Press. Meier, V. (2013). Evaluating rater and rubric performance on a writing placement exam. Working papers of the Department of Second Language Studies, University of Hawai’i, 311, 47–100. Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing Second Language Performance Assessments. Honolulu, HI : University of Hawai’i. Norris, J., Brown, J. D., Hudson, T., & Bonk, W. (2002). Examinee abilities and task difficulty in task-based language performance assessment. Language Testing, 19(4), 395–418. North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales. Language Testing, 15(2), 217–262. Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. Schmeiser, C., & Welch, C. (2006). Test development. In R. Brennan (Ed.), Educational Measurement (4th edn.). Westport, CT: American Council on Education/Praeger, pp. 307–353. Sudweeks, R., Reeve, S., & Bradshaw, W. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239–261. Tabachnick, B., & Fidell, L. (2013). Using multivariate statistics (6th edn.). Boston, MA : Pearson.

186

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Trace, J., Janssen, G., & Meier, V. (Forthcoming). Measuring the impact of rater negotiation in writing performance assessment. Language Testing. Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale maker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49–70. Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3–12. Van den Branden, K., Bygate, M., & Norris, J. (2009). Task-Based Language Teaching. Philadelphia, PA : John Benjamins. Van Lier, L. (2004). The Ecology and Semiotics of Language Learning: A Sociocultural Perspective. Norwell, MA : Kluwer Academic Publishers. Weigle, S. (1998). Using FACETS to model rater training. Language Testing, 15(2), 263–287. Weigle, S. (2002). Assessing Writing. Cambridge, UK : Cambridge University Press. Wigglesworth, G. (2008). Tasks and performance-based assessment. In E. Shohamy & N. Hornberger (Eds.), Encyclopedia of Language and Education: Language Testing and Assessment (Vol. 7, 2nd edn.). New York, NY: Springer, pp. 111–122.

PRINCIPLED RUBRIC ADOPTION AND ADAPTATION

187

Appendix A. Category Measures, Fit Statistics, Separation Values for the Baseline Study TABLE A1 Baseline Jacobs et al. Study: Category Measures, Fit Statistics, Separation Values Rubric Categories

Measure

SE

Infit MS

.05

.02

1.43

Organization

−.08

.03

.86

Vocabulary

−.03

.03

.66

Language Use

−.50

.02

.89

.57

.03

1.16

Contenty

Mechanics

Separation

13.07

Reliability χ2

.99

.00

Note: (n = 542). IMS values demonstrate that there was reasonably good fit for the data to the original model created for the baseline study. From Janssen et al. (2014b).

188

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Appendix B. Levels Contrasts for the Original and Revised Rubrics TABLE B1 Levels Contrasts for the Original and Revised Rubrics Exam Placement

M Difference

SE

P

Original 0

1

6.80

4.56

.70

1

2

11.12

2.52

.00

2

3

7.00

2.68

.16

3

4

9.76

0.02

.04

0

1

4.17

4.88

.95

1

2

11.09

2.70

.00

2

3

7.30

2.86

.18

3

4

11.67

3.17

.01

Revised

9 The Social Dimension of Language Assessment: Assessing Pragmatics Carsten Roever

ABSTRACT

T

his chapter gives an overview of current approaches in the assessment of second language pragmatics, and distinguishes between two major categories of pragmatics assessments. The more long-standing and widespread type grew out of general pragmatics research, and specifically, speech act pragmatics, and encompasses atomistic, multi-item tests of particular aspects of pragmatic ability. In contrast, a more recent development are discursive tests assessing extended interaction that are rooted in theories of social interaction, such as interactional sociolinguistics and Conversation Analysis. Roever et  al.’s (2014) test of sociopragmatics tried to bridge the two traditions through the inclusion of multi-turn discourse completion tasks as well as appropriateness judgments and single-turn items. Findings from this study will be discussed to illustrate the tension in pragmatics assessment between construct coverage and practicality.

Testing of Second Language Pragmatics Pragmatics is an important component of models of communicative competence, but it is rarely explicitly tested or mentioned in rating scales. However, research on testing second language pragmatics has grown rapidly and various tests have been developed. This chapter outlines the construct of second language pragmatics, gives an overview of approaches to testing, and discusses a recent test in depth. 189

190

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Construct of Second Language Pragmatics Pragmatics has been variously defined (Austin, 1962; Crystal, 1997; Levinson, 1983; Mey, 2001) though the most commonly cited definition by Crystal (1997) sees it as investigating language use in social interaction, accounting for choices language users make, and describing the effect of these language use choices on other interlocutors in a particular context. Concisely put, the contextualized, interlocutoraffected and interlocutor-affecting nature of language use is at the heart of pragmatics. Pragmatics plays a central role in models of communicative competence. In the early models of communicative competence by Canale & Swain (1980) and Canale (1983), pragmatics was covered under sociolinguistic competence, which describes knowledge of language use appropriate to the social situation, and discourse competence, which relates to the organization of spoken and written texts. In their influential model of language knowledge, Bachman & Palmer (2010) put pragmatic knowledge at the top of their taxonomy by distinguishing between organizational knowledge and pragmatic knowledge. The former encompasses grammatical knowledge, including vocabulary, syntax, phonology/graphology, and textual knowledge of cohesion and rhetorical or conversational organization. Pragmatic knowledge is subdivided into functional knowledge and sociolinguistic knowledge. Functional knowledge relates utterances to the communicative goals of language users, and includes encoding of language functions, for example, exchanging information, making requests, solving problems, or being humorous. Sociolinguistic knowledge is language users’ knowledge of appropriate language use in a particular setting, such as dialect, genres, registers, idiomatic expressions, and cultural references. Purpura (2004) takes a similar approach, showing how grammatical form serves to realize language meaning, which in context, becomes pragmatic meaning. Purpura (2004) further subdivides pragmatic meaning into sociolinguistic meaning (markers of social identity, politeness, and formality), sociocultural meaning (references, collocations, cultural norms), psychological meanings (stance, attitude, and affect), rhetorical meanings (coherence, genre), and contextual meanings (interpersonal meanings, metaphor). It is perhaps curious that models of communicative competence bear little connection to research in pragmatics, but rather echo work in other traditions. For example, ethnography of communication (Hymes, 1972) influenced Canale & Swain’s model, and systemic-functional linguistics (Halliday, 1970; Halliday & Hasan, 1976) influenced Bachman & Swain’s and Purpura’s models. However, pragmatics research has outlined competences that language users must possess to use language effectively and appropriately in social contexts. Most importantly, Leech (1983, 2014) distinguishes between two types of pragmatic competence: pragmalinguistic and sociopragmatic competence. Sociopragmatic competence describes language users’ knowledge of the social rules of the target language applicable to the relevant language use setting. It includes knowledge of norms of what is appropriate given a certain social relationship determined by factors like power differential, social distance, degree of imposition (Brown & Levinson, 1987) as well as age, gender, social class, social role in the setting, and so on. It further encompasses knowledge of “expected ways of thinking, feeling and acting” (Ochs, 1986, p. 2), knowledge of taboos and what constitutes matters that require delicate handling, conventional courses of actions, and as Fraser et al. (1981) put it so succinctly, “what you do, when and to whom”

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

191

(p. 79). Sociopragmatic knowledge concerns the societal interpretations of the various aspects of the communicative situation and the parameters within which a social actor is expected to operate to perform a particular social role. Of course, interactants do not follow sociopragmatic rules like robots, but rather act against the backdrop of these norms and evaluate others’ performance in relation to these norms. Pragmalinguistic competence encompasses language users’ knowledge of the linguistic tools for implementing their sociopragmatic knowledge, comprehending their interlocutor’s stance and positioning, and generally accomplishing social actions through language. Leech (1983) describes pragmalinguistics as the linguistic interface of pragmatics, encompassing knowledge of language resources for conveying speech intentions. This potentially includes a language user’s entire linguistic competence, but certain linguistic resources are more pivotal for pragmatic goals than others since they make it clear what type of social action is being done, for example, request, admonition, throwaway comment, joke, and so on. Clark’s (1979) distinction between conventions of means and conventions of form is useful for exploring configurations of linguistic knowledge for pragmatic purposes. Conventions of means are the range of conventionally available strategies for conveying speech intentions, for example, refusals can be made in a number of ways, including direct performative and nonperformative statements, indirect statements of regret, wish, explanation, and so on. (Eslami, 2010). Conventions of form describe the language knowledge necessary to implement these strategies, for example, how to produce explanations that have the illocutionary force of a refusal (“Can you help me draft this letter?” —“I have to pick up the kids now.”), and require control of a wide range of grammatical structures and morphological features. Research has shown that features available to second language learners as part of their general proficiency are not necessarily immediately available pragmatically (Salsbury & Bardovi-Harlig, 2001). Pragmalinguistic competence also includes knowledge of how to implement different formality levels and registers, index oneself as belonging to a certain social, regional, or age group, and deploy turn-initial expressions such as oh, ah, well, and so on (Heritage, 2013). Sociopragmatic competence and pragmalinguistic competence are tightly interrelated in production and acquisition. In terms of production, it can be difficult to define the origin of pragmatic errors (Thomas, 1995). For example, if a learner asks for an extension on a term paper from a professor with whom she has had little previous contact by saying “I want more time for my paper” (which would generally be considered inappropriately curt and direct), is this due to lack of sociopragmatic knowledge of power relations and role expectations, or due to lack of pragmalinguistic tools for making a complex request? In acquisitional terms, increasing pragmalinguistic competence makes it easier for learners to identify linguistic features that fulfil pragmatic functions, and they have a wider range of linguistic tools available for performing in social roles. Conversely, increasing sociopragmatic competence allows them to fine-tune the deployment of their pragmalinguistic skills and to avoid excessively polite language use (Al-Gahtani & Roever, 2014; Blum-Kulka & Olshtain, 1986). A recent construct that is gaining popularity in second language pragmatics research is Interactional Competence (Hall, 1993; Kramsch, 1986; Young, 2011), defined by Hall & Pekarek Doehler (2011) as “our ability to accomplish meaningful social

192

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

actions, to respond to co-participants’ previous actions and to make recognizable for others what our actions are and how these relate to their own actions” (p. 1). Interactional Competence as a concept is more strongly interactively oriented than pragmalinguistic and sociopragmatic competence, and accounts for language users’ ability to claim membership in a social group through “effective, morally accountable participation” (Kasper & Wagner, 2011, p.  118). Following the idea of interaction being a co-constructed achievement of interlocutors, Kasper & Wagner (2011) emphasize the emergent nature of interactional competence, which can be viewed as cooperatively achieved by interlocutors rather than as a stable trait in an individual’s mind (Kasper, 2009). This poses obvious issues for testing, which needs to assign scores to individuals indicating the strength of the attribute under measurement. Tests of second language pragmatics have therefore grappled with the interactional competence construct.

Tests of Second Language Pragmatics Testing of second language pragmatics has a relatively short history. The earliest test focusing on the pragmatic dimension of linguistic competence is Farhady’s (1980) test of functional English, which can be placed more in a notional-functional curriculum tradition (van Elk, 1976) than a pragmatic tradition. Farhady’s test consisted of multiple-choice appropriateness judgments and assessed whether test takers had sufficient sociopragmatic knowledge to understand the social situation as well as accurate mapping of pragmalinguistic knowledge to choose an appropriate utterance. Shimazu’s (1989) test was similar, focusing on appropriateness judgments of requests and employing NNS as well as NS utterances as response options. The first larger-scale test design effort for pragmatics tests was Hudson et  al.’s (1992, 1995) project to develop tests of ESL pragmatics for Japanese learners of English. Hudson et  al.’s test battery was very much in the speech act tradition and relied strongly on the cross-cultural work previously done by Blum-Kulka et al. (1989). Hudson et  al. (1992, 1995) investigated learners’ ability to recognize and produce appropriate requests, apologies, and refusals under different settings of the social context factors Power, Social Distance, and Imposition (Brown & Levinson, 1987). They employed a variety of instruments: ●

● ●







Written discourse completion tasks (DCTs): situation descriptions with a gap for test takers to fill in the request, apology, or refusal they would utter in the situation, which was later scored by trained raters. Oral DCTs: same as written DCTs, but test takers speak their responses. Multiple-choice DCTs: test takers choose an appropriate utterance from three response options, and their response is dichotomously scored. Role plays: role play scenarios in which test takers produce a request, apology, and refusal to be later scored by raters. Self-assessments for DCTs: test takers provide self-assessments of their ability to produce speech acts appropriately in DCT situations. Self-assessments for role plays: test takers provide self-assessments of their ability to produce speech acts appropriately in role play situations.

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

193

Hudson (2001a, 2001b) reported on the piloting of the instrument with Japanese ESL learners and found inter-rater and Cronbach’s alpha reliabilities between .75 and .9 for the test sections, but excluded the multiple choice DCT. Yoshitake (1997) administered Hudson et al.’s (1995) instrument to Japanese EFL learners, Yamashita (1996) adapted it for Japanese as a target language, and Ahn (2005) used some sections for testing Korean as a second language. Brown (2001) analyzed data from Yamashita’s (1996) and Yoshitake’s (1997) test administrations and found acceptable reliabilities for all instruments except for the multiple-choice DCT. Tada (2005) and Liu (2006) tackled the specific issue of multiple-choice DCTs and attained more satisfactory reliability levels of .75 in Tada’s (2005) case, and .88 in Liu’s (2006) study. Tada supported his instrument with video snippets, and Liu went through a detailed bottom-up test construction process, which may account for the improved reliabilities. Hudson et al.’s (1995) test was ground breaking in that it was a methodologically oriented test validation project in second language pragmatics with a strong theoretical anchoring in speech act and politeness theory. Prior to their project, it was not even certain that productive pragmatic ability could be tested reliably but Hudson et al. showed that this is possible though there are noticeable instrument effects. A drawback of their approach was its lack of practicality: a) the instrument was designed for a specific L1-L2 pairing, which limited its usefulness; b) it involved rater scoring of productive responses, role plays which are time-consuming to administer and score, and the most practical component of the test (the multiple choice DCT ) did not work well. Hudson et  al.’s test was also limited to speech acts, which was in line with mainstream second language and cross-cultural pragmatics research at the time, but it neglected other aspects of pragmatic competence. Roever (2005, 2006) set out to develop a test that assessed the construct of pragmatic ability more broadly. He developed a web-based test incorporating three sections: ●





a speech act section consisting of written DCT items for the speech acts request, apology, and refusal, a multiple-choice implicature section assessing test takers’ comprehension of two types of indirect language use previously identified by Bouton (1988, 1999), a multiple-choice routines section assessing test takers’ recognition of situationally bound formulaic expressions.

Roever’s test was more practical to administer and score than Hudson et  al.’s (1995), and it attained overall high reliability just above .9. Notably, it was not contrastively designed for a specific L1-L2 pair and focused less on politeness and sociopragmatic appropriateness and more on pragmalinguistic knowledge. In a similar vein, Itomitsu (2009) developed a test of Japanese as a foreign language, using multiple-choice items to assess learners’ recognition of formulaic expressions, speech styles, intended meaning of speech acts, and grammatical accuracy. He also attained a strong reliability of nearly .9 overall. Aiming to expand the sociopragmatic side of the pragmatics construct while still maintaining high levels of practicality, Roever et al. (2014) developed a test of sociopragmatic competence, which is discussed in detail below.

194

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

The third tradition of testing in L2 pragmatics takes a much more interactional approach. It is characterized by extended discourse elicited through role plays and a theoretical orientation toward discursive pragmatics and interactional competence rather than speech acts and isolated components of pragmatics. However, the test constructs vary. Grabowksi (2009, 2013) administered four role plays varying Social Distance and Imposition and scored them following Purpura’s (2004) conceptualization of pragmatic meaning. She attained overall a high Cronbach alpha reliability of .94, though alpha reliabilities for sociocultural appropriateness and psychological appropriateness were .5 and .64, respectively. In another interactionally oriented study, Youn (2013) had test takers complete two role plays, one with a higher-power interlocutor, the other with an equal-power interlocutor as well as three monologic tasks. Interlocutors followed a fairly tightly prescribed structure, and Youn analyzed her data bottom-up, creating rating categories based on Conversation Analysis, including content delivery, language use, sensitivity to the situation, engagement with the interaction, and turn organization. While she does not report inter-rater or Cronbach alpha reliabilities, she used multi-faceted Rasch measurement to show that raters produced consistent ratings although their severity differed appreciably. Youn (2013) used Kane’s (2006, 2012) argumentbased validation approach to support her interpretation of scores as indicators of learners’ pragmatic abilities. In a series of studies, Walters (2004, 2007, 2009, 2013) developed a pragmatics test informed by Conversation Analysis, involving a listening component, DCTs, and role plays. The test assessed knowledge of responses to compliments, assessments, and pre-sequences, but was hampered by low Cronbach alpha reliabilities. Finally, Timpe (2014) straddles research traditions by including both judgments of appropriateness and extended discourse in her test. She followed Bachman & Palmer’s (2010) construct of pragmatics and assessed comprehension of the illocutionary force of speech acts (offers, requests), judgment of naturalness of routine formulae, and comprehension of phrases and idioms. Timpe also included four role play tasks, administered via Skype, varying Power and Social Distance. She rated aspects of discourse organization (discourse competence), appropriateness to interlocutor and situation (pragmatic competence), and overall performance. She attained an overall Cronbach alpha of .87 for the sociopragmatic comprehension test though the reliability of the speech act section was low at α = .53, demonstrating again the challenge of testing speech acts with multiple-choice items. Inter-rater correlations for the role play ratings between Timpe herself (who rated the whole dataset) and four raters (who rated subsets) were strong in the high .8 to mid .9 region. From this review, several observations of the state of testing L2 pragmatics can be made. For one thing, a great deal of work has been undertaken since Hudson et al.’s (1995) seminal project. Whereas previously two test instruments (Farhady, 1980; Shimizu, 1989) existed, this number has since increased multifold. A wide range of aspects of L2 pragmatics have been tested, including speech acts, routine formulae, implicature, speech styles, politeness/appropriateness, formality and extended discourse. It has been found that pragmatic aspects of language are reliably testable and that scores from pragmatics tests can enable inferences about learners’ ability to engage in social language use. This rise of pragmatics testing supplements

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

195

the assessment of more traditional aspects of language competence, whether conceptualized as skills (reading/writing/listening/speaking) or formal features (grammar/vocabulary/pronunciation). At the same time, though pragmatic aspects may be sporadically mentioned in the scoring rubrics for speaking, such as in the test of German for foreign learners of German administered by the TestDaF Institute (TestDaF), it is notable that none of the large language tests contain a set of pragmatics items, including the Test of English as a Foreign Language (TOEFL ®), the International English Language Testing System (IELTS ™), the Pearson Test of English Academic (PTE Academic™), or the Chinese Standard Exam (Hanyu Shuiping Kaoshi, HSK ). So testing of pragmatics has so far been limited to a research endeavor and pragmatics tests have not been used for real-world decisions. This is problematic since it means that current language tests are not assessing a significant part of the construct of communicative competence. This construct underrepresentation potentially threatens the defensibility of score inferences unless it can be shown that pragmatic ability is linearly related to general language proficiency. However, this does not appear to be the case (Kasper & Rose, 2002; Salsbury & Bardovi-Harlig, 2001). One reason for the limited uptake of pragmatics in language tests may be the particularly acute tension between construct coverage and practicality in pragmatics assessment. Pragmatics tasks must establish situational context in a way that is comprehensible to all test takers since otherwise general proficiency would influence pragmatics measurement strongly. At the same time, it is unclear how much information is necessary for establishing context (Billmyer & Varghese, 2000), and while it appears that video input is helpful (Tada, 2005) video scenarios are expensive to create. Similarly, administration and scoring of pragmatics tasks tends to be resource intensive with role plays requiring a live interlocutor and scoring often necessitating human raters. Finally, it is not clear what aspects of the overall construct of pragmatics should be tested and how they should be weighted relative to each other: For example, is comprehension of implicature as important as production of speech acts? Also, certain parts of the construct have not been investigated much, such as pragmatics of written communication, for example, writing emails. On the whole, it appears that there is still quite a bit of research to be done on assessment of second language pragmatics. In the next section, a recent research study in this area is described in depth.

Sample Test Project Roever et  al.’s (2014) test of second language sociopragmatics was developed to contribute to testing of L2 sociopragmatic ability. It falls between the first and second tradition of pragmatics tests, focusing on learners’ sociopragmatic knowledge of social norms and conventional procedures while trying to cover a broader range of areas of pragmatic competence than just speech acts. This construct definition was based on considerations about the relationship between construct coverage and practicality. As Ebel (1964) remarked, practicality is an essential part of validity since tests that are less practical are less likely to be used, and therefore, real-world decisions will be informed by evidence collected through other (possibly less

196

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

defensible) means. To ensure a high degree of practicality, Roever et  al. aimed to develop a test that is computer-based, can be completed in one hour, and does not involve one-on-one, real-time interactions. This precluded testing of coherent discourse, as was the case in the third tradition (Grabowski, 2009, 2013; Youn, 2013), and necessitated a limitation of the construct to offline abilities. The construct of sociopragmatic ability was defined as: ● ●

● ● ●

judgment of appropriateness of speech acts, judgment of the appropriateness of conversational second pair parts (responses to initiating utterances), judgment of conventional discourse structure, production of appropriate second pair parts, production of turns in extended conventional interactions.

This definition was informed by previous research in which appropriateness judgments of target-like politeness levels and conventional utterances figure very centrally (e.g., Bardovi-Harlig & Dörnyei, 1998; Matsumura, 2003, 2007; Schauer, 2006), and which found that unconventional discourse structure can lead to misunderstandings (Gumperz, 1982; House, 2003; Young, 1994). The production of second pair parts was a variation on Discourse Completion Tasks (Kasper, 2008), while the completion of an extended conversation with turns missing was experimental although it had been previously mentioned by Ishihara & Cohen (2010). Since this construct of sociopragmatics is oriented toward offline knowledge rather than online, real-time processing ability, it allows inferences only with regard to what learners can potentially do, but not what they would actually do in real-world language use. However, it is important to note that even a one-on-one interactive test is not an actual replication of real-world talk (see Stokoe, 2013, for differences between role plays and actual conversation). The study was designed as a test validation study using Kane’s (2006, 2012, 2013) argument-based approach with an additional inference (Explanation) added, following Chapelle (2008). In the argument-based approach, two lines of argument are used. In the interpretive argument, a series of inferences lead from the envisioned goal of measurement (target trait in target domain) to test use. The inferences are Domain Description (Do the test items represent the universe of generalization?), Scoring/ Evaluation (Are scores derived defensibly?), Generalization (Does the test reliably sample from the universe of generalization?), Explanation (Does the test measure the target trait?), Extrapolation (Does the test allow conclusions to be extended to the entire target domain?), and Utilization (What uses are justifiable?). Each inference needs to be supported by evidence, which constitutes the validity argument. In the following section, Roever et al.’s validation argument is sketched out.

Domain Description The target domain for the test was the amorphous domain of “everyday language use.” The universe of generalization drawn from this domain included acquaintances and strangers interacting in a variety of situations and settings, such as social

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

197

gatherings, workplaces, and service encounters. The power differential varied between equal, higher, and lower power for the imaginary interlocutor, and utterances to be judged could be appropriate, impolite, or overly polite. The trait under investigation was sociopragmatic knowledge as applied productively and receptively to these social settings. The test itself was a web-based test consisting of five sections. All items were shown to test takers on screen and played as audio files. ●











Appropriateness Judgments: Nine items describing situations ending with the target utterance, which was a request, apology, refusal, or suggestion. Test takers judged the target utterance on a five-point scale from “very impolite/ very harsh” to “far too polite/soft.” Interactive appropriateness judgments: Twelve items with the social relationship held equal as friends and a two-turn interaction, in which the second turn was either appropriate or inappropriate in that it might be unresponsive to the first, too formal, too demanding, or too direct. Test takers judged appropriateness as “appropriate/inappropriate.” Interactive appropriateness correction: If a test taker judged an interactive appropriateness item as inappropriate, he or she was prompted to provide a correction. Extended DCT: Five macro-items consisting of a dialog in which some of the turns from one interlocutor were missing. Three items focused on service encounter talk (booking a table, ordering at a restaurant, making a dental appointment) and two items focused on small talk (at a party, in a taxi). Four of the items had four gaps each, while the item “making a dental appointment” had eight gaps. Test takers were asked to complete the gaps in writing in such a way that the conversation made sense. Dialog choice: Four items consisting of two dialogs each, one of which included a misunderstanding due to the core utterance occurring unconventionally late, or in an opaque manner. Test takers indicated which dialog was more “successful” and also gave a reason why (not scored). C-Test: A C-test, consisting of three texts, was also included as an independent proficiency measure.

Answers and response times were recorded, but the test itself was not timed and was not run under supervised conditions. The test was administered to 368 ESL learners in Australia recruited from language schools and university programs, 67 EFL learners from year eleven and twelve high school classes in Chile, and 50 Australian English native speakers. The ESL learners’ three most common native languages were Chinese (accounting for nearly half the learner sample), Indonesian and Spanish, in addition to another 41 languages. The native language of all the EFL learners was Chilean Spanish. Due to the study’s recruitment strategy, learners can generally be assumed to be at least at intermediate level, roughly high A2 or B1 and above on the Common European Framework of Reference for Languages (Council of Europe, 2001). This sampling approach was intended to limit the influence of pure second language proficiency as

198

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

a construct-irrelevant factor. Learners should, at least, be able to understand the items and the instructions and should not answer incorrectly only due to lack of overall proficiency.

Scoring As Brennan (2013) argues, the scoring inference frequently does not receive the attention it deserves. Scoring is an extremely important step since it translates test taker performance into a numerical value, and this numerical value is henceforth taken as a true representation of performance. Scoring in this test proved far more complex than the test makers had anticipated. For the Appropriateness Judgment section, scoring followed an approach used by Matsumura (2003), which referenced partial credit scores to native speaker preferences. Where a test taker chose the appropriateness judgment also chosen by the majority of the native speaker comparison group, they received two points for the item. If they chose the judgment that corresponded to the next two largest groups of native speakers, they received one point, as long as each of the choices had attracted at least 10 percent of the native speaker group. All other choices received zero points. It is admittedly questionable whether preference can be equated with competence and whether a native speaker standard is appropriate. Interactive Appropriateness Judgments and Dialog Choice items were less problematic since they could be dichotomously scored. The Extended DCT and Appropriateness Correction items were rated by native speaker raters on a scale of 0–3. After excluding under-performing items, sections had the number of items and raw total scores as shown in Table 9.1. Using raw scores would have led to a total score of 124, more than half of which would have been due to the Extended DCT. On the other hand, using percentage scores for sections would have given the Dialog Choice section the same weight as the Extended DCT. In the end, all sections received a maximum score of eight except for the Dialog Choice section, whose maximum score was four. This approach was based on a rationale that all sections contributed evenly to the measurement but the

TABLE 9.1 Items per Section and Possible Total Scores

Section

Possible Maximum Number of Items Raw Score

Appropriateness Judgments

8

16

Interactive Appropriateness Choice

8

8

Interactive Appropriateness Correction

8

24

Extended DCT Dialog Choice

24 gaps 4

72 4

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

199

Dialog Choice section only required receptive, binary judgments of a small number of items, and was therefore most vulnerable to guessing. This clearly demonstrates that scoring itself is a matter of argument, value judgment, and debate. Even though it was not done here, a careful scoring argument, justifying differential weighting of items and sections, should be considered in test development. Further validity evidence in support of the scoring inference included item panelling and piloting, removal of several items with low discrimination from the final analysis, and rater training.

Generalization The generalization inference extends scores on the particular set of items used in the test to the entire universe of generalization, and is supported by reliability studies. Table 9.2 shows the reliability of the sections and the whole test as well as the mean inter-item correlation. Reliability is lowest for Dialog Choice, which only consisted of four items and also quite low for Appropriateness Judgments. The overall reliability of .811 and inter-item correlation of .161 are acceptable, and may have been negatively impacted by the reduced spread in test-taker abilities, which reduced the effect of proficiency as a variable. Inter-rater reliabilities were consistently in the .9 range, indicating a very high level of agreement between raters.

Explanation This inference is essentially concerned with construct validation: Does the test indeed measure the target trait? Validity evidence includes comparisons between TABLE 9.2 Reliability Section

N

α

C-test

426

.886

Appropriateness Judgments

435

.567

Extended DCT

432

.661

Dialog Choice

435

.529

Appropriate Choice

429

.724

Appropriate Correction

429

.769

Total

426

.811

Mean Inter-item Correlation

.161

200

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

groups that would be expected to differ on the strength of the trait, and correlational analyses to investigate whether the structure of the construct is reflected in the test. For sociopragmatic knowledge, native speakers would be expected to outperform ESL learners, who would, in turn, be expected to outperform EFL learners (BardoviHarlig & Dörnyei, 1998; Byon, 2004). This is indeed the case for all sections, though an analysis of variance on the effect of group membership on scores varied strongly in the ESL , EFL , and native speaker groups, from η2 = 0.47 in the Dialog Choice section to η2 = 0.23 in the Appropriateness Choice section. Overall, group membership resulted in an effect size of η2 = 0.21. When looking at the background variables proficiency, length of residence, and amount of English language use, proficiency alone accounted for 19.3 percent of variance in total scores. When residence, amount of English use, and the interactions between all three variables were added, the total amount of explained variance more than doubled to 40.7 percent. The fact that these three variables account for nearly half the variance on the test is consistent with the test measuring sociopragmatic knowledge, which can be expected to be strongly influenced by these factors. For correlational analyses, it is common to conduct a factor analysis of all items to identify underlying scales, and correlation analysis with the sections to identify section overlap. Exploratory principal component analysis with varimax rotation identified four factors: one for the Appropriateness Judgment section, one for extended discourse, including Dialog Choice and Extended DCT items, and one each for “overly polite” and “impolite” items in the Appropriateness Choice and Correction sections. Table  9.3 shows the correlation matrix for the section correlations. While statistically significant, all section correlations were small to medium (Plonsky & Oswald, 2014), which indicates that common abilities affected all sections,

TABLE 9.3 Section Correlations

Appropriateness Judgments

Extended DCT

Interactive Interactive Appropriateness Appropriateness Choice Correction

Extended DCT

.332

Interactive Appropriateness Choice

.325

.314

Interactive Appropriateness Correction

.331

.370

.853

Dialog Choice

.182

.293

.338

.342

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

201

but sections also measured distinct attributes, which is consistent with measuring subtraits of an overall “sociopragmatic knowledge” trait.

Extrapolation The extrapolation inference provides evidence for the test scores accounting for the interaction of the trait with the entire domain, not just with the carefully defined “universe of generalization” covered in the test. Strong extrapolation evidence would be measures of actual performance in the target domain, which are difficult to attain. In this study, this measure was a general question of how easy test takers found interaction in English on an everyday basis. Sociopragmatic competence should have at least some effect on everyday communicative success. In comparing the scores of four groups of test takers who claimed that they found interacting “very easy,” “mostly easy,” “sometimes easy,” and “sometimes difficult,” an ANOVA found a significant difference among groups (F(3,355) = 9.056, p < .001) accounting for 7.1 percent of the variance. A Scheffé post-hoc test indicated that the “very easy” group had significantly higher scores than the “sometimes easy” and “sometimes difficult groups.” This demonstrates an impact of sociopragmatic competence, as measured in this instrument, on everyday language use.

Utilization This final inference needs validity evidence from test use studies, demonstrating that the test leads to beneficial washback or appropriate and ethical decisions. Since this test is not currently in use for decision-making purposes such evidence is unavailable, but suggested uses can be outlined, subject to validation. For instance, the test could be used as a stand-alone measure in low-stakes pedagogical and placement decisions, self-assessment, and learner guidance to find appropriate resources and improve sociopragmatic abilities. The test might also be used as part of a larger, proficiency-oriented test battery to supplement information about learners’ language competence with information about their pragmatic competence. However, any specific use (e.g., overseas assignments, immigration) would need to be specifically validated.

Conclusion Work on testing of second language pragmatics has greatly increased in the past twenty years, but is still not as developed as work on other areas in language assessment, even though pragmatics is a large component of communicative competence. Questions to be investigated in future research include: 1 To what extent do different types of pragmatics tests mirror real-world ability for use? Just as the relationship between test and real world is important for other language tests, this is a central extrapolation question that deserves a great deal of additional validation work.

202

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

2 What is the unique value-add of testing pragmatics? In other words, what additional information do pragmatics tests provide that is not already captured by tests of general proficiency? 3 How can extended discourse tasks be made less resource intensive? Developments in speech recognition software and artificial intelligence may be helpful here, but even simply investigating the relationship between monologic and dialogic productive abilities would be interesting as well. 4 How can written pragmatics be measured? There is still very little work on “writing pragmatically” even though this is an area of potentially large importance to learners living, working, and studying in the target language context. 5 How does pragmatic ability overlap with noncognitive characteristics, such as empathy or introversion/extraversion? The contribution of personality may be much stronger and more direct in pragmatics than in general language proficiency.

References Ahn, R. C. (2005). Five Measures of Interlanguage Pragmatics in KFL [Korean as Foreign Language] Learners (Doctoral thesis). University of Hawai’i, Hawai’i. Al-Gahtani, S., & Roever, C. (2014). The Development of Requests by L2 Learners of Arabic: A Longitudinal and Cross-Sectional Study. Unpublished manuscript. Austin, J. L. (1962). How To Do Things With Words. Oxford, UK : Oxford University Press. Bachman, L. F., & Palmer, A. S. (2010). Language Assessment in Practice: Developing Language Tests and Justifying Their Use in the Real World. Oxford, UK : Oxford University Press. Bardovi-Harlig, K., & Dörnyei, Z. (1998). Do language learners recognize pragmatic violations? Pragmatic vs. grammatical awareness in instructed L2 learning. TESOL Quarterly, 32(2), 233–262. Billmyer, K., & Varghese, M. (2000). Investigating instrument-based pragmatic variability: Effects of enhancing discourse completion tests. Applied Linguistics, 21(4), 517–552. Blum-Kulka, S., & Olshtain, E. (1986). Too many words: Length of utterance and pragmatic failure. Studies in Second Language Acquisition, 8(2), 165–179. Blum-Kulka, S., House, J., & Kasper, G. (1989). Cross-Cultural Pragmatics: Requests and Apologies. Norwood, NJ : Ablex. Bouton, L. F. (1988). A cross-cultural study of the ability to interpret implicatures in English. World Englishes, 7(2), 183–197. Bouton, L. F. (1999). Developing non-native speaker skills in interpreting conversational implicatures in English: Explicit teaching can ease the process. In E. Hinkel (Ed.), Culture in Second Language Teaching and Learning. Cambridge, UK: University of Cambridge, pp. 47–70. Brown, J. D. (2001). Six types of pragmatics tests in two different contexts. In K. Rose & G. Kasper (Eds.), Pragmatics in Language Teaching. Cambridge, UK : Cambridge University Press, pp. 301–325. Brown, P., & Levinson, S. (1987). Politeness: Some Universals in Language Use. Cambridge, UK : Cambridge University Press. Byon, A. S. (2004). Sociopragmatic analysis of Korean requests: Pedagogical settings. Journal of Pragmatics, 36(9), 1673–1704.

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

203

Canale, M. (1983). From communicative competence to communicative language pedagogy. In J. Richards and R. Schmidt (Eds.), Language and Communication. London, UK : Longman, pp. 2–27. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47. Chapelle, C. A. (2008). The TOEFL validity argument. In C. A. Chapelle, M. E. Enright, & J. Jamieson (Eds.), Building a Validity Argument for the Test of English as a Foreign Language. New York, NY: Routledge, pp. 319–350. Clark, H. H. (1979). Responding to indirect speech acts. Cognitive Psychology, 11(4), 430–477. Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge, UK : Cambridge University Press. Crystal, D. (1997). English as a Global Language. Cambridge, UK : Cambridge University Press. Ebel, R. L. (1964). The social consequences of educational testing. In Educational Testing Service (Ed.), Proceedings of the 1963 Invitational Conference on Testing Problems. Princeton, NJ : Educational Testing Service, pp. 130–143. Eslami, Z. R. (2010). Refusals: How to develop appropriate refusal strategies. In A. Martinez-Flor & E. Uso-Juan (Eds.), Speech Act Performance: Theoretical, Empirical and Methodological Issues. Amsterdam, The Netherlands: John Benjamins, pp. 217–236. Farhady, H. (1980). Justification, Development, and Validation of Functional Language Testing (Doctoral thesis). University of California, Los Angeles. Fraser, B., Rintell, E., & Walters, J. (1981). An approach to conducting research on the acquisition of pragmatic competence in a second language. In D. Larsen-Freeman (Ed.), Discourse Analysis. Rowley, MA : Newbury House, pp. 75–81. Grabowski, K. (2009). Investigating the Construct Validity of a Test Designed to Measure Grammatical and Pragmatic Knowledge in the Context of Speaking (Doctoral thesis). Teachers College, Columbia University, New York, NY. Grabowski, K. (2013). Investigating the construct validity of a role-play test designed to measure grammatical and pragmatic knowledge at multiple proficiency levels. In S. Ross & G. Kasper (Eds.), Assessing Second Language Pragmatics. New York, NY: Palgrave MacMillan, pp. 149–171. Gumperz, J. (1982). Discourse strategies. Cambridge, UK : Cambridge University Press. Hall, J. K. (1993). The role of oral practices in the accomplishment of our everyday lives: The sociocultural dimension of interaction with implications for the learning of another language. Applied Linguistics, 14(2), 145–166. Hall, J. K., & Pekarek Doehler, S. (2011). L2 interactional competence and development. In J. K. Hall, J. Hellermann, & S. Pekarek Doehler (Eds.), L2 Interactional Competence and Development. Bristol, UK : Multilingual Matters, pp. 1–15. Halliday, M. A. K. (1970). Functional diversity in language as seen from a consideration of modality and mood in English. Foundations of Language, 6(3), 322–361. Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London, UK : Longman. Heritage, J. (2013). Turn-initial position and some of its occupants. Journal of Pragmatics, 57, 331–337. House, J. (2003). Misunderstanding in intercultural university encounters. In J. House, G. Kasper, & S. Ross (Eds.), Misunderstanding in Social Life. London, UK : Longman, pp. 22–56. Hudson, T. (2001a). Indicators for cross-cultural pragmatic instruction: some quantitative tools. In K. Rose & G. Kasper (Eds.), Pragmatics in Language Teaching. Cambridge, UK : Cambridge University Press, pp. 283–300.

204

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Hudson, T. (2001b). Self-assessment methods in cross-cultural pragmatics. In T. Hudson & J. D. Brown (Eds.), A Focus on Language Test Development: Expanding the Language Proficiency Construct Across a Variety of Tests. Honolulu, HI : University of Hawai’i, Second Language Teaching and Curriculum Center, pp. 57–74. Hudson, T., Detmer, E., & Brown, J. D. (1992). A Framework for Testing Cross-Cultural Pragmatics. Honolulu, HI : University of Hawai’i, Second Language Teaching and Curriculum Center. Hudson, T., Detmer, E., & Brown, J. (1995). Developing Prototypic Measures of CrossCultural Pragmatics. Honolulu, HI : University of Hawai’i, National Foreign Languages Resource Center. Hymes, D. (1972). On communicative competence. In J. Pride & J. Holmes (Eds.), Sociolinguistics: Selected Readings. Harmondsworth, UK : Penguin, pp. 269–293. Ishihara, N., & Cohen, A. (2010). Teaching and Learning Pragmatics: Where Language and Culture Meet. London, UK : Pearson. Itomitsu, M. (2009). Developing a Test of Pragmatics of Japanese as a Foreign Language (Doctoral thesis), Ohio State University, Columbus, OH . Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (4th edn). Westport, CT: American Council on Education/Praeger Publishers, pp. 17–64. Kane, M. T. (2012). All validity is construct validity. Or is it? Measurement: Interdisciplinary Research and Perspectives, 10(1–2), 66–70. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. Kasper, G. (2008). Data collection in pragmatics research. In H. Spencer-Oatey (Ed.), Culturally Speaking (2nd edn.). London and New York: Continuum, pp. 279–303. Kasper, G. (2009). Locating cognition in second language interaction and learning: Inside the skull or in public view? IRAL, 47(1), 11–36. Kasper, G., & Rose, K. R. (2002). Pragmatic Development in a Second Language. Oxford, UK : Blackwell. Kasper, G., & Wagner, J. (2011). A conversation-analytic approach to second language acquisition. In D. Atkinson (Ed.), Alternative Approaches to Second Language Acquisition. London, UK : Routledge, pp. 117–142. Kramsch, C. (1986). From language proficiency to interactional competence. Modern Language Journal, 70(4), 366–372. Leech, G. (1983). Principles of Pragmatics. London, UK : Longman. Leech, G. (2014). The Pragmatics of Politeness. Oxford, UK : Oxford University Press. Levinson, S. C. (1983). Pragmatics. Cambridge, UK : Cambridge University Press. Liu, J. (2006). Measuring Interlanguage Pragmatic Knowledge of EFL Learners. Frankfurt, Germany: Peter Lang. Matsumura, S. (2003). Modelling the relationships among interlanguage pragmatic development, L2 proficiency, and exposure to L2. Applied Linguistics, 24(4), 465–491. Matsumura, S. (2007). Exploring the after effects of study abroad on interlanguage pragmatic development. Intercultural Pragmatics, 4(2), 167–192. Mey, J. (2001). Pragmatics: An Introduction. Oxford, UK : Blackwell Publishers. Ochs, E. (1986). Introduction. In B. B. Schieffelin & E. Ochs (Eds.), Language Socialization Across Cultures. Cambridge, UK : Cambridge University Press, pp. 1–17. Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878–912. Purpura, J. (2004). Assessing Grammar. Cambridge, UK : Cambridge University Press. Roever, C. (2005). Testing ESL Pragmatics. Frankfurt, Germany: Peter Lang. Roever, C. (2006). Validation of a web-based test of ESL pragmalinguistics. Language Testing, 23(2), 229–256.

THE SOCIAL DIMENSION OF L ANGUAGE ASSESSMENT

205

Roever, C., Fraser, C., & Elder, C. (2014). Testing ESL Sociopragmatics: Development and Validation of a Web-Based Test Battery. Frankfurt, Germany: Peter Lang. Salsbury, T., & Bardovi-Harlig, K. (2001). “I know your mean, but I don’t think so” Disagreements in L2 English. In L. Bouton (Ed.), Pragmatics and Language Learning. University of Illinois, Urbana-Champaign: Division of English as an International Language, pp. 131–151. Schauer, G. A. (2006). Pragmatic awareness in ESL and EFL contexts: Contrast and development. Language Learning, 56(2), 269–318. Shimazu, Y. M. (1989). Construction and Concurrent Validation of a Written Pragmatic Competence Test of English as a Second Language (Doctoral thesis). University of San Francisco, CA . Stokoe, E. (2013). The (in)authenticity of simulated talk: comparing role-played and actual interaction and the implications for communication training. Research on Language and Social Interaction, 46(2), 165–185. Tada, M. (2005). Assessment of ESL Pragmatic Production and Perception using Video Prompts (Doctoral thesis). Temple University, Philadelphia, PA . Thomas, J. (1995). Meaning in Interaction. London, UK : Longman. Timpe, V. (2014). Assessing Intercultural Language Learning. Frankfurt, Germany: Peter Lang. Walters, F. S. (2004). An Application of Conversation Analysis to the Development of a Test of Second Language Pragmatic Competence (Doctoral thesis), University of Illinois, Urbana-Champaign, IL . Walters, F. S. (2007). A conversation-analytic hermeneutic rating protocol to assess L2 oral pragmatic competence. Language Testing, 27(2), 155–183. Walters, F. S. (2009). A conversation analysis-informed test of L2 aural pragmatic comprehension. TESOL Quarterly, 43(1), 29–54. Walters, F. S. (2013). Interfaces between a discourse completion test and a conversation analysis-informed test of L2 pragmatic competence. In S. Ross & G. Kasper (Eds.), Assessing Second Language Pragmatics. New York, NY: Palgrave, pp. 172–195. Yamashita, S. (1996). Six Measures of JSL Pragmatics. Honolulu, HI : University of Hawai’i, National Foreign Languages Resource Center. Yoshitake, S. (1997). Interlanguage Competence of Japanese Students of English: A Multi-Test Framework Evaluation (Doctoral thesis). Columbia Pacific University, San Rafael, CA . Youn, S.J. (2013). Validating Task Based Assessment of L2 Pragmatics in Interaction using Mixed Methods (Doctoral thesis). University of Hawai’i, Honolulu, HI . Young, L.W. L. (1994). Crosstalk and Culture in Sino-American Communication. Cambridge, UK : Cambridge University Press. Young, R. F. (2011). Interactional competence in language learning, teaching, and testing. In E. Hinkel (Ed.), Handbook of Research in Second Language Teaching and Learning (Vol. 2). New York, NY: Routledge, pp. 426–443.

206

PART THREE

Issues in Second Language Assessment

207

208

10 Applications of Corpus Linguistics in Language Assessment Sara Cushing Weigle and Sarah Goodwin

ABSTRACT

T

he recent proliferation of large computerized databases (corpora) of naturally occurring speech and writing has been accompanied by an increased interest in using corpus linguistics to inform and enhance language assessment. This chapter provides a brief introduction to corpus linguistics and then presents an overview of how language assessments have been and could be informed by the use of corpora in various contexts. We cover two overarching themes: using corpora to inform test development (e.g., for item writing) and using corpora in test validation (e.g., in the analysis of language learner speech or writing). The focal study describes the analysis of linguistic features in a corpus of essays from the GSTEP (Georgia State Test of English Proficiency), a locally developed test of academic English for nonnative speakers. The conclusion offers suggestions for how corpora can be used more effectively to better inform the design, analysis, and use of language assessments.

Introduction A corpus is an electronically-stored database of spoken or written language used for linguistic analysis. As computers became more powerful and accessible in the 1970s and 1980s, scholars in second language studies and language teaching began compiling and using corpora to address a variety of linguistic issues, such as the differences in language use across writing and speaking and the frequency of particular combinations of words in different contexts. These analyses have allowed scholars to produce materials such as dictionaries and textbooks that reflect natural language examples, rather than relying only on examples from their own intuition or consulting a limited supply of texts (see Römer, 2011, for further examples of corpus 209

210

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

applications in second language instruction). However, large-scale databases were not commonly used to inform language assessment projects until closer to the end of the twentieth century (Barker, 2006). Alderson (1996, p. 254) noted that the use of corpora could benefit test construction, compilation, and item selection, and assessment professionals have been increasingly recognizing the usefulness of corpus linguistics resources in the creation and validation of examinations. When we discuss language tests in this context, we, following Barker’s (2006) description, are referring to large-scale language proficiency exams used for academic or professional purposes; however, we note that corpora can, of course, also have benefits for classroom-based or smaller-scale tests. It may be useful here to distinguish between a reference corpus and a learner corpus (Barker, 2013). A reference corpus is a corpus of naturally occurring texts (spoken or written) produced by proficient language users; such a corpus can be considered to represent the target domain of a test. Examples of reference corpora, including samples of representative native or proficient language, include the Brown Corpus of American English (Kucˇ era & Francis, 1967), the Lancaster-Oslo/Bergen Corpus (LOB Corpus, 1970–1978), the American National Corpus (Reppen et al., 2005), and the British National Corpus (2007). A learner corpus, as the name implies, is a corpus consisting of texts produced by language learners and is designed to help scholars understand the ways in which learner language differs from the language of proficient users. Learner corpora have been used in materials development (Milton, 1998; Tseng & Liou, 2006), second language acquisition studies (Gilquin et al., 2007), and assessment projects (Hasselgren, 2002; Taylor & Barker, 2008). As noted above, various corpora have been used as tools consulted by test developers. Some offer a view of language as it occurs in many different scenarios, while others are highly specialized to a certain domain of language, such as spoken academic discourse. Many large-scale corpora are synchronic, providing a one-time snapshot of language in use. Others such as the Corpus of Contemporary American English (COCA ; Davies, 2008) are monitor corpora, as opposed to fixed-size corpora, meaning they are incrementally being added to. This type of corpus can allow scholars to examine a collection of highly up-to-date language or be able to note potential language change over time. Corpora have been valuable for testing professionals because, by consulting these real-life samples of text, it is possible to investigate the structure(s), function(s), and use(s) of language produced by language users (Barker, 2013). In the corpus compilation process, there are design features that language testing researchers should be aware of: these include the representativeness of texts, the characteristics of the speakers or writers, the permissions surrounding the use of particular texts, or which transcription or notation conventions to use (for corpus compilation recommendations, see McEnery & Hardie, 2011; Sinclair, 1991). As Sinclair explained, “The beginning of any corpus study is the creation of the corpus itself. The results are only as good as the corpus” (1991, p.  13). Corpora can thus be beneficial in assessment because a language test must permit us to evaluate the speech and writing that test takers will need to demonstrate for a particular purpose. Ultimately, the information afforded to us by corpus data reflect what speakers and writers actually do with language.

APPLICATIONS OF CORPUS LINGUISTICS IN L ANGUAGE ASSESSMENT

211

The application of corpora to language testing has been realized in various ways. Park’s (2014) comprehensive review article details a number of these developments. Barker’s (2006, 2013) writings also clearly and expertly elaborate on how corpus resources have been applied throughout language assessment. Drawing on these contributions to the language testing literature as well as on our own experiences, we first turn to the use of corpora in second language test development and validation. Any test development project takes place in a number of (iterative) steps: (1) defining the construct, that is, the underlying trait or skill to be assessed; (2) creating a test plan, including the item types and response formats; (3) writing and trialling test items; (4) combining items into a complete test form and pilot testing; (5) administering and scoring the test; (6) conducting validation research. These steps, while listed sequentially here, are not necessarily completed in a linear fashion; for example, some validation research is typically done throughout the entire process. In the discussion below, we describe how corpora can be used both in test development (steps 1 through 3) and in test validation (step 6). Of particular interest to language testing professionals may be the use of reference corpora for item writing (step 3). Test developers, in order to create plausible or expected samples of language, often consult corpora, and they can do so at a lexical or syntactic (small-scale) level or at a sentential or discoursal (larger-scale) level. Item writers may focus on language at the word or grammar structure level when crafting discrete items, in order to determine which words a certain vocabulary item tends to co-occur with. For example, if the phrase make a decision appears more frequently with the subsequent word about rather than the word to, as in make a decision about versus make a decision to, grammar item editors can discover this information by consulting appropriate corpora and crafting as naturalsounding a sentence as possible. At the sentential or discoursal level, when producing item sets, they can consult corpora to create authentic-sounding input and to determine the appropriate proficiency levels for listening and reading passages. For example, for an English language test for businesspeople, a reading passage might be adapted by drawing from a corpus of business text types, such as memos or financial reports. This can help ensure that the exam contains representative texts that also contain appropriate linguistic features for the level(s) tested. The corpus resources that assessment developers refer to should thus be relevant to the target domain(s) of the test, and by extension, to real-world language use in that context. To be able to find particular parts of speech, the use of an annotated corpus may be necessary. Annotated corpora contain a tag, or note, on each word in a text. There may also be notes on particular textual features in a database; such tags can include part-of-speech information, functional relationships among words, or the characteristics of the speaker or writer. Analyses involving the use of corpora may focus on the word level or the syntactic level, or researchers may consider a combination of linguistic features. Both of these levels of analysis—patterns at the word or collocation level, or sets of patterns that tend to occur together—are valuable in assessment research. For example, tools such as Coh-Metrix have been used to calculate indices of not only lexical measurements like word frequency or polysemy, but also syntactic complexity measures or density of cohesive connectives (see Crossley & McNamara, 2009; Graesser et al., 2004).

212

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Previously, we discussed corpus resources being used at the lexical or syntactic level, such as for item writing purposes. We expand on this point here. It is important to consider the naturalness or typicality (see Hunston, 2002, pp.  42–44) of the language presented in a language test. From the perspective of current psycholinguistic approaches to second language acquisition (SLA ), formulaicity and language patterning are crucial for language acquisition. Although language assessment professionals may not overtly align with this line of SLA studies, much of their research falls in line with this sentiment as well, especially if they use corpus methodologies to inform their tests and research. McEnery & Hardie (2011), in their book Corpus Linguistics, discuss the use of corpus data in modelling SLA ; they note that learner corpora, especially longitudinal corpora, may give a glimpse into the language learning process, especially how certain words or phrases pattern with others. “Words mean things in the context of other words,” notes Ellis (2008, p. 1), and this understanding may thus impact assumptions about the language included on an assessment. For example, item writers for discrete vocabulary items may be trained to write stems that should create a tip-of-the-tongue reaction to the word belonging in the gap. Consider the following discrete multiple-choice format vocabulary item. Examinees are to select the one word that best fits in the blank: Guests are encouraged to come _______ for a natural setting with proper clothing, shoes, hats, sun protection, insect spray and water. (Meeks, 2011) prepared* arranged ready fit This sentence comes from a newspaper article indexed in COCA . The sentence containing the blank is the stimulus or stem, the word prepared is the intended key (marked so with an asterisk), and the words arranged, ready, and fit are distractors. When crafting this item, the test developer knew that she wanted to assess the phrase come prepared, so she opened the COCA web interface, conducted a search for the phrase, and chose the sentence that fit the context she wanted to assess. This item uses an unaltered phrase directly from corpus hits. However, due to considerations of fair use or copyrighted material in texts, it may be preferable to draft a new stem based on how the language of the key appears to be used across many different examples in a corpus. To craft the distractors for this item, the test developer used a thesaurus to find words similar in meaning to prepared, but that do not appear to collocate as strongly with the preceding word come. Other ways to create distractors could be to search through learner or proficient-user corpora to find possible verb-adjective forms (this is where a corpus with part-of-speech data may prove beneficial) or functional contexts similar to the target structure(s) the developer desires to assess. The item writer may also want to check the frequency of the stem and distractors. In discrete vocabulary item writing, it is generally held that neither the language in the stem nor in the distractors ought to require a test taker to comprehend more difficult or lower frequency vocabulary than the word or phrase that is being tested.

APPLICATIONS OF CORPUS LINGUISTICS IN L ANGUAGE ASSESSMENT

213

Corpus data can also be useful in revising or updating tests. For example, in a review of a test published in 1972 that is still widely used, Ellingburg & Hobson (2015) discuss one item testing the structure prefer [N] to [N], which many examinees in their sample answered incorrectly. A search of COCA for this structure revealed that it occurs relatively infrequently in the corpus; prefer [N] rather than [N] is approximately four times as frequent. Information of this type can be useful to test developers to ensure that their tests remain current as language changes. Corpus linguistics studies are also important for language test validation. While reference corpora may be more useful in test development by describing the language structures employed by competent or expert language users, learner corpora are useful in describing differences between expert and nonexpert language users, and in documenting growth in grammatical and lexical structures across time or across proficiency levels. For example, researchers involved with the English Profile project (englishprofile.org), a collaborative project designed to provide information on lexical and grammatical structures at different levels of the Common European Framework of Reference for Languages (CEFR ; Council of Europe, 2001), have recently documented how learners at different levels of the CEFR use the passive voice in their writing, using data from the Cambridge Learner Corpus (2015). These levels are documented in a so-called “Grammar Gem” or discussion of a specific grammatical topic (see http://www.englishprofile.org/english-grammar-profile/ grammar-gems). Based on the Cambridge data, A2-level (upper beginner) English learners use the affirmative passive voice with familiar verbs (e.g., “it was made by”), while B1 (lower intermediate) learners produce a greater variety of passive verbs and forms, including negative passives (e.g., “the sheets aren’t printed properly”). Reliable information of this nature may be very helpful in test validation; for instance, examinees’ use of the passive in their spoken and/or written production can be compared to the English Profile data. If the data can successfully help distinguish A2 from B1 learners, testing professionals can feel confident about their claims regarding the ability of the test to discriminate at the appropriate level(s).

Focal Study This section of the chapter focuses on a study that uses corpus linguistics tools for the analysis of learner language. The research concerns the use of multi-word units in argumentative essays in the context of an integrated (reading to writing) academic English test. We present it here as an example of how learner corpora can be used to support a test validity argument (see Chapelle et al., 2010); specifically, if the test accurately distinguishes between high- and low-proficiency writers, a corpus analysis should reveal differences in the ways these writers use multi-word units in their essays.

Background for the Study Multi-word units can be problematic for language learners. Successful language learners are expected to accurately use vocabulary, not only at an individual-word

214

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

level, but also at the phrase level (Ellis et al., 2008; Pawley & Syder, 1983), and many scholars (e.g., Ellis & Cadierno, 2009; Römer, 2009) advocate the explicit teaching of language patterns, including specific combinations of words, that commonly occur in academic writing. At the same time, students are often warned not to copy strings of words directly from source texts in their writing. It is therefore likely to be difficult for novice writers to know whether a particular string of words that they encounter in a reading text is an accepted formulaic sequence that can, or indeed should, be used in their own writing or a string that is particular to that specific text, and thus, requires a paraphrase and/or an appropriate attribution to the source. With this in mind, the goal of the study was to investigate writers’ use of multi-word units (i.e., recurrent strings of words, or n-grams) on an integrated writing task in the context of a large-scale assessment. In particular, we wanted to compare the most commonly used n-grams in test essays both to those found in the source texts, and to a list of academic formulas recently compiled by Simpson-Vlach & Ellis (2010): the Academic Formulas List (AFL ). Simpson-Vlach & Ellis (2010) define academic formulas (or formulaic sequences, terms used interchangeably) as “frequent recurring patterns in written and spoken corpora that are significantly more common in academic discourse than in nonacademic discourse and which occupy a range of academic genres” (pp. 487–488). The AFL was created as a companion to Coxhead’s (2000) Academic Word List. Simpson-Vlach & Ellis began with four academic corpora, two spoken and two written: the Michigan Corpus of Academic Spoken English (MICASE ; Simpson et al., 2002) and academic speech from the British National Corpus (BNC ) comprised the spoken data, and Hyland’s (2004) research article corpus and selected BNC files were included in the written corpus subsection. Two quantitative measures were used to determine whether certain word strings within those corpora consisted of formulaic sequences: frequency (at least ten occurrences per million) and a mutual information figure, which calculates the strength of association between words. The higher the mutual information value, the more coherent the phrase is than might be expected by chance; examples of formulas with high values include circumstances in which, see for example, a wide variety of, and it can be seen that. As a qualitative measure, Simpson-Vlach & Ellis asked experienced ESL instructors and language testers at the University of Michigan English Language Institute to judge whether or not the formulas constituted a fixed chunk, whether their meaning or function was clear, and whether the group of words was worth teaching or assessing. The frequency data, mutual information figures, and value judgments were then used to identify the formulas that became part of the AFL . Presuming that AFL formulas were an important feature of academic writing, we hypothesized that more proficient writers would use more AFL formulas. Based on previous research (Weigle & Parker, 2012), we also hypothesized that lower proficiency examinees would use more and longer strings of text directly from the source texts. Thus, our research questions were: 1 Do low and high proficiency writers differ in their use of n-grams from the source texts? 2 Do low and high proficiency writers differ in their use of n-grams from the AFL ?

APPLICATIONS OF CORPUS LINGUISTICS IN L ANGUAGE ASSESSMENT

215

Procedure Test Format The Georgia State Test of English Proficiency (GSTEP ) is a paper-based examination given to assess the academic English proficiency of current or prospective students whose first language is not English. One task is a seventy-minute integrated reading and writing section in which examinees read two argumentative passages, respond to eight open-ended questions about the reading passages, and write an argumentative essay on a topic related to the texts (see Weigle, 2004, for a complete description). Candidates may refer to the reading passages while they are writing. The essays, which are the focus of our study, are scored by at minimum two trained human raters using an analytic rubric; raters’ scores are averaged to calculate a score out of twenty for Rhetoric (content and organization) and twenty for Language (accuracy and range of grammar and vocabulary), for a maximum total score of forty.

The Corpus The corpus consisted of 332 GSTEP essays written between 2009 and 2012 on a single topic: whether schoolchildren should use computers in the classroom, particularly in early childhood. There are thus three source texts: the two reading passages (276 and 219 words, respectively) and the prompt for the writing task (50 words). The writers are representative of the GSTEP examinee population with regard to native country, native language, and range of proficiency levels represented. Essays were transcribed, saved in plain-text format, and corrected for spelling so that search terms in corpus tools would appear together. All spellings were changed to standardized North American English spellings (e.g., practise to practice). Interlanguage vocabulary or grammar errors were not changed, nor were errors of a nonexistent morphological form in English (such as informations or stuffs). If an examinee wrote an actual word, such as thinks, but the context indicated that the word things was likely intended, the word was not altered. The corpus was divided into approximately half, with a score of 24 serving as the cut-off. Scores ranged from 4 to 39 (out of 40); the mean score was 23.8 (standard deviation 6.36). Writing sample length ranged from 22 to 536 words; the mean word count was 283.55 words (standard deviation 93.22). Essays below the cut-off score constituted the lower-scoring corpus, or Low subcorpus (n = 163; mean length = 238.58), and essays scoring 25 and above were in the higher-scoring corpus (n = 169; mean length = 326.93), henceforth called High subcorpus. The total corpus comprised 94,576 words; the Low subcorpus contained 39,037 words and the High subcorpus 55,539 words.

Corpus Analysis Software AntConc (Anthony, 2014), a free concordancing program, was used to examine the use of formulas within the essays. The Low and High subcorpora were each loaded separately, and n-grams, or co-occurring strings of n words, were extracted under the Clusters/N-grams tab in the program interface. The n-gram size span was set at a

216

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

minimum of three words and a maximum of ten, the minimum frequency was set to three occurrences, and the minimum range was set to two texts. In other words, multi-word constructions could be from three to ten words in length, but had to have occurred at least three times in the entire corpus, by two or more unique writers. Frequency outputs were normalized as occurrences per 10,000 words, the largest power of 10 that did not exceed the total word count of the corpus. Because AntConc treats contracted words with an apostrophe as two separate words, an item such as don’t have was considered to be a trigram of the three “words” don, t, and have. Searches are not case- or punctuation-sensitive (e.g., For example, in would appear in the same search result as for example in).

Data Analysis To answer the first research question, the most frequent n-grams in each subcorpus were gathered and compared to n-grams in the source texts. To answer the second question, all n-grams from the AFL that appeared in any of the essays were compiled. Note that, for the purposes of this chapter, we are only presenting descriptive data and not statistical differences, so differences between the subcorpora should be interpreted with caution: The sample size is too small to generalize from, and the division of essays into the two subcorpora using the cut-off point of 24 is somewhat arbitrary, with the goal of dividing the entire sample into two approximately even groups.

Results Table 10.1 shows the most frequent n-grams from the Low and High subcorpora, respectively. The n-grams are displayed in decreasing frequency order with their normed occurrences per 10,000 words, along with a Range column indicating how many unique writers used each phrase. Note that some strings were counted multiple times; for instance, in the High subcorpus, item 2 the use of and item 4 use of computers are sometimes occurrences of a ninth most frequent item in that subcorpus, the use of computers. The phrases that are not bold/italicized are direct phrases from the source material. Several observations can be made from the table. First, almost all the most frequent n-grams in both subcorpora reflect verbatim or slightly paraphrased text from the essay prompt. Sixteen of the most frequent n-grams in the Low subcorpus are verbatim words from the essay prompt; this number rises to 18 if we disregard the lack of plural s in using computer in and computer in education. For the High group, the use of verbatim material from the prompt is much less pronounced: Not only are there fewer strings taken directly from the prompt, but these strings are generally shorter; only four are longer than three words. The High subcorpus includes more phrases reflecting paraphrase strategies, such as nominalization (e.g., the use of) and synonymy (e.g., at an early age). Furthermore, the normed frequencies of verbatim strings are often higher in the Low group than in the High group. For example, is very important has a normed frequency of 25.87 times per 10,000 words in the Low subcorpus, but only 8.64 in the High subcorpus.

TABLE 10.1 Most Frequent N-grams in the Low and High Subcorpora Low Subcorpus Rank Hits Normed

High Subcorpus

Range String

Rank Hits Normed Range String

217

1

101

25.87

63

is very important

1

87

15.66

55

using computers in

2

79

20.24

57

using computers in

2

85

15.30

40

the use of

3

75

19.21

50

in early childhood

3

74

13.32

47

computers in education

4

69

17.68

47

computers in education

4

71

12.78

34

use of computers

4

69

17.68

51

in education is

5

67

12.06

43

in early childhood

6

56

14.35

41

education is very

6

59

10.62

35

a lot of

7

55

14.09

40

education is very important

7

57

10.26

41

to use computers

8

54

13.83

41

in education is very

8

50

9.00

35

in education is

9

53

13.58

40

in education is very important

9

48

8.64

39

is very important

10

51

13.06

38

even in early

9

48

8.64

28

the use of computers

11

50

12.81

38

a lot of

11

45

8.10

28

of computers in

11

50

12.81

37

using computers in education

11

45

8.10

34

until they are

13

49

12.55

29

to use computer

13

43

7.74

26

computers in school

14

47

12.04

36

even in early childhood

14

42

7.56

32

using computers in education

15

45

11.53

37

computers in education is

15

40

7.20

32

i believe that (Continued)

218

TABLE 10.1 Continued Low Subcorpus Rank Hits Normed

High Subcorpus

Range String

Rank Hits Normed Range String

16

44

11.27

33

using computer in

16

39

7.02

30

they are older

17

43

11.02

30

they are older

17

38

6.84

31

that children should

18

38

9.73

32

using computers in education is

18

36

6.48

27

until they are older

19

37

9.48

25

computer in education

19

34

6.12

22

an early age

19

37

9.48

30

computers in education is very

19

34

6.12

28

children should not

19

37

9.48

30

computers in education is very important

19

34

6.12

26

computers in education is

19

34

6.12

28

education is very

19

34

6.12

26

how to use

APPLICATIONS OF CORPUS LINGUISTICS IN L ANGUAGE ASSESSMENT

219

To answer our second research question, we compiled all AFL formulas that appeared in either subcorpus (see Table  10.2). The formulas are presented in descending order of the difference in normed frequencies between the High and Low subcorpora (the rightmost column in the table). Thus, the formulas at the top of the table are more frequent in the High subcorpus, while those at the bottom are more frequent in the Low subcorpus. We have highlighted the formulas where the difference is greater than one, with bold and italics, respectively.

TABLE 10.2 AFL Formulas Used by Writers Low

High

Hits (Range) Normed

Hits (Range) Normed

the use of

30(18)

7.69

85(40)

as well as

3(3)

.77

20(17)

3.60

2.83

point of view

4(4)

1.02

20(15)

3.60

2.58

the development of

6(5)

1.54

18(17)

3.24

1.70

they do not

4(3)

1.02

15(13)

2.70

1.68

34(27)

8.71

57(43)

10.26

1.55

the ability to

6(5)

1.54

17(13)

3.06

1.52

the fact that

4(4)

1.02

14(11)

2.52

1.50

important role in

3(3)

0.77

11(11)

1.98

1.21

11(11)

2.82

21(18)

3.78

0.96

there is a

4(4)

1.02

11(11)

1.98

0.96

on the other hand

9(9)

2.31

18(15)

3.24

0.93

10(10)

2.56

19(16)

3.42

0.86

as a result

4(4)

1.02

9(9)

0.60

0.60

the idea of

4(3)

1.02

9(8)

1.62

0.60

there is no

7(6)

1.79

13(12)

2.34

0.55

13(12)

3.33

21(18)

3.78

0.45

6(5)

1.54

11(11)

1.98

0.44

should not be

the other hand

on the other

to use the it is important to

15.3

High – Low 7.61

(Continued)

220

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 10.2 Continued Low

High

Hits (Range) Normed

Hits (Range) Normed

High – Low

are able to

6(5)

1.54

10(8)

1.80

0.26

if they are

6(5)

1.54

9(7)

1.62

0.08

in order to

9(8)

2.31

13(10)

2.34

0.03

it is important

11(10)

2.82

14(11)

2.52

−0.30

it is not

15(13)

3.84

19(17)

3.42

−0.42

9(8)

2.31

10(10)

1.80

−0.51

10(8)

2.56

11(11)

1.98

−.58

first of all

22(22)

5.64

26(25)

4.68

−.96

we have to

16(12)

4.1

13(9)

2.34

−1.76

9(8)

2.31

3(3)

0.54

−1.77

for example in

10(10)

2.56

4(4)

0.72

−1.84

we can see

10(10)

2.56

2(2)

0.36

−2.20

the most important at the same time

because it is

It is interesting to note that six of the nine formulas used more frequently by the High writers are parts of complex nominals (e.g., the ability to, the fact that), suggesting that nominalization is a writing strategy more frequently used by High writers. In contrast, three of the four AFL formulas used by the Low writers are parts of finite verb phrases (because it is, we have to, we can see). While this is a small dataset and only a few AFL phrases are represented, the differences between the High and Low subcorpora here are consistent with the progression in syntactic complexity proposed by Biber et  al. (2011) as writers learn to produce academic writing: from finite dependent clauses to more complex noun phrases with extensive phrasal embedding.

Discussion The GSTEP data presented here could lend themselves to several other analyses (e.g., a functional analysis of the formulaic sequences produced by test takers), but the two analyses presented here are illustrative of the kinds of information that can be gleaned from a simple n-gram analysis of a written corpus. First, consistent with previous research (Weigle & Parker, 2012), we found that lower scoring essays included more, and longer, verbatim strings from the prompt in their essays than did

APPLICATIONS OF CORPUS LINGUISTICS IN L ANGUAGE ASSESSMENT

221

higher scoring essays. Second, we found that higher scoring essays included more phrases from the AFL , particularly with regard to complex nominals, than did lower scoring essays. This is consistent with other research suggesting that academic writers go through a developmental progression toward more complex nominal structures (e.g., Biber et al., 2011). As their proficiency increases, test takers appear to move away from relying on formulas taken verbatim from the source texts to formulas that allow them to produce more abstract academic writing. From a test validity standpoint, our results provide preliminary evidence that the GSTEP essay is validly capturing differences in writing proficiency, at least in these two areas.

Conclusion In the last section of this chapter, we discuss how corpora can be used more effectively to better inform the design, analysis, and use of language assessments. Park (2014) outlines several areas where more work is needed in this regard. First, there is a need for more learner corpora with error annotations, preferably accompanied by parallel corpora with suggested edits or corrections. Second, more work needs to be done on identifying criterial features of language at different proficiency levels such as the CEFR . Third, as the field of applied natural language processing progresses, the insights gained on how to automatically parse and analyze language needs to feed into assessment in ways that are user-friendly and do not require an extensive background in programming or computer science. Finally, there are two areas within assessment that could benefit from corpus linguistics: dynamic assessment and assessing local varieties of a language. In dynamic assessment, in which learners are assessed both on what they can do independently and what they can do with support from an expert, an obvious use of corpora would be to provide instances of naturally occurring language that is tailored to the needs of a learner in a testing situation (e.g., Park, 2014). In terms of assessing local varieties such as Singapore English, where vocabulary and syntax may differ from international standards such as British or American English, reference corpora of the local standard can be useful in providing guidance to test developers working in these contexts. (More about fostering localization can be found in Brown, 2014; Brown & Lumley, 1998.) While there is clearly much more to be done with corpus linguistics and language assessment, current work in this area shows promise. In the development and validation of language tests, the use of corpora for assessment applications is likely to increase (Barker, 2013). It is important to remember, however, that information from corpora needs to be assessed in light of and with reference to other noncorpusbased data (Gilquin & Gries, 2009), including other forms of validity evidence such as the perceptions of test takers and raters, analysis of test performance, and empirical studies of test impact (see, for example, Enright & Tysori, 2008). Barker (2013) reminds us that “corpora are not the only tool available to language testers to help them to design assessments, and that a corpus’ fitness for purpose should always be established and balanced with theoretical and experimental findings and language testers’ own expertise” (p. 1015).

222

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

References Alderson, J. C. (1996). Do corpora have a role in language assessment? In J. Thomas & M. Short (Eds.), Using Corpora for Language Research: Studies in the Honour of Geoffrey Leech. London, UK : Longman, pp. 248–259. Anthony, L. (2014). AntConc (Version 3.4.3w) [Computer Software]. Tokyo, Japan: Waseda University. Available from: http://www.laurenceanthony.net/. Barker, F. (2006). Corpora and language assessment: Trends and prospects. Research Notes, 26, 2–4. Barker, F. (2013). Using corpora to design assessment. In A. J. Kunnan (Ed.), The Companion to Language Assessment. Hoboken, NJ : Wiley-Blackwell, pp. 1013–1028. Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly, 45(1), 5–35. The British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. Available from: http:// www.natcorp.ox.ac.uk/. Brown, A., & Lumley, T. (1998). Linguistics and cultural norms in language testing: A case study. Melbourne Papers in Language Testing, 7(1), 80–96. Brown, J. D. (2014). The future of world Englishes in language testing. Language Assessment Quarterly, 11(1), 5–26. Cambridge Learner Corpus. (2015). Cambridge, UK : Cambridge University Press/ Cambridge English Language Assessment. Available from: englishprofile.org. Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an Argument-Based Approach to Validity Make a Difference? Educational Measurement: Issues and Practice, 29(1), 3–13. Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge, UK : Cambridge University Press. Coxhead, A. (2000). A new Academic Word List. TESOL Quarterly, 34(2), 213–238. Crossley, S. A., & McNamara, D. S. (2009). Computational assessment of lexical differences in L1 and L2 writing. Journal of Second Language Writing, 18(2), 119–135. Davies, M. (2008) The Corpus of Contemporary American English: 450 Million Words, 1990–Present. Available from: http://www.corpus.byu.edu/coca/. Ellingburg, D., & Hobson, H. (2015). A Review of the Michigan English Placement Test. Unpublished seminar paper, Georgia State University. Ellis, N. C. (2008). Phraseology: The periphery and the heart of language. In F. Meunier & S. Granger (Eds.), Phraseology in Foreign Language Learning and Teaching. Amsterdam, The Netherlands: John Benjamins, pp. 1–13. Ellis, N. C., & Cadierno, T. (2009). Constructing a second language: Introduction to the special edition. Annual Review of Cognitive Linguistics, 7, 111–139. Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and second-language speakers: Psycholinguistics, corpus linguistics, and TESOL . TESOL Quarterly, 42(3), 375–396. Enright, M., & Tyson, E. (2008). Validity Evidence Supporting the Interpretation and Use of TOEFL iBT Scores. TOEFL iBT Research Insight. Princeton, NJ : Educational Testing Service. Gilquin, G., Granger, S., & Paquot, M. (2007). Learner corpora: The missing link in EAP pedagogy. Journal of English for Academic Purposes, 6(4), 319–335. Gilquin, G., & Gries, S. T. (2009). Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26.

APPLICATIONS OF CORPUS LINGUISTICS IN L ANGUAGE ASSESSMENT

223

Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavioral Research Methods, Instruments, and Computers, 36(2), 193–202. Hasselgren, A. (2002). Learner corpora and language testing: Small words as markers of learner fluency. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam, The Netherlands: John Benjamins, pp. 143–173. Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge, UK : Cambridge University Press. Hyland, K. (2004). Disciplinary Discourses: Social Interactions in Academic Writing. Ann Arbor, MI : University of Michigan Press. Kucˇera, H., & Francis, N. (1967). Computational Analysis of Present-Day American English. Providence, RI : Brown University Press. The LOB Corpus, original version. (1970–1978). Compiled by Geoffrey Leech, Lancaster University, Stig Johansson, University of Oslo (project leaders), & Knut Hofland, University of Bergen (head of computing). Available from: http://www.helsinki.fi/varieng/ CoRD /corpora/LOB /index.html. McEnery, T., & Hardie, A. (2011). Corpus Linguistics: Method, Theory and Practice. Cambridge, UK : Cambridge University Press. Meeks, F. (2011). Stay-cation options abound in Clear Lake, Galveston areas. Houston Chronicle. Available from: http://www.chron.com/news/article/Stay-cation-optionsabound-in-Clear-Lake-1394187.php. [24 May 2015]. Milton, J. (1998). Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment. In S. Granger (Ed.), Learner English on Computer. London, UK : Addison-Wesley, Longman, pp. 186–198. Park, K. (2014). Corpora and language assessment: The state of the art. Language Assessment Quarterly, 11(1), 27–44. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. C. Richards & R. W. Schmidt (Eds.), Language and Communication. London, UK : Longman, pp. 191–225. Reppen, R., Ide, N., & Suderman, K. (2005). American National Corpus, Second Release. Available from: http://www.americannationalcorpus.org/. Römer, U. (2009). The inseparability of lexis and grammar: Corpus linguistic perspectives. Annual Review of Cognitive Linguistics, 7, 141–162. Römer, U. (2011). Corpus research applications in second language teaching. Annual Review of Applied Linguistics, 31, 205–225. Simpson, R., Briggs, S. L., Ovens, J., & Swales, J. M. (2002). The Michigan Corpus of Academic Spoken English. Ann Arbor, MI : The Regents of the University of Michigan. Simpson-Vlach, R., & Ellis, N. C. (2010). An Academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512. Sinclair, J. M. (1991). Corpus, Concordance, Collocation. Oxford, UK : Oxford University Press. Taylor, L., & Barker, F. (2008). Using corpora for language assessment. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and Education (2nd edn.), Volume 7: Language Testing and Assessment. New York, NY: Springer US , pp. 241–254. Tseng, Y.-C., & Liou, H.-C. (2006). The effects of online conjunction materials on college EFL students’ writing. System, 34(2), 270–283. Weigle, S. C. (2004). Integrating reading and writing in a competency test for non-native speakers of English. Assessing Writing, 9(1), 27–55. Weigle, S. C., & Parker, K. (2012). Source text borrowing in an integrated reading/writing assessment. Journal of Second Language Writing, 21(2), 118–133.

224

11 Setting Language Standards for International Medical Graduates Vivien Berry and Barry O’Sullivan

ABSTRACT

S

etting standards is the process by which the critical boundary point used to interpret test performance is established. Critical decisions around standard setting include the general approach to be taken, and the nature and source of any contributing human judgments. In the project we present in this chapter, the standard is the criterion level on the International English Language Testing System (IELTS ) test that overseas-trained doctors must attain in order to be allowed to practice medicine in Britain. The major focus of the chapter is on the standardsetting procedures that were carried out, the problems that were encountered, and the decisions that were reached. Cut-scores arrived at by means of subjective judgments, descriptive analysis of score equivalent Bands, and MFRM analysis of appropriate skills papers are presented and compared. We conclude by proposing suggestions for improving future standard-setting events in medical contexts.

Introduction Deciding on the language requirements for such purposes as professional certification, immigration, and study has been debated in the language testing literature for some time (cf. Banerjee, 2003; McNamara & Roever, 2006; Shohamy, 2001; Van Avermaet et al., 2013; inter alia). It is not the purpose of this chapter to focus on the justification for including language in tests used for such gatekeeping purposes. Even when a recognized language framework has been used to describe the language levels required of immigrants, the rationale for choosing this level has rarely been made explicit. What is consistently lacking in the literature is evidence of any empirically driven approaches to setting the language level required of a particular group of 225

226

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

immigrants. However, Van Avermaet & Rocca (2013) refer to the linking of teaching programs to assessment, to ensure the appropriateness of test-driven decisions. This chapter, therefore, presents an approach to setting the language requirements of one particular immigrant group, international medical graduates (IMG s), using a transparent empirical methodology. When the project on which this chapter is based was commissioned, the language proficiency (as indicated by an IELTS score) expected of incoming medical doctors to the UK was Band 7.0. This score had been determined by the General Medical Council (GMC ) following an earlier study by Banerjee (2004), which investigated the speaking and writing requirements for overseas doctors. However, as in the aviation context, there are many risks associated with medical professionals lacking the ability to communicate efficiently in English. Over time, significant languagerelated failures by a number of overseas doctors working in the United Kingdom led to concern being raised by many within the medical profession. When combined with criticism of the population size in the earlier research, the need for an up-todate study, investigating all four skills and taking into account a broader and more representative population of stakeholders, was recognized by the GMC . The aim of this current project was to either reaffirm the existing English proficiency level for incoming medical doctors to the United Kingdom, or empirically establish an alternative level as indicated by the IELTS test results. Before this could be done, evidence that IELTS was perceived by key stakeholders as an appropriate test of the language proficiency of these doctors was first required. That this was the case is reported in some detail by Berry & O’Sullivan (2016) and Berry et al. (2013). In the following section, we focus on two aspects of the project under discussion. The first of these is the rationale for using a standard setting approach to establish appropriate cut-scores for IELTS within a UK medical doctor selection procedure. The second is to present the argument for using two different standard-setting approaches for this project.

Overview of Standard Setting Procedures In this section, we present a brief overview of standard-setting procedures. For a more complete overview, see Cizek & Bunch (2007), Cizek (2012), and Kaftandjieva (2004), all of which offer comprehensive, accurate, and practically useful guidelines. Standard setting is the process by which a critical cut-score (or scores), which will be used to interpret test performance, is established. In other words, it is the process used to establish important boundary points such as pass/fail or pass/merit boundaries in our examinations. It is vitally important to set such cut-scores in a systematic and transparent manner as these are the most critical score points in the decision-making process. Deciding whether a performance deserves a fail, pass, or merit is central to the claims we intend to make based on test performance. It has been argued elsewhere that the essential approaches to setting standards are the test-centered approach, which focuses on the items in the test, and the examinee-centered approach, which focuses on the candidates’ responses (Cizek & Bunch, 2007; Jaeger, 1989). Although other approaches have been suggested (Cizek

SETTING L ANGUAGE STANDARDS FOR GRADUATES

227

& Bunch, 2007, 9–11), the test/examinee approaches represent the most common and intuitively appropriate procedures used today. This is not to say that these represent two fixed sets of processes. Quite the opposite: the reference works referred to earlier (Cizek & Bunch, 2007; Cizek, 2012; Kaftandjieva, 2004) all contain detailed discussions of the many variations that have appeared in the literature over the years.

Why a Standard-setting Approach? Standard setting is essentially a systematic procedure to establish important boundary points for a test, and we were tasked by the GMC with establishing such a boundary point (or points) for incoming medical doctors. Thus, it seemed clear that a standardsetting approach would be the most obvious way to undertake our task. The difference between the aim of this project and that of a typical standard-setting event lies in the relationship between the test and the decision makers. In a typical event, the decision makers are often described as an expert panel. The individuals making up the panel are expected to have some significant expertise in the context (e.g., where IELTS or another test is a university entrance requirement, the panel members should have a solid understanding of the level of language proficiency required for university study), and also of the test itself (so the panel members should be aware of what the test is attempting to measure and how it reports performance). The combination of these two areas of expertise allows the panel members to make thoughtful judgments with regard to the level of language required as reported by the test. In the case of the current project, members bring to the panel their wealth of experience either working in, or engaging with, health professionals. In order to be in a position to make systematic, specific test-related (i.e., IELTS ) judgments of the language level required by incoming medical doctors, they need, in addition, to develop a good understanding of the IELTS test papers and of the IELTS reporting scale. The facilitation of this additional expertise allows the members to make the sort of judgments required of an expert panel member in estimating the appropriate cut-scores. As will be seen in the methodology section below, this training formed the basis of the introduction of the panel members to the process. Having decided on a standard-setting approach, the issue then was to decide on the details of the procedure to be used.

Test- and Examinee-Centered Approaches The test-centered approach to setting standards focuses on the items in the test itself. Typically, a test of reading or listening (both receptive skills) consists of a series of independent items. This means that each item is a separate entity, measuring a separate aspect of the ability. In order to consider where a cut score might be placed, the expert panel member is asked to consider first the minimum amount of that skill a target individual might be expected to demonstrate (in measurement terms referred to either as the minimally acceptable person or minimally competent candidate—in

228

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

this study, we will use the latter term)—and then whether that person might respond appropriately to each item. In addition, the member may be asked to consider what percentage of such “persons” might get the item right (a higher percentage means an easier item). The important thing to consider here is that the expert panel member is asked to estimate how test candidates will behave. Large numbers of experts are required in order to eliminate any judge effect, where one expert is very harsh or lenient and pushes the cut score in one way or another away from where it should be. For this reason, a total of sixty-two panel members were included in the various panels in the study. Of course, this approach cannot work for a speaking or writing test, in which there are normally far fewer items (or tasks) for the experts to judge. Instead of looking at these items/tasks and estimating how well (or poorly) test candidates might perform on them, the experts have the advantage of seeing or hearing actual performances. The focus has moved from the test itself to the performance of test candidates. This change in focus means that the approach is often referred to as an examinee-centered approach. The task here for the experts is to review a range of performances at different levels of proficiency and to identify the acceptability of each performance in terms of the minimally competent candidate (MCC ). This time, the latter is defined as the person who demonstrates enough of the ability being tested to be deemed acceptable to the expert panel member. Since IELTS comprises four papers, two targeting receptive and two targeting productive skills, it was clear that both approaches would need to be used in this project.

Research Design A mixed-methods approach consisting of the use of quantitative probability-based statistical analysis, the many-facet Rasch measurement model (MFRM ) analysis, plus a detailed qualitative analysis of the focus-group data, was employed in this study. In addition to the use of MFRM to support the decisions of the stakeholder focus groups (SFG s), the approach included an additional validation process, whereby the recommendations were debated by an expert focus group who made the final recommendations included in the report (Berry et al., 2013). The SFG s were chosen to best reflect the relevant stakeholders in the project, and the procedures followed for their selection were based on the recommendations of Hambleton et al. (2012, p. 53). Figure 11.1 shows the basic design of the approach.

Materials Used in the Study Two complete sets of IELTS test materials were supplied by Cambridge English Language Assessment, and were either live (currently in use) or IELTS tests standardized for training purposes. See http://www.ielts.org/test_takers_information. aspx for an overview of the different sections of the IELTS test from which the descriptions of various parts of the test have been taken, together with sample tasks/ questions.

SETTING L ANGUAGE STANDARDS FOR GRADUATES

229

FIGURE 11.1 Approach (based on Berry et al., 2013, p. 10).

Writing Test The IELTS writing test consists of two tasks. Task 1 requires candidates to summarize information presented graphically. Task 2, requires candidates to write an essay in response to a point of view, argument, or problem. In the Banerjee (2004) study, exemplar writing performances were used from Task 1, presumably because Task 1 appeared to represent a more realistic writing task for doctors than writing an essay. However, while both tasks contribute to the overall score awarded for writing, the second task is more substantial and is doubleweighted, meaning that it contributes twice as much to the final writing score as Task 1. For this reason, the exemplar performances in this study were represented using Task 2 only. This was done to limit the amount of work for the judges while still gaining as true a picture of the level as possible. Twelve exemplar Task 2 writing papers ranging in standardized ability from Band 5 to Band 8.5 were provided. Eight of the twelve papers were used in the actual standardization procedure; the remaining papers were used for training purposes.

Speaking Test The IELTS speaking test consists of a one-to-one interview in three parts. Sixteen speech samples, ranging in assessed ability from Band 5 to Band 8.5, were provided. Twelve speech samples were used in the actual standardization procedure; the remaining four speech samples were used for training purposes.

Reading Test The IELTS reading test consists of three sections with a total text length of 2,150– 2,750 words. Each reading test has forty questions; each correct answer is awarded 1 mark. Scores out of 40 are converted to the IELTS 9-Band scale. Two reading tests were provided, complete with item statistics (facility value and item discrimination) and raw score to Band conversion tables. One test was used in

230

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

the standardization procedure; the other was used for training purposes. Scores awarded on both tests counted toward the final recommended Band.

Listening Test The IELTS listening test consists of four sections. Each test has forty questions and each correct answer is awarded 1 mark. Scores out of 40 are converted to the IELTS 9-Band scale. Two listening tests were provided, complete with item statistics (as above) and raw score to Band conversion tables. One test was used in the standardization procedure; the other was used for training purposes. Scores awarded on both tests counted towards the final recommended Band.

Additional Materials IELTS Band descriptors for writing and speaking skills, plus Council of Europe Framework of Reference for Languages (CEFR ) “can do” statements for C1 and C2 level reading, writing, listening, and speaking were adapted to supplement the “can do” statements produced by the panels.

Stakeholder Focus Groups (SFGs) Selection of SFGs There is no best way of selecting participants for stakeholder groups, and the number of judges to have on a panel is a matter of some debate, ranging from not less than 5 (Livingston & Zieky, 1982), 7 to 11 (Biddle, 1993; Maurer et al., 1991), 15 (Hurtz & Hertz, 1999; Loomis, 2012), or even 20 (Jaeger & Mills, 2001). The most important factor is to be able to demonstrate that the issue of composition has been seriously considered and that there is a defensible rationale for the composition of the panel that has been formed (Hambleton et al., 2012). Five stakeholder groups were identified—patients/members of the public, doctors, nurses, allied health professionals, and medical directors—because there was a concern that issues of seniority and status might impact negatively on group dynamics if stakeholders were together in a single panel. This concern is supported by Papageorgiou (2007), who also cautions against the constitution of larger groups. His evidence suggests that, in such groups, the impact of group dynamics may prove to be destabilizing if some participants dominate the judgment process. It was also important to provide a range of diversity in participants’ background and personal characteristics, for example, age, gender, ethnicity, geographical locations, level of seniority, and healthcare settings, which reflected the larger population from which the panels were drawn. This all suggested that the optimum approach was a number of groups working independently, with a single, final decision-making group. Consequently, eleven initial stakeholder panels were

SETTING L ANGUAGE STANDARDS FOR GRADUATES

231

convened comprising three subpanels of doctors (15 participants), three subpanels of nurses (15 participants), three subpanels of patients/members of the public (20 participants), one panel of allied health professionals (5 participants) and one panel of medical directors (7 participants). A total of 62 people, 30 males and 32 females, 53 white and 9 ethnic minorities, ranging in age from early twenties to late sixties, from all regions of the United Kingdom and Northern Ireland, participated in the initial panels.

Training of Participants The importance of training panel participants to ensure they are familiar with their task has been highlighted by many standard-setting theorists (Hambleton et  al., 2012; Skorupski & Hambleton, 2005; inter alia). We initially considered sending materials in advance so participants could familiarize themselves at their leisure, as is common in many standard-setting events, including the Banerjee (2004) study. However, this approach was ultimately rejected for several reasons: ●





Two full days were scheduled for each panel, giving sufficient time for thorough face-to-face training, as recommended by Skorupski & Hambleton (2005). Assuring the confidentiality of the IELTS materials was a major concern and could not be guaranteed if part or all of the sections were studied outside the meeting. Research has consistently shown (e.g., Loomis, 2012) that panelists often do not find time to prepare for the meetings in advance. We anticipated that this would probably be the case for the medical professionals involved.

A framework for conducting the panel discussions was developed as shown in Figure  11.2. Following introductions, participants on each panel were asked to consider the language characteristics they believed an MCC should possess for each skill. These were summarized as “can do” statements and supplemented by IELTS descriptors and CEFR “can do” statements. IELTS Band descriptors (performance level descriptors or PLD s) and score to Band conversion tables were then studied. It is recognized that understanding PLD s and the concept of an MCC is cognitively very challenging for panel participants (Hein & Skaggs, 2010), so it was important to spend enough time ensuring that characteristics of each of these had been thoroughly internalized. Participants then took listening and reading tests to familiarize themselves with the format, tasks, and question types of each subtest. After each of the receptive skills tests, they were asked to estimate the probability that an MCC would answer each question correctly. They then studied examples of writing scripts and speech samples to assist them in interpreting the rating scales. Hambleton et al. (2012, p. 58) suggest it is likely that panelists will set more realistic performance standards if they have personal experience of both the assessment and its associated scoring keys and instructions/rubrics. At all stages of the training, participants were encouraged to take notes, ask questions, and discuss any matters of concern to them. Evidence shows that the more

232

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 11.2 Framework for Conducting the Initial SFG Sessions.

panelists converse with each other, the more likely they are to reach agreement, often with higher cut-scores, but also with greater consensus (e.g., Clauser et al., 2009; Fitzpatrick, 1989; Hurtz & Auerbach, 2003).

Standard-setting Procedures Training for the four IELTS subtests took one half day of the two days allocated for the standard-setting procedure, a timescale that is considered fairly typical (Cizek, 2012). On completion of the training for each skill, participants completed the relevant subtest.

Receptive Skills—Reading and Listening Participants completed a reading test (three texts, forty questions) and estimated the probability of an MCC correctly answering each question. They then marked each question separately, added their responses, giving a score out of 40 for reading. Scores were averaged with scores awarded on the practice test and converted to a Band according to the raw scores to Band conversion table provided. Based on the conversion, panel members were informed of the average Band they had decided on for the reading skill, and invited to discuss this and confirm their decision.

SETTING L ANGUAGE STANDARDS FOR GRADUATES

233

The same procedure was followed for the listening test (four sections, 40 marks). The decision as to the final Band they had agreed for each of the receptive subtests constituted their subjective judgment for the purposes of analysis.

Productive Skills—Writing and Speaking Participants read eight scripts and listened to individual speaking tests from twelve candidates all of which had been given standardized ratings in the range Bands 5 to 8.5. They were then asked to determine whether each script/speech sample was acceptable or not as representative of writing/speaking by a doctor. Finally, they were asked to rate each script/speech sample as follows: 5 = very good writing/speaking 4 = acceptable writing/speaking 3 = borderline acceptable 2 = borderline not acceptable 1 = not acceptable writing/speaking Initial individual decisions for the writing and speaking tests were analyzed by averaging the score given to each writing and speech sample. Those in the 4 or over range were definitely acceptable. Those in the 3–4 range were almost, but not definitely, acceptable. So, panel members were asked to discuss their decisions and come to a collective agreement as to the acceptability of each writing and speech sample, and the definition of the writing/speaking competence of a minimally competent doctor. These were then converted to Band levels and the panel was informed of the Band they had agreed on and asked to discuss it and confirm their decision. The discussions formed an important part of the judgment process, the purpose of them being to provide panelists with an opportunity to reconsider their initial ratings and to clarify any confusions or misunderstandings that had occurred. There is evidence to show that panelists feel more confident about the accuracy and validity of their ratings if there has been discussion and feedback (Hambleton et al., 2012). Once the judgments were confirmed, the data generated during the procedure were then subjected to MFRM analysis.

Results and Discussion of Findings from Initial Panels Following completion of the initial stakeholders’ panels, all judgments and data were analyzed. The results are presented in their entirety in Berry et al. (2013). Since the focus of this chapter is primarily on the standard-setting procedures followed and problems encountered, the results will simply be summarized as Bands converted from scores of three analyses (see Table  11.1). First, subjective judgments in each

234

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 11.1 Subjective Judgments, Band Score Equivalents, and MFRM Analyses of the IELTS Skills Subtests

Skill

Subjective Judgment Band

Score Equivalent Band

MFRM Analysis

Reading

Band 7.5

Band 8

n.a.

Writing

Band 7.5

Band 8.5

Band 8.5

Listening

Band 8.5

Band 8.5

n.a.

Speaking

Band 7.5

Band 8

Band 8

subtest based on an average of each panel’s Band decisions are presented. Second, a descriptive analysis of score-equivalent Bands awarded in each subtest are presented; scores for reading and listening were converted to Bands according to the raw-scoreto-Band conversion table supplied by Cambridge English Language Assessment; Bands for writing and speaking were allocated according to the standardized Bands awarded to the minimally acceptable script/speech sample. Finally, MFRM analysis of the data generated in the two productive skills subtests confirmed the score equivalent decisions. Subjective judgments averaged for all categories of panels suggest that the preferred levels were Band 7.5 overall, with no skill lower than Band 7.5, but with listening at Band 8.5. In fact, most panels found it difficult to agree on an overall Band score and were reluctant to specify anything other than a profile. In the end, most agreed to an overall Band score by averaging the subtest scores. Since a mean score of 7.75 would be rounded down to Band 7.5, this means that the average of the subjective judgments was Band 7.5. Descriptive statistics analyzing the scores for reading and listening, and the scores awarded to writing and speech exemplars present the following different profile of an MCC : Reading = Band 8, Writing = Band 8.5, Listening = Band 8.5, Speaking = Band 8. MFRM analysis of the writing and speaking scores confirms the findings of the descriptive, quantitative analysis. These scores can be averaged to produce an overall requirement of Band 8 since a mean score of 8.25 would be rounded down to Band 8.

Final Panel Deliberations and Recommendations Final Panel Composition and Procedures Representatives from each category of panel consisting of two patients, two doctors, two nurses, one allied health professional, and one medical director participated in a final confirmatory panel, the purpose of which was to review the decisions of the initial SFG s and make recommendations to the GMC .

SETTING L ANGUAGE STANDARDS FOR GRADUATES

235

Participants were presented with summaries of “can do” statements for all skills, summarized comments from all the initial panels, summaries of qualitative judgments, and quantitative analyses. Impact data in the form of required IELTS scores for registration of overseas doctors by medical councils in Australia, Canada, Ireland, New Zealand, and South Africa were presented and discussed, as were requirements for registration with nonmedical professional bodies within the United Kingdom. Language requirements of European Economic Area countries for overseas registration of doctors were also presented and discussed. Decisions were made regarding appropriate levels to recommend as follows: Listening—Band 8.5; Speaking—Band 8; Reading—Band 7.5; Writing—Band 7.5; Overall—Band 8. Recommendations to be offered to the GMC were noted, related back to the final panel, further discussed and confirmed.

Discussion of Problems Encountered in the Standard-setting Procedures Definition of MCC The first problem we encountered was with the patients’ panels (twenty participants) in that they were uncomfortable with the term minimally competent candidate. Many of them commented that they were not interested in having a minimally competent doctor—they wanted a very competent one. We explained that their task was to decide on the minimal level of competence in the language skills they would accept for a doctor, even a very competent doctor, but there was still discomfort with the term. Our solution was to drop the term MCC for the patients’ panels and ask them to determine what level of skill they would accept for a doctor, with reference to the “can do” statements they had previously determined. This appears to have been successful as the patients eventually produced Band requirements that were consistent with those of nurses, both of which were only half a Band different from the other three categories of panels.

Estimating the Probability of Obtaining a Correct Answer to an Item on IELTS Receptive Tests Previous research (e.g., McGinty, 2005; Plake, 2008) has shown that one of the difficulties panelists have with judgments is a confusion between prediction and value judgment, that is, between “will answer an item correctly” and “should answer an item correctly.” McGinty (2005) also notes that panelists often feel pressure to set high standards because they are concerned with how well people do their jobs. This would certainly seem to be the case in our study, especially with decisions made regarding the listening test. All panelists from every category insisted that a very high standard was required for the listening test based on what they believed doctors should be able to do.

236

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Our solution to achieve an estimate of correct responses on the listening test was again to refer the participants back to the “can do” statements they had developed for the skill. We asked them to consider the cognitive requirements of each statement, and to try and relate these to the listening test questions and then use that frame of reference to make their judgments. This was successful for one panel of doctors, predominantly psychiatrists, who eventually determined that Band 7.5 was the cutscore required. However, two panels (doctors and nurses) insisted on nothing less than Band 9, making the average subjective judgment across all panels Band 8.5.

Relating the Tasks in IELTS Productive Skills Tests to Doctors’ Real-life Use of Language As with the listening test, panelists had difficulty in making connections between the types of writing a doctor would need to do in real life and the type of writing required in Task 2 of the IELTS writing test. The range of Bands considered acceptable by each panel was the most diverse of all the subtests. One panel initially insisted that a script rated as Band 6.5 was acceptable, whereas another thought none of the scripts was acceptable and insisted that only Band 9 could be recommended. This highlights the difficulties encountered by the participants in judging the writing samples, and is also consistent with the contradictory findings for writing that emerged from the Banerjee (2004) study. Our solution to this problem was to encourage open and frank discussion among the panelists to ensure they all had the same understanding of the PLD s. All panels ultimately subjectively arrived at a recommendation of Band 7.5 for Writing.

Determining an Overall IELTS Band Score to Recommend All categories of panels, including the final confirmatory panel, were reluctant to recommend an overall IELTS Band level, preferring to ask for a profile of different Bands for different skills. This supports Chalhoub-Deville & Turner’s (2000) suggestion in terms of university admissions, that good selection practice should take into account not only an overall Band score, but also scores in the different skill areas. In terms of doctors, it could be argued that their oral/aural skills are more critical than their reading/writing ability. However, as an overall Band score is an integral part of IELTS reporting and as our initial research brief was to recommend an overall cut-score Band level, in addition to Band levels for each of the subtests, we suggested to the panels that if they did not recommend an overall cut-score, the GMC was almost certain to do it themselves by averaging the separate subtest scores. We then invited further discussion on this, with the result that the subjective judgment of all panels was that an overall cut-score of IELTS Band 7.5 should be recommended. With the exception of the listening cut-score, on which they were immoveable from Band 8.5, a subjective cut-score of Band 7.5 was agreed and recommended for Reading, Writing, Speaking, and overall.

SETTING L ANGUAGE STANDARDS FOR GRADUATES

237

Objective versus Subjective Legitimacy: Judgments versus Scores for Determining Cut-scores Subjective judgments averaged for all categories of panels suggested that Band 7.5 overall, with no skill lower than Band 7.5, but with listening at Band 8.5, were the preferred cut-scores. However, descriptive statistics analyzing the scores for reading and listening, and the scores awarded to writing and speech exemplars presented a different profile of a minimally competent candidate. Using the IELTS overall Band score system, these scores would be averaged to produce an overall requirement of Band 8. Clearly, with the exception of Listening, the subjective judgments and score equivalents for the other skills provide different cut-scores. The final panel discussed these differences at length, initially concluding, on the grounds of patient safety, that the cut-scores to be recommended should be based on the objective scores given rather than on the subjective decisions that had been made. However, the score equivalent requirement of Band 8.5 for writing, although supported by the MFRM analysis, was considered unrealistic. It could therefore be argued that the decision to use the objective scores to determine the cut-scores was, in fact, made subjectively. This is supported by the final decision to recommend an overall cut-score of Band 8, but to give prime importance to the oral/aural skills with recommended cut-scores of Band 8.5 for Listening and Band 8 for speaking, respectively. However, for the Reading and Writing cut-scores they decided to allow flexibility with recommended levels of Band 7.5 for each.

General Medical Council’s Response to Recommendations The evidence from this research indicated that the current IELTS cut-score of Band 7 overall, with no separate skills score below Band 7, was no longer adequate. In July 2014, the GMC announced that the IELTS requirement for overseas doctors wishing to take the PLAB test would be Band 7.5 overall, with no skill score lower than Band 7. This decision may have been based on the impact data rather than as a result of the research presented in this study. It merely brings the United Kingdom roughly into line with the New Zealand Medical Council’s IELTS standards (Band 7.5 overall, Band 7.5 for Speaking and Listening, no lower than Band 7 for Reading and Writing), rather than raising those of the United Kingdom above any other English speaking country, as recommended by Berry et al. (2013). If this is the case, it would confirm Skorupski’s (2012) observation that cut-scores are often systematically lowered when impact data are provided.

Conclusion: The Way Forward This chapter has demonstrated that it is feasible to create a transparent and empirically sound approach to setting language standards. We have outlined a study

238

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

commissioned to recommend appropriate IELTS Band scores to be obtained by overseas doctors as a first step in gaining recognition from the GMC for registration to practice medicine in the United Kingdom and Northern Ireland. The standardsetting procedures carried out have been described, and the problems encountered as well as the solutions we adopted to counter them have been explained. Any situation that has cut-scores associated with it must, of necessity, also have consequences, which in this case, may mean that overseas-trained doctors are not able to practice medicine in the United Kingdom. The validity of the standard-setting procedure is therefore of paramount concern. Although we made every effort to find solutions to the problems we encountered in this study, there are a number of additional measures that could be considered for future standard-setting events in a language related medical context.

Composition of Panels Based on the quality and depth of the discussions that took place in each panel, we believe that the separation of panels into categories representing different stakeholders was justified. However, none of the participants in any of the panels were experts in all aspects of the standard-setting procedures and none of them had any knowledge of the IELTS test. Although participants on the patients’ panels had considerable experience of interacting with professionals in medical settings, they had no specialist medical knowledge and no specific language expertise. The other stakeholder groups had differing degrees of medical expertise, but no specific language expertise. In a situation such as this, when trying to set language standards within a medical context, it may in future prove useful to include a number of participants with specific language expertise and/or IELTS experience on each of the panels in order to facilitate a shared understanding of the many issues to be dealt with.

Training of Panel Participants Research findings suggest that panelists often report problems understanding PLD s, the definition of the MCC , and even the standard-setting method (Impara & Plake, 1998; Papageorgiou, 2010; Plake & Impara, 2001). Several researchers have also commented on an overall lack of confidence on the part of panelists in the judgments they have made (Ferdous & Plake, 2005; Papageorgiou, 2010; Skorupski & Hambleton, 2005). This may have been at least partially the case in this study, as one of the participants in the final panel commented, “I wonder looking at us, I think we became more confident in our decision making as the process went on.” This comment complements the findings of Skorupski & Hambleton (2005), who noted that panelists’ confidence increases as the standard-setting activity progresses. It may therefore be useful to develop a measure that would assess participants’ understanding of the standard-setting method, the MMC concept, and the PLD s. This could be used after training to assess participants’ readiness to start the standard-setting procedure proper.

SETTING L ANGUAGE STANDARDS FOR GRADUATES

239

Materials Required for Standard-setting Procedures As stated at the beginning of this chapter, we rejected the idea of giving panel participants materials in advance. We still believe that, for this study, this was the right decision, particularly with regard to maintaining the confidentiality of the IELTS materials. However, given the complexities of setting IELTS standards with such diverse groups of stakeholders, it may be worth considering developing a specific package of training materials that does not contain confidential IELTS materials. This would not, of course, obviate the finding that participants often do not find enough time to complete training materials in advance. However, with careful wording it may be that most would feel obliged to at least read the materials before coming to the event. Since two days was the absolute maximum time available for all the procedures, and training took at least half a day out of that, this might perhaps be one way of extending the time available for the actual standard-setting event.

References Banerjee, J. (2003). Interpreting and Using Proficiency Test Scores (Doctoral thesis). University of Lancaster, UK . Banerjee, J. (2004). Study of the Minimum English Language Writing and Speaking Abilities Needed by Overseas Trained Doctors. Report to the General Medical Council. July 2004. Berry, V., & O’Sullivan, B. (2016). Language standards for medical practice in the UK : Issues of fairness and quality for all. In C. Docherty & F. Barker (Eds.), Language Assessment for Multilingualism, Proceedings of the ALTE Paris Conference, April 2014, Studies in Language Testing volume 44. Cambridge, UK : UCLES /Cambridge University Press, pp. 268–285. Berry, V., O’Sullivan, B., & Rugea, S. (2013). Identifying the Appropriate IELTS Score Levels for IMG Applicants to the GMC Register. Manchester, UK : General Medical Council. Available from http://www.gmc-uk.org/Identifying_the_appropriate_IELTS _score_levels_ for_IMG _applicants_to_the. . .. pdf_55207825.pdf. Biddle, R. (1993). How to set cutoff scores for knowledge tests used in promotion, training, certification, and licensing. Public Personnel Management, 22(1), 63–79. Chalhoub-Deville, M., & Turner, C. (2000). What to look for in ESL admissions tests: Cambridge certificate exams IELTS and TOEFL . System, 28(4), 523–539. Cizek, G. J., & Bunch, M. B. (2007). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests. Thousand Oaks, CA : Sage. Cizek, G. J. (2012). An introduction to contemporary standard setting: concepts, characteristics, and contexts. In G. J. Cizek (Ed.), Setting Performance Standards: Foundations, Methods, and Innovations (2nd edn.). New York and London: Routledge, pp. 3–14. Clauser, B. E., Harik, P., Margolis, M. J., McManus, I. C., Mollon, J., Chis, L., & Williams, S. (2009). An empirical examination of the impact of group discussion and examinee performance information on judgments made in the Angoff standard-setting procedure. Applied Measurement in Education, 22 (1), 1–21. Ferdous, A. A., & Plake, B. S. (2005). Understanding the factors that influence decisions of panelists in a standard-setting study. Applied Measurement in Education, 18 (3), 223–232. Fitzpatrick, A. (1989). Social influences in standard setting: The effects of social interaction on group judgments. Review of Educational Research, 59(3), 315–328.

240

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Hambleton, R. K., Pitoniak, M. J., & Copella, J. M. (2012). Essential steps in setting performance standards on educational tests and strategies for assessing the reliability of results. In G. J. Cizek (Ed.), Setting Performance Standards: Foundations, Methods, and Innovations (2nd edn.). New York and London: Routledge, pp. 47–76. Hein, S. F., & Skaggs, G. E. (2010). Conceptualizing the classroom of target students: A qualitative investigation of panelists’ experiences during standard setting. Educational Measurement: Issues and Practice, 29(2), 36–44. Hurtz, G. M., & Auerbach, M. A. (2003). A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educational and Psychological Measurement, 63(4), 584–601. Hurtz, G. M., & Hertz, N. (1999). How many raters should be used for establishing cutoff scores with the Angoff method? A generalizability theory study. Educational and Psychological Measurement, 59(6), 885–897. Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69–81. Jaeger, R. M. (1989). Certification of student competences. In R. L. Linn (Ed.), Educational Measurement: Issues and Practice (3rd edn.). New York, NY: Macmillan, pp. 485–514. Jaeger, R. M., & Mills, C. N. (2001). An integrated judgment procedure for setting standards on complex, large-scale assessments. In G. J, Cizek (Ed.), Setting Performance Standards: Concepts, Methods and Perspectives. Mahwah, NJ : Erlbaum, pp. 313–338. Kaftandjieva, F. (2004). Reference Supplement to the Preliminary Pilot version of the Manual for Relating Language examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Section B: Standard Setting. Strasburg, France: Council of Europe. Available from: https://www.coe.int/t/dg4/ linguistic/CEF -refSupp-SectionB.pdf. [10 April 2015]. Livingston, S., & Zieky, M. (1982). Passing Scores: A Manual for Setting Standards of Performance on Educational and Occupational Tests. Princeton, NJ : Educational Testing Service. Loomis, S. C. (2012). Selecting and training standard setting participants: state of the art policies and procedures. In G. J. Cizek (Ed.), Setting Performance Standards: Foundations, Methods, and Innovations (2nd edn). New York and London: Routledge, pp. 107–134. Maurer, T. J., Alexander, R. A., Callahan, C. M., Bailey, J. J., & Dambrot, F. H. (1991). Methodological and psychometric issues in setting cutoff scores using the Angoff method. Personnel Psychology, 44(2), 235–262. McGinty, D. (2005). Illuminating the “black box” of standard setting: An exploratory qualitative study. Applied Measurement in Education 18(3), 269–287. McNamara, T., & Roever, C. (2006). Language Testing: The Social Dimension. Malden, MA : Blackwell Publishing. Papageorgiou, S. (2007). Setting Standards in Europe: The Judges’ Contribution to Relating Language Examinations to the Common European Framework of Reference (Doctoral thesis). University of Lancaster, UK . Papageorgiou, S. (2010). Investigating the decision making process of standard setting participants. Language Testing, 27(2), 261–282. Plake, B. S. (2008). Standard setters: Stand up and take a stand! Educational Measurement: Issues and Practice, 27(1), 3–9. Plake, S., & Impara, J. C. (2001). Ability of panelists to estimate item performance for a target group of candidates: An issue in judgmental standard setting. Educational Assessment, 7(2), 87–97. Shohamy, E. (2001). The Power of Tests: A Critical Perspective on the Uses of Language Tests. Harlow, England, and New York: Longman.

SETTING L ANGUAGE STANDARDS FOR GRADUATES

241

Skorupski, W. P. (2012). Understanding the cognitive processes of standard setting participants. In G. J. Cizek (Ed.), Setting Performance Standards: Foundations, Methods, and Innovations (2nd edn.). New York and London: Routledge, pp. 135–147. Skorupski, W. P., & Hambleton, R. K. (2005). What are panelists thinking when they participate in standard-setting studies? Applied Measurement in Education, 18(3), 233–256. Van Avermaet, P., & Rocca, L. (2013). Language testing and access. In E. Galaczi & C. Weir (Eds.), Exploring Language Frameworks: Proceedings of the ALTE Kraków Conference, July 2011. Cambridge, UK : Cambridge University Press, pp. 11–44.

242

12 Fairness and Bias in Language Assessment Norman Verhelst, Jayanti Banerjee, and Patrick McLain

ABSTRACT

T

he problem of fairness and bias in testing is complex and there are no widely accepted solutions. The field of language assessment (particularly foreign language assessment) typically applies a toolkit to detect whether a test functions differently in different populations. The most widely used techniques are known under the name Differential Item Functioning (DIF ). This chapter provides an overview of these techniques and discusses their weaknesses. It shows that statistical power is gained using a generalized form of DIF called profile analysis. In this approach, investigators first hypothesize how DIF might occur (such as an interaction between item domain and test-taker age), then group items according to this hypothesis, and finally, apply profile analysis. The chapter demonstrates this approach on both artificial datasets and real data—a multilevel test of English language proficiency. The chapter then discusses the methodological and policy implications of profile analysis.

Fairness in Testing The Standards for Educational and Psychological Testing (henceforth referred to as Standards) was created to help promote good testing practices and provide a frame of reference to address issues relevant to test development, evaluation, and validation. It acts as a set of guidelines which test developers, users, and other stakeholders are encouraged to follow. Among the many issues addressed in the Standards, one of the most important relates to fairness in testing. 243

244

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

The 2014 Standards provide a set of twenty-one fairness standards, the first of which (standard 3.0) is intended to be viewed as a general guiding principle for test fairness. It states that the purpose of fairness in testing is to “identify and remove construct-irrelevant barriers” in order to maximize the performance of all examinees (AERA, APA, NCME , 2014, p.  63). The remaining twenty standards (3.1–3.20) provide more details as to how this can be done. They are divided into four clusters, each of which focuses on different aspects of test fairness including standardization and accessibility. Two fairness standards pertain specifically to the fairness of test items. Standard 3.2 states that test developers are responsible for ensuring that tests measure the intended construct while minimizing the potential influence of constructirrelevant characteristics. Standard 3.6 notes that when evidence is found that suggests test scores differ in meaning for different subgroups of the test population, test developers should work to determine the potential cause of the difference (AERA, APA, NCME , 2014). The fairness of an exam and its items is a primary concern in the field of language assessment. An important aspect of fairness in testing is to ascertain that no identifiable groups of test takers are advantaged or disadvantaged because some parts of the test require a skill which is irrelevant to the construct aimed at by the test, and where two or more groups show systematic differences. If a test unfairly advantages or disadvantages a group of test takers, then the validity of the test results can be called into question. This aspect of fairness is closely tied to the construct validity of the test as it points to construct irrelevant variance. It is therefore important for test developers to employ techniques to identify items, or groups of items, that may be unfairly biased against groups of test takers.

Differential Item Functioning Differential Item Functioning (DIF ) analysis is commonly used in testing and assessment to identify items that may be biased against a group of test takers. The term DIF is used to describe items that function differently for groups of test takers with comparable ability levels. For example, many DIF studies have explored whether items favor one gender over another. Ideally, if boys and girls were matched on the scale representing the target construct of the test (i.e., language proficiency), girls and boys of the same language proficiency would have the same likelihood of getting any given item correct. If an item demonstrates DIF with respect to gender, then this means that girls and boys of the same language proficiency do not have the same likelihood of getting that item correct. In other words, the difficulty of the item is unequal for the two groups. This suggests that finding the correct answer on this item is driven by a variable other than the target construct, and that this extra determinant differs systematically in the two groups. There are two types of DIF (Pae & Park, 2006): 1 Uniform DIF, which occurs when an item differs in difficulty across groups— (such as when an item is systematically more difficult for male test takers than for female test takers).

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

245

2 Non-uniform DIF, which occurs when an item differs in discrimination across groups (such as when an item discriminates better between weak and strong performers in the male than in the female population). It is important to note, however, that DIF analysis is a necessary but not sufficient condition for identifying biased items (McNamara & Roever, 2006). Items that do not have DIF are not biased, but items that do have DIF require more analysis to determine the source. DIF is said to be the result of item bias if and only if the source of the DIF is “irrelevant to the test construct” (Camilli, 2006, p. 234). Importantly, if the source of the DIF is relevant to the test construct, then the item is not biased, and the DIF is said to be the result of item impact (Zumbo, 2007)—that is, true differences between test taker groups in an underlying ability targeted by the test.

Approaches to DIF analysis Camilli (2006) states that there are two primary approaches to DIF analysis: item response theory (IRT ) methods and observed score methods. IRT-based DIF analysis requires that the sample size is large enough to obtain accurate estimates of the IRT parameters for the groups of interest and that the data fit the selected IRT model. Model fit is typically assessed through analysis of item and person fit (e.g., using measures such as infit and outfit mean square) by looking at the individual test items and candidates. In general, IRT based DIF analysis is done through model comparison of the groups’ item response functions (IRF ) (Camilli, 2006; McNamara & Roever, 2006). In a one parameter IRT or Rasch model setting, the only potential difference between the two IRF functions is one parameter (threshold, also known as item difficulty). The DIF analysis is then conducted on the parameter estimates using a twosample t-test. In the case of a two- or three-parameter IRT model, there are more parameters (i.e., threshold, discrimination, guessing), so the analysis becomes more complex. For these models, DIF analysis can be performed using the likelihood ratio test or by examining the difference between the IRT parameters using multivariate methods. Observed score methods are similar to IRT based methods, but they have different underlying measurement models that the data needs to fit, and they tend to have less restrictive requirements on sample size (Camilli, 2006). One of the most commonly used techniques for detecting DIF using observed scores is known as the MantelHaenszel procedure. This method uses a common odds ratio to obtain a measure of DIF effect size that can be interpreted as the relative likelihood that two groups of test takers with similar ability levels would answer an item correctly (Camilli, 2006). The Mantel-Haenszel odds ratio is often converted to the delta scale, which was developed by the Educational Testing Service, to help classify items based on DIF size (Camilli, 2006). A second observed score method, known as the standardized difference or standardization method, obtains a weighted measure of the average difference in percent or proportion correct for test takers of similar ability levels (Camilli, 2006). Like the Mantel-Haenszel procedure, the measure provided by the standardization method also provides a measure of DIF effect size. Another observed score technique used in DIF analysis is known as differential bundle functioning (DBF ). DBF is used to determine “whether two distinct groups

246

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

with equal ability differ in their probability of answering a bundle of items correctly” (Banks, 2013, p.  43). A computer program known as the Simultaneous Item Bias Procedure (SIBTEST ) can be used to perform DIF analyses on bundles of items in addition to single items (Shealy & Stout, 1993a, 1993b). This method of DIF analysis is useful because, by allowing for the analysis of several items at once, it takes the same approach as profile analysis. However, SIBTEST and profile analysis approach the calculation of DIF in different ways. We explore the differences between these two techniques later in the chapter.

DIF Research in Language Assessment Literature DIF studies and analyses are common in language assessment research and literature. Several researchers have made use of IRT-based DIF analysis. Takala & Kaftandjieva (2000) and Pae & Park (2006) investigated Gender DIF on an English vocabulary test and an English reading test, respectively. These studies utilized different methods to identify items with DIF. Takala & Kaftandjieva (2000) tested for differences in the threshold parameter, while Pae & Park (2006) used the likelihood ratio test. Geranpayeh et al. (2007) also used the likelihood ratio test in their investigation of age-related DIF on the listening section of an English exam. Other researchers have used observed score methods for their DIF analysis. Harding (2012), for example, used both the MantelHaenszel procedure and the standardization method to investigate a shared L1 advantage on an English listening test featuring speakers with L2 accents. Abbott (2007) used SIBTEST to investigate DIF and DBF of reading test items for two different native language groups. A persistent problem in DIF analysis is determining the source of DIF in the items. When DIF studies are conducted as an exploratory analysis, the first step is to analyze all items for DIF and the second step is to review the content of DIF items to try and determine why DIF has occurred. This is unfortunately complicated by the fact that items do not necessarily show “obvious signs of bias” (Takala & Kaftandjieva, 2000, p. 330). As a result, even expert judges are often unable to identify the DIF source (Angoff, 1993). Geranpayeh & Kunnan (2007) present an excellent example of this phenomenon. In their study of DIF in terms of age, they identified three age groups: 17 and younger, 18 to 22, and 23 and older. They engaged a panel of five content experts to judge each item for bias in favor of or against one or more of the age groups. Taking each age group in turn, the panel rated each item on a 5-point Likert scale (from strongly advantage to strongly disadvantage). This resulted in three judgments per item—one for each age group. Additionally, the content experts were asked to comment on item features that they believed would cause bias in favor of or against the age groups. Their results were far from conclusive; the experts’ judgments only corresponded with the DIF analysis in the case of one item, suggesting that the item features that were salient for the content experts were not salient in affecting test-taker performance.

Confirmatory Approach to DIF Analysis One solution to this conundrum might be to take a confirmatory approach to the DIF analysis by first developing hypotheses as to what features of an item would

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

247

cause DIF among groups. In their study of the listening section of the Michigan English Test (MET ®, http://www.cambridgemichigan.org), Banerjee & Papageorgiou (2011) hypothesized that items in the occupational domain would be relatively more difficult for test takers with no workplace experience. The MET ® is a general language proficiency test designed for test takers at A2—C 1 of the CEFR (highbeginner to low-advanced). Because the target context of the test is general language, the items are situated in a range of domains including personal, occupational, and educational. It stands to reason, however, that younger test takers would be unfamiliar with language use in the workplace and could be disadvantaged by the occupational domain items. Two questions were central in this study: 1 Would young test takers—with no occupational experience—be disadvantaged by occupational domain items in comparison to older test takers? 2 If young test takers are disadvantaged by the occupational domain items, should these items remain in the test or should they be removed? The data were collected in an incomplete design: a total of 133 items were tested on four different test forms, each containing forty-six items. The test forms had a number of items in common to make an IRT analysis possible. The items belonged to four different domains: personal, public, educational, and occupational; forty-two items belonging to the occupational domain, and ninety-one to the three other domains: personal, public, and educational. Because this study focused on the effect of age on performance in occupational domain items, it was important to have unimpeachable definitions of age. In particular, it was important to delimit two groups, one of which was highly unlikely to have work experience (group 1) and another which was highly likely to have work experience (group 3). A middle group was defined in order to account for test candidates who fell between the two categories and whose performance profiles might not be as revealing. The resulting definitions were: Group 1: Test candidates younger than seventeen years old who were below the legal school-leaving age and who could be expected to have little or no workplace knowledge. Group 2: Test candidates between seventeen and twenty-six years old who were above the legal school-leaving age, but who might not have entered the workforce, perhaps because they are studying at a university. This group may have some workplace knowledge by virtue of holiday jobs or part-time jobs. Group 3: Test candidates older than twenty-six who had definitely entered the workforce and who could be expected to have substantial workplace knowledge. Table 12.1 shows the distribution of candidates across age groups. DIF analyses, as influenced quite strongly by the work of Holland & Thayer (1988), commonly distinguish two groups of test takers; one is called the Reference group; the other, the Focal group. The Focal group is usually the group for which there is a suspicion of DIF. This is relatively straightforward when investigating

248

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 12.1 The Three Age Groups Group

Age Range

Group 1

< 17

Group 2

≥ 17 and ≤ 26

Group 3

> 26

Number of Candidates 460 1756 645

Total

2861

TABLE 12.2 Number of Items Demonstrating DIF, Significant at 5 percent Level Occupational Items

Other Items

Easier for Group 1

1

9

Easier for Group 3

3

4

the effect of gender or first language on language test performance. It is generally accepted that a test taker is either male or female or that a test taker can only have one first language. However, in the Banerjee & Papageorgiou (2011) study, three groups were distinguished to have a clear contrast between test takers who probably had no workplace experience and those who probably did have workplace experience. This required the exclusion of a group of test takers (group 2) who may or may not have had workplace experience. An overview of the results is given in Table 12.2.

Some Problems with DIF Analysis Banerjee & Papageorgiou’s (2011) results are appropriate for explicating the issues involved in DIF analyses. First, seventeen out of the total of 133 items showed significant DIF, but only four of the forty-two occupational domain items showed DIF, including one in the reverse direction of the hypothesis. Furthermore, a crossvalidation of these findings by analyzing the four test forms separately revealed that no items demonstrated DIF in all the test forms in which they appeared. Consequently, any attempt to explain these results risks pure speculation. Second, DIF is susceptible to two types of statistical decision errors (usually referred to as Type I and Type II errors). The first (Type I) error occurs if the null hypothesis (“there is no DIF ”) is true, and yet, we find a significant result that makes us say that there is DIF for this item. The probability of this occurring (if the null hypothesis is true) equals the significance level. So this means that, in 5 percent of the cases (approximately seven items), even if there is no DIF at all, we can expect to get a significant result. This makes the interpretation of our findings in Table 12.2 yet

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

249

more problematic: almost a third of the seventeen significant cases might have occurred by accident (i.e., they would probably not occur again in an independent sample), but we cannot know which ones they are from the current analysis. The second (Type II ) error occurs if the null hypothesis is false (“there is DIF ”), but we do not find significance, that is, there is DIF for the item but the statistical test fails to identify it. The probability of such an error depends on a number of factors. The most important ones are the sample size and the seriousness of the “falseness” of the null hypothesis (i.e., the effect size). Even the most trivial effect will be statistically significant if the sample size is large, and conversely, when a dataset is small, it is less likely that even a nontrivial effect will be found statistically significant. This may be the issue in Banerjee & Papageorgiou’s (2011) work. So the initial hypothesis might be true (that the younger test takers have a lower probability of getting the occupational items right), but the difference might be so small that significance is not reached in many of the DIF -tests. In technical terms one says that the test has low statistical power. In summary, there are clearly problems with the approach taken. The hypothesis was formulated in terms of a category of items (occupational items), but the investigation was at the level of individual items rather than being at the level of the category. There are forty-two occupational items in the dataset (distributed across four test forms). A small effect at the item level might have a serious effect in a test form with a substantial number of items of this category.

Artificial Example Data were generated for 1,000 “young” candidates (group 1) and 1,000 “older” candidates (group 3), with abilities drawn from a normal distribution with a mean of zero and an SD of 0.8. The test consists of thirty items, twelve from the occupational domain and eighteen from the other domains. The data were generated using the Rasch model, and the item parameters are displayed in Figure 12.1. The horizontal axis represents test-taker ability. The same axis also indicates the item difficulty parameter: the higher the value, the more difficult the item. For each group, the twelve hollow diamonds represent the occupational items and the seventeen filled diamonds represent the other (nonoccupational) items. The middle filled diamond

FIGURE 12.1 Item Parameters as Used for the Generation of the Data.

250

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

represents two items with the same difficulty. The item parameters used for the young group are in the upper row of the figure. The parameters for the older group are given in the lower row. As the dashed lines indicate, the occupational items are systematically a little bit easier (0.15 points on the scale) for group 3 than for group 1, while the other items have pair-wise the same difficulty in both groups. In other words, this data is completely in accordance with the hypothesis advanced by Banerjee & Papageorgiou (2011). The DIF -analysis on these artificial data yielded four significant DIF -values, two occupational items and two other items. But for a good understanding of DIF, it should be clear that it is impossible to discover from an analysis of the data how the data were constructed. In our DIF -analysis, the data for the two groups were analyzed separately (using the Rasch model), and this means that two scales are constructed that have an arbitrary origin. So, the estimates of the item parameters in either one of the groups may be shifted by an arbitrary amount. Now, if we shift the parameters in group 3 by an amount of +0.15 (to the right), we obtain the item representation as given in Figure  12.2. From the dashed lines, it is clear that the occupational items (the hollow symbols) now have pair-wise the same value in both groups, while the other items are systematically more difficult in group 3, an interpretation completely opposite to the one we used to generate the data. This means that we cannot know from the analysis which of the two interpretations is correct. We may say that one interpretation is more plausible than the other, but the plausibility does not follow from the analysis. In fact, there are infinitely many different representations possible, all of them statistically equivalent. This means that we have to be careful with interpretations. A way to interpret DIF without inferring any special cause or reason is by considering pairs of items: either homogeneous pairs (where both items are from the same domain category—occupational or nonoccupational) or heterogeneous pairs (where the items are from different domain categories). There are two pairs circled in Figure 12.3, one for each age group. Each item in the circled pair belong to a different category, and each pair falls in the same difficulty position relative to other items for that group, that is, the fifth nonoccupational item and the third occupational item on the ability axis. It is clear that comparing the difficulty of these two items is not univocal: In the population represented by group

FIGURE 12.2 Configuration of Item Parameters, Equivalent to Figure 12.1.

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

251

FIGURE 12.3 Comparison of Items in Homogeneous and Heterogeneous Pairs.

3, they are of equal difficulty, but in the group 1 population, the occupational item is more difficult. What we can say in general is this: For any heterogeneous pair of items, the difference between their difficulties is not the same in both age groups. In contrast, for any homogeneous pair (two occupational items, or two nonoccupational items) the comparison (i.e., the difference of the difficulty) is identical across the two groups. In Figure 12.3, we consider two examples. The first one consists of the pair of occupational items (hollow diamonds) from which a vertical dashed line starts. The geometric figure formed by the hollow diamonds of these items in the two groups is a rectangle, so that it is immediately clear that the difference in difficulty between the two items is the same in both populations. The second example is given by the pair of items represented by the filled diamonds at the endpoints of the double arrows: These arrows have the same length, and therefore, the difference between the difficulties of these two items is the same in both populations. In more general and abstract terms, then, one can say that it is not meaningful to speak of DIF with respect to a single item or a category of items. One should always refer to a comparison item or category of items. In the example, we can say that for the younger group, the occupational items are relatively, that is, in comparison to the nonoccupational items, more difficult than for the older group (group 3). This means, we cannot speak of the occupational items without simultaneously considering the nonoccupational items.

Profile Analysis Like differential bundle functioning, profile analysis is a technique to investigate differential functioning of several items at the same time. It is, in some respects, similar to the technique developed by Shealy & Stout (1993a, 1993b) for their computer program SIBTEST, but it also has some unique characteristics: ●

SIBTEST assumes that all items in the test measure the target construct of the test. Profile analysis only requires the definition of two, three, or four meaningful characteristics of items (such as domain). It is neutral with respect to the concept of test construct.

252





CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

SIBTEST, like most DIF techniques, only allows for two groups of students. Profile analysis allows for an arbitrary number of groups. SIBTEST is a nonparametric technique—it does not presuppose an IRTmodel where parameters have to be estimated. Profile analysis, on the other hand, assumes that the test data have been analyzed with an IRT model; the parameter estimates of the items (and in some cases, also of the candidates) are required to carry out the analysis.

Definitions Before explaining how profile analysis is applied in item functioning investigations, some definitions are helpful.

Observed profile When the items of a test are partitioned in m (≥2) classes or categories, we can determine (by simple counting) the observed score of a single test taker on all the items jointly in this category. An observed profile is comprised of the m-tuple of the partial scores. Looking at Table 12.3, where there are two categories of items, A and B, the observed profile of a single test taker would comprise his or her score on all the items in each category and be presented as follows: (number Category A correct, number Category B correct) or (4, 2). The sum of the partial scores is the test score by definition.

Expected profile If one can treat the item parameters as known (because one has a good estimate of them), then one can use the mathematics of the measurement model to compute the conditional expected subscore in each category (see Verhelst, 2012a for an explanation of the computation). An example is given in Table 12.3. The topmost row represents the observed profile of some test taker whose total test score is 6. Two categories of items are used, called A and B, and the test taker obtains a partial score of 4 on the category A items and a partial score of 2 on the B items. The parameters of the measurement model can be used to compute the conditional expected subscore in each of the two categories. By conditional is meant “given the total test score of this test taker,” and by expected one means “on the average.” Combining these two requirements we can understand the conditional expected subscore to be the average subscore that a very large sample of test takers, all with the same total score (of six TABLE 12.3 Observed, Expected and Deviation Profile Category A

Category B

Sum

Observed

4

2

6

Expected

4.406

1.594

6

Deviation

−0.406

+0.406

0

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

253

in the example), would obtain in each category. These two subscores, taken jointly, form the expected profile. It is clear that the sum of the conditionally expected subscores equals the total score. Written as an ordered pair, the expected profile is (4.406, 1.594).

Deviation profile The deviation profile is the difference between the observed and the expected profile (see Table  12.3). Notice that the sum of the deviations is always zero. Written as ordered pairs, we find that the deviation pair is given by (4, 2) – (4.406, 1.594) = (−0.406, +0.406).

How Profile Analysis Works Profile analysis is the study of the deviation profiles. The general question is whether we can learn something new or original about individual test takers, or groups of tests takers, by considering (only) the deviation profile. It may be useful to try to understand first what we cannot infer if only the deviation profile is given: ●

● ●



We cannot infer whether there are more category A items in the test than B items. We cannot infer if the A items are on average more difficult than the B items. We cannot infer whether the observed score on the A items is higher or lower than the observed score on the B items; the same holds for the expected scores. We cannot infer from the deviation profile the total number of items in the test, and hence, we cannot infer whether the total test score is high or low.

Summarizing, we could say that all the things one usually reports from an analysis of test data, are stripped away when we only consider deviation profiles. This means that if something interesting is left, then it is probably an aspect that is not noticed in the common reports. From Table  12.3, it is clear that the observed score on category A is less than expected, while (by necessity) it is higher than expected for items of category B. This can be an interesting finding at the individual level, but in the context of this chapter, it is more interesting to make group comparisons.

Mathematics and Statistics At the heart of profile analysis is the enumeration of all possible profiles that could result in a given test score and to determine for each profile the probability that it will occur (see Verhelst, 2012a, for details; a computer program and a manual are available online, Verhelst, 2012b). Table 12.4 presents a simple example for a test score of 6. Since a (partial) score cannot be negative, the enumeration of the possible observed profiles is very simple: The first value of the profile can take the values 0, 1, . . ., 6 and the second value is 6 minus the first value. The seven possible profiles are listed in column one of Table 12.4. In the second column, the first value (called “X1”) of the profile is listed separately. In column three (labeled “prob”), the probability of each profile is listed (see Verhelst, 2012a, 2012b, for details). In the final two columns,

254

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 12.4 Distribution of Profiles and Computation of Mean and Variance X × prob

(X – X)2 × prob

Profile

X1

prob

(0,6)

0

0.001

0.000

0.010

(1,5)

1

0.002

0.002

0.023

(2,4)

2

0.026

0.052

0.151

(3,3)

3

0.141

0.423

0.279

(4,2)

4

0.348

1.392

0.057

(5,1)

5

0.361

1.805

0.127

(6,0)

6

0.122

0.732

0.310

Sum

4.406

0.957

(Mean)

(Variance)

the operations are given to compute the mean (X) and the variance. Their values are given in the bottom row. The mean of the second value of the profile is just 6 – 4.406 = 1.594, and the variance is the same as for the first value. Suppose a test taker has an observed profile equal to (4,2), then his or her deviation profile equals (see Table  12.3): (−0.406, +0.406), meaning that this test taker performs worse than expected on the items of the first category. But the important question to answer is how serious this deviation is from the expected value. Is it “small,” occurring by accident, or is it so serious that we do not believe or accept that such a deviation is in accordance with the measurement model we use. Answering this question requires a statistical test. Obtaining a result that is at least as bad as the one we are considering, is obtaining a profile with a partial score on the first element of the profile of not more than 4, that is, 0, 1,. . ., 4, and the probability that this happens is the sum of the five topmost probability values in Table 12.4, giving a total of 0.517. Such a probability is not particularly small, and therefore, this statistical test does not provide a reason to worry about a student with an observed profile (4,2). In other words, the profile of this test taker does not indicate that the items are biased. If we are interested in how categories of items have functioned, then the preceding procedure is not efficient as it uses only the answers from a single student. It is more efficient to use data from groups of test takers, and the adaptation of the testing procedure is as follows: 1 For each test taker in the group, one computes a table like Table 12.4. Note that the tables will be different if the test scores of the students are different. From the table and the observed profile, we compute the deviation dij of an

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

2

3

255

arbitrary value i in the observed profile of student j. To fix the ideas, we just take the first value i.e. i = 1. The variance in the table will be indicated as σ 2ij. Note that the variance depends on not only the value of i (the category under consideration), but also on the total test score of the test taker j. One computes the average deviation in the group:

where j indexes the test takers of the group; dij is the deviation value of the j-th test taker for the i-th element in the deviation profile; n is the number of test takers in the group. For example, suppose there is a small group of three test takers, all obtaining a test score of 6, so we can use Table 12.4. Their observed partial scores on the first category of items are 2, 3, and 5. From Table 12.4, we obtain d11 = 2−4.406 = −2.206; d12 = 3−4.406 = −1.406 and d13 = 5−4.406 = 0.594. The average deviation (for the first value of the profile) is d1 = −1.0727. Since the test takers answer the items independently of each other, the variance of the sum equals the sum of the variances: , whence it follows that

In the example, we find . 4

In the final step, we do not need the tables any longer because we can have recourse to a fundamental result of theoretical statistics—the central limit theorem. This states that, if the sample size n becomes large, then the quantity

follows the standard normal distribution. Using a significance level of 5 percent, if the absolute value of z is larger than 1.96, we cannot accept the null hypothesis for this group. The square root of Var(d) is called the standard error of the average deviation. In the example, the standard error of the (first) average deviation is and the z-value is −1.0727/0.565 = −1.899. In this case, however, we cannot have recourse to the normal distribution because the sample size is trivially small.

256

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Applying Profile Analysis Table 12.5 displays the results for the artificial example. In each line, the cell with the standard error gives the standard error for both average deviations; in the case of two categories, they are equal. The average deviations themselves are displayed in the two leftmost cells in each line. We see that for the younger group (group 1), the test takers perform less well than expected on the occupational items, and that this deviation from zero is significant at the 5 percent level, but not at the 1 percent level; the z-value is less than 2.58 in absolute value. We find the complementary conclusion for group 3: they perform better than expected on the occupational items. But the stronger result follows when the two average deviations are compared to each other, and this result is given in the bottom row of the table: The difference is 0.087 – (−0.087) = 0.174, and the corresponding z-value is 3.358, which is highly significant. This difference means that the older group performs better than the younger group on the occupational items. In other words, the occupational items are easier for the older group than for the younger group, and that as an automatic consequence the reverse holds for the other (nonoccupational) items. The statistical test for the other items yields a z-value of −0.174/0.0517 = −3.366. The standard . error of the difference is computed as Before we proceed, a technical remark is in order: The two averages in the same row sum to zero, and this is a consequence of the definition of a deviation profile. But in Table  12.5, the two averages in the same column also sum to zero, and this is caused by the fact that, in the artificial data, the two groups are of equal size.1 Such a scenario is unlikely to be observed with real data unless the number of test takers in each group is approximately the same. Having explored the concepts with the artificial data, we now turn to the MET ® dataset. The results of the profile analysis for the MET ® are displayed in Table 12.6. Note that a column has been added with the number of test takers in each group (n). Here we see that (unlike Table 12.5) the sum of the averages across groups is not zero. Moreover, we see that the larger the group, the closer the average deviation is to zero, and this is a consequence of the parameter estimation. The middle group outnumbers the two other groups, and this means that the parameter estimates will automatically be influenced most by the characteristics of this group and less by the two other groups. If we work with unequal groups, then the following rule applies: The weighted average of the deviations across groups is zero, where the weights are the number of test takers per group.

TABLE 12.5 Results of the Profile Analysis for the Artificial Example Group

Occupational

Other

Standard Error

z-value

Group 3

0.087

−0.087

0.0367

2.376

Group 1

−0.087

0.087

0.0364

−2.373

0.174

−0.174

0.0517

3.366

Difference

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

257

TABLE 12.6 Results of the Profile Analysis of the MET ® Data (1) Group

n

Occupational

459

−0.215

From 17 to 26

1741

Older than 26

638

Less than 17

Difference (26)

Other

Standard Error

z-value

0.215

0.0652

−3.300

0.021

−0.021

0.0334

0.625

0.101

−0.101

0.0546

1.857

−0.316

0.316

0.0850

−3.718

Once we understand this, it will be clear that considering the z-value per group is not a very wise strategy. It would lead to the conclusion that the occupational items are more difficult for the younger test takers, but not (significantly) easier for the middle group or the older test takers, which is contrary to the approach taken above. So, the essential ingredient in the approach is to compare groups. In the present case, this may done in two ways. In the first approach, we just leave out the middle group and take differences between the first and the last group. The resulting z-value is highly significantly different from zero, indicating that the younger group is performing less well on the occupational items than the older group, and consequently, better on the other items. In the second approach, we can use an interpretation from Table 12.6, saying that the middle group is more like the older group than the younger one in the average deviation profile. Hence, we might use all data and run a new profile analysis with two groups, where the original middle and older groups are merged into a single group. The results of this analysis are given in Table 12.7. The conclusion from this analysis is much the same as the one from Table 12.6. This example demonstrates the promise of profile analysis in identifying an explanation for differential item functioning. Profile analysis looks at groups of items (rather than looking at each item in isolation), and can therefore simultaneously account for the small effects of each item in a category. In the artificial example as well as in the real life example, the differential functioning of two categories of items in different age groups was convincingly shown.

TABLE 12.7 Results of the Profile Analysis of the MET ® Data (2) Group

Occupational

Other

Standard Error

z-value

Less than 17

−0.215

0.215

0.0652

−3.300

More than 17

0.043

−0.043

0.0285

1.490

−0.258

0.258

0.0712

−3.621

Difference

258

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Conclusion In this chapter, we have engaged with the concepts of fairness and bias. We have explained how bias in the underlying content/context of the test can result in differential item functioning. We have also discussed in detail the assumptions made during DIF analysis, and we have presented an alternative approach to investigating differential item functioning—profile analysis. Each methodological development brings us closer to better identifying biased items. Banerjee & Papageorgiou (2011) noted that thirteen of the nonoccupational items demonstrated DIF (just over 14 percent of the ninety-one nonoccupational domain items in the dataset), and only four of the occupational items demonstrated DIF (just under 10 percent of the fortytwo occupational domain items in the dataset). For them, this raised the important question of whether they had overlooked something important when formulating their hypothesis. Profile analysis offers a methodology for exploring this question; further investigations of the MET ® dataset could take this question further. However, the most difficult aspect of DIF investigations is deciding how to proceed based on the results, and this question remains unresolved. To date, most research has fallen silent on this matter or has presented tentative suggestions. This is perhaps to be expected since the results are often weak or unclear. For instance, Geranpayeh & Kunnan (2007) were unable to confidently explain in all cases the reasons why items in their dataset exhibited DIF. It was therefore difficult for them to make recommendations for test development. Indeed, determining what should be done with the results of DIF analysis is a largely unanswered question that poses a serious challenge to the testing community. One exception is Takala & Kaftandjieva (2000), who recommend that items be retired as soon as DIF is detected. However, the systematic removal of items that exhibit DIF might unwittingly narrow the construct of the exam. Let us take the example of items in a vocabulary test that demonstrate DIF because certain words are more familiar to male test takers or female test takers. Takala & Kaftandjieva (2000) argue that this is because the vocabulary contexts for these test takers are different—male test takers are more familiar with the words used in technology and science, while female test takers are more familiar with the words used in humanistic activities such as child care. This might be the case, but it is not clear that the solution is to remove these words from testing. For this sends the message that these words are not part of the general vocabulary construct for all test takers. In the case of the MET ®, a very similar question would need to be answered, that is, are items in the occupational domain an important aspect of the test construct. If yes, then it might not be wise to remove occupational items that demonstrate DIF. And here, we come full circle. At the start of the chapter, we cited the Standards for Educational and Psychological Testing (AERA, APA, NCME , 2014), saying that test developers must ensure that their test measures the intended construct. It is unclear what a test developer should do if a portion of the target test population is disadvantaged by a portion of the intended construct. Additionally, we should ask whether an item should be removed as soon as it exhibits DIF, or whether it should be monitored to establish whether it consistently exhibits DIF, that is every time that item is presented to test takers, it exhibits DIF (and in the same direction). Banerjee & Papageorgiou’s (2011) work demonstrated

FAIRNESS AND BIAS IN L ANGUAGE ASSESSMENT

259

that DIF cannot be assumed to be consistent. For each of the items in the MET ® dataset that demonstrated DIF, this only occurred on one test administration. This suggests an interaction between the item and the context in which it is tested (i.e., the other items in that administration). It is not entirely clear how that might be disentangled and investigated. So, two issues remain. First, what actions are defensible for items that demonstrate DIF ? Instead of removing DIF items, might we argue that there are too many items of that category in the test and that the construct would be better served by placing fewer items from a particular category (in the case of the MET ®, occupational domain items) on the test? Alternatively, when a test appears to have an age effect, test developers might wish to reconsider their definition of the target test population, perhaps narrowing it in order to define the construct more appropriately. In the case of the MET ®, the evidence from this study forms an early argument for an MET ®-for-teens, that is, a variant of the MET ® that is targeted more directly at younger test takers. Second, once biased items have been confidently identified, how might we confirm the source of the bias? Alderson (1990) and Alderson & Kremmel (2013) remind us that content experts find it difficult to agree on what an item is testing. The test taker, however, might be able to throw light on the causes of differential item functioning. Eye-tracking methodology might be combined with stimulated recall interviews to explore the processes that test takers engage in when they answer test items (particularly those processes that prove to be statistically biased).

Note 1 The criteria for achieving equal average deviations (in absolute value) are even more strict: The groups must be of equal size, the data used for the profile analysis must be the same as the ones used for the estimation of the item parameters, and the estimation procedure for the item parameters must be conditional maximum likelihood.

References Abbot, M. L. (2007). A confirmatory approach to differential item functioning on an ESL reading assessment, Language Testing, 24(1), 7–36. AERA , APA , NCME (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education). (2014). Standards for Educational and Psychological Testing. Washington, DC : American Educational Research Association. Alderson, J. C. (1990). Testing reading comprehension skills (Part 1). Reading in a Foreign Language, 6(2), 425–438. Alderson, J. C., & Kremmel, B. (2013). Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge. Language Testing, 30(4), 535–556. Angoff, W. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning. Hillsdale, NJ : Lawrence Erlbaum Associates, pp. 3–24.

260

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Banerjee, J., & Papageorgiou, S. (2011). Looking out for the test taker: Checking DIF. Paper presented at the annual EALTA Conference, Siena, Italy. Banks, K. (2013). A Synthesis of the peer-reviewed differential bundle functioning research. Educational Measurement: Issues and Practice, 32(1), 43–55. Camilli, G. (2006). Test Fairness. In G. Brennan (Ed.), Educational Measurement (4th edn.). Westport, CT: Praeger Publishers, pp. 221–256. Geranpayeh, A., & Kunnan, A. J. (2007). Differential item functioning in terms of age in the Certificate in Advanced English examination. Language Assessment Quarterly, 4(2), 190–222. Harding, L. (2012). Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective. Language Testing, 29(2), 163–180. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the MantelHaenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test Validity. Hillsdale, NJ : Lawrence Erlbaum, pp. 129–145. McNamara, T., & Roever, C. (2006). Language Testing: The Social Dimension. Malden, MA : Blackwell Publishing. Pae, T.-I., & Park, G.-P. (2006). Examining the relationship between differential item functioning and differential test functioning. Language Testing, 23(4), 475–496. Shealy, R. T., & Stout, W. F. (1993a). An item response theory model for test bias and differential test functioning. In P. Holland & H. Wainer (Eds.), Differential Item Functioning, Hillsdale, NJ : Erlbaum, pp. 197–240. Shealy, R. T., & Stout, W. F. (1993b). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test/bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194. Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language Testing, 17(3), 323–340. Verhelst, N. D. (2012a). Profile analysis: A closer look at the PISA 2000 Reading Data. Scandinavian Journal of Educational Research, 56(3), 315–332. Verhelst, N. D. (2012b). Profile-G. Available from http://www.ealta.eu.org/resources.htm. Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233.

13 ACCESS for ELLs: Access to the Curriculum Jennifer Norton and Carsten Wilmes

ABSTRACT

I

n response to the steadily increasing number of English language learners (ELL s) entering grades K–12 U.S. classrooms, federal accountability laws in the United States now require schools to provide ELL s with meaningful instruction and assessment of both academic content and academic English. This chapter briefly traces the history of the American educational standards and assessment movement, and discusses the shift from the first generation of English proficiency assessments to academic language proficiency assessments. It then explains how research on ACCESS for ELL s supports the WIDA Consortium’s1 overall assessment system. The research and development of the initial test forms are described, along with research related to a major refinement to the test: cognitive labs on the speaking test redesign for computer delivery. The chapter concludes by discussing the opportunities and challenges of computer-based English language proficiency testing in the K–12 school context.

Introduction This chapter provides an overview of work related to the assessment of English language learners (ELL s) in the United States of America. Similar to a trend in other English-speaking countries, the number of ELL s in the United States has risen steadily over the past few decades, and with approximately 4.5 million students, is now the fastest growing subpopulation in public schools (Amundsen, 2012; Migration Policy Institute, 2010; National Education Association, 2008; Ontario Ministry of Education, 2008). Indeed, it is estimated that by 2025 as many as a 261

262

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

quarter of all U.S. public students in grades kindergarten to 12 (K–12) will be ELL s (TESOL International, 2013). Once given only scant attention, increasing legal requirements have culminated in a stricter federal accountability system that requires U.S. states to provide ELL s with rigorous instruction and assessment to promote their academic success. This steadily growing group of students has historically underperformed compared to their native English-speaking peers. It is for that reason that current federal law is mandating significant accountability measures that track ELL s’ academic and linguistic progress, and which relies heavily on the use of language assessments. Starting with the background on ELL s and a historic perspective on their instruction and assessment in the U.S. context, we discuss the nature of academic language that is key to success in school and beyond. More specifically, we address how it is derived from educational standards, how those standards have, since 2004, been operationalized for an English-language proficiency (ELP ) assessment called ACCESS for ELL s, and how research reflected in assessment use arguments is underpinning this work.

Who Are ELLs? U.S. federal law defines an ELL as someone who is (1) 3–21 years of age, (2) enrolled or about to be enrolled in elementary or secondary school, and (3) whose native language is not English (U.S. Department of Education, 2001). In the 2011–2012 school year (the most recent year for which data was available), 9.1 percent of students in U.S. public schools were ELL s, and at least 60 percent of all U.S. schools enrolled at least one ELL , with a wide range of population patterns and higher incidence in urban areas (U.S. Department of Education, 2012). Approximately 67 percent of ELL s were at the elementary school level (Gottlieb, 2006). Across the country, ELL s accounted for 14.2 percent of students in city schools, 9 percent in suburban schools, 6.2 percent in towns, and only 3.9 percent in rural schools. Perhaps surprisingly, the vast majority of ELL s are actually born in the United States: 77 percent in elementary school and 56 percent in secondary school (Capps et al., 2005). Most ELL s speak Spanish as their native language (79%), followed by Vietnamese (2%), Hmong (1.6%), Cantonese (1%), and Korean (1%) (Loeffer, 2007). Therefore, ELL s are an extremely heterogeneous group, including such varied members as Native American students on reservations, recent or second and third-generation Mexican immigrants, and very recent arrivals, including refugees.

Why Is It Important to Support These Students? Indeed a heterogeneous group of students, ELL students of all backgrounds mentioned above share a need to receive linguistic support to enable them to succeed in academic content classes, and ultimately, American society. Providing adequate support to these students ensures that: (1) students who are ELL s are identified appropriately and receive the support services they need, and (2) appropriate instruction is provided to ensure that ELL s “develop oral and written language skills that will make them academically competitive” (Goldenberg et al., 2010, p. 63). ELL s’ language proficiency and progress are measured on an annual basis, and they are exited from support services

ELLs’ ACCESS TO THE CURRICULUM

263

when they have attained sufficient proficiency to succeed in English-language classrooms on their own. The challenge that this population faces is the dual task of learning both English and academic content at the same time (Short & Fitzsimmons, 2007). Not only do they need to learn the academic content like their peers, but they simultaneously need to gain the English language skills needed to access the academic content being taught in their classes. As may be expected, without additional support ELL s historically have struggled to meet grade-level expectations. For example, among fourth graders, only 30 percent of ELL s scored at or above Basic Level in Reading on the National Assessment of Educational Progress in 2007, compared to 69 percent for non-ELL students (U.S. Department of Education, 2010). The task of supporting ELL s is complicated by the fact that public education in the United States is the domain of individual states, which determine their own policies and criteria related to fulfilling legal requirements imposed by federal law. This has resulted in a diversity of criteria, policies, instructional standards, and assessments used across states, even relating to such basic issues as to how ELL s are defined and identified. In recent years, the federal government has been trying to bring states together by making grant opportunities and other funding streams available for efforts leading to shared policies. One example of these efforts is the WIDA Consortium, a nonprofit collaborative of thirty-six U.S. states that provides an integrated system of English language development (ELD ) standards, curricular support, research, and a common ELP assessment, ACCESS for ELL s.

A Federal Mandate to Adequately Support ELLs Early efforts to provide ELL s with better opportunities to succeed in society and close the achievement gap with other students focused on providing appropriate instructional support and not on assessment per se. The U.S. Civil Rights Act (1964) and the Bilingual Education Act (1968) created funding mechanisms for school programs to support ELL s in their native languages (Hakuta, 2011). This was further solidified by the U.S. Supreme Court decision Lau v. Nichols (1974), which established language minority students as a protected class and created a need to identify and place these ELL s into instruction. This need, in turn, led to the adoption of a first generation of large-scale K–12 ELP assessments. However, rather than developing new assessments for these new purposes, various existing commercially available ELP assessments, such as Language Assessment Scales (LAS ), were repurposed and utilized at the local school and district level (Del Vecchio & Guerrero, 1995; Wolf et al., 2008). Since these assessments focused on social, everyday language, which is likely not a sufficient indicator of the ability to access grade-level curriculum and succeed in school and later in the society at large, the validity and reliability of decisions made on the basis of these assessments is somewhat questionable (Abedi, 2007, p. 5; Bailey, 2007; Butler & Stevens, 2001; Stevens et al., 2000). Combined with a lack of consistent English-language proficiency curricular standards at the state level (never mind at the federal level), “Commercial tests were usually based on unexamined assumptions of what English-language proficiency meant. It was an inexact science, arguably one that resulted in a wide range of inconsistent decisions from one district to the next.” (Boals et al., 2015). The passage of the No Child Left Behind (NCLB) Act (U.S. Department of Education, 2001) marked a watershed moment as it placed much greater emphasis

264

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

on assessment and accountability at the state level. Most relevant to the present discussion, it required states—not districts—to develop and implement appropriate state-wide English-language development (ELD ) standards aligned to existing state content standards, and crucially, state-wide ELP assessments based on these standards (Abedi, 2007). These assessments had to be adequately aligned to a state’s ELD standards and assess all four domains: listening, reading, speaking, and writing. Consequently, many states had to abandon their existing assessments, many of which merely assessed social language or language arts, and lacked the rigor of the current generation of assessments. The NCLB Act also strengthened accountability requirements by prescribing annual assessment of all ELL s, both on the ELP assessment and the existing regular state content accountability assessments. This legislation effectively closed the loopholes that had previously allowed the exclusion of ELL s from accountability systems (Lacelle-Peterson & Rivera, 1994). As noted above, this next generation of assessments was to be built on new ELD standards. State, not federal, standards are a driving force in teaching and learning in the current American education system and set expectations for what students are to learn in school. For example, beginning with the National Council of Teachers of Mathematics in 1989, national groups in the United States have promoted the development of content standards “to describe what students should know and should be able to do as the result of schooling” (Gottlieb, 2006, p. 31). In the absence of national educational standards, many states have adapted standards to their own local needs and purposes to meet federal accountability purposes. More recently, the federal government has attempted to foster the development of shared minimum college and career-readiness standards for Mathematics and Language Arts, oftentimes referred to as the Common Core. The aim is to establish comparable standards of academic rigor across the United States in order to enable U.S. students to be competitive with those of other nations (National Governors Association Center for Best Practices and Council of Chief State School Officers, 2010). Typically, teachers must teach specific content standards in order to help students to achieve high scores on required standardized tests. Now, since the adoption of the NCLB Act in 2001 and its requirement to annually assess ELL s’ progress in learning academic English, ELD standards have been developed to serve as the foundation of these required summative assessments. These standards are also intended as resources for educators who seek guidance in designing language instruction and assessment for English language learners (ELL s) at different levels of proficiency in academic English. In meeting this federal mandate, states have the option of developing their own ELD standards and their own ELP assessments. However, due to the high cost and capacity required to carry out such a large and complex undertaking, many states have opted to join a consortium that includes other states and have elected to use the same standards and assessment. States with large ELL populations, such as California and New York, have developed their own standards and assessments.

What Is Academic Language? In recent years, much research has been conducted on academic language or academic English, which is the language required to access academic content and to succeed in

ELLs’ ACCESS TO THE CURRICULUM

265

school. It is also the language ELL s need to successfully acquire in order to succeed in school (Albers et al., 2008; Anstrom et al., 2010; Bailey, 2007; Scarcella, 2003). This evolving understanding of academic English underpins the current generation of ELD standards, such as those articulated by the English Language Proficiency Assessment for the 21st Century (ELPA , p. 21) and WIDA consortia. Conceptually, academic language is an outgrowth of the concept of “cognitive academic language proficiency” (CALP ), which Cummins (1984) differentiated from “basic interpersonal communication skills” (BICS ) (Snow et al., 1989). The term BICS refers to a facility in the language needed to interact in social and instructional settings, while CALP requires the ability to communicate and comprehend more abstract topics and decontextualized language, which is typical of the language of the content areas taught in school. According to Anstrom et al. (2010), academic language is a concept that refers to a register of English that can be specific to a given content area (p. 11). Academic language is characterized not only by technical, subject-specific vocabulary or vocabulary denoting abstract concepts, but also by sentence level and discourse level features that require explicit instruction (Anstrom et al., 2010, Scarcella, 2003). Refer to Boals et al. (2015) for a comprehensive review of current conceptualizations of academic language.

How Is Academic Language Represented in the WIDA ELD Standards? The WIDA ELD Standards are one example of the sets of next-generation ELD standards driving ELL assessment and instruction in the United States. According to the WIDA Consortium, the standards exemplify their “mission of advancing the academic language development and academic achievement of ELL s” (WIDA Consortium, 2012, p. 14). The WIDA ELD Framework defines academic language as “the oral and written text required to succeed in school that entails deep understanding and communication of the language of content within a classroom environment” (WIDA Consortium, 2012, p.  112). While not content standards themselves, the WIDA ELD Standards address the language of the main content areas and the social/instructional context for Prekindergarten through high school: (1) social and instructional language, (2) the language of Language Arts, (3) the language of Mathematics, (4) the language of Science, and (5) the language of Social Studies (WIDA Consortium, 2012). These standards are the foundation for both assessment and instruction. At the classroom level, the standards are meant to be a resource for both second language teachers and content teachers as they design language-rich curriculum, instruction, and assessment to foster academic English development and academic achievement. For accountability purposes, WIDA’s largescale annual summative English proficiency assessment, ACCESS for ELL s, as well as the WIDA Screener, are aligned directly to the WIDA ELD Standards (Center for Applied Linguistics, 2013). The WIDA ELD Standards define five proficiency levels along the continuum of English language development: (1) Entering, (2) Emerging, (3) Developing, (4) Expanding, and (5) Bridging (WIDA Consortium, 2012). The performance definitions

266

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

for each level relate to the discourse, sentence, and word/phrase levels of language. That is to say, the performance criteria describe linguistic expectations or demands in terms of linguistic complexity, language forms and conventions, and vocabulary usage. Table  13.1 summarizes the framework organization by standard, grade, language domain, and proficiency level. To provide models of language expectations at each proficiency level in a manner that is developmentally appropriate for students at different ages, the framework is articulated for each grade from kindergarten to grades 11–12. From there, for each grade or grade cluster, the standards are described for each of the four language domains: listening, reading, speaking, and writing. To make the WIDA ELD Standards tangible, they are realized in matrices of model performance indicators at each proficiency level (1–5), for every domain (Listening, Reading, Writing, and Speaking), for every grade cluster (K, 1, 2, 3, 4, 5, 6, 7, 8, 9–10, and 11–12), and for every standard (Social and Instructional language, language of Language Arts, language of Mathematics, language of Science, and the language of Social Studies). The model performance indicators in the matrices are examples for educators and test developers to transform to fit their instructional and assessment needs. These example indicators are composed of a language function, content stem, and level of support (which is greater for students at the lower end of the proficiency continuum). For instance, a model performance indicator for the language of Science, grade 2, Writing, Level 3, Developing is: “Describe the stages of life cycles using illustrated word banks and graphic organizers” (WIDA Consortium, 2012, p.  61). In this example, the language function of describe represents the productive domain of writing and the content stem is life cycles, which is a typical science topic in grade 2. At Level 3, Developing, students benefit greatly from visual organizers, so the example supports given are “illustrated word banks and graphic organizers.” Note that the emphasis is not on a student’s mastery of the science content of the life cycle, but on the student’s ability to use academic language to describe the life cycle though writing, with supports. The WIDA ELD Standards also emphasize “the particular context in which communication occurs,” which in this instance may be a lab report for a science class (WIDA Consortium, 2012, p. 112). In addition to a specific focus on academic language as opposed to simply general social language, WIDA’s assessments are based on the WIDA English Language Development Standards. These standards champion a “can do” philosophy that TABLE 13.1 Organization of the WIDA ELD Standards (WIDA Consortium, 2012) Standards

Social and Instructional Language, Language of Language Arts, Language of Mathematics, Language of Science, Language of Social Studies

Grades

K, 1, 2, 3, 4, 5, 6, 7, 8, 9–10, 11–12

Language Domains

Listening, Reading, Speaking, Writing

Proficiency Levels

(1) Entering, (2) Emerging, (3) Developing, (4) Expanding, (5) Bridging

ELLs’ ACCESS TO THE CURRICULUM

267

emphasizes “believing in the assets, contributions, and potential of linguistically diverse students” (http://www.wida.us). This mindset focuses on what ELL s bring to the classroom, whether cultural, linguistic, or academic, rather than what they lack in English language proficiency, and it is reflected in the standards as well as the WIDA assessments. In particular, the recognition of the supports needed by an ELL at a given proficiency level is a way of focusing on what the student “can do” rather than on deficits. A newcomer to the United States may not be able to write an essay in English, but he or she can do higher order thinking and can demonstrate writing in English by labelling pictures with words or phrases. Likewise, an ELL at Level 3, Developing, may struggle to complete an extended recount completely independently, but with the support of “word banks and graphic organizers,” as in the example above, the student can take advantage of word banks effectively and can author comprehensible sentences. Ultimately, the purpose of using the WIDA ELD Standards is to assist students in moving from one proficiency level to the next, until they can access grade level content independently. So, understanding a student’s current proficiency level allows teachers to tailor instruction to their current level and provide some guided challenges to help the student make progress.

How Is Academic Language Operationalized in the WIDA Assessment System? ACCESS for ELL s (the WIDA summative assessment) and the WIDA Screener (the WIDA ELL identification/screener test) fulfill the need to determine the proficiency level of those students who are identified as English language learners.2 The overall purpose of the assessment system is to measure ELL s’ annual academic English growth and to ensure they are progressing toward Level 6, Reaching, which is considered the end of the continuum. At this level, ELL s can access grade level content without requiring ELL support services; it is not defined as another level of proficiency. Though the specific identification processes and English language support programs vary by state and by district, Figure 13.1 provides a general overview of this process. When a student enters a school system, the family completes a home language survey, which asks about the language that a student may speak, read, write, or be exposed to. If the home language survey responses suggest that a student may be an ELL , the student takes a screener test. However, best practice requires that several data sources, such as teacher observations and not solely test results, be used in the high-stakes decision of ELL identification. The WIDA Screener assesses academic English language proficiency to determine whether a student would be classified as an ELL . The resulting proficiency score is initially used for identification and may also be used to determine the type or level of English support services that the student may need. Those services represent a range of programs, depending on the local offerings and context and on the proficiency level of the student. Also shown in Figure 13.1, the ACCESS for ELL s test is the summative assessment that is administered annually to measure students’ yearly progress in acquiring academic English. Its purpose is to ensure that students who are ELL s are

268

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 13.1 WIDA English Language Proficiency Screener and Summative Assessments.

demonstrating growth on an annual basis and moving toward the point where they will be able to engage in the mainstream classroom without additional English language support. Students who are still in the process of acquiring English would, of course, continue to receive support. In addition, their ACCESS for ELL s proficiency level scores could inform specific instructional goals to help ensure the student makes progress throughout the coming school year. The student would be administered ACCESS the following year, and hopefully, demonstrate the anticipated growth. Each state sets an “exit score” that, once attained, indicates that a student could be “exited” from English language support services. This policy decision has a critical impact at the school and student level: schools’ programmatic decisions are based on the state’s approved exit score and the extent that ELL s receive English language support or not depends on their scores on ACCESS for ELL s. As such, ACCESS for ELL s is a high-stakes assessment. The rigor of the test itself and its validity as a measure of ELL s’ language proficiency are therefore of the utmost importance.

What Research Supports the Use of ACCESS for ELLs? An argument-based approach to validity has been used to frame the research, development, and refinement of the ACCESS for ELL s test. Grounded in this validation framework, this section presents highlights of the initial test development, followed by an explanation of cognitive lab research carried out to support the refinement of the speaking test for computer delivery.

Argument-based Validation Framework In an argument-based validation approach, the validity of an assessment lies in the ability to provide evidence that supports the claims that are made in the interpretation

ELLs’ ACCESS TO THE CURRICULUM

269

FIGURE 13.2 The Center for Applied Linguistics Validation Framework (Center for Applied Linguistics, 2013, p. 30).

of the test scores (Kane, 1990). Figure 13.2 shows a seven-step validation framework that “connects test design and scores to intended consequences” (Center for Applied Linguistics, 2013, p.  28). Adapted from Bachman & Palmer (2010) and Mislevy et al. (2004), this argument-based framework depicts “Consequences” as step 1 and “Plan” as step 7 because the planning must anticipate the decisions and consequences of the assessment’s scores from the inception of the test. At each step, the claims, or assertions, being made must have evidence to support each action associated with that claim. In other words, all decisions and actions related to the test development, scores, score uses, and the consequences must be justified. In the “Design” step, the definition of what is being measured and how the given construct will be measured are established and described. Within this step, the development process involves a trialling phase to ensure quality of the assessment implementation and delivery as well as the scores (“Assessment Records”) and the validity of the score interpretations (“Interpretations”). Score interpretations must be valid and related to steps 2 and 1, the decisions and consequences resulting from those scores.

ACCESS for ELLs Initial Pilot Test An initial test design was created and pilot test items were drafted. Before collecting extensive field test data of full test forms, a pilot test was conducted. The pilot

270

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

examined six focal areas that were critical to ensuring the validity of this language proficiency test’s scores and the ability to interpret those test scores for accountability and instructional decisions. Kenyon (2006) outlined the six claims that were made and investigated during the pilot of the test: 1 The overall test format is appropriate for ELL s at each of the grades. 2 The format of having three different, leveled forms per grade cluster (to be taken by students according to their approximate proficiency level) is appropriate. 3 The timing of each domain subtest is appropriate. 4 The items assess different proficiency levels. 5 The test administration instructions are appropriate. 6 The test content is appropriate for the different grade clusters. To address these six focal areas, data were collected in two phases at twenty different schools in three different states (Phase 1 N = 509; Phase 2 N = 805) (Kenyon, 2006). In each phase, one researcher assumed the role of test administrator, while a second researcher observed and recorded notes, which included all deviations from the script, students’ questions, the timing of each item prompt and question, response times, the timing of entire forms as well as any other issues that arose, such as room arrangement (Kenyon, 2006). Written and oral feedback was also solicited from teachers and students. After Phase 1 of piloting, test administrator scripts and items were revised. Because items were revised between pilot phases and the sample size in Phase 2 was insufficient for Rasch analyses, only a basic item analysis (checking percent correct) was conducted. Table 13.2 summarizes the six focal areas of the pilot test and the actions and evidence associated with each claim, based on the data collected in both phases of the pilot (Kenyon, 2006). Regarding Claim 1, students interacted with the test materials as expected, which showed that the overall test format functioned appropriately. Claim 2 referred to the three leveled forms per grade cluster: Tier A, for newcomers with a small amount of English proficiency; Tier B, aimed at students in levels 2–4; and Tier C, geared toward higher proficiency students who may be close to exiting from ELL support services. The claim was that leveling the forms would allow students to see test items that were slightly challenging without being frustratingly difficult or easy. The pilot data shows that the leveling did have the intended effect, which confirmed the design decision. Based on the pilot evidence for Claim 3, only the timing for the Speaking Test needed to be changed, from ten to fifteen minutes to account for the additional time needed for higher proficiency students to complete the tasks (Kenyon, 2006). Claim 4 was that the test administration procedures were clear and able to be implemented in a standardized way by all test administrators. The test administration instructions were revised between rounds of piloting and were found to be “clear and efficient” by the close of the pilot phase (Kenyon, 2006, p. 13). Claim 5 was that the items were able to assess different proficiency levels, and that the test could therefore distinguish among students of different abilities. The field test collected extensive quantitative data to investigate this claim, whereas the pilot

ELLs’ ACCESS TO THE CURRICULUM

271

TABLE 13.2 Summary of Pilot Research Claims, Actions and Evidence (Kenyon, 2006) Claim

Action

Evidence

1. Appropriateness of overall • Observed students’ ability • Students reported test format to interact with the test understanding what was and understand what to do being assessed • Students and teachers • Solicited qualitative 2. Appropriateness of three reported that the test feedback from students different, leveled forms per seemed not overly easy or and teachers on the grade cluster (taken by difficult difficulty of the test students according to • Gathered quantitative data approximate proficiency in the field test level) 3. Appropriate timing of each domain subtest

• Collected administration times for each pilot form

• More time needed for the Speaking Test; increased time by 5 minutes • Other domain test times were on average as long as anticipated

4. Appropriateness of test • Observed test administration instructions administration, identified weaknesses, and then refined instructions

• After making refinements based on Phase 1, observations showed smooth test administration for administrators and students

5. Ability of items to assess different proficiency levels

• Solicited qualitative feedback from students and teachers on the difficulty of the test items • Item pass rates analyzed

• Students and teachers reported on difficulty of the items • Overall, items show increasing difficulty as intended • Field test data collected to provide quantitative backing for this claim

6. Appropriateness of test content for the different grade clusters

• Enlisted teachers to write the initial test items • Submitted all items to Content, Bias and Sensitivity Reviews

• Reviews by teachers confirmed appropriateness of items; any items deemed inappropriate were jettisoned or revised

272

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

provided mainly qualitative backing. The student and teacher feedback on the Listening and Reading Test items showed that heavier use of graphics at the lower proficiency level items was successful, and that the items aimed at higher proficiency levels posed the anticipated level of challenge. Mean item difficulty showed the desired progressive increase in difficulty. For the Writing Test, confusion among students about what to do indicated that some instructions needed to be clarified. In addition, the pilot revealed that students at the lowest proficiency levels had difficulty interacting independently with the Writing Test. Additional interaction with the test administrator was effective in guiding those students. This evidence led to increasing the support in the test administrator script for those students. Additionally, students at a high proficiency level tended to produce shorter writing samples. This evidence prompted changes to the task design to require an extended writing sample in order to sufficiently assess students’ writing at the higher proficiency levels. For the Speaking Test, the individual interview-style format was effective and elicited the desired language levels. Finally, Claim 6 was that the content of the test items was appropriate for the grade clusters. In the pilot, both student and teacher feedback on the content appropriateness was generally positive, with some concern in the grades 1–2 cluster regarding differences in literacy levels between students in these two grades. An additional action to support Claim 6 was that teachers from each grade cluster served as item writers. As experts in the grade level content, they are well placed to ensure that items are at grade level. Following the pilot test phase, the item specifications were further refined to reflect the characteristics that made items clear and appropriate for students. Then, more items were developed to complete the pool of items to undergo field testing. Following the pilot, but before field testing, all items underwent a content and fairness review to ensure that the content presented to students was appropriate for the grade level and fair for all students, regardless of subgroup. The purpose of the review is to have a test that is fair to all students, neither favoring nor disfavoring any subgroup of student for his or her gender, cultural background, religious beliefs, socioeconomic status, or regional experience. That is, the test is intended to distinguish students’ academic English language proficiency, not to discriminate differentially according to other personal characteristics. As such, these reviews lend evidence to support the claim that the test items are appropriate. Based on the pilot test evidence in support of these six claims, test development was completed for full test forms, including overage items, in anticipation of the field test, which collected “extensive data on items and forms in order to equate forms both horizontally (i.e., across tiers within same grade clusters) and vertically (i.e., across grade clusters) as well as to judge the strength of individual items” (Kenyon et al., 2006, p. 5). The field test established valid and reliable proficiency scores on a standard scale. Also, in order to ensure the usability of the test results, all the test forms for grades K–12 had to be equated to the same scale. Equating is “the process of putting all the tests on the same scale, such that the results mean the same regardless of which test items the test taker takes” (Kenyon et al., 2006, p. 7). Every year since the initial test became operational in 2004, nearly half the test items have been replaced with new test items in order to avoid student overexposure to test items and to continually refine the assessment as more is learned about assessing the construct of academic English language proficiency.

ELLs’ ACCESS TO THE CURRICULUM

273

Continual Refinement: Research and Development for an Online Speaking Test In addition to other refinements to the test over the years, ACCESS for ELL s transitioned to online delivery in 2015. This change was motivated by three factors: (1) the paper-based format imposed a heavy requirement to train test administrators to score the Speaking Test reliably, (2) difficulties in monitoring the standardization of test administration, and (3) the lack of control over reliability when scoring locally across thousands of schools. Standardized delivery through an online format allowed for “uniformity of elicitation procedures” (Hughes, 1989, p.  105). Accent, rate of speech, and extent of repetition can all be controlled for when presenting recorded audio. Additionally, students’ performances on the online test are recorded, allowing for centralized scoring and alleviating concerns about variances among local raters. Centralized raters can be trained and monitored, with careful checking of inter-rater reliability. The computer format afforded several other opportunities: (1) increased student engagement; (2) increased ability to present innovative performance-based tasks, rather than just traditional multiple choice; (3) decreased burden on test administrators; (4) faster test results; (5) increased uniformity of administration across all four domains; and (6) increased reliability of scoring the Speaking Test, which previously had been locally scored, but is now scored centrally by professional raters. However, careful research and development was required to create a test where students would be comfortable and confident giving an oral response to the computer and would also successfully record their answers. In the transition from the paper-based format, the Speaking Test was fully redesigned for grades 1–12 from a one-on-one interview format scored locally by the test administrator to a small group administered design in which the student is guided through the test by a “virtual” test administrator and a “model student” and the student self-records his or her responses. This means that students wear headsets to hear the task input from the virtual test administrator and the model student, and then record their oral responses into a microphone for electronic capture and central scoring. Figure 13.3 shows an example screen from the online Speaking Test.

Speaking Test Pilot Phase During the development of the new type of test items, an approach called retrospective cognitive interviews, or cognitive labs was used. Cognitive labs are stimulated recall interviews which ask participants to explain their thinking and processes when completing a given task (Gass & Mackey, 2007; Willis, 1999). It is an effective qualitative method for gathering data on the clarity, possible interpretations, and unintended interpretations of a task. “Verbal probes” are utilized to determine the source of confusion in the task (Willis, 1999). Questions about the task are asked as close as possible following the task completion to minimize recall difficulty in the interview. Prototypes of Speaking Test items were developed and then trialled with students to gather qualitative interview and observational data. The purpose was to research

274

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

FIGURE 13.3 Online Speaking Test Sample (WIDA Consortium, 2015).

and develop the prototypes iteratively and then scale up item development. The cognitive labs were conducted to gather evidence related to three claims that are critical to the validity of the Speaking Test scores: 1 Students understand how to use the computer interface to respond to speaking tasks. 2 Speaking test tasks elicit language at the targeted proficiency levels. 3 Model responses help students know how to answer test questions i.e., how extensive their responses should be (Center for Applied Linguistics, 2014). The cognitive labs were individual 45–60 minute video-recorded student interviews conducted in four rounds, as summarized in Table  13.3. The first two rounds of cognitive labs were conducted with students in grades 4–5. This grade level was selected as the sample in order to garner information relevant to both younger and older students, with a later cognitive lab round focusing specifically on the youngest students who would take the test. The iterative design of the cognitive labs allowed revisions to be made within and between rounds and for their effectiveness to be confirmed, thereby strengthening the validity of the new design. For example, instructions that had been confusing to a student were revised immediately after that cognitive lab interview and then repiloted with the next student. In addition, after analyzing the observational and interview data, revisions were made to improve the prototypes. First individual students viewed computer-delivered instructions, practiced responding to tasks, and had the opportunity to ask questions. If any other difficulties were observed, the interviewer asked the student about them. Then, the student

ELLs’ ACCESS TO THE CURRICULUM

275

TABLE 13.3 Speaking Test Cognitive Labs (Center for Applied Linguistics, 2014) Round

Grade N Level

Target Student Level

Description

1

4–5

15

Higher level proficiency

Initial investigation of research questions with prototype instructions, practice, and tasks

2

4–5

10

Higher level proficiency Lower level proficiency

Confirm Round 1 findings using revised tasks Investigate format for students with lower English proficiency

8

3

1–3

13

Higher level proficiency

Try out tasks with young students

4

1–12

21

Lowest level of proficiency

Confirm tasks are accessible to students with very low proficiency at all grades

responded to a set of tasks and afterward responded to verbal probes. This was repeated with a second set of tasks, and finally, general interview questions about the students’ experience interacting with these tasks and with other computerized tests. In each round, the results were positive: Students enjoyed listening to and viewing the prompts and tasks on the computer and then recording their answers. The results were fruitful as well and revealed constructive feedback that was incorporated into subsequent iterations of the prototypes and confirmed effective or further refined. Table  13.4 summarizes the evidence for each claim emanating from the cognitive labs as well as the next step carried out to address any weaknesses in the test or task design. Making the necessary refinements to prototype items based on the cognitive labs provided principles that were applied to specifications to support the development of the full field test item pool. Field testing was carried out for all four domains, after which item selection for the operational test was carried out based on the results.

Conclusion The adoption of computer-delivered testing presents new opportunities for improving language assessment for ELL s in K–12 school settings. More innovative task types can be implemented that offer the potential to assess the ELP construct in a more refined and efficient manner. The introduction of game-based assessment features may result in enhanced student engagement in the assessment process. At the same time, electronic scoring of assessments bring with them the opportunity to reduce the turnaround time for score reporting, potentially improving the usability of scores generated from large-scale standardized assessments, such as ACCESS for ELL s.

276

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 13.4 Issues Revealed in Researching the Online Speaking Test (Center for Applied Linguistics, 2014). Claim Issue

Next Step/Action

1

• Overall understanding of interface, • Confirmed with young learners and students how to record answers, with low proficiency in Rounds 2 and 3 engagement with the test

1

• Simplified text in instructions and added • In the first practice, some graphic support such as arrows confusion regarding which buttons to push to start and stop recording

1

• Student frustration when forgetting names of people in tasks, e.g., historical figures or story characters

1

• Newcomers struggled to record • Further reduced listening load in instructions; based on the instructions, though added visual features to show rather than tell the student what to do a demonstration by the interviewer was effective

1

• Ensured modelling and support were present • Beginning proficiency students across grade levels were able to use and clear the computer interface, but often needed modelling and support

2

• Some students were overwhelmed by amount of input/language on the test

• Created a test form for lowest proficiency students so that they can demonstrate proficiency via picture description tasks without becoming discouraged

2

• Short responses to highest level task

• Revised task instructions • Made expectations explicit • Enhanced the model response

2

• Long prompt input

• Chunk input into shorter amounts

2, 3

• Language elicitation was inconsistent across tasks

• Conducted a study of task characteristics and language elicitation

3

• Lack of interest in the model student’s response

• Made role of model explicit • Enhanced the model response • Required complete listening of model before advancing to the next screen

• Repeated names in prompt and showed names on screen as needed

ELLs’ ACCESS TO THE CURRICULUM

277

Innovative technologies, such as the Accessible Portable Item Protocol (APIP ), bring with them the promise of assessments specifically tailored to the needs of students with special needs. At the same time, the increased adoption of computerized-testing also introduces challenges. Not all students, or schools for that matter, are ready to use technology in this manner. Paper and pencil assessments remain a tried-and-true mode of administration. This is particularly problematic for some ELL s who may be newcomers from other countries and may particularly lack experience keyboarding responses on writing assessments. Technology is also prone to malfunctions as recently highlighted by large-scale testing breakdowns in several U.S. states. Looking at current ELP assessment practices in the United States more broadly, it seems highly plausible that the current focus on rigorous assessment of all students along with their inclusion in state accountability systems, including those students with special needs, will be reduced again with an expected renewal of related federal legislation. Areas of particular concern are the use of assessments for teacher evaluation and a general sense that students, including ELL s, are over-tested and that testing is taking up too much valuable instructional time. Whether accountability tests pervade schools or not, ensuring that language proficiency tests provide valid and reliable scores, and that the decisions and consequences based on those scores are part of the planning and design is critical to meaningful assessment practices.

Notes 1 Please see http://www.wida.us for more information about WIDA . 2 WIDA also offers an interim assessment, which is another summative assessment that is intended to be given between the annual accountability assessments, should it be desired to monitor progress during the school year.

References Abedi, J. (2007). Chapter 1: English language proficiency assessment and accountability under NCLB Title III : An overview. In J. Abedi (Ed.), English Language Proficiency Assessment in the Nation: Current Status and Future Practice. University of California, Davis, School of Education, pp. 3–10. Albers, C. A., Kenyon, D. M., & Boals, T. J. (2008). Measures for determining English language proficiency and the resulting implications for instructional provision and intervention. Assessment for Effective Intervention, 34(2), 74–85. Amundsen, S. (2012). Union ACT Branch English as an Additional Language or Dialect Report. Available from: http://www.aeuact.asn.au/uploads/file/EALD %20%20 Report%20May%202012.pdf. [28 July 2015]. Anstrom, K., DiCerbo, P., Butler, F., Katz, A., Millet, J., & Rivera, C. (2010). A Review of the Literature on Academic Language: Implications for K–12 English Language Learners. Arlington, VA : The George Washington University Center for Equity and Excellence in Education. Bachman, L. F., & Palmer, A. S. (2010). Language Assessment in Practice. Oxford, UK : Oxford University Press.

278

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Bailey, A. L. (2007). The Language Demands of School: Putting Academic English to the Test. New Haven, CT: Yale University Press. Boals, T., Kenyon, D. M., Blair, A., Cranley, M. E., Wilmes, C., & Wright, L. J. (2015). Transformation in K–12 English language proficiency assessment: Changing contexts, changing constructs. Review of Research in Education, 39(1), 122–164. Butler, F. A., & Stevens, R. (2001). Standardized assessment of the content knowledge of English language learners K–12: Current trends and old dilemmas. Language Testing, 18(4), 409–427. Capps, R. M., Fix, J., Murray, J., Ost. J., Passel, J., & Herwantoro, S. (2005). The New Demography of America’s Schools: Immigration and the No Child Left Behind Act. Washington, DC : The Urban Institute. Center for Applied Linguistics. (2013). Annual Technical Report for ACCESS for ELLs® English Language Proficiency Test, Series 301, 2012–2013 Administration (WIDA Consortium Annual Technical Report No. 9). Available from: http://www.wida.us/ assessment/ACCESS /. [27 July 2015]. Center for Applied Linguistics. (2014). Speaking Test Research and Development. Unpublished Internal Report. Cummins, J. (1984). Bilingualism and Special Education: Issues in Assessment and Pedagogy. Clevedon, UK : Multilingual Matters. Del Vecchio, A., & Guerrero, M. (1995). Handbook of English Language Proficiency Tests. Albuquerque, NM : Evaluation Assistance Center, Western Region New Mexico Highlands University. Gass, S. M., & Mackey A. (2007). Data Elicitation for Second and Foreign Language Research. Mahwah, NJ : Lawrence Erlbaum Associates. Goldenberg, C., & Coleman, R. (2010). Promoting Academic Achievement Among English Learners: A Guide to the Research. Thousand Oaks, CA : Corwin Press. Gottlieb, M. (2006). Assessing English Language Learners: Bridges from Language Proficiency to Academic Achievement. Thousand Oaks, CA : Corwin Press. Hakuta, K. (2011). Educating language minority students and affirming their equal rights: Research and practical perspectives. Educational Researcher, 40(4), 163–174. Hughes, A. (1989). Testing for Language Teachers. Cambridge, UK : Cambridge University Press. Kane, M. T. (1990). An Argument-Based Approach to Validation. ACT Research Report Series 90-13. Available from: http://www.act.org/research/researchers/reports/ pdf/ACT _RR 90-13.pdf. [7 December 2014]. Kenyon, D.M. (2006). Technical Report 1: Development and Field Test of ACCESS for ELLs. Center for Applied Linguistics. Available from: http://www.wida.us/assessment/ ACCESS /. [27 July 2015]. Kenyon, D. M., MacGregor, D., Ryu, J. R., Cho, B., & Louguit, M. (2006). Annual Technical Report for ACCESS for ELLs® English Language Proficiency Test, Series 100, 2004– 2005 Administration. (WIDA Consortium Annual Technical Report No. 1). Available from: http://www.wida.us/assessment/ACCESS /. [27 July 2015]. Lacelle-Peterson, M. W., & Rivera, C. (1994). Is it real for all kids? A framework for equitable assessment policies for English language learners. Harvard Educational Review, 64(1), 55–76. Loeffer, M. (2007). NCELA Fast FAQ 4: What Languages Do ELLs Speak? Washington, DC : National Clearing House for English-Language Acquisition and Language Instruction. Migration Policy Institute. (2010). Number and Growth of Students in US Schools in need of English Instruction. Available from: http://www.migrationpolicy.org/sites/default/files/ publications/FactSheet%20ELL 1%20-%20FINAL _0.pdf. [8 December 2014].

ELLs’ ACCESS TO THE CURRICULUM

279

Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2004). A Brief Introduction to EvidenceCentered Design (CSE Report 632). Los Angeles, CA : Center for Research on Evaluation, Standards, and Student Testing. National Education Association. (2008). English Language Learners Face Unique Challenges. Available from: http://www.nea.org/assets/docs/HE /ELL _Policy_Brief_ Fall_08_(2).pdf. [8 December 2014]. National Governors Association Center for Best Practices and Council of Chief State School Officers. (2010). Common Core State Standards. Washington, DC : Authors. Ontario Ministry of Education. (2008). Supporting English Language Learners: A Practical Guide for Ontario Educators Grades 1 to 8. Toronto, Canada: Queen’s Printer for Ontario. Scarcella, R. (2003). Academic English: A Conceptual Framework. Santa Barbara, CA : University of California Linguistic Minority Institute. Short, D., & Fitzsimmons, S. (2007). Double the Work: Challenges and Solutions to Acquiring Language and Academic Literacy for Adolescent English Language Learners: A Report to Carnegie Corporation of New York. Washington, DC : Alliance for Excellent Education. Snow, M. A., Met, M., & Genesee, F. (1989). A conceptual framework for the integration of language and content in second/foreign language instruction. TESOL Quarterly, 23(2), 201–217. Stevens, R., Butler, F. A., & Castellon-Wellington, M. (2000). Academic Language and Content Assessment: Measuring the Progress of ELLs. Los Angeles, CA : University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST ). TESOL International Association. (2013). Overview of the Common Core State Standards Initiatives for ELLs. Alexandria, VA : Author. U.S. Department of Education. (2001). Language Instruction for Limited English Proficientand-Immigrant Students. (Title III of the No Child Left Behind Act of 2001). Washington, DC : Author, pp. 107–110. U.S. Department of Education. National Center for Education Statistics (2012). Schools and Staffing Survey: Public School, BIE School, and Private School Data Files 2011–12. Available from: http://nces.ed.gov/surveys/sass/tables/sass1112_2013312_s12n_002.asp. [28 July 2015]. U.S. Department of Education. National Center for Education Statistics, Institute of Education Sciences. (2010). The Condition of Education 2010 (NCES 2010-028), Washington, DC : U.S. Government Printing Office. WIDA Consortium. (2012). 2012 Amplification of the English Language Development Standards Kindergarten–Grade 12. Madison, WI : Board of Regents of the University of Wisconsin System. WIDA Consortium. (2015). ACCESS for ELLs 2.0 Test. Madison, WI : Board of Regents of the University of Wisconsin System. Willis, G. B. (1999). Cognitive Interviewing: A How-To Guide. Short course presented at the 1999 Meeting of the American Statistical Association. Wolf, M. K., Kao, J. C., Herman, J., Bachman, L. F., Bailey, A. L., Bachman, P. L., Farnsworth, T., & Chang, S. M. (2008). Issues in Assessing English Language Learners: English Language Proficiency Measures and Accommodation Uses—Literature Review. CRESST Report 731. Los Angeles, CA : University of California, Los Angeles.

280

14 Using Technology in Language Assessment Haiying Li, Keith T. Shubeck, and Arthur C. Graesser

ABSTRACT

A

dvanced technologies such as natural language processing and machine learning, conversational agents, intelligent tutoring systems (ITS s), and epistemic games have been enhancing speedy development in language assessment. The missions of language assessment can be substantially enhanced with the integration of automated text analysis, automated scoring, automated feedback, and conversational agents. This chapter includes an empirical study in which we examine the reliability of automated assessment in an ITS . The first section of the chapter is a selective overview of language assessment. The second section introduces Coh-Metrix, an automatic text analysis tool for the analysis of text characteristics, including text difficulty. The third section introduces conversational agents in an ITS , which span one-on-one tutoring, conversational trialogs, tetralogs, and one-on-multiparty interaction with empirical studies.

Using Technology in Language Assessment The lens of this chapter is on automated scoring in various contexts. Automated scoring involves a computer system receiving electronic versions of texts or writing, processing them by using computational algorithms, and finally, producing a text analysis on various levels of language and discourse structure. The various applications of automated text analysis and automated assessment of texts each present a unique set of challenges and problems to address.

281

282

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

The assessment of reading comprehension requires a complex interaction among characteristics of the reader, the types of reading tasks, and the sociocultural context of the reader in addition to the properties of the text (McNamara, 2007; National Reading Panel, 2000; Snow, 2002). Precise measures of the properties of a text have facilitated the investigation of such interactions at various levels of text analysis (e.g., words, sentences, paragraphs, entire texts). These various levels of analyses assist with measuring and recording the deficits of readers who may excel at some levels, but not at others. Reading interventions that target specific characteristics of the text are expected to improve reading. Thus, automated text analysis tools have been applied to detect the appropriate reading levels for texts used in the interventions. The relatively modest reliability (i.e., .71 on average; Breland et al., 1999) and the high cost of human essay ratings have emphasized the need for applying automated natural language processing techniques for the development of automated scoring, which can supplement human scoring. Similar to automated text analysis, automated scoring can grade thousands of texts on hundreds of measures in a short amount of time. Some measures of language and discourse cannot be automated reliably, such as complex, novel metaphors, humor, puns, and conversations with cryptic provincial slang. In such cases, human experts need to annotate the texts systematically. Interestingly, automated scoring has been compared with human experts, and the results have been surprisingly impressive. Essay grading on a large scale typically involves the grading of essays on academic topics, which can be analyzed by computers as reliably as by expert human graders (e.g., Attali et al., 2012; Landauer et al., 2003; Shermis et al., 2010). Some people might be sceptical of the notion that computers can analyze language and discourse because they believe that only humans have sufficient intelligence and depth of knowledge. Errors in computer assessment have been controversial when used to grade high-stakes tests or psychological diagnoses of individuals or groups. People complain about computer errors even when trained human experts are no better, if not worse, than computer assessments. Computers, however, have unchallenged advantages, such as never getting fatigued, providing instantaneous feedback, being consistent and unbiased, measuring on many dimensions, and applying sophisticated algorithms that humans could never understand and apply (Graesser & McNamara, 2012; Shermis et al., 2010). This chapter includes three sections. The first section reviews the automated scoring of essays (i.e., texts that contain at least 150 words) and of short-answer responses (i.e., responses that range from a few words to two sentences). The second section introduces Coh-Metrix, an automatic text analysis tool for the analysis of text characteristics and text difficulty. The third section introduces automated scoring in an ITS that includes one-on-one tutoring (i.e., a user with a tutor agent), conversational trialogs (i.e., a user with a tutor agent and a student agent), tetralogs (i.e., two users with a tutor agent and a student agent), and multiparty interaction.

Automated Scoring The review of automated scoring in this section consists of scoring essays and shortanswer responses. The accuracy of automated scoring of these two types has reached the level of expert human raters (Attali et al., 2012; Attali & Burstein, 2006; Burstein,

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

283

2003; Elliot, 2003; Landauer et al., 2003; Rudner et al., 2006; Shermis et al., 2010; Valenti et  al., 2003). The automated scoring model was established on a humangraded corpus. Usually, two, three, or even more expert human raters grade a large sample of essays after they receive training on a scoring rubric. Five to seven grading scales and more recently eighteen scales (Attali et  al., 2012), are adopted in the rubric. The samples are divided into a training corpus and a validation corpus. A set of computational algorithms have been fine-tuned to optimally fit essay grading in a training set. The computational algorithms are tuned using linguistic features (e.g., grammar, usage, style, lexical complexity, etc.) and other higher order aspects of writing (e.g., idea quality, organization, development) (Attali et al., 2006; Attali et  al., 2012). The typical algorithms for this step include a linear multiple regression formula or a set of Bayesian conditional probabilities between the linguistic features and grading levels. The selected solution is applied to the essays in the validation corpus, and then the scores are compared with the human rating scores using a correlation coefficient (Pearson r or Spearman’s rho) or the percentage of exact agreements in the scores to examine the accuracy of the model. The automated scoring is considered successful if the scores between the computer and humans are about the same as the scores between humans. The most successful automated scoring systems for essays include the e-rater® system developed at the Educational Testing Service (Attali & Burstein, 2006; Burstein, 2003), the Intelligent Essay Assessor™ developed at Pearson Knowledge Technologies (Landauer et al., 2003), and the IntelliMetric™ Essay Scoring System developed by Vantage Learning (Elliot, 2003; Rudner et  al., 2006). These systems surprisingly have slightly higher performance measures than agreement between trained human expert raters. They have been used in scoring essay writing for high-stakes tests, such as the Analytic Writing Assessment of the Graduate Management Admission Test (GMAT ), Graduate Record Examination (GRE ), Test of English as a Foreign Language (TOEFL ), and Scholastic Aptitude Test (SAT ). The above tests require the test taker to finish writing within twenty-five or thirty minutes and measure the test taker’s ability to think critically and to present ideas in an organized way. Some tasks involve an analysis of an issue or opinion by citing relevant reasons or evidence to support the test taker’s perspective on a topic of general interest. Other tasks are the analysis of an argument by reading a brief argument, analyzing the reasoning and evidence behind it and writing a critique of the argument. Automated scoring systems such as Criterion™ (Attali & Burstein, 2006) and IntelliMetric™ (Elliot, 2003) are also used in electronic portfolio systems to help students improve their writing skills. To achieve this, these systems instantaneously report the students’ writing scores and provide holistic and diagnostic feedback on several features of their essays. More specifically, feedback is provided on the students’ essay organization, sentence structure, grammar usage, spelling, mechanics, and conventions. Automated natural language processing techniques have been applied to the development of automated essay scoring, such as the Project Essay Grade (PEG ; Page, 2003), IntelliMetric™ (Elliot, 2003), e-rater® (Attali & Burstein, 2006; Burstein et  al., 1998), and the Intelligent Essay Assessor™ (IEA ; Landauer et  al., 2003). The e-rater® program grades essays on four types of analyses that are aligned with human scoring rubrics: (1) number of errors in grammar, usage, mechanics and

284

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

style; (2) discourse elements such as organization and development; (3) lexical complexity such as frequency and word length; and (4) vocabulary usage in the way of content vector analysis such as score point value and cosine correlation value (for details see Attali & Burstein, 2006). IntelliMetric™ analyzes more than 300 features at the levels of semantics, syntax, discourse, mechanics, and conventions. The semantic level of analysis is used to assess the development and elaboration of ideas, and the use of context and support for concepts. The Intelligent Essay Assessor™ (Landauer et al., 2003) analyzes the conceptual content of the essays with latent semantic analysis (LSA ; Landauer et al., 2007) as well as style (e.g., coherence) and mechanics (e.g., misspelled words). LSA is a mathematical, statistical technique for representing knowledge about words on the basis of a large corpus of texts that attempts to capture the knowledge of a typical test taker. LSA computes the conceptual similarity between words, sentences, paragraphs, or passages. Specifically, the meaning of a word W is reflected in other words that surround word W in natural discourse (imagine ten thousand texts or millions of words). Two words are similar in meaning to the extent that they share similar surrounding words. For example, the word love will have nearest neighbors in the same functional context, such as jealousy, affection, kiss, wedding and sweetness. LSA uses singular value decomposition (SVD ) to reduce a large corpus of texts to 100–500 dimensions (Landauer et  al., 2007). The conceptual similarity between any two text excerpts (e.g., word, clause, sentence, entire essay) is computed as the geometric cosine between the values and weighted dimensions of the two text excerpts. The cosine value typically varies from approximately 0 to 1 (see Landauer et  al., 2007, for more details). The Intelligent Essay Assessor™, uses a variety of expert essays or knowledge source materials, which are each associated with a “level” on their scoring rubric (e.g., content, style, and mechanics). LSA is then used to compare unscored test takers’ essays with different levels of expert essays or knowledge source materials. Automated essay scoring can produce reliable scores that have a high correlation with human scores (Attali et al., 2012), similar to interrater reliability among human rating scores (Burstein et  al., 1998; Elliot, 2003; Landauer et al., 2007). However, automated essay scoring is limited in its assessment of higher order aspects of writing (e.g., quality of ideas, essay development, and essay organization). This may raise a suspicion that automated grading techniques focus merely on word features and do not construct deep structured meanings. To some degree, this objection is correct. However, two counter arguments support the application of automated scoring. First, the essays in the high-stakes tests are written under extreme time pressure, which results in a large amount of ungrammatical content, semantically ill-formed content, and content that lacks cohesion. These language features are easily assessed by automated scoring tools. Second, human essay scoring can be less reliable than automated scoring tools. This is partly because human graders get tired. It is also because, when grading different essays, human raters may place emphasis on different rating criteria. This does not occur in automated scoring; the program grades the essay objectively based on typical and theoretically well-grounded features. Furthermore, human essay-grading is time consuming and expensive. As a compromise, the essays in large-scale tests, such as GRE and TOEFL , are graded by one human rater and e-rater®. If there is a big discrepancy between the

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

285

human rater and e-rater® (e.g., greater than .5 standard deviation), another human rater is required to grade the essay. The average of two human rater scores is applied in the final score.

Automated Measurements of Text Characteristics A holistic grade provides specific feedback on different characteristics of essays and is valuable for the student and instructors. As mentioned earlier, the criteria in e-rater® and the Intelligent Essay Assessor™ provide feedback in terms of problems with spelling, vocabulary, syntax, cohesion of the message, missing content, and elements of style.

Automated Assessment of Syntax and Cohesion Coh-Metrix, another popular automated text analysis tool, has been increasingly applied in essay grading. Coh-Metrix was originally developed to provide automated analyses of printed text on a range of linguistic and discourse features, including word information (e.g., word frequency, word concreteness, part of speech), syntax, semantic relations, cohesion, lexical diversity, and genre (Graesser & McNamara, 2011a; Graesser et  al., 2011b; Graesser et  al., 2004; McNamara et  al., 2014). Recently, a formality measure has been added to Coh-Metrix for discourse style analysis (Graesser et al., 2014c; Li et al., 2013). Coh-Metrix exists as a free online public version (http://www.cohmetrix.com, version 3.0) and as an internal version. The public version provides 106 measures of language and discourse, whereas the internal research version has nearly a thousand measures that are at various stages of testing. According to a principal component analysis conducted on a large corpus of 37,351 texts (Graesser et  al., 2011b), the multiple measures provided by CohMetrix funnel into the following five primary dimensions: 1 Narrativity. Narrative text has a story-like style, with characters, events, places, and things that are familiar to the reader. Narrativity is closely affiliated with everyday oral conversation. 2 Deep cohesion. Causal, intentional, and temporal connectives help the reader form a more coherent and deeper understanding of the text. 3 Referential cohesion. High cohesion texts contain words and ideas that overlap across sentences and the entire text, forming threads that connect the explicit text together for the reader. 4 Syntactic simplicity. Sentences with few words and simple, familiar syntactic structures are easier for the reader to process and understand than complex sentences with structurally embedded syntax. 5 Word concreteness. Concrete words evoke vivid images in the mind and are more meaningful to the reader than abstract words. One of the primary goals of Coh-Metrix is to investigate the cohesion that distinguishes text types and predicts text difficulty (Graesser et al., 2011b; Graesser

286

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

et al., 2014c; McNamara et al., 2010). Previous studies have used Coh-Metrix to analyze essays and other writing samples. For example, McNamara et  al. (2010) used Coh-Metrix to distinguish high- and low-proficiency essays written by undergraduate students. They found that linguistic features associated with language sophistication could distinguish high-proficiency essays from low-proficiency essays. The higher quality essays featured less familiar words, more complex syntax and a greater diversity of ideas. Lexical proficiency indicators, such as lexical diversity, word hyponymy, and content word frequency, explained about half of the variance of human perception of lexical proficiency for essays written by the English language learner (Crossley et  al., 2010). In addition, lexical proficiency indicators can accurately classify 70 percent of essays into high and low proficiency (Crossley et al., 2011a). By contrast, many factors were not predictive of essay quality, such as word overlap between sentences, conceptual overlap measured by LSA , deep cohesion measured by causal connectives, and the use of various types of connectives. It is quite intuitive to find that the lexis and syntax are predictive of higher quality essays. These results are compatible with the culture of English teachers who encourage more erudite language. This is also confirmed by the research conducted by Crossley et  al. (2011b), in which both high school and college students apply more sophisticated words and more complex sentence structures in their essays as grade level increases. McNamara et al. (2015) adopted an automated essay scoring approach that involves hierarchical classification. The first level in their hierarchy divides essays according to their length (i.e., shorter essays and longer essays) in terms of the total number of words and paragraphs in each essay. Discriminant function analyses were subsequently conducted by iteratively predicting the essay scores through multiple rounds. For example, the first round focused on the length of the essay, the second round focused on relevance, and so on. McNamara et al. (2015) found that the quality of an essay at one level in the hierarchical analysis was not determined by the same indices of essay quality at another level in the hierarchy. Their hierarchical analysis for automated essay scoring revealed that the quality of shorter essays was predicted by high cohesion (i.e., lexical diversity, paragraph-toparagraph), the increased usage of sophisticated vocabulary and grammar, and the absence of references to more mundane and familiar topics such as social processes and religion.

Formality Formality is a composite metric computed by five components from the Coh-Metrix Text Easability Assessor (TEA ) components to measure text complexity (Graesser et al., 2014c; Li et al., 2013). Formal discourse is the language of print or sometimes preplanned oratory when there is a need to be precise, coherent, articulate, and convincing to an educated audience. Informal discourse has a solid foundation in oral conversation and narrative, replete with pronouns, verbs, adverbs, and reliance on common background knowledge. Formal language is expected to increase with grade level and with informational (over narrative) text. Therefore, we have formulated a composite scale on formality via Coh-Metrix that increases with more informational text, syntactic complexity, word abstractness, and high cohesion.

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

287

Graesser et al. (2014c) reported that the Coh-Metrix formality scores had high correlations with Flesch-Kincaid (FK ) grade level (r = .72) in both the TASA (Touchstone Applied Science Associates, Inc.) corpus (37,520 texts from kindergarten through twelfth grade, Questar Assessment Inc.) and a Common Core Exemplar corpus (246 texts with four text samples used to compare the seven tools for scaling texts on difficulty; see Nelson et al., 2011, for details). The Coh-Metrix formality index was also highly correlated with Lexile scores (r = .66) in the TASA corpus. As expected, writing quality increased with grade level; therefore, we assume the Coh-Metrix formality index might be sensitive to the writing quality. As grade levels increase, students’ writings tend to be more informational, produce more complex syntactic structures, contain more abstract words, and are more cohesive. This hypothesis needs to be tested in a future study with human-annotated writing corpora.

Automated Scoring in AutoTutor CSAL Technology for language analysis has also been applied in intelligent tutoring systems (ITS s). ITS s are computerized learning and assessment environments with computational models that track content knowledge through automated scoring, strategies, and other psychological states of learners or test takers, a process called student modelling (Sleeman & Brown, 1982; Sottilare et  al., 2013; Woolf, 2009). This student modelling process can be used in the assessment and evaluation of student language proficiency. An ITS adaptively evaluates and responds to students’ constructed-response or open-ended response with natural language that is both sensitive to these states and the advancement of the instructional curriculum. The interaction between student and computer follows a number of alternative trajectories that is tailored to individual students. Assessment of student performance in this turn-by-turn tutorial interaction is absolutely essential in any ITS . One assessment is the constructed responses that are selections among a fixed set of alternatives as in the case of multiple-choice questions, true-false questions, ratings, or toggled decisions on a long list of possibilities. Another key feature of an ITS is its ability to assess student input written in natural language. A number of ITS s have been developed that hold conversations in both constructed response and natural language as part of language learning and assessment. A number of other systems on reading and writing have been developed, such as iSTART (Levinstein et al., 2007; McNamara et al., 2004) and Writing pal (Allen et al., 2014; McNamara et al., 2013; Roscoe & McNamara, 2013). However, the remainder of this chapter will focus on AutoTutor CSAL (Center for the Study of Adult Literacy, http://www.csal.gsu.edu/content/homepage; Graesser et al., 2014b) because it has been systematically implemented and tested on the extent to which the computer accurately scores the students’ constructed and verbal responses. This system is also designed for language learners. AutoTutor CSAL (hereafter called AutoTutor) is an ITS that helps students learn about reading comprehension, computer literacy, physics, critical thinking skills, and other technical topics by holding conversations in natural language (Graesser et al., 2014a; Graesser et  al., 2008; Graesser et  al., 2014b). AutoTutor is an adaptive,

288

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

interactive tutoring system that helps the struggling adult readers learn and practice reading strategies to improve reading for understanding. Improving these skills ultimately improves their lives in the context of the workforce, family, and civic arenas. AutoTutor is a trialog-based ITS , that is, a human learner interacts with a teacher agent and a student agent. The learner interacts with the teacher agent (Cristina) and the student agent (Jordan). AutoTutor is particularly designed to help low knowledge and less skilled readers better understand practical text. The dialogue moves are implemented in this system and AutoTutor analyzes the natural language that the human learner provides. The teacher agent asks the questions and evaluates the responses of the learner and Jordan. The student agent, Jordan, sometimes demonstrates how to learn, but sometimes learns from the human learner. Considering the lower level of the learners and their computer skills, the majority of the questions are presented in conversation or in the format of multiple choice questions, whereas others are short-answer questions. Student responses are evaluated by the system and the teacher agent provides instant feedback to both the learner and Jordan. AutoTutor also produces dialogue moves to encourage the students to generate content and to improve their answers to the challenging questions or problems. AutoTutor focuses on both the training of strategy use and the accuracy of content understanding. AutoTutor facilitates the learning of reading strategies by simulating face-to-face tutoring. The students are first provided with information about strategies for review and typical rhetorical structures of the strategies (e.g., compare and contrast, problem solution, cause and effect). Then some adaptive and practical reading passages are provided to assess their understanding. Each lesson focuses on a particular reading strategy in a way that adapts text difficulty to the students’ performance. The student starts with a text at an intermediate level of difficulty. A text at an easy level of reading is subsequently presented if the student’s performance is low (e.g., lower than 70 percent of accuracy) on the first text of intermediate difficulty. In contrast, the reading difficulty level is increased on the second text if the student’s performance is high on the first text. AutoTutor attempts to mimic dialogue moves as human tutors do, which are described as below: 1 Short Feedback. The feedback is either positive (“Great!” “Fantastic!” *big smile*), negative (“Sorry, this is not correct!” “Oops, you did not get it!” *head shake*), or neutral (“hmm,” “Alright”). 2 Pumps. Nondirective pumps are provided to the student to try to get the student to talk (“What else?” “Say more about it!”). 3 Hints. The tutor provides hints to the learner to get the student to do the talking or the task. Hints are more specific than pumps in that they direct the learner along a conceptual path. Hints range from being generic statements or questions (“Why?” “Why not?”) to more direct speech acts that lead a student to a specific answer. Hints are ideal for scaffolding and promoting active student learning while also directing the student to focus on relevant material. 4 Prompts. The tutor asks a very specific question (i.e., a leading question) in order to get the students to articulate a particular word or phrase. Prompts are necessary because the students can sometimes say or contribute very

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

289

little. Prompts are sometimes needed to get the student to say something very specific. 5 Assertions. The tutor expresses a fact or state of affairs, such as stating the correct answer. Pump-hint-prompt-assertion cycles are frequent in an ITS to extract or cover particular expectations. For one particular question, once all of the expectations are covered, the exchange is finished for the main question or problem. During this process, the students occasionally ask questions, which are immediately answered by the tutor, and the students occasionally express misconceptions, which are immediately corrected by the tutor. Consequently, there are other categories of tutor dialogue moves: answers to student questions, corrections of student misconceptions, summaries, mini lectures, or off-topic comments. AutoTutor implements objective and subjective assessment items with the automated scoring. Figure  14.1 shows a snapshot of the interface for AutoTutor. There is a teacher agent (Cristina, top left), a student agent (Jordan, top right), and the human learner who interacts with the agents. Jordan is often overwhelmed by the information delivered from the text, so he asks for help processing information. Cristina suggests the human learner help Jordan summarize the text based on the Venn diagram, which lists the similarities (area of overlap between the two circles) and differences (the non-overlapping parts of the circles) between running and

FIGURE 14.1 Snapshot of Summarizing Strategy in the AutoTutor CSAL.

290

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

walking. First, the learner selects the best summary of the two (the first one). Then, Cristina asks the learner why this summary is better. The learner types in the answer in the chat box (bottom of Figure 14.1) and clicks the submit button to submit the answer. Table 14.1 shows the conversations that demonstrate a systematic conversational structure that is called an expectation and misconception-tailored (EMT ) trialog, which is described in Figure 14.1. We annotate in brackets with italics some of the dialogue move categories. It should be noted that sometimes Jordan, the student agent, gives correct or incorrect answers based on the probability of correctness of the human learner’s responses. Jordan’s score is always lower than the learner’s score in order to increase the human learner’s confidence. Cristina, the teacher agent, has the ground truth. Tim (the human learner) needs to choose and then elaborate on his answer. Therefore, the human is an active participant rather than merely a vicarious observer.

TABLE 14.1 Example of Trialog Conversation with AutoTutor CSAL Cristina: This chart clearly shows differences and similarities between running and walking. Jordan: Yeah. Cristina: Tim, using the chart, which is the best summary of the passage? [Main Question] Tim: (Clicks the first answer) Running and walking are both excellent aerobic exercises to improve health. However, running is more rigorous and more likely to cause injury than walking. Cristina: Tim, could you please explain why you think this summary is good? [Elaboration Question] Tim: This summary includes the similarities and differences. Cristina: O.K. [Neutral Feedback] Jordan: I agree with Tim. This summary seems to point out both the similarities and differences between running and walking. Cristina: It is true these are similarities and differences, but how are they related to a good summary? [Hint] Jordan: Oh, I remember a good summary is related to how important the information is. [Peer Hint] Tim, can you help me identify what information is included in this summary? Tim: This summary includes the important information. Cristina: Great, Tim! [Positive Feedback] This summary includes the most important information, but not the minor details. The summary has the most important similarities and differences between running and walking. [Summary]

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

291

The trialogs in AutoTutor help students evaluate the good summary with justification by using the automated evaluation of the human learner’s response. Two types of the learner’s answers are evaluated. The first consists of the accuracy of answers to the multiple choice questions, which is assessed based on the proportion of correct items. For example, in AutoTutor there are three choices for each question. In our code, the correct answer choice is labelled as correct and the other two choices are labelled as incorrect. If a student makes a correct choice, his or her score will be calculated based on a rubric (described below). Also, the student’s choice action will be saved in the student’s log file. As all of the assessment is objective, results between the automated assessment and human judgments showed the 100 percent correctness for the automatic scoring for the multiple choice question. There are automated scoring rubrics on the quality of the test taker’s multiturn interactions during the conversation. For example, if the learner’s answer is correct the first time, 1 point is given; if it is correct the second time after a hint or prompt, 0.5 point is given, and otherwise 0 credit is given. Therefore, the performance of the test taker is tracked to implement the adaptive characteristics of the learning or assessment system. A second evaluation of answers consists of answers to the open-ended responses. The semantic match between a student’s verbal input and the expectations are evaluated by Regular-Expressions (RegEx; Thompson, 1968) and latent semantic analysis (LSA ; Landauer et al., 2007). In AutoTutor, LSA measures the similarity between types of answers, such as correct answer, partial answer, incorrect answer, and misconception in the rubrics versus test taker’s answer in terms of words, clauses, sentences, and entire answer. The conceptual similarity is computed as the geometric cosine between the values and weighted dimensions of the two types of answers. Content authors and subject matter experts generate prototypical “good” and prototypical “bad” answers for each question. The value of the cosine typically varies from approximately 0 to 1. The threshold of matching score is set up as between .3 and .7. If the good match (i.e., match between student input and a target “good” answer) of either RegEx or LSA is greater than .7 and the bad match (i.e., match between student input and a target “bad” answer) is less than .7, a good answer will be detected. If the good match is greater than .3, but less than .7, a partial correct answer is identified. If the good match is less than .3 and bad match is greater than .7, an incorrect answer is detected. If both good match and bad match are greater than .7, an undetermined answer is detected. An irrelevant answer will be detected if both the good match and bad are less than .3. If there is no input, a blank answer is detected. Additionally, other speech act categories can be detected based on regular expressions, including metacognition (e.g., “I have no idea.”), metacommunication (e.g., “Can you say it again?”) and question (e.g., “How should I find the alert table?”). The type of feedback the student receives is based on the answer category the student’s input most closely matches. In order to investigate the reliability of automated assessment for the open-ended questions, we compared the automated answer categorization with the human judgment of the answer type in a pilot study. Participants were English learners (n = 39), who learned a summarizing strategy with AutoTutor. The short-answer question requires them to justify why they chose that particular summary. Three English native speakers classified the answer types based on the instructions in Table 14.2.

292

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

TABLE 14.2 Definitions of Answer Categories Categories

Definitions

Good

The answer covers the main information in the good answer.

Bad

The answer covers the main information in the bad answer.

Irrelevant

The answer is not related to either good or bad answers.

Metacognition

Statements that indicate that a person does or does not understand something (e.g., “I don’t know.”).

Metacommunication

Statements about communicating (e.g., “I don’t understand what you said.”).

Blank

There are no any inputs.

Any

No any categories above are identified.

Note: Above are descriptions of the answer categories that were provided to the human raters in the pilot reliability study.

Both good and bad answers are also provided to judges as references. Good Answer: A good summary should contain the main idea and important information of the passage. Bad Answer: A good summary should contain more than just the supporting information. The reliability of the automated AutoTutor system was assessed by three human raters. Reliabilities included results of AutoTutor versus three human judges and human versus human. Our results revealed as high as or much higher reliabilities between automated assessment and human judges than those between humans. For example, reliability (Cronbach α) of the good answer between AutoTutor and human raters was as good as that between human raters (.72 vs. .76). The reliability of irrelevant answers between AutoTutor and human raters was much higher than human raters (.81 vs. .70), as was the reliability for blank answers (.87 vs. .78). Future steps will involve fine-tuning our regular expressions and the alternative answers by using information from the human responses. We predict that these changes will likely improve the accuracy of our automated categorization. For example, the regular expression for metacognition in AutoTutor is as follows: [no idea | clue | sure | i forgo | et | (do | did not | don’t | don’t | didn’t | didn’t | doesn’t | doesn’t understand | know) | confus]. Two raters rated one participant’s answer “guess” as cognition. As “guess”’ is not included in the regular expression, it failed to detect this type of response. However, we can modify the regular expression and add “guess” and its alternatives to it. In future, AutoTutor will detect this answer category.

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

293

AutoTutor is also an adaptive reading system. In application, the automatic assessment has been implemented in AutoTutor for adaptive reading tasks. AutoTutor tracks students’ performance during their learning process and then assigns the next task based on their current performance. For example, each lesson in AutoTutor consists of three difficulty levels: easy, medium, and difficult. The student will start the task at the medium difficulty level, which usually consists of ten questions. AutoTutor keeps a record of the student’s response and scores cumulatively for this task. At the end of the task, AutoTutor automatically computes the final scores. Whether the student’s next task will be easy or difficult is determined based on the threshold of his or her most recent passing task (e.g., above 70% correctness). More specifically, if the score is above the threshold, a more difficult reading is provided to challenge the student in the second task. If the score is below the threshold, an easier reading is provided to reduce difficulty. The accuracy of the assignment to task is reliable because it is based on an algorithm that calculates the proportion of the correct items out of the total items. Therefore, automated scoring in AutoTutor can identify the learner’s current learning status and learning gaps, then provide instant scaffolding and finally provide instant and high-quality feedback in the learning process. This instant scaffolding and feedback are essential components in a relatively new line of research that has begun to focus on learning-oriented assessment (Turner & Purpura, 2016).

AutoTutor and Learning-oriented Assessment Standard student assessments have been described as problematic for a variety of reasons. A major drawback to standard student assessment, particularly summative assessment, is that they tend to be administered at the end of a course thereby creating a “high-stakes” situation. In a standard classroom environment, the feedback students receive from summative assessments can be untimely and ineffective. These assessments can also encourage students to memorize content instead of establishing a deeper-level understanding of the material. Little learning occurs during summative assessment. The learning-oriented assessment (LOA ) framework, on the other hand, emphasizes student learning during assessment. It recognizes that L2 learning involves an interplay of individual knowledge and social-cognitive skills. Simple memorization of content is insufficient in the mastering of a new language. LOA posits that the assessment process can also be a learning opportunity for students. Feedback can be promptly provided to students in a LOA framework, allowing instructors to take note of the students’ current knowledge space, and for students to correct their misunderstandings before progressing through the course content. Turner & Purpura (2016) describe how L2 learning can take advantage of the LOA framework, particularly because assessment is already a central feature in L2 learning. Similarly, reading comprehension learning and language learning of AutoTutor places student assessment at the center of the learning process. LOA has several key dimensions, many of which fit neatly into the existing learning framework of AutoTutor. First, a dimension specific to LOA in L2 is the elicitation dimension (Turner & Purpura, 2016). This dimension involves planned language elicitations from L2 learners designed by the instructor to observe the learner’s state of knowledge, skills, and abilities (KSA s) pertaining to the subject material. Within the ITS domain, these KSA s

294

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

are referred to as the “student model” (Sottilare et  al., 2013; Woolf, 2009). More specifically, with AutoTutor, the student model of the system is frequently updated with information elicited by the learner from dialogue moves with planned assessments. In their chapter, Turner & Purpura provide an example of a planned elicitation that involves learners working with peers to compare their answers with an answer key. Similarly, AutoTutor’s trialog system situates human learners in a learning environment along with a teacher agent and a student agent. The human learner interacts with both the student agent and the learner agent, where the student agent’s answers can be corrected by the human learner. As explained in the section describing automated scoring in AutoTutor CSAL , student assessments are absolutely essential in the development of the student model. The student model is used to determine when students can move forward to the next topic and at what pace. The ability for AutoTutor to automatically and accurately assess students’ natural language responses allows students to learn while they are being assessed either via the pumphint-prompt-assertion cycles, through the multiple-choice questions, or through the short-answer questions. Finally, the ability of AutoTutor to categorize student answers into one of seven answer categories as effectively as or better than human raters enables AutoTutor to quickly provide very specific feedback to learners.

Conclusion Automated assessments of natural language and discourse have made enormous progress in recent decades due to advancements in the techniques of computational models and algorithms, machine learning, linguistic databases, and theoretical understanding of discourse processes. These techniques have been implemented in text analyses and enable the prediction of text difficulty, text cohesion, genres, accuracy of grammar and word choice, quality of student contributions in intelligent tutoring systems, and student affective states reflected in discourse. All these contribute to automated essay scoring and the processing of students’ natural language in tutoring systems. Automated analyses of text and discourse have been facilitating the development of language assessment and providing reliable and valid assessment with low cost as compared to human assessments. Automated essay scoring, automated assessments, and feedback in tutoring systems are as good as if not better than human assessments. Automated essay or open-ended response assessments have evolved from assessing low level features such as word count, type-token ratio, and spelling, to assessing syntactic complexity, cohesion, genre, and organization of essays. Current and future developments should continue to improve the reliability and validity of automated assessments of literal language and also the techniques for assessing nonliteral or rhetorical language in essays.

Author Note The research reported in this chapter was supported by the National Science Foundation (0325428, 633918, 0834847, 0918409, 1108845) and the Institute of

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

295

Education Sciences (R305A080594, R305G020018, R305C120001, R305A130030). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of these funding sources.

References Allen, L. K., Crossley, S. A., Snow, E. L., & McNamara, D. S. (2014). L2 writing practice: Game enjoyment as a key to engagement. Language Learning and Technology, 18(2), 124–150. Attali, Y., & Burstein, J. (2006), Automated Essay Scoring With e-rater® V.2. Journal of Technology, Learning and Assessment, 4(3), 1–31. Attali, Y., Lewis, W., & Steier, M. (2012). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30(1), 125–141. Breland, H. M., Bridgeman, B., & Fowles, M. E. (1999). Writing Assessment in Admission to Higher Education: Review and Framework (College Board Report No. 99–3). New York, NY: College Entrance Examination Board. Boud, D. (2000). Sustainable assessment: rethinking assessment for the learning society. Studies in Continuing Education, 22(2), 151–167. Burstein, J. (2003). The E-rater scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. C. Burstein (Eds.), Automated Essay Scoring: A Cross-Disciplinary Perspective. Mahwah, NJ : Erlbaum, pp. 113–122. Burstein, J. C., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998). Computer analysis of essays. Paper presented at the annual meeting of the National Council of Measurement in Education, San Diego, CA . Crossley, S. A., Salsbury, T., & McNamara, D. S. (2011a). Predicting the proficiency level of language learners using lexical indices. Language Testing, 29(2), 243–263. Crossley, S. A., Salsbury, T., McNamara, D. S., & Jarvis, S. (2010). Predicting lexical proficiency in language learner texts using computational indices. Language Testing, 28(4), 561–580. Crossley, S. A., Weston, J. L., Sullivan, S. T. M., & McNamara, D. S. (2011b). The development of writing proficiency as a function of grade level: A linguistic analysis. Written Communication, 28(3), 282–311. Elliot, S. (2003). IntelliMetric: From here to validity. In M. D. Shermis & J. Burstein (Eds.), Automated Essay Scoring: A Cross-Disciplinary Perspective. Hillsdale, NJ : Erlbaum, pp. 71–86. Graesser, A. C., Keshtkar, F., & Li, H. (2014a). The role of natural language and discourse processing in advanced tutoring systems. In T. Holtgraves (Ed.), The Oxford Handbooks of Language and Social Psychology. New York: Oxford University Press, pp. 491–509. Graesser, A. C., Li, H., & Forsyth, C. (2014b). Learning by communicating in natural language with conversational agents. Current Directions in Psychological Science, 23(5), 374–380. Graesser, A. C., & McNamara, D. S. (2011a). Computational analyses of multilevel discourse comprehension. Topics in Cognitive Science, 3(2), 371–398. Graesser, A. C., & McNamara, D. S. (2012). Automated analysis of essays and open-ended verbal responses. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA Handbook of Research Methods in Psychology, Vol. 1: Foundations, Planning, Measures and Psychometrics. Washington, DC : American Psychological Association, pp. 307–325.

296

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Graesser, A. C., McNamara, D. S., & Kulikowich, J. (2011b). Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234. Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004) Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, and Computers, 36(2), 193–202. Graesser, A. C., McNamara, D. S., Cai, Z., Conley, M., Li, H., & Pennebaker, J. (2014c). Coh-Metrix measures text characteristics at multiple levels of language and discourse. Elementary School Journal, 115(2), 210–229. Graesser, A. C., Rus, V., D’Mello, S. K., & Jackson, G. T. (2008). AutoTutor: Learning through natural language dialogue that adapts to the cognitive and affective states of the learner. In D. H. Robinson & G. Schraw (Eds.), Recent Innovations in Educational Technology that Facilitate Student Learning. Charlotte, NC : Information Age Publishing, pp. 95–125. Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated Scoring and Annotation of Essays with the Intelligent Essay Assessor™. In M. D. Shermis & J. Burstein (Eds.), Automated Essay Scoring: A Cross-Disciplinary Perspective. Hillsdale, NJ : Erlbaum, pp. 87–112. Landauer, T. K., McNamara, D., Dennis, S., & Kintsch, W. (Eds.). (2007). Handbook of Latent Semantic Analysis. Mahwah, NJ: Erlbaum. Levinstein, I. B., Boonthum, C., Pillarisetti, S. P., Bell, C., & McNamara, D. S. (2007). iSTART 2: Improvements for efficiency and effectiveness. Behavior Research Methods, 39(2), 224–232. Li, H., Graesser, A. C., & Cai, Z. (2013). Comparing two measures of formality. In C. Boonthum-Denecke & G. M. Youngblood (Eds.), Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference. Palo Alto, CA : AAAI Press, pp. 220–225. McNamara, D. S. (Ed.). (2007). Theories of Text Comprehension: The Importance of Reading Strategies to Theoretical Foundations of Reading Comprehension. Mahwah, NJ : Erlbaum. McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). The linguistic features of quality writing. Written Communication, 27(1), 57–86. McNamara, D. S., Crossley, S. A., & Roscoe, R. (2013). Natural language processing in an intelligent writing strategy tutoring system. Behavior Research Methods, 45(2), 499–515. McNamara, D. S., Crossley, S. A., Roscoe, R. D., Allen, L. K., & Dai, J. (2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23(1), 35–59. McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge, UK : Cambridge University Press. McNamara, D. S., Levinstein, I. B., & Boonthum, C. (2004). iSTART: Interactive strategy training for active reading and thinking. Behavior Research Methods, Instruments and Computers, 36(2), 222–233. National Reading Panel. (2000). Teaching Children to Read: An Evidence-Based Assessment of the Scientific Research Literature on Reading and its Implications for Reading Instruction (NIH Pub. No. 00–4769). Jessup, MD : National Institute for Literacy. Nelson, J., Perfetti, C., Liben, D., & Liben, M. (2011). Measures of Text Difficulty: Testing Their Predictive Value for Grade Levels and Student Performance. New York, NY: Student Achievement Partners. Page, E. B. (2003). Project essay grade: PEG . In M. D. Shermis & J. Burstein (Eds.), Automated Essay Scoring: A Cross-Disciplinary Prospective. Mahwah, NJ : Lawrence Erlbaum Associates, pp. 43–54. Roscoe, R. D., & McNamara, D. S. (2013). Writing pal: Feasibility of an intelligent writing strategy tutor in the high school classroom. Journal of Educational Psychology, 105(4), 1010–1025.

USING TECHNOLOGY IN L ANGUAGE ASSESSMENT

297

Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of the IntelliMetric essay scoring system. Journal of Technology, Learning and Assessment, 4(4), 1–22. Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In E. Baker, B. McGaw, & N. S. Petersen (Eds.), International Encyclopedia of Education (3rd edn.). Oxford, UK : Elsevier, pp. 75–80. Sleeman, D., & Brown, J. S. (Eds.). (1982). Intelligent Tutoring Systems. Orlando, FL : Academic Press. Snow, C. (2002). Reading for Understanding: Toward an R&D Program in Reading Comprehension. Santa Monica, CA : RAND Corporation. Sottilare, R., Graesser, A., Hu, X., & Holden, H. (2013). Design Recommendations for Intelligent Tutoring Systems: Learner Modeling (Vol. 1). Orlando, FL : Army Research Laboratory. Thompson, K. (1968). Programming techniques: Regular expression search algorithm. Communications of the ACM, 11(6), 419–422. Turner, C. E., & Purpura, J. E. (2016). Learning-oriented assessment in second and foreign language classrooms. In D. Tsagari & J. Banerjee (Eds.), Handbook of Second Language Assessment. Berlin, Germany: De Gruyter Mouton, pp. 255–273. Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on automated essay grading. Journal of Information Technology Education, 2(2), 319–330. Woolf, B. P. (2009). Building Intelligent Interactive Tutors. Burlington, MA : Morgan Kaufmann.

298

15 Future Prospects and Challenges in Language Assessments Sauli Takala, Gudrun Erickson, Neus Figueras, and Jan-Eric Gustafsson

ABSTRACT

T

he future prospects and challenges in language assessment are contextdependent: They emerge from past and anticipated developments. For this reason, the current chapter seeks to situate the discussion in the context of major disciplinary developments by addressing some key concepts and approaches in language assessment. By the same token, future prospects and challenges are seen as possible or probable trends needing or deserving attention, and some responses are suggested to new opportunities deemed possible or probable. The text is structured by addressing the whys, whats, hows, whos, and ands . . .? in language assessment, seen as key issues related to good and ethical practice. Connected to this view, learning, teaching, and assessing are considered a “pedagogical trinity.” The chapter concludes by outlining some possible/probable/desirable scenarios.

Introduction The aim of this concluding chapter is to reflect on future prospects and challenges in language assessment, situating the discussion in major past developments. In the best of scenarios, using Sadler’s (1989, 2010) terminology based on Björkman (1972), it may also have a function of “feedforward” in continued discussions and developments, thus making a constructive contribution to assessment literacy, as indeed the volume at large. However, in working on the text, and in reflecting on earlier discussions of developments in language testing and assessment (e.g., Davies, 2013b; Spolsky, 1977, 299

300

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

1995), we have come to realize that the title may also be somewhat misleading. It seems that prospects and challenges tend to remain fundamentally the same, while their particular manifestations may vary depending on conceptual and contextual developments. Recently, Bachman (2013) has suggested the same. The main reasons for the alleged persistence of challenges—plus ça change, plus c’est la même chose—has to do with: (1) assessment and evaluation being an integral part in all acting and being (Scriven, 1998), and (2) the strong links among teaching, learning, and assessment, forming an intrinsic and interacting “pedagogical trinity.” It is interesting to note that definitions and perceptions of assessment vary, historically and contextually, from strictly external and measurement-related definitions to pedagogically integrated views, and from basically negative or apprehensive opinions to distinctly hopeful and positive perceptions. Our own reasoning builds on the assumption that educational assessment has a dual function, with two overall, non-interchangeable aims: to promote learning and to promote equity, at individual as well as pedagogical and structural levels. In this, a number of common characteristics for all types of assessment can be identified, related to transparency, validity, reliability, and respect, in particular. These concepts are broad and may be interpreted in different ways. However, it may be fruitful to start out from the general perspective of learning and fairness, and to relate this to fundamental assessment quality parameters. Based on this, different contextual and practical decisions can be made. There are obvious conceptual, ethical, and practical challenges involved, related both to the learning and to the equity aim. One of these is what is actually meant by assessment: how assessment is related to testing and measurement, and how it is related to the continuous communication of and about knowing that is embedded in regular teaching and feedback (Erickson & Gustafsson, 2014). Put differently, what message is conveyed to students if/when they are told not to worry about tests because their teacher prefers to assess all the time? The Common European Framework of Reference for Languages (CEFR , Council of Europe, 2001) cautions of the consequences, if students perceive everything as assessment that is being documented and used for summative purposes: “Continuous assessment . . . can, if taken to an extreme, turn life into one long never-ending test for the learner and a bureaucratic nightmare for the teacher” (p. 185). Our text is built around five fundamental questions, which focus on the whys, whats, hows, whos, and ands . . .? in language assessment, hence on purposes and contexts, constructs and criteria, formats and methods, agency, uses and consequences. Throughout the chapter, there is special emphasis on what can be assumed to be the most immediate challenges, be they in everyday assessment in classrooms or in largescale, high-stakes contexts.

Purposes and Contexts Drawing on the above discussion, Figure 15.1 presents a possible conceptualization of how different purposes/functions of assessment are related to a range of contexts. The top half of each concentric circle represents a set of expanding contexts, and the bottom half typical types/forms of assessment related to each context. This

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

301

FIGURE 15.1 Purposes of Assessment Related to Contexts.

diagram illustrates two fundamental considerations: (1) activity at the individual and institutional levels is embedded in ever-broadening levels of generality, and (2) assessment needs and forms need to be adapted to the particular form and level of contextual embeddedness. The inner circles represent contexts where the processes of teaching (interaction in classrooms) and studying contribute directly to the quality and quantity of learning that is assessed in a variety of ways. In some contexts there is testing and assessment at the school district level and in many contexts there are national tests. These assessments often have the function of improving fairness in grading and/or accountability. The outer circles focus more on the yield of the educational system, which is often assessed using representative samples (periodic national assessments), and internationally produced instruments (international assessments). International language tests have a long history and serve a range of specific purposes. In the inner circles, the individuals matter crucially as individuals, while at the more general levels, they act as members of a representative sample of the whole universe of eligible individuals. This diagram, although obviously a simplified and static picture of the various dynamic contexts in which assessments are used, shows clearly the multiplicity of assessment purposes and uses, and also the potentially great impact the field of language testing and assessment plays in language education. There has been a clear increase in the amount of research reports related to the three outer circles, and publications on test development and use have also proliferated in the last twenty years. In the international field, Takala et al. (2013) have described the origins, evolution, issues and challenges of joint international

302

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

assessments, pioneered by the IEA (International Association for the Evaluation of Educational Achievement) in the 1960s. The IEA assessments of English and French as a foreign language in the early 1970s were followed up forty years later by the European Union-sponsored European Survey on Language Competences (European Commission, 2012). At the national or state level, several reports have been published on the development and use of language assessments organized and administered either by exam boards (local or international) and/or government authorities, sometimes in collaboration with universities (for example, Austria, Germany, Sweden, University of St. Petersburg). Research related to the two inner circles, classroom assessment(s) and teacher-led, learner-oriented assessments is less visible and accessible, at least in the more traditional language testing fora—although much more in focus now than twenty years ago. This is somewhat surprising as the first two chapters in the very first edition of Educational Measurement (Lindquist, 1951) discussed the functions of measurement in the facilitation of learning (Cook, 1951) and in improving instruction by Tyler (1951), who stated that “educational measurement is conceived, not as a process quite apart from instruction, but rather as an integral part of it” (p. 47). Several approaches, which are often relatively closely related, have been proposed. The interdisciplinary Assessment Reform Group in the United Kingdom, beginning in the 1980s, and particularly active between 1996 and 2010, has had a visible impact on assessment research as well as on practice. The expression “assessment for learning,” often used interchangeably with “formative assessment,” is widely known, and work by researchers, such as Black, Gipps, Harlen, Stobart, and Wiliam is often cited and used in teacher education and educational planning. Another “movement,” with an obvious relation to ideas of assessment for learning, is Dynamic Assessment (DA), based on Vygotskian, sociocultural theory, specifically the concept of “zone of proximal development” (ZPD ; Vygotsky, 1978, p. 86). In DA, mediation of examinee performance is regarded as an integral part of the assessment process. There has been research on this type of assessment also related to language education (Lantolf & Poehner, 2008). A fairly recent development attracting increasing attention is learning-oriented assessment (LOA ), which, according to Carless (2007), is governed by three principles focusing on assessment tasks stimulating learning, assessment actively involving students, and feedback also feeding forward, thereby contributing to students’ learning. There is accordingly an attempt to reconcile the perspectives of formative and summative assessment, or put differently, assessment for and of learning. Research on these forms of assessment is increasing as are publications and presentations in different fora. The impact of these trends for the strengthening of its two basic functions, support for learning and support for fairness and equity, remains to be seen.

Constructs and Criteria Construct Definition in Language Assessment Bachman (2007) describes the different construct approaches in language testing and assessment across almost five decades, illustrating the long-standing debate on to

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

303

what extent abilities and language use contexts are closely related or clearly distinct and affect performance in language assessment tasks. He proposes a distinction among ability-based, task-based, and interaction-based approaches to construct definition. Questions about the construct are fundamentally questions about validity: What is focused on, and what interpretations and decisions are intended? Validity is usually considered the most important determinant of sound assessment practices and also the most difficult concept to explicate and teach. The latest Standards for Educational and Psychological Testing (2014, henceforth Standards) states unequivocally that validity is the most fundamental consideration in developing and evaluating tests. Statements about validity always need to refer to particular interpretations for specified uses, and it is declared that it is simply incorrect to use the unqualified phrase “the validity of the test” (p. 11). Developments in the concept of validity show that construct validity has been receiving increasing attention. For the past thirty years, the concept of unified construct validity elaborated on by Messick (1989) has played a key role: The validity of assessment-based interpretations is enhanced by ensuring a high degree of construct-relevant variance and a minimum of construct-irrelevant variance. Due to the impact of technology and digital communication, the way language(s) are accessed, learned, and used is changing rapidly. New styles of language use and new activities may force a reconceptualization of the abilities and traits to describe, especially for reading and writing. All of this affects the nature of language to be assessed and the way the abilities to be assessed (constructs) are defined. Where do the constructs come from? The main problem in answering this question is that there are hardly any “true” constructs waiting to be discovered through clever strategies. Constructs need to be defined, not discovered. Definitions are typically based either on theories or hypotheses concerning phenomena, for instance, communicative competence. Empirical definitions, in turn, are often derived from various kinds of needs analyses. Contextually, varying traditions of language education and assessment also play a role in how constructs are defined. Emerging patterns of communication and changing assessment needs will ensure that the definition of constructs will remain a perennial issue. The “native speaker” as a model is increasingly challenged and one alternative is “effective international communicator.” International and technological developments will continue to challenge current accepted norms. One obvious change in the view of constructs in language education and assessment was when the long-lived dichotomy between “active” and “passive” language skills was discarded and replaced by “receptive” and “productive.” Developments in cognitive psychology, in particular, had shown that reading and listening are very active processes (being an effort after meaning). The CEFR distinguishes among different communicative language activities and strategies: receptive, productive, interactive, and mediation (pp. 57–88). Models of communicative competence, inspired largely by Hymes (1972), have dominated the discussion on constructs since the seminal article by Canale & Swain (1980). Several modifications to their original model have since been proposed (e.g., Bachman, 1990; Bachman & Palmer, 2010). Six volumes in the Cambridge Language Assessment Series (2000–2006), edited by J. Charles Alderson and Lyle Bachman, represent a skills-oriented approach to

304

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

language assessment by surveying the conceptual and theoretical foundations of the specific skills and presenting how they have been and are being interpreted and implemented in assessment practices. In language assessment focused mainly on L1, models with a strong psycholinguistic/cognitive orientation have been proposed drawing on the work on speaking by Willem Levelt. His seminal work on the construct of speaking (Levelt, 1989) was slightly recast as a “blueprint of the speaker” (Levelt, 1999). Following Levelt, similar blueprints have been produced for listening, reading, and writing. They all illustrate the components and processes of the four skills.

Criteria Where do criteria of quality come from? As with constructs, there are no self-evident criteria to be discovered. The most usual point of reference has been the native speaker, but as noted above, this has been problematized (e.g., Davies, 2013a). Whose norms count? In empirical terms, criteria for levels of proficiency usually refer to linguistic accuracy and range, intelligibility, and appropriacy in terms of purpose and context (comprehension and comprehensibility). Criteria in achievement-related assessment usually also use syllabuses/curricula as a guideline.

Task-based Assessment Task-based assessment (TBA ), a paradigm that challenged the dominant abilitybased assessment, emerged as a dynamic approach in the early 1990s and soon became a topic of intensive research and development work (for a good overview see Norris, 2009). Task-based assessment has obvious links to the teaching of language for specific purposes and addresses language primarily as a tool for “realworld communication” rather than a display of language knowledge. It has been claimed to facilitate the provision of concrete formative feedback and make the assessment results easier for different assessment users (stakeholders) to interpret and utilize. However, McNamara (1996) also pointed out that this kind of performance assessment faces some obvious challenges. He proposed a distinction between the strong and weak sense of TBA in second language performance. This is related to the criteria: in the strong approach, “real world” task performance criteria are employed, and in the weak one, the focus still remains on language performance. Thus, task performance plays a different role in the criteria applied. The dilemma identified by McNamara remains a challenge for construct definition. Other issues are the generalizability of the outcomes and the modelling and reliable judging of the difficulty of tasks. A conceptual model that may prove useful in the analysis of these issues is provided by Crooks et al. (1996). We believe that the two approaches (ability-based and task-based), in fact, complement each other. In our view, the primary function of language is communication of meaning and that this communication is related to tasks in situations. To what extent ability versus task is of primary concern depends on the purpose of assessment.

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

305

Formats and Methods Operationalizing any construct definition requires making decisions on formats and methods. Such decisions imply the planning, development and choice of the instruments and the procedures that will be used to collect and analyze evidence. There are several textbooks, manuals, and online materials (e.g., Alderson et  al., 1995; ALTE , 2011; Downing & Haladyna, 2006) that address the technical aspects of such implementations and that provide very useful advice on how to approach the complex task. There are also Guidelines, Codes of Good Practice, Codes of Ethics and Standards, which address principles such as transparency, accountability, responsibility, and fairness and that provide frameworks for the identification and analysis of the implications of the social consequences of test use (see Spolsky, 2013, for a good overview of the influence of ethics in language assessment). With the amount of relevant literature available, it might be expected that the choices of formats and methods for the development of tests and exams for the three outer circles in Figure 15.1 are straightforward. This would seem the case despite their complexity and also despite the need to take more advantage of what technology has to offer in terms of development, delivery, analysis, reporting, item banking, and item archiving. The picture is less clear when decisions on formats and methods are to serve the two inner circles in Figure 15.1 or when they aim at crossing circle boundaries, as coexisting dimensions (Harlen, 2012) or modes (Bachman & Palmer, 2010) of assessment, such as formative-summative, implicit-explicit, spontaneous-planned, informal-formal. Those making the decisions need to ascertain how to make classroom assessments and institutional assessments useful and relevant to one another, and how to identify which conceptual or pragmatic accommodations need to be made (Harlen, 2012) to avoid either unnecessary repetition or conflict of interests. An additional issue to consider in such cases is how to resolve the tension between the growing need for accountability, reliability, portability, and transparency by social agents on the one hand, and the need for specificity and diversification so that individual needs can be met on the other.

Standard Setting An important milestone in assessment was the development of the criterionreferenced approach (Glaser, 1963) as an alternative to norm-referenced assessment. It involved generalization into a well-defined content domain and an establishment of a pass score. Related to criterion referencing, there is a more recent approach usually called standard-based or standard-referenced assessment. It has emerged as a strong trend, which is likely to expand even if some assessment experts recommend caution about its use without a careful consideration of its pros and cons. The 2014 Standards define standard-based assessment as “an assessment of an individual with respect to systematically described content and performance standards” (p.  224), and refer explicitly to “standard setting.” Standard setting consists of a set of structured procedures, usually involving expert judgments at different stages that

306

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

map scores (in a broad sense) onto discrete performance levels specified by performance level descriptors (PLD s). PLD s usually consist of (1) content standards, which are statements defining what students should know and be able to do; and (2) performance standards, which are statements describing what good performance with respect to the content is expected to be. As such, this is not a particularly radical turn as curricular frameworks have served the same function. The widely used CEFR with its six-point global scale and fifty-two more specific scales related to communicative and language competencies immediately attracted attention in language assessment circles. The Council of Europe was asked to provide guidance in how language examinations could be related to the CEFR reference level standards. A pilot manual was published in 2003 and the revised version in 2009 (Council of Europe, 2009). A reference supplement to the manual is also available (Council of Europe, 2004, 2009). Using the manual and other extensive standard setting literature (e.g., Cizek & Bunch, 2007), a number of national and international examinations/tests are reported to have been successfully linked to the CEFR levels. Kaftandjieva (2010) provides a comprehensive review of standard setting and reports on results using six different methods to set standards on a test reading in EFL . A large (and steadily growing) number of standard-setting methods have been developed and new ones keep emerging. However, the methods can be basically divided into two groups: experts (panelists) judge the level of tasks/items (test-centered methods), or they judge the level achieved by candidates (examinee-centered methods). The latter method presupposes that the panelists have sufficient first-hand knowledge of the candidates’ language abilities. Standard setting presents considerable challenges (see, e.g., Figueras et al., 2013). Reporting the results of standard setting requires sufficiently detailed documentation of procedures and indicators of how good the evidence is for the defensibility of the set cut-scores. If standard setting is based on the CEFR , the comparability of the cut-scores is also a major issue. This is often phrased as “How do I know your B1 is my B1?” If it is matter of an international standard setting, where the same tests are used, the challenge is somewhat smaller. If the standard setting concerns tests or examinations in a particular country, some form of (international) cross-validation seems useful or even necessary.

Advances in Quantitative Methods of Analysis The toolbox of quantitative methods to support the development and use of language assessments provides a rich set of concepts, models, statistical techniques, and rules of thumb. This toolbox has been under development for more than 100 years, and there are no signs of developments slowing down. While this is reassuring, the increasing number, complexity, and technical sophistication of psychometric techniques also provides challenges and bewilderment to users. Even developers of psychometric techniques sometimes experience a lack of contact between the substantive issues and the modern technology of measurement (Wilson, 2013). Classical Test Theory (CTT ) models, such as the Spearman (1904) model and generalizability theory (GT ) (Cronbach et al., 1972), primarily investigate variance among persons and the influence of different sources of error. These models focus

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

307

on individual differences in performance, which has been, and will continue to be, of great interest in much language assessment. But the focus on variance brings the disadvantage that the characteristics of an assessment are determined both by the instrument itself and by the group of persons investigated. Modern test theory, or Item Response Theory (IRT ) models such as that developed by Rasch (1960) postulate that items and persons have invariant properties, which determine the probability of a particular response. This approach yields, under certain assumptions, estimates of item parameters, which are the same across different groups of persons. With these items, IRT can be used to create a large number of tests, which all produce comparable estimates of ability. Such scales can be very useful in criterion-referenced procedures, the IRT models providing information about the amount of measurement error at different points along the scale. They can also support standard-setting procedures, and investigations of the scale can give insight into its meaning through investigations of the properties of the items. For these possibilities to be realized, the data need to satisfy some assumptions. Thus, all items should measure one and the same underlying ability. This unidimensionality assumption may be tested with different techniques, and frequently it must be rejected. However, it is also often found that the items satisfy the assumption of essential unidimensionality (Gustafsson & Åberg-Bengtsson, 2010), there being one dominant dimension along with a set of minor dimensions, which may be quite sufficient for trustworthy application of IRT techniques. The IRT-models are currently extended in different directions, for example, by allowing for multidimensionality. IRT-models that allow for classification of persons into different categories on the basis of their patterns of responses are now also becoming available. However, even though we are likely to be provided with more complex and more powerful modelling techniques, it must be realized that good practice in language assessment will always require valid instruments/procedures and sensible use of techniques that are appropriate to solve the problems at hand, whether these are classified as classical or modern methods.

Agency in Assessment The question Who? in language assessment, or more precisely the issue of agency, can be conceptualized and approached in different ways. The basic issue is focus on who does what and in what situations. Obviously, aspects of aims as well as outcomes, uses, and consequences are connected to this, thus emphasizing agency as an essential aspect of validity. In this, collaboration among different agents is of prime importance.

Learners The agency of learners can be manifested in different ways, where self-assessment adds a dimension that may have beneficial internal as well as external impact— improving learners’ learning and teachers’ teaching. By reflecting on and estimating

308

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

their own ability in relation to explicit competences, students become potentially more aware both of where they are in their learning and where they are, or should be, heading. This, in turn, has been shown to enhance motivation and learning (Black & Wiliam, 2012; Harlen, 2012; Sadler, 1989). Self-assessment has a key function in the European Language Portfolio (ELP), an operative, pedagogical document accompanying the CEFR. Pedagogically, it can best be described as a means of linking learning and assessment with special emphasis on the individual learner’s development of language competence and of autonomy in this process (Little, 2005; 2009). Another example of assessment with learners as the main agents is students collaborating in peer-assessment, focusing on each other’s work with the aim of discussing and improving quality (Cheng & Warren, 2005). Like self-assessment, this requires substantial attention and practice, and it also presupposes a positive and respectful relationship among the students to function in the intended way. There are obvious advantages in involving students in assessment practices, be they classroom-based or large-scale, but also definite challenges. There are aspects of power involved in all types of assessment, hence also in those where students are the main agents (Gipps, 1999; Shohamy, 2001). For example, aspects of hierarchy among students, and between students and teachers, as well as effects of compliance, need to be taken into account, as well as the role of culture in a wide sense.

Teachers Historically, the role of teachers in assessment has attracted relatively little attention. Given the fact that this type of assessment constitutes the vast majority of all educational assessment, this is somewhat surprising. One reason may be that formal tests, in particular, large-scale exams, are often more crucial for individual students’ further education and life chances. Another reason may be the relative difficulty of distinguishing continuous assessment from regular teaching: While different in several respects, they both share important elements of good practice, regarding, for example, construct coverage, transparency, sensitivity, fairness, and respect (Erickson, 2010). Gradually, however, more attention has been paid to teacher assessment activities as an intrinsic part of pedagogy and classroom practice (the pedagogical trinity; cf. the reference to the work by the Assessment Reform Group in the United Kingdom). Regarding teachers’ role, the concept of feedback is crucial, as stressed, for example, by Hattie & Timperley (2007), Harlen (2012), Ramaprasad (1983), Sadler (1989) and Stobart (2012). Feedback plays an important role in students’ understanding of their own performances in relation to goals. It should serve as a bridge between current and intended performances in showing what can be done “to alter the gap in some way” (Ramaprasad, 1983, p. 4). Feedback needs to be given at the right time, be linked to specified goals, help students understand these goals, and provide information and strategies on how to proceed. Furthermore, students need to consider feedback worthwhile in order to use it (Stobart, 2012) and to be actively involved in the feedback process (Harlen, 2012). Conceptually, as well as practically, feedback also needs to be considered in relation to the whole pedagogical process

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

309

of learning and teaching, and regarded as a means of communicating about and co-constructing knowing as expressed in the goals to be achieved for the activity at hand (Erickson & Gustafsson, 2014). Thus, feedback has a clear function at the practical as well as at the metacognitive level.

External Agents External agents have an obvious role in assessment, for example, in providing teaching and assessment materials, and exams. This requires great responsibility, which, to some extent, is unique, but also distinctly related to the basic quality parameters mentioned previously. One of the main challenges is to handle the inevitable tension between requirements of transparency and collaboration on the one hand, and aspects of “trade secret,” and sometimes, commercial interests on the other. The Standards (2014) devote considerable attention to the proper role of external agents.

Uses and Consequences The last question to be focused on in this chapter concerns the uses and consequences of educational assessment and the challenges related to this. A more direct way of asking this question, in parallel to the Why? What? How? and Who? previously addressed, would be And . . .?, namely, what will happen to the outcome of the assessment, be it a test of a traditional kind or, for example, performance assessment in a more or less authentic context? Using the terminology of Moss et al. (2006), what IDA s—inferences, decisions and actions—will follow, and adding on Messick’s (1989) concern, what will the consequences be? It is essential to take three levels into account when approaching the issue of uses and consequences, namely the individual, pedagogical, and political levels. Also, a distinction between content/procedure and outcome seems called for, albeit with some degree of overlap. The way an assessment is designed and what it focuses on is one thing; the results obtained is another. It can be assumed that students perceive what is visibly assessed as an indication of what is considered important, and consequently what should be learned, irrespective of whether it is for insight-related or instrumental reasons. Furthermore, assessment results indeed affect individual students’ continued studies and learning, educational choices, and life chances. In addition, and never to be neglected, results, and the way they are communicated and handled, have an impact on motivation, self-perception, and self-esteem. Moreover, language assessment also takes place outside institutions and in contexts in which consequences are even more drastic, for example, in the case of migration and citizenship. This further emphasizes the aspect of power as well as the ethical dimension and responsibility embedded in assessment. And it obviously reminds us, again, of the necessity to carefully analyze the motives for assessments, and when engaging in different activities, to optimize the quality of construct definitions and methods in a wide sense, not least regarding procedures, instruments, and analyses. Assessments, including outcomes, also influence not only teachers’ pedagogical planning and performance, but also prioritizations at the school level. That this

310

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

happens is seldom questioned, but there is very little empirical evidence as to exactly what it concerns, and how and to what extent this impact is manifested in teaching, and even more complicated to establish, in learning outcomes. Studies have been conducted within the field (e.g., Alderson & Wall, 1993; Bailey, 1999; Purpura, 2009; Wall & Horak, 2011), but results are partly inconclusive. Also, a question that needs to be asked is whether assessment should be used to actively support, or even drive, changes in curricula and teaching practices. In analyzing this, a number of issues need to be taken into account, for example, whether such measures mainly risk influencing teaching in the direction of the test, or the opposite, function as a means of aligning teaching more closely to the [desired] curriculum. Impact of assessments also concerns the societal or political level. This can be exemplified by school results—be they local or national—affecting political opinion and also decisions concerning schooling. In addition, international surveys have an increasing influence on national analyses of education (the outermost circle in Figure  15.1). Examples of this are numerous such as the different shock waves following the publication of the Programme for International Student Assessment (PISA ) and Progress in International Reading Literacy Study (PIRLS ) findings. The reactions have concerned either issues related to international ranking or national trends or both. There is much to say about this, regarding the constructs and methods of the surveys as such, and not least, about the national interpretations and uses made of the results. However, the impact of international studies at the political level seems obvious and is strongly related to issues of quality and ethics of assessment in a wide sense.

Concluding Remarks In spite of more than fifty years of systematic research and development work on language assessment, it seems that the overarching, permanent challenge is the need to promote and strengthen assessment literacy in the widest possible sense to avoid frequent misconceptions and unrealistic expectations. Assessment literacy is likely to attract increasing attention in the whole field of education. The relationships, for example, between formative and summative purposes, diagnostic and placement aims, classroom assessment and external, sometimes large-scale, assessment, need to be studied in greater depth to elucidate what they have in common and what their specific features and uses are. This is of obvious importance to ascertain to what extent assessment data can be used to draw valid conclusions and to make valid decisions beyond their intended, primary uses and purposes. In this, one of the essential issues seems to be if, how, and to what extent, summative assessment can, or should, be used for formative purposes, and vice versa. We will conclude by briefly looking forward, relating our discussions to the five fundamental questions focused on at the outset of the chapter, that is, those concerning purposes and contexts, constructs and criteria, formats and methods, agency, and finally, uses and consequences. In relation to constructs underlying assessment and testing practices, a perennial challenge (see, e.g., Masters & Forster, 2003) is to avoid over-emphasizing more easily measured skills at the expense of competencies, such as reflection, critical analysis,

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

311

and problem solving. One concrete aspect of this would be to reflect on the question of whether we should focus only on testing comprehension of text or also give some attention to learning from text (e.g., in Content and Language Integrated Learning, CLIL ). Another example may be to explore closer links between testing/assessment in L1 and L2, perhaps especially, at higher levels of L2 proficiency—to bring in the assessment of literary/aesthetic uses of language (appreciation and use), translation, and interpretation (mediation). What would such a test look like? What would the criteria of quality be? Early literacy is also likely to emerge as a challenge for language assessment as its benefits as a complement and counterbalance to the strong focus on early oracy are being increasingly discussed. Looking at formats and methods, while many current approaches will continue to have valid uses, new methodological approaches need to be developed in response to emerging problems and challenges, and to make effective use of new conceptual and technological tools. Frameworks for developing tests and other assessment approaches, including standard setting, will be developed further, as will descriptions of stages and levels of language proficiency. This work will draw on progress in second language acquisition research determining criterial linguistic features characteristic of different stages/levels. A challenge in standard setting using performance level descriptions is related to the number of levels (cut scores) to be set. If a broad range of levels should be distinguished in a single test, the assessment tasks need to provide a challenge to, and yield useful information about, an equally broad range of proficiency. This might involve greater use of open-ended tasks that allow students to respond at a variety of levels, with considerable concomitant challenge to reliability. In principle, this approach is relatively easy to implement in classroom assessment: direct observations and judgments by teachers of students’ work and performances over time (e.g., in a portfolio). Another option is adaptive testing using an item bank. All this sets high demands on task construction and also on reliability. Integrative testing/assessment will be investigated more actively as modern information technology makes multimodal language use more common. This is one reason why the definition of constructs will also require continuous attention. The analysis of the structure of language ability may be revisited after the intensive work in the 1950s and 1960s. Regarding agency in assessment, there are many important issues to be addressed, not least concerning the importance and weight given to students’ self-evaluations. In what way should they, or should they not, be used by teachers in their pedagogical planning and individual decision making? Related to this, there is the question of students’ role and possible influence in low- and high-stakes testing contexts: In what way are their ideas elicited, voices heard and expertise utilized? These are questions that need to be analyzed from a wide perspective, related to aspects of effects as well as ethics. Another aspect of the Who? question has to do with the object of the assessment, frequently referred to as the student, not the knowledge of the student. This may be regarded as mere semantics, but is in fact of prime importance in the way assessment is defined, perceived, and used. It seems crucial to work for a change of expression, not allowing words to take power over thinking—or as Bachman (1990) puts it: “Whatever attributes or abilities we measure, it is important to

312

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

understand that it is these attributes or abilities and not the persons themselves that we are measuring” (p. 20). Finally, as for the uses and consequences of language testing and assessment, the challenges are vast, regarding the whole spectrum from classroom assessment to large-scale examinations and international surveys. This entails considerable, multifaceted responsibility at several levels: teachers and principals, test developers, researchers, administrators, test users, policy makers, politicians. Responsibility is best supported by solid knowledge, which requires high quality education at all levels. Furthermore, a consistent dialogue among different stakeholder groups enhances the possibility for valid uses of assessment, meaning adequate inferences, decisions, and actions leading to as many beneficial consequences as possible at individual, pedagogical, and societal levels. To enable and enhance this dialogue, international networks will become increasingly important, supported by associations such as the International Language Testing Association, the European Association for Language Testing and Assessment, and the Association of Language Testers in Europe. Undoubtedly, more challenges and prospects could be added to those mentioned in the current text and volume, as it inevitably does not reflect the whole spectrum of views among the worldwide community engaged in language assessment. A quote from William Cowper’s eighteenth century poem “Hope” expresses well what can be viewed as a key challenge of language assessment: “And diff´ring judgements serve but to declare | That truth lies somewhere if we knew but where.” Seeking the “truth,” or “truths,” entails the prospect of an un-ended effort after and quest for valid interpretations and decisions.

References Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation. Cambridge, UK : Cambridge University Press. Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14(2), 115–129. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC : American Educational Research Association. Association of Language Testers in Europe (ALTE ). (2011). Manual for Test Development and Examination: For use with the CEFR . Strasbourg, France: Council of Europe. Available from: http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp. [14 January 2015]. Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford, UK : Oxford University Press. Bachman, L. F. (2007). What is the construct? The dialectic of abilities and contexts in defining constructs in language assessment. In J. Fox, M. Wesche, D. Bayliss, & L. Cheng (Eds.), Language Testing Reconsidered. Ottawa, Canada: University of Ottawa Press, pp. 41–71. Bachman, L. (2013). Ongoing Challenges in Language Assessment. In A. J. Kunnan (Ed.), The Companion to Language Assessment: Volume III (1st edn.). Malden, MA : Wiley, pp. 1–18. Bachman, L. F., & Palmer, A. S. (2010). Language Assessment in Practice. Oxford, UK : Oxford University Press.

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

313

Bailey, K. M. (1999). Washback in Language Testing (TOEFL Monograph No. TOEFL MS -15). Princeton, NJ : Educational Testing Service. Björkman, M. (1972. Feedforward and feedback as determiners of knowledge and policy: Notes on a neglected issue. Scandinavian Journal of Psychology 13(1), 152–158. Black, P., & Wiliam, D. (2012). Assessment for learning in the classroom. In J. Gardner (Ed.), Assessment and Learning. London, UK : Sage, pp. 11–32. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47. Carless, D. (2007). Learning-oriented assessment: Conceptual basis and practical implications. Innovations in Education and Teaching International, 44(1), 57–66. Cheng, W., & Warren, M. (2005). Peer assessment of language proficiency. Language Testing, 22(1), 93–121. Cizek, G. J., & Bunch, M. B. (2007). Standard Setting. A Guide to Establishing and Evaluating Performance Standards on Tests. Thousand Oaks, CA : Sage. Cook, W. W. (1951). The functions of measurement in the facilitation of learning. In E. F. Lindquist (Ed.), Educational Measurement. Washington, DC : American Council on Education, pp. 3–46. Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge, UK : Cambridge University Press. Council of Europe. (2004, 2009). Reference Supplement to the Manual for Relating Language Examinations to the CEFR. Available from: http://www.coe.int/t/dg4/ linguistic/manuel1_en.asp. [14 January 2015]. Council of Europe. (2009). Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment. A Manual. Available from: http://www.coe.int/t/dg4/linguistic/manuel1_en.asp. [14 January 2015]. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972) The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York, NY: Wiley. Crooks, T. J, Kane, M. T., & Cohen, A. S. (1996). Threats to the valid use of assessments. Assessment in Education: Principles, Policy and Practice, 3(3), 265–286. Davies, A. (2013a). Native Speakers and Native Users. Cambridge, UK : Cambridge University Press. Davies, A. (2013b). Fifty Years of Language Assessment. In A. J. Kunnan (Ed.), The Companion to Language Assessment: Volume III (1st edn.). Malden, MA : Wiley, pp. 3–21. Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook of Test Development. Mahwah, NJ : Erlbaum. Erickson, G. (2010). Good practice in language testing and assessment: A matter of responsibility and respect. In T. Kao & Y. Lin (Eds.), A New Look at Teaching and Testing: English as Subject and Vehicle. Taipei, Taiwan: Bookman Books, pp. 237–258. Erickson, G., & Gustafsson, J-E. (2014). Bedömningens dubbla funktion—för lärande och likvärdighet. [The dual function of Assessment—for learning and equity]. In U. Lundgren, R. Säljö, & C. Liberg (Eds.), Lärande, Skola, Bildning (Learning, Schooling, Education [3rd edn.]). Stockholm, Sweden: Natur and Kultur, pp. 559–589. European Commission. (2012). First European Survey on Language Competences. Final report. Available from: http://www.ec.europa.eu/languages/policy/strategic-framework/ documents/language-survey-final-report_en.pdf. [22 July 2015]. Figueras, N., Kaftandjieva, F., & Takala, S. (2013). Relating a reading comprehension test to the CEFR levels: A case of standard setting in practice with focus on judges and items. Canadian Modern Language Review, 69(4), 359–385.

314

CONTEMPORARY SECOND L ANGUAGE ASSESSMENT

Gipps, C. (1999). Socio-cultural aspects of assessment. Review of Research in Education, 24(1), 355–392. Glaser, R. (1963). Instructional technology and the measurement of learning outcomes. American Psychologist, 18(8), 519–522. Gustafsson, J-E., & Åberg-Bengtsson, L. (2010). Unidimensionality and interpretability of psychological instruments. In S. E. Embretson (Ed.), Measuring Psychological Constructs: Advances in Model-Based Approaches. Washington, DC : American Psychological Association, pp. 97–121. Harlen, W. (2012). The role of assessment in developing motivation for learning. In J. Gardner (Ed.), Assessment and Learning. London, UK : Sage, pp. 171–183. Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112. Hymes, D. H. (1972). On communicative competence. In J. B. Pride & J. Holmes (Eds.), Sociolinguistics. Harmondsworth, UK : Penguin, pp. 269–293. Kaftandjieva, F. (2010). Methods for setting cut scores in criterion-referenced achievement tests. A comparative analysis of six recent methods with an application to tests of reading in EFL . Available from: http://www.ealta.eu.org/documents/resources/FK _second_ doctorate.pdf. [21 July 2015]. Lantolf, J., & Poehner, M. (2008). Dynamic Assessment. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and Education, Volume 7 (2nd edn.). New York, NY: Springer, pp. 273–284. Levelt, W. M. (1989). Speaking. From Intention to Articulation. Cambridge, MA : MIT Press. Levelt, W. M. (1999). Producing spoken language: A blueprint of the speaker. In C. M. Brown & P. Hagoort (Eds.), The Neurocognition of Language. Oxford, UK : Oxford University Press, pp. 83–122. Lindquist, E. F. (Ed.). (1951). Educational Measurement. Washington, DC : American Council on Education. Little, D. (2005). The Common European Framework and the European Language portfolio: Involving learners and their judgments in the assessment process. Language Testing, 22(3), 321–336. Little, D. (2009). The European language portfolio: Where pedagogy and assessment meet. Strasbourg, France: Council of Europe. Available from http://www.coe.int/t/dg4/ education/elp/elp-reg/Publications_EN.asp. [14 January 2015]. Masters, G. N., & Forster, M. (2003). The Assessments We Need. Camberwell, VIC : Australian Council for Educational Research. McNamara, T. F. (1996). Measuring Second Language Performance. London, UK : Addison Wesley Longman. Messick, S. A. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd edn.). New York, NY: American Council on Education/Macmillan, pp. 13–103. Moss, P., Girard, B., & Haniford, L. (2006). Validity in educational assessment. Review of Research in Education, 30, 109–162. Norris, J. M. (2009). Task-based teaching and testing. In M. H. Long & C. J. Doughty (Eds.), The Handbook of Language Teaching. Chichester, UK : Wiley-Blackwell, pp. 578–594. Purpura, J. (2009). The impact of large-scale and classroom-based language assessments on the individual. In C. Weir & L. Taylor (Eds.), Language Testing Matters: Investigating the Wider Social and Educational Impact of Assessment: Proceedings of the ALTE Cambridge Conference. Cambridge, UK : Cambridge University Press, pp. 301–325. Ramaprasad, A. (1983). On the definition of feedback. Behavioural Science, 28(1), 4–13. Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen, Denmark: Denmarks Pedagogiska Institut.

FUTURE PROSPECTS AND CHALLENGES IN L ANGUAGE ASSESSMENTS

315

Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18(2), 119–144. Sadler, D. R. (2010). Beyond feedback: Developing student capability in complex appraisal. Assessment and Evaluation in Higher Education, 35(5), 535–550. Scriven, M. (1998). Minimalist theory: The least theory that practice requires. American Journal of Evaluation, 19(1), 57–70. Shohamy, E. (2001). The Power of Tests: A Critical Perspective on the Uses of Language Tests. London, UK : Pearson. Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101. Spolsky, B. (1977). Language testing: Art or science. In G. Nickel (Ed.), Proceedings of the Fourth International Congress of Applied Linguistics. Stuttgart, Germany: Hochschulverlag, pp. 7–28. Spolsky, B. (1995). Measured Words. Oxford, UK : Oxford University Press. Spolsky, B. (2013). The influence of ethics in language assessment. In A. J. Kunnan (Ed.), The Companion to Language Assessment: Volume III (1st edn.). Malden, MA : Wiley, pp. 1571–1585. Stobart, G. (2012). Validity in formative assessment. In J. Gardner (Ed.), Assessment and Learning (2nd edn.). London, UK : Sage Publications, pp. 233–242. Takala, S., Erickson, G., & Figueras, N. (2013). International assessments. In A. J. Kunnan (Ed.), The Companion to Language Assessment: Volume I (1st edn.). Malden, MA : Wiley, pp. 285–302. Tyler, R. W. (1951). The functions of measurement in improving instruction. In E. F. Lindquist (Ed.), Educational Measurement. Washington, DC : American Council on Education, pp. 47–67. Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Cambridge, MA : Harvard University Press. Wall, D., & Horak, T. (2011). The Impact of Changes in the TOEFL ® Exam on Teaching in a Sample of Countries in Europe: Phase 3, the Role of the Coursebook Phase 4, Describing Change. TOEFL iBT ® Research Report TOEFL iBT-17. Princeton, NJ : Educational Testing Service. Wilson, M. (2013). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211–236.

316

INDEX

academic language 264–268 academic formulas definition 214 list (AFL ) 214 activity theory 151–152 agency (in language assessment) 307–309, 311 authenticity definition 2, 103–104 in language assessment 5, 103–123 automated text analysis 285–287 automated scoring definition 281 systems 282–285 AutoTutor 287–293 trialog 288–294 bias (see fairness) Common European Framework of Reference (CEFR ) linking to 84–85 cognitive labs 273–275 Coh-Metrix 285–287 Common Core 264 communicative language ability 17–34 consequences 309–310 construct 302–304 context of language use 17–34 conversational agent 287–293 corpora 52–53, 209–221 learner corpus 210, 213–221 reference corpus 210 diagnosis 125–140 Differential Item Functioning (DIF ) 244–249 Differential Bundle Function (DBF ) 245–246

English language development (ELD standards) 264 learners (definition) 261 error Type I 248 Type II 248 exam reform 64–66 factor analysis confirmatory 23 fairness (also bias) of a test 243–260 Georgia State Test of English Proficiency (GSTEP ) 215 Global Scale of English 97–98 Intelligent Tutoring System (ITS ) 287–293 International English Language Testing System (IELTS ) 229–230 interactional competence 191–192 International Medical Graduates (IMG s) 225–239 latent semantic analysis 284 learners’ background 126–128 Learning-oriented Assessment 293–294 Language Proficiency Assessment of Teachers (LPAT ) 152–153 multi-faceted Rasch Measurement 171–175 No Child Left Behind (NCLB ) Act 263–264 performance assessment 166–167 pragmatics assessment 7, 189–202 definition 190–191

317

318

INDEX

profile analysis application of 243–260 deviation profile 253 expected profile 252–253 observed profile 252 Pearson Test of English (PTE ) Academic 85 pragmalinguistic ability (see pragmalinguistic competence) pragmalinguistic competence 190–191 pragmalinguistic knowledge (see pragmalinguistic competence)

spoken texts scripted 104–106 unscripted 104–106 standards-based assessment 261–279 standard setting 225–241, 305–306 examinee-centred approach 226 expert panel 227 Minimally Competent Candidate (MCC ) 227–228 stakeholder focus group 230–231 test-centred approach 226

rating scales (also rubrics) revision 165–187 types 167 reading in a foreign language 125–139 in a second language 125–139 diagnosis 125–139 rubrics (see rating scales)

task definition 149–150 difficulty 6, 150–151

sociopragmatic ability (see sociopragmatic competence) sociopragmatic competence 190–191 sociopragmatic knowledge (see sociopragmatic competence) speaking assessment 147–163, 273–275

validation (see validity) validity a posteriori evidence 84, 89–96 a priori evidence 84, 86–89 argument-based 38–40, 196–201, 268–272 construct 38 definition 37 washback 61–77

319

320

321

322