Virtual Standard Setting: Setting Cut Scores 363180539X, 9783631805398

Virtual standard setting became more popular since the global outbreak of Covid-19 in 2020. Standard setting practitione

219 85 26MB

English Pages 301 [304] Year 2023

Table of contents :
Cover
Table of contents
List of figures
List of tables
List of acronyms
Chapter 1: Introduction
1.1 Overview of the study
1.2 Scope of the study
1.3 Outline of the chapters
Chapter 2: Literature review
2.1 Background to standard setting
2.2 The importance of setting valid cut scores
2.2.1 Standard setting methods
2.2.1.1 Examples of test-centred methods
Variants of the Angoff method
The Bookmark method
The Objective Standard Setting (OSS) method
2.2.1.2 Examples of examinee-centred methods
The Borderline Group (BG) method and the Contrasting Group (CG) method
The Body of Work (BoW) method
2.2.2 Evaluating and validating standard setting methods
2.3 Standard setting in language assessment
2.3.1 Current LTA standard setting research
2.3.1.1 The first publicly available CEFR alignment studies
2.3.1.2 Studies investigating understanding of method or CEFR
2.3.1.3 Studies investigating external validity evidence
2.3.1.4 Studies proposing new methods/modifications
2.4 Challenges associated with standard setting
2.4.1 Theoretical and practical challenges
2.4.2 Logistics
2.5 Virtual standard setting
2.5.1 Virtual standard setting: Empirical studies
2.5.2 Challenges associated with virtual standard setting
2.6 Media naturalness theory
2.6.1 Re-evaluating virtual standard setting studies through MNT
2.7 Summary
Chapter 3: Methodology
3.1 Research aim and questions
3.2 Methods
3.2.1 Embedded MMR design
3.2.2 Counterbalanced workshop design
3.2.3 Instruments
3.2.3.1 Web-conferencing platform and data collection platform
3.2.3.2 Test instrument
3.2.3.3 CEFR familiarisation verification activities
3.2.3.4 Recruiting participants
3.2.3.5 Workshop surveys
3.2.3.6 Focus group interviews
3.2.3.7 Ethical considerations
3.3 Standard setting methodology
3.3.1 Rationale for the Yes/No Angoff method
3.3.2 Pre-workshop platform training
3.3.3 In preparation for the virtual workshop
3.3.4 Description of the workshop stages
3.3.4.1 Introduction stage
3.3.4.2 Orientation stage
3.3.4.2.1 CEFR familiarisation verification activity A
3.3.4.2.2 CEFR familiarisation verification activity B
3.3.4.2.3 Familiarisation with the test instrument
3.3.4.3 Method training stage
3.3.4.4 Judgement stage
Round 1 Stage
Round 2 Stage
Round 3 Stage
3.4 Data analysis methods and frameworks
3.4.1 CEFR verification activities analysis
3.4.2 Internal validity of cut scores
Classical test theory (CTT)
Rasch measurement theory (RMT)
The many-facet Rasch measurement (MFRM) model
3.4.3 Comparability of virtual cut score measures
3.4.4 Differential severity
3.4.5 Survey analysis
3.4.6 Focus group interview analysis
3.6 Summary
Chapter 4: Cut score data analysis
4.1 Cut score internal validation: MFRM analysis
4.1.1 Rasch group level indices
4.1.2 Judge level indices
4.2 Cut score internal validation: CTT analysis
4.2.1 Consistency within the method
4.2.2 Intraparticipant consistency
4.2.3 Interparticipant consistency
4.2.4 Decision consistency and accuracy
The Livingston and Lewis method
The Standard Error method
4.3 Comparability of cut scores between media and environments
4.3.1 Comparability of virtual cut score measures
4.3.2 Comparability of virtual and F2F cut score measures
4.4 Differential severity between medium, judges, and panels
4.4.1 Differential judge functioning (DJF)
4.4.2 Differential medium functioning (DMF)
4.4.3 Differential group functioning (DGF)
4.5 Summary
Chapter 5: Survey data analysis
5.1 Survey instruments
5.2 Perception survey instrument
5.2.1 Evaluating the perception survey instruments
5.2.2 Analysis of perception survey items
Qualitative comments for communication item 1
Audio medium
Video medium
Qualitative comments for communication item 2
Audio medium
Video medium
Qualitative comments for communication item 3
Audio medium
Video medium
Qualitative comments for communication item 4
Qualitative comments for communication item 5
Audio
Video medium
Qualitative comments for communication item 6
Audio medium
Video medium
Qualitative comments for communication item 7
Audio medium
Video medium
Qualitative comments for communication item 8
Audio medium
Video medium
Qualitative comments for communication item 9
Audio medium
The video medium
5.3 Procedural survey items
5.3.1 Evaluating the procedural survey instruments
5.4 Summary
Chapter 6: Focus group interview data analysis
6.1 Analysis of transcripts
6.2 Findings
6.2.1 Psychological aspects
Distraction in the video medium
Self-consciousness in the video medium
Lack of non-verbal feedback in the audio medium
Inability to distinguish speaker in the audio medium
Inability to discern who was paying attention in audio medium
Cognitive strain in the audio medium
6.2.2 Interaction
Lack of small talk in virtual environments
No digression from the topic in virtual environments
Differences in amounts of discussion between virtual and F2F settings
6.2.3 Technical aspects
Technical problems in virtual environments
Turn-taking system
6.2.4 Convenience
Time saved in virtual environments
Freedom to multi-task in virtual environments
Less fatigue in virtual environments
6.2.5 Decision-making in virtual environments
6.3 Summary
Chapter 7: Integration and discussion of findings
7.1 Research questions
7.1.1 Research questions 1, 2, and 3
7.1.2 Research question 4
7.1.3 Research question 5
7.2 Limitations
7.3 Summary
Chapter 8: Implications, future research, and conclusion
8.1 Significance and contribution to the field
8.2 Guidance for conducting synchronous virtual cut score studies
Demands for facilitators and/or co-facilitators
Establishing a virtual standard setting netiquette
Selecting a suitable virtual platform
Selecting an appropriate medium for the workshop
Recruiting online participants
Training in the virtual platform
Uploading materials
Monitoring progress and engaging judges
8.3 Recommendations for future research
8.4 Concluding remarks
Appendices
Appendix A CEFR verification activity A (Key)
Appendix B Electronic consent form
Appendix C Judge background questionnaire
Appendix D Focus group protocol
Introductory statement
Focus group interview questions
Appendix E Facilitator’s virtual standard setting protocol
Appendix F CEFR familiarisation verification activity results
Appendix G: Facets specification file
Appendix H: Intraparticipant consistency indices
Appendix I: Group 5 group level and individual level Rasch indices
Appendix J: Form A & Form B score tables
Appendix K: DJF pairwise interactions
Appendix L: DGF pairwise interactions
Appendix M: Wright maps
References
Author index
Subject index
Series Index

Recommend Papers

Setting the Standard for Project Based Learning 9786468600, 9781416620334, 1416620338

Project based Learning (PBL) is gaining renewed attention with the current focus on college and career readiness and the

335 113 5MB Read more

Pathfinder Chronicles: Campaign Setting

Note: Pathfinder Campaign Setting: The Inner Sea World Guide has replaced this volume as the main Pathfinder campaign se

2,005 167 29MB Read more

Savage Company Campaign Setting

Orcs! Guns! Vehicles! Time to join the big green war machine and march into hundreds of pages of new military style cont

526 84 45MB Read more

Brancalonia: Setting Book

Based on Italian tradition, folklore, history, landscapes, fiction, and pop culture, Brancalonia is a SPAGHETTI FANTASY

448 4 50MB Read more

Setting the Stage 9780773568150

Setting the Stage is the story of Montreal's theatrical cultures and their part in the development of Canadian thea

107 9 16MB Read more

GURPS Dungeon Fantasy Setting: Caverntown

1,001 134 3MB Read more

The Social Setting of Pauline Christianity 9780567661302

A major collection of studies in the sociology of the New Testament in which Theissen presents his thesis about the deve

153 28 10MB Read more

Pathfinder Campaign Setting: Lost Treasures 9781601257031

Lost and Found! Treasure hunting isn’t just a job. For the most passionate adventurers, it’s the thrill of finally hold

780 206 8MB Read more

Pathfinder Campaign Setting: Darklands Revisited 9781601258199

Terrors from Below Beneath the Inner Sea region stretches a vast network of echoing caverns, serpentine tunnels, and sub

1,230 224 7MB Read more

Pathfinder Campaign Setting: Mystery Monsters Revisited 9781601254733

Every culture tells stories of strange beasts that haunt the edges of civilization. Seldom corroborated, the accounts of

684 134 14MB Read more

Virtual Standard Setting: Setting Cut Scores
363180539X, 9783631805398

Author / Uploaded
Charalambos Kollias

Similar Topics
Linguistics

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Virtual Standard Setting: Setting Cut Scores

Language Testing and Evaluation Series editors: Claudia Harsch and Günther Sigott

Volume 46

Zur Qualitätssicherung und Peer Review der vorliegenden Publikation

Notes on the quality assurance and peer review of this publication

Die Qualität der in dieser Reihe erscheinenden Arbeiten wird vor der Publikation durch die Herausgeber der Reihe geprüft.

Prior to publication, the quality of the work published in this series is reviewed by the editors of the series.

Charalambos Kollias

Virtual Standard Setting: Setting Cut Scores

Bibliographic Information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available online at http://dnb.d-nb.de. Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library ofCongress.

ISSN 1612-815X ISBN 978-3-631-80539-8 (Print) E-ISBN 978-3-631-88904-6 (E-Book) E-ISBN 978-3-631-88905-3 (E-PUB) DOI 10.3726/b20407 © Peter Lang GmbH Internationaler Verlag der Wissenschaften Berlin 2023 All rights reserved. Peter Lang – Berlin · Bern · Bruxelles · New York · Oxford · Warszawa · Wien All parts of this publication are protected by copyright. Any utilisation outside the strict limits of the copyright law, without the permission of the publisher, is forbidden and liable to prosecution. This applies in particular to reproductions, translations, microfilming, and storage and processing in electronic retrieval systems. This publication has been peer reviewed. www.peterlang.com

To my wife Voula and my sons Raf and Thanos

Abstract: In an attempt to combat the high costs associated with conducting face-to- face (F2F) cut score studies, standard setting practitioners have started exploring other avenues such as virtual standard setting. However, the impact that a virtual communication medium can have on panellists and their cut scores in a virtual standard setting has yet to be fully investigated and understood. Consequently, the aims of this study were to explore whether reliable and valid cut scores could be set in two synchronous e-communication media (audio and video), to explore the panellists’ perceptions towards the two media, and to investigate whether virtual cut scores derived in a virtual environment were comparable with those derived in F2F environments. Forty-five judges were divided into four synchronous virtual standard setting panels, each panel consisting of 9 to 13 judges. Each panel participated in a virtual workshop consisting of two sessions conducted through a different e-communication medium (audio or video). In each session, judges employed the modified Yes/No Angoff method for Rounds 1 and 2 and provided an overall judgement for Round 3 to set cut scores on two equated language examination instruments. To cater for order effects, test form effects, and e-communication media effects, an embedded, mixed methods, counterbalanced research design was employed. Data were collected from three main sources: (1) panellists’ judgements; (2) survey responses; and (3) focus group interviews. The panellists’ judgements were evaluated through the many-facet Rasch measurement (MFRM) model and classical test theory (CTT), the survey data were analysed through CTT, and the focus group interview data were analysed through the constant comparison method (CCM). The results were further interpreted through the lens of media naturalness theory (MNT). To compare virtual cut score results with F2F cut score results, data collected from an earlier F2F cut score study using the modified percentage Angoff method for two rounds were used. The findings from the MFRM and CTT analyses reveal that reliable and valid cut scores can be set in both e-communication media. While no statistically significant differences were observed within and across groups and media regarding the panellists’ overall cut score measures, analysis of the open-ended survey responses and focus group transcripts revealed that judges differed in their perceptions regarding each medium. Overall, the panellists expressed preference towards the video medium, a finding in line with MNT. The comparison of virtual cut score measures with F2F mean cut score measures yielded non-significant results for Round 1 and Round 2. However, when final virtual cut score measures (Round 3) were compared with final F2F cut score measures (Round 2), the same was not observed for one of the two virtual groups in the video medium. Group 3 video medium cut score measures differed in a statistically significant way from the F2F cut score measures. This difference may be attributed to Group 3’s idiosyncrasies when setting a cut score in the video medium, as Group 3 set the highest cut score measure compared to the other groups. This study adds to the current limited literature of virtual standard setting, expands the MFRM framework for evaluating multiple virtual cut score studies, and proposes a framework for conducting, analysing, and evaluating virtual cut score studies.

Acknowledgements This book is a revised version of my PhD thesis completed at Lancaster University in 2017. I would like to take this opportunity to express my appreciation to all those who contributed to this study either directly or indirectly. To begin with, I would like to thank all those who participated in either the pilot phase or the main study. Next, I would like to thank colleagues and friends, especially Karen Lee, Catherine Georgopoulou, and Nikos Pylarinos, as well as institutions such as the Panhellenic Federation of Language Centre Owners and the Hellenic American Union for assisting with recruiting participants for the main study. Further thanks go to two awarding organisations, the Hellenic American University for allowing me access to raw data and test instruments and to an international awarding body, wishing to remain anonymous, for providing me with sensitive information. A special thanks go to the following people who shared their expertise with me during this study. I would like to begin with Dr Mike Linacre for his ongoing Rasch measurement support throughout the study and Dr Thomas Eckes for also providing Rasch support. Next, I would like to thank Dr Spiros Papageorgiou, Dr Sauli Takala, and Dr Richard Tannenbaum for sharing their standard setting experiences with me. A further thanks to Dr Richard Tannenbaum for sharing his virtual standard setting experiences with me, paving the way for my research. Further, I would especially like to thank my PhD supervisor, Dr Luke Harding, for his tremendous support, patience, and professional guidance, without which the original thesis would never have been completed. Next, I would like to thank the editors of this book, Dr Claudia Harsch and Dr Günther Sigott for their valuable feedback and suggestions throughout the revision process that has led to this book extending far beyond the original thesis. Finally, I would like to thank my parents and friends, especially Konstantinos Drivas (aka the IT expert), for their patience, encouragement, and support. I must thank my family, my wife, Paraskevi (Voula) Kanistra (aka the co-facilitator) and my two wonderful children, Raf and Thanos, for their patience, support, and understanding. Their unconditional love gave me the strength to complete this study. I must add that Voula provided extensive feedback and suggestions throughout the write- up of my thesis and the revised manuscript. I am extremely fortunate to have access to a language assessment expert, especially one with (virtual) standard setting experience, with whom I can exchange views

10

Acknowledgements

24/7 literally. There is never a dull moment in our house when we are discussing language assessment, standard setting, and analysing data through the many- facet Rasch measurement (MFRM) model.

Table of contents List of figures �� 19 List of tables �� 21 List of acronyms �� 23 Chapter 1: Introduction �� 25

1.1 Overview of the study �� 25

1.2 Scope of the study �� 27

1.3 Outline of the chapters �� 28

Chapter 2: Literature review �� 31

2.1 Background to standard setting �� 31

2.2 The importance of setting valid cut scores �� 33 2.2.1 Standard setting methods �� 34 2.2.1.1 Examples of test-centred methods �� 34 Variants of the Angoff method �� 34 The Bookmark method �� 36 The Objective Standard Setting (OSS) method � 37 2.2.1.2 Examples of examinee-centred methods �� 39 The Borderline Group (BG) method and the Contrasting Group (CG) method �� 39 The Body of Work (BoW) method �� 39 2.2.2 Evaluating and validating standard setting methods �� 40

2.3 Standard setting in language assessment �� 42 2.3.1 Current LTA standard setting research �� 43 2.3.1.1 The first publicly available CEFR alignment studies �� 43 2.3.1.2 Studies investigating understanding of method or CEFR �� 44

12

Table of contents

2.3.1.3 Studies investigating external validity evidence 46 2.3.1.4 Studies proposing new methods/modifications 48

2.4 Challenges associated with standard setting �� 49 2.4.1 Theoretical and practical challenges �� 49 2.4.2 Logistics �� 50

2.5 Virtual standard setting �� 51 2.5.1 Virtual standard setting: Empirical studies �� 51 2.5.2 Challenges associated with virtual standard setting �� 55

2.6 Media naturalness theory �� 58 2.6.1 Re-evaluating virtual standard setting studies through MNT �� 59

2.7 Summary �� 60

Chapter 3: Methodology �� 63

3.1 Research aim and questions �� 63

3.2 Methods �� 64 3.2.1 Embedded MMR design �� 65 3.2.2 Counterbalanced workshop design �� 66 3.2.3 Instruments �� 67 3.2.3.1 Web-conferencing platform and data collection platform �� 67 3.2.3.2 Test instrument �� 70 3.2.3.3 CEFR familiarisation verification activities �� 71 3.2.3.4 Recruiting participants �� 73 3.2.3.5 Workshop surveys �� 77 3.2.3.6 Focus group interviews �� 80 3.2.3.7 Ethical considerations �� 83

3.3 Standard setting methodology �� 83 3.3.1 Rationale for the Yes/No Angoff method �� 83 3.3.2 Pre-workshop platform training �� 83 3.3.3 In preparation for the virtual workshop �� 86 3.3.4 Description of the workshop stages �� 86 3.3.4.1 Introduction stage �� 87

Table of contents

13

3.3.4.2 Orientation stage �� 88 3.3.4.2.1 CEFR familiarisation verification activity A �� 88 3.3.4.2.2 CEFR familiarisation verification activity B �� 89 3.3.4.2.3 Familiarisation with the test instrument �� 91 3.3.4.3 Method training stage �� 92 3.3.4.4 Judgement stage �� 93 Round 1 Stage: �� 93 Round 2 Stage: �� 94 Round 3 Stage: �� 96

3.4 Data analysis methods and frameworks �� 98 3.4.1 CEFR verification activities analysis �� 99 3.4.2 Internal validity of cut scores �� 99 Classical test theory (CTT) �� 99 Rasch measurement theory (RMT) �� 100 The many-facet Rasch measurement (MFRM) model 102 3.4.3 Comparability of virtual cut score measures �� 103 3.4.4 Differential severity �� 104 3.4.5 Survey analysis �� 105 3.4.6 Focus group interview analysis �� 105

3.6 Summary �� 107

Chapter 4: Cut score data analysis �� 111

4.1 Cut score internal validation: MFRM analysis �� 111 4.1.1 Rasch group level indices �� 113 4.1.2 Judge level indices �� 118

4.2 Cut score internal validation: CTT analysis �� 123 4.2.1 Consistency within the method �� 124 4.2.2 Intraparticipant consistency �� 125 4.2.3 Interparticipant consistency �� 128 4.2.4 Decision consistency and accuracy �� 129 The Livingston and Lewis method: �� 129

14

Table of contents

The Standard Error method �� 132

4.3 Comparability of cut scores between media and environments �� 133 4.3.1 Comparability of virtual cut score measures �� 134 4.3.2 Comparability of virtual and F2F cut score measures � 138

4.4 Differential severity between medium, judges, and panels � 140 4.4.1 Differential judge functioning (DJF) �� 140 4.4.2 Differential medium functioning (DMF) �� 141 4.4.3 Differential group functioning (DGF) �� 142

4.5 Summary �� 145

Chapter 5: Survey data analysis �� 147

5.1 Survey instruments �� 147

5.2 Perception survey instrument �� 147 5.2.1 Evaluating the perception survey instruments �� 148 5.2.2 Analysis of perception survey items �� 149 Qualitative comments for communication item 1: �� 152 Audio medium �� 152 Video medium �� 153 Qualitative comments for communication item 2: �� 156 Audio medium �� 156 Video medium �� 157 Qualitative comments for communication item 3: �� 160 Audio medium �� 160 Video medium �� 161 Qualitative comments for communication item 4: �� 163 Qualitative comments for communication item 5: �� 165 Audio �� 165 Video medium �� 166 Qualitative comments for communication item 6: �� 169 Audio medium �� 169 Video medium �� 170 Qualitative comments for communication item 7: �� 173

Table of contents

15

Audio medium �� 173 Video medium �� 173 Qualitative comments for communication item 8: �� 176 Audio medium �� 176 Video medium �� 177 Qualitative comments for communication item 9: �� 180 Audio medium �� 180 The video medium �� 181

5.3 Procedural survey items �� 189 5.3.1 Evaluating the procedural survey instruments �� 189

5.4 Summary �� 190

Chapter 6: Focus group interview data analysis �� 193

6.1 Analysis of transcripts �� 193

6.2 Findings �� 196 6.2.1 Psychological aspects �� 196 Distraction in the video medium �� 196 Self-consciousness in the video medium �� 197 Lack of non-verbal feedback in the audio medium �� 197 Inability to distinguish speaker in the audio medium �� 198 Inability to discern who was paying attention in audio medium �� 199 Cognitive strain in the audio medium �� 199 6.2.2 Interaction �� 200 Lack of small talk in virtual environments �� 200 No digression from the topic in virtual environments �� 201 Differences in amounts of discussion between virtual and F2F settings �� 201 6.2.3 Technical aspects �� 202 Technical problems in virtual environments �� 203 Turn-taking system �� 205

16

Table of contents

6.2.4 Convenience �� 206 Time saved in virtual environments �� 206 Freedom to multi-task in virtual environments �� 206 Less fatigue in virtual environments �� 207 6.2.5 Decision-making in virtual environments �� 208

6.3 Summary �� 209

Chapter 7: Integration and discussion of findings �� 211

7.1 Research questions �� 211 7.1.1 Research questions 1, 2, and 3 �� 211 7.1.2 Research question 4 �� 212 7.1.3 Research question 5 �� 219

7.2 Limitations �� 220

7.3 Summary �� 220

Chapter 8: Implications, future research, and conclusion �� 223

8.1 Significance and contribution to the field �� 223

8.2 Guidance for conducting synchronous virtual cut score studies �� 224 Demands for facilitators and/or co-facilitators �� 225 Establishing a virtual standard setting netiquette �� 225 Selecting a suitable virtual platform �� 225 Selecting an appropriate medium for the workshop �� 226 Recruiting online participants �� 228 Training in the virtual platform �� 228 Uploading materials �� 228 Monitoring progress and engaging judges �� 229

8.3 Recommendations for future research �� 229

8.4 Concluding remarks �� 230

Table of contents

17

Appendices �� 233 Appendix A CEFR verification activity A (Key) �� 233 Appendix B Electronic consent form �� 237 Appendix C Judge background questionnaire �� 238 Appendix D Focus group protocol �� 241 Introductory statement �� 241 Focus group interview questions �� 242 Appendix E Facilitator’s virtual standard setting protocol �� 243 Appendix F CEFR familiarisation verification activity results �� 246 Appendix G: Facets specification file �� 248 Appendix H: Intraparticipant consistency indices �� 252 Appendix I: Group 5 group level and individual level Rasch indices �� 255 Appendix J: Form A & Form B score tables �� 256 Appendix K: DJF pairwise interactions �� 257 Appendix L: DGF pairwise interactions �� 263 Appendix M: Wright maps �� 266 References �� 271 Author index �� 289 Subject index �� 293

List of figures Figure 2.1 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 Figure 3.8 Figure 3.9 Figure 3.10 Figure 3.11 Figure 3.12 Figure 3.13 Figure 3.14 Figure 3.15 Figure 3.16 Figure 3.17 Figure 3.18 Figure 3.19 Figure 3.20 Figure 3.21 Figure 3.22 Figure 3.23 Figure 3.24 Figure 3.25

The media naturalness scale �� 59 The study’s embedded MMR design �� 66 Overview of counterbalanced virtual workshop design �� 66 The e-platform placed on the media naturalness scale �� 69 CEFR familiarisation verification activities �� 73 Surveys administered to each panel during each workshop �� 80 Focus group sessions �� 82 Example of e-platform: equipment check session �� 84 Example of e-platform: audio medium session �� 85 Example of e-platform: video medium session �� 85 Overview of the workshop stages for each session �� 87 Example of CEFR familiarisation verification activity A �� 89 Example of CEFR familiarisation verification activity B �� 90 Example of CEFR familiarisation verification activity feedback 1 �� 90 Example of CEFR familiarisation verification activity feedback 2 �� 91 Example of grammar subsection familiarisation �� 92 Example of Round 1 virtual rating form �� 93 Example of panellist normative information feedback �� 94 Example of Round 2 virtual rating form �� 96 Group 1 normative information and consequences feedback �� 97 Round 3 virtual rating form �� 97 Overview of the quantitative and qualitative data collected �� 99 Data analysis for internal validity: CTT �� 100 Data analysis for internal validity: RMT �� 103 CCM process for analysing focus group transcripts �� 106 Coding process within CCM �� 107

List of tables Table 2.1 Table 2.2 Table 3.1 Table 3.2 Table 3.3 Table 3.4 Table 3.5 Table 3.6 Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 4.7 Table 4.8 Table 4.9 Table 4.10 Table 4.11 Table 4.12 Table 4.13 Table 4.14 Table 4.15 Table 4.16 Table 4.17 Table 4.18 Table 4.19 Table 4.20 Table 4.21 Table 4.22 Table 4.23 Table 4.24

Summary of elements for evaluating standard setting �� 40 Summary of standard setting expenses �� 50 BCCETM GVR section: Original vs. shortened versions �� 70 Summary of workshop participants �� 76 Examples of survey adaptations �� 78 Materials uploaded onto virtual platforms �� 86 Virtual session duration �� 98 Overview of RQs, instruments, data collected, and analysis �� 108 Group 1 group level Rasch indices �� 115 Group 2 group level Rasch indices �� 116 Group 3 group level Rasch indices �� 117 Group 4 group level Rasch indices �� 118 Group 1 individual level Rasch indices �� 120 Group 2 individual level Rasch indices �� 121 Group 3 individual level Rasch indices �� 121 Group 4 individual Rasch level indices �� 122 Psychometric characteristics of Test Form A and Test Form B �� 123 All groups internal consistency within method check �� 125 Intraparticipant consistency indices per round and test form 126 Changes in ratings across Round 1 and Round 2 �� 127 Logit changes in ratings across Round 2 and Round 3 �� 127 Interparticipant indices: Form A �� 128 Interparticipant indices: Form B �� 129 Accuracy and consistency estimates for Form A raw cut scores �� 130 Accuracy and consistency estimates for Form B raw cut scores �� 131 Form A and Form B pass/fail rates �� 131 Percentage of correct classifications per group and test form � 133 Round 1 virtual cut score measure comparisons �� 135 Round 2 virtual cut score measure comparisons �� 136 Round 3 virtual cut score measure comparisons �� 137 Round 1 virtual and F2F cut score measure comparisons �� 138 Round 2 virtual and F2F cut score measure comparisons �� 139

22 Table 4.25 Table 4.26 Table 4.27 Table 4.28 Table 4.29 Table 4.30 Table 4.31 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 5.6 Table 5.7 Table 5.8 Table 5.9 Table 5.10 Table 5.11 Table 5.12 Table 5.13 Table 5.14 Table 5.15 Table 5.16 Table 5.17 Table 6.1 Table 8.1

List of tables

Round 3 virtual & Round 2 F2F cut score measure comparisons �� 139 DMF analysis of all judgements per medium �� 141 DMF analysis of all judgements per medium, within test form �� 142 DGF analysis across all judgements between media per group �� 142 Round 1 DGF pairwise interactions within groups �� 143 Round 2 DGF pairwise interactions �� 144 Round 3 DGF pairwise interactions �� 144 Psychometric characteristics of perception survey instruments �� 148 Frequency data of the perception survey instruments �� 149 Wilcoxon signed-rank test/Sign test communication item 1 � 151 Wilcoxon signed-rank test/Sign test communication item 2 � 155 Wilcoxon signed-rank test/Sign test communication item 3 � 159 Wilcoxon signed-rank test/Sign test communication item 4 � 162 Wilcoxon signed-rank test/Sign test communication item 5 � 164 Wilcoxon signed-rank test/Sign test communication item 6 � 168 Wilcoxon signed-rank test/Sign test communication item 7 � 172 Wilcoxon signed-rank test/Sign test communication item 8 � 175 Wilcoxon signed-rank test/Sign test communication item 9 � 179 Wilcoxon signed-rank test/Sign test communication item 10 �� 182 Wilcoxon signed-rank test/Sign test communication item 11 �� 184 Wilcoxon signed-rank test/Sign test platform item 1 �� 186 Wilcoxon signed-rank test/Sign test platform item 2 �� 188 Psychometric characteristics of procedural survey instruments �� 189 Frequency data of procedural survey instruments �� 190 Coding scheme �� 195 Virtual standard setting platform framework �� 227

List of acronyms 1PL 3DC BCCETM BG BPLDs BoW CAE CCM CEFR CG CR CRT CTT DGF DMF DJF DRM ESL F2F FDR FWER GDPR GMAT ICC IRT IT KSAs LTA M MBA MC MFRM MMR MNT MPI

One-parameter Logistic Model (1PL) Data-Driven Direct Consensus Basic Communication Certificate in English Borderline Group Borderline Performance Level Descriptor Body of Work Cambridge English Advanced Constant Comparative Method Common European Framework of Reference for Languages Contrasting Group Criterion-Referenced Criterion-Referenced Testing Classical Test Theory Differential Group Functioning Differential Medium Functioning Differential Judge Functioning Digital Rights Management English as a Second Language Face-to-Face False Discovery Rate Family-Wise Error Rate General Data Protection Regulation Graduate Management Admissions Test Intraclass Correlation Coefficient Item Response Theory Information Technology Knowledge, Skills, & Abilities Language Testing & Assessment Mean measure Master of Business Administration Multiple-Choice Many-Facet Rasch Measurement Model Mixed Methods Research Media Naturalness Theory Misplacement Index

24 MST NDA NR NRT OIB OSS PC PL PLDs RMSE RMT RP50 RP67 RQs S.D. S.E. SEc SEM SMEs SR TOEFL® TOEIC® TSE® TWE® UREC

List of acronyms

Measurement Scale Theory Non-Disclosure Agreement Norm-Referenced Norm-Referenced Testing Ordered Item Booklet Objective Standard Setting Personal Computer Performance Label Performance Level Descriptors Root Mean-Square Standard Error Rasch Measurement Theory Probability of correct response .50 Probability of correct response .67 Research Questions Standard Deviation Standard Error Standard Error of Cut Score Measure Standard Error of Measurement Subject Matter Experts Selected Response Test of English as a Foreign LanguageTM Test of English for International CommunicationTM Test of Spoken EnglishTM Test of Written EnglishTM University’s Research Ethics Committee

Chapter 1: Introduction The purpose of this chapter is to provide a broad introduction to the study. The chapter is divided into three main sections, with the first section providing an overview to the study. The next section discusses the scope of the study, while the final section presents the structure of the study.

1.1 Overview of the study The overall aim of the study was to further investigate virtual standard setting by examining the feasibility of replicating a F2F standard setting workshop conducted in 2011 in two virtual environments, audio-only (henceforth “audio”) and audio-visual (henceforth “video”), and to explore factors that may impact cut scores. First, standard setting, as used in the study, is defined and the practical challenges associated with it are presented. Next, an overview of the findings from the few empirical virtual standard setting studies that have been conducted are presented and areas of virtual standard setting which warrant further investigation are discussed. Finally, the rationale for the study along with the contributions it sought to make are presented. Standard setting is a decision-making process of setting a cut score –a certain point on a test scale used for classifying test takers into at least two different categories (Cizek, Bunch, & Koons, 2004; Hambleton & Eignor, 1978; Kaftandjieva, 2010). The standard setting process usually entails recruiting a group of panellists to complete a variety of tasks with the aim of recommending a cut score which usually equates to a pass/fail decision on a certain test instrument. Some of the key challenges associated with conducting a standard setting workshop range from purely academic issues such as selecting the most appropriate method to set cut scores to very practical issues such as recruiting panellists and arranging accommodation. It is such practical issues involved in conducting a cut score study that may result in such workshops either not being replicated at regular intervals (Dunlea & Figueras, 2012) to examine whether cut scores have changed or, in some cases, not being conducted at all (Tannenbaum, 2013). Recruiting panellists for a standard setting workshop places a heavy financial burden on the awarding body commissioning the cut score study. The external costs associated with conducting such a study usually entail hiring a suitable venue, offering panellists a financial incentive for participating in the study (per

26

Introduction

diem or lump sum) and when panellists are associated with a university, the university also receives a sum for contracting their lecturers. Furthermore, when an awarding body has limited human resources, it may need to hire temporary staff to help with the amount of preparation needed to conduct the workshop. For example, a large volume of photocopies needs to be made so that all panellists have their own sets of materials (i.e., training materials, the test instrument, ratings forms, etc.) that will be used during the study. In the cases where the awarding body cannot conduct the cut score study themselves, standard setting practitioners need to be contracted for the study. There are also internal costs associated with standard setting meetings such as internal meetings held amongst staff to organise the cut score studies, the follow-up meetings to discuss the recommended cut scores and their implications, and even the write-up of the cut score study itself. In some studies, qualified internal staff may participate as panellists in the standard setting sessions to reduce the external costs. The time that internal staff devote to performing the activities is time reduced from their everyday activities, duties, and responsibilities, which usually equates to there being a backlog of work to be done. Some standard setting practitioners (Harvey and Way, 1999; Katz, Tannenbaum, & Kannan, 2009; Schnipke & Becker, 2007) have started exploring the feasibility of setting cut scores in virtual environments to offset the external costs associated with F2F standard setting. Virtual environments here are defined as artificial environments in which geographically isolated participants engage in computer- mediated conversation with one another through e- communication tools (i.e., emails, audio-conferencing, and videoconferencing). The very few empirical virtual standard setting research studies that have been published (Harvey & Way, 1999; Katz & Tannenbaum, 2014; Katz, Tannenbaum, & Kannan, 2009) to date have confirmed that it is feasible to conduct a standard setting workshop in (1) an asynchronous virtual environment –one in which panellists are not necessarily in the virtual environment at the same time or in (2) a combined synchronous and asynchronous environment, in which one or more parts of a cut score study are conducted in real time, while other parts are conducted offline. These studies have also revealed that virtual standard setting can be conducted through different e-communication media such as emails, audio-conferencing and/or call conferencing and even through a combination of audio-conferencing and videoconferencing. While such findings paint a positive picture of virtual standard setting, it is an area of standard setting that still remains under-investigated. The empirical virtual standard setting studies published to date have been conducted in a series of smaller sessions. However, in a F2F setting the duration

Scope of the study

27

of a cut score study on a language examination may range from approximately 1 to 1.5 days, when a cut score is to be set on one single instrument measuring a single skill (e.g., listening, reading, writing, or speaking) to even eight days when multiple cut scores need to be set on multiple instruments. The feasibility of the length of the virtual sessions has yet to be investigated. The demands placed on both the panellists’ equipment (i.e., computers, cameras, microphones, bandwidth requirements, etc.) and on the panellists themselves (e.g., fatigue, motivation, distractions, etc.) may be too great, resulting in some of the participants withdrawing from the study or the study itself not being completed. Little is known about whether an appropriate e-communication medium to conduct a virtual standard setting study exists, and if so, how a standard setting workshop might best be conducted within that medium. None of the published virtual standard setting studies have compared two different e- communication media (i.e., audio and video) to explore whether using different e-communication media (i.e., audio-conferencing, videoconferencing) results in comparable and/or equally reliable cut scores. What is also not clear is to what degree the virtual medium can affect panellists’ decision-making processes and/ or their perceptions and evaluations of the virtual environment. A related issue is how such perceptions are to be evaluated. In the literature on standard setting, specific guidance for conducting and evaluating cut scores is provided (Cizek & Earnest, 2016; Council of Europe, 2009; Kaftandjieva, 2004; Kane, 2001; Pitoniak, 2003; Zieky, Perie, & Livingston, 2008); however, the translation of this guidance to the virtual environment requires further exploration.

1.2 Scope of the study This study seeks to address the gap that exists in the virtual standard setting literature. The aim of this study was threefold. The first aim was to investigate whether a particular e-communication medium (audio or video) was more appropriate than the other when replicating a F2F standard setting workshop. The aim was addressed through (1) selecting a web-conferencing platform for the study which could be used for both audio-conferencing and videoconferencing and (2) recruiting four groups of panellists to participate in two synchronous virtual sessions lasting approximately six hours (with breaks) each. The second aim was to investigate whether the cut scores set via the two e- communication media (audio and video) were reliable and comparable, and as such would allow valid inferences to be drawn for cut score interpretations, and whether the virtual cut scores were comparable with previously set F2F cut scores. This aim was addressed though employing an embedded mixed method,

28

Introduction

counterbalanced research design. To explore the comparability of the virtual cut scores between and across panels and media, two similar test instruments previously equated through a complex mathematical procedure (Rasch) were used. The reliability and the internal validity of the virtual cut scores were investigated by applying Kane’s framework (Kane, 2001). The virtual cut scores were also compared with cut scores previously set on the same test instruments in a F2F environment. The third aim was to explore whether either of the e-communication media (audio and video) affected the panellists’ decision-making processes as well as the panellists’ perceptions and evaluations of how well they communicated in each medium. This aim was investigated quantitatively through an analysis of survey data and qualitatively through an analysis of open-ended survey questions and focus group transcripts. The quantitative and qualitative findings were integrated and discussed with reference to media naturalness theory (MNT) (Kock, 2004, 2005, 2010) to gain new insights into virtual standard setting. The study sought to contribute to the limited research in virtual standard setting in three ways: (1) theoretical; (2) practical; and (3) methodological. The first contribution of the study was (i) to provide evidence of the theoretical feasibility of conducting a synchronous virtual standard setting study, simulating F2F conditions, and (ii) to test a theoretical framework for evaluating qualitative data collected from virtual standard setting panellists by drawing from the principles of MNT. The next contribution was to provide a practical framework for conducting virtual standard setting by providing guidance to standard setting practitioners. The final contribution of the study was to provide a methodological framework for analysing multiple panel cut scores through equating and anchoring test instruments to their respective difficulty levels. It also added to the scarce literature of evaluating cut score data through MFRM (Eckes, 2009; Eckes, 2011/2015; Hsieh, 2013; Kaliski et al., 2012).

1.3 Outline of the chapters This study is presented in eight chapters. Chapter 1 provides the introduction, while chapter 2 provides a review of the literature, with a particular focus on conducting standard setting in virtual environments. First, standard setting is defined in relation to norm-referenced and criterion-referenced test score interpretations and then defined for the purpose of this study as a decision- making activity. Second, the importance of standard setting is described, key elements to its evaluation are discussed, and examples of standard setting methods are presented. Third, the role of standard setting in the field of language

Outline of the chapters

29

testing and assessment (LTA) is discussed and current standard setting research is presented. Fourth, associated challenges of conducting F2F standard setting are discussed. Next, the limited number of virtual standard setting studies reported to date are critically evaluated and associated challenges of conducting virtual standard setting are presented. Finally, MNT is presented, and the virtual standard setting theories are re-evaluated through its principles to identify the gap in the research literature. Chapter 3 presents the research methodology employed in the study. First, the research questions are introduced and the methodology to answer the questions is presented and discussed. Second, the standard setting methodology is described in chronological order ranging from the selection of the virtual platform and recruitment of the participants to conducting the cut score study. Finally, the data analysis methods and frameworks are presented. Chapter 4 provides the results of the quantitative analysis of the cut scores. First, the results of each panels’ judgements are presented and discussed in accordance with standard setting validation strategies conducted through classical test theory (CTT) and Rasch measurement theory (RMT). Next, the comparability of the cut scores between and/or across media, tests forms, panels and environments is discussed. Chapter 5 provides the results of the survey data. First, the survey instrument is discussed together with its psychometric characteristics. Next, a quantitative and qualitative analysis of a subset of survey items relating to the panellists’ perceptions of each medium is presented, discussed, and compared. Finally, a brief quantitative analysis of the remaining items collecting procedural evidence is presented. Chapter 6 provides the results of the focus group interviews. The findings are presented and discussed according to the themes that emerged from constant comparison analysis. The perceived advantages and disadvantages of each virtual medium are presented and discussed. Chapter 7 integrates the findings of the study and discusses them in connection to the main research questions. Next, the research questions and their findings are discussed in reference to MNT. Finally, the limitations of the study are presented and discussed. Chapter 8 presents the contributions of the study. Next a practical virtual standard setting guide for conducting virtual standard setting cut score studies is presented. Finally, directions for future research are discussed.

Chapter 2: Literature review The aim of this chapter is to review the relevant literature related to standard setting and virtual standard setting. The chapter is divided into five main sections with the first section defining standard setting and its importance. The second section describes the impact standard setting has had on language testing and assessment (LTA) and summarises current standard setting studies in LTA. The third section presents the challenges associated with conducting F2F standard setting studies, while the fourth section reviews the limited literature on virtual standard setting and presents challenges associated with conducting virtual cut score studies. The final section re-evaluates the virtual standard setting studies according to the principles of MNT.

2.1 Background to standard setting The concept of there being an ‘external criterion’ or ‘standard’ against which teachers could measure student performance and communicate to their students whether performance expectations had been met can be attributed to Mager (1962). Mager defined such a criterion as a “standard or test by which terminal behaviour is evaluated” (1962, p. 2). He viewed terminal behaviour as teachers’ perceptions of what learners needed to have accomplished by the end of an instruction period. This need for measuring the knowledge, skills, and abilities (KSAs) learners accumulated through an instructional programme may have contributed to the rise of the criterion-referenced testing (CRT) movement, a movement that was initiated as an answer to the problems associated with norm- referenced testing (NRT). Glaser (1963) coined the terms “norm-referenced measures” and “criterion- referenced measures”, the former implying a relative standard against which learners would be compared in relation to one another and the latter referring to the existence of an external criterion against which the quality of learning would be defined –an absolute standard. The terms norm-referenced (NR) and criterion-referenced (CR) are traditionally used to refer to methods of test score interpretations (Linn & Gronlund, 1995). While NR and CR approaches to interpreting test scores are meaningful, the performance to which the scores are evaluated differ considerably. In NRT, test-taker performance is compared against a reference group (i.e., the norm group), while in CRT, test-taker performance is measured in terms of the extent to which KSAs have been mastered (Hambleton

32

Literature review

& Eignor, 1978; Hambleton, Zenisky, & Popham, 2016; Sireci & Hambleton, 1997). In other words, in CRT, successful test-taker performance is determined by a cut score, a certain point on a test scale used for classifying test takers into at least two different categories such as mastery or non-mastery. Learners who achieve a score at or above a certain cut score are deemed to be masters, while those who score lower are regarded as nonmasters (Hambleton & Eignor, 1978; Kane & Wilson, 1984). The process of establishing a cut score is called standard setting, and such a process can be applied to both NRT and CRT. In norm-referencing standard setting procedures, cut scores are set according to previously established levels of success, and/or are set by those who have the power or authority to do so. NR has been used in situations where there are limited resources such as a limited number of places in an institutional programme or where educational authorities believe that students should get at least a certain percentage of items correct on a test (Alderson, Clapham, & Wall, 2005; Popham & Husek, 1969). NR test score interpretations, especially on standardised tests, are used by many universities for admitting students into their programmes. For example, some universities boast that the undergraduates accepted into their Master of Business Administration (MBA) programmes have an average student score on the Graduate Management Admission Test (GMAT), a standardised NR test, of at least 700 (89th percentile) out of a maximum of 800 (Lodenback, Stranger, & Martin, 2015), implying that students wishing to be admitted into those schools may need to score at least 90th percentile. Consequently, such NR test scores are only meaningful when directly compared with the test scores of the average group (Ornstein & Gilman, 1991). NR standard setting procedures have two main disadvantages. The first disadvantage is that students are classified as passing or failing a certain standard based on how well the norm group had performed in a previous examination. For example, a norm group that had performed exceptionally well would result in a standard of achievement that may be an impossible task for subsequent test takers to achieve; however, if the norm group had a low performance, such a standard would probably be easily achieved by the majority of subsequent test takers. The second disadvantage is that there are no scientific methods to establish the cut scores used, nor is the error of such a decision questioned or taken into consideration (Kaftandjieva, 2010; Sireci & Hambleton, 1997; Zieky, 1995). In contrast to NR standard setting procedures, CR standard setting approaches set a standard through a decision-making process, which usually entails recruiting a group of panellists, also referred to as subject matter experts

The importance of setting valid cut scores

33

(SMEs), to take part in a workshop and to complete a variety of tasks in order to recommend a cut score for a certain examination (Cizek & Earnest, 2016; Hambleton, Pitoniak, & Copella, 2012; Kaftandjieva, 2010; Pitoniak & Cizek, 2016) . The latter procedure for CR standard setting is what today has come to be regarded as standard setting. In CR standard setting, cut scores can be dichotomous decisions such as pass/fail or award/deny certification, especially in cases of medical certification (e.g., doctors; dentists; psychologists). In educational contexts, however, multiple cut scores may be needed to place students into different performance levels/categories such as Below Basic, Basic, Proficient, and Advanced (Pitoniak & Cizek, 2016). SMEs consider the KSAs or the minimum competence (Kane, 2006) that test takers need to possess to set a recommended cut score. Such a cut score is established without directly comparing the performance of the test takers to one another, but by comparing test-taker performance to established criteria (standards) for success. Such criteria may be operationalised through certain sets of descriptors usually located within a framework of competency.

2.2 The importance of setting valid cut scores Standard setting has an impact on test takers, stakeholders, and society in general as cut scores can determine whether test takers graduate from high school, enter university, or receive a promotion or licensure to practice a profession (Pitoniak & Cizek, 2016; Sireci, Randall, & Zenisky, 2012). Consequently, it is imperative that when cut scores are set, they serve and protect public interest, especially in the case of licensure. To ensure this, the standard setting process should be such that it would allow a representative group of experts in the field to be involved in it and to be able to express their opinions freely and in an unbiased way. Only if expert judgements reflect the judges’ own principles and experiences will their ratings be reliable enough and therefore replicable if the standard setting procedure were to be repeated at a later point in time (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). It is only when cut scores are rigorously and empirically set that test score interpretations and test-taker classification based on such scores are valid. Thus, standard setting also functions as a safeguard when certain standards are adhered to, as it can determine (1) the extent to which test instruments measure test-taker ability; (2) the amount of KSAs exhibited by test takers; and (3) the extent to which score interpretations are valid (Cizek, 2012a; Kane, 2006; Sireci, Randall, & Zenisky, 2012; Wang, 2009).

34

Literature review

2.2.1 Standard setting methods Jaeger (1989) divided standard setting methods into two categories: test-centred and examinee-centred. In the test-centred standard setting methods, panellists review the test tasks/items and decide on whether test takers would be able to answer the tasks/items correctly. By contrast, in examinee-centred methods, panellists place test-taker performance into ordered categories and/or evaluate test-taker performances (Hambleton & Pitoniak, 2006). In 2010, Kaftandjieva identified at least 60 different types of standard setting methods; however, as advances in psychometric analysis have allowed standard setting practitioners to modify pre-existing standard setting methods, some of the modified methods may not neatly fit into Jaeger’s classification system.

2.2.1.1 Examples of test-centred methods Variants of the Angoff method The Angoff method and its variations are the most widely used and thoroughly researched test-centred standard setting methods (Brandon, 2004; Cizek & Bunch, 2007; Cohen, Kane & Crooks, 1999; Council of Europe, 2009; Kane, 1998; Plake, Impara, & Irwin, 1999; Irwin, Plake, & Impara, 2000; Plake & Cizek, 2012). In its original form, panellists review items and decide whether a minimally competent test taker, one who possesses the minimum competence to be classified into a performance group (Kane, 2006), would be able to answer an item correctly. This method is now considered a variant of the Angoff method and is referred to as the Yes/No Angoff method (Impara & Plake, 1997). However, what has come to be known as the Angoff method (aka the modified Angoff method) is one that was originally proposed in a footnote and has been attributed to Ledyard Tucker (Angoff, 1971). In this method, panellists review each MC item and think of how many out of 100 minimally competent test takers would answer an item correctly. The panellist’s sum of probabilities is then averaged, and the average of all panellists’ probabilities is used to set the cut score. The method has been criticised for the ‘impossible’ cognitive task of asking panellists to predict how minimally competent test takers would perform on a particular item (Bejar, 1983; Impara & Plake, 1998; Stone, 2009a). To combat some of the criticism aimed at the Angoff method, Impara and Plake (1997) suggested two variations of the method; the first variation was aiming at making the notion of a borderline test taker more tangible to the subject matter experts (SMEs) while the second offered an alternative to the calculation of the recommended cut score. What Impara and Plake proposed was, in essence,

The importance of setting valid cut scores

35

similar to the one originally proposed by Angoff (1971), which is now referred to as the Yes/No Angoff method. SMEs, in this variation, are asked to think of a real test taker, one who meets the minimum level competence requirements to pass the examination, and to make a dichotomous Yes/No decision (Yes=1; No= 0) as to whether the test taker was able to answer each item correctly or not. It is purported that this method does not pose a very cognitively challenging task for SMEs who are usually teachers as they are more familiar with the process of classifying individual students rather than students collectively into different performance levels and are more accurate in doing so (Impara & Plake, 1997; Impara & Plake, 1998). Additionally, not all of the SMEs might be as accurate in classifying students but in the Yes/No Angoff method, SMEs review each item and the sum of their raw scores is then averaged. Consequently, since the average of all panellists’ scores across all appraised items are used to set the cut score, such a cut score results in a better classification of test-taker performance. The method has several advantages such as not placing a cognitive demand on SMEs to estimate the probability of a group of minimally competent test takers answering each item correctly (Cizek & Bunch, 2007), which in turn also speeds up the process of entering ratings (Plake & Cizek, 2012). This method also contains iterative rounds but is not without disadvantages such as biasing participant judgements (Cizek & Bunch, 2007; Plake & Cizek, 2012). For example, when SMEs are asked to estimate the probability of a minimally competent test taker getting an item correct, their responses can theoretically range from 0.0 to 1.0 inclusive, offering SMEs the opportunity to fine tuning their judgements. However, in the Yes/No Angoff method only two judgements are possible: “0” =No; “1” =Yes. Consequently, the range of SMEs’ judgements are restricted to “0” and “100” percent, which in turn, may bias the final cut scores. However, in order to ensure that panellists are not faced with such an ‘impossible’ task, recent variants of the Angoff method have employed both rigorous training and various rounds at which panellists are presented with empirical information to help them with their judgements. For example, after Round 1 judgements have been made, SMEs discuss their rationale for their judgements and prior to making Round 2 judgements, SMEs are provided with the empirical difficulty of the items (normative feedback). After Round 2 judgements, a discussion may follow and SMEs are then presented with consequences feedback (Reckase & Chen, 2012), i.e., data indicating the number of test takers classified in each performance level (e.g., pass/fail), prior to making their Round 3 judgements (if one exists).

36

Literature review

The Bookmark method The Bookmark method (Lewis, Mitzel, & Green, 1996) is another test-centred approach in which SMEs review test items. The difference between this method and variations of the Angoff method is that panellists are not asked to estimate the difficulty of each item, as the test items are presented to them in a specially designed booklet in order of difficulty (from easy to difficult items). The difficulty of each item is estimated through either item response theory (IRT) or Rasch measurement theory (RMT)1 (Cizek & Bunch, 2007; Zieky, Perie, & Livingston, 2008). This booklet is referred to as an ordered item booklet (OIB). SMEs appraise items in terms of the KSAs needed by a minimally competent test taker to be able to answer an item correctly, while at the same time they evaluate how each item progressively becomes more difficult than the previous one. SMEs are asked to place a marker in the booklet on the last item that they believe a minimally competent test taker would have a .50 (RP50) or .67 (RP67) probability of answering it correctly. The marker (bookmark) becomes the cut score for each panellist as it separates items into two categories: items that would be answered correctly or incorrectly either half of the times (.50) or two thirds of the times (.67) by minimally competent test takers. Cut scores are usually derived by taking the mean or median judgements of each SMEs’ placed bookmark. Unlike the cognitive demands placed on SMEs in the modified Angoff methods, the Bookmark method simplifies the panellists’ judgement as SMEs are more familiar with reviewing items with regards to the KSAs the items are aiming at measuring and, in this method, SMEs make fewer judgements. This method is deemed to be suitable for examinations that need to set different cut scores simultaneously. However, one of its main shortcomings is that panellists may have problems applying the RP50 or RP67 probability rule and the decision to use one rule over the other is arbitrary, one that, nonetheless, may result in different cut scores being set (Karantonis & Sireci, 2006). Moreover, some panellists may even misunderstand the place at which they set their bookmark believing that a minimally competent test taker must be able to answer the total number of items placed before the bookmark correctly. Another shortcoming can arise when SMEs’ perceptions regarding the difficulty of an item may be different from its actual empirical difficulty (Cizek & Bunch, 2007; Skaggs 1 While the one-parameter Logistic (1PL) IRT model appears similar to the Rasch Dichotomous Model, the conceptual differences between these two models imply that they are not part of the same family of mathematical models (Linacre, 2005).

The importance of setting valid cut scores

37

& Tessema, 2001; Zieky, Perie, & Livingston, 2008). Similar to the modified Angoff methods, this method typically employs several rounds in which discussion amongst panellists is essential, though reaching a consensus is not. For example, after Round 1, SMEs’ discussion focuses on establishing the minimum KSAs that test takers should have to be classified into a performance level. Round 1 discussion may occur in small groups, while Round 2 discussion may occur as a panel discussion (Mitzel, Lewis, Patz, & Ross, 2001). At the end of Round 2, consequences feedback is presented to SMEs before making their final judgements.

The Objective Standard Setting (OSS) method Rejecting the Angoff method on the premise that it is fundamentally flawed due to the cognitive demands placed on panellists, Stone (2009a) proposed the OSS method, a method that incorporates the difficulty of the test items, the judgement of the panellists, and the measurement error of the test to arrive at a cut score. In stage one, after defining the characteristics of the minimally competent test taker, SMEs are asked to review each test item and decide whether each item should be considered ‘essential’ for the minimally competent test taker to know. A panellist’s first cut score (criterion point) is calculated by averaging only the item difficulties, estimated through Rasch, of those items that were selected as essential. In stage two, panellists are asked to reach a consensus on how much mastery a minimally competent test taker needs to obtain to pass (mastery level), i.e., what percentage of the selected items need to be answered correctly. This decision is originally expressed in a percentage ranging from 0 % to 100 %, which is then converted onto a Rasch scale. In the third stage, panellists are asked to take into consideration the standard error of measurement (SEM) associated with the test instrument. Panellists must reach consensus on how the SEM will be incorporated in their cut scores. Panellists need to indicate whether to add the SEM to their cut scores, subtract it from their cut scores, or not to allow it to affect their cut scores at all. The SEM is then logistically transformed into a confidence level. The final recommended cut score for each judge is the sum of the criterion point, mastery level, and confidence (± SEM) and the final cut score is the average of all judges’ cut scores. The main advantage of using the OSS method is that the panellists’ cognitive load is reduced, as they do not decide on a probability estimate for each item, which in turn, allegedly reduces the time needed for the specific task. However, in stage one no discussion is generated amongst the panellists as they are not required to provide a rationale for their selection of items considered to be necessary for a minimally competent test taker to know. Such a discussion may

38

Literature review

allow panellists to reconsider their selected pool of items, leading to a potential change in the final cut score. Furthermore, stages two and three require panellists to reach consensus on which items are essential for a minimally competent test taker to know and whether the SEM should be incorporated in the final cut score. Reaching consensus in two stages may prove to be a tiring and lengthy process especially when some panellists may refuse to change their opinions. Consequently, conflict and competition may arise amongst panellists in their attempt to persuade the rest of the group to accept their views. Another disadvantage of the OSS method is that it may not be appropriate for a test that has been designed to target a specific level. The reasoning behind such a premise is that tests that are aiming at a particular level usually include many items that are essential for the minimally competent test taker to know. This may be especially true of tests that have been aligned to the Common European Framework of Reference for Languages (CEFR). In such a case, the risk of producing a cut score that will be too high is highly probable. For example, a test instrument analysed through Rasch will usually have an item mean measure of 0 logits (logits being Rasch-scale units), and if panellists decide that the majority of items are essential for a minimally competent test taker to know, then the average mean of all item difficulties would be close to 0 logits for all judges. Assuming further that judges select a hypothetical mastery level of 60 % (e.g., 0.41 logits) and choose to add the SEM (e.g., 0.19 logits) to the cut score, the following cut score may result for each judge: 0.00 logits (criterion) +0.41 logits (mastery level) +0.19 (Criterion) =0.60 logits, which may correspond to a minimally competent test taker needing to answer at least two thirds of all the items on a test to pass. Despite its name –Objective Standard Setting –there seems to be a high level of subjectivity involved, primarily because the initial item selection is mostly an individual judgement based on the background and idiosyncrasies of the panellists. In retrospect, this method is not so different from the Angoff and the Bookmark methods described earlier in that in all these methods, panellist judgements play a pivotal role in setting cut scores and such judgements are confined to the panellists’ beliefs, teaching experience, and educational background. In the Angoff method, panellists need to imagine how “a mythical, minimally competent individual” (Stone, 2009b, p. 143) will perform on a particular item. However, asking panellists to assess an item in terms of its being essential or not does not necessarily free the panellists from visualising the minimally competent individual as the panellists need to have a frame of reference to make that judgement.

The importance of setting valid cut scores

39

2.2.1.2 Examples of examinee-centred methods The Borderline Group (BG) method and the Contrasting Group (CG) method In the BG method (Livingston & Zieky, 1982), panellists are familiar with the actual test takers and the KSAs that the test takers possess. Prior to reviewing test scores of test takers, panellists are asked to separate test takers based on their knowledge of the test takers into three categories: (1) well-below the borderline performance level; (2) at the borderline performance level; or (3) well-above the performance level. The median score of the panellists in the borderline level group is usually used to set the cut score. Similar to the BG method, the CG method (Livingston & Zieky, 1982) entails panellists having first-hand knowledge of test takers. However, instead of asking panellists to separate test takers into low performers, borderline performers, and high performers, panellists place each test taker into a corresponding performance level. For example, if there are three performance levels for cut scores to be derived such as Basic, Proficient, and Advanced, the judgement process entails placing test takers in all three performance levels. The cut scores for each performance level is set at the median score for each group or at scores where 50 % of the test takers are classified in either one of the two performance levels (i.e. Proficient and Advanced) (Livingston & Zieky, 1982; Pitoniak & Cizek, 2016; Zieky, Perie, & Livingston, 2008). The advantages of both methods are that cut scores are (1) derived based on actual test-taker performance, and (2) made by panellists who have direct knowledge of the actual test takers’ KSAs and are familiar with making such judgements. However, in both methods, the panellists may be biased as their personal opinions of the test takers may cloud their judgements. Additionally, in both methods, cut scores may be unstable as sizes of each performance group may differ considerably (Pitoniak & Cizek, 2016). Moreover, it is not possible for all panellists to know all test takers and that is why it is recommended that each panellist knows at least one test taker (Council of Europe, 2009). The Body of Work (BoW) method The BoW method (Kahl, Crockett, DePascale, & Rindfleish, 1994; Kingston, Kahl, Sweeney, & Bay, 2001) is in essence a CG approach, the difference being that test takers’ work is classified into performance groups, not the tests takers themselves. Consequently, panellists do not need to be familiar with the actual test takers. The BoW method is especially suitable for constructed response items such as written and speaking performances. Test takers’ responses to all the test items are presented to panellists for evaluation. The method comprises

40

Literature review

of several rounds where the first round is regarded as a training round. The second round is a range-finding round, where the score range of the boundaries between performance levels is identified, while in the third round, cut scores are pin–pointed. The advantages of this method are similar to the ones identified in the CG method, namely, the task is familiar to the SMEs (teachers) and the cut scores are the natural outcome of SMEs’ evaluation of the entire set of test takers’ responses. Notwithstanding, the method is time-consuming as far as the facilitator selecting responses covering the whole scale range of performance and as far as the panellists evaluating all the performances are concerned. Discussions amongst panellists between rounds is an essential part of the method as it facilitates the range-finding aspect stage as well as the pin–pointing stage.

2.2.2 Evaluating and validating standard setting methods Kane (2001) proposed a framework for evaluating standard setting cut score studies in which three types of validity evidence need to be collected to make a claim that the inferences made about performance standards –cut scores – are valid. In 2003, Pitoniak listed the elements of a standard setting cut score study that need to be evaluated. Cizek and Earnest (2016) have adapted the list to include another element that should also be evaluated, that of ‘decision consistency’. Table 2.1 summarises the three types of validity evidence that need to be presented to the awarding body requesting a cut score study. Table 2.1 Summary of elements for evaluating standard setting Procedural Explicitness

Internal Consistency within method

Practicability

Intraparticipant consistency

Implementation of procedures Panellist feedback Documentation

Interparticipant consistency

External Comparisons to other standard setting methods Comparisons to other sources of information Reasonableness of cut scores

Decision consistency Other measures

(Cizek & Earnest, 2016, p. 218)

The first type of validity evidence to be evaluated is procedural evidence and refers to why the particular method was chosen (explicitness), how easy it was to implement the standard setting method (practicability), what training

The importance of setting valid cut scores

41

the panellists received (implementation of procedures), whether the panellists felt comfortable using the method (panellist feedback), and how precise the procedure was documented (documentation). The next type of validity evidence is internal and relates to (1) whether the recommended cut score would be the same if the method were repeated (consistency within the method); (2) how consistent each panellist’s ratings are with the empirical difficulty of the items and to what extent panellists changed ratings across rounds (intraparticipant consistency); (3) the degree to which cut scores and item judgements are consistent across participants (interparticipant consistency); and (4) the degree to which test takers would be classified in the same performance levels were the cut score study repeated (decision consistency). The final type of validity evidence is external and refers to (1) whether the cut score would be replicated if another standard setting method were used (comparisons to other standard setting methods); (2) whether the cut score would be in line with other criteria used to assess the test takers such as teacher grades or test taker test scores on another similar test (comparison to other sources of information); and (3) whether the cut scores resulted in reasonable pass rates (reasonableness of cut scores). The facts that procedural validity evidence includes panellists’ feedback on the procedure of setting cut scores and that internal validity evidence entails intraparticipant consistency and interparticipant consistency underscore the acceptance of the subjective nature of judgements. In other words, researchers are not disillusioned by the fact that standard setting judgements are subjective; rather, training procedures (Council of Europe, 2009) and empirical checks for the evaluation of the panellists’ raw data have been established to determine whether such subjective judgements are acceptable in terms of decision consistency and accuracy, as well as intraparticipant and interparticipant reliability (Hambleton & Pitoniak, 2006). In order to build a validity argument into the standard setting process, Sireci, Randall, & Zenisky (2012) have recommended the following seven actions to ensure that valid standards are set regardless of the method used to set cut scores: (1) selecting a sufficient number of qualified panellists; (2) providing adequate training of panellists; (3) facilitating panellists’ discussion of ratings; (4) providing quantitative feedback on provisional judgements; (5) facilitating a discussion on the final cut score for consensus; (6) conducting a comprehensive survey of panellists’ impressions; and (7) deriving cut score confidence intervals. All of the seven actions deal with the panellists directly or indirectly as far as their selection, their training, feedback received as well as their views on the standard setting workshop procedures and recommended cut scores are concerned. Additionally, what has come to be accepted as the norm in cut score

42

Literature review

studies is that panellist discussion is a prerequisite of any cut score study. In short, standard setting is “the establishment of a subjective policy, albeit one that is informed by data” (Sireci, Randall, & Zenisky, 2012, p. 19).

2.3 Standard setting in language assessment Standard setting has become more prominent in LTA in the last 15 years mainly due to the impact that behavioural scales of language proficiency -scales defining the KSAs test takers are expected to possess at different proficiency levels -have had on test development. Language proficiency scales such as the ones found in the Interagency Language Roundtable (ILR) (Interagency Language Roundtable, n.d.); the Canadian Language Benchmarks (CLB) (Centre for Canadian Language Benchmarks, 2012), the ACTFL Proficiency Guidelines (American Council on the Teaching of Foreign Languages, 2012), and the CEFR (Council of Europe, 2001) have provided test developers and policy makers with established and accepted descriptors of what language learners ‘can do’ at certain proficiency levels. By incorporating such descriptors in the test development cycle, language testers can develop examinations aiming at measuring different performance levels, while at the same time the descriptors provide meaningful explanations of what successful test takers can do (Kantarcioglu & Papageorgiou, 2011; Papageorgiou, 2016). What is more, policy makers (i.e., immigration, educational and employment agencies) have started requesting that testing agencies align their examinations to such proficiency scales as such an alignment provides evidence for score interpretation (Kenyon & Römhild, 2014; LaMarca, 2001). To support such endeavours, the Council of Europe commissioned The Manual for Relating Examinations to the Common European Framework of Reference for Languages (Council of Europe, 2009) (henceforth, “The Manual”). The Manual (2009) provided the standard setting community with guidance and a framework for conducting CEFR alignment studies and as such became an influential standard setting document. The framework in the Manual helps standardise to some degree the alignment process and advocates the importance of setting and evaluating cut scores on language examinations, a vital step in the alignment process. In 2020, the Council of Europe released the final version of the companion volume to the CEFR which contained updated and extended descriptors. As there was no update to the Manual (2009) to accompany the new CEFR companion volume (Council of Europe, 2020), a new handbook, Aligning Language Education with the CEFR: A Handbook, was jointly produced by the British Council, the UK Association for Language Testing and Assessment (UKALTA), the European Association for Language Testing and Assessment (EALTA), and the Association for Language Testers in Europe (ALTE) in 2022.

Standard setting in language assessment

43

The handbook (British Council et al., 2022) provides an in-depth updated description of the CEFR alignment process, one filled with illustrative examples, resources, and suggestions.

2.3.1 Current LTA standard setting research Standard setting methods have evolved and multiplied immensely over the last 15 years both due to advances in psychometric theory, which have provided the tools to evaluate elements of a cut score study, and to the vast volume of research conducted to investigate such elements and/or to address problems associated with some of the methods. In the field of LTA, researchers have conducted standard setting workshops to align their examinations to the CEFR and the findings of those studies have instigated further research into various aspects of standard setting.

2.3.1.1 The first publicly available CEFR alignment studies In one of the first reported CEFR alignment studies, Tannenbaum and Wylie (2008) aligned three English language tests (the TOEFL® iBT, the TOEIC® assessment, and the TOEIC BridgeTM test) to several CEFR levels, respectively. Even though the researchers had initially hoped to be able to set CEFR C2 level cut scores for the TOEFL® iBT and the TOEIC® assessment, the panellists felt that a CEFR C2 cut score could not be set on those instruments. The difficulty of setting C2 cut scores may have been due to the nature of the test instruments used in the study. Both the TOEFL® iBT and the TOEIC® assessment measure test takers’ English language proficiency and as such comprise items at varying degrees of language proficiency. Panellists may not have been able to set CEFR C2 cut scores as such proficiency tests usually contain too few items measuring test takers’ advanced language proficiency, thus, making it extremely difficult or impossible to set a minimum C2 cut score. This study stressed the difficulties associated with CEFR alignment studies, especially when attempting to align a test instrument to all six CEFR levels. The researchers underscored that difficulties associated with CEFR alignment studies may be fewer when test instruments have been designed with the CEFR in mind and suggested that external validity evidence can be collected by comparing test takers’ scores to ratings in terms of CEFR level, assigned by the test takers’ teachers. In another study, (O’Sullivan B., 2008) conducted a CEFR alignment study on a B2 English language test. While the researcher also noted the difficulties of aligning test instruments to the CEFR based on the procedure, recommended at the time, in Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment: Preliminary Pilot

44

Literature review

Manual (Council of Europe, 2003) (henceforth, “The Preliminary Pilot Manual”), O’Sullivan made several recommendations as far as CEFR alignment studies were concerned. O’Sullivan questioned the claim made in The Preliminary Pilot Manual that a CEFR alignment process was a linear process which began at the familiarisation stage, moved on to the specification and standardisation stages, and ended in the validation stage. Instead, the researcher viewed the process of CEFR alignment as a non-linear one, one in which validation should occur from the onset of the alignment project and continue throughout it. O’Sullivan argued that the specification stage as described in The Preliminary Pilot Manual entailed only completing a set of open-ended questions, ones which do not request information on the psychometric characteristics of the test instrument used in the alignment study and as such should an instrument be found later to be unreliable, the whole alignment process would be invalidated. Moreover, he advocated that evidence used to support CEFR alignment should not be limited to internal and external validity evidence, but that test developers needed to provide evidence that their tests are “of sufficient quality and also that the linking is embedded in the test” (p. 89). As for the CEFR descriptors, O’Sullivan further argued that the descriptors for the receptive skills were not as explicit as the ones for the productive skills, especially at the higher CEFR levels. These two studies were amongst the first publicly available CEFR alignment reports and presented some of the problems that researchers encountered when aligning their own examinations with the CEFR. A few years later, two separate collections of CEFR alignment studies (Figeuras & Noijons, 2009; Martyniuk, 2010) were published. Many of the researchers in those collections also expressed similar problems when reflecting on the alignment process recommended in The Preliminary Pilot Manual. It should be noted that in 2009, a revised version of the CEFR linking framework was recommended in the Manual (Council of Europe, 2009), which addressed some of the comments made by O’ Sullivan (2008) and Tannenbaum and Wylie (2008). Since then, several researchers have explored (1) panellists’ understanding of the CEFR descriptors; (2) different ways to add external validity evidence to their own cut score studies; and (3) different variations of a method and/or new methods to reduce the cognitive burden associated with setting cut scores. The following studies are examples of standard setting research conducted in the field of LTA.

2.3.1.2 Studies investigating understanding of method or CEFR Papageorgiou (2009) investigated the thought processes of SMEs during a CEFR alignment study to explore the difficulties that SMEs had in understanding

Standard setting in language assessment

45

and operationalising the CEFR descriptors throughout the study. Analysis of group discussion transcripts revealed that SMEs faced problems with the CEFR descriptors mainly due to the context-free nature of the CEFR, the descriptions of real-life language use, vague wording of some of the descriptors or the limited range of the descriptors as some of the descriptors did not cover many KSAs of test takers at certain CEFR levels and/or were not suitable for assessing young learners. These above-mentioned findings highlight the challenges of conducting CEFR alignment studies, especially when the CEFR descriptors may not adequately describe the range of KSAs that a test instrument has been developed to measure. Additionally, Papageorgiou also discovered that SMEs that belonged to the awarding body which created the test instrument had a different interpretation of the CEFR levels as the panellists were biased by having participated in an earlier stage (specification stage) of the alignment study. In another alignment study, Brunfaut & Harding (2014) used a twin panellist approach when setting CEFR listening cut scores on a suite of examinations. The two panels consisted of external and internal panellists, where the external panellists were not familiar with the test instruments while the internal panellists were item writers, examiners, and/or test developers. The researchers employed group discussions at the end of each panel session to gain insight into each panels’ interpretations of the modified basket method and their CEFR-based judgements. Nonetheless, similar to Papageorgiou’s (2009) study, qualitative analysis of the group discussions revealed that the internal panel had problems with applying the CEFR descriptors due to the descriptors being too restrictive or not localised enough for what the specific test instruments were aiming to measure, while the external panel had problems with the validity of the test items and whether such items could be adequately aligned to the CEFR. The internal panel’s difficulty in applying CEFR descriptors effectively may not be solely due to the wording or the lack of wording in the CEFR descriptors but may be due to “bias of insiders” (Papageorgiou, 2009, p. 181), where internal panellists expect a certain CEFR level test taker to exhibit certain characteristics based on the claims that were made earlier in the alignment process’s specification stage (Papageorgiou, 2009). However, “bias of insiders” may also be expanded to include general bias towards the CEFR descriptors that exists amongst internal panellists –panellists that have directly participated in the design of a test instrument and/or the production of the items that are to be aligned to a CEFR level. For example, when item writers have not been thoroughly familiarised with the CEFR descriptors or have not fully understood the CEFR descriptors and/or levels, prior to creating items, they may resort to trying to justify their

46

Literature review

items at a particular level when participating as panellists in a standard setting workshop by expressing criticism towards the CEFR descriptors.

2.3.1.3 Studies investigating external validity evidence Dunlea and Figueras (2012) replicated a CEFR cut score study to gather external validity evidence for the original CEFR alignment study. Even though the two judge panels came from different continents (Japan & Europe), had differing interpretations of the CEFR descriptors, and used two different standard setting methods (the modified Angoff and the CG method), the cut scores for the replicated study were within one raw point of the original cut scores, adding external validity evidence to the main study. The researchers exemplified one way of gathering external validation evidence for a CEFR alignment study and also emphasised the importance of replicating CEFR cut score studies, a rarely undertaken task. Other awarding bodies such as Pearson (2012) and ETS (2010) have attempted to gather external validity evidence for their own CEFR alignment studies by conducting comparability studies between their own examinations and the International English Language Test System (IELTS) examination. However, Lim, Geranpayeh, Khalifa, and Buckendahl (2013) questioned the claims made by these two awarding bodies (Pearson & ETS) in relation to the recommended IELTS-CEFR cut scores. The researchers highlighted that there was a discrepancy between the IELTS-CEFR cut scores reported by the two awarding bodies and their own IELTS-CEFR cut scores. They concluded that the discrepancies may be attributed to an inappropriate research design (Pearson, 2012), lack of test-taker motivation as test takers may not have taken one of the two examinations seriously (Pearson, 2012) or problematic linking of ETS’ own examination (Tannenbaum & Wylie, 2008). Lim et al. (2013) externally validated their own IELTS-CEFR results by comparing how test takers also performed on the Cambridge English Advanced (CAE) examination. They concluded that the overall correlation (.87) between the test takers’ performance on the two examinations (IELTS & CAE) added external validity to their recommended IELTS-CEFR cut scores. However, the researchers did not provide any internal validity evidence for cut scores set on either examination. This study is significant as it addresses the importance of collecting appropriate and/or reliable external validity evidence when conducting a CEFR alignment study. Another way of gathering external validity evidence was pursued by Hsieh (2013). The researcher investigated whether cut scores derived from the Yes/ No Angoff and the Bookmark methods were comparable. The researcher asked

Standard setting in language assessment

47

panellists to recommend multiple cut scores on a 6th Grade English listening and reading test by first using the Yes/No Angoff method and then using the Bookmark method. The findings revealed that the two methods did not produce comparable cut scores as cut scores set using the Yes/No Angoff method were lower than those set using the Bookmark method for the two lower levels, while the opposite was true for the highest level. The researcher provided several explanations why differences between cut scores occurred: (1) the Yes/ No method yielding cut scores that move towards the extremes; (2) both methods each consisting of three rounds were conducted on the same day; and (3) the multidimensional nature of the test instrument, containing both listening and reading items, made the Bookmark method confusing as both reading and listening items were presented in one ordered item booklet (OIB). While the explanations seem plausible for the incomparable cut score results, an equally plausible reason may be how the Yes/No standard setting method was conducted. At the beginning of Round One, the researcher presented panellists with information about the items such as their facility values (proportion of correct responses) and their calibrated difficulty estimates based on a three-parameter IRT model, one in which the calibrated difficulty of the item has been impacted by a hypothetical guessing factor. These indices may have biased the judges when making their Round 1 judgements. Despite this study’s possible limitations, it does add to the LTA standard setting literature of attempting to gather external validity evidence by comparing two different methods. Undoubtedly, many awarding bodies may not be able to collect external validity evidence for their own cut score studies as such evidence entails (1) repeating the study by using another standard setting method, (2) comparing test takers on two different examinations, or (3) asking the test takers’ teachers to classify the test takers. To begin with, time constraints may not allow two methods to be used simultaneously as the amount of time to complete the workshop increases when a second method is used. Additionally, comparing how test takers perform on two different examinations poses its own set of problems as the two examinations may not be comparable as far as the KSAs measured are concerned. Unless an anchor test is used, i.e., a short test containing common items employed in both examinations, placing both examinations on the same latent scale may be somewhat arbitrary. Furthermore, rarely will an awarding body freely release an examination with or without test specifications to another awarding body so that the latter can gather external validity evidence. As for teachers assessing their own students in terms of CEFR levels, this too may pose problems, as teachers will need to be thoroughly familiarised with the CEFR descriptors and levels.

48

Literature review

2.3.1.4 Studies proposing new methods/modifications Brunfaut and Harding’s study (2014) is one example of a modification made to an existing standard setting method to reduce the cognitive burden of the panellists’ rating task. The researchers used a modified basked method to set multiple CEFR cut scores. In this modification, panellists first assign each item to a particular CEFR level and then decide whether “a just qualified, a mid, or a high test taker” (p. 8) at the assigned CEFR item level would be able to answer the item correctly. While no statistically significant differences were observed between panel means at the elementary level, statistically significant differences were found between panel means at the other three levels (intermediate, high intermediate, and advanced). The researchers concluded that the modified basket method was suitable for setting CEFR cut scores as the cut scores exhibited adequate internal validity and procedural validity, panellists reporting being comfortable with applying the new method. In another study, Feskens, Keuning, van Til, and Verheyen (2014) used a new standard setting method, the Data-Driven Direct Consensus (3DC) method, to set CEFR cut scores on 24 tests measuring the German, French, or English language reading and listening skills of Dutch secondary students. The 3DC method contains elements of the Bookmark method, the Angoff method, and the Direct Consensus method. In the 3DC method, panellists review items presented in clusters, where each cluster consists of items associated with the same reading text or same speakers and choose how many items in each cluster a borderline test taker would get correct. The findings revealed that the method yielded adequate internal validity evidence, implying that the 3DC method can be considered a sound standard setting method. The main advantage of this method is that it drastically reduces the cognitive burden placed on judges as they are not directly assessing each item, but rather a group of items at the same time, which in turn, reduces the time needed to set cut scores. Nonetheless, the 3DC method as it was implemented in the study had its own disadvantage as panellists were not provided with any empirical data on the difficulty of the items, but were given information on how a test-taker’s performance in each cluster related to a test-taker’s expected overall performance according to item response theory (IRT) principles. By not providing empirical data on the items, however, the researchers may have added a cognitive burden on the rating process as panellists were applying a new method while at the same time trying to comprehend how to interpret the IRT principles behind the rating process.

Challenges associated with standard setting

49

The aforementioned recent LTA studies have tried (1) to gain insight into the panellists’ understanding of the CEFR descriptors and/or the standard setting method employed, (2) to explore methods of collecting external validity evidence, and (3) to propose modifications to existing standard setting methods or to propose new methods or techniques aiming at reducing the cognitive burden associated with the task of setting cut scores. However, such studies have yet to address (i) how panellists can be trained to have a better understanding of the CEFR levels and/or descriptors despite the problems the descriptors have in their wording or lack of skills defined; (ii) how external validity evidence can be collected reliably and effectively; and (iii) how the cognitive demands of the rating task can be significantly reduced. Nevertheless, these studies have explored F2F standard setting and highlighted its associated challenges.

2.4 Challenges associated with standard setting The challenges that arise when conducting standard setting are associated with either the nature of the study itself or the logistics involved in the study. As standard setting is a complex procedure, both a theoretical and a practical one, it requires careful planning and execution, as “the process of determining cut scores is not only technical, social, and political, but obviously consequential as well” (Pitoniak & Cizek, 2016, p. 39).

2.4.1 Theoretical and practical challenges The various methods used to set cut scores are not always easily understood by participants to follow and impose, in one way or another, cognitive demands on the panellists. Panellists may enter into a cut score study with predefined ideas of where the cut score should be placed or what a minimum competent test taker ‘can do’ at a certain CEFR level. Panellists may also find it difficult to internally operationalise CEFR descriptors (Brunfaut & Harding, 2014; Dunlea and Figueras, 2012; Harsch & Hartig, 2015; O’Sullivan, 2008; Papageorgiou 2009) or performance level descriptors (PLDs) (Skorupski and Hambleton, 2005). Other challenges also include security concerns as the items/tasks used in the study may come from an administration that is still live (Pitoniak & Cizek, 2016) and selection of panel, method, and items also affect the outcome of a cut score study (Brunfaut & Harding 2014; Buckendahl, Ferdous, & Gerrow, 2010; Cetin & Gelbal, 2013; Dunlea and Figueras, 2012; Kannan, Sgammato, Tannenbaum, & Katz, 2015, Tiffin-Richards and Pant, 2013).

50

Literature review

2.4.2 Logistics Standard setting is a time-consuming, expensive, and complicated endeavour. Panellists need to be recruited, PLDs need to be created if they do not exist, materials need to be compiled (i.e., test booklets, data), flights and accommodation need to be booked for panellists as well as the venue for the study. Facilitators, statisticians, and assistants need to be hired when internals of the examination board are not available. Consequently, standard setting is expensive and cut score studies place an enormous financial burden on any examination board. For example, in a 2016 cut score study conducted in the UK by an international examination board, wishing to remain anonymous, an eight- day standard setting study costed approximately £53,750, not including internal costs such as internal panellists, implementation, subsidiary meetings, follow- up meetings, report writing, etc. Seventeen panellists participated in the study and each session comprised of no fewer than 12 judges. Table 2.2 illustrates the external costs (figures rounded) associated with the study. Table 2.2 Summary of standard setting expenses Description of cost Venue Air fare and train fare Accommodation Meals & per diems Consultation fees (including university additional fees for contracting lecturers)

Cost £3,100 £1,750 £5,100 £1,850 £41,950 Total

£53,750

The high costs associated with conducting F2F standard setting workshops can act as a deterrent factor for many awarding bodies and therefore cut score studies are not replicated regularly (Dunlea & Figueras, 2012). Consequently, standard setting practitioners and researchers have started investigating alternative ways and techniques to lower the costs associated with standard setting either by reducing the number of items used in a cut score study (Buckendahl, Ferdous, & Gerrow, 2010; Kannan, Sgammato, Tannenbaum, & Katz, 2015, Tiffin-Richards and Pant, 2013) or by suggesting or exploring technical advancements which have made standard setting possible in virtual environments (Buckendahl, 2011; Cizek & Earnest, 2016; Harvey and Way, 1999; Gelin & Michaels, 2011; Katz & Tannenbaum, 2014; Katz, Tannenbaum, & Kannan, 2009; Lorie, 2011; Schnipke & Becker, 2007; Way & McClarty, 2012; Zieky, 2012).

Virtual standard setting

51

2.5 Virtual standard setting Due to advances in technology and web-conferencing software and the high costs associated with conducting a F2F standard setting such as expenses for panellist travel, accommodation, food and the like, researchers have started to investigate the prospect of conducting standard setting studies in an online environment. Considering that virtual standard setting, also referred to as web-based standard setting or online standard setting, is a relatively new area of research, few empirical studies have been published on it so far and it still remains an under researched area that needs to be explored more thoroughly (Harvey and Way, 1999; Katz & Tannenbaum, 2014; Katz, Tannenbaum, & Kannan, 2009; Schnipke & Becker, 2007).

2.5.1 Virtual standard setting: Empirical studies In what follows, the virtual standard setting empirical studies conducted to date are reviewed to gain an awareness of current practice. These studies are then re- evaluated from an MNT, an e-communication medium theory, perspective to provide insight into communication aspects that may have influenced panellist judgements in virtual environments. Harvey and Way (1999) conducted a study comparing a virtual standard setting with a F2F standard setting. The virtual standard setting was conducted in an asynchronous computer-mediated environment, one in which panellists and facilitators communicated through emails. In both settings, the modified Angoff and the Benchmark methods were used and the instrument used for the study was a test created to measure the reading, writing, and mathematical skills of students entering or completing a teacher preparation course. The study revealed that the modified Angoff method yielded cut scores for both groups that were not statistically different, whereas the difference in the cut scores produced through the Benchmark method reached statistical significance in both environments – the virtual group’s cut score being significantly higher. Another significant finding was that the panellists (teachers) in the virtual environment felt less comfortable exchanging ideas than those participating in the F2F environment. One of the reasons that might have affected the virtual group’s interaction patterns could lie in the asynchronous nature of the virtual environment as it delayed the exchange of ideas and opinions of the panellists. The virtual study took approximately three weeks to complete, something that may have exhausted the panellists, compared with the F2F study that was conducted in a two-day span. Another reason for panellists feeling uncomfortable in the virtual environment may have been the channel through which they communicated, as the panellists and the facilitator

52

Literature review

exchanged information in writing. Nonetheless, the researchers concluded that a virtual standard setting system appeared a promising endeavour worth exploring. The significance of Harvey and Way’s study is that it demonstrated that a virtual standard setting, albeit an asynchronous one, was feasible and thus their study paved the way for subsequent attempts to explore this uncharted territory. Schnipke and Becker (2007) briefly described their experiences of using videoconferencing technology, which includes audio and visual, for item editing and standard setting. They underscored that panellists need training to use the web-based platform effectively and that the speed of the panellists’ Internet connection needs to be considered when planning and conducting a virtual meeting. Even though the researchers concluded that a virtual environment may not be as effective to conduct meetings or standard setting workshops as a F2F environment, they, nonetheless, highlighted that conducting such meetings and workshops were feasible through video technology. Katz, Tannenbaum, and Kannan (2009) conducted a virtual standard setting on a mathematics test with six mathematics teachers. The study was the first documented standard setting workshop to use audio (conferencing) and audio- video technology (web-conferencing) for the panellists to communicate with the facilitator and/or with each other. Unlike a F2F standard setting workshop, which is usually conducted over a few days, each day lasting approximately seven to eight hours (with breaks), the researchers conducted the virtual workshop over several sessions. Apart from an introductory 30-minute session in which participants familiarised themselves with the virtual platform and ensured that they could access the web links, the remaining three sessions lasted approximately two hours each. In this virtual standard setting approach, the facilitators assigned activities for the panellists to complete in their own time. Prior to the second session, panellists reviewed two web casts on their own in which the purpose of the standard setting workshop and the modified Angoff training method was explained. Panellists were then requested to review the test items by taking the test. During the second session, panellists reviewed the webcasts again and engaged in a discussion to fine-tune the PLDs. They were then trained in the method and after they had practised the method, discussed their practice ratings, and indicated their understanding of the method, panellists began their Round 1 ratings (10 items), which were entered into a custom web site accessed through the web-conferencing platform. Once all ratings were completed, the facilitator showed the panellists their ratings and discrepancies in ratings on items were discussed. Session 3 was a continuation of Round 1 ratings for the remaining 39 items of the test and when all panellists finished their ratings, the

Virtual standard setting

53

same procedure as used in Session 2 was used to facilitate a discussion on the items. The final session (Session 4) consisted of judges making their Round 2 judgements, and after the final cut score was revealed and discussed, judges were asked whether they would be willing to accept the final recommended cut score. A focus group discussion on the virtual standard setting meeting was conducted to facilitate the design of future virtual standard setting workshops. In the discussion of their findings, the researchers focused only on the panellists’ perceptions of the virtual environment, the efficiency of the training method, and the panellists’ understanding of the PLDs. They concluded that virtual standard setting is feasible and can be economically implemented. Such findings add procedural validity to the cut score study, but no internal validity evidence was provided (i.e., consistency within the method, intraparticipant consistency, interparticipant consistency, decision consistency). This may have been due to the fact that only six panellists participated in the study, making it difficult to investigate the reliability of the cut score and the validity of the inferences drawn from it. Moreover, what was surprising was the researchers’ use of the videoconferencing capability of the web- conferencing platform, Microsoft Office Live Meeting 2007, used during their study. The web camera was not used during independent ratings but was used by the facilitator when providing instructions or background information to the panellists. It was also used during the discussions, but only the active speaker was broadcasted. The researchers reported that some panellists claimed that the use of video helped them to remain engaged during the workshop while others stated that the use of such technology was distracting or even superfluous at times. The researchers rationalised their restricted use of the visual display based on initial trial runs of their virtual standard setting approach with another group of panellists. They also cited that results of previous studies investigating the use of video during virtual meetings were inconclusive. They concluded by acknowledging that further research was needed to assess the advantages of using video technology during virtual standard setting workshops. This study underscored the feasibility of conducting a virtual standard setting employing two types of e-communication media, audio and video. Nonetheless, the researchers failed to perceive whether the shifting between the two different media (audio and video) or the blending of the media at the same time (i.e., during the discussion stages) had a direct or indirect impact on the panellists’ behaviour and judgements, which may, in turn, have affected their cut scores. They also did not explore whether conducting a workshop in so many sessions had an impact on the panellists in terms of fatigue, cognitive demands, and/or motivation to continue participating in the study. For example, in the beginning

54

Literature review

of Session 3, panellists continued with their Round 1 estimates on the remaining items. However, no indication of how much time had lapsed between the sessions was provided by the researchers. If there was a considerable amount of time between sessions, then panellists may no longer be able to relate to the borderline performance level descriptors (BLPDs) that they had created for the minimally competent test taker, or their enthusiasm for participating in the study may have dwindled. Katz and Tannenbaum (2014) investigated whether the results of a virtual standard setting cut score study were comparable to those of a F2F panel. The researchers conducted two studies and for each study they convened one virtual panel and two F2F panels. The virtual standard setting approach was similar to their earlier standard setting study (Katz, Tannenbaum, and Kannan, 2009) described above, with the only difference being that these two virtual studies did not employ the use of web cameras but used the web-conferencing platform (Microsoft Office Live Meeting, 2007) to provide panellists access to links and materials and used call conferencing so that panellists could discuss. In the first study, cut scores were set on a digital literacy test and a total of 33 panellists had participated in the study. The two F2F panels were comprised of 12 and 11 panellists respectively, while the virtual panel had ten panellists. In this study, the researchers presented a more thorough analysis of the procedural validity evidence collected through survey items measuring the panellists’ perceptions of the workshop. The descriptive results presented for both the F2F panels and the virtual panel revealed that all panels provided similar ratings for their perceptions of the standard setting process and outcomes. The researchers claimed that only one difference in virtual cut scores was statistically significant and that was in Round 1 when one of the F2F cut scores was compared with the virtual cut score. The researchers concluded that the virtual panel consistently recommended lower cut scores. In the second study, cut scores were set on a French test fulfilling the language requirements of the certification process necessary for beginning K12 French teachers. The researchers again used two F2F panels and one virtual panel. The F2F panels consisted of 23 panellists and 24 panellists respectively, while the virtual panel consisted only of seven panellists. Unlike the digital literacy test, the French test consisted of four sections (listening, reading, writing, and speaking) and panellists only performed two rounds of ratings. Similarly, to the digital literacy cut score study, the procedural evidence was positive across panels. The researchers concluded that differences in cut scores between the virtual panel and a F2F panel did not differ more than cut scores derived from two different F2F panels.

Virtual standard setting

55

Katz and Tannenbaum (2014) found for a digital literacy test that cut scores set in a virtual audio environment were “significantly lower” than cut scores set in a F2F environment. At the same time, they discovered that the differences in cut scores set in two different environments, virtual and F2F, are comparable to differences in cut scores set by two different panels in the same F2F environment. The authors acknowledge that a more focused investigation into panellist behaviour in a virtual environment is needed before any firm conclusions about comparability between F2F and virtual environments can be made. Nonetheless, the findings indicate that cut score studies spanning over multiple sessions in an audio virtual medium are feasible and can yield reliable cut scores. To date, no virtual studies have replicated F2F conditions faithfully, nor have two different e-communication media (e.g., audio and video) been compared either in a synchronous or asynchronous environment. Virtual studies so far have consisted of a series of smaller sessions (two-hour sessions) spanning over a period of time. Due to the session time limitations, panellists are required to do much of what would normally be done in a F2F setting in an asynchronous environment. For example, in virtual studies, panellists are asked to take the test instrument under timed conditions and to record their judgements (e.g., Round 1, Round 2) in between the sessions. Moreover, an in-depth investigation into the panellists’ perceptions of participating in a virtual standard setting workshop, either synchronous or asynchronous has yet to be conducted.

2.5.2 Challenges associated with virtual standard setting Virtual standard setting workshops do have their own unique challenges apart from the theoretical and practical challenges that any standard setting workshop has (see section 2.4). Katz and Tannenbaum (2014) listed the main validity threats “likely to affect virtual environments [as] not being able to access the web sites, not being able to hear or see others during group discussions and other technological failures” (p. 3), while a further challenge inferred from the literature (Katz & Tannenbaum, 2014; Katz, Tannenbaum, & Kannan, 2009) is that of security. However, as Katz and Tannenbaum (2014) did not provide any examples of the main validity threats, the researcher felt that some of the threats may have been too restricting in their wording. For example, “not being able to access the web sites” (Katz & Tannenbaum, 2014, p. 3) might not only apply to panellists as suggested in their article but may also apply to the facilitator. While “not being able to hear others during group discussions” (Katz & Tannenbaum, 2014, p. 3) poses a validity threat, the threat is not only confined to group discussions, but to all stages of a virtual workshop. Additionally, whether not

56

Literature review

being able to “see” one another during group discussions poses a validity threat has yet to be determined. Finally, “other technological failures” do not incorporate technological problems experienced such as “slow bandwidth”. Consequently, the researcher adapted the above-listed validity threats as follows: (1) inability to access the virtual platforms; (2) inability to hear one another; (3) other technical problems; and (4) security issues. A description of each challenge associated with virtual standard setting workshops follows: Inability to access the virtual platforms When panellists are not able to access the virtual platform(s), they cannot participate in a virtual standard setting study. This, in turn, could result in cut scores being unstable and it could even invalidate the results should there be an insufficient number of panellists participating in the study. Panellists may not be able to access the platform due to lack of technological know-how needed to do so and/or web server failure, the former possibly restricting the panellist selection pool. However, when the facilitator cannot access the virtual platforms, the scheduled virtual standard setting workshop will need to be cancelled. Inability to hear one another Standard setting is a complex, technical, and highly interactive task in which communication plays a pivotal role. Regardless of whether the standard setting method is test-centred or examinee-centred, all methods rely on communication between facilitators and panellists and/or on discussions between panellists. The success of a standard setting relies, in most cases, heavily on panellists exchanging ideas and expertise with one another and on the facilitator communicating the objectives of a cut score study and training panellists efficiently in the standard setting method employed in the study. For example, in the standard setting methods previously reviewed (see section 2.2.1), except for the BG and CG methods, panellists are engaged in extensive discussion to either define the KSAs a minimally competent test taker needs to possess or to agree on the items that are essential for a test taker to know. Most methods, nowadays, incorporate iterative rounds allowing panellists the opportunity to rationalise their ratings and to exchange academic views on what items/tasks are measuring, such as linguistic features, cognitive skills and processes, syntactical complexity, etc. Additionally, facilitators need to communicate complex and technical information to the panellists either during the method training stage or between rounds. Consequently, if panellists cannot hear the facilitator, they may misunderstand the task at hand, and if they cannot hear other panellists’ views,

Virtual standard setting

57

they may not come to the same understanding of what an item is measuring in terms of KSAs or what a minimally competent test taker can do. Other technical problems Virtual standard setting is only feasible when panellists and facilitators are connected to the Internet and have the appropriate hardware and software. However, if panellists have slow Internet connections, then that may cause problems such as audio or video delays, which may result in panellists becoming frustrated. Other problems that might be experienced by the panellists and the facilitator may be temporary power failures and equipment breakdown. Such problems may also occur in a F2F setting, but solutions can be found much faster in a F2F environment such as changing equipment or moving into another room with direct sunlight. In a virtual environment, a way to work around a power outage or for a panellist to change equipment may not always be feasible. Security issues It is extremely difficult for facilitators to ensure that the materials shared with panellists in a virtual environment remain secure. While security measures such as having participants sign a non-disclosure agreement (NDA), sending files or using platforms not allowing the printing, copying and/or pasting of materials may help. However, in our digital age of technology, a mobile phone can easily record test items in any environment (virtual or F2F) and software programs exist that allow users to record anything on their monitor or screen. Nonetheless, despite the problems that a virtual standard setting may pose, there seems to be agreement amongst researchers that there are advantages to conducting standard setting in a virtual environment. The main advantages are the following: (1) the awarding body requesting the study can save money on panellists’ travel expenses and accommodation; (2) the total time needed for panellists to participate is reduced in terms of travelling to and from the meeting centre; and (3) virtual standard setting workshops allow the facilitator to recruit from a larger pool of panellists as geographic location is no longer an issue (Buckendahl & Davis-Becker, 2012; Katz & Tannenbaum, 2014; Katz, Tannenbaum, & Kannan, 2009; Zieky, 2012). The fact that Buckendahl and Davis-Becker (2012) recommended reviewing the literature on web-based learning to inform practices of virtual standard setting pinpoints the present gap in the literature concerning the quality and quantity aspect of a virtual standard setting. The above-mentioned advantages and challenges of conducting a virtual standard setting aim at probing researchers to investigate further to overcome mostly technical or practical problems such

58

Literature review

as participant engagement or gaining access to the platform. The impact of the medium selected for communication during a virtual standard setting study, regardless of the standard setting method used, has yet to be fully explored. An area from which insight can be drawn and advice can be sought is the field of media communication theories, in particular, media naturalness theory (MNT).

2.6 Media naturalness theory Kock’s MNT (Kock, 2004, 2005) is an evolutionary theory that is based on Darwinism. According to Kock (2004, 2005), humans have relied on synchronous forms of communication either though body language, facial expressions, gestures, and sound or through speech, for thousands of years. Therefore, he defined media naturalness as any type of synchronous communication that can support facial expressions, body language, and speech (Kock, 2005). To this end, F2F communication is viewed as the most natural type of communication and whatever type of communication deviates from F2F communication is deemed unnatural. Kock (2005) claims there are five media naturalness elements that characterise natural communication amongst participants. These are the participants’ ability (1) to share the same environment, one that allows them to see and hear one another (co-location); (2) to exchange verbal cues quickly (synchronicity); (3) to employ and detect facial expressions; (4) to employ and detect body language; and (5) to express themselves through speech and listen to others. MNT predicts that, all other factors being equal (i.e., topic of discussion, task on hand), when the degree of naturalness of a collaborative task is decreased, as is the case in e-communication, then it is likely that the following effects will be observed: “(a) an increase in cognitive effort, (b) an increase in communication ambiguity, and (c) a decrease in physiological arousal” (Kock, 2010, p. 23). Kock (2004) rationalises that since the human brain has developed through evolution to process all five media naturalness elements that naturally occur in F2F communication, when one of the five elements is suppressed, an additional cognitive burden is placed on the participants in their attempt to decode the information received (increase in cognitive effort). Additionally, when participants cannot see each other’s facial expressions, gestures, or hear each other’s tone of voice, communication ambiguity can occur in the participants’ attempt to fill in the missing information, information usually derived from verbal or non- verbal cues, through their background schemata (increase in communication ambiguity). Kock (2005) emphasises that this is especially true when participants do not share the same cultural background. When participants are engaged in

59

Media naturalness theory

communication that may not resemble their everyday type of communication, not being able to see or employ facial expressions and gestures may result in their experiencing a lack of motivation to communicate and may provide them with a less than fulfilling experience due to a decrease in physiological arousal. MNT postulates that an e- communication medium that increases or decreases the “communicative stimuli per unit of time” (Kock, 2010, p. 24) that participants would receive in F2F communication will add a cognitive burden to communication. Thus, MNT acknowledges that certain media may be richer in communication stimuli than the F2F medium. Kock (2010) explains that virtual reality media that allow for multiple participants to interact with one another simultaneously and also allow for participants to engage in private interactions, without the rest of the participants being aware of such interactions, can be classified as super-rich media. In short, in a one-dimensional scale of naturalness, the F2F medium is placed in the middle, and decreases or increases in naturalness result in deviations to the left or right (see Figure 2.1). e-mail, internet chat, video conferencing

decrease in naturalness

super-rich virtual reality media

Face-to-face medium

decrease in naturalness

Figure 2.1 The media naturalness scale (Kock, 2004, p. 340)

MNT appears to be a suitable framework for (1) evaluating the e- communication media and platforms to be selected for a virtual standard setting study; (2) advising the collection of procedural evidence; and (3) analysing and evaluating the findings of such a study. Consequently, according to its principles, a synchronous e-communication medium resembling a F2F environment such as audio-conferencing or videoconferencing is more likely to result in more natural communication as both media incorporate many of the media natural elements inherited in F2F settings. These elements can be included in questions collecting procedural evidence to investigate panellists’ perceptions regarding the appropriateness of an e-communication medium for conducting a virtual study.

2.6.1 Re-evaluating virtual standard setting studies through MNT Under the light of MNT, Harvey and Way (1999) chose an e-communication medium (emails) that supressed to a certain extent all five of the communication

60

Literature review

media naturalness elements, in particular speech and sound. This, in turn, explains why many panellists felt less inclined to share ideas in an e-communication medium employing email. Katz, Tannenbaum, and Kannan (2009) used a combination of video and audio in their first virtual standard setting. In line with MNT, the added value of having the video camera on is that it would make the medium of communication more natural. Unfortunately, the restrictions placed on video usage during the study cannot confirm or reject this hypothesis as the camera was on only when the facilitator was providing instructions/background information or when the active speaker was broadcasted during the discussions. While the researchers stated that some panellists found the use of the video distracting, it is not clear whether this was due to the added communication media naturalness element (sight) or due to their switching on and off the camera. What is more, the cognitive demands placed on panellists when switching from one medium to the next in the same study were not investigated. Thus, due to the restrictions placed on the use of the visual display, this study cannot be evaluated through the lens of MNT. In the follow-up studies, Katz and Tannenbaum (2014) compared the audio medium with the F2F environment. While the researchers presented the procedural evidence that was collected during the cut score study, there was no indication whether they explored how comfortable the virtual panellists felt participating in a virtual cut score study through the audio medium. Consequently, whether the virtual medium resulted in an increase in cognitive effort and/or communication ambiguity, or a decrease in physiological arousal was not investigated. Thus, the conclusion that “a virtual standard setting approach appears to be both viable and appropriate” (Katz & Tannenbaum, 2014, p. 15) may be premature.

2.7 Summary In this chapter, relevant literature pertaining to the meaning of cut scores, standard setting, methods to derive cut scores, a framework for evaluating cut score studies, current LTA cut score studies, virtual standard setting studies, and frameworks for evaluating e-communication media was examined. The review of the literature emphasised the theoretical and practical challenges associated with standard setting as well as the logistics involved, which have paved the way for virtual standard setting. The scarce literature on virtual standard setting has established that it is feasible in asynchronous and synchronous environments when the study has been divided into small sessions; however, the appropriateness

Summary

61

of an e-communication medium for such a study has yet to be fully explored, especially through the lens of an e-communication media theory such as MNT, nor have the findings of such studies been thoroughly analysed through Kane’s framework (Kane 2001) for evaluating standard setting. It still remains to be explored whether a virtual medium or a longer synchronous virtual cut score session, whose duration reflects that of a F2F session, affects panellists’ ability to set reliable and valid cut scores. The next chapter describes the study conducted to investigate such issues.

Chapter 3: Methodology The purpose of the chapter is to provide a description of the study’s methodology. The chapter is divided into four main sections where the first section provides a rationale for conducting a mixed methods research study, the second section describes the instruments that were adapted, adopted, and created to collect data. The third section describes the standard setting methodology in chronological order, while the fourth section describes the frameworks that were used to analyse the data.

3.1 Research aim and questions This study was designed to investigate whether cut scores set in two synchronous e-communication media, here compared for audio and video media, were reliable, which medium was most appropriate for a virtual cut score study, and whether cut scores set in virtual environments (called “virtual cut scores” henceforth) were comparable with existing cut scores set previously in a F2F environment (henceforth called “F2F cut scores”). The research study also sought to explore participants’ perceptions of the influence of the two e-communication media on their decision-making processes and their communication patterns. To this end, the research questions (RQs) in this study focused on (1) the reliability and validity of virtual cut scores, (2) the comparison of virtual cut scores within and across panels and between e-communication media, (3) the comparison of virtual cut scores with existing F2F cut scores, (4) the panellists’ perceptions of the virtual standard setting process in each medium and (5) the comparison of the two e-communication media. The specific questions which drove the research were: RQ1. Can reliable and valid cut scores be set in synchronous e-communication media (audio and video)? If so, to what extent? RQ1.1 How reliable and valid are the recommended virtual cut score measures? RQ1.2 How reliable and valid are the judgements made in each e- communication medium? RQ2. How comparable are virtual cut scores measures within and across virtual panels and different environments (virtual and F2F)? RQ2.1 How comparable are virtual cut score measures within and across virtual panels?

64

Methodology

RQ2.2 How comparable are virtual cut score measures with F2F cut scores? RQ3. Do judges exercise differential severity when setting cut scores in different e-communication media (audio and video)? If so, to what extent? RQ3.1 Do judges exhibit differential severity towards either of the two e-communication media? RQ3.2. Do any of the virtual panels exhibit differential severity towards either of the two e-communication media? RQ4. What are the judges’ perceptions of setting cut scores in each e- communication medium (audio and video)? RQ4.1 Do either of the e-communication media affect the judges’ perceptions and evaluations of how well they communicated? If so, to what extent? RQ4.2 Do either of the e-communication media influence the judges’ decision-making processes? If so, to what extent? RQ4.3 What do judges claim are the advantages and disadvantages of each e-communication medium? RQ4.4 How do judges compare their virtual standard setting experience with a similar face-to-face experience? RQ5. Which e-communication medium (audio or video) is more appropriate for setting a cut score on a receptive language test in a synchronous virtual environment?

3.2 Methods The study employed a mixed methods research (MMR) design to answer the RQs. MMR, the third research paradigm, (Johnson & Onwuegbuzie, 2004) employs both quantitative and qualitative approaches (1) to collect and analyse the data; (2) to integrate the findings; and (3) to draw inferences (Ivankova & Creswell, 2009). In the field of LTA, MMR is still “relatively new” (Guetterman & Salamoura, 2016, p. 157), but is steadily pervading LTA research (Turner, 2013). MMR has been used in LTA to explore various aspects of validity such as construct validity, consequential validity, and scoring validity (Jang, Wagner, & Park, 2014), as well as rater effects, classroom-based and large-scale assessments (Moeller, 2016). As MMR entails collecting data from both the quantitative and qualitative paradigms, it provides the researcher with the opportunity to collect more data about the problem being investigated and to examine it from two different perspectives (quantitative and qualitative) to obtain a more comprehensive

Methods

65

understanding of the issue (Creswell, 2015). Additionally, the shortcomings of one particular data collection method are balanced by the advantages of another method. For example, survey methodology allows for a large amount of data to be collected from a large number of respondents at the same time, though clarification on participants’ responses is not always possible, especially when surveys have been filled in anonymously. However, by requesting that survey respondents participate in a focus group interview, greater insight may be achieved on survey responses to questions measuring participants’ perceptions. Creswell (2014) identified the following six different MMR designs: (1) convergent parallel; (2) explanatory sequential; (3) exploratory sequential; (4) embedded; (5) transformative; and (6) multiphase. Creswell added that the first three designs are the basic designs of MMR, while the last three are the more advanced designs. In the convergent parallel design, both quantitative data and qualitative data are analysed separately, and the findings of each analysis are compared to see whether they support each other. In the explanatory sequential design, quantitative data is collected and analysed, and the findings are used for a follow-up qualitative study, while in the exploratory sequential design, the order is reversed as the qualitative findings are used to design the quantitative study that is to follow. The embedded design entails nesting “one or more forms of data (quantitative or qualitative or both) within a larger design (e.g., a narrative study, an ethnography, and experiment)” (Creswell, 2014, p. 228). The embedded design is explained in detail below as it was adopted for the current research project.

3.2.1 Embedded MMR design The embedded MMR design is most appropriate when secondary RQs cannot be answered through one traditional research design, be it purely quantitative or qualitative (Ivankova & Creswell, 2009). For this study, the feasibility of setting reliable and valid cut scores in two synchronous virtual environments was addressed through a quantitative study analysing cut scores derived in each environment, while an investigation into the panellists’ perceptions of their experience of participating in a virtual standard setting workshop is addressed through an embedded qualitative design and analysis. More specifically, in this study’s embedded mixed methods design, the quantitative research was selected as the overarching design, while the qualitative research was selected as being secondary, playing a “supportive role” to the quantitative data (Creswell, 2012). Consequently, it was expected that by using this design, the quantitative results would be triangulated through the qualitative aspect of the research. Figure 3.1 illustrates an embedded MMR design adapted from Creswell, 2014, p. 221.

66

Methodology

Quantitative data collection and analysis Qualitative data collection and analysis

Interpretation

Figure 3.1 The study’s embedded MMR design

3.2.2 Counterbalanced workshop design Four synchronous virtual standard setting workshops with four different panels of judges were designed to control for order effects, test form effects, and to counterbalance the audio and video conditions (Kim, 2010; Ying, 2010). Each workshop consisted of two sessions that were conducted through two e- communication media (audio and video) where a different equated test form was used in each session. The two sessions totalled approximately 13 hours and were two weeks apart, except for Workshop 4 where the sessions were one week apart at the request of the judges. Figure 3.2 illustrates the counterbalanced design used in the study.

Session A1

Audio

Test Form A

Session A2

Video

Test Form B

Session B1

Video

Test Form A

Session B2

Audio

Test Form B

Session C1

Audio

Test Form B

Session C2

Video

Test Form A

Session D1

Video

Test Form B

Session D2

Audio

Test Form A

Workshop 1 (N=9)

Workshop 2 (N=13) Participants (N=45) Workshop 3 (N=12)

Workshop 4 (N=11)

Figure 3.2 Overview of counterbalanced virtual workshop design

Methods

67

3.2.3 Instruments To collect data for the main study, both quantitative and qualitative data collection instruments were adopted, adapted and/or created. The following section describes the instruments used in the main study.

3.2.3.1 Web-conferencing platform and data collection platform Following the argument of MNT, F2F communication is the most natural medium of communication because human evolution has geared the human brain and body to employ and process all of the five media naturalness elements (co-location, synchronicity, ability to transmit facial expressions, gestures, and speech). Given that none of the e-communication media existing to date allow the close simulation of F2F interaction, two e-communication media (audio and video) were selected to carry out the virtual cut score study. The criteria for the selection of these two media were in line with the selection criteria given to managers in the corporate environment under MNT: (1) that an e- communication medium should at least fulfil the requirements of at least one of the five media naturalness elements and (2) that an e-communication medium that integrates one of the five media naturalness elements to a greater degree will display the highest degree of naturalness. The web-conferencing platform for the main study was selected according to a list of features that the researcher felt were necessary for a virtual standard setting to be conducted, a platform also complying with MNT principles. At a minimum, the platform needed to have the capability of (1) presenting PowerPoint presentations; (2) allowing the participants the opportunity to chat with one another as well as with the facilitator; (3) allowing the researcher to control participants’ ability to use their microphones and web cameras; (4) digitally recording the sessions; and (5) restricting unauthorised access to online sessions and uploaded materials. The researcher consulted online reviews of videoconferencing software in search of a platform meeting all of the above specified features. The next criterion for selection was whether the company who created or distributed the platform provided a free thirty-day trial period to test the platform. The two web-conferencing platforms that met all the aforementioned criteria were (1) GoToMeeting (version 6.0) and Adobe® ConnectTM (version 9.3); hence both of them were trialled in the piloting stage. Two pilot studies were conducted in February 2014, in which the researcher trialled the OSS method (Stone, 2004) as well as the audio medium for conducting a virtual cut score study. The OSS method seemed a promising alternative to the Angoff standard setting method at the time, and since the researcher had no

68

Methodology

prior experience of conducting a research study solely in an audio environment, the audio medium was used so that any potential problems associated with the medium would be flagged and dealt with prior to conducting the main study. The first study took place on February 8th, 2014, during which the GoToMeeting (version 6.0) online platform was trialled. The platform was trialled with 11 PhD students studying in the US, who were interested in learning about standard setting. The students were offered a three-hour workshop in which they would learn about standard setting and gain first-hand experience of participating as a judge in a mock virtual standard setting cut score study. Of the 11 students who had registered for the workshop, only one had a problem with his microphone and had to withdraw. At the end of the workshop, the students were asked to switch on their cameras and provide feedback on their experience with the platform. The participants who remained for the feedback session reported that they were pleased with the online platform and only noted that delay in audio transmission caused problems at times. At the time, the GoToMeeting virtual platform seemed a sensible choice for the main study; however, upon downloading the recording of the session, the researcher discovered that only the audio file could be downloaded. This meant that a visual display of the platform was not recorded in the audio medium. Because the recording did not contain what the participants and the facilitator saw during the workshop, such as the presentation, the chat boxes, and the names of the participants highlighted when they were speaking, it was not possible to identify the speaker during the transcription of the follow-up focus group interviews. The second pilot study took place on February 23rd, 2014, during which the Adobe® ConnectTM (version 9.3) videoconferencing platform was trialled. The platform was trialled with seven English language teachers invited to take part in a one-day standard setting workshop. To be able to compare and contrast both platforms, the same materials, standard setting method, and medium (audio) was used. The Adobe® ConnectTM platform was slightly more complicated than the GoToMeeting platform for the participants, and since there was no scheduled platform training session for the participants prior to the workshop, two of the participants could not be heard despite all attempts to resolve the problem. One of the participants finally withdrew, while the other communicated her opinions through synchronous text- based interaction (chat). It was later discovered that the participant who withdrew did not have a working microphone. Such problems may have been avoided were the participants able to check their system requirements (audio setup, lack of updates, drivers, etc.) for the platform on their own prior to the workshop as requested. After the completion of the workshop, each participant was individually interviewed so that the researcher could trial

69

Methods

the videoconferencing capabilities of the platform and could receive feedback on participants’ experiences in participating in an audio workshop. Setting aside some of the technical problems encountered, the participants reported that they were pleased with the platform as they found it easy to use. The recordings of the workshop and interviews revealed that, in both media, the platform interface had been recorded, making it easy for the researcher to identify the speaker in both media. Consequently, the Adobe® ConnectTM platform was selected as the web-conferencing platform for the main study. In terms of MNT, the videoconferencing capabilities of the platform with the option of allowing personal chats amongst panellists can be viewed as a super-rich medium as the videoconferencing capabilities of the platform offer more elements of communication media naturalness than the audio-conferencing capabilities of the platform. Figure 3.3 places the e-platform’s audio and videoconferencing capabilities on the media naturalness scale (Kock, 2004, p. 340). Adobe® ConnectTM platform

Adobe® ConnectTM platform

audio-conferencing

videoconferencing

decrease in naturalness

decrease in naturalness

face-to-face medium

Figure 3.3 The e-platform placed on the media naturalness scale

To prevent some of the problems encountered during the pilot phase, for the main study, all participants were required to have a direct cable connection to the Internet, and attend a scheduled virtual platform training session prior to the workshop. Additionally, an Information Technology (IT) expert was commissioned for the duration of the study to help resolve remotely any technical problems experienced by the participants by accessing their computers through the software programme, TeamViewer (TeamViewer GbmH, version 9.0, 2014). To ensure that the workshop and the focus group data were digitally recorded, apart from using the recording capability of the videoconferencing platform, another computer was registered as a participant in order to record all the workshops and focus group interviews through Snaggit (TechSmith, version 12.1), a software programme that records a computer’s visual display and audio output. To be able to monitor in real time whether participants had completed an individual task, the researcher examined all the features available in the Adobe® ConnectTM platform and concluded that e-polls asking judges to indicate when

70

Methodology

they had finished a certain activity would suffice. To facilitate the data collection process, all instruments used to collect judge data during the workshop were uploaded onto SurveyMonkey, an online platform used to create and administer online surveys.

3.2.3.2 Test instrument The test instrument used for the main study was two shortened versions of the grammar, vocabulary, and reading (GVR) section of the Basic Communication Certificate in English (BCCETM) examination, an examination developed by the Hellenic American University. The BCCETM examination is mapped to the CEFR B1 level and consists of four sections: (1) Listening, (2) Grammar, Vocabulary, and Reading (GVR), (3) Writing, and (4) Speaking. For the main study, two shortened versions (Test Form A and Test Form B) of the BCCETM GVR section were used. The GVR section of the BCCETM examination contains 75 multiple- choice items: 25 discrete grammar items, 25 discrete vocabulary items, and 25 reading comprehension items. Each item is worth one raw point, resulting in a total of 75 raw points. In contrast, each shortened GVR section (Test Form A and Test Form B) contained 45 items: 15 discrete grammar items, 15 discrete vocabulary items, and 15 reading comprehension items (3 passages × 5 items each). The main reason for selecting a shortened version of the GVR section was to ensure that cut scores for each test form would be set by the end of the second session of each workshop. For the purpose of the study, each shortened version of the GVR Section is referred to as Test Form A and Test Form B. Table 3.1 illustrates the number of items and total raw points for the BCCETM GVR section and each shortened version (Test Form A and Test Form B). Table 3.1 BCCETM GVR section: Original vs. shortened versions No. of grammar items No. of vocabulary items No. of reading items Total items Total score

BCCETM GVR Section 25 25 25 75 75 points

Test Form A 15 15 15 45 45 points

Test Form B 15 15 15 45 45 points

Test Form A (Hellenic American University, n.d.) and Test Form B were equated through common item concurrent equating (Linacre, 2020c). Concurrent equating allows different test forms to be placed on the same latent scale provided

Methods

71

that the test forms share common items, i.e., items that are administered in more than one version of a test instrument. Equating was performed through the WINSTEPS® computer programme (Linacre, 2014, version 3.81.0). To conduct the concurrent equating analysis, the researcher used six different versions of the BCCETM GVR section, all of which shared common items. The first version of the GVR section contained 75 anchor items (items with predefined difficulty estimates) on which the rest of the items were calibrated. Several analyses were conducted until no anchor item exhibited large displacement [large displacement ≥ 2 standard errors (S.E.)]. The final analysis resulted in 32 (43 %) of the 75 anchored items being unanchored (removal of predefined difficulty level), implying that the remaining 43 anchor items were used to place all the other items on the same latent scale. Such an equating process is necessary as it is extremely difficult to create multiple versions of a test instrument exhibiting exactly the same psychometric difficulty. Through the process of equating, a predefined cut score can be applied to different versions of a test ensuring that a cut score is set at the same difficulty level each time, regardless of a difference in raw scores. For example, on one version of a test instrument a cut score set at a raw score of 26 out of 45 raw points may correspond to a cut score of 27.8 out of 45 raw points on another version of a test instrument when the second version is slightly easier. Equating also made it possible for the researcher to directly compare the cut scores set on the two different test forms (Test Form A and Test Form B). The psychometric characteristics of each test form are described in detail in section 4.2.

3.2.3.3 CEFR familiarisation verification activities In the beginning of a standard setting workshop, panellists are usually engaged in activities and discussions designed to allow them to become familiar with the performance levels and PLDs, descriptors defining the KSAs expected of test takers (Egan, Schneider, & Ferrara, 2012). During workshops in which cut scores are to be set in relation to one or more CEFR levels, CEFR descriptors are used as the PLDs. Consequently, it is of critical importance that panellists are familiar with the CEFR levels and descriptors, especially the descriptors relevant to the cut score study. The Manual (Council of Europe, 2009) recommends that facilitators create activities to familiarise panellists with the CEFR descriptors before the judging process begins. Such activities help panellists come to a common understanding of the CEFR levels and descriptors, especially when the facilitator provides feedback to panellists on their assessment of the descriptors. Otherwise, panellists may have their own interpretations of the CEFR levels and

72

Methodology

descriptors, which may cause a validity threat to the cut scores set (Harsch & Hartig, 2015; Papageorgiou, 2009). However, as the majority of judges reported in their background questionnaires that they had familiarity with the CEFR levels and/or descriptors (see section 3.2.3.4, Table 3.2 for judges reported CEFR familiarity), two CEFR familiarisation verification activities containing mainly descriptors describing grammar, vocabulary, and reading ability of test takers were created for verification purposes, as such abilities are measured in the GVR section of the BCCETM. During each workshop, activity A, which contained 58 discrete descriptors, was administered during the first session, while activity B, which contained 24 of the 58 descriptors from activity A, was administered during the second session. As there are no explicit grammar and vocabulary descriptors in the CEFR scales, the DIALANG grammar and vocabulary descriptors were used instead. In both activities, participants were asked to match the descriptors with their corresponding CEFR levels. Figure 3.4 illustrates the descriptors used for the verification activities (see Appendix A for CEFR Verification Activity A descriptors with their corresponding key).

73

Methods

CEFR Global Scale (N=10)

DIALANG Grammar Descriptors (N=14) CEFR Verification Activity A (N=58) DIALANG Vocabulary Descriptors (N=13)

CEFR Reading Descriptors (N=21)

CEFR Global Scales (N=6)

DIALANG Grammar Descriptors (N=6) CEFR Verification Activity B (N=24) DIALANG Vocabulary Descriptors (N=6)

CEFR Reading Descriptors (N=6)

Figure 3.4 CEFR familiarisation verification activities

3.2.3.4 Recruiting participants To recruit participants for the main study, the researcher created an electronic project information sheet and consent form (see Appendix B for electronic consent form) following Creswell’s recommendations for seeking informed consent from participants. The project information sheet and consent form informed participants of (1) the purpose of the study; (2) the data collection and storage procedures; (3) their right to withdraw from the study at any time

74

Methodology

without providing reasons; (4) there being no risks associated with participating in the study; (5) the procedures regarding confidentiality; (6) their payment for participation in the study; (7) the requirements to be considered for the study; and (8) their right to ask questions pertaining to the study (Creswell, 2012). Considering that the workshop was taking place in a virtual environment and that the researcher was based in Greece at the time of the study, participants were recruited from Greece and neighbouring European countries sharing approximately the same time zone (± 2 hours) with Greece. Participants were recruited by employing a mixture of a purposive and a snowball sampling procedure. The minimum criteria for panellist consideration were the following: • 5 years of experience with teaching learners at a CEFR B1 level and/or have familiarity with the CEFR levels and descriptors. • an undergraduate degree in Teaching English as a Foreign/Second Language or related area (e.g., Applied Linguistics, Language Testing, etc.). • private access to a personal computer, a microphone, and a web camera. • direct (cable) private access to the Internet. The first two criteria guaranteed the participant’s familiarity with the test population and the intended CEFR level of the test instruments to be used in the standard setting workshop (Raymond & Reid, 2001), while the latter two ensured that all participants would have the appropriate equipment and direct private access to the Internet to participate in the study. To recruit participants, the examination departments of the Hellenic American Union and the PanHellenic Federation of Language School Owners were contacted to enquire whether they would be willing to send, on the researcher’s behalf, an email to their examiners inviting them to participate in the study. Colleagues were also contacted requesting that they forward the invitation to examiners and/or teachers they believed would be interested in participating in the study and fulfilled the minimum selection criteria. The invitation to participate in the study also provided a link for participants to complete an online background questionnaire mounted on SurveyMonkey, in order to be considered for the study. The online background survey also contained the project information sheet and an electronic consent form. Participants who provided their electronic consent to take part in the study where allowed access to the background survey (see Appendix C for judge background questionnaire). In the last section of the background questionnaire, judges were asked to indicate which workshop(s) they were able to attend. No attempt was made, at the time, by the researcher to assign participants to a specific group in order to create homogenous panels,

Methods

75

as only a few participants had expressed their availability for more than one possible workshop date. A total of 50 background questionnaires had been collected, from which only two participants were excluded from the study on the basis of not meeting the minimum requirements set in the project information sheet; namely, not possessing a Bachelor’s degree in Teaching English as a Foreign/Second Language and/or related field. An email was sent to the remaining 48 participants notifying them that they had been selected to participate in the study and requesting them to confirm their participation and interest in the workshop. Once confirmation of attendance had been received, an email was sent to participants providing them with additional information about the study. Participants were assigned two pre-workshop tasks: (1) to review CEFR Reading descriptors and (2) to do an online CEFR Reading task. Of the 48 participants that had been selected, two withdrew from the study due to personal reasons before the first session of the workshop. In one of the groups, one of the participants logged into the session of the workshop while it was half-way through. Fearing that the participant was not adequately familiar with the test instrument and since the participant was not even present during the discussion that had taken place in Round 1, the researcher decided to eliminate the judge’s data from the study. Consequently, the remaining 45 panellists constituted the four virtual standard setting panels in the study. For the purpose of the proposed research project, adapting Rheingold’s (Rheingold, 1993) definition of a virtual community, a virtual standard setting panel was defined as follows: an ad hoc group of professionals and experts, who may not be acquainted with each other, but have gathered in a synchronous virtual environment to participate in two cut score sessions in which they exchange ideas, beliefs and expert judgement through a multimedia web-based platform on two separate occasions through audio and video e-communication media. Table 3.2 provides a summary of the background information of the 45 panellists.

76

Methodology

Table 3.2 Summary of workshop participants Participant characteristics No. of participants: Gender: Country residing in: Highest degree attained: Current Position:

Teaching English Experience (in years): Familiarity with CEFR levels (A1 –C2): Familiarity with CEFR descriptors: CEFR level most familiar with (maximum of 3):

Total 45 Female (38) Albania (8) Spain (1) Undergraduate (13) Postgraduate (17) Academic coordinator (2) teacher (5) Private institute English teacher (23) Private English tutor (2) Item writer/materials developer (2) Administrative position in education (3) 2 –4 years (4) 15+years (20) Slightly familiar (2) Slightly familiar (5)

A1 (5) B1 (31) B2 (41) Teaching English experience 2 –4 years (8) with B1 level students (in years): 9 –15 years (8) Experience with online meetings: Experience with online course: Experience with online workshops: Face-to-face standard setting / benchmarking experience Online standard setting/ benchmarking experience

Male (7) Greece (36) Graduate (15) State school English Exam coordinator (1) Director of studies (1) University professor (3) Teacher trainer (3) 5 –8 years (11) 9 –15 years (10) Familiar (19) Familiar (24)

Very Familiar (24) Very Familiar (16)

A2 (11)

No (19)

C1 (16) C2 (25) 5 –8 years (19) Over 15 years Not applicable (8) (2) Yes (26)

No (22) No (25)

Yes (23) Yes (20)

No (18)

Yes (27)

No (34)

Yes (11)

Methods

77

Of the 45 participants, 38 (approx. 84 %) were females and the remaining seven were males. The participants were residing in three countries when the study was conducted, the majority (80 %) coming from Greece, approximately eighteen percent (18 %) from Albania, and one participant (2 %) from Spain. All participants had at least a Bachelor’s degree in Teaching English as a Foreign/ Second Language or in a related field (e.g. English Language and Literature, Applied Linguistics) and one third of the participants (33.33 %) had a post graduate degree. Two thirds of the participants (66.66 %) had direct contact with English language learners as they were either English language teachers working at private language institutes or state schools, or were offering private English language instruction, while one third (33.33 %) held either academic or administrative positions such as a university professor, an academic or exam coordinator, or a teacher trainer. Two of the participants were item writers and/ or materials developers. Around ninety percent (90 %) of the participants had at least five years of experience teaching English and nearly all of them (96 %) had at least two years of experience teaching English to CEFR B1 level learners. Only two participants reported that they were only slightly familiar with the CEFR levels and approximately eleven percent (11 %) reported they were only slightly familiar with the CEFR descriptors. Approximately seventy percent (69 %) stated that one of the CEFR levels they were most familiar with was the B1 level, while the remaining thirty-one percent (31 %) of the participants reported that one of the levels that they were most familiar with was an adjacent B1 level -A2 or B2. Approximately forty percent (42 %) reported having taken part in an online meeting and nearly half (49 %) of the participants had taken an online course at some time in the past. Over half (55 %) of the participants had experience in online training sessions and/or workshops, and forty percent (40 %) had participated in a F2F standard setting workshop. Surprisingly, around a quarter (24.44 %) reported having taken part in an online cut score study and/or benchmarking session. All participants indicated that they had (1) the specified equipment and (2) direct private Internet access so that they fulfilled the technical requirements to take part in the study.

3.2.3.5 Workshop surveys Throughout the study, surveys were employed to collect procedural evidence as well as judge perceptions on the e-communication medium during a particular session. The survey items were mainly adopted (used as originally designed) or adapted from Cizek (2012b) for items collecting procedural evidence, and

78

Methodology

from Dennis and Kinney (1998) for items collecting judge perceptions. There were three types of adaptations made to the survey items: (1) minor, (2) more substantial, and (3) insertion. Minor adaptations include one-word changes to original items and/or changing negatively worded items to positively worded items, while more substantial adaptations include several-word changes, splitting double barrelled items with or without one-word changes, or using slightly different concepts. New items are referred to as insertions. Table 3.3 provides examples of the types of adaptations that were made to the original items. The first column illustrates the original survey item, the second column describes the type of adaptation, while the third column presents how the item was presented to the judges. Table 3.3 Examples of survey adaptations Original item

Type of Survey item adaptation I have a good understanding of the [test Minor I have a good understanding of the name] performance level descriptors CEFR Reading Descriptors. (PLDs) (Cizek, 2012b, p. 174). The communication condition under which we communicated helped us to better understand each other (Dennis & Kinney, 1998).

Minor

The medium through which we communicated helped us to better understand each other.

When we disagreed, the communication conditions made it more difficult for us to come to agreement (Dennis & Kinney, 1998).

Minor

When we disagreed, the medium through which we communicated made it easier for us to come to an agreement.

The technologies were helpful and More The e-platforms were easy to use. functioned well (Cizek, 2012b, p. 175). substantial The e-platforms functioned well. I felt comfortable contributing in the small discussions (Cizek, 2012b, p. 174). - -

Minor

I felt comfortable contributing in the virtual small group discussions.

Insertion

I could relate to other panellists’ ideas/beliefs in the medium through which we communicated. The medium through which we communicated created a positive working environment.

Insertion

Methods

79

In each session, a total of six surveys containing both closed-ended and open- ended items were used. For the closed-ended items, participants were requested to express their level of agreement on each survey item by using a six-point Likert scale with the following categories: “1” (Strongly Disagree); “2” (Disagree); “3” (Slightly Disagree); “4” (Slightly Agree); “5” (Agree); “6” (Strongly Agree). In the open-ended items, participants were probed to provide comments for their choice on the items that measured their perceptions on the medium, especially when judges used the bottom end of the rating scale [“1” (Strongly Disagree); “2” (Disagree); “3” (Slightly Disagree)]. All six surveys contained at least two final open-ended items in which participants could request further explanation and/ or make any comments. All surveys were previously mounted onto SurveyMonkey and participants were provided with a link to access the surveys through the online virtual platform. According to Cizek (2012b), procedural evidence should be collected at specific “junctures” (p. 171) during a standard setting workshop. Consequently, the six surveys were administered during each session as suggested: (1) Following Introduction/Orientation activities (Survey 1 –End of Orientation); (2) Following Training in the Selected Method (Survey 2 –End of Method Training Session); (3) Following completion of Round One ratings/judgements (Survey 3 –End of Round 1); (4) Following Round One feedback and completion of Round Two ratings/judgements (Survey 4 –End of Round 2); (5) Following Round Two feedback and completion of Round Three ratings/judgements (Survey 5 –End of Round 3); and (6) Final evaluation at the conclusion of the standard setting workshop. (Survey 6 –Final Survey). Throughout the six surveys, there were repeated items measuring participants’ perceptions on (1) their confidence in their ratings; (2) the timing and pacing of each stage of the workshop; (3) the medium used in each session: and (4) the usability and functioning of the web-conferencing platform. Figure 3.5 illustrates the six surveys administered during each session of the workshop.

80

Methodology

Workshop

Session 1

Session 2

End of Orientation Stage

Survey 1

End of Method Training Stage

Survey 2

End of Round 1 Stage

Survey 3

End of Round 2 Stage

Survey 4

End of Round 3 Stage

Survey 5

End of Session

Survey 6

End of Orientation Stage

Survey 1

End of Method Training Stage

Survey 2

End of Round 1 Stage

Survey 3

End of Round 2 Stage

Survey 4

End of Round 3 Stage

Survey 5

End of workshop

Survey 6

Figure 3.5 Surveys administered to each panel during each workshop

3.2.3.6 Focus group interviews Focus group methodology allows researchers to collect data economically and efficiently from multiple participants having shared similar experiences, such experiences may promote group cohesiveness and encourage participants to share their personal opinions, attitudes, and beliefs in a more spontaneous way. Focus group data may also provide an in-depth understanding of the participants’ opinions and experiences (Finch & Lewis, 2003; Kleiber, 2014; Krueger & Casey, 2015; Onwuegbuzie, Dickinson, Leech, & Zoran, 2009). Nonetheless, focus groups are not as common as other data collection methods in LTA research (Harding, 2017);

Methods

81

however, they have been used to investigate perceptions of learners (Zhu & Flaitz, 2005), perceptions of teachers (Faez, Majhanovich, Taylor, Smith, & Crowley, 2011; Piccardo, 2013), perceptions of raters (Harding, 2017) and perceptions of standard setting judges (Tschirner, Bärenfänger, & Wanner, 2012). One reason for limited use of focus group methodology to investigate judges’ perceptions in the field of standard setting may be attributed to there not being any reference to this type of data collection method in the Reference Supplement Section D to the Preliminary Pilot version of the Manual for Relating Language examinations to the Common European Framework of Reference for Languages:, learning, teaching, and assessment (Banerjee, 2004), as it was not common in LTA research when the supplement was published. Another reason why focus groups are not as popular in LTA research as they are in other fields of applied linguistics is that focus groups are not appropriate in cases where the frequency of a phenomenon is being investigated or the issue under investigation is highly sensitive. Focus groups are also challenging in cases where participants lack the ability to clearly and articulately express themselves or are unwilling to share their experiences. Another disadvantage is when some of the participants dominate the discussion (Bryman, 2004; Creswell, 2014; Krueger & Casey, 2 015; Savin-Baden & Major, 2013). Despite the shortcomings of focus group methodology, focus groups still remain a valid way to collect a great amount of data in a short time, allowing researchers to investigate participants’ perceptions (Bryman, 2004). Consequently, this type of research method was deemed to be the most appropriate for investigating panellists’ perceptions of the e-communication medium used in each session as interviewing each panellist would have been time- consuming to organise and conduct, which may have resulted in the panellists withdrawing from this part of the study or not being able to recall accurately their perceptions due to the passage of time. Thus, all focus groups occurred within two weeks from the time the second session of the workshops ended. The final stage of the study comprised of online focus group interviews with the workshop participants. The researcher asked participants to provide feedback on their virtual standard setting experiences. Consequently, prior to conducting the focus group interviews, a protocol with questions and probes was created. The protocol was guided by the RQs, the literature on virtual meetings, observations made by the researcher during the piloting phase, and comments made by the judges in the open-ended questions in the surveys. The questions ranged from (1) whether judges felt comfortable interacting in each e- communication medium; (2) whether their decision-making process was affected by either medium; (3) what similarity and differences judges perceived in their interaction patterns when comparing the audio medium to the video medium or the virtual environment to the F2F environment; to (4) whether judges perceived any advantages and disadvantages to participating in a virtual cut core study

82

Methodology

(see Appendix D for focus group protocol). Apart from the first focus group interview in which panellists requested to have it immediately after the workshop ended, all other focus groups were comprised of panellists from different groups. Figure 3.6 illustrates the online focus groups conducted and in which workshop each panellist participated in the main study. For example, Focus Group Session 1 had five participants (N=5) and all participants came from Group 3 [G3 (N=5)], while Focus Group Session 2 had five participants (N=5), where one participant came from Group 1 [G1 (N=1)], two participants came from Group 2 [G2 (N=2)], and the remaining participants came from Group 4 [G4 (N=2)].

Focus Group Session 1 (N=5)

G3 (n=5) G1 (n=1)

Focus Group Session 2 (N=5)

G2 (n=2) G4 (n=2)

Focus Group Sessions

G1 (n=1) Focus Group Session 3 (N=9)

G2 (n=3) G4 (n=5) G1 (n=1) G2 (n=3)

Focus Group Session 4 (N=7) G3 (n=2) G4 (n=1) G1 (n=3) Focus Group Session 5 (N=7)

G2 (n=1) G3 (n=3)

Figure 3.6 Focus group sessions

Standard setting methodology

83

3.2.3.7 Ethical considerations Prior to commencing the study, ethical clearance for the project information sheet and consent form, surveys, and focus group questions was obtained from Lancaster University’s Research Ethics Committee (UREC).

3.3 Standard setting methodology This section provides a description of the standard setting method chosen for the main study along with its rationale and presents how the method was implemented in a virtual environment.

3.3.1 Rationale for the Yes/No Angoff method The selection of the standard setting method for the main study was advised by (1) a review of the standard setting literature; (2) the cognitive demands of the method; (3) the performance standard to be set (pass/fail); (4) the nature of the test instrument; (5) the test-taker data available; and (6) the facilitator’s experience with the method (Kane, 2001; Zieky, Perie, & Livingston, 2008). The Yes/No Angoff method was deemed appropriate for this study as it places a less cognitive demand on the panellists. This method entails making a Yes/ No judgement, which in turn, reduces the time needed for the judging process. The method is also suitable for instruments which contain multiple-choice items such as the instruments (Form A and Form B) used in the study. Furthermore, it does not require panellists to have direct knowledge of the actual test takers’ KSAs as is necessary for other standard setting methods (i.e., the CG method). It is easier for the facilitator to explain the method, train judges in the method, and analyse judge data between rounds (Impara & Plake, 1997; Pitoniak & Cizek, 2016; Plake & Cizek, 2012; Zieky, Perie, & Livingston, 2008). What is more, the researcher had extensive experience in using modified Angoff methods (Downey & Kollias, 2009; Downey & Kollias, 2010; Kollias, 2012; Kollias, 2013).

3.3.2 Pre-workshop platform training A few days prior to the first session of their scheduled workshop, all participants were asked to undergo an online training/equipment check session. An email was sent to participants providing them with times the training session would take place. Prior to participating in the online training session, participants were asked to read a “Visual Quick Start Guide” to the Adobe® ConnectTM platform and to perform an online computer diagnostic test. The computer diagnostic test would evaluate whether the participants had a clear connection to the platform and whether they needed software add-ons and/or updates so that all features of

84

Methodology

the platform would function appropriately. Participants were also encouraged to download TeamViewer 9, a free online programme that would allow the researcher and/or the IT specialist to try to resolve any computer problems a participant was facing. During the training session, participants were trained to (1) connect their microphones and camera; (2) share their videos; (3) switch on and off their cameras; and (4) use the platform function buttons. The Adobe® ConnectTM platform offers meeting participants the ability to indicate their status through various buttons in the platform. For example, participants can indicate (1) their agreement or disagreement by clicking on the “Agree” or “Disagree” buttons; (2) whether they have a question by clicking on the “Raise Hand” button; and (3) whether they need to leave the meeting for a short while by clicking on the “Step Away” button. The purpose of requiring participants to take part in a training session was to allow them to become familiar with the online platform and to ensure that their equipment was functioning appropriately. The virtual communication etiquette (netiquette) was also introduced for the first time. Judges were instructed to “raise their hands” virtually if they wished to be given the floor and to use the various button to indicate their status. Figures 3.7 through to Figure 3.9 illustrate the Adobe® ConnectTM Platform interface.

Figure 3.7 Example of e-platform: equipment check session

Standard setting methodology

Figure 3.8 Example of e-platform: audio medium session

Figure 3.9 Example of e-platform: video medium session

85

86

Methodology

3.3.3 In preparation for the virtual workshop Prior to the commencement of the workshop, all materials needed for the virtual cut score study were uploaded onto the Adobe® ConnectTM platform and/or onto SurveyMonkey. In this way, access to the materials were given to the participants through file sharing, screen sharing, and/or web links. Table 3.4 illustrates which materials were uploaded and onto which platform. Table 3.4 Materials uploaded onto virtual platforms Materials Introduction Presentation CEFR activities CEFR activity descriptors with keys Test Form A & Test Form B Test form and rating form (hard copy) Round 1–3 ratings forms Survey 1–6 forms

Adobe® ConnectTM √ (screen sharing) √ (web link) √ (file transfer) √ (web link) √ (file transfer) √ (web links) √ (web links)

SurveyMonkey √ √

√ √

To ensure that participants were provided access to the materials in the correct order and at the right time, the researcher created a facilitator’s virtual standard setting protocol, which was a step-by-step guide for conducting the virtual workshop (see Appendix E for facilitator’s virtual standard setting protocol). The protocol facilitated the smooth flow of the workshop and ensured that all the data for the study was collected.

3.3.4 Description of the workshop stages The Yes/No Angoff method was organised in iterative rounds (multiple rounds of ratings) with panellists receiving additional information between each round to help them with their ratings (Plake & Cizek, 2012). Each session of the workshop was divided into six stages: (1) Introduction; (2) Orientation; (3) Training in the standard setting method; (4) Round 1; (5) Round 2; and (6) Round 3. Figure 3.10 provides an overview of each session.

Standard setting methodology

87

Introduction stage (Session 1 only) • Welcoming • Introductions Orientation stage • Standard setting presentation • CEFR familiarisation check activity • Familiarisation with test instrument • Survey 1: Orientation stage Training in the method stage • Yes/ No training & practice • Yes/ No practice discussion • Survey 2: Method training stage Round 1 stage • Round 1 judgments • Round 1 normative information (feedback) & discussion • Survey 3: Round 1 stage Round 2 stage • Empirical data feedback • Round 2 judgments • Survey 4: Round 2 stage Round 3 stage • Round 2 normative information (feedback) & consequences feedback • Round 3 judgments • Survey 5: Round 3 End of Session • Round 3 normative information (feedback) & consequences feedback • Survey 6: Final survey & Farewell Greetings

Figure 3.10 Overview of the workshop stages for each session

3.3.4.1 Introduction stage At the beginning of this stage, each participant was asked to introduce themselves to one another so that they hopefully would feel more comfortable interacting during the discussion stages. Even though the introductions were

88

Methodology

brief, it provided the participants with a chance to learn something about the other panellists and form their first impression of them without relying solely on a visual image and/or tone of voice. The participants were also introduced to two other members of the researcher’s team. The first was an IT expert who provided support to participants experiencing technical problems with their hardware and/or software during the workshop, whereas the second was the co- facilitator who aided with the downloading of the participants’ data and with preparing the feedback files in order to avoid a time lag between the stages of the workshop, as well as with other administration tasks such as muting noisy participants and providing assistance through private chats. The Introduction Stage was not repeated during the second session as participants were already familiar with one another.

3.3.4.2 Orientation stage The purpose of the Orientation stage was threefold: (1) to provide judges with an overview of the standard setting workshop; (2) to calibrate judges in their interpretation of the CEFR levels; and (3) to allow judges to become familiar with the content of the test instrument. At the beginning of this stage, participants followed a short PowerPoint presentation on what standard setting entailed and what was expected of them during the workshop. On completion of the presentation, participants completed an online CEFR familiarisation verification activity, one in which they would match each CEFR descriptor to its appropriate CEFR level. Two versions of the CEFR familiarisation verifications activities were created to be used during each session of the workshop (see section 3.2.3.3). Activity A was used in the first session of the workshop and activity B was used in the second session.

3.3.4.2.1 CEFR familiarisation verification activity A The CEFR familiarisation verification activity A comprised of 58 discrete descriptors: 10 CEFR Global descriptors; 14 DIALANG Grammar Descriptors; 13 DIALANG Vocabulary Descriptors (see Alderson, 2006 for all CEFR level DIALANG grammar and vocabulary descriptors), and 21 CEFR Reading Descriptors. For the first 10 CEFR global descriptors, participants were asked to match a descriptor to one of the six CEFR levels A1, A2, B1, B2, C1, and C2, while for the remaining 48 descriptors, which focused on grammar, vocabulary, and reading, participants were asked to match the descriptors to one of the following three CEFR levels: A2 (the level below the target level of the test instruments), B1 (the target level of the test instruments), or B2 (the level above the target level of

Standard setting methodology

89

the test instrument). This allowed the activity to be completed within a suitable time frame and was less cognitively demanding for the participants. Figure 3.11 is an example of CEFR familiarisation activity A.

Figure 3.11 Example of CEFR familiarisation verification activity A

3.3.4.2.2 CEFR familiarisation verification activity B The second CEFR verification activity comprised of 24 descriptors out of the 58 descriptors used in activity A: six CEFR global descriptors; six DIALANG Vocabulary descriptors; six DIALANG Vocabulary descriptors; and six Reading descriptors. Participants were asked to match each set of descriptors to one of the six CEFR levels. Even though the participants had completed a CEFR verification activity in the first session of the workshop, it was felt that it would be best if they completed an additional activity before setting a cut score on the second test form considering that the second session was scheduled either one week (Workshop 4) or two weeks (Workshops 1 –3) after the first session. Figure 3.12 is an example of CEFR familiarisation activity B.

90

Methodology

Figure 3.12 Example of CEFR familiarisation verification activity B

Once all participants had completed the familiarisation activity, a discussion took place on the descriptors. To facilitate the discussion, participants were provided with visual feedback on their groups’ ratings in the form of graphs and percentages. Figure 3.13 is an example of the feedback projected to participants on CEFR verification activity A. During the discussion, the researcher emphasised parts of the descriptors that he believed would help the participants understand why a particular descriptor was set at a particular CEFR level.

Figure 3.13 Example of CEFR familiarisation verification activity feedback 1

Standard setting methodology

91

At the end of the CEFR familiarisation activity feedback discussion, the complete set of the descriptors, -the key (see Appendix A), with those parts in bold that were emphasised during feedback, were uploaded onto the platform so that all participants could download them and refer to them during the next stages of the workshop. Figure 3.14 is an example of the descriptors uploaded onto the platform.

Figure 3.14 Example of CEFR familiarisation verification activity feedback 2

3.3.4.2.3 Familiarisation with the test instrument In this part of the Orientation stage, panellists were asked to complete each test form (Test Form A or Test Form B) containing 45 items in approximately 50 minutes. The time allocated to complete each test form was proportional to the time test takers are provided with to complete the BCCETM GVR section in a live administration. To complete the BCCETM GVR section, test takers are given 70 minutes to complete 75 items, consequently, judges were advised to review and answer all 45 items in each test form in 50 minutes when they were first given access to the test and instructed to review it (see section 3.2.3.2, Table 3.1 for a comparison of items of the official BCCETM GVR section and each shortened test form). The rationale behind having panellists undertake the test instrument under timed conditions is so that they have a similar experience to that of the test takers and that they become familiar with the items and their corresponding difficulty (Loomis, 2012). Panellists were also provided (through file sharing) with a form to keep a record of their answers so that they could compare their answers with the key that was provided to them once they completed the test.

92

Methodology

Through the online platform, panellists were given access to three separate links to the test instrument, each link contained a subsection of the test instrument. Panellists were then asked to complete each subsection of the instrument in the following order: grammar, vocabulary, and reading. Once all participants had completed all three subsections of the test, the key was provided and participants were asked whether they had any comments or questions on the items. At the end of the orientation stage, panellists completed their first online evaluation, which was reviewed before moving onto the next stage. Figure 3.15 is an example of Test Form A grammar subsection.

Figure 3.15 Example of grammar subsection familiarisation

3.3.4.3 Method training stage The purpose of the method training stage was threefold: (1) to define the concept of a “Just Qualified B1 Candidate”; (2) to train judges in the standard setting method to be used in Round 1; and (3) to provide panellists with practice in using the standard setting method. A “Just Qualified B1 Candidate” was defined as a test taker that (1) had just crossed over the CEFR threshold between level A2 and B1; (2) had just become an “independent” user of the language; and (3) was not expected to be able to handle the higher end of the B1 scale (e.g., B1+). Panellists were provided with three example items and instructed to decide whether a “Just Qualified B1 Candidate” would be able to answer the items correctly. Each item was presented one at a time and, for each item, participants entered either “Yes” or “No” into an e-poll. Once all panellists had answered the poll, a discussion

Standard setting methodology

93

took place on the groups’ rating behaviour. Panellists were invited to discuss their rationale for their own rating. To ensure that the panellists had understood the standard setting method and that any possible confusion had been addressed, panellists completed their second online evaluation form. Upon reviewing the responses to the evaluation, the researcher concluded that the group was ready to proceed to the next stage.

3.3.4.4 Judgement stage Round 1 Stage: The purpose of this stage was threefold: to allow panellists to (1) “trial” the standard setting method; (2) set a preliminary cut score; and (3) share their rationale for their judgements. During this stage, panellists were asked to evaluate each item and apply the standard setting method explained in the training stage. Similarly, to the test instrument familiarisation step, panellists were provided with three online links, one for each section of the test, to enter their ratings. Panellists were also instructed to keep records of their ratings in the rating form so that they would have immediate access to their ratings for the discussion step of this stage. Figure 3.16 is an example of the grammar section Round 1 virtual rating form.

Figure 3.16 Example of Round 1 virtual rating form

Panellists were instructed to complete their ratings in the following order: grammar section; vocabulary section; and reading section. Polls were used asking panellists to indicate whether they had finished each section. Once a

94

Methodology

section had been completed by all panellists, the researcher and the co-facilitator downloaded the panellists’ ratings and started creating the feedback file to be used in the discussion step of this stage. After all panellists had finished all three subsections, a discussion on the panellists’ ratings took place. To facilitate Round 1 discussion, panellists were presented with normative information illustrating their groups’ ratings on an item. Figure 3.17 is an example of the feedback projected to panellists during Round 1 discussion.

Figure 3.17 Example of panellist normative information feedback

As it was impractical to discuss every item (Reckase & Chen, 2012), especially since a specific amount of time was allocated to each session of the workshop, all Round 1 discussions were kept to approximately one hour, in which panellists discussed items on which panellist disagreement was evident (i.e., at least a third of the panellists had a different view on the difficulty of an item). Approximately half the items in each section were discussed, and panellists with differing opinions were invited to express their rationale for their judgement on each one of the selected items, though the aim of the discussion was not to impose consensus. At the end of the discussion, panellists completed their third online evaluation, which was reviewed before moving onto to the next stage.

Round 2 Stage: The purpose of this stage was to allow panellists to make any changes to their initial ratings and to set a second cut score. In the beginning of this stage, panellists were trained how to interpret the empirical data that would be provided in this round to inform their Round 2 judgements. The empirical data used were logits, difficulty measures yielded from a Rasch analysis ranging from negative values

Standard setting methodology

95

to positive values. Negative values indicate that an item is easy, while positive values indicate more difficult items. Items of average difficulty have a logit value of “0.00”. Logits were used instead of facility values so as not to influence panellists in their next round of judgements (Round 2). A discussion took place on how to interpret the logit values. The researcher explained that when an item had a difficulty logit value of “0” and a test taker had an ability level of “0”, the test taker had a fifty-percent chance of getting the item correct; however, when a test taker had an ability level of “0” logits and an item had a difficulty of “1.50” logits, the test taker had a much lower probability of getting the item correct. On the other hand, when a test taker had an ability level of “0” logits and an item had a difficulty measure of “-1.50” logits, the test taker had a much higher probability of getting the item correct. Panellists were then shown the same three items used in the method training stage; however, each item also contained empirical data. Judges were then asked to re-evaluate the items in terms of their empirical data and to decide whether a “Just Qualified B1 Candidate” would answer the question correctly. As in the initial method training stage, a discussion followed each item in which judges rationalised their rating. After all questions on how to interpret the empirical data had been answered and judges felt confident in their interpretation of the empirical data, panellists were informed that for each item they would be provided with the following information: (1) the most difficult logit value for the subsection; the easiest logit value for the subsection; and each item’s difficulty logit value (see section 4.2 for psychometric properties of test forms). Panellists were instructed to re- evaluate each item and apply the same standard setting method used for Round 1 taking into consideration what was discussed during Round 1 discussion and the empirical data provided in this stage. Panellists were informed that they were not obliged to change their Round 1 rating if they wished to leave it the same. The same procedure for completing their ratings was followed as in Round 1, with the only difference being that panellists also had the empirical data (in logits) for each item. Figure 3.18 is an example of Round 2 online rating form. At the end of this stage, panellists completed their fourth online evaluation which was reviewed before moving onto to the next stage.

96

Methodology

Figure 3.18 Example of Round 2 virtual rating form

Round 3 Stage: The purpose of Round 3 Stage was to provide judges with a final opportunity to make any changes to their final recommended cut score. In the beginning of Round 3, normative information in the form of each judge’s Round 2 cut score, the group’s overall Round 2 cut score, and consequences feedback (pass rate based on the group’s Round 2 recommended cut score) was presented. Figure 3.19 shows an example of the normative information and the consequences feedback projected to participants. A brief discussion followed on how the data should be interpreted.

Standard setting methodology

97

Figure 3.19 Group 1 normative information and consequences feedback

Panellists were instructed to re-evaluate their cut score and to enter a final individual cut score. Panellists were reminded that they did not have to change their Round 2 cut score. Figure 3.20 is an example of Round 3 rating form. At the end of this stage, panellists completed their fifth online evaluation which was reviewed before moving onto to the final evaluation.

Figure 3.20 Round 3 virtual rating form

98

Methodology

Before concluding the session, panellists were shown once again their own Round 3 judgements and the overall cut score and the consequences feedback of that cut score so that they could complete their final evaluation form. At the end of the final evaluation, all participants were thanked and reminded of either the next session that was going to take place and/or the upcoming focus group session. Each session lasted approximately 6 –6.5 hours with breaks. Table 3.5 illustrates the approximate duration of each session. Table 3.5 Virtual session duration Stage Orientation Stage Break Method Training Stage Round 1 Stage Lunch Break Round 1 Stage cont. Round 2 Stage Break Round 3 Stage Wrap up Total

Timing (in minutes) 90–95 15 25–30 45 60 70–75 30–35 5 15–20 5 360–385

3.4 Data analysis methods and frameworks This section presents the types of analyses performed to analyse the quantitative and qualitative data gathered throughout the workshop. The data collected were a mixture of quantitative data and qualitative data from three main sources: (1) panellist elicited numeric judgements (raw data), (2) online surveys, and (3) focus group interviews. Quantitative data were collected through panellist ratings and survey responses while the qualitative data were collected through the open-ended items in the surveys as well as from the focus group interviews. Figure 3.21 illustrates the types of data collected.

99

Data analysis methods and frameworks CEFR Actvity A (N=48)

CEFR Verification Activities

CEFR Activity B (N=24) Round 1 ratings (N=45) Session 1

Quantitative Data

Round 2 ratings (N=45) Round 3 ratings (N=1)

Panellists' ratings

Round 1 ratings (N=45) Session 2

Round 2 ratings (N=45) Round 3 ratings (N=1)

Session 1

6 survey evaluations forms

Session 2

6 survey evaluations forms

Surveys

Qualitative Data

5 Focus Groups Interviews

Focus Group Transcripts

Surveys

Open-ended questions

Figure 3.21 Overview of the quantitative and qualitative data collected

3.4.1 CEFR verification activities analysis For each of the 45 participants, two sets of data were collected during the CEFR training for each session. The first activity consisted of 58 items while the second activity consisted of 24 items. To measure the degree of consistency between the panellists’ ranking of the CEFR descriptors and the descriptors’ actual level, Kaftandjieva (2010) misplacement index (MPI) was used (see Appendix F).

3.4.2 Internal validity of cut scores Classical test theory (CTT) Internal validity relates to the degree to which (1) the cut score would be attained if the same method were repeated (consistency within the method);

100

Methodology

(2) panellists are consistent in their own judgements across rounds and groups (intraparticipant consistency); (3) panellists are consistent in their ratings when compared to one another (interparticipant consistency); and (4) test takers are accurately classified (decision consistency) (Cizek & Earnest, 2016; Hambleton & Pitoniak, 2006). Figure 3.22 illustrates the types of analysis conducted to investigate the internal validity of the cut scores. Consistency within the method

SEc/ SEM < .05 (internal check)

Intraparticipant consistency

MPI (Kaftandjieva, 2010) Correlations

Interparticipant consistency

ICC Cronbach’s (a)

Decision consistency

Livingston & Lewis method (1995) Standard error method (2009)

Internal validity

Figure 3.22 Data analysis for internal validity: CTT

Consistency within the method was measured by comparing the standard error of the cut score measure (SEc) with the standard error of the test form (SEM). Intraparticipant consistency was measured by examining (1) the degree to which ratings correlated with empirical item difficulty measures; (2) the degree to which ratings correlated across rounds; and (3) the extent to which ratings changed across rounds. Interparticipant consistency was measured through intraclass correlation coefficient (ICC) (McGraw & Wong, 1996; Shrout & Fleiss, 1979) and Cronbach’s alpha. Decision consistency and accuracy, the probability of correct classifications –false positive and false negative -and overall percentage of consistent classifications was measured through the Livingston and Lewis method (Livingston and Lewis, 1995) and the Standard Error method (Stearns & Smith, 2009).

Rasch measurement theory (RMT) To further evaluate the internal validity of cut scores, RMT was employed. By investigating the validity evidence through RMT, findings can move beyond the group level to the individual panellist level and CTT findings can be corroborated. What follows is a brief introduction to RMT in which the MFRM model is located.

Data analysis methods and frameworks

101

The basic Rasch model was developed by George Rasch (Rasch, 1960/1980) and is a probabilistic model that estimates that the probability of a test taker answering an item correctly is the difference between the test taker’s ability (Bn) and the item’s difficulty (Di). Thus, the more ability test takers have, the greater the probability of getting an item correct, while the more difficult an item is, the less likely test takers are to get that item correct (Bond & Fox, 2015; Bachman, 2004; McNamara, 1996). The Rasch model states that test- taker ability is independent of the number of items in a test instrument and that item difficulty is independent of the sample of test takers (Bond & Fox, 2015; Snyder & Sheeham, 1992), a prerequisite for any objective measurement model (Engelhard, 2013). The Rasch model assumes that when test-taker ability and item difficulty can be estimated and placed on a natural log scale, the probability of observing a specific score is obtained from the difference of the two estimates. The natural log scale, referred to as a logit scale, is an interval scale that rank orders (1) test takers in terms of their ability and (2) items in terms of their difficulty (Bond & Fox, 2015). By placing test takers and items on the same logit scale, a logit being a “simple logarithmic expression of the odds of success” (Knoch & McNamara, 2015, p. 277), researchers can see by how much a test taker is “more” able or an item is more “difficult” (McNamara, 1996). The basic Rasch model is the following:

P  log  ni1  ≡ Bn − Di (1)  Pni 0 

where Bn is the ability of subject n, where n =1, N, Di is the difficulty of an item i, where i =1, L, Pni1 is the probability that subject n will succeed on item i, Pni0 is the probability that of failure 1 - Pni1 . Pni1 can be expressed as: (Linacre, 2004, p. 27) Pni1 =

e(

Bn − Di )

1 + e(

Bn − Di )

According to equation 1, the log ratio of the probability of success on item i and the probability of failure on item i is equal to the ability of a test taker Bn minus

102

Methodology

the difficulty of item Di. It follows that when a test taker n has the same ability level as the item difficulty level i, the probability of test taker n answering the item correctly is .50. For example, if a test taker has an ability of 1.50 logits and an item has a difficulty of 1.50 logits, inserting those estimates into the equation will result in a fifty-percent chance of getting that particular item correct: Pni1 =

e(

Bn − Di )

) 1 1 e( e( ) = = = = ( Bn − Di ) (1.5 −1.5) (0 ) 1 + 1 2 1+ e 1+ e 1+ e 1. 5 − 1. 5

0

The basic Rasch model is a two- facet model (test- taker ability and item difficulty), which uses dichotomous data for analysis. A facet is any variable such as a task, item, or rater that can affect scores assigned to a particular test taker in a systematic way (Eckes, 2009). However, if the data for analysis is polytomous and/or there are more facets interacting in the measurement process, then several other models from the Rasch family exist to select from.

The many-facet Rasch measurement (MFRM) model The MFRM model was developed by Linacre (1989/1994) as an expansion of the basic Rasch model to include more than two facets in the data analysis. Facets such as judges may cause systematic measurement error to any judging situation as judges will vary in terms of their leniency/severity, and the MFRM model allows for a probabilistic estimation of such severity on the same logit scale (Weigle, 1998; Wolfe & Dobria, 2008). This, in turn, allows for the estimation of objective measures of judge ratings (Linacre, 2020a). McNamara and Knoch acknowledge that the use of the MFRM model by LTA researchers has increased significantly in the past decade (Knoch & McNamara, 2015; McNamara & Knoch, 2012), which is equally true for its use by standard setting researchers to analyse judge data. Several standard setting researchers have used MFRM to evaluate judge ratings on dichotomous and polytomous data (Eckes & Kecker, 2010; Hsieh, 2013; O’Sullivan, 2010; Papageorgiou, 2009; Kantarcioglu & Papageorgiou, 2011; Stone, Belyokova, & Fox, 2008; Wu & Tan, 2016) and it is a procedure recommended by the Language Policy Division of the Council of Europe for relating examinations to the CEFR (Eckes, 2009). MFRM analysis allows for the investigation of differences in judge severity as no amount of training can change idiosyncrasies that judges may have developed throughout their life (Lunz & Stahl, 2007). From a Rasch measurement perspective, the internal validity of judge ratings can be evaluated at an individual level and at the group level. Figure 3.23 displays the indices that were consulted to evaluate the consistency of standard setting judgements.

Data analysis methods and frameworks

103

Fit Statistics (Infit MnSq (Zstd)) Individual

Corr. Ptbis. >.20 Obs.%. vs. Exp. %

Internal validity

Separation indices: (G), (H), (R) Group

Chi-square statistic (χ2) Obs. %. vs. Exp. %

Figure 3.23 Data analysis for internal validity: RMT

On the individual level, the internal validity of each judge’s ratings was investigated through fit statistics, point- biserial correlations, and observed and expected agreements amongst judges. On the group level, internal validity of each group’s recommended cut score was examined by examining the separation index, strata, reliability of separation index, chi- square statistic, and by comparing the exact agreement of all judges with the model’s expected agreement. To investigate whether the medium influenced cut scores, pairwise interactions were conducted.

3.4.3 Comparability of virtual cut score measures To investigate whether virtual cut scores measures (i.e., cut scores expressed in logits) were comparable between media, within and across groups, and across rounds, the researcher conducted a series of Welch’s t-tests (Welch, 1947). The Welch t-test was selected instead of the Students t-test as the former does not assume that both populations have the same variance (Lu & Yuan, 2010). Cut scores are considered comparable when cut score measure differences within groups and between/across media or across groups and media are not statistically significant. However, when more than one t-test is conducted or more than one comparison is made on the same data, a multiple comparison adjustment is needed. The adjustment is made (1) to reduce the probability of observing a false positive (finding a significance) when one does not exist (i.e., the family-wise error rate (FWER)), or (2) to control the expected probability of falsely rejected hypotheses (i.e., the false discovery rate (FDR), Benjamini & Hochberg, 1995). Both approaches control for making an incorrect inference due to sampling

104

Methodology

error by adjusting the “statistical confidence measures” directly proportionate to the number of t-tests conducted (Rouam, 2013). However, the FWER approaches such as the Holm’s Sequential Bonferroni procedure (Abdi, 2010) or the Dunn-Bonferroni adjustment method (Dunn, 1961) have been criticised as being too conservative and failing to identify statistically significant differences. Opponents of the FWER approach advocate using an FDR approach to adjust p-values when performing multiple t-tests. “A false discovery is an incorrect rejection of a hypothesis, and the FDR is the likelihood such a rejection occurs” (Haynes, 2013). In the original FDR approach (Benjamini & Hochberg, 1995), all p-values from multiple comparison tests are arranged in descending order and then an FDR threshold value to each t-test is applied. For example, when eight multiple comparison tests have been conducted, the FDR adjusted p-value (threshold value) for the first highest p-value would be .05. The next FDR adjusted p-value for the second highest p-value would be .04375, the third highest threshold value would be .0375 (.04375 –.00625 =.0375)  0. 5  and so on, each time .00625  = .00625 is subtracted from the preceding p-  8  value. In the FDR method, the smallest p-value is compared with the smallest adjusted p-value, known as a q-value (Storey, 2002), while the highest p-value is compared with the highest adjusted p-value. The advantage of this method is that it yields more power. Consequently, this study utilises an FDR approach for multiple comparisons, specifically the Benjamini, Krieger, & Yekutiele (2006) two-stage linear step-up procedure to control for FDR. This adaptive method was preferred over the original method (Benjamini & Hochberg, 1995) as the adaptive method yields greater power as it has larger thresholds (see Benjamini, Krieger, & Yekutiele (2006) for more details). The multiple comparison q-values (adjusted p-values) were retrieved from GraphPad Prism (version 9.3.1, 2021).

3.4.4 Differential severity To investigate whether judges and/ or virtual panels exhibited differential severity towards a particular e-communication medium, differential analysis was conducted. The Benjamini, Krieger, & Yekutiele (2006) two-stage linear step-up procedure to control for FDR was used when multiple comparisons were conducted. The estimates were retrieved through FACETS computer program (Linacre, 2022, version 3.84) and pairwise interactions with their corresponding multiple comparison q-values (adjusted p-values) were retrieved from GraphPad Prism (version 9.3.1, 2021).

Data analysis methods and frameworks

105

3.4.5 Survey analysis As the survey data collected did not conform to the Rasch model assumption of unidimensionality, implying all items were not measuring the same construct (Iramaneerat, Smith, & Smith, 2008), the quantitative data was analysed through CTT. The Wilcoxon signed-rank test was used to investigate whether there was a median difference between the observations made in the audio medium and the video medium. However, when the symmetry assumption was violated, the Sign test was used in its place as it is more robust to the violation of a symmetrical distribution (Kinnera & Gray, 2008). The Benjamini, Krieger, & Yekutiele (2006) two-stage linear step-up procedure to control for FDR was used when multiple comparisons were conducted. The qualitative data from the surveys were interpreted in reference to the results of the statistical analyses.

3.4.6 Focus group interview analysis To analyse the focus group data, the constant comparative method (CCM) -a method first developed by Glaser (1965) -was used. The process involved the following steps: (1) reading through the entire set of data (or subset of data); (2) chunking data into smaller meaningful parts; (3) assigning codes to each chunk; (4) comparing each new chunk of data with previously assigned codes; (5) grouping codes by similarity; and (6) identifying categories (themes) based on grouping codes. CCM is especially suitable for analysing data from multiple focus group interviews as the themes emerging from one focus group can be directly compared and contrasted with the themes emerging from another focus group. Such a comparison allows the researcher to verify whether data and/or theoretical saturation has occurred (Onwuegbuzie, Dickinson, Leech, & Zoran, 2009). Figure 3.24 illustrates how the focus group data was analysed.

106

Methodology

1. Reading through entire set of data

6. Identifying themes based on coding codes

2. Chunking data in smaller meaningful parts

5. Grouping codes by similarity

3. Assigning codes to each chunk

4. Comparing each new chunk of data to previously assigned codes

Figure 3.24 CCM process for analysing focus group transcripts

The coding process first entailed reading through the first focus group transcripts (step 1), and then separating the data into smaller units (step 2) and assigning initial codes (step 3) by employing a range of coding approaches (Miles, Huberman, & Saldana, 2020; Saldana, 2013). Next, First Cycle coding methods such as evaluation coding, descriptive coding, in vivo coding, and emotion coding were used. Evaluation coding was used as it reflected the evaluative nature of the focus group questions and allowed for data to be separated into segments following the researcher’s evaluation of the responses provided by the participants. Descriptive coding was used as such coding allowed for an initial word or phrase to summarise the main topic of discussion, while in vivo coding was used when the same wording was used by multiple participants when addressing the same focus question. Both descriptive coding and in vivo coding supplement evaluation coding. Emotion coding was employed when emotions were either explicitly stated by the participants or inferred by the researcher. Next, each new chunk of data (coming from the subsequent focus group) was compared with First Cycle codes (step 4) which were revised accordingly. During the Second Cycle coding, pattern coding was used to group codes by similarity

107

Summary

(step 5) into smaller categories, a precursor for analysis across focus groups. The final step (step 6) entailed theming the data by collapsing the grouped categories that emerged from pattern coding (step 5). Figure 3.25 illustrates the coding process within the CCM. Step 1 evaluation coding Step 2

Initial Coding descriptive coding

Step 3

First Cycle Coding

Step 4

Revision of Codes

in vivo coding

emotion coding

Step 5

Second Cycle Coding

pattern coding

Step 6

Final Cycle

Theming the Data

Figure 3.25 Coding process within CCM

3.6 Summary In this chapter, the argument for an embedded, mixed methods, counterbalanced research design was presented. Next, the instruments used to collect the data and their rationale for their use was discussed. Following on from this, the standard setting methodology employed in the main study and how the virtual standard setting workshop was conducted was described in detail. Finally, the frameworks for data analyses were presented and discussed. Table 3.6 presents an overview of the RQs, the instruments used, the data collected, and the analysis applied.

108

Methodology

Table 3.6 Overview of RQs, instruments, data collected, and analysis Research Questions RQ1. Can reliable and valid cut scores be set in synchronous e-communication media (audio and video)? If so, to what extent?

Instruments Forms A & B Surveys Focus group interviews

Data Collected Judge ratings Survey responses Focus group transcripts

Data analysis CTT & RMT Wilcoxson Signed- rank Tests/Sign tests CCM

RQ1.1 How reliable and valid are the recommended virtual cut score measures?

Forms A & B

Judge ratings

CTT & RMT

RQ1.2 How reliable and valid are the judgements made in each e-communication medium?

Forms A & B

Judge ratings

CTT & RMT

RQ2. How comparable are Forms A & B virtual cut score measures within and across virtual panels and different environments (virtual and F2F)?

Judge ratings

RMT & Welch t-tests

RQ2.1 How comparable are virtual cut score measures within and across virtual panels?

Judge ratings

Welch t-tests

RQ2.2 How comparable are Forms A & B virtual cut score measures with Original Form F2F cut scores? A

Judge ratings

Welch t-tests

RQ.3 Do judges exercise Forms A & B differential severity when setting cut scores in different e-communication media (audio and video)? If so, to what extent? RQ3.1 Do judges exhibit Forms A & B differential severity towards either of the two e- communication media?

Judge ratings

RMT: MFRM

Judge ratings

RMT: MFRM

Forms A & B

109

Summary Table 3.6 Continued Research Questions RQ 3.2 Do any of the virtual panels exhibit differential severity towards either of the two e-communication media?

Instruments Forms A & B

Data Collected Data analysis Judge ratings RMT: MFRM

RQ4. What are the judges’ Focus group perceptions of setting cut scores interviews in each e-communication Surveys medium (audio and video)?

Focus group transcripts Survey responses

CCM

RQ4.1 Do either of the e- communication media affect the judges’ perceptions and evaluations of how well they communicated? If so, to what extent? RQ4.2 Do either of the e-communication media influence the judges’ decision- making processes? If so, to what extent? RQ4.3 What do judges claim are the advantages and disadvantages of each e- communication medium? RQ4.4 How do judges compare their virtual standard setting experience with a similar face- to-face experience?

Focus group interviews Surveys

Focus group transcripts Survey responses

Focus group interviews

Focus group transcripts

CCM

Focus group interviews

Focus group transcripts

CCM

Focus group interviews

Focus group transcripts

CCM

Wilcoxson signed- rank Tests/Sign tests & RMT CCM Wilcoxson signed- rank Tests/Sign tests & RMT

RQ5. Which e-communication All above- All above- All above-mentioned communication medium mentioned data mentioned data data analysis (audio or video) is more sources collected appropriate for setting a cut score on a receptive language test in a synchronous virtual environment?

Chapter 4: Cut score data analysis The aim of the chapter is (1) to evaluate the virtual cut scores in terms of reliability and validity; (2) to present and compare the judge panels’ cut score recommendations; and (3) to investigate whether judge panel ratings are influenced by either e- communication media. The chapter is divided into four sections, where sections 4.1 and 4.2 evaluate the reliability of each virtual panel’s cut scores and the validity of the judgements from an MFRM and a CTT perspective, respectively. Section 4.3 compares virtual panel cut score measures between media (audio and video), test forms (Test Form A and Test Form B), and environments (virtual and F2F). Section 4.4 investigates whether any panel exhibited differential severity towards either e-communication medium.

4.1 Cut score internal validation: MFRM analysis To investigate research question 1 [Can reliable and valid cut scores be set in different synchronous e-communication media (audio and video)? If so, to what extent?] the cut scores were first analysed through MFRM. The data used in the cut score analysis comprised 8190 responses [(45 × 45 × 2 × 2) +(45 × 1 × 2 =8190)], 45 panellists rendering judgements for 45 items for two equated tests (Test Form A and Test Form B) at two rounds (Round 1 and Round 2) and 45 panellists rendering one cut score (Round 3) for two tests. The data were analysed through FACETS computer programme (Linacre, 2022, version 3.84.0). A seven-facet MFRM model was selected to analyse the data, in which the facets were: (1) judges; (2) group; (3) medium; (4) order; (5) test form; (6) round; and (7) items. However, out of the seven facets, only two facets were active (judges and items). In other words, the estimates produced by the FACETS programme were based on an interaction between judges and items only. By making the other five facets (group, medium, order, test form, and round) inactive, the estimates used for evaluating judge panels’ cut score measures would not be influenced by any interactions between the active facets (judges and items) and the inactive facets, but at the same time pairwise interactions (bias) could be explored. For the cut score measures to be comparable between test forms and across rounds, and for judgements to be comparable across groups and judges, both judges and items needed to be placed on the same measurement scale. To accomplish this, items were anchored to their empirical difficulty derived from the concurrent equating analysis (see section 3.2.3.2 for description of

112

Cut score data analysis

concurrent equating analysis) to analyse Round 1 and Round 2 judgements. In Round 3, judges were asked to make one judgement per test form in which they recommended an overall cut score out of 45. However, to analyse Round 3 data, each test form Round 3 judgement was treated as two separate items in the analysis, one being assigned a raw score below the original judgement and the other being assigned a raw score above the judgement. Then each of those two items were weighted by .05. For example, a Round 3 raw score judgement of 26 was treated as 25 and 27 and when both items were weighted by 0.5, the overall raw score judgement for Round 3 was once again 26. This allowed the MFRM rating scale model to be used for Round 3 analysis as the two test form Round 3 items were also were anchored to the overall difficulty of each test form as well as the thresholds of each test form. As a result, Round 3 judgements were also placed on the same latent scale as Round 1 and Round 2 judgements. As the standard setting task entailed judgements for three rounds, two different measurement scales were used for judges to evaluate the items. For Rounds 1 and 2, judges assigned a “Yes” (1) or a “No” (0) to each item, indicating whether the Just Qualified B1 Candidate taker would be able to answer the item correctly. However, as judges assigned an overall score, ranging from 0 to 45, indicating how many items a Just Qualified B1 Candidate would be able to answer for Round 3, another measurement model for analysis was required. The two measurement models are illustrated in equations 4.1 and 4.2 (see Appendix G for an edited version of FACETS specification file used for analyses). The MFRM model used for analysing Round 1 and Round 2 data can be written as follows:

 Pnijk1  log   ≡ Bn − Di − Gm − Mi − Oi − Ft − R j − D y (4.1)  Pnijk 0 

where Pnijk1 is the probability of a “Yes” being awarded on item i by judge n, Pnijk0 is the probability of a “No” being awarded on item i by judge n, Bn is the leniency of judge n, Di is the difficulty of item i, Gm is the severity of group m, Mi is the difficulty of the medium i, Oi is the difficulty of the order i,

Cut score internal validation: MFRM analysis

113

Ft is the difficulty of test form t, Rj is the judgement of performance standard for round j, Dy difficulty of rating a “Yes” relative to “No” Thus, from equation (4.1), the probability of judge n assigning a “Yes” on item i in group m in medium i in test form t in order i in round j rather than a “No” equals the leniency of judge n, minus the difficulty of the item i, minus the severity of group m, minus the difficulty of the medium i, minus the difficulty of the order i, minus the difficulty of the test form t, minus the judgement of performance standard for round j, minus the difficulty of assigning a “Yes” relative to “No” Dy. The MFRM model used for analysing Round 3 data can be written as follows:

 Pnijk1  log   ≡ Bn − Di − Gm − Mi − Oi − Ft − R j − Tik (4.2)  Pnijk −1 

where Pnijk1 is the probability of k being awarded on item i by judge n, Pnijk −1 is the probability of k-1 being awarded on item i by judge n, Bn is the leniency of judge n, Di is the difficulty of item i, Gm is the severity of group m, Mi is the difficulty of the medium i, Oi is the difficulty of the order i, Ft is the difficulty of test form t, Rj is the judgement of performance standard for round j, Tik difficulty of assigning k relative to k-1. Thus, from equation (4.2), the log ratio of the probability of judge n assigning a k on item i and the probability of judge n assigning k-1 on item i equals the leniency of judge n, minus the difficulty of the item i, minus the severity of group m, minus the difficulty of the medium, i, minus the difficulty of the order i, minus the difficulty of the test form t, minus judgement of performance standard for round j, minus the difficulty of assigning a k relative to k – 1(Tik).

4.1.1 Rasch group level indices To investigate research question 1.1 [How reliable and valid are the recommended virtual cut score measures?], five group level Rasch separation indices were analysed: (1) the separation ratio (G); (2) the separation (strata) index (H); (3) the

114

Cut score data analysis

separation reliability index (R); (4) the fixed (all same) chi-square statistic; and (5) exact and expected agreements. In direct contrast to inter-rater reliability indices which measure the degree to which judges are similar in their ratings, the separation ratio (G) reports the differences in judge severity which is evaluated in terms of the precision of the judge measures (Fisher, 1992). The separation (strata) index (H) reports how many distinct levels of severity exist within the judges. For example, a separation index of 4 indicates that four statistically distinct levels of judge groups can be distinguished (Myford & Wolfe, 2004). A separation (strata) index (H) near 1.0 implies judges can be viewed as forming a homogenous group. The separation reliability (R) index indicates “how different the severity measures” (Eckes, 2015, p. 66) are. Low-judge separation reliability is desired as this would indicate that judges are interchangeable, as values less than 0.50 denote that differences between judge measures are mainly attributed to measurement error (Fisher, 1992). The fixed (all same) chi-square tests the null hypothesis of whether judges exhibited the same severity/leniency level when evaluating the difficulty of the items, after accounting for measurement error (Eckes, 2015). When statistically significant differences between observed and expected results as observed by the Rasch model occur, the null hypothesis is rejected (Andrich & Marais, 2019). The final indices, exact agreements and expected agreements, report the agreement achieved between all the judges on their ratings and the amount of agreement expected by the model. Linacre (2020a) suggests using a summary statistic, the Rasch-Kappa index, which is similar to Cohen’s kappa to investigate whether judges are rating as independent experts. The Rasch-Kappa index is calculated by subtracting the group’s expected agreement based on the Rasch model from the group’s observed agreement and dividing that by 100 minus the expected agreement: Rasch-Kappa =(Obs% -Exp%)/(100 –Exp%). Values close to 0 are desirable and indices much greater suggest that judges are not independent, a violation of the Rasch model. Tables 4.1 to 4.4 report the group indices for each group in both media (test forms nested) across all three rounds. Row one presents the medium judges convened in and the test form they recommended cut scores for, while row two shows the round. Rows three through six display the cut score measure, its associated corresponding raw score, its standard deviation (S.D.) and the standard error of cut score measure (SEc), respectively, while rows seven to nine report the three separation indices [the separation ratio (G), the separation index (H), and the separation reliability (R)]. Row ten reports the fixed (all same) chi- square significance probability (prob), while rows 11 to 12 report the observed agreement, the expected agreement, and the Rasch-Kappa values. Table 4.1 illustrates the group indices for Group 1.

115

Cut score internal validation: MFRM analysis Table 4.1 Group 1 group level Rasch indices

Cut score measure Mean raw score S.D. (population) mean measure Standard error of cut score measure (SEc) Separation ratio (G) Separation (strata) index (H) Separation reliability (R) Fixed (all) Chi-square (prob) Observed agreement (%) Expected agreement (%) Rasch –Kappa

Audio (Test Form A) Round Round Round 1 2 3 0.36 0.57 0.57 24.56 26.56 26.56 0.43 0.42 0.39

Video (Test Form B) Round Round Round 1 2 3 0.27 0.45 0.49 25.44 27.11 27.56 0.55 0.51 0.43

0.15

0.15

0.14

0.19

0.18

0.15

0.87 1.49 .43 .07 56.7 55.2 0.03

0.80 1.40 .39 .12 70.6 56.3 0.33

0.67 1.22 .31 .22 16.7 5.9 0.11

1.34 2.12 .64 .00 59.0 55.0 0.09

1.16 1.88 .57 .06 63.3 56.3 0.16

0.83 1.44 .41 .21 13.9 6.2 0.08

For Group 1, the separation ratio (G) illustrated that the spread of judge severity measures was not very large as the G values ranged from 0.67 to 1.34. The separation (strata) index (H) verified that the judges were mainly homogenous as H values ranged from 1.22 to 2.12. The H value of 2.12 is not surprising considering that it is a Round 1 estimate (video medium –Test Form B). The separation reliability (R) ranged from 0.31 to 0.64 and the fixed (all same) chi- square probability statistic ranged from 0.00 to 0.22. The Rasch-Kappa index ranged from 0.03 (Round 1 –audio medium) to 0.33 (Round 2 –audio medium), indicating that judges were acting as independent experts. When examining Group 1 measures in both media, all measures were acceptable. Table 4.2 illustrates the group indices for Group 2.

116

Cut score data analysis

Table 4.2 Group 2 group level Rasch indices

Cut score measure Mean raw score Standard error of cut score measure (SEc) S.D. (population) mean measure Separation ratio (G) Separation (strata) index (H) Separation reliability (R) Fixed (all) Chi-square (prob) Observed agreement (%) Expected agreement (%) Rasch –Kappa

Video (Test Form A) Round Round Round 1 2 3 0.33 0.67 0.55 24.31 27.54 26.46 0.15 0.14 0.06

Audio (Test Form B) Round Round Round 1 2 3 0.35 0.50 0.50 26.31 27.85 27.85 0.12 0.10 0.08

0.52

0.49

0.21

0.42

0.34

0.29

1.26 2.01 .61 .00 59.5 54.9 0.10

1.09 1.79 .54 .00 66.5 56.9 0.22

0.00 0.33 .00 .94 12.8 7.5 0.06

0.84 1.46 .42 .05 58.9 55.8 0.07

0.27 0.69 .07 .39 67.7 57.2 0.25

0.00 0.33 .00 .61 20.5 6.7 0.15

For Group 2, the separation ratio (G) ranged from 0.00 to 1.26, the separation (strata) index (H) ranged from 0.33 to 2.01, and the separation reliability (R) ranged from .00 in Round 3 (both media) to .61 (Round 1, video medium). Such indices provided evidence that judges were expressing similar levels of severity in their ratings and as such they could also be considered interchangeable. The group homogeneity was also confirmed by the non-significant, Round 3, fixed (all same) chi-square probability statistics which ranged from .00 to .94. What was surprising was that in Round 2 (video medium) the chi-square probability statistic did not increase from Round 1 (χ2 =.00), implying that there were still statistically significant differences between the most lenient judge and the most sever judge. Nonetheless, by Round 3, the chi-square statistic probability was very high χ2 =.94 indicating that the differences between the most lenient and most severe judge were not statistically significant. The Rasch-Kappa index ranged from 0.06 (Round 3 –video medium) to 0.25 (Round 2 –audio medium), adding evidence that judges were behaving independently. Overall, the groups’ indices were acceptable and in accordance with the expectations of the Rasch model. Table 4.3 illustrates the group indices for Group 3.

117

Cut score internal validation: MFRM analysis Table 4.3 Group 3 group level Rasch indices

Cut score measure Mean raw score Standard error of cut score measure (SEc) S.D. (population) mean measure Separation ratio (G) Separation (strata) index (H) Separation reliability (R) Fixed (all) Chi-square (prob) Observed agreement (%) Expected agreement (%) Rasch –Kappa

Video (Test Form A) Round Round Round 1 2 3 0.55 0.75 0.86 26.42 28.17 29.33 0.12 0.16 0.09

Audio (Test Form B) Round Round Round 1 2 3 0.43 0.61 0.59 27.00 28.83 28.67 0.16 0.11 0.10

0.39

0.52

0.30

0.52

0.37

0.32

0.68 1.24 .32 .11 59.0 56.3 0.06

1.18 1.90 .58 .01 66.9 57.5 0.22

0.00 .33 .00 .55 7.6 6.7 0.01

1.20 1.94 .59 .01 60.4 56.1 0.10

0.47 .96 .18 .19 63.4 58.0 0.13

0.00 0.33 .00 .41 31.8 6.6 0.27

For Group 3, the separation ratio (G) ranged from 0.00 (Round 3 –video medium) to 1.20 (Round 1 –audio medium), the separation (strata) index (H) ranged from 0.33 to 1.94, and the separation reliability (R) ranged from .00 (Round 3 –audio & video media) to .59 (Round 1 –audio medium), adding evidence that the judges’ ratings were reliable and interchangeable. The fact that most of the group’s R values were less than .50 implied that any differences in judge severity could be attributed to measurement error. By Round 3, the fixed (all same) chi-square statistic probability supported that there were no statistically significant differences between the judges (video medium, χ2 =.55; audio medium, χ2 =.41). The Rasch-Kappa index ranged from 0.01 (Round 3 –video medium) to 0.27 (Round 3 –audio medium), suggesting that overall judges were acting as independent experts. Overall, Group 3 indices revealed that the measures were reliable and fitted the expectations of the Rasch model. Table 4.4 illustrates the group indices for Group 4.

118

Cut score data analysis

Table 4.4 Group 4 group level Rasch indices

Cut score measure Mean raw score Standard error of cut score measure (SEc) S.D. (population) mean measure Separation ratio (G) Separation (strata) index (H) Separation reliability (R) Fixed (all) Chi-square (prob) Observed agreement (%) Expected agreement (%) Rasch –Kappa

Audio (Test Form A) Round Round Round 1 2 3 0.73 0.50 0.49 28.09 26.00 25.91 0.14 0.13 0.11

Video (Test Form B) Round Round Round 1 2 3 0.28 0.56 0.43 25.64 28.09 27.18 0.14 0.18 0.11

0.43

0.42

0.36

0.45

0.58

0.35

0.84 1.46 .42 .04 61.9 57.5 0.10

0.81 1.41 .39 .06 68.8 55.9 0.29

0.48 .97 .18 .20 12.7 5.9 0.07

0.95 1.60 .48 .03 57.7 55.4 0.05

1.40 2.20 .66 .00 67.0 56.9 0.23

0.38 .85 .13 .23 18.2 5.9 0.13

For Group 4, the separation ratio (G) ranged from .48 to 1.40, the separation (strata) index (H) ranged from .85 to 2.20, and the separation reliability (R) ranged from .13 to .66. Such low separation indices added evidence that the judge ratings were acceptable and reliable. The fixed (all same) chi-square probability ranged from .00 (Round 2 –video medium) to .23 (Round 3 –video medium); however, the fact that the fixed all same chi-square was non-significant by Round 3, purported that all judges were rating in a similar manner. The Rasch-Kappa index ranged from 0.05 (Round 1 –video medium) to 0.29 (Round 3 –audio medium), implying that overall judges, were acting as independent experts. Overall, Group 4 indices were within acceptable measures. On the whole, at the group level, all group indices were within acceptable ranges, especially Round 3 measures, adding evidence that group estimates in both e-communication media were conforming to the expectations of the Rasch model.

4.1.2 Judge level indices To investigate research question 1.2 [How reliable and valid are the judgements made in each e- communication medium?], three sets of indices were analysed: (1) Fit statistics; (2) Point-biserial correlation (Corr. PtBis); and Rasch- Kappa. FACETS (Linacre 2022, version 3.84.0) reports four types of fit statistics: two mean-square indices (infit mean-square index and outfit mean-square index)

Cut score internal validation: MFRM analysis

119

along with their corresponding z-score (ZStd) statistics. Mean-square indices close to one indicate ratings match the predictions of the MFRM model. When a judge has a mean-square index much greater than 1, the judge is considered to be “misfitting”; whereas when a judge has a mean-square value much less than 1, the judge is considered to be “overfitting”. Several rules of thumbs have been proposed by researchers for acceptable judge mean-square index (MnSq) ranges such as 0.40 to 1.20 (Bond & Fox, 2015), 0.80 to 1.30 (Knoch & McNamara, 2015), 0.50 to 1.50 (Linacre, 2020a), and mean measure ± 2 × standard deviation (Pollitt & Hutchinson, 1987). Applying Linacre’s proposed range for acceptable judge infit mean-square index ranges would mean that MnSq measures less than 0.50 would be deemed as “overfitting”, while measures above 1.50 would be deemed as “misfitting”. The infit mean-square index (Infit MnSq) is sensitive to unexpected ratings, while the outfit mean- square index (Outfit MnSq) is sensitive to extreme scores, even single outliers. Consequently, Infit MnSq is regarded as being more informative than Outfit MnSq (McNamara, 1996; Weigle, 1998) as the former (Infit MnSq) contains “higher estimation precision” (Eckes, 2011, p. 77). As misfitting judges are a source of concern as they may invalidate the inferences made on the estimates in the analysis, an examination of the fit statistics for this section will be confined to examining the infit mean-square index (Infit MnSq) using Linacre’s recommended MnSq range (Linacre 2020a). However, to evaluate the significance of the Infit MnSq statistics, the Z-scores (ZStd) need to be examined. ZStd statistics greater than ± 2 indicate that the fit statistics are statistically significant (Linacre, 2020a). The second set of indices, the point- biserial correlation (Corr. PtBis), reports the extent to which a particular judge’s ratings are consistent with the ratings provided by the other judges. While correlations greater than .30 are desired, Myford and Wolfe (2004) noted that correlations as low as .20 may be observed when judges are making dichotomous ratings. Near zero or zero judge correlations indicate that a particular judge rank orders items differently from the other judges. Mean correlations and their standard deviations are less linear than raw scores; consequently, Fisher’s Z transformation was used to convert the sampling distribution of the correlations into a normal distribution. The final set of individual indices analysed was the Rasch-Kappa index (see section 4.1.1 for a full description), which compares the percentage of agreement between a judge’s and the other judge’s ratings to what the MFRM model would expect the agreement percentage to be, taking into consideration the judge’s rating behaviour. When a judge’s observed agreement is much higher than the expected agreement as estimated by the model, the judge is said to be acting like a “scoring machine” (Linacre, 2020a).

120

Cut score data analysis

Table 4.5 to Table 4.8 present summary statistics at the judge level per group. No Round 3 fit indices or point-biserial correlations were reported in the FACETS output analysis as Round 3 entailed a single overall observation per judge. Consequently, only Round 1 and Round 2 estimates are presented. Table 4.5 Group 1 individual level Rasch indices

Min. Infit MnSq (ZStd) Max. Infit MnSq (ZStd) Mean. Infit MnSq (ZStd) S.D. (population) Infit MnSq Min. Corr. PtBis Max. Corr. PtBis Mean. Corr. PtBis* Min. Rasch-Kappa Max. Rasch-Kappa

Test Form A (Audio) Round 1 Round 2 0.89 (-1.1) 0.70 (-1.5) 1.26 (2.4) .96 (-0.3) 1.09 (0.8) .85 (-1.3) 0.11 0.08

Test Form B (Video) Round 1 Round 2 0.89 (-1.1) 0.79 (-2.0) 1.19 (1.8) 1.45 (4.0) 1.01 (0.2) .98 (-0.2) 0.10 0.21

.11 .41 .29 -0.05 0.10

.20 .49 .36 0.02 0.16

.46 .76 .69 0.24 0.45

-.01 .73 .50 -0.11 0.33

* Average using Fisher’s Z-transformations

For Group 1, no judge exhibited misfit (Misfit: Infit MnSq ≥ 1.5, ZStd ≥ ± 2) as Infit MnSq ranged from 0.70 to 1.45. The point-biserial correlation (Corr. PtBis) ranged from -.01 (Round 2 –video medium) to .76 (Round 2 –audio medium). Judge J03 had a negative point-biserial (-.01) in Test Form B Round 2, which suggests that the judge was rating items in a different way to the rest of the judges. Such behaviour would have warranted further investigation had the judge been misfitting. The judges’ Rasch-Kappa ranged from -0.11 to 0.45, implying that the judges were acting as independent experts. Table 4.6 reports the individual level indices for Group 2.

Cut score internal validation: MFRM analysis

121

Table 4.6 Group 2 individual level Rasch indices

Min. Infit MnSq (ZStd) Max. Infit MnSq (ZStd) Mean. Infit MnSq (ZStd) S.D. (population) Infit MnSq Min. Corr. PtBis Max. Corr. PtBis Mean. Corr. PtBis* Min. Rasch-Kappa Max. Rasch-Kappa

Test Form A (Video) Round 1 Round 2 0.86 (-1.4) 0.73 (-2.4) 1.28 (2.5) 1.30 (2.8) 1.04 0(.3) .87 (-1.0) 0.13 0.16 .24 .02 .62 .73 .42 .60 0.03 -0.09 0.22 0.35

Test Form B (Audio) Round 1 Round 2 0.77 (-2.4) 0.69 (-3.1) 1.20 (1.1) 1.24 (2.1) 1.03 (0.1) 0.87 (-1.3) 0.13 0.17 .08 .04 .62 .85 .36 .53 -0.05 -0.08 0.20 0.44

* Average using Fisher’s Z-transformations

For Group 2, no judge exhibited misfit (Infit MnSq ≥ 1.5, ZStd ≥ ± 2) as Infit MnSq ranged from 0.69 to 1.30. The point-biserial correlation (Corr. PtBis) ranged from .02 (Round 2 –video medium) to 0.93 (Round 2 –video medium). Judge J18 had a low point-biserial (.02) in Test Form A Round 2, and Judge 22 had a low point-biserial (.04) in Test Form B Round 2. However, as the judges were not misfitting, no further analysis was deemed necessary at the time. The judges’ Rasch-Kappa ranged from -0.09 to 0.44, implying that Group 2 judges were rating as independent experts. Table 4.7 reports the individual level indices for Group 3. Table 4.7 Group 3 individual level Rasch indices

Min. Infit MnSq (ZStd) Max. Infit MnSq (ZStd) Mean. Infit MnSq (ZStd) S.D. (population) Infit MnSq Min. Corr. PtBis Max. Corr. PtBis Mean. Corr. PtBis* Min. Rasch-Kappa Max. Rasch-Kappa

Test Form A (Video) Round 1 Round 2 0.91 (-.8) 0.71 (-2.5) 1.22 (1.3) 1.14 (1.4) 1.04 (0.4) .89 (-0.8) 0.08 0.12 .13 .26 .58 .77 .36 .58 -0.04 0.07 0.18 0.38

* Average using Fisher’s Z-transformations

Test Form B (Audio) Round 1 Round 2 0.90 (-.5) 0.76 (-2.5) 1.15 (1.4) 1.15 (1.4) 1.02 (0.3) .94 (-0.4) 0.07 0.12 .22 .21 .52 .69 .39 .40 0.02 0.02 0.17 0.28

122

Cut score data analysis

For Group 3, no judge exhibited misfit (Misfit: Infit MnSq ≥ 1.5, ZStd ≥ ± 2) as Infit MnSq ranged from 0.71 to 1.22. The point-biserial correlation (Corr. PtBis) ranged from .13 to .77, implying that the judges were judging items in a similar fashion, albeit as independent experts since the Rasch-Kappa ranged from -0.04 to 0.38. Table 4.8 reports the individual level indices for Group 4. Table 4.8 Group 4 individual Rasch level indices

Min. Infit MnSq (ZStd) Max. Infit MnSq (ZStd) Mean. Infit MnSq (ZStd) S.D. (population) Infit MnSq Min. Corr. PtBis Max. Corr. PtBis Mean. Corr. PtBis* Min. Rasch-Kappa Max. Rasch-Kappa

Test Form A (Audio) Round 1 Round 2 0.96 (-0.2) 0.73 (-2.6) 1.24 (1.4) 1.04 (0.3) 1.12 (1.0) .88 (-1.1) .10 .09 .18 .46 .63 .67 .40 .65 0.00 0.22 0.23 0.36

Test Form B (Video) Round 1 Round 2 0.86 (-1.4) 0.72 (-3.2) 1.20 (1.6) 1.18 (1.7) 1.05 (0.4) .90 (-0.8) .09 .13 .13 -.01 .58 .79 .32 .63 -0.04 -0.10 0.18 0.40

* Average using Fisher’s Z-transformations

For Group 4, no judge exhibited misfit (Misfit: Infit MnSq ≥ 1.5, ZStd ≥ ± 2) as Infit MnSq ranged from 0.72 to 1.24. The point-biserial correlation (Corr. PtBis) ranged from -.01 to .79. Judge J37 had a negative point-biserial (-.01) and had assigned the lowest raw score (21). However, similar to the other judges in the other groups who had low point-biserial correlations and were not misfitting, no further analysis was deemed necessary at the time. The Rasch-Kappa index ranged from -0.10 to 0.40, providing evidence that this group was also providing independent judgements. From a Rasch perspective, none of the cut scores were deemed unreliable as all judge measurements were deemed productive for measurement purposes. The cut scores were further analysed from a CTT perspective to evaluate whether the same findings would be observed.

123

Cut score internal validation: CTT analysis

4.2 Cut score internal validation: CTT analysis This subsection re-evaluates the cut scores using criteria from a CTT perspective. Hambleton and Pitoniak (2006) provide a framework for evaluating the internal validity of standard setting judgements by using CTT indices. Within this framework cut scores are also examined in relation to the test instrument. The psychometric properties of both virtual test instruments are reported in Table 4.9. Rasch measures are expressed in logits, while raw scores and CTT equivalent measures are provided in parenthesis where applicable. The Rasch person separation index illustrates how many distinct levels of test takers a measurement instrument can distinguish, while the item separation index reports how well items are separated in terms of difficulty. Both test instruments had a person separation of at least 2.0 and an item separation of at least 4.0 (Boone & Staver, 2020), implying that the instruments were sensitive enough to separate test takers into at least 2 distinct categories. The Rasch person reliability, which is analogous to test reliability in CTT, of both instruments was at least .80, ranging from .85 to .88 (Cronbach’s alpha ranging from .87 to .90). The Rasch item reliability was at least .90 for both instruments, implying that the sample size for each instrument was large enough to confirm the construct validity of the instrument. Consequently, the high Rasch indices and CTT estimates exhibited acceptable separation and reliability indices so that cut scores could be set and valid inferences could be made from such cut score measures. Table 4.9 Psychometric characteristics of Test Form A and Test Form B Number of test takers Number of items Maximum number of points Mean measure (Mean score) S.D. population (S.D. score) Min. measure (Min. score) Max. measure (Max. Score) Person separation (real) Person reliability [alpha (rxx)] Model RMSE (SEM) S.E. of Person mean Item mean measure Item min. measure Item max. measure Item separation (real) Item reliability (real)

Form A 2562 45 45 0.78 (27.83) 1.10 (8.9) -2.11 (5) 5.42 (45) 2.66 .88 (.90) 0.39 (2.88) 0 .02 0.15 -1.17 1.68 14.93 1.00

Form B 649 45 45 0.75 (29.0) 1.04 (8.1) -1.89 (7) 5.19 (45) 2.41 .85 (.87) 0.40 (2.88) 0.04 -0.05 -1.19 1.44 7.13 0.98

124

Cut score data analysis

4.2.1 Consistency within the method Consistency within the method, also referred to as across-panel consistency, was evaluated by examining the standard error of cut score measure (SEc), one of the classical indices used to determine whether the cut score obtained was replicable across different groups of judges (Hambleton & Pitoniak, 2006; Kaftandjieva, 2010). The equation used to calculate the standard error of the cut score measure (SEc) was the following:

(4.1)

where S.D. is the population standard deviation of the group’s cut score and n is the number of judges in each group, respectively (Linacre, 2020a). The SEc is then evaluated in terms of the standard error of measurement (SEM) of the instrument. The equation used to calculate the SEM is the following:

(4.2)

where S.D. is the standard deviation of the test instrument and rxx is its reliability. Several criteria have been recommended by standard setting researchers when comparing the SEc to the SEM. The most stringent criterion was proposed by Jaeger (1991), who suggested that the SEc should not be greater than one quarter of the SEM. On the other hand, Cohen, Kane, & Crooks (1999) recommended a more lenient criterion, stating that SEc should be less than or equal to half the SEM so that the impact rate of misclassification of test takers is minimal, while Kaftandjieva (2010) suggested a compromise between the previous two criteria and recommended that the SEc should be less than or equal to a third of the SEM. For this study, Cohen, Kane, and Crook’s criterion was chosen as the internal check of the recommended cut score, implying that the SEc divided by the SEM should be equal to or less than 0.50 (SEc/SEM ≤ .50). This criterion is generally used in the standard setting literature as there is a minimal impact on test-taker misclassification rates when the SEc is less than half of the SEM (Cohen, Kane, & Crooks, 1999). The consistency of the method was calculated using Rasch measures, which required that the model root-mean-square-error (RMSE) be used in the place of the SEM (T. Eckes, personal communication, February 11, 2016). Table 4.10 displays the internal consistency within method check for each group across rounds and test forms.

Cut score internal validation: CTT analysis

125

Table 4.10 All groups internal consistency within method check

Group 1 Group 2 Group 3 Group 4

Round 1 0.39 0.39 0.31 0.35

Test Form A Round 2 Round 3 0.38 0.36 0.36 0.16 0.40 0.24 0.34 0.29

Round 1 0.48 0.30 0.39 0.35

Test Form B Round 2 Round 3 0.45 0.38 0.24 0.21 0.27 0.24 0.46 0.27

An examination of the internal check estimate (SEc/RMSE) revealed that the criterion set (SEc/RMSE ≤ 0.50) was met by all groups across all rounds. The internal check index (SEc/RMSE) ranged from 0.16 (Test A, Group 2 – Round 3) to 0.48 (Test B, Group 1 –Round 1) and generally decreased across rounds within groups, providing further support for across-panel consistency (see Table 4.1 to Table 4.4 for SEc measures and Table 4.9 for RMSE measures).

4.2.2 Intraparticipant consistency Intraparticipant consistency refers to (1) the extent judge ratings are consistent with the empirical difficulties of the items and (2) the extent to which such ratings change across rounds (Hambleton & Pitoniak, 2006), the former providing evidence of the judges’ content knowledge (Hambleton, Pitoniak, & Copella, 2012), while the later providing evidence of the judges’ consideration of feedback provided between rounds. To evaluate intraparticipant consistency between judgements and empirical values, both misplacement index (MPI) analyses and correlational analyses were conducted. To conduct the correlational analysis, item measures (logits) of each group were correlated with their corresponding empirical item difficulty (logits). The item measures used in the correlational analysis were retrieved by anchoring the judges to their difficulty level and unanchoring the items. Table 4.11 illustrates the mean MPI indices and Spearman correlation indices per group and test form.

126

Cut score data analysis

Table 4.11 Intraparticipant consistency indices per round and test form Test Form A Spearman correlation Round 1 Round 2 Round 1 Round 2 .62 .81 .44* .72* .64 .79 .47* .75* .65 .78 .54* .72* .59 .78 .30* .70* MPI

Group 1 Group 2 Group 3 Group 4

Test Form B Spearman correlation Round 1 Round 2 Round 1 Round 2 .67 .71 .34* .58* .66 .79 .60* .77* .66 .74 .54* .73* .64 .77 .49* .72* MPI

*All correlations significant at the .05 level (2-tailed)

For both test forms, the average group MPI interparticipant consistency indices for Round 1 were at least .50, the minimum acceptable criterion (Kaftandjieva, 2010), which increased to at least .70, the suggested minimum criterion, by Round 2. Similarly, for all groups, the average group correlational interparticipant consistency indices increased across rounds and also correlated well with the empirical difficulty measures (logits) of the items. Consequently, it can be implied that the judges’ estimates can be considered valid (Brandon, 2004), adding further evidence to the adequacy of the cut scores (Kaftandjieva, 2010) and the standard setting method (Chang, 1999). To investigate the degree to which judges changed their ratings across Round 1 and Round 2, the researcher correlated each judge’s Round 1 ratings with their Round 2 ratings. Such a correlation allows identifying which judges did not change estimates between rounds and/or which judges made a considerable amount of changes. A perfect positive correlation of 1 indicates that no change occurred between rounds on any item. Similarly, very high positive correlations (≥ .90) indicate that very few changes occurred, while very weak correlations (± .01 to .30) signal that a large number of changes occurred. Table 4.12 summarises the degree of changes that each group made across Round 1 and Round 2 on both test forms. For each group, the minimum (Min.) and maximum (Max.) correlations are reported (see Appendix H, Table H3 for each judge’s correlations).

127

Cut score internal validation: CTT analysis Table 4.12 Changes in ratings across Round 1 and Round 2

Group 1 Group 2 Group 3 Group 4

Min. .06 -.05 .09 .17

Form A Max. .96 .78 1.00 .64

Min. .06 .10 -.35 .15

Form B Max. .78 .91 1.00 .81

Correlations ranged from -.35 to 1.00 of which the majority were between .30 and .80, indicating that changes between rounds occurred and that they were not random. Two judges, J30 and J34 (see Appendix H, Table H3), did not make any changes across Round 1 and Round 2 in Test Form A and Test Form B, respectively. This may not necessarily imply the judges did not consider the feedback provided between rounds or what was discussed during the discussion stage of Round 1. To investigate the extent to which changes were made across Round 2 and Round 3, the difference between Round 2 and Round 3 measures were examined. Positive differences indicate that cut scores increased across rounds, while negative differences signal the opposite. Differences of approximately ± 0.05 logits indicate that no changes were made across rounds, while each difference of approximately ± 0.10 logits approximates to a difference of one raw point. Table 4.13 reports the minimum (Min.) and maximum (Max.) logit differences observed across Round 2 and Round 3. Table 4.13 Logit changes in ratings across Round 2 and Round 3

Group 1 Group 2 Group 3 Group 4

Min. -0.33 -0.76 -0.70 -0.23

Form A

Max. 0.31 0.93 0.62 0.20

Min. -0.17 -0.27 -0.47 -1.38

Form B

Max. 0.20 0.32 0.42 0.42

Logit differences across Round 2 and Round 3 ranged from -0.76 logits to 0.93 logits for Test Form A and ranged from -1.38 logits to 0.42 logits for Test Form B, implying that judges were making changes across rounds. When examining changes across all three rounds (see Appendix H), only one judge, J34 made

128

Cut score data analysis

no changes across rounds in Test Form B. However, the judge may have been confident about her/his own ratings and that is why changes did not occur. Nonetheless, changes for judge J34 across rounds in Test Form A were observed, implying that the judge was considering what was discussed during Round 1 and was taking the feedback given between rounds into account. Changes across rounds further support the judges’ intraparticipant consistency.

4.2.3 Interparticipant consistency Interparticipant consistency refers to the degree to which item ratings and performance standards are consistent across panellists (Hambleton & Pitoniak, 2006). Tables 4.14 and 4.15 display the inter-rater agreement indices estimated on the judges’ raw score ratings for both test forms, across rounds. The first column displays the group label, the second column displays the group’s Cronbach’s alpha, which is an internal consistency estimate. A high alpha estimate signals that judge ratings measure a common dimension. The third column shows the intraclass correlation index (ICC) index. Extremely high ICC indices (close to 1) imply an excellent inter-rater reliability (Stemler & Tsai, 2008). The model used to calculate the ICC was the two-way mixed model, average measures for exact agreement, in brackets a 95 % confidence interval is also reported. Cronbach’s alpha and ICC are consensus inter-rater agreement indices. Both ICC and Cronbach’s alpha indices are frequently reported indicators of internal consistency (Kaftandjieva, 2010). Hoyt (2010) noted that reliability indices of .80 are desirable as such high indices “reflect good dependability of scores” (p. 152). Tables 4.14 and 4.15 display the interparticipant indices for Form A and Form B respectively. Table 4.14 Interparticipant indices: Form A

Group 1 Group 2 Group 3 Group 4

Alpha .58 .76 .72 .73

Test Form A Round 1 ICC Alpha .58, 95 % CI [.37, .74] .86 .76, 95 % CI [.64, .85] .86 .70,95 % CI [.55, .82] .84 .73, 95 % CI [.59, .83] .87

Round 2 ICC .86, 95 % CI [.79, .91] .85, 95 % CI [.78, .91] .84, 95 % CI [.76, .90] .87, 95 % CI [.80, .92]

Cut score internal validation: CTT analysis

129

Table 4.15 Interparticipant indices: Form B

Group 1 Group 2 Group 3 Group 4

Alpha .67 .72 .73 .65

Test Form B Round 1 ICC Alpha .66, 95 % CI [.48, .79] .75 .71, 95 % CI [.57, .82] .86 .73, 95 % CI [.59, .83] .76 .65, 95 % CI [.47, .78] .84

Round 2 ICC .74, 95 % CI [.61, .84] .86, 95 % CI [.79, .91] .76, 95 % CI [.64, .85] .83, 95 % CI [.75, .90]

On both forms, there was an increase in reliability between Round 1 and Round 2. However, on Test Form B (see Table 4.15), two groups (Group 1 and Group 3) had reliability estimates lower than .80 in Round 2. Nonetheless, the relatively high inter-rater reliability observed on both forms indicate that judges were homogeneous in their ratings (Kaftandjieva & Takala, 2002).

4.2.4 Decision consistency and accuracy Decision consistency refers to the extent to which test takers would be classified in a similar manner on two different occasions when applying the recommended cut score (Cizek & Earnest, 2016; Kaftandjieva, 2004). To calculate such coefficients would entail that test takers take the same examination twice. Fortunately, several researchers have proposed methods to estimate decision consistency and accuracy based on a single administration (Brennan and Wan, 2004; Hanson & Brennan, 1990; Livingston & Lewis, 1995; Stearns & Smith, 2009; and Subkoviak, 1988). Such methods provide the likelihood of test takers being classified –as passing or failing –similarly on another administration (Cizek & Bunch, 2007). To evaluate each group’s cut score decision consistency, two methods were used: (1) the Livingston and Lewis method (Livingston & Lewis, 1995) and (2) the Standard Error method (Stearns & Smith, 2009).

The Livingston and Lewis method: The Livingston and Lewis decision consistency and accuracy estimates were obtained through the BB-Class software (Brennan, 2001, version 1.1). Each group’s average raw cut score was rounded to the nearest integer (see Table 4.1 to Table 4.4 for actual raw score cut scores). Tables 4.16 and 4.17 report on the accuracy and consistency of the recommended raw cut scores for each test

130

Cut score data analysis

form. The first row presents the recommended rounded to the nearest integer cut score, while the second row presents the probability of correct classification while rows three and four illustrate the probability of false positive (Type I) and false negative (Type II) errors, respectively (Buckendahl & Davis-Becker, 2012). Type I (false positive) errors occur when test takers pass an examination when in fact they should fail, while Type II (false negative) errors occur when test takers fail an examination when in fact they should pass (Hambleton & Novick, 1973). The next three rows report decision consistency by comparing classification decisions based on expected and observed scores. Row five indicates the percentage of consistent classifications, while row seven reports the kappa (k) coefficient, the consistency of classifications beyond what would be expected by chance. Kappa values should be within the approximate range of .60 and .70 or higher (Subkoviak, 1988). Row eight reports the probability of misclassifications, while the last row indicates which group recommended the cut score in the top row. Table 4.16 presents the accuracy and consistency estimates for Form A cut scores. Table 4.16 Accuracy and consistency estimates for Form A raw cut scores Raw cut score Probability of correct classification False positive rate False negative rate Overall percentage of consistent classification (pc) pchance Kappa (k) Probability of misclassification Group

26 .91 .04 .05 .88

27 .91 .05 .04 .88

29 .91 .05 .04 .87

.52 .74 .13 Group 2 Group 4

.51 .74 .13 Group 1

.50 .73 .13 Group 3

The greatest difference between the recommended raw cut scores on Test Form A was three raw points while the smallest was one raw point. For all four groups, the probability of correct classification was high, .91. The false positive and false negative rates were approximately the same. All groups had a reasonably high overall percentage of consistent classification (pc) of at least .87 and a kappa coefficient of at least .73. The probability of misclassification (.13) was the same for all groups. On the whole, all four groups’ recommended cut scores on Test Form A exhibited high decision accuracy and consistency.

131

Cut score internal validation: CTT analysis

Table 4.17 presents the accuracy and consistency estimates for Form B raw cut scores. Table 4.17 Accuracy and consistency estimates for Form B raw cut scores Raw cut score Probability of correct classification False positive rate False negative rate Overall percentage of consistent classification (pc) pchance Kappa (k) Probability of misclassification Group

27 .90 .05 .05 .86

28 .90 .05 .05 .85

29 .89 .06 .05 .85

.54 .69 .14 Group 4

.52 .69 .15 Group 1 Group 2

.51 .69 .15 Group 3

In contrast to Test Form A recommended cut scores, the greatest difference between the cut scores for Test Form B was two raw points while the smallest was one raw point. For all groups, the probability of correct classification remained approximately the same, ranging from .89 to .90. Similarly, the Type I (false positive) and Type II (false negative) rates were approximately the same for all groups, ranging from 05 to .06. For Test Form B, for all four groups, the overall percentage of consistent classification (pc) was at least .85, while the kappa coefficient was .69 for all groups. The probability of misclassification was slightly lower (.14) for Group 4. While all recommended cut scores for both forms exhibited high decision accuracy and classification consistency, a closer examination at the consequences of each cut score revealed that differences in cut score measures impacted 4 % (Test Form B) to 14 % (Test Form A) of the test-taker population. Table 4.18 presents the overall pass/fail rates based on the recommended cut score measures (in logits) for each test form. Table 4.18 Form A and Form B pass/fail rates Cut score 0.49 measure Overall .60 pass rate Overall .40 fail rate Group 4

Form A (N =2562) 0.55 0.57 0.86

0.43

Form B (N =649) 0.49 0.50 0.59

.60

.57

.57

.60

.60

.60

.56

.40

.43

.43

.40

.40

.40

.44

Group 2 Group 1 Group 3

Group 4 Group 1 Group 2 Group 3

132

Cut score data analysis

For Test Form A, both Group 2 and Group 4 had the same rounded raw cut score, but their logit cut scores were different. Group 2 had a recommended raw cut score of 26.46 (0.55 logits), while Group 4 had a recommended raw cut score of 25.91 (0.49 logits). Consequently, the very small difference in cut score measures did not change the overall pass rate. However, the difference between the cut score measures recommended by Group 3 and Group 4 for Test Form A was 0.37 logits which, in reality, translated into a difference of 14 % on the pass rate. A cut score of 0.49 logits (Group 4) implies that 60 % of the test takers would pass the test, while a cut score of 0.86 logits (Group 3) indicates that 46 % of the test takers would pass. Test Form A had a test-taker population of 2,562, resulting in the difference in cut scores between Group 3 and Group 4 affecting approximately 359 test takers (14 %) either positively or negatively. Similarly for Test Form B, Group 1 and Group 2 had the same rounded raw cut score (28), but their logit cut score measures were different. Group 1 had a recommended raw score of 27.56 (.49 logits), while Group 2 had a recommended raw cut score of 27.85 (0.50 logits). The difference in the overall pass rate between the lowest cut score measure (Group 4, 0.43 logits) and the highest cut score measure (Group 3, 0.86 logits) was four percent (4 %), a difference affecting approximately 26 test takers (649 test takers × .04 ≈ 26).

The Standard Error method Stearns and Smith (2009) propose the Standard Error method for calculating decision consistency based on a single administration in the framework of Rasch measurement. The method was applied using the following formula:

(4.3)

where z is the test-taker’s z-score, βn is the ability estimate of the test taker, LC is the logit cut score, and SEβn is the standard error of the test-taker’s ability estimate. For example, a test logit cut score (LC) of 0.49 (e.g., Group 1 Form B cut score, see Table 4.18), a test-taker ability measure (βn) of 0.49, and a test-taker standard error SEβn of .32 would yield a z-score of approximately 0.00. Using a z-table to convert a z-score to a probability, a z-score of 0 is 0.50. Therefore, the probability of that test taker receiving the same classification on retesting would be .50 (1 –.50). Considering that the test taker had an ability estimate (0.49) equal to the cut score logit (0.49), a probability of .50 is expected. However, a test logit cut score (LC) of 0.49, a test-taker ability measure (βn) of 1.27 and a

Comparability of cut scores between media and environments

133

( )

test-taker standard error SEβn of 0.35 would yield a z-score of 2.22. Using a z- table to convert a z-score to a probability, a z-score of 2.22 is 0.01. Therefore, the probability of that test taker receiving the same classification on retesting would be .99 (1 –.01). Considering that the test taker had an ability estimate (1.27), approximately 2.6 times higher than the logit cut score (0.49), an extremely high probability of .99 is expected. To calculate the total classification rate for all test takers, the number of test takers receiving each raw score is multiplied by its corresponding proportion of correct classification. Then those weighted proportions are summed and divided by the total number of test takers. The estimates used to calculate the z-score were retrieved from WINSTEPS® Rasch Measurement software program (Linacre, 2020c, version 4.7.0.0.). Table 4.19 reports the consistency indices for each group and test form. Table 4.19 Percentage of correct classifications per group and test form Group Group 1 Group 2 Group 3 Group 4

Test Form A .91 .91 .91 .91

Test Form B .90 .90 .90 .91

Both the Livingston and Lewis method (1995) and the Standard Error method (Stearns & Smith, 2009) yielded similar correct classification estimates for all groups. From a CTT perspective, the internal validity analyses of the recommended cut scores revealed that the cut scores set in both virtual environments by all four groups appear to be reliable, implying that valid inferences can be drawn from them.

4.3 Comparability of cut scores between media and environments Research question 2 [How comparable are virtual cut scores within and across virtual panels and different environments (virtual and F2F)?] and its sub- questions aimed to investigate whether cut score measures were comparable between media, test forms, rounds, and environments. To investigate research question 2.1 [How comparable are virtual cut scores within and across virtual panels?] and research question 2.2 [How comparable are virtual cut scores with

134

Cut score data analysis

F2F cut scores?], a series of Welch t-tests was performed through GraphPad Prism (version 9.3.1, 2021). The analysis entailed comparing virtual cut score measures (1) within and across groups per round and (2) with equivalent face- to-face (F2F) cut score measures, data collected from a previous workshop (Kollias, 2012). The Benjamini, Krieger, & Yekutiele (2006) two-stage linear step- up procedure for control of FDR was applied to all the comparisons to assess the significance of the Welch t-tests using an FDR level of 5 %. Adjusted p-values (q-values) (see section 3.4.3 for description of q-values) are reported in the last column of each table. Tables 4.20 –4.23 report the virtual cut score comparisons within and across groups per round, while tables 4.24 –4.26 report the virtual cut score comparisons with the face-to-face (F2F) cut scores per round.

4.3.1 Comparability of virtual cut score measures This section presents the cut score measure comparisons within and across virtual groups per round. Column one (Group 1) displays the first group and the medium in which the virtual cut score measure was recommended, where (A) =audio and (V) =video, while column two (Cut score measure) presents the group’s recommended cut score measure with its associated standard error (S.E.) reported in parenthesis. Columns three and four report the same information for the second group. Column five (Cut score contrast) reports the difference between the two cut score measures, with its associated joint standard error (Joint S.E.) reported in parenthesis. Due to rounding, differences between the two target measures may not always match with the reported contrast. Column six (Welch’s t) reports the Welch’s t-test statistic, its (approximate) degrees of freedom (d.f.) reported in parenthesis. Column seven (Prob.) reports the probability of the Welch t-statistic, while the last column reports the FDR adjusted p-value (q-value). Table 4.20. reports Round 1 virtual cut score measure comparisons.

Comparability of cut scores between media and environments

135

Table 4.20 Round 1 virtual cut score measure comparisons Group Cut score (1) measure (S.E.) G1 (A) 0.36 (0.15) G1 (A) 0.36 (0.15) G1 (A) 0.36 (0.15) G1 (A) 0.36 (0.15) G1 (A) 0.36 (0.15) G1 (A) 0.36 (0.15) G1 (A) 0.36 (0.15) G1 (V) 0.27 (0.19) G1 (V) 0.27 (0.19) G1 (V) 0.27 (0.19) G1 (V) 0.27 (0.19) G1 (V) 0.27 (0.19) G1 (V) 0.27 (0.19) G2 (A) 0.35 (0.12) G2 (A) 0.35 (0.12) G2 (A) 0.35 (0.12) G2 (A) 0.35 (0.12) G2 (A) 0.35 (0.12) G2 (V) 0.33 (0.15) G2 (V) 0.33 (0.15) G2 (V) 0.33 (0.15) G2 (V) 0.33 (0.15) G3 (A) 0.43 (0.16) G3 (A) 0.43 (0.16) G3 (A) 0.43 (0.16) G3 (V) 0.55 (0.12) G3 (V) 0.55 (0.12) G4 (A) 0.73 (0.14)

Group (2) G1 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (V) G4 (A) G4 (V) G4 (A) G4 (V) G4 (V)

Cut score measure (S.E.) 0.27 (0.19) 0.35 (0.12) 0.33 (0.15) 0.43 (0.16) 0.55 (0.12) 0.73 (0.14) 0.28 (0.14) 0.35 (0.12) 0.33 (0.15) 0.43 (0.16) 0.55 (0.12) 0.73 (0.14) 0.28 (0.14) 0.33 (0.15) 0.43 (0.16) 0.55 (0.12) 0.73 (0.14) 0.28 (0.14) 0.43 (0.16) 0.55 (0.12) 0.73 (0.14) 0.28 (0.14) 0.55 (0.12) 0.73 (0.14) 0.28 (0.14) 0.73 (0.14) 0.28 (0.14) 0.28 (0.14)

Cut score contrast (Joint S.E.) 0.09 (0.25) 0.01 (0.19) 0.03 (0.21) -0.08 (0.22) -0.19 (0.19) -0.37 (0.20) 0.08 (0.21) -0.08 (0.23) -0.07 (0.25) -0.17 (0.25) -0.28 (0.23) -0.46 (0.24) -0.01 (0.24) 0.01 (0.19) -0.09 (0.20) -0.2 (0.17) -0.38 (0.18) 0.07 (0.19) -0.1 (0.22) -0.22 (0.19) -0.40 (0.20) 0.06 (0.21) -0.12 (0.20) -0.29 (0.21) 0.16 (0.21) -0.18 (0.18) 0.27 (0.18) 0.45 (0.20)

Welch t (d.f.)

Prob. FDR p q

0.37 (15.11) 0.06 (17.06) 0.12 (19.17) 0.35 (18.69) 0.99 (16.38) 1.81 (17.25) 0.39 (17.44) 0.35 (14.18) 0.26 (16.56) 0.67 (16.65) 1.24 (13.74) 1.94 (15.03) 0.04 (15.32) 0.08 (23.05) 0.44 (21.33) 1.18 (23.00) 2.07 (21.11) 0.37 (20.86) 0.47 (22.86) 1.13 (22.20) 1.94 (22.00) 0.27 (21.98) 0.59 (20.55) 1.42 (20.87) 0.74 (20.95) 0.99 (20.25) 1.47 (20.03) 2.29 (19.98)

.72 .95 .90 .73 .34 .09 .70 .73 .79 .51 .24 .07 .97 .94 .67 .25 .05 .71 .64 .27 .07 .79 .56 .17 .47 .34 .16 .03

.95 .99 .99 .95 .81 .49 .95 .95 .95 .95 .76 .48 .99 .99 .95 .76 .48 .95 .95 .76 .48 .95 .95 .64 .95 .81 .64 .48

In Round 1, no statistically significant differences in mean measures were observed, implying that no discoveries were made as adjusted p-values (q- values) ranged from .48 to .99. Consequently, Round 1 cut score measures were comparable across virtual groups and between media. Table 4.21 reports Round 2 cut score mean measure comparisons.

136

Cut score data analysis

Table 4.21 Round 2 virtual cut score measure comparisons Group (1) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G2 (A) G2 (A) G2 (A) G2 (A) G2 (A) G2 (V) G2 (V) G2 (V) G2 (V) G3 (A) G3 (A) G3 (A) G3 (V) G3 (V) G4 (A)

Cut score measure (S.E.) 0.57 (15) 0.57 (15) 0.57 (15) 0.57 (15) 0.57 (15) 0.57 (15) 0.57 (15) 0.45 (0.18) 0.45 (0.18) 0.45 (0.18) 0.45 (0.18) 0.45 (0.18) 0.45 (0.18) 0.50 (0.10) 0.50 (0.10) 0.50 (0.10) 0.50 (0.10) 0.50 (0.10) 0.67 (0.14) 0.67 (0.14) 0.67 (0.14) 0.67 (0.14) 0.61 (0.11) 0.61 (0.11) 0.61 (0.11) 0.75 (0.16) 0.75 (0.16) 0.50 (0.13)

Group (2) G1 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (V) G4 (A) G4 (V) G4 (A) G4 (V) G4 (V)

Cut score measure (S.E.) 0.45 (0.18) 0.50 (0.10) 0.67 (0.14) 0.61 (0.11) 0.75 (0.16) 0.50 (0.13) 0.56 (0.18) 0.50 (0.10) 0.67 (0.14) 0.61 (0.11) 0.75 (0.16) 0.50 (0.13) 0.56 (0.18) 0.67 (0.14) 0.61 (0.11) 0.75 (0.16) 0.50 (0.13) 0.56 (0.18) 0.61 (0.11) 0.75 (0.16) 0.50 (0.13) 0.56 (0.18) 0.75 (0.16) 0.50 (0.13) 0.56 (0.18) 0.50 (0.13) 0.56 (0.18) 0.56 (0.18)

Cut score contrast (Joint S.E.) 0.12 (0.23) 0.06 (0.18) -0.10 (0.20) -0.04 (0.18) -0.18 (0.22) 0.06 (0.20) 0.01 (0.24) -0.06 (0.20) -0.22 (0.23) -0.16 (0.21) -0.30 (0.24) -0.06 (0.22) -0.11 (0.26) -0.17 (0.17) -0.11 (0.15) -0.25 (0.18) 0.00 (0.16) -0.05 (0.21) 0.06 (0.18) -0.08 (0.21) 0.17 (0.19) 0.11 (0.23) -0.14 (0.19) 0.11 (0.17) 0.05 (0.21) 0.25 (0.21) 0.19 (0.24) -0.05 (0.23)

Welch t (d.f.)

Prob. FDR p q

0.52 (15.43) 0.36 (14.64) 0.50 (18.82) 0.22 (15.83) 0.85 (18.80) 0.32 (17.12) 0.04 (17.80) 0.27 (12.69) 0.98 (16.71) 0.77 (13.76) 1.28 (17.45) 0.25 (15.37) 0.43 (17.86) 0.97 (21.33) 0.72 (22.38) 1.35 (18.64) 0.00 (19.18) 0.26 (15.41) 0.34 (22.12) 0.38 (22.52) 0.86 (21.99) 0.49 (19.60) 0.74 (19.77) 0.61 (20.01) 0.24 (16.55) 1.21 (20.69) 0.80 (20.13) 0.24 (18.12)

.61 .72 .62 .83 .40 .75 .97 .79 .34 .46 .22 .81 .67 .34 .48 .19 .99 .80 .74 .70 .40 .63 .47 .55 .81 .24 .43 .81

.92 .92 .92 .92 .92 .92 .99 .92 .92 .92 .92 .92 .92 .92 .92 .92 .99 .92 .92 .92 .92 .92 .92 .92 .92 .92 .92 .92

Similarly, in Round 2, no differences in cut score measures were statistically significant as q-values ranged from .92 to .99. Thus, Round 2 virtual cut scores were comparable across virtual panels and between media. Table 4.22 reports Round 3 virtual cut score measure comparisons.

Comparability of cut scores between media and environments

137

Table 4.22 Round 3 virtual cut score measure comparisons Group (1) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G2 (A) G2 (A) G2 (A) G2 (A) G2 (A) G2 (V) G2 (V) G2 (V) G2 (V) G3 (A) G3 (A) G3 (A) G3 (V) G3 (V) G4 (A)

Cut score measure (S.E.) 0.57 (0.14) 0.57 (0.14) 0.57 (0.14) 0.57 (0.14) 0.57 (0.14) 0.57 (0.14) 0.57 (0.14) 0.49 (0.15) 0.49 (0.15) 0.49 (0.15) 0.49 (0.15) 0.49 (0.15) 0.49 (0.15) 0.50 (0.08) 0.50 (0.08) 0.50 (0.08) 0.50 (0.08) 0.50 (0.08) 0.55 (0.06) 0.55 (0.06) 0.55 (0.06) 0.55 (0.06) 0.59 (0.10) 0.59 (0.10) 0.59 (0.10) 0.86 (0.09) 0.86 (0.09) 0.49 (0.11))

Group (2) G1 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (V) G4 (A) G4 (V) G4 (A) G4 (V) G4 (V)

Cut score measure (S.E.) 0.49 (0.15) 0.50 (0.08) 0.55 (0.06) 0.59 (0.10) 0.86 (0.09) 0.49 (0.11) 0.43 (0.11) 0.50 (0.08) 0.55 (0.06) 0.59 (0.10) 0.86 (0.09) 0.49 (0.11) 0.43 (0.11) 0.55 (0.06) 0.59 (0.10) 0.86 (0.09) 0.49 (0.11) 0.43 (0.11) 0.59 (0.10) 0.86 (0.09) 0.49 (0.11) 0.43 (0.11) 0.86 (0.09) 0.49 (0.11) 0.43 (0.11) 0.49 (0.11) 0.43 (0.11) 0.43 (0.11)

Cut score contrast (Joint S.E.) 0.08 (0.21) 0.07 (0.16) 0.02 (0.15) -0.02 (0.17) -0.29 (0.17) 0.08 (0.18) 0.14 (0.18) -0.01 (0.17) -0.06 (0.16) -0.10 (0.18) -0.37 (0.18) -0.01 (0.19) 0.05 (0.19) -0.05 (0.10) -0.09 (0.13) -0.36 (0.12) 0.01 (0.14) 0.07 (0.14) -0.04 (0.11) -0.31 (0.11) 0.05 (0.13) 0.11 (0.13) -0.27 (0.13) 0.10 (0.15) 0.16 (0.15) 0.36 (0.15) 0.42 (0.14) 0.06 (0.16)

Welch t (d.f.)

Prob. FDR p q

0.40 (15.87) 0.42 (13.58) 0.15 (11.07) 0.12 (14.90) 1.73 (14.48) 0.42 (16.34) 0.76 (16.10) 0.08 (12.71) 0.36 (10.56) 0.57 (13.94) 2.08 (13.55) 0.04 (15.46) 0.28 (15.21) 0.44 (21.96) 0.69 (22.27) 2.88 (22.55) 0.06 (19.11) 0.49 (19.40) 0.38 (18.87) 2.83 (19.34) 0.41 (15.53) 0.90 (15.79) 2.03 (21.97) 0.65 (20.08) 1.07 (20.27) 2.50 (19.74) 2.96 (19.96) 0.38 (19.99)

.70 .68 .89 .90 .10 .68 .46 .93 .72 .58 .06 .97 .78 .67 .49 .01 .96 .63 .71 .01 .69 .38 .06 .52 .30 .02 .01 .71

.91 .91 .99 .99 .38 .91 .91 .99 .91 .91 .26 .99 .94 .91 .91 .09 .99 .91 .91 .09 .91 .91 .26 .91 .81 .14 .09 .91

In Round 3, no statistically significant differences were observed between cut score measures as q-values ranged from .09 to .99. Consequently, Round 3 virtual cut scores were also comparable across groups and between media.

138

Cut score data analysis

4.3.2 Comparability of virtual and F2F cut score measures To investigate research question 2.2 [How comparable are virtual cut scores with F2F cut scores?] data collected in an earlier standard setting workshop that took place in 2011 (see Kollias, 2012) were used. The F2F panel consisted of 15 judges who used the modified percentage Angoff method (Angoff, 1971, see section 2.2.1.1 for description) to set a B1 cut score on the original instrument (N =75 items, see section 3.2.3.2, Table 3.1). The F2F judge data was also re- analysed through MFRM and placed on the same latent scale with the virtual judge data. Even though the F2F panel had recommended a cut score on the 75 items, Group 5 data entails the same 45 items (Test Form A) that the virtual panels used to recommend their cut scores. As the F2F cut score study did not have a Round 3, Round 2 F2F cut score measures were compared with virtual Round 3 cut score measures (See Appendix I for Group 5 individual level Rasch indices). For the F2F panel, both Round 1 (cut score measure =0.258) and Round 2 (Cut score measure =0.264) cut score measures were 0.26 logits when rounded to two decimal places. Table 4.23 reports Round 1 virtual cut score measures and Round 1 F2F cut score measure comparisons. Table 4.23 Round 1 virtual and F2F cut score measure comparisons Group (1) G1 (A) G2 (V) G3 (V) G4 (A)

Cut score measure (S.E.) 0.36 (0.15) 0.33 (0.15) 0.55 (0.12) 0.73 (0.14)

Group (2) G5 (F2F) G5 (F2F) G5 (F2F) G5 (F2F)

Cut score measure (S.E.) 0.26 (0.13) 0.26 (0.13) 0.26 (0.13) 0.26 (0.13)

Cut score contrast (Joint S.E.) 0.10 (0.20) 0.07 (0.20) 0.29 (0.17) 0.47 (0.19)

Welch t (d.f.)

Prob. FDR p q

0.50 (18.27) 0.37 (24.63) 1.66 (24.95) 2.49 (22.72)

.62 .72 .11 .02

.95 .95 .53 .48

When comparing Round 1 virtual cut score measures with F2F cut score measures, no statistically significant differences were observed as q-values ranged from .43 to .95. Table 4.24 reports Round 2 virtual cut score measures and Round 2 F2F cut score measure comparisons.

139

Comparability of cut scores between media and environments Table 4.24 Round 2 virtual and F2F cut score measure comparisons Group (1) G1 (A) G2 (V) G3 (V) G4 (A)

Cut score measure (S.E.) 0.57 (0.15) 0.67 (0.14) 0.75 (0.16) 0.50 (0.13)

Group (2) G5 (F2F) G5 (F2F) G5 (F2F) G5 (F2F)

Cut score measure (S.E.) 0.26 (0.13) 0.26 (0.13) 0.26 (0.13) 0.26 (0.13)

Cut score contrast (Joint S.E.) 0.30 (0.20) 0.41 (0.19) 0.49 (0.21) 0.24 (0.19)

Welch t (d.f.)

Prob. FDR p q

1.53 (19.04) 2.10 (25.51) 2.38 (23.13) 1.28 (23.40)

.14 .05 .03 .21

.92 .77 .77 .92

Similarly, when comparing Round 2 virtual cut score measures with F2F cut score measures, no statistically significant differences were observed as q-values ranged from .77 to .92. Table 4.25 reports Round 3 virtual cut score measures and Round 2 F2F cut score measure comparisons. Table 4.25 Round 3 virtual & Round 2 F2F cut score measure comparisons Group (1) G1 (A) G2 (V) G3 (V) G4 (A)

Cut score measure (S.E.) 0.57 (0.14) 0.55 (0.06) 0.86 (0.09) 0.49 (0.11)

Group (2) G5 (F2F) G5 (F2F) G5 (F2F) G5 (F2F)

Cut score measure (S.E.) 0.26 (0.13) 0.26 (0.13) 0.26 (0.13) 0.26 (0.13)

Cut score contrast (Joint S.E.) 0.30 (0.19) 0.28 (0.15) 0.59 (0.16) 0.23 (0.17)

Welch t (d.f.)

Prob. FDR p q

1.58 (19.82) 1.92 (19.46) 3.67 (23.68) 1.31 (24.00)

.13 .07 .01 .20

.43 .28 .04* .60

* Statistically significant at q ≤ .05

When comparing Round 3 virtual cut score measures with Round 2 F2F cut score measures, one discovery was made, signalling that one statistically significant difference between cut score measures was observed in one of the four comparisons. The difference between Group 3 cut score measure (M =.86) and Group 5 F2F measure (M =.26) was statistically significant (t (23.68) =3.67, q =.04). However, Group 3 set the highest Round 3 cut score on Test Form A amongst all the virtual panels. Further investigation revealed that two judges (J25 and J33) set cut scores that were at least twice as much as the next highest cut score measure (Group 1 Round 3 cut score measure =0.57 logits). J25 set a cut score of 1.27 corresponding to a raw score of 33 (see Appendix J for Form A score table), while J33 set the highest cut score (1.40, raw score =34). Removing these two judges from the analysis yielded a cut score measure of 0.76 and

140

Cut score data analysis

rerunning the analysis produced a non-statistical q-value of .12. Thus, it is most likely that the difference between Group 3 Round 3 virtual cut score measure and Group 5 Round 2 F2F cut score measure was due to the idiosyncrasies of the judges in Group 3. Nonetheless, comparable cut score measures within rounds do not necessarily imply that judges did not exhibit differential severity in either e-communication medium. Consequently, the next round of analysis entailed conducting differential analysis.

4.4 Differential severity between medium, judges, and panels To investigate research question 3 [Do judges exercise differential severity when setting cut scores in different e-communication media (audio and video)? If so, to what extent?] and its sub-questions 3.1 [Do judges exhibit differential severity towards either of the two e-communication media?] and 3.2 [Do any of the virtual panels exhibit differential severity towards either of the two e-communication media?], three different types of differential functioning (aka bias) analyses were conducted. The bias sizes and their corresponding standard errors were retrieved through FACETS programme version 3.84.0 (Linacre, 2022), while the t-tests results and q-values were retrieved through GraphPad Prism (version 9.3.1, 2021). However, in order for there to be any systematic bias, differences in bias sizes need to be both statistically significant and substantial, implying that the difference between them should be at least |.50| logits (Linacre, 2020a).

4.4.1 Differential judge functioning (DJF) The first series of analysis, differential judge functioning (DJF), was conducted to explore whether any judge exhibited systematic bias towards a particular e- communication medium. DJF analyses were conducted on all three Rounds; however, Round 3 analyses did not produce any degrees of freedom (d.f.) or statistical significance probabilities (prob.) as judges recorded one rating in each virtual medium (see Appendix K for DJF analysis). The DJF analysis yielded that no judge exhibited systematic bias in either virtual medium in Round 1 and Round 2 as the differences in cut score measures between e-communication media for each judge were smaller than |0.50| logits and/or q-values were greater than .05. It should be noted that since the judge facet was active, the reported measures in the Appendix K tables indicate each judge’s actual cut score measure (measure +bias). The next series of analysis entailed treating all judges (N=45) as forming one group.

Differential severity between medium, judges, and panels

141

4.4.2 Differential medium functioning (DMF) Differential medium functioning (DMF) was conducted to explore whether judges when treated as one group exhibited systematic bias towards a particular e-communication medium. The first analysis involved comparing the measures of all judgements made in both media, regardless of the test form and judge group. Table 4.26 reports the DMF analysis of all judgements [(45 judges × 45 items (Round 1) +45 judges × 45 items (Round 2) +45 judges × 1 item (Round 3)) × 2 test forms =8190; 4095 judgements per test form] per test form. Column one displays the measure (measure +bias size) for the audio medium and its associated standard error (S.E.) reported in parenthesis. It should be noted that when a facet for analysis is a dummy facet (anchored to zero), measures +bias sizes equal to zero when no bias size is detected, while any differences from 0 indicate positive or negative biasness, respectively. Column two displays the same information as column one for the video medium. Column three (Target contrast) reports the difference between the two media measures, with its associated joint standard error (Joint S.E.) reported in parenthesis. Due to rounding, differences between the two target measures may not always match with the reported target contrast. Column four (Welch’s t) reports the Welch’s t- test statistic, its (approximate) degrees of freedom (d.f.) reported in parenthesis, while column five (Prob.) reports the probability of the Welch statistic. When pairwise interactions are greater than one, then the adjusted p-values (q-values) are reported as well. Table 4.26 DMF analysis of all judgements per medium Audio medium Video medium Target measure (S.E.) measure (S.E.) contrast (Joint S.E.) 0.00 (.03) 0.00 (.03) 0.00 (.04)

Welch t (d.f.)

Prob p ≤. 05

-0.08 (8187)

.94

The first DMF analysis of all judgments (3 rounds × 2 test Forms) made in both media revealed that no systematic bias was detected towards or against either of the two e-communication media. The bias size in each medium was approximately 0.00 and the perceived difference in bias sizes was also 0.00, which was not statistically significant, t (8187) =-0.08, p =.94. The next DMF analysis entailed comparing all judgments (3 Rounds, N =4095) made in each medium within each test form (see Table 4.27).

142

Cut score data analysis

Table 4.27 DMF analysis of all judgements per medium, within test form Test Form A Form B

Audio measure (S.E.) -0.07 (.04) 0.06 (.04)

Video measure (S.E.) -0.06 (.04) 0.07 (.04)

Target Contrast (Joint S.E.) 0.01 (.06) 0.01 (.06)

Welch t (d.f.)

Prob p

FDR q

0.19 (3904) 0.18 (3907)

.85 .85

.90 .90

When all judgements were compared across all rounds per test form, no systematic bias was detected for or against either of the two e-communication media. The final series of differential analysis was conducted by comparing the different panel groups.

4.4.3 Differential group functioning (DGF) The final series of differential analysis entailed examining whether any systematic bias was exhibited within the four virtual panels. The first DGF analysis investigated whether panels exhibited bias towards a specific medium when all rounds of judgments were combined. Table 4.28 reports the pairwise interactions of all judgements between media and test forms per group [(No. of judges × 45 items × 2 rounds; Rounds 1 and 2) +(No. of judges × 45 items × 1 round; Round 3) × 2 test forms], yielding a total of 1638 judgements for Group 1, a total of 2366 judgements for Group 2, a total of 2184 judgements for Group 3, and a total of 2002 judgements for Group 4. Table 4.28 DGF analysis across all judgements between media per group Group Audio measure (S.E.) 1 0.05 (0.06) 2 -0.03 (0.05) 3 -0.09 (0.05) 4 0.08 (0.06)

Video measure Target Contrast* Welch t (S.E.) (Joint S.E.) (d.f.)

Prob p

FDR q

-0.05 (0.06) 0.03 (0.05) 0.09 (0.05) -0.08 (0.06)

.22 .38 .03 .04

.31 .40 .09 .09

0.11 (0.09) -0.06 (0.07) -0.17 (0.08) 0.16 (0.08)

1.22 (1636) -0.88 (2364) -2.24 (2182) 2.02 (2000)

* Differences between measures may not add up because of rounding

When comparing all judgments (3 Rounds) made within groups and between media, no groups exhibited systematic biasness as no group had a Target Contrast of at least ± 0.50 logits and a q-value ≤ .05.

143

Differential severity between medium, judges, and panels

The final analysis entailed conducting differential group functioning (DGF) within and across panels. The within panel pairwise interactions report whether a particular group exhibited systematic bias towards a specific e-communication medium, while the across- panel interactions report whether systematic bias can be attributed to a particular panel. For each Round only the within groups interactions are reported below, while the full analyses can be found in Appendix L. Table 4.29 presents Round 1 DGF within groups pairwise interactions. Table 4.29 Round 1 DGF pairwise interactions within groups Group Target (1) measure (S.E.) G1 (A) 0.05 (0.11) G2 (A) 0.01 (0.09) G3 (A) -0.07 (0.09) G4 (A) 0.23 (0.10)

Group (2) G1 (V) G2 (V) G3 (V) G4 (V)

Target measure (S.E.) -0.05 (0.11) -0.01 (0.09) 0.07 (0.09) -0.22 (0.10)

Target contrast* (Joint S.E.) 0.10 (0.15) 0.02 (0.13) -0.13 (0.13) 0.45 (0.14)

Welch t (d.f.)

Prob. FDR p q

0.65 (808.00) 0.13 (1168.00) -0.99 (1078.00) 3.26 (987.40)

.51 .90 .32 .00

.84 .92 .66 .03

* Differences between measures may not add up because of rounding

In Round 1, no systematic bias was detected when comparing each group’s cut score measures between media. While Group 4 had a statistically significant difference in cut score measures, the difference was not greater or equal to 0.50 logits, implying that the difference was not substantial to indicate any bias towards either medium. The FACETS programme version 3.84.0 (Linacre, 2022) also produces another type of interaction report resulting in a table in which each group and medium interaction is analysed separately. The information presented in the report is exactly the same as the information presented in Table 4.29 apart from the t-test statistic, degrees of freedom, and significance of the t-test as that t-test is based on the significance of the bias per e-communication medium. The report also produces a chi-square statistic, which tests the global hypothesis that apart from measurement error, all sets of measures are the same. For Round 1, the fixed effect hypothesis that the interactions between group and medium are sharing the same measure, after taking into consideration measurement error, cannot be rejected (χ2= 12, d.f. =10, p =.21). Table 4.30 reports Round 2 DGF pairwise analysis.

144

Cut score data analysis

Table 4.30 Round 2 DGF pairwise interactions Group Target (1) measure (S.E.) G1 (A) .07 (.11) G2 (A) -.08 (.09) G3 (A) -.06 (.10) G4 (A) -.01 (.10)

Group Target (2) measure (S.E.) G1 (V) -.07 (.11) G2 (V) .08 (.09) G3 (V) .06 (.10) G4 (V) .01 (.10)

Target contrast* (Joint S.E.) .13 (.15) -.16 (.13) -.12 (.13) -.03 (.14)

Welch t (d.f.)

Prob. FDR p q

.87 (808.00) -1.25 (1168.00) -.90 (1077.98) -.21 (987.80)

.38 .21 .37 .83

.99 .99 .99 .99

* Differences between measures may not add up because of rounding

Similarly, Round 2 DGF analysis revealed that no group exhibited systematic bias towards either medium, a finding corroborated by the fixed effect hypothesis that cannot be rejected (χ2 =3.2, d.f. =9, p =.96). Table 4.31 reports Round 3 DGF cut score analysis within groups and between media. Table 4.31 Round 3 DGF pairwise interactions Group Target (1) measure (S.E.) G1 (A) .04 (.11) G2 (A) -.02 (.09) G3 (A) -.13 (.09) G4 (A) .03 (10)

Group Target (2) measure (S.E.) G1 (V) -.04 (.11) G2 (V) .02 (.09) G3 (V) .13 (10) G4 (V) .03 (10)

Target contrast Welch t (Joint S.E.) (d.f.)

Prob. p

FDR q

.09 (.15) -.05 (.13) -.27 (.13) .06 (.14)

.58 .71 .06 .66

.94 .94 .94 .94

.57 (16.00) -.38 (24.00) -1.98 (22.00) .44 (2.00)

* Differences between measures may not add up because of rounding

Round 3 DGF cut score analysis revealed that no group exhibited systematic bias towards either medium when mean cut score measures were compared within groups between media. This finding was also corroborated by the non- significant chi-square test (χ2 =4.6, d.f. =8, p =.80), implying no statistically significant interactions between group and medium. Overall, the global hypothesis that all interactions within group and between media are sharing the same measure apart from measurement error cannot be rejected as corroborated by all chi-square statistics. In all three rounds, no group exhibited systematic bias towards a particular e-communication medium when bias size differences within groups were compared, suggesting that mean cut score measures within groups and between media were comparable. This finding

Summary

145

is graphically corroborated in the item maps derived from the FACETS analysis (see Appendix M for a graphical overview of all judgements).

4.5 Summary So far, the analyses revealed that mean cut score measures set in the two virtual environments (audio and video) were comparable and valid as all measures derived from the FACETS analysis were within the acceptable ranges and in accordance with the expectations of the Rasch model. From a CTT perspective, the same conclusions were reached as far as the reliability and validity of each group’s recommended cut score. In other words, the two virtual environments did not seem to have a statistically significant impact on any of the groups’ recommended cut scores. Further investigation of the virtually set cut score measures with F2F cut score measures revealed that cut score measures were comparable across all three media with the exception of Group 3 Round 3 (video) cut score measures. While Group 3 Round 3 video cut score measures were not in line with the other virtual groups’ Round 3 measures or the F2F groups’ Round 2 measures, it likely that the statistically significant difference was observed due to the idiosyncrasies of two judges assigning cut score measures that were at least twice as much as the next highest cut score measure (Group 1). Nonetheless, Group 3 was flagged so that its judging behaviour could be further investigated through the qualitative analysis that followed. The survey data collected during each session of the workshops were analysed to explore whether one of the two e-communication media was perceived to be a more appropriate virtual medium for the purposes and nature of a virtual standard setting workshop.

Chapter 5: Survey data analysis The purpose of the chapter is (1) to present and compare the judges’ perceptions of each virtual medium based on their quantitative and qualitative survey responses in both media, (2) to investigate whether either of the two e- communication media hindered their communication and/or impacted their ratings and judgements and (3) to explore whether judges exhibited a preference towards any of the two media. The chapter is divided into two sections where the first section presents the survey items used to collect the judges’ perceptions in both e-communication media and evaluates the judges’ quantitative and qualitative responses on those items. The second section briefly presents a descriptive analysis of the quantitative items used to collect procedural validity evidence as an in-depth analysis of the procedural data was deemed to add little to answering the RQs.

5.1 Survey instruments Research sub-question 4.1 [Do either of the e-communication media affect the judges’ perceptions and evaluations of how well they communicated? If so, to what extent?] aims to investigate whether judges felt that their ability to exchange ideas successfully was affected by either of the two media in the virtual environment. To collect (1) judges’ perceptions on the medium and platform and (2) procedural evidence, six surveys were administered, during each session of the workshop, (see section 3.2.3.5 for a description of the surveys). In each session, a total of 129 closed-ended survey items were administered resulting in a total of 11,610 responses (129 items × 45 judges × 2 virtual sessions =11,610). Out of the 129 closed-ended survey items, 77 items (henceforth “perception survey instrument”), gathered insight into the judges’ perceptions of the medium and platform used, while the remaining 52 items (henceforth “procedural survey instrument”) gathered procedural evidence. For the closed-ended items, a six- point Likert scale was used from which the participants were asked to choose from one of the following categories: “1” (Strongly Disagree); “2” (Disagree); “3” (Slightly Disagree); “4” (Slightly Agree); “5” (Agree); and “6” (Strongly Agree).

5.2. Perception survey instrument The perception survey instrument contained 13 items that were repeated across the six surveys to explore (1) whether one e-communication medium

148

Survey data analysis

was favoured over the other during a particular stage of the session as far as communication and interaction were concerned and (2) whether familiarity with the platform changed panellists’ perceptions regarding the ease of use and the well-functioning of the platform. Eleven of the 13 repeated items aimed at collecting communication and interaction effectiveness data (henceforth “communication items”). Of the 11 communication items, ten were repeated across all six surveys and 1 was repeated across the first five surveys, rendering a total of 65 [(10 items × 6 surveys) +(1 item × 5 surveys)] items. Qualitative data for the first nine of the 11 items, in all six surveys, were also collected, probing judges to explain their rationale for assigning scores on the lower band scale, especially scores of “1” (Strongly Disagree); “2” (Disagree); and “3” (Slightly Disagree). The last two of the 13 items (henceforth “platform items”) aimed at collecting platform efficiency data, rendering a total of 12 items (2 items × 6 surveys). Consequently, the 13 repeated items accounted for 77 perception survey items (65 communication items +12 platform items) out of the 129 survey items, while the remaining 52 items collected procedural evidence. The in-depth analysis that follows focuses on the 77 items pertinent to this study.

5.2.1 Evaluating the perception survey instruments The first analysis conducted was to examine the reliability of each perception survey instrument. Table 5.1 reports the descriptive statistics for each instrument. Table 5.1 Psychometric characteristics of perception survey instruments No. of judges No. of items Maximum no. of points Mean score S.D. Median score Mode score Minimum score Maximum score Reliability [alpha (rxx)] SEM

Audio 45 77 462 398.13 28.97 393 383 329 462 .97 5.02

Video 45 77 462 398.38 26.87 395 397 335 461 .97 4.65

149

Perception survey instrument

Both instruments exhibited high internal consistency as they both had a Cronbach alpha of .97. Further analysis revealed that judges were mainly using the high end of the survey scale. In both instruments, approximately 90 % and 91 % of all the ratings were either a “5” (Agree) or a “6” (Strongly Agree) in the audio and video media, respectively, implying that overall, as a group, the judges found the survey items easy to endorse (agree with), regardless of the e- communication medium. Table 5.2 reports the frequency data of the judges’ ratings for each instrument. Table 5.2 Frequency data of the perception survey instruments 1 (Strongly Disagree) 2 (Disagree) 3 (Slightly Disagree) 4 (Slightly Agree) 5 (Agree) 6 (Strongly Agree) Missing

Audio 10 (0.29 %) 11 (0.32 %) 71(2.05 %) 253 (7.30 %) 2061 (59.48 %) 1059 (33.02 %) 0 (0.00 %)

Video 02 (0.00 %) 15 (0.43 %) 58 (1.67 %) 229 (6.61 %) 2149 (62.02 %) 1010 (29.15 %) 02 (0.06 %)

Total

3465 (100 %)

3465 (100 %)

5.2.2 Analysis of perception survey items To analyse the 77 items, a series of Wilcoxon signed-rank tests were performed to investigate judges’ preferences towards any of the two media. However, when the Wilcoxon signed-rank test symmetry assumption (distribution of the differences between the two related groups to be symmetrical) was violated, a Sign test was used in its place. Both tests were performed through SPSS computer software (version 22). The quantitative and qualitative analysis of the 77 items is presented according to the thematic grouping of the items: communication items and platform items. The judge’s qualitative comments on each item were analysed and are presented in reference to the e-communication medium. For the purpose of the qualitative analysis only, judgements ranging from “1” (Strongly Disagree) to “4” (Slightly Agree) are regarded as not endorsing (not agreeing with) an item, while judgements “5” (Agree) and “6” (Strongly Agree) are regarded as endorsing an item.

150

Survey data analysis

Tables 5.3 through 5.15 report the frequency data for the items. Column 1 displays the survey, while column 2 displays the e-communication medium in which the survey was administered. Columns 3 through 8 report the frequency data of the judges’ responses, Column 9 (W-S-R Test) reports the Wilcoxon signed-rank test Z-statistic and its corresponding two-sided probability (Prob.) and in the event that the assumptions are not met, Column 10 reports the Sign test two-sided probability (Prob.) and the Z-statistic when applicable (the sum of negative and positive rankings are at least 25). The last column reports the adjusted p-values (q-values) (see section 3.4.3 for description of q-values). When q-values are less than or equal to .05 judges are expressing a preference towards one of the two e-communication media. Communication item 1: The medium through which we communicated helped us to better understand each other.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

1 -

- -

Strongly Disagree

- -

- -

- -

- 1

- -

1 1

Disagree

- 1

- -

1 1

1 1

1 1

1 2

Slightly Disagree

2 2

5 5

5 4

5 7

3 4

7 3

Slightly Agree

Table 5.3 Wilcoxon signed-rank test/Sign test communication item 1

34 27

26 27

28 29

26 24

30 32

26 27

Agree

9 15

14 13

11 11

13 12

10 8

10 12

Strongly Agree

-

-.22 (.83)

-.21 (.84)

-.89 (.37)

-.30 (.77)

W-S- R Test Z (Prob.) -.58 (.56)

(.27)

-

-

-

-

Sign Test Z (Prob.) -

.88

.88

. .88

.88

.88

.88

FDR q

Perception survey instrument

151

152

Survey data analysis

For communication item 1, there were no statistically significant differences in perceptions on whether a particular medium helped judges communicate better in any of the six surveys. The fact that no statistically significant differences were observed does not necessarily imply that there were no qualitative differences between the media during any stage of a session. The qualitative findings on this item revealed that judges were beginning to express a preference towards a particular medium.

Qualitative comments for communication item 1: The qualitative comments on whether a particular medium through which judges communicated helped them better understand one another corroborated the quantitative findings in that both media were associated with certain challenges such as technical issues and lack of resemblance with F2F settings.

Audio medium Judges who found it difficult to agree with communication item 1 cited three main reasons: (1) technical issues; (2) unnaturalness; and (3) lack of resemblance with F2F environments. The technical issues experienced had to do with their microphones (J07, Disagree; J28, Slightly Agree –survey 1) or with their sound quality (J39, Slightly Disagree –survey 2; J35, Slightly Agree –survey 3). Judges also felt a sense of “unnaturalness” in the audio medium as they found it strange not being able to see who was speaking (J37, Slightly Disagree –survey 1; J44, Slightly Agree -survey 1), something which led to instances of confusion about the speaker’s identity (J09, Agree –survey 2). The final reason provided for not being able to endorse the item was that their audio medium experience was unlike that of a F2F experience. Some judges felt that not much discussion was taking place (J28, Slightly Agree -survey 1; J29 Slightly Agree -survey 1), while another felt that the session was going too fast (J07, Strongly Disagree-survey 2). Judge J39 summarised the audio medium experience by stating that: … the medium had some limitations compared with face-to-face interaction. Besides, there were some technical problems that affected the discussions but on the whole the workshops were quite efficient and productive and the purpose was accomplished (J39, Slightly Agree).

On the other hand, judges who found the item easier to endorse in the audio environment remarked that (1) their experiences resembled those in a F2F environment and (2) that the medium facilitated the exchange of ideas. Judge J01 commented that “the online platform was the next best thing to being there” (Agree –survey 1). Judge J09 remarked that since they could hear the other

Perception survey instrument

153

judges and see them raising their hands through the “Raise Hand” function, the medium facilitated communication (Agree –survey 1). Other judges commented that they could “understand people’s way of thinking from what they say and how they say it” (J02 Agree –survey 3), and that they found that the medium lent itself to exchanging views, opinions, and expertise effectively (J05, Strongly Agree –survey 3, J02, Agree –survey 6, J07, Strongly Agree –survey 6) and that there were “no delays or problems, and that everyone had a chance to talk and explain their position” (J40, Agree –survey 5). Nonetheless, while some judges endorsed the item in the audio medium, they still expressed their preference for the video medium. Judge J20 remarked that “having the video on would be more helpful” (J20, Agree –survey 6), while Judge J36 explicitly stated s/he “preferred the video” (J36, Agree –survey 6). Judge J11 claimed that after having experienced the video medium first, the audio medium “felt a bit strange and impersonal” (Agree –survey 1).

Video medium In the video medium, judges who found the item difficult to endorse reported three main reasons: (1) technical issues, (2) limited benefits, and (3) lack of resemblance to F2F settings. Similar to the audio medium, judges experienced technical issues. Some judges referred to such issues as experiencing repetitive technical problems (J15, Slightly Disagree –survey 1), some technical problems being observed (J05, Slightly Agree –survey 4; J39, Slightly Agree –survey 4; J05, Agree –survey 6) and/or minor connection problems (J39, Slightly Agree – survey 1). Problems of hearing certain judges were observed once again, especially after Round 1 discussion (J38, Disagree –survey 3; J39, Slightly Disagree –survey 3; J40, Slightly Agree –survey 3). The problem stemmed from the judges’ hardware (microphone) and not from the platform and/or the e-communication medium as the same judges had problems of being clearly heard in both media; despite the technical support received from the IT specialist at the time. As survey 3 was administered after Round 1 discussion, a discussion lasting at approximately an hour in both media, more judges referred to technical sound problems being observed in the video medium. Such problems may have been attributed to the video medium demanding more bandwidth and consequently having a conversation in the video medium for an hour will undoubtedly result in some judges experiencing technical issues such as loss of sound and/or picture at times. This is especially true when judges do not have the appropriate hardware equipment, such as headphones with microphones, up- to- date computers, cables to directly connect their computers to their routers, and fast Internet connections.

154

Survey data analysis

Other judges finding the item difficult to agree with remarked that the video medium did not bring any associated benefits as they saw no real difference between the two media (J01, Slightly Disagree –survey 1), nor did the ability to see the other judges “help at all” (J29; Slightly Disagree –survey 2) or add much to their communication (J29, Slightly Disagree –survey 4). Judge J09 reported relying less on video, but more on audio (Agree –survey 2). The final reason for not endorsing the item (also reported in the audio medium) was the perceived difference between the video and F2F environment. Judge J41 stated that the video medium did not allow for one to see “a complete picture” of the person one is speaking with compared with F2F interaction (J41, Disagree –survey 1). Judges endorsing the item in the video medium reported reasons such as the added value the e-communication medium brings and the sense of occupying the same physical space. Judge J02 (Agree –survey 1) claimed throughout the surveys that “seeing one another is better than just listening to people’s voices” as it allows one to “see facial expressions, see people nodding” all of which “supported [their] understanding” (J02, Strongly Agree -survey 3). Judge 38 added one can “communicate better with someone when [looking] at him, since body language helps [the judge] understand others better” (J37, Agree – survey 6). Judge J13 reported that in the video medium the judge felt “as if [they] were physically present in one common room sharing and exchanging ideas and views” (J13, Agree –survey 6). This feeling also shared with Judge J36, as the judge felt “as though [they] were in a room together so there was almost no difference and I think that was good” (J36, Strongly Agree –survey 6). In contrast, only one judge provided comments claiming that the sound was much better (J39 –survey 5). It should be noted that some judges endorsing the item experienced technical issues due to having their cameras on (J20, Agree –survey 2) or minor sound problems. Overall, there was no quantitative difference in media observed as far as helping judges to better understand each other is concerned. However, the qualitative data begins to suggest that judges are expressing a preference towards the video medium as judges are able to see one another. Communication item 2: When we disagreed, the medium through which we communicated made it easier for us to come to an agreement.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

- -

- -

Strongly Disagree

1 -

- -

- -

1 -

- -

- 3

Disagree

2 1

2 -

3 2

2 2

3 3

4 1

Slightly Disagree

4 2

6 5

4 4

5 6

2 2

4 7

Slightly Agree

Table 5.4 Wilcoxon signed-rank test/Sign test communication item 2

32 27

27 26

28 33

26 27

28 32

29 25

Agree

6 15

10 14

10 6

11 10

12 8

8 9

Strongly Agree

-

-

-.25 (.81)

-.03 (.98)

-

W-S- R Test Z (Prob.) -.38 (.71)

(.29)

(.24)

-

-

(.41)

-

Sign Test Z (Prob.)

.86

.86

1.00

1.00

.86

1.00

FDR q

Perception survey instrument

155

156

Survey data analysis

For communication item 2, from a quantitative perspective, judges did not express a particular preference towards a particular medium when coming to an agreement after a disagreement has occurred. Consequently, no discoveries were made as q-values ranged from .86 to 1.00.

Qualitative comments for communication item 2: The qualitative comments on whether a particular medium through which judges communicated made it easier for them to come to an agreement when they disagreed confirmed the quantitative findings. The main reason for not endorsing the item in either medium was not directly associated with the medium, but with the question whether there were any disagreements or a need for agreement in a particular stage of the virtual session.

Audio medium Judges finding it difficult to endorse communication item 2 reported two main reasons: (1) unnaturalness; and (2) relevance or occurrence of disagreements. Judges commented on the unnaturalness of the medium by claiming that “not seeing people was not nice” (J11, Slightly Disagree –survey 2) and that “it was weird to listen to a voice while looking at a black screen”. (J37’s, Slightly Disagree – survey 1). The most frequent comment recorded by judges finding it difficult to endorse the item was that the need for ‘agreement’ was not relevant during a particular stage (J20, Slightly Disagree –survey 2; J11, Slightly Disagree – survey 2; J07, Disagree –survey 3; J20, Slightly Disagree –survey 3) or that disagreements had yet to occur (J19, Slightly Disagree –survey 1; J29, Slightly Disagree – survey 1). Judge J11 (Slightly Disagree –survey 1) added that it was not a matter of disagreement, but the content of what was being discussed mattered. Judge J29 (Slightly Agree –survey 3) questioned the amount of agreement reached despite recalling that a discussion had taken place. Judge J07 (Strongly Disagree – survey 2) offered an explanation why the judge also felt that not much disagreement was taking place by stating that “questions were not dealt with because teachers [were] afraid to ask questions because [they] might sound ignorant”. Judge J01 (Slightly Disagree, –survey 4 -audio medium) emphasised that judges “agreed to disagree”, while Judge J11 noted that that the medium was not important for agreements to occur, but only “valid arguments” were needed to reach agreement (J11, Slightly Disagree –survey 3). It should be noted that only Judge J39 (Slightly Agree –survey 2) cited any technical issues by claiming the “sound quality was not always good”.

Perception survey instrument

157

Judges finding the item easy to endorse reported that it was easy to come to an agreement (J09, Agree –survey 1) which may have been due to judges’ “filtering what they [were] saying and talk[ing] when they [had] something to say” (J02, Strongly Agree –survey 3). Judge J02 added that it was more “difficult to disagree in an online workshop as people [were] trying to be polite” (Agree -survey 6).

Video medium In the video medium, judges finding the item difficult to endorse used similar arguments to the ones used by judges in the audio medium, namely, the relevance or occurrence of disagreements. Judges felt that agreeing or disagreeing was not an issue (J07, Disagree –survey 1; J29, Slightly Disagree –survey 1) or relevant during a particular stage (J19, Slightly Disagree –survey 2), “as [they] were only discussing and justifying answers” (J39, Slightly Disagree –survey 3). Only one judge reported that minor sound quality technical problems prevented the particular judge from agreeing or disagreeing at times (J38, Slightly Disagree – survey 3). The reasons provided by judges supporting their decision to endorse the item may help explain why several judges felt that disagreements did not take place. Judge J15 originally had stated in survey 1 (End of Orientation stage) that “it would have been helpful if disagreements had occurred” (Agree- survey 1). Judge J15 later remarked that once “you are given the floor and the listeners are good, you can get your message across” (Agree –survey 2). Judge 38 (Agree, video –survey 6) added that people are “more polite and agreeable when [they] can see each other”. It was not surprising that several judges in both e-communication media felt that the medium was irrelevant to their ability to come to an agreement as there were only a few places within either virtual session in which disagreements could have taken place. During the Orientation stage, the only place where judges could express their disagreement would be when feedback was given to judges on their ranking of the CEFR descriptors. However, when misplaced descriptors were discussed and rationalised, most judges came to an agreement. Similarly, stage 2 (method training) did not offer the judges much of an opportunity to disagree. The only place that any disagreement could have occurred was when the judges were practising with the method and were asked to discuss their rationale for stating that a Just Qualified B1 Candidate would be able to answer the item correctly. The main place in which disagreements could have taken place was at the end of stage 3 (Round 1) where judges participated in a one-hour

158

Survey data analysis

discussion in which they were asked to rationalise their judgements. In stages 4 (Round 2) and 5 (Round 3) little discussion amongst the judges took place so disagreements were not an issue once again. The qualitative comments throughout the six surveys were at times made by judges in both media, thus no qualitative indication of whether a particular medium helped judges come to an agreement when disagreements occurred could be established. The specific item stem was slightly altered from “come to an agreement” to “come to a common position” (see communication item 4) and presented to judges again so that responses to both items could be compared. In principle, drastically different responses to very similar items would suggest that judges were not taking the surveys seriously or that they were interpreting the items in different ways. Analysis of communication item 4 revealed that judges endorsed the item in a similar way and that their recorded responses were also similar to those of item communication 2. Communication item 3: The medium through which we communicated sped up our communications.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- 1

1 -

1 -

Strongly Disagree

- -

- -

- -

- -

- -

- 1

Disagree

1 -

2 -

- 1

1 1

1 2

1 2

Slightly Disagree

4 4

3 4

4 2

5 8

4 5

8 8

Slightly Agree

Table 5.5 Wilcoxon signed-rank test/Sign test communication item 3

30 28

27 26

29 32

29 27

25 29

30 24

Agree

10 13

13 15

12 10

10 8

14 9

5 10

Strongly Agree

-

-

-.54 (.59)

-

-

W-S- R Test Z (Prob.) -.59 (.56)

(.27)

(.39)

-

(.38)

-.77 (.44)

-

Sign Test Z (Prob.)

.62

.62

.62

.62

.62

.62

FDR q

Perception survey instrument

159

160

Survey data analysis

For communication item 3, from a quantitative perspective, judges did not favour one medium over another as far as speeding up their communications was concerned. No statistically significant differences could be found as q-values were all .62.

Qualitative comments for communication item 3: The qualitative comments on whether a particular medium through which judges communicated sped up their communications corroborated the quantitative findings in that both media were associated with certain challenges such as lack of resemblance with F2F settings. However, in the video medium, judges felt that the netiquette established in the beginning of the workshop slowed down their communication.

Audio medium The main reason that some judges failed to endorse communication item 3 can be attributed to their audio medium experience not resembling a F2F experience. Judge J07 (Strongly Disagree –survey 1) emphasised that the orientation session took too long, and had it occurred in a F2F situation it would have taken “45 minutes”. Such a comment is surprising considering that the Orientation stage is usually the longest stage in any standard setting study, especially a language assessment cut score study. In this stage, judges are exposed to the test instrument, are trained in the CEFR and need to come to a common interpretation of the performance level the cut score is set on. However, at the same time, it may have been that the audio environment made the whole experience seem longer. Judge J28 (Slightly Disagree –survey 1) claimed that “real-life communication can be faster … and no misunderstandings are involved”. The point that communication in the audio medium appeared slower than in a F2F environment was also reported by Judge J07 (Strongly Disagree –survey 2). Judge J44 was not sure of whether to begin talking as the judge “wasn’t sure who was on the other side” (Slightly Agree –survey 1). Judges endorsing the item reported that the audio medium was more efficient for discussing issues in real time than chats and/or emails (J01, Agree – survey 1) and that the lack of visuals made it “easier to focus on instructions and tasks being completed” (J29, Agree –survey 1) as they were not distracted by anything on the screen (J09, Strongly Agree –survey 3; J06, Agree –survey 3) and that there was no redundancy in their communication (J05, Strongly Agree – survey 3). Other judges supported their agreement with the item by stating that the “nature of [the] workshop [did not leave] much room for taking [one’s time because] if this was an onsite workshop, people might have taken longer” (J02,

Perception survey instrument

161

Agree –survey 6), or by adding that in the virtual environments the tasks had to be carried out “within a specific time frame” (J05, Strongly Agree –survey 6). Judge J09 (Agree –survey 2) found that the platform features such as “Agree” and “Disagree” and “Raise Hand” were helpful as they provided “visual support”.

Video medium Judges having difficulty endorsing item 3 in the video medium cited two main reasons: (1) netiquette and (2) technical problems. The netiquette established requiring judges to use the web platform’s “Raise hand” function in order to express their views was reported by some judges to have slowed down their communication. Judge 42 reported that more time was needed for communication in the video medium as judges were asked to raise hands through the “Raise Hand” function in the platform to be next in line to talk (J41, Disagree –survey 1). Judge J29 added that since the platform dictated rules of communication, a “physical response [wasn’t] required to indicate turn-taking” (J29, Slightly Disagree –survey 1). Technical problems also made it difficult to endorse the item (J38, Strongly Disagree –survey 3; J39, Slightly Disagree – survey 3). One judge also reported experiencing hardware problems (Judge J41, Slightly Disagree –survey 2). Surprisingly only one judge remarked that F2F communication would have been “more effective” (Judge J39, Slightly Disagree, video medium –survey 4). Judges who agreed with this item expressed that their communication in the video medium felt natural as they “could see those who raised their hands, (literally!)” (J09, Strongly Agree –survey 3) and that it was not really a matter of which medium was being used but how it was used (J02, Agree –survey 3). Other judges justified their endorsement by stating that there was a “quick exchange of ideas” (J05, Agree –survey 6) and that “questions [were] answered on the spot” (J11, Agree –survey 6). Judge J36 found the netiquette “time sav[ing]” as judges were not talking over one another (J36, Strongly Agree – survey 6). The quantitative analysis and qualitative comments made throughout the six surveys suggest that both media were equally effective regarding the speed of which the judges communicated with one another, but neither of the two e- communication media was as fast as F2F communication. Nonetheless, there seems to be a running theme through the comments so far, regardless of the communication item, specifically that the audio medium has less distraction and that the video medium fosters a more life-like environment, one resembling a F2F setting. Communication item 4: When we disagreed, the medium through which we communicated helped us to come to a common position.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

- -

- -

Strongly Disagree

- -

- -

- -

1 -

- -

- 1

Disagree

3 3

3 2

3 3

3 2

3 2

3 3

Slightly Disagree

5 5

4 6

6 4

5 7

4 5

6 6

Slightly Agree

Table 5.6 Wilcoxon signed-rank test/Sign test communication item 4

29 27

31 27

30 31

29 29

31 33

30 28

Agree

8 10

7 10

6 7

7 7

7 5

6 7

Strongly Agree

-.71 (.48)

-.76 (.44)

-.54 (.59)

-.39 (.70)

-.90 (.93)

W-S- R Test Z (Prob.) -.25 (.80)

-

-

-

-

-

-

Sign Test Z (Prob.)

.98

.98

.98

.98

.98

.98

FDR q

162 Survey data analysis

Perception survey instrument

163

For communicative item 4, from a statistical perspective, judges expressed no preference towards a particular medium when trying to come to a common position or an agreement. No discoveries were made as q-values were all .98.

Qualitative comments for communication item 4: The comments made by judges throughout the surveys reflected the comments made for communication item 2. The fact that the same findings were observed, namely, no statistically significant differences observed, very similar trends in endorsement behaviour, and similar comments recorded suggests that the judges were taking the survey seriously and that they interpreted the items the same way. Communication Item 5: I felt that I could easily explain things in the medium through which we communicated.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- 1

1 -

1 -

Strongly Disagree

- -

- -

- -

- -

- -

- 1

Disagree

- 1

- -

1 -

1 2

1 -

1 0

Slightly Disagree

5 1

5 1

3 1

3 2

2 4

2 6

Slightly Agree

Table 5.7 Wilcoxon signed-rank test/Sign test communication item 5

31 27

29 29

26 32

26 27

27 31

33 24

Agree

9 16

11 15

15 12

15 15

14 10

8 14

Strongly Agree

-

-

-.26 (.80)

-.17 (.86)

-.26 (.79)

W-S- R Test Z (Prob.) -.51 (.61)

(.06)

(.06)

-

-

-

-

Sign Test Z (Prob.)

.19

.19

.91

.91

.91

.91

FDR q

164 Survey data analysis

Perception survey instrument

165

For communication item 5, no statistical significance in preference towards a particular medium when explaining things was found, implying that no discoveries were made.

Qualitative comments for communication item 5: The qualitative comments on whether judges felt that they could easily explain things in the medium through which they communicated supported the quantitative findings that no preference was indicated. Judges felt that both media presented different challenges resulting from a lack of visual stimuli in the audio medium to a heightened awareness of own behaviour in the video medium.

Audio Judges cited that the main reason why they believed that they could not easily explain things in the audio medium was that there was a lack of visual stimuli. Not having cameras on during the virtual session resulted in some judges feeling “confused without the camera for some reason [and not feeling] like talking most of the time” (J37, Slightly Disagree –survey 3) or feeling “shy without the camera” despite not expecting to feel so (J37, Slightly Disagree –survey 4). Judge J38 admitted to experiencing difficulty in explaining things as “at times it was difficult to articulate” oneself (J38, Slightly Agree –survey 6). However, due to the brevity of Judge J38’s comment, it is not clear whether the problem of not being able to articulate oneself was due to the medium or due to the judge’s own inability to do so. One judge also remarked that it was difficult to endorse the item since it was their first online workshop (J23, Slightly Agree -survey 1). Several reasons were provided by judges explaining why they were able to explain things in the audio medium. Some judges referred to finding it easy to ask and reply to questions (J29, Agree), while others found themselves “quite eager to participate” because other judges could not see them (J09, Strongly Agree –survey 3). Judges might have found it easy to communicate in the audio medium due to their own experience with online conference calls (J01, Strongly Agree –survey 1) or their ability to explain things easily “either by raising hand or just commenting orally” (Judge J09, Agree –survey 1). Other judges reported that the audio medium allowed them to be “given the floor to voice [their] opinion” (J05, Strongly Agree –survey 3), “knowing that [their] input was valued” (J05, Strongly Agree –survey 6). The audio medium also allowed them to “explain ideas [while] not being criticised” (J07, Strongly Agree –survey 6),

166

Survey data analysis

Judge J19 added that the medium presented “no particular restraints” (Agree). Nonetheless, two of the judges endorsing the item still showed their preference towards the video medium or a F2F setting, explaining that they preferred “to see people when [they] speak to them” (Judge J39, Agree –survey 1) or believed that in a F2F workshop “everyone would be chattier” (Judge J15, Agree –survey 4). Judge 15 further added that in online workshops “you respect other speakers’ time more and you rarely butt in” (Agree –survey 4).

Video medium The main reasons that a few judges felt it difficult to communicate in the video medium seems to have been a psychological one. Some judges felt that having their camera on heightened their awareness of their own behaviour. Judge J29 reported using “gestures more” and being conscious of being watched, something which resulted in changing how things were explained (Slightly Agree –survey 3). It should be noted that judges endorsing this item also referred to this issue as Judge J09 reported becoming self-conscious of own fidgeting until such behaviour was observed in other judges as well (Agree –survey 3). Such comments suggest that the video medium made the judges aware that they could be seen by others. Another judge who did not endorse the item reported technical problems during two stages of the session (J38, Strongly Disagree – survey 3, Slightly Disagree –survey 6), while Judge J29 expressed that the video added no additional benefit, “the video was not needed for this” (J29, Slightly Agree –survey 1). It seems that more judges are starting to report that in certain stages of the virtual session the video camera may not be adding value to their communication. Judges endorsing communication item 5 referred to the advantages of the video medium such as (1) their own “facial expressions adding to overall intended meaning” (J05, Strongly Agree –survey 6) and having (2) “direct communication, resembling real F2F communication, made explanations easy” (J11, Strongly Agree –survey 6). Judge J36 expressed preference for the video medium by contrasting it directly to the audio medium: It was as natural -after the initial 5 mins -as being physically in a room as we could see everyone and it was better than just a phone line as I think it is important to actually see the person you are communicating with, not a dismembered voice unless you already know that person. (J36, Strongly Agree –survey 6)

Judge J36 seems to be suggesting that the video medium gives one the impression of sharing the same physical space with the other judges. But at the same time,

Perception survey instrument

167

the judge is implying that in the beginning of the session the communication amongst judges may have felt artificial until the judges became accustomed to having their cameras on. Overall, the quantitative and the qualitative data suggest that both media through which judges communicated allowed them to easily explain things. The conflicting views expressed by judges J37 and J09 suggest that the audio medium may have both a negative effect on participants as it may make them reluctant to participate in discussions as well as a positive effect as it might make them more willing to communicate. Nonetheless, the preference of the judges leaning towards the video medium is becoming more prevalent. Communication Item 6: The medium through which we communicated helped us exchange ideas/beliefs quickly.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

- -

- -

Strongly Disagree

- -

- -

- -

- -

1 -

- 2

Disagree

1 2

- -

- 1

2 1

- -

1 -

Slightly Disagree

3 2

4 3

6 2

3 3

4 5

5 5

Slightly Agree

Table 5.8 Wilcoxon signed-rank test/Sign test communication item 6

30 27

25 29

26 34

25 25

25 32

27 25

Agree

11 14

16 13

13 8

15 16

15 8

12 13

Strongly Agree

-

-

-.73 (.47)

-.35 (.73)

-

W-S- R Test Z (Prob.) -.54 (.59)

(.34)

(.58)

-

-

(.54)

-

Sign Test Z (Prob.)

.74

.74

.74

.77

.74

.74

FDR q

168 Survey data analysis

Perception survey instrument

169

For communication item 6, no preference towards a particular medium was expressed when exchanging ideas quickly from a quantitative perspective. The analysis did not yield any discoveries as q-values ranged from .74 to .77.

Qualitative comments for communication item 6: The qualitative comments for communication 6 revealed that the netiquette used in both e-communication media both hindered and fostered the quick exchange of ideas/beliefs. The overall qualitative analysis confirmed the quantitative results as a preference for one particular medium was not evident.

Audio medium The two main reasons that some judges felt that the audio medium did not allow them to exchange ideas/beliefs quickly were attributed to (1) lack of interaction or (2) platform features and/or netiquette. Some judges who did not endorse the item claimed that it was because no interaction was actually taking place during a particular stage (J28, Slightly Disagree –survey 1), not because the medium was preventing judges from exchanging ideas quickly. Judge J29 (Slightly Agree –survey 5) reported that “no real ideas were discussed by the group, but the medium allowed the host to exchange ideas quickly”, a comment supported by the fact that Round 3 was a fast round, one in which judges were presented with consequences feedback and not much discussion was taking place between the judges themselves. Other judges justified their low ratings by discussing features of the platform and/or by the virtual communication etiquette. For example, Judge J01 referred to the way the Round 1 feedback was given to judges on the reading items. As it was impossible to display (1) the whole reading text with items and answer choices and (2) the judges’ descriptive statistics (i.e. individual ratings per item) in the same screen, the researcher showed the descriptive items first, followed by the item of discussion. For judges to provide a rationale for their choice based on the text, the researcher needed to scroll resulting in J01 stating that “the need to wait for moderator to scroll between options and texts in the RC [Reading comprehension portion] slowed things slightly” (J01, Slightly Disagree –survey 3). Judge J41 (Slightly Agree –survey 3) referred to the fact that judges had to unmute their microphones prior to speaking and whenever any judge faced technical issues, they were asked to quickly exit the platform and return to see whether the problem had been resolved (Agree –survey 4). The issue of judges beginning to talk without unmuting their microphones was also raised by Judge J36:

170

Survey data analysis

People forgot to turn on their mikes so this slowed thing down. Sometimes we didn’t hear all of what people said and some people didn’t speak clearly into their mikes. When we looked at each other I think people were much more aware (J36, Slightly Agree).

It should be noted that judges were asked to mute their microphones and/or speakers when they were involved in rating items so that they would not disturb the other judges. Judges who muted their speakers would unmute their speakers when they finished their individual tasks. When all ratings had been collected, judges were then asked to unmute their microphones so that any questions could be answered and that a discussion could take place. Judges who agreed that the medium helped them exchange ideas/beliefs quickly mainly supported their judgements by referring to the platform’s features and/or the netiquette established. Judge J01 remarked that the audio medium helped them to exchange ideas and stated that “it [was] like ‘being there’ except for delays caused by unfamiliarity with platform/equipment and line speed” (J01, Agree –survey 1) [quotes in the original]. Judge J09 claimed that “it was easy to signal what we wanted very quickly” as the platform allowed judges to use the chat function, their microphones, and the “Raise Hand” function (J09, Strongly Agree –survey 1). Judge J40 (Agree –survey 5 -audio) added that the platform features with “all [its] functions allowed everyone to have a fair turn without overlapping, miscommunication or dominating any talks”. Other comments ranged from “the audio helped us communicate quickly by listening to each other’s thoughts and explanations” (J38, Agree –survey 6), to “fast exchange of ideas in real time” (J07, Strongly Agree –survey 6). Judge J05 explained that a quick exchange of ideas occurred “through turn-taking” (J05, Strongly Agree – survey 6).

Video medium Judges who did not agree that they could exchange ideas/beliefs quickly in the video medium cited three reasons: (1) no additional value added (2) platform features and/or netiquette, and (3) technical problems. Judge J29 questioned whether the medium actually contributed to the exchange of ideas/beliefs by stating that the judge was “not sure that it made a difference having the video on as the platform itself helped order the communication” (J29 Slightly Disagree – survey 4). Judge J41 stated that they “need more time to talk, and raise hands, etc.” (J41, Disagree –survey 1). As Judge J41 was in Group four (G4) whose first session was in the video medium, the judge was comparing the video medium with a F2F setting, implying more time is needed in the virtual environment to be able to exchange beliefs and/or ideas as judges had to wait in line to be given

Perception survey instrument

171

the floor to say something. Judge J38 (Slightly Disagree –survey 3) remarked that “some people exchanged ideas, but others couldn’t”. Unfortunately, it is not clear whether judges could not exchange ideas because of technical issues or whether judges were not able to because of their inability to do so. Some judges also noted that they there were technical problems that they experienced either directly or indirectly that affected their ability to exchange and/or beliefs quickly. For example, Judge J05 stated that a swift exchange of beliefs “was not always the case due to technical problems some participants faced” (J05, Slightly Agree –survey 6). Judge J38 added that speed of exchange was hindered because the “video at times wasn’t clear as it could be” (J38, Slightly Disagree –survey 6) Judges who found it easy to endorse communication item 6 in the video medium reported that the experience reminded them of a F2F setting or that the virtual communication etiquette helped to facilitate a quick exchange of ideas/ beliefs. Judge J09 reported that the visual display did not make it necessary for the researcher to remind the next judge to begin speaking, as was the case in the audio medium: “This time the facilitator did not have to say our names to assign turns all the time. When there was a pause, the visual support from the video prompted people to speak” (J09 Strongly Agree –survey 3). Judge J37 commented on the turn-taking system by claiming that they “could easily have a discussion without interrupting each other” (J37, Agree –survey 6). Judge J36 expanded on no interruptions taking place by stating that the facilitator “moderated the conversations so we didn’t talk over one another. Everyone waited their turn even though they could have spoken over each other. The put your hand up thing works well!” (J36, Strongly Agree –survey 6). Overall, the quantitative and qualitative analyses suggest that both media equally and effectively helped judges exchange their ideas and/or beliefs quickly. In both media, the netiquette was deemed both positive and negative by some judges. Communication item 7: I could relate to other panellists’ ideas/beliefs in the medium through which we communicated.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

- -

- -

Strongly Disagree

- -

- -

- -

- -

1 -

- 1

Disagree

- -

- -

1 1

- -

- 1

2 1

Slightly Disagree

3 2

2 1

4 1

3 2

4 2

2 6

Slightly Agree

Table 5.9 Wilcoxon signed-rank test/Sign test communication item 7

31 32

32 28

29 34

26 31

29 31

34 26

Agree

11 11

11 16

11 9

16 12

11 11

7 11

Strongly Agree

-.28 (.78)

-

.00 (1.00)

-.66 (.51)

-.38 (.71)

W-S- R Test Z (Prob.) -.84 (.86)

-

(.21)

-

-

-

-

Sign Test Z (Prob.)

1.00

1.00

1.00

1.00

1.00

1.00

FDR q

172 Survey data analysis

Perception survey instrument

173

For communication item 7, from a quantitative perspective, both media allowed judges to related to one another’s ideas and beliefs. No discoveries were made as q-values were all 1.00.

Qualitative comments for communication item 7: The qualitative comments for communication item 7 revealed that in both media, judges were able to relate to other judge’s ideas/beliefs. The overall qualitative analysis confirmed the quantitative results of there not being a preference towards a particular medium.

Audio medium Judges who found it difficult to agree with communication item 7 in the audio medium stated either that there was no need to relate to other judge’s ideas/ beliefs in a particular stage (i.e., Orientation stage –survey 1) or that they faced technical issues such as their microphone not functioning at the time (Judge J07, Disagree –survey 2). On the other hand, judges who found the item easy to endorse provided examples of times when they could relate to ideas/beliefs being expressed. Judge J09 agreed that she could relate to other judges’ opinions through the medium and provided an example of doing so: “for example when another panellist commented on an obscure item” (J09, Agree –survey 2). Judge J01 added that relating to others’ ideas “might also have been influenced by the other participants’ ability to explain themselves. Someone less articulate might not have been so easily understood” (J01, Strongly Agree–survey 2). Judge J02 stated “I could relate to some panellists’ ideas, but not to all of them” (J02, Agree, audio medium –survey 5). However, the brief comment made by Judge J02 did not make it explicit whether not being able to relate to other judges’ opinions was due to the medium or to the other judges’ lines of argument. Other reasons mentioned ranged from “the facilitator did not allow the participants to dominate discussions” (J05, Agree –survey 6) to being able to relate “to what some people said before me as we were in agreement” (J02, Agree –survey 6). Judge J07 (Strongly Agree –survey 6) also observed judges sharing the same idea and added that judges “took one idea and further expanded or analysed it”.

Video medium In the video medium only one judge expressed unease with endorsing the item due to the experience not fully resembling a F2F experience as the medium did not resemble a F2F environment. Judge J39 reported that it was obvious that

174

Survey data analysis

“the medium [had] a few constraints compared with a F2F setting but it worked quite well” (Slightly Disagree–survey 2). Judges agreeing with the item believed that the video medium allowed for an “effective [ex]change of opinion and turn- taking” (J05, Agree –survey 6) or for “common experiences” (J11, Agree – survey 6). One judge also stated that the medium allowed the judge to be able to respect conflicting beliefs: “…naturally some people’s ideas didn’t cut it, but I could respect them” (J36, Strongly Agree –survey 6). Overall, it appears that both media were effective insofar as they provided a suitable environment for judges to be able to relate to each other’s beliefs and ideas. Similar to the quantitative results, the positive comments in both media show no preference towards a particular virtual environment. Communication Item 8: The medium through which we communicated created a positive working environment.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

- -

- -

Strongly Disagree

- -

- -

- -

- -

- -

- -

Disagree

- 1

1 -

- 1

1 1

1 -

2 -

Slightly Disagree

2 1

3 -

4 1

2 2

4 3

3 1

Slightly Agree

Table 5.10 Wilcoxon signed-rank test/Sign test communication item 8

26 27

27 27

24 30

24 30

23 27

21 33

Agree

17 16

14 18

17 13

18 12

17 15

19 11

Strongly Agree

-.24 (.81)

-

-.38 (.71)

-

-.17 (.86)

W-S- R Test Z (Prob.) -

-

(.06)

-

(.26)

-

(.31)

Sign Test Z (Prob.)

.91

.38

.91

.66

.91

.66

FDR q

Perception survey instrument

175

176

Survey data analysis

For communication item 8, no preference towards a particular medium was observed, from a quantitative perspective, as to whether a particular medium created a positive working environment. No discoveries were made as q-values ranged from .38 to .91.

Qualitative comments for communication item 8: The qualitative comments for communication item 8 revealed that both media fostered a positive working environment by creating a sense of team building amongst its group members. While both media presented some minor associated challenges, unlike the quantitative results, the overall qualitative results revealed that judges expressed a solid preference towards the video medium.

Audio medium Judges finding it difficult to endorse communication item 8 in the audio medium referred to two main reasons why they felt the medium through which they communicated did not create a positive working environment: (1) technical issues, and (2) lack of visual support. While the two judges had previously referred to their technical problems experienced, it was the first time that they expressed “feeling disappointed” (J07, Slightly Disagree, audio medium –survey 2) and “frustrated” (J38, Slightly Disagree –survey 3). The next main reason why judges felt they could not agree with a positive environment being created in the audio medium was due to not being able to see the other judges. Judge J11 stated that “having no picture of who[m]‌I was working with was not very nice” (J11, Slightly Disagree –survey 1) while judge J44 stated preference towards the video medium by reporting that as “a visual person I would prefer to see the person speaking, especially in a setting of multiple participants” (J44, Slightly Agree –survey 1). Judge J37 added that s/he “felt better when [s/he] could watch everybody else” (J37, Slightly Disagree –survey 3). Judge J11 claimed that s/he missed the camera being on, which may have been due to not understanding “at some point who was talking? [as the] video makes it easier” to identify the speaker (Slightly Agree –survey 3). Other reasons mentioned by judges ranged from expressing frustration at other judges as “some people find it difficult to understand instructions and this can be tiring and annoying” (J02, Slightly Disagree –survey 5) to claiming that other judges in their group “were not so active, so the medium may have affected the participation” (J39, Slightly Agree –survey 5). The point made by judge J39 may have to do with the fact that in Round 3, not much participation is required from the judges.

Perception survey instrument

177

The judges who agreed with communication item 8 offered several reasons ranging from creating a respectful atmosphere or creating a team spirit to being able to express oneself. Judge J02 reported that “the truth is that everybody was respectful of each other and trying to do what was expected” (J02, Slightly Agree – survey 6). However, it is not clear from the brief comment made by judge J02 why the item was not rated higher. Judge J02’s comment suggests that the audio medium presented judges with an environment in which they respected their fellow colleagues, thus suggesting that the medium created a positive working environment. Judge J01 endorsed the item by stating that: … the desire to communicate was supported and encouraged/increased. This probably has something to do with wanting to become and remain part of the group that had been established from the beginning of Round 1 (J01, Agree –survey 5).

Judge J01’s comment suggests that the judge felt that their group was functioning as a team from the time they started discussing the items in Round 1. This sense of belonging was also shared by Judge J05 who felt a sense of team membership in both media by stating that “we worked as [a]‌team and felt we had a joint responsibility for the outcome” (J05, Strongly Agree, audio/video medium – survey 6). Other judges felt that the audio medium gave them the feeling of being able to “express and agree or disagree [with] no criticism” (J07, Strongly Agree – survey 6), while having “equal chances [of] getting the floor … [as] all [judges] expressed ideas/beliefs quickly, [and] effectively, … [making] discussion [was] easy” (J19, Agree –survey 6). One judge also claimed that the audio medium created a positive working environment by acknowledging that “… because it was the first session … it helped us get familiarised with the process more easily” (J28, Strongly Agree –survey 1). Surprisingly two judges who did endorse the item in the audio medium still expressed their preference for the video medium. Judge J07 stated “I would like to have seen the participants since I’m Mediterranean” (J07, Agree –survey 1), while Judge J02 maintained that “it is more pleasant to see each other” (Agree –survey 3).

Video medium Only three judges commented as to why they found the item difficult to endorse in the video medium. Each judge presented a different reason why they felt the video environment did not create much of a positive working environment. The reasons offered were the following: (1) lack of resemblance with a F2F setting, (2) feeling of being intimidated, and (3) being distracted by other judges. Judge J15 cited that s/he still believed “that physical proximity would have made it more positive [and that] there is much more to meaningful interaction,

178

Survey data analysis

such as nodding, gesturing, fillers, real time reactions to someone’s response” (J15, Slightly Agree –survey 2). Judge J29 reported that s/he “actually found it more intimidating trying to communicate through the video” (J29, Slightly Agree –survey 2). Judge J07 (Slightly Disagree, video medium –survey 4) expressed annoyance in being able to hear other judges seeking clarification when the judge was filling in the evaluation surveys. In the video environment, judges who endorsed communication item 8 referred to the medium creating a sense of team building. Judge J09 reported that “seeing the other participants perhaps contributed to ‘team building’ [quotes in original]” (J09, Strongly Agree –survey 6). Judge J09’s comment alludes to the fact that the judge felt being a member of the team. Considering that two judges (J05 endorsed the item in both media, and J09 in the video medium) in the same group mentioned that they felt part of a team when discussing audio and video media may suggest that the sense of membership amongst the group may be attributed to the group’s dynamics rather than the virtual environment per se. Other judges remarked that they found the session relaxing (J37. Agree – survey 6) or that “the environment created by the medium was warm, friendly, and thus much more productive than a cold, impersonal seminar in a real auditorium with 200 people” (J11, Strongly Agree –survey 1). Judge J36’s comments about the video medium nicely summarise remarks about this medium made in the previous communication items: You see each other, you can speak to each other, no one talks over each other and so a friendly and positive working environment was created easily and so I think no one felt that their point of view was invalid and that what they said brought the discussion further forward and help[ed] develop it to reach a conclusion (J36, Strongly Agree).

Both the quantitative and qualitative data suggest that both media presented a positive working environment. Nonetheless, the comments made throughout the six surveys strongly imply that judges overall preferred the video medium. Communication item 9: The medium through which we communicated during this stage was appropriate.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

1 -

- -

Strongly Disagree

- -

- -

- -

- -

- -

- -

Disagree

- -

- -

1 1

1 1

1 1

- 1

Slightly Disagree

1 2

3 3

4 1

3 -

2 1

2 2

Slightly Agree

Table 5.11 Wilcoxon signed-rank test/Sign test communication item 9

30 28

29 25

29 34

28 31

29 31

30 29

Agree

14 15

13 17

11 9

13 13

12 12

13 13

Strongly Agree

.00 (1.00)

-

.00 (1.00)

-.56 (.57)

-.70 (.49)

W-S- R Test Z (Prob.) -.36 (.71)

-

(.42)

-

-

-

-

Sign Test Z (Prob.)

1.00

1.00

1.00

1.00

1.00

1.00

FDR q

Perception survey instrument

179

180

Survey data analysis

For communication item 9, no statistically significant differences in perceptions were found, implying that from a quantitative point of view, both media were appropriate for communicating during all stages of a session. No discoveries were made as q-values were all 1.00.

Qualitative comments for communication item 9: Unlike the previous communication items, for communication item 9 judges were asked to reflect on each stage of the virtual session and comment on the medium through which they communicated (surveys 1–5). Survey six (the final survey) asked judges to provide an overall comment on the appropriateness of the medium for the whole virtual session.

Audio medium Judges who found it difficult to agree whether the e-communication medium they communicated through was appropriate for each stage of the virtual session listed various reasons. At the end of the Method Training Stage (survey 2), judge J07 expressed his annoyance at one of the judges for not allowing too much discussion to take place as “J01 was talking all the time and was too domineering” (J07, Strongly Disagree –survey 2). At the same time, judge J39 remarked that “if there had been more [discussion] we would have had more problems to understand who is talking at each moment since it is difficult to identify the speaker just through the voice” (J39, Slightly Agree –survey 2). Judge J20 seemed to have expressed the same problem as the judge found it difficult “to figure out who was talking every time since we couldn’t see each other” (J20, Slightly Agree–survey 3). No other reasons were provided from judges who had difficulty endorsing the item. Judges who found the medium to be appropriate for various stages of the virtual session also listed various arguments ranging from feeling that “nothing was missing” (Judge J09, Agree –survey 1) during the Orientation stage, believing that “the audio medium was appropriate for the exchange of views during this round [since] opinions were heard and respected” (J09, Strongly Agree, audio medium –survey 3) during Round 1 to claiming that “everyone’s ideas and input were respected” (J02, Agree, audio medium –survey 4) during Round 2. The two judges who made comments in the final survey summed up their views by stating that “the pace, timing, and content of the workshop were appropriate, making participants feel that the whole process was a worthwhile and rewarding experience” (J05, Strongly Agree –survey 6) and that it was “the right medium for this workshop. We could all be heard which was the point at this workshop” (J07, Strongly Agree –survey 6).

Perception survey instrument

181

The video medium The few judges who found it difficult to endorse the item in the video medium reported that (1) the visual was not necessary at a particular stage, (2) they experienced technical issues, and (3) they felt uncomfortable with the camera on. In both the Orientation stage (survey 1) and the End of method training stage (survey 2), Judge J29 remarked that the video was not needed as the session “could have been done in the same way through audio-only” (J29, Slightly Agree, –survey 1 video medium) and as such “was unnecessary” (J29, Slightly Disagree – survey 2). Judge J38 reported technical problems: “sound wasn’t working properly” (J38, Slightly Agree –survey 2) during this stage. For the next stage (Round 1 stage), Judge J29 also expressed her unease with the video medium stating that the judge did not feel “as comfortable communicating” in the video medium (J29, Agree –survey 3), despite endorsing the item. Judge J29 had repeatedly been reporting that the video camera was not necessary throughout the surveys, but the judge’s comment in this stage may suggest bias towards the video medium. Judge J15, who also endorsed the item, felt that “turn-taking is much easier” in F2F settings and added that s/ he “would have contributed more input and reaction responses if [the judge] was sitting in the same room with [the] panel” (J15, Agree, video medium–survey 3). Judges who found the item easy to endorse only made positive comments in survey 6 in which they provided their overall judgement on the appropriateness of the video medium for the virtual session. Judge comments ranged from it resembling F2F “meetings” (J05, Agree –survey 6), not having “problems with it” (J28, Agree –survey 6) to being “able to get adequate information for what we were to do and do it in a timely manner” (J38, Agree –survey 6). Judge J11 reinforced the idea of being in a similar situation to a F2F meeting by reporting that “the facilitator, cameras, microphones, speakers, test items that popped up and everything else, like the chat platform were all there to help us communicate in real face-to-face form” (J11, Strongly Agree -survey 6). The comments made in all surveys suggest that both media are appropriate for setting cut scores as both have their distinct advantages and disadvantages. Judge J09’s comments summarise the above-mentioned finding as follows: I think the standard setting workshop was effective both via “audio” and “audio-video”. The advantage of the audio was that the facilitator allocated who spoke, so it was hard to interrupt other people, we stayed on track and opinions were heard in a defined order. However, the fact that we could comment more freely and see other people during the “audio-video” also facilitated discussion and gave it a more natural feel. I feel that both methods can facilitate a standard setting workshop. (J09, Strongly Agree, video medium)

Communication item 10: I felt comfortable contributing in the virtual small group discussions.

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

1

2

3

4

5

6

- -

- -

- -

- -

1 -

2 0

Strongly Disagree

- -

- -

- -

- -

- -

1 1

Disagree

* Communication item 10 was not included in survey 6

Medium

Survey (ID)

- -

- -

1 -

1 1

1 -

0 0

Slightly Disagree

- -

3 1

1 -

1 1

3 1

1 4

Slightly Agree

Table 5.12 Wilcoxon signed-rank test/Sign test communication item 10

- -

29 31

28 29

21 28

27 28

28 24

Agree

- -

13 13

15 16

22 15

13 16

13 16

Strongly Agree

-

-.54 (.59)

-

-

-

W-S- R Test Z (Prob.) -.66 (.51)

-

-

(.61)

(.14)

(.21)

-

Sign Test Z (Prob.)

-

.64

.64

.56

.56

.64

FDR q

182 Survey data analysis

Perception survey instrument

183

For communication item 10, no statistical significance in preference towards a particular medium was found. No discoveries were made as q-values ranged from .56 to .64. This item was not included in survey 6 and no qualitative data were collected on this item as no open-ended questions were included in the surveys. Communication item 11: I am satisfied with my contributions during this stage of the workshop.

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video*

1

2

3

4

5

6

- -

- -

- -

- -

1 -

- -

Strongly Disagree

- -

- 1

- -

- 1

1 -

1 -

Disagree

* Not all judges responded to the item (missing data)

Medium

Survey (ID)

- -

- -

1 -

1 -

2 1

- -

Slightly Disagree

3 2

5 1

2 -

2 3

2 4

2 2

Slightly Agree

Table 5.13 Wilcoxon signed-rank test/Sign test communication item 11

27 32

26 28

27 35

30 30

26 33

29 31

Agree

15 10

14 15

15 10

12 11

13 7

13 12

Strongly Agree

-

-

-.19 (.85)

-.52 (.60)

-

W-S- R Test Z (Prob.) .00 (1.00)

(.55)

(.63)

-

-

(.41)

-

Sign Test Z (Prob.)

.99

.99

1.00

.99

.99

1.00

FDR q

184 Survey data analysis

Perception survey instrument

185

For communication 11, in line with the quantitative analysis of the previous 10 communication items, no preference towards a specific medium was detected as no discoveries were made. No qualitative data were collected on this item as no open-ended questions were included in the surveys. Platform item 1: The e-platforms were easy to use.

Medium

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video*

Survey (ID)

1

2

3

4

5

6

- -

- -

- -

- -

- -

- -

Strongly Disagree

- -

- -

- -

- -

- -

- -

Disagree

- -

- -

- -

- -

- -

- -

Slightly Disagree

- 1

1 1

- -

1 3

- 1

4 1

Slightly Agree

Table 5.14 Wilcoxon signed-rank test/Sign test platform item 1

20 21

22 24

21 24

18 19

22 21

15 16

Agree

25 23

22 20

24 21

26 23

23 23

26 28

Strongly Agree

-

- .58 (.56)

-.58 (.56)

-

-.24 (.81)

W-S- R Test Z (Prob.) -1.10 (.27)

(.58)

-

-

(.42)

-

-

Sign Test Z (Prob.)

.73

.73

.73

.73

.85

.73

FDR q

186 Survey data analysis

Perception survey instrument

187

For platform item 1, there were no statistically significant differences in perceptions regarding the ease of using each e-platform. No discoveries were made as q-values ranged from .73 to .85. Such a finding was expected, as the only difference between the two e-platforms was that the video capability was disabled for all judges and the facilitator in the audio medium. No qualitative data were collected on this item, as no open-ended questions were included in the surveys. Platform item 2: The e-platforms functioned well.

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video

Audio Video*

1

2

3

4

5

6

- -

- -

- -

- -

- -

- -

Strongly Disagree

- -

- -

- -

- -

1 1

1 -

Disagree

* Not all judges responded to the item (missing data) ** Statistically significant at q ≤ .05

Medium

Survey (ID)

- -

- -

- -

1 3

1 -

- 1

Slightly Disagree

4 1

3 3

3 3

1 6

2 3

3 -

Slightly Agree

Table 5.15 Wilcoxon signed-rank test/Sign test platform item 2

18 24

20 25

19 28

21 23

26 23

24 22

Agree

23 19

22 178

23 14

22 13

15 18

17 22

Strongly Agree

.00 (1.00)

-

-

-

-.38 (.70)

W-S- R Test Z (Prob.) -

-

(.30)

(.04)

(.00)

-

(.21)

Sign Test Z (Prob.)

.88

.39

.11

.02**

.74

.37

FDR q

188 Survey data analysis

189

Procedural survey items

For platform item 2, judges indicated in survey 3 that the e-platform functioned better in the audio medium, implying that one discovery was made (survey 3). The fact that the discovery was made in survey 3 comes as no surprise as it may be attributed to the technical problems reported by some of the judges when referring to the video medium throughout the perception survey items. These problems, however, were not insurmountable as the judges felt that, on the whole (survey 6), they were able to communicate efficiently and effectively. No qualitative data were collected on this item as no open-ended questions were included in the surveys at the time of the study.

5.3 Procedural survey items A brief investigation into the 52-item procedural survey instruments collecting procedural evidence (see Cizek, 2012b) was warranted as items difficult to endorse may pose a validity threat to the cut score results. The 52 items rendered a total of 2,340 responses (52 items × 45 judges =2,340) for each instrument. Table 5.16 reports the descriptive statistics for each instrument, while Table 5.17 reports the frequency data for each of the scale categories.

5.3.1 Evaluating the procedural survey instruments The first analysis served to examine the reliability of each procedural survey instrument. Table 5.16 reports the descriptive statistics for each instrument. Table 5.16 Psychometric characteristics of procedural survey instruments No. of judges No. of items Maximum no. of points Mean score S.D. Median score Mode score Minimum score Maximum score Reliability [alpha (rxx)] SEM score

Audio 45 52 312 271.38 21.64 270 274 212 312 .96 4.33

Video 45 52 312 273.18 16.46 270 267 233 310 .94 4.03

190

Survey data analysis

Table 5.17 Frequency data of procedural survey instruments 1 (Strongly Disagree) 2 (Disagree) 3 (Slightly Disagree) 4 (Slightly Agree) 5 (Agree) 6 (Strongly Agree) Missing Total

Audio 6 (0.26 %) 11 (0.47 %) 33 (1.41 %) 188 (8.03 %) 1237 (52.86 %) 858 (36.67 %) 7 (0.30 %) 2340 (100 %)

Video 8 (0.34 %) 5 (0.21 %) 14 (0.60 %) 132 (6.22 %) 1333 (56.97 %) 840 (35.90 %) 8 (0.34 %) 2340 (100 %)

Both instruments exhibited high internal consistency as they had a Cronbach alpha of .96 and .94 for the audio and video medium, respectively. The frequency data revealed that in both instruments, approximately 90 % or 92 % of all the ratings were either a “5” (Agree) or a “6” (Strongly Agree) in the audio and video medium, respectively. Such findings imply that the judges, overall, found the survey procedural items easy to endorse (agree with), regardless of the medium. Further examination of the item responses in both media revealed that only five items in the audio medium [1–6 (62.22 %), 1–7 (68.89); 2–4 (77.78); 5-4 (75.56 %); 6–20 (73.33 %)] and four items in the video medium [1–6 (75.56 %); 1–7 (71.11 %); 1–9 (77.78 %) and 6–20 (71.11 %)] were not easily endorsed, as the number of judges assigning either a “5” (Agree) or a “6” (Strongly Agree) was fewer than 80 %. However, when examining item responses in terms of judges endorsing the item to some extent [assigning a “4” (Slightly Agree), a “5” (Agree), or a “6” (Strongly Agree)], all 52 items in both surveys were endorsed by at least 88 % of all judges, and as such did not warrant any further investigation. Consequently, the descriptive analyses of the procedural survey instruments add to the procedural validity of the cut scores.

5.4 Summary Overall, the analysis of the quantitative survey data revealed that judges, as a group, did not believe that their ability to communicate effectively was impaired by either e-communication medium. Only when judges were asked whether the e-platforms functioned well, there was one discovery made for the video medium in survey 3. The overall quantitative findings were not surprising, as the recommended virtual cut score measures across groups and between media did not differ statistically either. The platform used for the virtual sessions did

Summary

191

not seem to have introduced any construct irrelevant factors (i.e., platform not functioning well or platform being too difficult to use) that could have had a direct impact on the validity of the recommended cut scores. However, analyses of the qualitative survey data suggested that judges, as a group, may have preferred the video e-communication medium either for certain stages of a virtual session such as the discussion stage(s) or for a virtual cut score study. The judges’ preference towards the video medium was also noticed by the facilitator, especially during Round 1 discussions in the audio medium as some of the judges had difficulty identifying who was sharing information, an observation that was further explored in the focus group interviews.

Chapter 6: Focus group interview data analysis The aim of the chapter is threefold: to (1) present and compare the judges’ perceptions of each medium in the virtual environment, (2) investigate whether any of the two e-communication media provided a more appropriate discussion environment for standard setting purposes, and (3) explore whether any of the e-communication media influenced the judges’ decision-making processes. The chapter is divided into two sections where the first section outlines the analysis and coding scheme adopted, while the second section presents the findings of the focus group interviews, which were conducted online. During the course of each focus group, the full range of the judges’ virtual experiences during the two virtual sessions was explored in order to answer research question 4 [What are the judges’ perceptions of setting cut scores in each e-communication medium (audio and video)?] and its sub-questions.

6.1 Analysis of transcripts All five focus group interviews were recorded in the same platform (Adobe® ConnectTM) in which the virtual standard setting workshops took place. Out of the 45 judges who had taken part in the study, 33 of the judges (73.33 %) attended one of the five focus group sessions. Judges were given a choice of which session to attend and apart from the first focus group session, the remaining four sessions comprised of judges from different workshops. In this way, judges would be able to compare and share their experiences of the virtual standard setting sessions across groups. The interviews were then transcribed verbatim for content by using the f4transkript software programme (version 1.0, Audiotraskription, 2014b), a programme which accepts video recordings and offers the user a variety of features such as replay speed, predefined rewind interval, or time stamps. To code the transcripts, the f4analyse software programme (version 1.0, Audiotranskription, 2014a) was used as the programme allowed the researcher to upload all five transcripts into one project, to assign codes and subcodes to segments, and to retrieve the same coded data from all five transcripts simultaneously. To analyse the focus group data, the constant comparative method (CCM) (Corbin & Strauss, 2015; Glaser, 1965; Glaser & Strauss, 1967) was employed, a process which includes (in summary): 1. Reading through the entire set of data (or subset of data); 2. chunking data into smaller meaningful parts;

194

Focus group interview data analysis

3. assigning codes to each chunk; 4. comparing each new chunk of data with previously assigned codes; 5. grouping codes by similarity; and 6. identifying categories (themes) based on grouping codes. The CCM steps were embedded within three cycles of coding. The First Cycle of coding consisted of evaluation coding, descriptive coding, in vivo and emotion coding. The codes were revised prior to the Second Cycle of coding in which pattern coding was applied to group the data. The Final Cycle of coding entailed theming the data (see section 3.4.6. for coding framework and Table 6.1 for coding scheme). Five broad themes emerged from the focus groups analyses in relation to the judges’ overall experiences of the two virtual standard setting sessions: 1. 2. 3. 4. 5.

Psychological aspects; Interaction; Technical aspects; Convenience; Decision-making process.

Table 6.1 illustrates the themes and their associated codes (sub-themes), the number of focus groups each code was referenced in, and the number of times each code was referenced.

195

Analysis of transcripts Table 6.1 Coding scheme Theme

Code

No. of focus groups No. of references 5 42

Cognitive strain in the audio medium Distraction in the video medium Inability to discern who was paying attention in audio medium Inability to distinguish speaker in audio medium Lack of non-verbal feedback in the audio medium Self-consciousness in the video medium

2

3

3 1

14 1

2

4

5

15

4

5

5 5

17 12

2

4

1

1

5 5

39 15

5 5 1

24 28 1

5

11

5

16

5

7

5

7

Psychological Aspects

Interaction

Technical Aspects

Convenience

Decision-making process

Differences in amounts of discussion between virtual and F2F settings Lack of small talk in virtual environments No digression from the topic in virtual environments Technical problems in virtual environments Turn-taking system Freedom to multi-task in virtual environments Less fatigue in virtual environments Time saved in virtual environments

Decision-making in virtual environments

196

Focus group interview data analysis

6.2 Findings In this section a discussion of each of the main themes and their corresponding sub-themes, with illustrative examples from all five sets of focus group data follows.

6.2.1 Psychological aspects This theme captures comments related to the judges’ preference towards each medium, not related to any technical experiences encountered. Judges in favour of the audio medium referred to two main problems encountered in the video medium: (1) distraction and (2) self-consciousness.

Distraction in the video medium Judges in different focus groups remarked that having the video camera on continuously throughout the workshop resulted in their being distracted at times and they reported feeling more concentrated in the audio medium. In Excerpts 1 and 2, judges were providing their preference for which medium would be most appropriate for stage 1 (Orientation stage) of the workshop. Excerpt 1: J04: … but when we used audio we were not distracted so much, we were more concentrated on what we were supposed to do. I feel that was a good point. Excerpt 2: J14: I think that I just prefer the audio, because I would be more concentrated on reading and listening to you.

In both excerpts, judges emphasised that the video camera added distraction and made them less concentrated on the task at hand. The fact that these judges felt distracted in the video medium appeared to be related to the particular stage of the workshop and/or the task that the judges were engaged in. For example, in stage 1 (Orientation stage) after the judges had introduced themselves to one another, they did the CEFR familiarisation activities and then reviewed the test instrument. During this part of stage 1, the camera may have distracted them as judges were not communicating with anyone but were working on individual activities. Similarly, in Round 3, when their final recommended cut scores were presented, the use of video appeared to have been a distraction. Judge J13 explained why the video medium may have added distraction (see Excerpt 3). Excerpt 3: J13: Since it was to do with numbers and percentages, perhaps audio was better. Video might distract us [Facilitator: … can you elaborate on that?] … The power of image is always, can act as a distractor.

Findings

197

In Excerpt 3, Judge J13 was highlighting that in Round 3, in which judges enter their final recommended cut score after having seen their Round 2 data and consequences feedback, the video camera was a distraction.

Self-consciousness in the video medium A second negative comment made by judges about the video medium was that the use of video made them self-conscious about their appearance or their physical setting. Excerpt 4 illustrates an occasion when a judge felt self-conscious of her/his own behaviour: Excerpt 4: J09: … but having the camera on later, while exactly like that, while we were doing nothing, I kept scratching my nose, yawning, scratching my hair and I felt self-confident during those moments.

In Excerpt 4, judge J09 was referring to the times during the workshop that judges were waiting for other members of the group to finish entering their ratings or their survey responses. The fact that the cameras were on and that the judges could see themselves and other judges may have heightened this sense of self-consciousness. In Excerpt 5, judge J38 raises the issue of other judges being able to see her/his physical surroundings. Excerpt 5: J38: … I’m self-conscious that you can see everything that is happening, that you can see behind me, that I don’t have the freedom to do what I want to as if we were only on audio.

While this may not pose a problem to judges participating in a video-conference in an office or at a desk with a wall behind them, it may be a problem when a judge is working in a living room and the judge’s kitchen is in plain sight for all other judges to see and comment on. In contrast, judges favouring the video medium reported four main problems associated with the audio medium: (1) lack of non-verbal feedback; (2) inability to distinguish the speaker; (3) inability to discern who was paying attention; and (4) cognitive strain.

Lack of non-verbal feedback in the audio medium One of the most frequent criticisms of the audio medium was that judges could not see each other, and as such, could not receive any non-verbal communication, resulting in a less interactive discussion.

198

Focus group interview data analysis

Excerpt 6: J11: … If it is audio-visual [video], for me, it is as close as it can be to real communication. Just audio isn’t. I find it very practical and helpful if both things are there … J14: I agree with J11, and I find it more, you know, uh interactive … I mean, it’s like being in a real situation. Um, whereas when you are using the audio, you know, it’s kind of a different situation … J15: … For me, even a facial gesture might give me motive to say something, encourage me to speak more, or um share ideas more freely. J13: I’d like to add, body language, body posture also contributes to understanding …

This exchange suggests that judges found the audio medium a less interactive and less natural environment as they could not read each other’s non-verbal communication. Judge J04 explicitly referred to not being able to distinguish whether judges agreed or disagreed with the judge’s own comments: Excerpt 7: J04: … But video helps as well, because you can see the expressions on other people’s face if they agree, disagree if they want to say something.

Judge J04’s comment highlights the fact that non-verbal communication plays a key role in evaluating other participants’ reactions to, understanding of or lack of understanding of reasons used to support own beliefs, which, in turn may contribute to further communication, especially in the cases of not sharing the same beliefs. For example, when a judge can see and read other’s non-verbal communication and realises that the other participants do not share the same belief system, the judge may provide more reasons to support the claim that they are making.

Inability to distinguish speaker in the audio medium Another criticism of the audio medium had to do with judges being unable to distinguish who was speaking, despite the fact that the speaker’s name was highlighted in one of the windows in the platform. Excerpts 8 and 9 illustrate the problem faced by judges: Excerpt 8: J38: It also helped if you know who you are talking to, earlier in my example with J35, he confused me with somebody else, but if we had video, he would have known who was talking at that time.

The difficulty of not being able to clearly recognise who was speaking was also echoed by judge J11 (see Excerpt 9).

Findings

199

Excerpt 9: J11: … sometimes I couldn’t understand who was speaking and I think that is more natural, more friendly to see who I’m talking to.

It was surprising that the judges could not distinguish who was speaking as the speaker was highlighted in the platform and was always the first name in the list of the participants. Moreover, each time that a speaker took the floor, the speaker was introduced by name. However, the comment may have been more related to the fact that the judges could not associate the voice with an image as they could in the video medium or may not have heard the researcher introducing the next speaker.

Inability to discern who was paying attention in audio medium Another problem related to the audio medium was that it was difficult to understand whether other judges were paying attention to what was being said. While the problem was only mentioned by one of the judges, it deserves attention as it is easier for judges to become disengaged in audio virtual environments than in video virtual environments (see Excerpt 10). Excerpt 10: J43: Also, when you have the visual you can see who is out there and who is listening or not. Whereas when it was just the audio, we didn’t know who was there. I meant, you said their names, like you know, to say something, um, I would have to keep referring to a nameless and people who were in and out and um, is someone going to say something, if not, you know, we didn’t know who was out there. It was hard to keep track of who was in and out of the conversation. So, I think that created a lot of gap time, too.

Cognitive strain in the audio medium Another problem reported by judges using the audio medium was that it added a cognitive burden as a lack of visual stimuli made it difficult to concentrate on what was being discussed as illustrated in Excerpt 11: Excerpt 11: J37: Actually, I don’t understand exactly, it was difficult for me to concentrate on just the voice without seeing anything on the screen. It felt like I had to concentrate twice in order to understand what was going on, who was to speak next, I had difficulty concentrating. I didn’t like it. J16: I agree with J37, it’s much better when you see the other person’s face.

This excerpt suggests that not being able to see the other participants, especially during discussions, makes it more cognitively demanding to follow their line of

200

Focus group interview data analysis

reasoning. This, in turn, may result in judges becoming disengaged during long discussions.

6.2.2 Interaction The theme of “Interaction” captures the judges’ perceptions on the type of interaction encountered in the virtual environments compared with F2F interactions in meetings and/or standard setting workshops. Judges referred to three main differences between virtual and F2F environments in relation to interaction: (1) lack of small talk; (2) no real digression from the topic; and (3) differing amounts of discussion.

Lack of small talk in virtual environments The first main difference between virtual standard setting workshops and F2F workshops is that there was not much small talk taking place in the virtual workshops. In a F2F workshop, it is common for panellists to engage in such small talk, usually before the beginning of the workshop, during the breaks, and even after the workshop. However, in the virtual environment judges commented that no such small talk had taken place (see Excerpt 12). Excerpt 12: J12: Just to add something, when you meet someone if there is a meeting, face-to-face, the (name of institution) for example, then probably you have some time to get to know one another other. You have three, four, five minutes that you see someone in person and you say a few things, so it becomes a bit more personal. This way, it becomes more professional, you are here for a very specific period of time to do something, a very specific task. It is not the luxury of time, before or after to exchange some ideas, it’s over when it’s over. This system, the positive thing, the positive aspect of this system is that it is more professional, on the other hand, it is less personal. You don’t have time to say something personal to some of the comments about what you do in your life. You only introduce yourself and that is it. No questions, nothing. J01: … As J12 said, we didn’t have a lot of time to do anything more than the stiff introduction. We didn’t do that much personal sharing and joking because of this need to not step on each other’s language to let the recording record …

This extract suggests that judges may have expected more personal conversations to have been taking place throughout the virtual workshop. The fact that not much small talk had taken place may have been a positive aspect of the workshop as observed in judge’s J12 remarks. Nonetheless, the amount of small talk that may have actually taken place amongst the judges in a form of chat cannot be confirmed as the facilitator only had access to the chat that everyone was

Findings

201

participating in. Individual chats amongst the judges could not be accessed and judges were informed that all of their individual chats were private. However, the facilitator was aware that some participants were having personal chats with the co-facilitator during the sessions. Other judges had had personal chats with the IT expert earlier in the workshop so as to resolve technical problems.

No digression from the topic in virtual environments Another difference between virtual and F2F environments was that no real digressions were observed in the virtual environment as illustrated in Excerpt 13: Excerpt 13: J09: I found first the audio-only very efficient because we had to wait for each other to finish speaking, and maybe it was the medium. I don’t know, I didn’t felt, I didn’t feel that at any point our discussion went off topic whereas this may happen in face-to-face situations. We were always on topic and very focused on what we were discussing … [I]‌n the video we only went off topic while something was uploaded I think, um, and because we could see other so distinctly, saw that half of someone, we went off topic there, but we didn’t stop what we were doing to discuss something else. It was at a time when, ah, things were being uploaded. We were very focused on the workshop, nothing external you know coming in from the discussion. I think this, this, was the medium compared to face-to-face.

Despite the fact that only one judge commented on this issue, it may provide additional support for why some of the judges previously remarked that they were more “concentrated” in the audio medium.

Differences in amounts of discussion between virtual and F2F settings Amount of discussion refers to the length of discussion encountered within each virtual environment, especially when compared with the F2F environment. Divergent and often conflicting discourses emerged as illustrated in Excerpt 14. Excerpt 14: J43: No, no I don’t think the discussion was enough … I think if they were face-to-face, people are more forth [sic. forthcoming] to express their opinion, um, whereas especially if it is just audio people can just hide and not express … There was less discussion when the visual was off. I think if you are face-to-face, you have to, you know you are there, you are on a spot, and no one is raising their hand, no one is saying something, so you know you have to kind of get out of that awkward situation and say something whereas you can hide if you are behind the computer, you can hide, and, um, there wasn’t as much discussion.

202

Focus group interview data analysis

J09: In our group, I think it was the opposite. We discussed a lot more in the audio-only, I think, ah, not in the audio-video. Perhaps it was a time constraint as well, but for the items we talked about, we did talk about them, I mean, I think, until everyone had said something. Even if it was I agreed, I think. J30: Well there are some factors we have to consider for example, I felt that the video session had more discussion than the audio, but then perhaps it is not because of the medium but because of the fact that in the second workshop, familiarisation was much higher which allowed participants to participate more …

With regard to comparing the amount of discussion generated in virtual and F2F environments, the issue of time constraint may also have played a role in the amount of discussion generated in the virtual environments. Judges on various occasions brought up feeling a sense of time pressure when discussing several issues as evident in Excerpts 15 and 16. Excerpt 15: J01: … we were aware of the present time and the amount of work we had to do. I think we all could have talked a lot about certain of the issues for sure and may be more about everything. But we were, we knew we had to move along because we were in this major time frame. Excerpt 16: J11: … we don’t have the luxury of talking a lot, we have to concentrate on our answers and you concentrate on questions because there is a limited time to do it.

It is not clear why judges felt more pressed for time in the virtual environment. It may have been their awareness that the workshops were scheduled for a particular length of time and should the workshops go over time without any advance warning, many of the judges might have to withdraw as they may have had other engagements to attend to. The sense of time pressure may not be so prevalent in an onsite workshop as such workshops usually have additional time budgeted in for its completion and usually end ahead of schedule on the last day. Moreover, when judges participate in international standard setting workshops, especially ones which entail travelling, they usually have no other commitments to attend to during the length of workshop. Thus, such judges do not feel as pressed for time as some of the virtual judges felt.

6.2.3 Technical aspects The third theme of “Technical Aspects” captures the judges’ perceptions of each virtual medium in relation to problems encountered and the way the workshop was designed in the virtual environment. Judges referred to (1) the technical problems encountered and to (2) the turn-taking system used to express their opinion.

Findings

203

Technical problems in virtual environments One of the most frequent criticisms of the virtual sessions was the technical problems encountered by the judges directly or indirectly. The main problems were sound problems and video cameras freezing. Excerpt 17 highlights that such technical problems may not have existed in a F2F environment. Excerpt 17: J39: … but I have to say that technically there were some problems that we would not have had if we had been, you know, on a face-to-face interaction, but I think on the whole it was OK, taking into account some technical constraints, some technical limitations …

The technical problems encountered were not directly related to the virtual platform used to conduct the workshop, but to the judges’ own equipment, something also observed by other judges who experienced no problems with the virtual sessions (see Excerpt 18). Excerpt 18: J36: … and there were in our group some problems with some people and I don’t think it’s down to you and I don’t think it’s down to the platform, I think it’s their equipment that didn’t work very well.

The most common problem within the virtual platform was that judges raised their microphones or speakers too loud, resulting in an echo in the platform. When either of the two were lowered, the problem was resolved (see Excerpt 19). Excerpt 19: J25: In my opinion, I think that everything was ok but even some people that have a louder microphone, I just got very, not very good for me but everything was ok even with the camera even with audio that was our perception, it was OK.

Judge J25 might have been suggesting by saying “I just got very, not very good for me” that loud microphones may have caused the judge to become frustrated. During one of the focus group sessions, the same problem with loud speakers was also observed (see Excerpt 20). Excerpt 20: Facilitator: … I’m getting an echo; someone’s speakers are really loud … Someone’s got the speakers really loud.

Sound problems were also caused by judges entering the platform twice, which resulted in an echo and was easily rectified as it was easy to identify which judge had entered the platform twice. The judge who faced the most technical problems was judge J45. The IT expert later explained that the judge was using a dial-up connection and was not as computer literate as the other judges. In Excerpt 21, Judge J45 commented on the problems encountered during the workshop.

204

Focus group interview data analysis

Excerpt 21: J45: I also want to thank you. I enjoyed it, but I was the extreme. I had many problems and I want to thank [the IT expert] who was working very hard and I’m sorry but probably the problem I had was with my router and I had to call technical service. I think that I would have also felt more comfortable in a face-to-face session.

Technical problems were more prevalent in the video medium as cameras were freezing and/or not functioning well. One judge also experienced her/his camera freezing during a focus group session (see Excerpt 22). Excerpt 22: J09: OK. I think that my video screen has frozen but I can hear you perfectly. Can you hear me? …

During the workshop, when cameras froze, judges were asked either to switch on and off their camera or to exit the platform and re-enter it again, and when that failed, the IT expert remotely tried to resolve the problem. The fact that more technical problems were observed in the video medium than the audio medium gave some of the judges the impression that the audio medium workshop was faster than the video medium workshop (see Excerpts 23 and 24). Excerpt 23: J19: Although, I enjoyed the video more, um the audio just eliminated some things like problems, with the cameras on off so it got faster. [Facilitator: I missed the last part!] Although I like the video best, the audio keeps faster because we avoid some technical problems. J15: OK, I have to agree with J19, for me at least, the audio was easier, faster … Excerpt 24: J03: … Well, I always prefer audio because I am always under the impression that audio is faster. Audio, I have, I am always under the impression that it’s a faster type of interaction with the audio. Video lags.

Fortunately, nearly all of the technical problems were resolved quite quickly by the IT expert remotely accessing the judges’ computers to make any changes to software and/or hardware on their computers as needed. In the rare situations in which judge microphones were not working until the problem was attended to by the IT expert or could not be resolved, the judge communicated through chat. According to the IT expert, the judges that experienced most of the technical problems were those who were experiencing slow Internet connections. Therefore, it is not surprising that some judges felt that the audio sessions appeared to be running faster as it did not require as much bandwidth as the video medium.

Findings

205

Turn-taking system As part of the netiquette established during the platform training sessions at the beginning of the first virtual session, which served to facilitate the interaction in the virtual environments, judges were asked to use the “Raise Hand” function in the virtual platform so as to be given the floor. The order in which each judge raised their hand was shown in the window under the list of participants. In this way, the judges could see in which order they would be responding and no judge would be speaking over another judge. When a judge finished speaking, they would lower their hand and their name would be automatically placed at the bottom of the list, while the next judge in line would be placed at the top of the list. The facilitator sometimes would need to remind the next judge in line to begin speaking when the previous judge finished speaking. In relation to the turn-taking system implemented, judge comments ranged from the system being effective to requiring more self-discipline. Excerpts 25, 26, and 27 illustrate the judges’ perceptions of the turn-taking system. Excerpt 25: J21: No, in fact I felt really comfortable when I wanted to make a question. I really liked also the option of raising our hand, reply, and express the opinion … Excerpt 26: J18: Um, I didn’t have any problems, and actually felt that the raising hand symbol was very convenient because it enabled us to speak whenever we wanted to in order to express our opinion, etc. And, I felt that actually that it helped the whole process so it worked very well. Excerpt 27: J09: … like I said before, we did use the agree button to agree, and we didn’t have to speak after that like we would nod our heads or make noises in a face-to-face discussion too, but I think it was not a disadvantage I think this contributed to the efficiency of the workshop.

However, such a turn-taking system may not have been favoured by all judges as it required a lot of discipline on the judges’ part (see Excerpt 28). Excerpt 28: J03: … I think that it has to do with the personality of each person like I am a very, I can’t wait my turn, sometimes I just want to say, you know, what I want to say and I have my personal techniques to avoid doing this like I raise my finger so that I can count before I say it. It took a lot of discipline for me.

While the turn-taking system implemented was found effective by most judges, the fact that judge J03 expressed that it required “discipline” is not surprising

206

Focus group interview data analysis

because this form of turn-taking may not occur in day-to-day interactions to such an extent. For example, when participating in F2F meetings one uses other techniques to take the floor such as “jumping in” during a pause to make a comment or to answer a question. Waiting patiently in a queue to share views or make a comment that might have been raised earlier is not part of natural verbal interaction.

6.2.4 Convenience The fourth theme “Convenience” refers to the judge’s perceptions of the advantages of a virtual standard setting workshop when compared to a F2F standard setting workshop or a F2F training workshop. Judges referred to three main differences: (1) time saved; (2) freedom to multi-task; and (3) less fatigue.

Time saved in virtual environments The subcategory of “time saved” refers to judges saving time by not having to travel to a venue to attend a F2F workshop. Excerpt 29 is an example of the time saved in virtual environments. Excerpt 29: J09: … we don’t have to travel to a place or come back home or wait for buses and other means of transport …

Freedom to multi-task in virtual environments The ability to multi-task was reported while discussing the advantages and disadvantages of each virtual environment. Judge J38 found the freedom to also do other things while waiting for other judges to finish with the task at hand as being an advantage (see Excerpt 30). Excerpt 30: J38: … I had the freedom to, as I was waiting for the others to finish, I had the freedom of being here in the house to do some other things, while I was waiting for other people, where if I had to travel some place I wouldn’t have had that freedom.

However, since it not possible for a facilitator to know exactly what judges are actually doing in a virtual environment, being able to multi-task may not necessarily be a positive aspect as that may cause judges to be less focused when moving into another stage of the session. For example, if a judge decides to grade her/his own students’ work during a session this may have an impact on the judge’s appraisal of the difficulty of a test instrument.

Findings

207

Less fatigue in virtual environments Most of the judges agreed that virtual workshops were less tiring than F2F workshops as judges were in the comfort of their own homes and were able to take advantage of the facilities their home offered them, especially when it came to consuming snacks and beverages (see Excerpts 31 and 32). Excerpt 31 J18: Well, I won’t say that it wasn’t tiring, but it wasn’t too tiring and the fact that I was at home and I could do the whole thing in the comfort of my home was very convenient for me. I mean, I would have been exhausted if there were an equivalent workshop face-to face. So yes, I was tired, but not too tired. Excerpt 32 J21: I think that online training was a little bit less tiring because you have the possibility of following everyone, listening to everyone, but drink something, eat something very fast, but still hear everything whereas in a face-to-face, um, workshop you have to be in a certain place -behave well. So, it was, um, a little bit more comfortable than in a face- to-face conversation, for me.

Judge J34 also added another advantage to the comfort of being at home: that of wearing clothes to one’s own liking (see Excerpt 33). Excerpt 33 J34: … I was in my space, in my area, at home and doing this so it was much more comfortable, wearing your own clothes, wearing your pyjamas or you know your tracksuit. Not wearing makeup as women, you know we all did that, didn’t we? Oh yeah, that’s it. J02: … whereas at home you can break, no makeup as J34 pointed out, coffee, cigarettes. So much more relaxing.

However, there were a few judges who found the virtual workshop more tiring than a F2F workshop (see Excerpt 34). Excerpt 34 J05: … comparing these two, the online workshop was more tiring because in the face- to-face you could get up and stretch. I mean, this wasn’t possible, only if you would step away. But you would feel like you were missing something. …

By “step away”, judge J05 was referring to another function button (Step Away) in the virtual platform which was used by judges to indicate that they were no longer in front of their computer. In the audio medium, judges were instructed to use the function button whenever it was necessary for them to leave the virtual environment for any reason. This was not necessary in the video environment as it was easy to see which judge was not in front of their computer.

208

Focus group interview data analysis

When given a choice between participating in a F2F standard setting workshop and a virtual one, the majority of the judges expressed their preference to attending a virtual one for all the aforementioned advantages which were best summarised by Judge J38 (see Excerpt 35). Excerpt 35 J38: … I would prefer online primarily for the reasons I mentioned earlier. There’s no travelling involved, I have the freedom here to do other things while waiting. I am not wasting time back and forth and waiting for others being stuck in a room, having coffee, … Whereas face-to-face, if you’re stuck in a room for 5 hours? Let’s give ourselves an hour for lunch, but if you’re stuck in a room for five hours with the same people it’s mentally exhausting. Those are my thoughts.

6.2.5 Decision-making in virtual environments One of the questions during the focus group interviews aimed at eliciting whether there were any factors that could have affected the judges’ decision- making processes during a virtual standard setting workshop. Consequently, the final theme “decision-making” refers to the judge’s perceptions of whether either of the two e- communication media affected their decision- making ability. No judge reported that either medium affected and/or influenced their decision-making process. Judges mainly reported that they were influenced by: (1) the data presented at various stages of the workshop; (2) the discussions that took place; (3) their own teaching experience; and (4) the CEFR descriptors (see Excerpt 36). Excerpt 36 J42: Um, personally, for me, it didn’t. It was the empirical data that you showed us that influenced my original opinions. Everyone actually differs, everyone has their own angle as the way they look at things according to their own opinions. I think no one changed their opinion because they were able to look at someone actually saying something else. So no. J02: I don’t think I was affected by either medium. I think I would’ve made the same decisions if this would have been a live conference, what I was influenced by was the opinion of the other people, the statistics that you gave us. I don’t think that the medium affected me in any ways. I think I would have made the same decision otherwise. J34: Um, I wasn’t affected by any of the means and the media. No, the decision-making was upon my experience, … descriptors, how we always mark our students. It was mainly upon all that. …

Another judge also included the CEFR descriptors as playing a role in her decision-making process (see Excerpt 37).

Summary

209

Excerpt 37 J36: I don’t think either media or medium affected, influenced my decision-making, … My main influence was me, my experience, the CEFR, OK the CEFR going by their guidelines, then my experience, then the data that you gave us and then the impact data [consequences feedback] and OK the discussions as well.

Both excerpts suggest that the virtual environment in which the standard setting workshops was conducted did not impact the judges’ ability to set cut scores in an adverse way and that the same decisions would have been made in a F2F environment. The reasons offered by judges to support their decision-making process provides further evidence of intraparticipant consistency and adds internal validity to the workshop.

6.3 Summary Overall, the focus groups revealed that synchronous virtual standard setting workshops can be conducted in an effective manner despite the various technical problems that may arise. The virtual environment and the technical aspects that were introduced compared with a F2F environment did not introduce construct irrelevant factors that would have posed a validity threat to the recommended cut scores. The focus groups also revealed that the majority of judges preferred the video medium though they acknowledged that the e-communication medium itself did not affect their decision-making ability to set cut scores. Such findings were in line with the comparable cut scores set in both e-communication media, across all groups, as well as by the absence of bias towards any of the media as purported by the analysis of the survey data. At the same time, the judges’ criticism of the video medium denoted that a combination of both media and a comprehensive practical guide for conducting virtual standard setting workshops would facilitate the efficiency of such a workshop.

Chapter 7: Integration and discussion of findings The aim of this chapter is to integrate and discuss the main findings of the research project and its limitations. First, the findings are presented in order of the RQs and discussed through the lens of MNT. Then the limitations are presented and discussed. This research study sought to explore whether reliable and valid cut scores could be set in two synchronous e-communication media (audio and video), whether either of the two media was more appropriate for setting performance standards, and whether cut scores generated in virtual environments were comparable with cut scores set in a F2F environment. The findings of the RQs governing this study are discussed, in turn.

7.1 Research questions 7.1.1 Research questions 1, 2, and 3 The aim of any standard setting study, whether it be conducted in a F2F or virtual environment, is to establish cut scores that are reliable so that valid inferences can be drawn from test scores. To this end, RQs 1 and 2 aimed at exploring whether the recommended cut scores set in the two virtual environments (audio and video) in this research project were reliable, valid, and comparable. In particular, research question 1 [Can reliable and valid cut scores be set in synchronous e- communication media? (audio and video)? If so, to what extent?] along with its sub-questions 1.1 [How reliable and valid are the recommended virtual cut score measures?] and 1.2 [How reliable and valid are the judgements made in each e-communication medium?] aimed at investigating whether the overall cut score results set in either of the two virtual standard setting sessions were reliable and valid. The quantitative analysis of the cut scores presented in Chapter 4 revealed that cut scores set in both virtual environments exhibited internal validity. Research question 2 [How comparable are virtual cut score measures within and across virtual panels and different environments (virtual and F2F)?] along with its sub-questions 2.1 [How comparable are virtual cut score measures within and across virtual panels?] aimed at investigating whether cut scores set in the two e-communication media using the same standard setting method were comparable. The quantitative analyses presented in Chapter 4 revealed that cut scores set in the virtual environment were comparable across

212

Integration and discussion of findings

the two test forms (Test Form A and Test Form B) and across the two e- communication media (audio and video). Research sub-question 2.2 [How comparable are virtual cut score measures with F2F cut scores?] aimed to investigate whether cut scores were also comparable between environments (virtual and F2F). The quantitative analysis revealed that virtual cut scores set in the audio medium were comparable with cut scores set in a F2F environment. In the video medium, virtual cut scores were comparable for one of the two groups. The cut score measures between Group 3 and the F2F groups differed in a statistically significant way. This statistical significance is not likely attributed to any real differences in environment but to the idiosyncrasies of certain judges in the group who assigned cut scores at least twice as much as the next highest cut score measure (Group 1 Round 3). Research question 3 [Do judges exercise differential severity when setting cut scores in e-communication media (audio and video)? If so, to what extent?] and its sub-questions 3.1 [Do judges exhibit differential severity towards either of the two e-communication media?] and 3.2 [Do any of the virtual panels exhibit differential severity towards either of the two e- communication media?] aimed to investigate whether judges or panels exhibited any differential treatment towards either of the two e-communication media. The differential analyses at both the judge level and the panel level revealed that no judge or panel exhibited systematic bias when making judgements in either of the two e-communication media. Both findings are important as they provide validity evidence (Davis-Becker & Buckendahl, 2013; Hambleton & Pitoniak, 2006; Plake & Cizek, 2012) for cut scores set in both e-communication media. That no statistically significant bias was observed towards either medium reinforces the notion that cut scores can be set defensibly in either of the two e-communication media, despite the differences in naturalness between them.

7.1.2 Research question 4 Research question 4 [What are the judges’ perceptions of setting cut scores in each e-communication medium (audio and video)?] aimed at investigating the judges’ attitudes towards participating in a virtual standard setting workshop. Research sub-question 4.1 [Do either of the e-communication media affect the judges’ perceptions and evaluations of how well they communicated? If so, to what extent?] and sub-question 4.2 [Do either of the e-communication media influence the judges’ decision-making processes? If so, to what extent?] aimed at investigating whether either of the two e-communication media presented a suitable online environment for panellists to interact with one another and

Research questions

213

whether either of the media impacted (1) the judges’ decision-making process directly and (2) the cut scores indirectly. The quantitative analysis of the survey data presented in Chapter 5 was in line with the findings of the cut score analysis in Chapter 4 and revealed that there was no statistically significant differential treatment towards either e-communication medium. In other words, judges indicated that neither medium hindered their ability to engage in discussion while appraising a test item in terms of (1) its difficulty or (2) the skills and subskills that such an item aims at measuring. Nor did either medium hinder the judges’ ability to come to a shared understanding of what KSAs as well as cognitive processes a “Just Qualified B1 Candidate” needs to employ to be able to answer an item correctly. Such a finding is pertinent to any cut score study, especially one in which interaction amongst panellists is expected and encouraged. The quantitative survey data analysis also revealed that the virtual platform was easy to use and functioned well overall in both media. Had judges faced issues with the platform either because they found it difficult to use or because it did not function appropriately, the validity of the recommended cut scores would have been threatened. In Round 1, judges found that the platform did not function in the video medium as well as it did in the audio medium. However, the qualitative data collected from the open-ended responses suggested that the item was more difficult to endorse in the video medium because of the technical problems experienced by judges. The judges’ comments in the surveys revealed that although judges felt comfortable contributing in the virtual discussions occurring in both media, they perceived overall that the video virtual environment was a more natural, life-like environment for such a task. This comes as no surprise considering that it is much more difficult to start and sustain a professional discussion with colleagues you have never seen and/or met before, especially when communication has to occur through an audio e- communication medium. The most common tendency in a professional environment would be to resort to a non-visual communication style during the exploratory phase of a project through a telephone conversation or via email with the intent of arranging either a F2F meeting or a virtual video meeting to further discuss the project. Similarly, a standard setting workshop can be viewed as a professional meeting, one in which colleagues share their experiences and expertise with the intent of coming to a common understanding of where the boundaries of performance levels should be placed. However, the difference between a cut score meeting and any other work-related meeting is that in the former case an ad hoc- team is convened, usually by the awarding body and/or facilitator and those

214

Integration and discussion of findings

participating in the meeting do not usually know the other members attending the workshop. Moreover, a standard setting meeting usually spans over at least a day or two, unlike most other business meetings. Thus, it is necessary for natural communication amongst panellists to be established as quickly and efficiently as possible. Such a process seems to occur more naturally when judges can see each other. This is in line with MNT as are the judges’ comments indicating their preference towards the video medium as such a medium conveys more naturalness elements than the audio medium. The video medium fosters (1) co- location, albeit virtual; (2) synchronicity; (3) the ability to employ and detect facial expressions (4) the ability to employ and detect body language, albeit limited; and (5) the ability to express themselves through speech and listen to others (Kock, 2005). As for the judges’ decision-making process of setting cut scores, the focus group data analysis presented in Chapter 6 revealed that it was not affected by the e-communication media. Judges rationalised their decision-making process by explaining that it was a by-product of their virtual group discussions, the consequences feedback presented, their personal experience, and the CEFR descriptors. One of the judges, who also had experience with F2F cut score sessions, even emphasised that her recommended cut score would have been the same had the study been conducted in the same physical environment. The reasons provided by judges on what they based their cut score decisions on are in line with the expected standard setting judge decision-making processes as described in current literature (Cizek, 2012a; Skorupski, 2012; Zieky, Perie, & Livingston, 2008). These findings also corroborate MNT predictions that when e-communication media incorporate, even to varying degrees, as many of the media naturalness elements as possible, there is a decrease in the cognitive burden imposed on panellists as well as a decrease in communication ambiguity, and an increase in physiological arousal. Research sub-question 4.3 [What do judges claim are the advantages and disadvantages of each e-communication medium (audio and video)?] aimed at investigating what judges conceived were the advantages or disadvantages of setting performance standards through each medium. The findings of the open-ended survey questions and the focus group interviews revealed that the reported disadvantages of one medium were deemed as the advantages of the other medium, and vice versa. A lack of visual stimuli in the audio medium resulted in confusion being caused amongst judges, as it was difficult for them to discern who was speaking or who was actually engaged in what the speaker was discussing. The issue of disengagement has been raised in the literature of virtual environments

Research questions

215

(Mackay, 2007) and has been highlighted for facilitators to monitor in virtual standard setting environments (Katz & Tannenbaum, 2014). Understandably, it is difficult for panellists to assess whether other panellists have disengaged from a conversation taking place in the audio e-communication medium. This, in turn, also caused an extra cognitive strain on the judges making it difficult to concentrate when one can see only a black screen. Such a finding suggests that since participants are used to engaging both their senses (sight and sound) in their everyday lives, when visual stimuli is missing it is more difficult for a verbal message being transmitted in a synchronous environment to be quickly and efficiently decoded, especially when the speaker and listener have never conversed before. According to MNT, when many elements of F2F communication such as facial expressions, gestures, and body language are missing, the likelihood of communication ambiguity to exist is increased as participants will resort to filling in such ‘gaps’ (Kock, 2005) with their own notions, preconceptions, and beliefs. Thus, it comes as no surprise that the majority of the panellists expressed their preference towards the video medium as the added visual display offered panellists the opportunity to receive and evaluate other panellists’ non-verbal communication cues. Non-verbal communication cues such as facial gestures and posture made the exchange of information between the panellists easier, as that type of communication seemed to have created a sense of being physically present in the same meeting room (co-location), thus fostering understanding and even encouraging some of the panellists to interact more (physiological arousal). Nonetheless, the added value of judges being able to see one another during interaction must be evaluated in terms of the negative aspects associated with having a visual display throughout a standard setting workshop. During times when there was no interaction amongst participants, the visual display caused distraction and made some panellists less concentrated, a finding also observed by Katz, Tannenbaum, and Kannan (2009). This finding is important as it suggests that having participants keep their cameras on throughout a virtual workshop may not always be beneficial as distraction can also lead to a heightened sense of self-consciousness. For example, some panellists reported that the visual display made them aware of their own physical surroundings and their own non-verbal communication as the panellists could also see themselves in a video box as it was projected to other panellists. Such findings were expected as the video medium is a rich medium, and the video box added an element of unnaturalness since in F2F communication speakers do not see themselves while interacting. Hence, when panellists became aware of their virtual presence, greater cognitive strain was imposed to process the additional communication stimuli leading to an

216

Integration and discussion of findings

information overload (Kock, 2010), which may have been perceived by panellists as a distraction, one that may have decreased their physiological arousal and impeded their willingness to participate more. Research sub-question 4.4 [How do judges compare their virtual standard setting experience with a similar face-to-face experience?] aimed at eliciting from experienced judges, having participated in either a F2F standard setting workshop or a F2F training programme, their perceptions regarding similarities and differences between the virtual environment and the F2F environment. The judges reported both advantages and disadvantages to participating in a synchronous virtual cut score study when comparing their experience with a similar F2F experience. In comparison to a F2F meeting, judges reported saving time in the virtual environment, as there was no need to travel to a particular venue. Time needed to travel to a standard setting venue can range from a few hours to even half a day if a panellist is attending an international workshop and needs to travel by air. Such a gain in time may make otherwise reluctant participants more willing to participate in a virtual standard setting workshop as no travel time is involved and at the end of each session they can return to the comforts of their own homes. Additionally, attending a synchronous virtual standard setting workshop was deemed less tiring than attending a F2F workshop and/or a seminar lasting approximately the same duration. which may be attributed to the fact that they were in the comfort of their own homes. Being at home may predispose someone to feel and/or believe something is less tiring than it may actually be, especially being able to wear clothes that may not be appropriate for a F2F setting. What is more, virtual sessions provide participants with more freedom than that provided in F2F settings. Judges can easily step away from their computers and grab a snack, a beverage, or even stretch their legs at any time they are engaged in an individual activity. On the other hand, in F2F workshops, standing up and leaving the meeting room to get a cup of coffee may be frowned upon by facilitators, especially as it may encourage other participants to do so and, thus, cause a delay in the whole process. In F2F workshops, beverages are usually provided during scheduled breaks and/or are accessible to participants when they have finished their individual tasks. However, the fact that some judges felt that that they could not or should not get up from in front of their computers emphasises the importance of (1) setting ground rules for judge behaviour during a virtual environment and (2) implementing a system for allowing judges to monitor how other judges are progressing with the task at hand (e.g. e-polls). Nonetheless, the virtual workshop did not provide panellists with an opportunity to engage in social interaction beyond the cut score activity. Apart

Research questions

217

from each participant briefly introducing oneself in the beginning of the first session, judges were not allocated a private meeting space in the platform so that they could converse with one another and engage in small talk. No breakout rooms were set up in the platform at the time of the virtual study (2014) as the concept of engaging in such virtual rooms was relatively new to judges. In a F2F standard setting, judges usually engage in small talk during scheduled breaks, especially long ones such as lunch and dinner when the workshop spans over one or several days day. Katz, Tannenbaum, & Kannan (2009) claim that research is needed to assess any benefits of incorporating such opportunities in a virtual workshop by creating breakout rooms for judges to exchange ideas and information. or meet during breaks. However, breakout rooms may need to be monitored, adding complexity and additional resources. At the same time, the lack of such social interaction was deemed by one of the judges as an advantage as no digressions occurred throughout the workshop, making the virtual meeting very professional. Nowadays, new platforms provide a virtual space for judges to meet up in. For example, the WonderTM virtual platform (https:// www.wonder.me) offers a ‘gathering space’ which judges can use during breaks to engage in small talk. The platform allows judges to converse with each other at an individual level (one-to-one) or at the group level. Interestingly, judges expressed conflicting views when discussing the amount of interaction generated in both e- communication media, especially when compared with what may have been generated in a F2F environment. Many judges felt that less discussion took place in the audio medium compared to that in the video medium despite the same amount of discussion being generated in both e-communication media. Some judges were even surprised to learn that in both media, discussion time was kept to approximately one hour and stated that that more discussion would have occurred in a F2F environment. There are three possible explanations why the amount of discussion generated in a virtual environment might differ from than in a F2F environment: (1) social loafing; (2) time constraints; and (3) familiarity with the task. In a virtual environment, more participants may resort to social loafing, a reduction in individual effort in a group activity (Karau & Williams, 1993; Williams & Karau, 1991; Piezon & Feree, 2008). The fact that panellists are not physically present in the same room may increase the number of social loafers in the group, especially in the audio medium, as it is difficult for judges and facilitators to discern whether a judge actually has nothing to add to a discussion or whether the judge has consciously decided to not participate in the discussion. Such loafing is harder in a F2F environment as a quick look towards a participant’s direction from either another panellist and/or the facilitator may probe an otherwise reluctant judge to

218

Integration and discussion of findings

participate in the discussion. The video medium may have acted as a deterrent to social loafing, giving the judges the impression that more discussion was taking place in the video medium, which in fact was not the case. However, there could be another reason why judges felt that more interaction was taking place in the video medium. It seems that the existence of visual stimuli increased the excitement amongst participants and made them feel more engaged (physiological arousal) in the video medium. This probably made them believe that more discussion was generated in the video medium, despite the fact that both Round 1 discussions were restricted to one hour in both e-communication media. Perhaps one of the most unexpected findings was that the virtual environment added a heightened awareness of time constraints amongst panellists. They were aware that by a specific time the workshop was to conclude, which may have contributed to their perceptions that more discussion would have occurred in a F2F environment. This perception may also be attributed to the fact that the panellists were looking at a screen on which the clock was situated in the bottom left-hand corner. In a F2F setting, it would be considered impolite for panellists to be looking at their watch every five minutes as that type of behaviour would suggest that they wished to leave. Finally, the amount of discussion generated in the virtual environment may also be attributed to the judges’ familiarity with the standard setting task and/or the virtual platform. Judges felt that such familiarity made them more prone to participating in the discussion in their second session. This may be due to the fact that judges already knew the format of the session and may have felt more comfortable with the virtual environment. In the virtual environment, judges reported encountering different types of technical problems that would not have occurred in a F2F environment. However, most, if not all, of the problems that occurred during the virtual workshops were resolved quickly, and many were caused by the judges rather than by the platform. During the platform training sessions, which were conducted prior to the beginning of the virtual workshops, any technical problem that a judge experienced at that time was resolved. The main problems encountered were related to issues with the judges’ computer hardware and/or software. For example, some judges had multiple sound and image devices that needed to be set up so that they functioned appropriately in the virtual platform and, in some instances, image devices were being shared with other programmes such as Skype. However, none of these problems were insurmountable as the IT expert was able to remotely access the judges’ computers and resolve any technical issues. On the day of the workshop, the most frequently occurring technical problem faced by judges had to do with poor sound quality experienced at times. In the

Research questions

219

majority of the cases, this was due to judges entering the virtual platform twice or raising their microphone and/or speaker volumes too loud, resulting in everybody hearing an echo in the platform. Some judges also had background noises such as a radio playing or a pet barking being detected by their microphones. Other problems were similar to the ones observed during the platform training sessions, but were once again observed because judges changed computers during a virtual session or across the virtual sessions, making it necessary to set up their computers again. The IT expert also reported that some judges may not have been as computer literate as they claimed in their background questionnaire. To make matters worse, the judges who changed computers during the study did not always have administrative rights on the other computers, making it impossible to run the remote access programme. In these cases, the IT expert resorted to individual chat or a telephone call to resolve the problem. Nonetheless, problems such as cameras freezing or a delay in sound transmission were mostly attributed to either slow Internet connections and/or old computers. The fact that some judges reported that they did not experience any technical problems suggests that a synchronous virtual standard setting workshop can be technologically feasible and occur with a minimum of technical issues as long as the participants have up-to-date equipment, fast and stable Internet connections (fibre optic broadband), and have adequate computer literacy.

7.1.3 Research question 5 Research question 5 [Which e-communication medium (audio or audio- visual) is more appropriate for setting a cut score on a receptive language test in a synchronous virtual environment?] aimed to investigate whether one particular e-communication medium was more suitable for setting cut scores. On the media naturalness scale (Kock, 2004), the two e-communication media selected for the study were on either side of the F2F environment (see section 3.2.3.1, Figure 3.3). In the video medium, the platform with the visual display enabled can be considered a super-rich medium as participants were also able to communicate with one another through private chats and could also see their own visual displays. Such communication stimuli are more than what is received during F2F communication and is thus according to MNT expected to produce a cognitive overload. Similarly, the audio medium produced a cognitive burden on the participants because it suppressed their ability to see one another. Despite the deviations from the F2F environment, the overall findings suggest that reliable cut scores can be set and that valid inferences concerning test score interpretations can be drawn in both e-communication media. Such findings

220

Integration and discussion of findings

are very encouraging as they enable facilitators to select the virtual medium that best (1) suits the needs of a cut score study, (2) meets the technical and pragmatic limitations of their geographical location and/or that of the panellists’, and (3) caters for panellists’ personal reservations with regard to broadcasting video.

7.2 Limitations The primary limitation of the study is that the judges, and judge panels for that matter, were homogenous in relation to their educational background and experience. While such homogeneity was sought from the onset of the study as a purposive sampling procedure was employed to recruit participants, the results of the study may have been different were some of the panels more heterogeneous. As it is becoming more common for standard setting panels to be comprised of both educators and non-educators so that all stakeholders’ interests are represented in a cut score study (Loomis, 2012), less homogenous panels may have resulted in different cut scores. A further limitation of the study is related to the nature and level of the test instruments used in the study. The test instruments employed a MC format to measure test takers’ grammar, vocabulary, and reading receptive skills. As no other standard setting method was used in the study, the findings cannot be generalised to other standard setting methods or to other test instruments employing other task formats. What is more, setting a minimum CEFR B1 level cut score may be deemed an easier task for panellists to perform than setting a C2 level cut score. The B1 level contains a narrower range of KSAs a Just Qualified B1 Candidate needs to possess, compared to the KSAs a Just Qualified C2 Candidate needs to display. The minimum range of knowledge and skills required of a Just Qualified C2 test taker has yet to be agreed on. Another limitation is that the costs associated with conducting a F2F panel made it not feasible to conduct F2F sessions using the same panels even if that may have provided further insight into comparability of the cut scores and panellists’ perceptions in all three media.

7.3 Summary The findings revealed that both e-communication media were appropriate for conducting a synchronous virtual standard setting cut score study as cut scores were reliable and comparable comparable within and across panels and media. Even though no statistically significant differences were observed between the means of the cut score measures, the qualitative data suggests that the

Summary

221

panellists as a group preferred the video medium. This finding is not surprising when examined through MNT, as the video medium incorporated more communication naturalness elements despite the e-communication medium being super-rich. Super-rich media allow participants to engage in both group discussions and private discussions. From the panellists’ perspective, the video e-communication medium may be closer to the F2F environment than the audio e-communication medium.

Chapter 8: Implications, future research, and conclusion This chapter discusses the implications of the current study by discussing its significance and contribution to the standard setting field. It then offers guidance for conducting a virtual standard setting cut score study, provides a virtual standard setting platform framework combining both e-communication media, and concludes by recommending future research.

8.1 Significance and contribution to the field This study has addressed a topic not previously investigated in the field of standard setting. This book features the first in-depth examination and/or comparison of two e-communication media (audio and video) in a synchronous virtual environment used in a cut score study. The study provides novel insights for standard setting practitioners, namely that reliable, valid, and comparable cut scores can be set in synchronous virtual environments through both audio and video media. The study has generated insights into the feasibility of conducting synchronous virtual standard setting workshops that resemble F2F standard setting ones and has explored panellists’ attitudes and preference towards both e-communication media in the virtual environment. As the virtual workshops replicated a F2F study conducted in 2011 (Kollias, 2012), the recommended cut scores in the virtual environment were also compared to the F2F one. The analyses suggest that the cut scores set in the two environments (F2F and virtual) were also comparable. However, idiosyncrasies attributed to panel composition could not be negated conclusively as one of the two virtual groups yielded statistically significant different cut score measures in the video medium. The study employed Rasch measurement theory (RMT) to place two equated shortened versions of the GVR section of the Basic Communication Certificate in English (BCCETM) examination, measuring grammar, vocabulary and reading abilities of test takers at a CEFR B1 level, on the same latent scale. This allowed for direct comparisons of cut scores to be made within and across panels. It also employed both classical test theory (CTT) analysis and RMT analysis to investigate the internal validity of the recommended cut scores, as well as pairwise interaction analysis to investigate the interactions between the media and cut scores. Analysing virtual cut scores through the many-faceted Rasch measurement (MFRM) model allows for a more in-depth quantitative investigation into whether an e- communication medium affects panellists’

224

Implications, future research, and conclusion

cut scores. The thorough analysis of cut score validity presented by this study, employing both RMT and CTT to evaluate internal validity, has methodological implications for evaluating virtual standard setting studies. The fact that both RMT and CTT yielded the same results suggests that both types of measurement theories can be used effectively to investigate internal validity evidence. Using Kane’s framework (Kane, 2001) for analysing standard setting studies as a validation framework, this study shows that cut scores measures set in both virtual environments (audio and video) exhibited high internal and procedural validity, thus supporting that Kane’s framework (see section 2.2.2) can be applied to virtual cut score studies as well. Furthermore, the results of the project support the selection of a virtual communication medium, which is particularly relevant for the field of language testing and assessment (LTA), were performance standards need to be re- evaluated at regular intervals. Therefore, the findings that virtual cut scores are comparable with F2F ones are of great importance as they provide international and local awarding bodies the opportunity to conduct (more) standard setting cut score studies without incurring most of the logistic costs (i.e., travel, accommodation, catering) associated with F2F standard setting workshops. The study also yields insights into the selection of an appropriate e- communication medium. The facilitator needs to bear in mind that an inappropriate choice of medium can have an impact on the cognitive strain that may be imposed on the panellists, can cause miscommunication, and can decrease panellists’ motivation to fully engage in discussions. The medium to be used during virtual standard setting needs to simulate the F2F environment as closely as possible, though with current technology this is not yet possible. Therefore, the e-communication medium should at least incorporate not only one of the media naturalness elements as MNT suggests, but at least three of the five elements, namely synchronicity, ability to convey facial expressions, and speech, at least in some stages of the workshop. This study has illustrated that despite the fact that the video medium added cognitive strain coupled with a sense of self-consciousness on the participants, it also promoted physiological arousal, making panellists feel more content with their contributions in the discussions. Generally speaking, panellists perceived the video medium as being more conducive to a virtual standard setting.

8.2 Guidance for conducting synchronous virtual cut score studies The following section outlines in detail practical guidelines that may assist standard setting practitioners in their selection of an e-communication medium

Guidance for conducting synchronous virtual cut score studies

225

and in their administration of a synchronous virtual standard setting workshop. The guidelines focus on the aspects of a virtual standard setting workshop.

Demands for facilitators and/or co-facilitators Facilitating a virtual setting workshop bears certain demands. Facilitators and co-facilitators need to (1) establish netiquette, (2) multi-task, (3) engage judges throughout the workshop, (4) be familiar with the platform and its tools, (5) and understand the nature of technical issues. Facilitators and co-facilitators should remind judges to use the established netiquette at all times, otherwise, judges will be speaking over one another. Multi-tasking is a given for (co-)facilitators as they will need to be looking at the platform interface to see whether any judge has a question, and when judges do have questions, they will either respond in a private chat or provide the answer to the group. Throughout the workshop, facilitators will be sharing screens and/or pods with the judges either providing them with information for the next task or showing them their ratings. Next, (co-)facilitators also need to be able to engage judges to ensure that they have not disengaged at any time throughout the workshop. It is important that both facilitators and co-facilitators are familiar with the platform and its tools to be able to assist and/or train judges on how to use them appropriately. Finally, (co-)facilitators should be able to resolve and/or understand some technical issues that a judge may face. For example, when judges do not have sound, their speakers might not be active, or they may just need to re-enter the platform. Resolving major technical issues should be left to the IT expert who should be online for the duration of the workshop.

Establishing a virtual standard setting netiquette To ensure that an appropriate virtual environment is created to facilitate a cut score study, especially one that will allow productive discussions to occur, the facilitator will need to establish certain guidelines that should be followed by all participants. For example, participants should be instructed to use the virtual platform’s status bar to indicate whether they would like to ask a question and/or make a comment. In this way, participants will not be talking over one another, which may occur when there is a delay in the transmission of sound.

Selecting a suitable virtual platform The most appropriate platform for a virtual standard setting workshop is one that allows facilitators to share screens, upload materials and links, present PowerPoint presentations and, at the same time, allows participants to download

226

Implications, future research, and conclusion

materials, access links, and communicate through audio, video, and chat. That platform should also have a status bar with certain functions such as “Agree”, “Disagree”, “Raise Hand”, and “Step Away” to encourage engagement amongst participants and foster a conducive collaborative environment. The platform should be trialled prior to the study, especially with an IT expert who must be present (remotely) during the whole duration of the study. If the platform allows participants to check their equipment and/or Internet connection (diagnostic check), participants should send the results to the facilitator. The platform should also have a recording capability allowing (1) the facilitator to revisit certain parts of the session should need be, and (2) for the discussion to be transcribed at a later stage. To ensure that the General Data Protection Regulation (GDPR) that is enforced in a particular country and/or region is adhered to, facilitators must investigate where the data and/or the recording uploaded onto the platform is stored before making their final selection.

Selecting an appropriate medium for the workshop MNT provides further insights into the selection of the e- communication medium. A super-rich medium provides more elements of communication media naturalness than would be found in a F2F environment. For example, in the virtual environment judges could also have private chats with other participants and in the video medium they can also see themselves. Such additional elements can distract judges from the task at hand and place a burden on the facilitator, especially when the facilitator is trying to share important technical information with the panellists. Given the limitations that currently exist in e-communication media, it may be best if a combination of a super-rich medium and a less rich medium is used at different stages of a virtual standard setting workshop. Both media should incorporate at least two media naturalness elements (synchronicity and speech), whereby the video medium may be used during the training and discussion stages, while the audio medium could be used for the other stages (see Table 8.1). The qualitative findings from this study suggest that integrating both media (audio and video) when conducting a virtual standard workshop may be an effective solution to reduce the number of technical problems that might arise from increased bandwidth needed in the video medium. Participants can either pause their cameras (platform permitting) or switch them off when there is no group discussion underway. This way, participants may not become overloaded with information when working on individual activities. More importantly, participants need to mute their microphones and speakers when working on individual tasks so that they do not disturb the rest of the participants. As for the facilitator, it may be best to keep the video camera on during the whole

Guidance for conducting synchronous virtual cut score studies

227

workshop and only mute the microphone when participants are working on their individual tasks. It is important for the participants to see the facilitator throughout the workshop and to know that they can ask questions at any time. Table 8.1 illustrates which medium may be more appropriate for participants during the different stages of a virtual standard setting workshop. Table 8.1 Virtual standard setting platform framework Stage

Description of stage

Orientation Introductions

Medium Platform for requirements participants (participants)

Medium for facilitator

Video

Video

Familiarisation activities

Audio

Feedback & discussion on familiarisation activities Method Training

Video

Video

Video

Video

Video

Video

Round 1

Discussion on training items Round 1 Ratings

Round 2

Round 1 Discussion Round 2 Ratings

Video Audio

Round 2 Discussion Round 3 Ratings

Video Audio

Round 3 Discussion

Video Video

Training in the Method

Round 3 (if applicable)

Wrap up

Audio

Speakers & microphone muted (video paused)

Speakers & microphone muted (video paused)

Speakers & microphone muted (video paused)

Speakers & microphone muted (video paused)

Video

Video

Video Video

Video Video

Video Video

Platform requirements (facilitator)

Microphone muted

Microphone muted

Microphone muted

Microphone muted

228

Implications, future research, and conclusion

Recruiting online participants During the participant recruitment process, the invitation letter/email inviting participants to the study should emphasise the minimum hardware and software requirements needed for the virtual platform selected as well as the minimum equipment needed to participate in the study, such as a pair of headphones with a microphone (unless it is provided by the facilitator) and direct Internet access (a cable directly connecting the computer to the router). Participants should be aware that their personal computers (PC’s) may need to be accessed by an IT expert in the case where a participant cannot log into the platform and/ or cannot access certain materials. Consequently, participants must also have administrative rights to the PC that they will be using on the day of the workshop so that remote assistance can be provided. Such requirements may imply that participants use their own PCs during the workshop and not one belonging to the organisation they work for. It is important to collect information on the participants’ computer hardware and software as well as on their own computer skills. In this way, the IT expert can anticipate specific problems prior to the platform training. If the workshop is being recorded, facilitators must request permission from the participants prior to the workshop and when the material to be shared to participants is secure, participants should sign NDA’s before joining a workshop.

Training in the virtual platform Prior to participating in a virtual standard setting workshop, participants need to be trained on how to use the features of the virtual platform. This provides the participants with the opportunity to become accustomed to the platform features and it allows the IT expert to resolve any technical problems that may occur. The training session also allows the facilitator and/or IT expert to evaluate whether the participant has the appropriate equipment to participate in the study. It should be emphasised to participants that the training session should occur on the computer that they will be using during the workshop. Participants should also have the necessary equipment during the training session such as a pair of headphones with a microphone so that the devices can be set up properly as well as direct access (i.e. through an ethernet cord) to the Internet as that may solve bandwidth issues.

Uploading materials All materials and links should be uploaded onto the platform prior to the study so that on the day of the workshop, the facilitator and/or the co-facilitator will

Recommendations for future research

229

only need to access and/or share the material. A schedule of when the materials are to be accessed and/or uploaded is helpful for reminding the facilitator of what needs to be done when, and at what stage which materials need to be uploaded and/or shared (see Appendix E).

Monitoring progress and engaging judges Online e-polls should be used within the platform asking participants to indicate when they have finished a task so that both the facilitator and the participants have a sense of awareness of whether and/or how many other participants have finished an individual activity. Consequently, everyone will have a sense of how much time is still needed for everyone to finish an assigned activity. E-polls also engage judges to respond, allowing the facilitator and the co-facilitator to understand who may have disengaged from the activity.

8.3 Recommendations for future research This study along with the previous studies (see section 2.5) has contributed to opening the “black box” of virtual standard setting. However, there are three main areas that remain under researched: test security, panellists’ judgemental processes, and breadth and depth of panellist discussion. In this study, test security was not an issue as the instruments used in the study were retired test versions. However, when test security is an issue, standard setting practitioners can take certain measures to reduce the possibility of test items being exposed by implementing best practices from the fields of online testing and online streaming. For example, password protected platforms and surveys help prevent unauthorised persons from accessing or viewing sensitive material. Apart from PDF files being password protected, they can also be programmed to have an expiration date (Katz, Tannenbaum, & Kannan, 2009) or to restrict functions such as printing. Further, screen recording can be prevented when Digital Rights Management (DRM) technology is used to transmit encrypted video content. DRM coupled with anti-capture technology software which blocks recording software from running as well as using forensic watermarking, an invisible or small digital watermark, may deter end users from recording either through software and/or other recording devices. Currently it is impossible to fully prevent screen recording through devices such as mobile technology and cameras. IT researchers may wish to investigate whether it is feasible to manipulate (1) the amount of background lighting in a video content or (2) the frame rate of video content to render screen recording futile.

230

Implications, future research, and conclusion

Furthermore, conducting standard setting in a virtual environment facilitates research that cannot be conducted in a F2F environment. For example, in a F2F setting, it is impossible for a researcher to collect concurrent verbal reports on panellists’ cognitive processes and experiences while they are rating items in the same room with other panellists. However, in a virtual standard setting workshop, the facilitator can assign each panellist to a unique virtual room and record their cognitive processes (virtual platform allowing) via think-aloud protocols. Such data may provide new insight into how panellists perform their rating task. Finally, research into the amount of discussion generated in a virtual environment compared with that in a F2F environment is needed to fully understand the impact that a virtual environment may have on panellist discussion. A comparability study using the same panellists in both environments and both e-communication media could be conducted in which the discussion time would not be restricted. Such a project would lend itself to exploring the impact that an environment and/or medium can have on the breadth and depth of a standard setting discussion.

8.4 Concluding remarks The purpose of the research in this book was to investigate whether it was feasible to conduct a standard setting workshop, one replicating a F2F workshop lasting approximately 6 to 6 ½ hours, in a synchronous virtual environment. At the time the research was conducted, no empirical synchronous virtual standard setting studies had been reported in the literature and to date no empirical standard setting study has compared two different e-communication media (audio and video). As research in virtual standard setting is still in its infancy, despite it having become more widespread amongst practitioners due to Covid-19, this research attempts to address some of the many gaps in the current literature. The main gaps are (1) whether virtual cut scores are reliable, valid, and comparable within and across panels and e-communication media, (2) whether either of the e-communication media (audio or video) affected the judges’ decision-making processes and/or their perceptions and evaluations of each medium, and (3) how to interpret the judges’ perceptions of each e-communication medium. To this end, this research proposes several frameworks that may assist practitioners in conducting virtual standard setting workshops. First two methodological frameworks have been proposed for analysing and evaluating multiple panel virtual cut scores by equating and anchoring test instruments to their respective difficulties and by using both classical test theory (CTT) and

Concluding remarks

231

Rasch measurement theory (RMT). This, in turn, adds to the limited standard setting literature using the many-faceted Rasch measurement (MFRM) model to analyse and evaluate cut scores. Next a practical framework has been suggested for conducting virtual cut score studies, one providing guidance ranging from choosing an appropriate platform, recruiting online participants, establishing a virtual standard setting etiquette (netiquette) to a virtual standard setting platform framework detailing which e-communication medium to be used at which stage of the study. Finally, a theoretical framework for interpreting qualitative data collected from judges has been proposed, one that draws on the principles of media naturalness theory (MNT). It is hoped that this book will be a valuable tool for experienced and novice standard setting practitioners, policy makers, and educators interested in conducting and/or organising virtual standard setting workshops. This book is probably the first of many research studies, as virtual standard setting is likely to become common practice, due to the many advantages it has to offer.

Appendices Appendix A CEFR verification activity A (Key) Table A1 CEFR global descriptors CEFR GLOBAL DESCRIPTORS Level Descriptor B2 GL1. Can understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. B1 GL2. Can describe experiences and events, dreams, hopes & ambitions and briefly give reasons and explanations for opinions and plans. A2 GL3. Can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g., very basic personal and family information, shopping, local geography, employment). B1 GL4. Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. B2 GL5. Can interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible without strain for either party. A2 GL6. Can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters. B2 GL7. Can produce clear, detailed text on a wide range of subjects and explain a viewpoint on a topical issue giving the advantages and disadvantages of various options. B1 GL8. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. A2 GL9. Can describe in simple terms aspects of his/her background, immediate environment and matters in areas of immediate need. B1 GL10. Can produce simple connected text on topics, which are familiar, or of personal interest.

234

Appendices

Table A2 DIALANG grammar descriptors DIALANG GRAMMAR DESCRIPTORS Level Descriptor A2 G1. Learners can use the most frequent verbs in the basic tenses (present and simple past). B1 G2. Learners know the basic word order well. B2 G3. Learners know the comparison of adverbs. B1 G4. Learners know the comparison of adjectives. B1 G5. Learners can use e.g., some conjunctions, auxiliary verbs and frequent adverbs systematically. B2 G6. Learners have a good command of most adverbs. A2 G7. Learners can use verb forms in other than the basic tenses only in fixed phrases, e.g. I would like to have coffee. B2 G8. Learners have a good control of all basic grammatical structures. A2 G9. Learners use some simple structures correctly. B2 G10. Learners can also use pronouns quite systematically. B1 G11. Learners know the basic principles of constructing passives. B2 G12. Learners can use articles for the most part appropriately. B2 G13. Learners show some ability to vary their usage through paraphrases. B1 G14. Learners can formulate the most frequent expressions and clause types accurately.

Appendices

235

Table A3 DIALANG vocabulary descriptors DIALANG VOCABULARY DESCRIPTORS Level Descriptor B1 V1. Learners have a good command of vocabulary related to everyday situations. B2 V2. Learners can express meanings by adding prefixes and affixes to familiar words, e.g. to re-send a message; to have an after-dinner nap. B1 V3. Learners know a number of principles of word formation, e.g. to agree - agreeable; a verification -to verify. B1 V4. Learners can use the synonyms of some very common words, e.g. nice -kind (person). A2 V5. Learners know and can use everyday vocabulary related to a range of basic personal and familiar situations. B2 V6. Learners can recognise words with more than one meaning, e.g. back a car / back a proposal. B1 V7. Learners can use some frequent collocations, e.g. a tall girl -a high mountain. A2 V8. Learners know the opposites of very frequent words and can recognise synonyms for some words. B2 V9. Learners recognise and know how to use the basic word formation principles, e.g. note -notify; fur-furry; accident - accidental; paint-painting. B1 V10. Learners can use a range of prefixes to produce opposites to basic words. B2 V11. Learners know also a number of frequently used idioms. A2 V12. Learners recognise some basic principles of word formation, and can also apply some of them, e.g. to drive -a driver. B2 V13. Learners can use the synonyms and opposites of many common words in different contexts.

236

Appendices

Table A4 CEFR reading descriptors CEFR READING DESCRIPTORS Level Descriptor B2 R1. Can read with a large degree of independence, adapting style and speed of reading to different texts and purposes, and using appropriate reference sources selectively. B1 R2. Can identify the main conclusions in clearly signalled argumentative texts. A2 R3. Can understand short, simple texts on familiar matters of a concrete type which consist of high frequency every day or job-related language. B2 R4. Can read correspondence relating to his/her field of interest and readily grasp the essential meaning. B1 R5. Can understand the description of events, feelings and wishes in personal letters well enough to correspond regularly with a pen friend. A2 R6. Can understand basic types of standard routine letters and faxes (enquiries, orders, letters of confirmation etc.) on familiar topics. A2 R7. Can understand short simple personal letters. B2 R8. Can scan quickly through long and complex texts, locating relevant details. A2 R9. Can understand everyday signs and notices: in public places, such as streets, restaurants, railway stations; in workplaces, such as directions, instructions, hazard warnings. B2 R10. Can quickly identify the content and relevance of news items, articles and reports on a wide range of professional topics, deciding whether closer study is worthwhile. B1 R11. Can find and understand relevant information in everyday material, such as letters, brochures and short official documents. A2 R12. Can locate specific information in lists and isolate the information required (e.g. use the "Yellow Pages" to find a service or tradesman). B2 R13. Has a broad active reading vocabulary but may experience some difficulty with low frequency idioms. B1 R14. Can scan longer texts in order to locate desired information, and gather information from different parts of a text, or from different texts in order to fulfil a specific task. B2 R.15 Can understand articles and reports concerned with contemporary problems in which the writers adopt particular stances or viewpoints. B1 R.16 Can read straightforward factual texts on subjects related to his/her field and interest with a satisfactory level of comprehension. A2 R.17 Can understand short, simple texts containing the highest frequency vocabulary, including a proportion of shared international vocabulary items. B1 R.18 Can recognise the line of argument in the treatment of the issue presented, though not necessarily in detail. A2 R.19 Can find specific, predictable information in simple everyday material such as advertisements, prospectuses, menus, reference lists and timetables.

Appendices

237

Table A4 Continued B1 A2

R.20 Can recognise significant points in straightforward newspaper articles on familiar subjects. R.21 Can identify specific information in simpler written material he/she encounters such as letters, brochures and short newspaper articles describing events.

Appendix B Electronic consent form Title of Project: Investigating a suitable electronic communication medium for a virtual standard setting environment Name of Researcher: Charalambos Kollias By clicking on the “I agree to take part in the study” button below, you confirm that: • you have read and understood the information about the project, as provided in the Information Sheet; • you have understood what will be required of you if you decide to participate in the project; • you have been given the opportunity to ask questions about the project and your participation via email and/or telephone and any questions you may have been answered to your satisfaction; • you voluntarily agree to participate in the project; • you understand you can withdraw at any time without giving reasons; • you understand that if you withdraw within the first two months that the data has been collected, your data will not be used in the study; • you understand that if you withdraw after two months has elapsed from the time that the data has been collected, your data will be used in the study; • you consent to the standard setting virtual workshop and focus group interview being digitally video-recorded; • you understand the procedures regarding confidentiality (e.g., use of names, pseudonyms, anonymisation of data, etc.) as explained in the Information Sheet; and • you have received a copy of this Consent Form and accompanying Information Sheet. If you do not wish to participate in the research, please decline participation by clicking on the “I do not agree to take part in the study button”.

238

Appendices

Please indicate your preference of participating in the project by selecting the appropriate button. “I agree to take part in the study”. [button]. “I do not agree to take part in the study”. [button].

Appendix C Judge background questionnaire 1. What is your name? __________________________________ 2. What is your gender? Female Male 3. What category below includes your age? 21 –29 30 –39 40 –49 50 –59 60 or older 4. What is the highest level of education you have received? BA/BSc graduate MA/MSc graduate PhD/EdD graduate 5. What is your current position? (Choose as many as apply). Director of studies Teacher Trainer Private Language School Teacher State School Teacher University Professor Researcher Other (please specify): _______________________ 6. About how long have you been teaching English as a Foreign Language? 0 –1 years 2 –4 years 5 –8 years 9 –15 years over 15 years not applicable 7. About how familiar are you with the CEFR levels (A1 –C2)? Not at all familiar Slightly Familiar Familiar Very Familiar 8. About how familiar are you with the CEFR descriptors? Not at all familiar Slightly Familiar Familiar Very Familiar 9. Which CEFR level(s) is/are you most familiar with? Choose a maximum of 3. A1 A2 B1 B2 C1 C2 10. Which CEFR levels did you teach during the last academic year (2013 –2014)? (Choose as many as apply) A1 A2 B1 B2 C1 C2 N/ A

Appendices

239

11. About how many years of experience do you have teaching CEFR B1 level? 0 –1 year 2 –4 years 5 –8 years 9 –15 years over 15 years not applicable 12. About how often do you access the Internet? Less than once a month Monthly Weekly Daily 13. Do you have private access to a computer? No Yes 14. Do you have private access to the Internet? No Yes 15. Do you have a private access to web camera? No Yes 16. Do you have private access to a microphone? No Yes 17. Do you know how to download software on your computer? No Yes 18. Do you know how to install software on your computer? No Yes 19. I am good at using a computer? Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 20. I am comfortable using new technologies? Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 21. I prefer reading text off screen rather than off printed test. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 22. Have you even taken part in an online meeting? No Yes 23. Have you ever taken an online course? No Yes 24. Have you ever taken part in in an online workshop? No Yes

240

Appendices

25. Have you ever taken part in a face-to-face standard setting workshop? No Yes 26. If you have taken part in one. How many workshops have you participated in? 0 1 – 2 3 – 5 6 – 10 11 – 15 Over 15 27. Have you ever taken part in a standard setting workshop conducted online? No Yes 28. If you have taken part in one, how many online workshops have you taken part in? 0 1 – 2 3 – 5 6 – 10 11 – 15 Over 15 29. I feel comfortable working in an online environment. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 30. I feel comfortable speaking in an online environment. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 31. I feel comfortable taking part in an online workshop. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 32. I feel comfortable expressing my opinion in an online environment. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 33. I feel comfortable responding to other people’s ideas. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 34. I prefer working independently than in a group. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree 35. I feel that constructive discussions can take place in an online environment. Strongly Disagree Disagree Slightly Disagree Slightly Agree Agree Strongly Agree

Appendices

241

Appendix D Focus group protocol Introductory statement Good evening and welcome to our online session today. Thank you for taking the time to share with us your perceptions of the two workshops you have participated in. Tonight, we will be discussing setting cut scores in two different media: audio-only and video-audio. There are no right or wrong answers but rather differing points of view. Please feel free to share your point of view even if it differs from what others have said. Before we begin, I would like to share a few ground rules that will help our discussion. Please use the status bar to raise your hand when you wish to speak. When the previous person has finished and you are next in line, feel free to begin. I will be digitally recording the session because I don’t want to miss any of your comments. However, if we all speak at the same time, some of your comments may not be audible during playback. We will be on a first name basis tonight, and, in my study, there will not be any names attached to comments. You may be assured of complete confidentiality. Please keep in mind that we’re just as interested in negative comments as positive comments, and at times the negative comments are the most helpful. Our session will last about an hour and we will not be taking a formal break. Well, let’s begin. (adapted from Krueger & Casey, 2015)

242

Appendices

Focus group interview questions: 1. Did you feel comfortable interacting through videoconferencing/ audio-conferencing? Probe 1a: Which one was easier/harder to communicate through? 2. How does communicating through videoconferencing/audio-conferencing technology compare with communicating face-to-face? Probe 2a: Please describe any similarities between communicating through videoconferencing/audio-conferencing technology and communicating face-to-face. Probe 2b: Please describe any differences between communicating through videoconferencing/audio-conferencing technology and communicating face-to-face. 3. Did you feel that the final cut score was the natural outcome of group discussions or the outcome of individual decision-making? 4. Did the videoconferencing/ audio- conferencing technology influence or affected your decision-making process in any way? 5. If you have taken part in a face-to-face standard setting workshop, how does setting cut scores in a synchronous virtual environment compare with setting cut scores in a face-to-face environment? Probe 5a: Please describe any perceived benefits of using videoconferencing technology in a virtual standard setting workshop. Probe 5b: Please describe any disadvantages of using video- conferencing technology in a virtual standard setting workshop. 6. Overall, how would you describe your synchronous virtual standard setting experience? 7. Anything else you would like to add?

243

Appendices

Appendix E Facilitator’s virtual standard setting protocol Table E1 Step-by-step directions for facilitator Stage Directions 1. Orientation 0a: introduce Stage (1) IT expert; (2) co-facilitator; (3) facilitator; 0b: Have participants to briefly introduce themselves 1a: Add “Stage 1: Orientation” -Note pod 1b: Project “Introduction PowerPoint” 1c: Add “CEFR Familiarisation” (Weblinks) 1d: Share “Judge Recording Sheet or Excel file” 1e: Add Test instrument (Weblinks) & Poll 1 f: Share “Judge Recording Sheet or Excel file” for test instrument 1g: Share test instrument key 1h: Add “End of Orientation Survey” (Weblinks) & Poll

Note Pod (visible to judges) Purpose: • To provide judges with an overview of the purpose of the standard setting workshop • To calibrate judges with the CEFR level(s) • To allow judges to be familiar with the content of the test instrument

2. Standard Setting Method Training

Purpose: • To train judges in the standard setting method to be used in Round 1

2a: Add “Stage 2: Method Training” –Note pod 2b: Add “Definition of ‘Just Qualified B1 Test taker’ ” - Note pod 2c. Train participants in the method 2d. Have participants practice the method & Poll 2e: Add “End of Training Method” Survey (Weblink) & Poll

Description of stage: • Brief description of standard setting • Familiarisation with grammar descriptors (DIALANG) • Familiarisation with vocabulary descriptors (DIALANG) • Familiarisation with CEFR global descriptors • Familiarisation with CEFR Reading descriptors • Familiarisation with Test instrument • End of Orientation Survey (Evaluation_Survey_1) Judge role: • To complete matching activities with CEFR and DIALANG Scales • To review the test items (under timed conditions)

Description of stage: • Explanation of Standard Setting Method • Brief Practice with Standard Setting Method (2 –3 items) • End of Orientation Survey (Evaluation_ Survey_2) Judge role: • To learn how to apply the standard setting method • To practice using the standard setting method (Continued)

244

Appendices

Table E1 Continued Stage 3. Round 1

4. Round 2

Directions 3a. Share “Stage 3: Round 1” -Note pod 3b. Share Test Instrument (Weblinks) & Poll 3c. Download Round 1 individual results 3d. Download Round 1 summary results 3e. Project Round 1 individual results 3f. Project summary results & items 3g. Discuss Round 1 items 3h. Add “End of Round 1” Survey (Weblink) & Poll

Note Pod (visible to judges) Purpose: • To “trial” the standard setting method used to set the cut score • To set a preliminary cut score • To discuss rationale for individual judgement

4a. Share “Stage 4: Round 2” -Note pod 4b. Train participants with examples of empirical data - 4c. Share Round 2 Rating (Weblinks) & Poll 4d. Share “End of Round 2” Survey (Weblink) & Poll 4d. Download Round 2 individual results 4e. Calculate Round 3 results & consequences feedback

Purpose: • To provide judges with an opportunity to make any changes to their initial ratings • To set a second cut score • To discuss rationale for individual judgement

Description of stage: • Round 1 ratings of reading test • Discussion on Round 1 • End of Round 1 Survey (Evaluation_ Survey_3) Judge role: • To evaluate each item and apply the standard setting method • “Would a ‘Just Qualified B1 Test taker’ be able to answer this item correctly?” • To share rationale for judgement

Description of stage: • Empirical test-taker data provided • Guidance how to apply the empirical data • Round 2 ratings of test instrument • Discussion on Round 2 • End of Round 2 Survey (Evaluation_ Survey_4) Judge role: • To re-evaluate each item and apply the standard setting method again taking into consideration what was discussed during Round 1 discussion and empirical data • To share rationale for judgement

Appendices

245

Table E1 Continued Stage 5. Round 3

6. Wrap Up

Directions 5a. Share “Stage 5” Round 3 -Note pod 5b. Project Round 2 results & consequences feedback 5c. Share Round 3 (Weblinks) & Poll 5d. Share “End of Round 3” Survey (Weblinks) & Poll 5e. Download Round 3 results & calculate consequences feedback

Note Pod (visible to judges) Purpose: • To provide judges with a final opportunity to make any changes to their final recommended cut score.

6a. Project final cut score & consequences feedback 6b. Share “Final Survey” (Weblink) & Poll 6c. Thank judges and end workshop

Purpose • To provide judges with recommended final cut scores (Round 3 ratings) • Final Survey • To bring workshop to an end

Description of stage: • Consequence data provided • Guidance how to evaluate consequences feedback • Round 3 ratings • End of Round 3 Survey (Evaluation_ Survey_5)

Description of stage: • Final cut scores presented • Final survey • Thank judges

246

Appendices

Appendix F CEFR familiarisation verification activity results The responses to the two CEFR verification activities were analysed through MPI (Kaftandjieva, 2010). According to Kaftandjieva, (2010), the minimal acceptable MPI index for degree of consistency between ratings and empirical judgements of items should be more than .50. (MPI > .50). Kaftandjieva (2010) also claimed that a “more exacting criterion (MPI > .70)” (p. 59) should be set to evaluate individual judge consistency when setting cut scores. Tables F1 and F2 present the MPI indices for each judge in both activities, respectively. For activity A, the MPI index ranged from .56 (J26) to .98 (J39), implying that nearly all of the judges exhibited an adequate degree of consistency when ranking the CEFR descriptors. For activity B, the MPI index ranged from .82 (J08) to 1.00 (J01), implying that the judges individually exhibited a very high degree of consistency when ranking the CEFR descriptors. Consequently, it may be assumed that the judges exhibited a high level of CEFR familiarity, implying that they had the expertise to set a CEFR cut score on a test instrument. Table F1 CEFR familiarisation verification activity (A) results Judge code J01 J02 J03 J04 J05 J06 J07 J08 J09

MPI .94 .95 .90 .88 .88 .92 .86 .88 .96

Judge Code J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22

MPI .91 .71 .92 .90 .89 .89 .89 .79 .91 .74 .91 .80 .78

Judge Code J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33 J34

MPI .89 .85 .82 .56 .95 .73 .90 .82 .85 .90 .76 .80

Judge Code J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

MPI .90 .94 .84 .87 .98 .74 .69 .94 .89 .74 .93

247

Appendices Table F2 CEFR familiarisation verification activity (B) results Judge code J01 J02 J03 J04 J05 J06 J07 J08 J09

MPI 1.00 1.00 .97 .99 .96 1.00 .96 .82 1.00

Judge Code J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22

MPI .99 .94 .98 .99 .93 1.00 .99 .99 1.00 .97 1.00 .93 .96

Judge Code J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33 J34

MPI .94 .94 .94 .89 1.00 .98 1.00 1.00 .99 .94 .93 1.00

Judge Code J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

MPI .97 1.00 .96 .97 1.00 1.00 1.00 1.00 .91 .93 1.00

248

Appendices

Appendix G: Facets specification file Table G FACETS specification file (edited version) Specification TITLE =7 FACETS MASTER Facets =7

Delements =LN Positive =1,2; Non-centered = 1 Interrrater =1 Totalscore =Yes Pt-biserial = Yes Vertical =(1L, 7L) Iterations =0 Convergence =.001, .00001 Models = ?,?,?,?,?,?,1–90, D; ?,?,?,?,?,?,91–92, TestA,0.5; ?,?,?,?,?,?,93–94,TestB,0.5; ?,?,?,?,?,?,95–139,R45; * Rating scale =TestA,R45; 0 = 1 =, -4.39, A 2 =, -3.63, A … … …

Explanation Title of Facets analysis The analysis has seven facets: (1) Judges (active), (2) Group (dummy), (3) Medium (dummy), (4) Order (dummy), (5) Test Form (dummy), (6) Round (dummy), and (7) Items (active) Element identifiers are both elements numbers and labels The first two facets are positive, implying that higher Rasch measures indicate judges are more demanding. Facet 1 (judges) to float; not to be centred to 0. Facet 1 (judges) is the rater facet. Report rater agreements for facet 1. Report scores including all responses (extreme scores) Report the point-biserial correlation Show graphically facets in the order specified (1,7) and shows each element’s label Unlimited number of iterations Tight convergence applied; exaggerated accuracy, .001 score points and .00001 logits Measurement model for analysis = For items 1–90, use dichotomous model (1,0) For items 91–92, use TestA rating scale and apply weight of 0.5 to each item. For items 93–94, use TestB rating scale and apply weight of 0.5 to each item. For items 95–139 apply a rating scale of 45. End of models specifications The TestA rating scale is out of 45 A raw score of 1 has an anchored threshold of -4.39 A raw score of 2 has an anchored threshold of -3.63

Appendices

249

Table G Continued Specification 45 =, 4.42, A * Rating scale =TestB,R45; 0 =, 1 =, -4.28, A 2 =, -3.62, A … … … 45 =, 4.39, A * Glabel =; 1,1,Round 1AG1-Audio

… … … 1,26,Round 2AG5-FTF * Label = 1, Judges; 1 =PA01G1R1,,01 … … … 300 = PA60G5R2,,26 * 2, Group, D;

Explanation A raw score of 45 has an anchored threshold of 4.42 End of rating scale specifications The TestB rating scale is out of 45 A raw score of 1 has an anchored threshold of -4.28 A raw score of 2 has an anchored threshold of -3.62

A raw score of 45 has an anchored threshold of 4.39 End of rating scale specifications Group names in Facet 1 according to groups specified in Facet 2 AG1-Audio =Test Form A, Group 1, Audio medium. The first group of judgments will contain Round 1, Test Form A, Group 1 judgments made in audio medium.

The last group of judgments will contain Round 2, Test Form A, Group 5 judgments made in F2F environment. End of Glabel specifications Facet labels and elements per facet Facet 1 =Judges The first element is PA01G1R1. PA01G1R1 is Test Form A, Judge code PA01, Group 1, Round 1. Judge PA01 belongs to group 1 (,,01).

The last element is PA60G5R2,,26 . PA60G6R2,,26 is Test Form A, Judge code PA60, Group 5, Round 2. Judge PA 60 belongs to group 26. End of Facet 1 specifications Facet 2 =Group; D =Dummy (inactive facet) (Continued)

250

Appendices

Table G Continued Specification 1, G1 … … … 5, G5 * 3, Medium, D; 1, Audio 2, Video 3, F2F * 4, Order, D; 1, Audio 2, Video 3, F2F * 5, Test Form, D; 1, Form-A 2, Form-B * 6, Round, D; 1, Round-1 2, Round-2 3, Round-3 * 7, Items, A 1, AG01, -1.17

… … … 139, FR15, -0.14

Explanation First element (1) =G1 (Group 1)

Last element (5) =G5 (Group 5) End of Facet 2 specifications Facet 3 =Medium of communication; D =Dummy (inactive facet) First element (1) =Audio Second element (2) =Video Last element (3) =F2F End of Facet 3 specifications Facet 4 =Medium in which first session took place. D =Dummy (inactive facet) First element (1) =Audio Second element (2) =Video Last element (3) =F2F End of Facet 4 specifications Facet 5 =Test Form; D =Dummy (inactive facet) First element (1) =Form-A (Test Form A) Last element (2) =Form-B (Test Form B) End of Facet 5 specifications Facet 6 =Round; D =Dummy (inactive facet) First element (1) =Round-1 Second element (2) Round-2 Last element (3), Round-3 End of Facet specifications Facet 7 =Items; A =Anchored Facet 7 =Items; A =Anchored First element (1) is item AG01. AG01 =Test Form A, Grammar item number 01. The item is anchored to -1.17 logits.

This element (139) is item FR15. Item FR15 - F2F reading item 15 (the last item). The item is anchored to -0.14 logits.

Appendices

251

Table G Continued Specification * Data = PA01G1R1,G1,Audio,Audio,Form- A,Round-1,1–45,1,1,1 . . . 1,0,1

Explanation End of Facet 7 specifications Beginning of data specifications The first string of data is from Judge PA01G1R1, who is assigned to the first element of Facet 2 (Group 1), the first element of Facet 3 (audio medium), the first element of Facet 4 (audio), the first element of Facet 5 (test form A), the first element of Facet 6 (Round 1). For items 1–45, the data is 1,1,1, followed by another 39 responses, followed by the last three responses 1,0,1.

… … … R0.45,P46G5R1,G5,FTF,FTF,Form- This string of data is from Judge P46G5R1 who is A,Round-1,95,1 assigned to the fifth element of Facet 2 (Group 5), the third element of Facet 3 (F2F medium), the third element of Facet 4 (F2F), the first element of Facet 5 (test for A), the first element of Facet 6 (Round 1), for item 95 the probability of getting the item correct (1) is 45 % (R0.45). Note: In the face-to-face environment, the modified Angoff percentage method was used for setting cut scores. In order to analyse the data, each judgement needs to be entered twice: (1) the probability of getting the item correct (1) and (2) the probability of getting the item wrong (0). R0.45 indicates 45 % probability of getting item 95 (at the end of string) correct (1). … … … R0.40,P60G5R2,G5,FTF,FTF,Form- This string of data is from Judge P60G5R2 who is A,Round-2,139,0 assigned to the fifth element of Facet 2 (Group 5), the third element of Facet 3 (F2F medium), the third element of Facet 4 (F2F), the first element of Facet 5 (test for A), the second element of Facet 6 (Round 2), for item 139 the probability of getting the item wrong (0) is 40 % (R0.40). * End of data specifications

252

Appendices

Appendix H: Intraparticipant consistency indices Table H1 : Round 1 MPI indices

J01 J02 J03 J04 J05 J06 J07 J08 J09

Group 1 Test Test A B .66 .75 .61 .79 .57 .53 .71 .74 .62 .63 .59 .72 .56 .63 .50 .68 .77 .59

Min. Max. Mean

.50 .77 .62

ID

.53 .79 .67

ID J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22

Group 2 Test Test A B .78 .64 .79 .59 .73 .82 .70 .48 .56 .78 .76 .82 .47 .63 .58 .51 .66 .56 .60 .88 .72 .60 .52 .68 .52 .63 .47 .79 .64

.48 .88 .66

ID J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33 J34

Group 3 Test Test A B .67 .65 .57 .70 .58 .78 .46 .69 .69 .75 .75 .66 .61 .64 .64 .60 .71 .67 .67 .64 .72 .66 .72 .54

.46 .75 .65

.54 .78 .66

ID J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

Group 4 Test Test A B .52 .67 .64 .54 .56 .63 .61 .61 .57 .80 .49 .58 .45 .58 .68 .67 .51 .62 .70 .75 .74 .64

.45 .74 .59

.54 .80 .64

253

Appendices Table H2: Round 2 MPI indices

J01 J02 J03 J04 J05 J06 J07 J08 J09

Group 1 Test Test A B .92 .79 .82 .87 .69 .32 .77 .75 .81 .72 .75 .80 .73 .44 .96 .84 .81 .84

Min. Max. Mean

.69 .96 .81

ID

.32 .87 .71

ID J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22

Group 2 Test Test A B .90 .84 .79 .60 .84 .95 .72 .58 .91 .91 .91 .88 .87 .83 .87 .96 .44 .75 .79 .87 .59 .74 .71 .94 .92 .45 .44 .92 .79

.45 .96 .79

ID J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33 J34

Group 3 Test Test A B .73 .71 .88 .87 .66 .83 .84 .69 .85 .72 .78 .61 .94 .89 .64 .74 .81 .80 .60 .66 .89 .84 .78 .54

.60 .94 .78

.54 .89 .74

ID J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

Group 4 Test Test A B .71 .91 .84 .88 .85 .54 .91 .80 .72 .85 .73 .76 .62 .60 .81 .89 .81 .74 .78 .71 .76 .82

.62 .91 .78

.54 .91 .77

254

Appendices

Table H3 Changes in ratings across Round 1 and Round 2 ID J01 J02 J03 J04 J05 J06 J07 J08 J09

Group 1 Test Test A B .39* .72* .60* .60* .33* .29 .71* .45* .65* .78* .44* .60* .57* .53* .06 .06 .96* .64*

ID J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22

Group 2 Test Test A B .51* .10 .75* .81* .71* .74* .78* .52* .26 .63* .71* .78* .13 .34* .36* .14 .35* .62* .65* .91* .56* .68* .30* .44* -.05 .18

ID J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33 J34

Group 3 Test Test A B .59* -.35 .38* .30* .28 .44* .09 .41* .60* .31* .73* .66* .27 .58* 1.00* .53* .73* .39* .47* .15 .50* .26 .69* 1.00*

ID J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

Group 4 Test Test A B .17 .15 .32* .52* .48* .73* .26 .29 .64* .78* .59* .44* .59* .81* .61* .30* .42* .53* .64* .78* .44* .36*

*correlations significant at the .05 level (2-tailed)

Table H4 Logit changes in ratings across Round 2 and Round 3 ID J01 J02 J03 J04 J05 J06 J07 J08 J09

Group 1 Test Test A B .00 .00 .00 .00 -.20 .00 .31 .20 .00 .00 .21 .11 -.33 .00 .00 -.17 .00 .20

ID J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22

Group 2 Test Test A B .93 .00 -.33 .00 .10 .00 -.11 -.26 -.31 -.21 -.57 .10 .70 .00 -.67 .00 .00 .00 -.22 .00 .10 .00 -.31 .32 -.76 .00

ID J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33 J34

Group 3 Test Test A B .32 -.34 .31 -.11 .35 .12 -.69 .00 -.11 -.11 .21 .00 .00 .42 .51 .11 .22 .32 -.20 -.20 -.28 -.48 .62 .00

ID J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

Group 4 Test Test A B .01 .00 -.23 .00 .00 .00 .00 .00 .00 -.04 -.11 -.11 .00 .00 .20 .32 .00 .00 .00 .42 -.22 -1.38

Appendices

Appendix I: Group 5 group level and individual level Rasch indices Table I2 Group 5 individual level Rasch indices

Min. Infit MnSq (ZStd) Max. Infit MnSq (ZStd) Mean. Infit MnSq (ZStd) S.D. Population Infit MnSq Min. Corr. PtBis Max. Corr. PtBis Mean. Corr. PtBis* Min. Rasch-Kappa Max. Rasch-Kappa

Group 5 (45 items) Round 1 Round 2 1.12 (.7) 1.05 (0.4) 1.25 (2.1) 1.24 (1.7) 1.20 (1.8) 1.15 (1.4) 0.03 0.05 -.03 -.20 -.16 .12 .08 -.06 -0.12 -0.12 -0.08 -0.07

* Average using Fisher’s Z-transformations

Table I1 Group 5 group level Rasch indices

Cut score measure Mean raw score Standard error of cut score measures (SEc) S.D. (population) mean measure Separation ratio (G) Separation (strata) index (H) Separation reliability (R) Fixed (all) Chi-square (prob) Observed agreement (%) Expected agreement (%) Rasch –Kappa

Group 5 (45 items) Round 1 Round 2 0.26 0.26 23.59 23.66 0.13 0.13 4.66 4.82 1.10 1.17 1.80 1.90 0.55 0.58 0.00 0.00 50.0 50.0 54.8 54.8 -0.11 -0.11

255

256

Appendices

Appendix J: Form A & Form B score tables Table J1: Form A raw score to logit score table Raw Score 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Measure (S.E.) -5.08 (1.83) -3.86 (1.02) -3.13 (0.73) -2.69 (0.61) -2.37 (0.53) -2.11 (0.48) -1.89 (0.45) -1.70 (0.42) -1.53 (0.40) -1.38 (0.39) -1.24 (0.37) -1.10 (0.36) -0.98 (0.35) -0.86 (0.34) -0.74 (0.34) -0.63 (0.33)

Raw Score 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Measure (S.E.) -0.52 (0.33) -0.41 (0.32) -0.31 (0.32) -0.21 (0.32) -0.11 (0.32) -0.01 (0.32) 0.09 (0.32) 0.19 (0.32) 0.29 (0.32) 0.39 (0.32) 0.49 (0.32) 0.60 (0.32) 0.70 (0.33) 0.81 (0.33) 0.92 (0.33) 1.03 (0.34)

Raw Score 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Measure (S.E.) 1.15 (0.35) 1.27 (0.35) 1.40 (0.36) 1.54 (0.38) 1.68 (0.39) 1.84 (0.41) 2.01 (0.43) 2.21 (0.45) 2.43 (0.49) 2.69 (0.54) 3.02 (0.61) 3.46 (0.73) 4.19 (1.02) 5.42 (1.84)

Table J2: Form B raw score to logit score table Raw Score 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Measure (S.E.) -5.27 (1.83) -4.04 (1.02) -3.32 (0.73) -2.88 (0.61) -2.56 (0.53) -2.30 (0.48) -2.08 (0.45) -1.89 (0.42) -1.72 (0.40) -1.57 (0.39) -1.42 (0.37) -1.29 (0.36) -1.16 (0.35) -1.04 (0.34) -0.93 (0.34) -0.82 (0.33)

Raw Score 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Measure (S.E.) -0.71 (0.33) -0.60 (0.32) -0.50 (0.32) -0.40 (0.32) -0.30 (0.32) -0.20 (0.32) -0.10 (0.31) 0.00 (0.31) 0.10 (0.32) 0.20 (0.32) 0.30 (0.32) 0.40 (0.32) 0.51 (0.32) 0.61 (0.33) 0.72 (0.33) 0.83 (0.34)

Raw Score 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Measure (S.E.) 0.95 (0.34) 1.07 (0.35) 1.20 (0.36) 1.33 (0.37) 1.48 (0.39) 1.63 (0.40) 1.80 (0.42) 1.99 (0.45) 2.21 (0.49) 2.47 (0.53) 2.79 (0.61) 3.23 (0.73) 3.96 (1.02) 5.19 (1.83)

257

Appendices

Appendix K: DJF pairwise interactions Table K1 Round 1 DJF pairwise interactions Judge J01 J02 J03 J04 J05 J06 J07 J08 J09 J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22 J23 J24 J25 J26 J27 J28 J29 J30 J31

Audio Measure * (S.E.) 0.70 (0.33) 0.49 (0.32) 0.29 (0.32) -0.21 (0.32) -0.01 (0.32) -0.11 (0.32) 1.27 (0.35) 0.39 (0.32) 0.39 (0.32) 0.72 (0.33) 0.51 (0.32) 0.00 (0.31) 1.07 (0.35) 0.00 (0.31) 0.20 (0.32) 0.20 (0.32) 0.83 (0.34) -0.40 (0.32) 0.20 (0.32) 0.30 (0.32) -0.10 (0.31) 0.95 (0.34) 0.00 (0.31) 0.00 (0.31) 1.20 (0.36) 1.63 (0.4) 0.00 (0.31) 0.20 (0.32) -0.10 (0.31) 0.40 (0.32) 0.72 (0.33)

Video Measure * (S.E.) 0.95 (0.34) 0.10 (0.32) -0.10 (0.31) -0.60 (0.32) 0.00 (0.31) 0.10 (0.32) 0.51 (0.32) 1.33 (0.37) 0.10 (0.32) 0.09 (0.32) 0.39 (0.32) 0.60 (0.32) 0.49 (0.32) -0.63 (0.33) 0.81 (0.33) -0.21 (0.32) 0.70 (0.33) -0.21 (0.32) 0.92 (0.33) 0.29 (0.32) -0.21 (0.32) 1.27 (0.35) 0.81 (0.33) 0.19 (0.32) 1.15 (0.35) 1.15 (0.35) -0.11 (0.32) 0.60 (0.32) 0.39 (0.32) 0.19 (0.32) 0.49 (0.32)

Medium Contrast (Joint S.E.) -0.25 (0.47) 0.39 (0.45) 0.39 (0.45) 0.39 (0.45) -0.01 (0.45) -0.21 (0.45) 0.76 (0.48) -0.94 (0.49) 0.29 (0.45) 0.63 (0.46) 0.12 (0.45) -0.59 (0.45) 0.58 (0.48) 0.63 (0.46) -0.61 (0.46) 0.41 (0.45) 0.13 (0.47) -0.19 (0.45) -0.72 (0.46) 0.01 (0.45) 0.11 (0.45) -0.32 (0.49) -0.81 (0.46) -0.19 (0.45) 0.05 (0.5) 0.48 (0.53) 0.11 (0.45) -0.39 (0.45) -0.49 (0.45) 0.21 (0.45) 0.23 (0.46)

Welch t (d.f.)

Prob p

FDR q

-.053 (87.71) 0.87 (87.99) 0.87 (88.00) 0.87 (87.98) -0.02 (88.00) -0.47 (88.00) 1.59 (87.28) -1.92 (85.82) 0.65 (88.00) 1.38 (87.77) .026 (87.97) -1.32 (87.96) 1.22 (87.18) 1.38 (87.76) -1.33 (87.87) 0.91 (88.00) 0.29 (87.88) -0.42 (88.00) -1.56 (87.75) 0.02 (88.00) 0.25 (87.99) -0.65 (87.92) -1.77 (87.82) -0.43 (88.00) 0.10 (87.84) 0.91 (86.05) 0.24 (88.00) -0.87 (87.97) -1.09 (87.99) 0.47 (87.98) 0.50 (87.87)

.60 .39 .39 .39 .98 .64 .12 .06 .52 .17 .80 .19 .23 .17 .19 .36 .78 .68 .12 .98 .80 .52 .08 .67 .92 .36 .81 .38 .28 .64 .62

.94 .83 .83 .83 .99 .94 .69 .69 .94 .69 .96 .69 .77 .69 .69 .83 .96 .94 .69 .99 .96 .94 .69 .94 .99 .83 .96 .83 .83 .94 .94 (Continued)

258

Appendices

Table K1 Continued Judge J32 J33 J34 J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

Audio Measure * (S.E.) 0.10 (0.32) 0.72 (0.33) 0.30 (0.32) 0.70 (0.33) 1.4 (0.36) 0.92 (0.33) 1.03 (0.34) 0.39 (0.32) 0.81 (0.33) 1.15 (0.35) 0.70 (0.33) -0.21 (0.32) 0.19 (0.32) 0.92 (0.33)

Video Measure * (S.E.) 0.19 (0.32) 1.03 (0.34) 0.49 (0.32) -0.40 (0.32) 0.51 (0.32) -0.40 (0.32) 0.40 (0.32) 0.00 (0.31) 0.72 (0.33) 0.51 (0.32) 0.61 (0.33) 0.00 (0.31) 0.00 (0.31) 1.07 (0.35)

Medium Contrast (Joint S.E.) -0.09 (0.45) -0.31 (0.48) -0.19 (0.45) 1.10 (0.45) 0.89 (0.49) 1.31 (0.46) 0.63 (0.47) 0.39 (0.45) 0.09 (0.47) 0.64 (0.47) 0.09 (0.46) -0.21 (0.45) 0.19 (0.45) -0.15 (0.49)

Welch t (d.f.)

Prob p

FDR q

-0.20 (88.00) -0.65 (87.95) -0.42 (88.00) 2.41 (87.96) 1.83 (86.80) 2.85 (87.79) 1.34 (87.71) 0.87 (87.99) 0.18 (87.99) 1.35 (87.60) 0.19 (88.00) -0.47 (87.99) 0.43 (88.00) -0.32 (87.75)

.84 .52 .67 .02 .07 .01 .18 .39 .86 .18 .85 .64 .67 .75

.96 .94 .94 .42 .69 .26 .69 .83 .96 .69 .96 .94 .94 .96

* Measures are judges’ Round 1 measures Fixed (all =0) chi-squared: 51.9 d.f.: 90 significance (probability): 1.00

259

Appendices Table K2 Round 2 DJF pairwise interactions Judge J01 J02 J03 J04 J05 J06 J07 J08 J09 J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22 J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33

Audio Measure * (S.E.) 0.60 (0.32) 0.19 (0.32) 0.39 (0.32) 0.29 (0.32) 0.19 (0.32) 0.39 (0.32) 1.03 (0.34) 1.54 (0.38) 0.49 (0.32) 0.72 (0.33) 0.72 (0.33) 0.40 (0.32) 1.33 (0.37) 0.72 (0.33) 0.30 (0.32) -0.10 (0.31) 0.72 (0.33) 0.10 (0.32) 0.40 (0.32) 0.40 (0.32) 0.40 (0.32) 0.40 (0.32) 0.95 (0.34) 0.83 (0.34) 0.95 (0.34) 0.72 (0.33) 0.83 (0.34) -0.20 (0.32) 0.30 (0.32) 0.61 (0.33) 0.40 (0.32) 0.40 (0.32) 1.20 (0.36)

Video Measure * (S.E.) 0.51 (0.32) 0.40 (0.32) 0.10 (0.32) 0.10 (0.32) 0.10 (0.32) 0.40 (0.32) 0.51 (0.32) 1.8 (0.42) 0.10 (0.32) -0.01 (0.32) 1.03 (0.34) 0.09 (0.32) 0.81 (0.33) 0.70 (0.33) 1.27 (0.35) -0.31 (0.32) 1.27 (0.35) 0.29 (0.32) 1.03 (0.34) 0.70 (0.33) 0.70 (0.33) 1.15 (0.35) 0.49 (0.32) 0.39 (0.32) 0.92 (0.33) 1.84 (0.41) 1.03 (0.34) 0.39 (0.32) 0.81 (0.33) 0.19 (0.32) 0.70 (0.33) 0.39 (0.32) 1.68 (0.39)

Medium Contrast (Joint S.E.) 0.09 (0.46) -0.21 (0.45) 0.29 (0.45) 0.19 (0.45) 0.09 (0.45) -0.01 (0.45) 0.52 (0.47) -0.27 (0.57) 0.39 (0.45) 0.73 (0.46) -0.31 (0.48) 0.31 (0.45) 0.53 (0.50) 0.02 (0.46) -0.97 (0.48) 0.21 (0.45) -0.55 (0.49) -0.19 (0.45) -0.63 (0.47) -0.30 (0.46) -0.30 (0.46) -0.74 (0.47) 0.46 (0.47) 0.44 (0.46) 0.03 (0.48) -1.12 (0.52) -0.20 (0.48) -0.59 (0.45) -0.51 (0.46) 0.42 (0.45) -0.30 (0.46) 0.01 (0.45) -0.48 (0.53)

Welch t (d.f.)

Prob p

FDR q

0.19 (88.00) -0.47 (87.98) 0.65 (88.00) 0.43 (88.00) 0.20 (88.00) -0.03 (87.99) 1.12 (87.80) -0.47 (86.73) 0.87 (87.99) 1.59 (87.78) -0.65 (87.96) 0.69 (87.98) 1.06 (86.67) 0.05 (87.96) -2.03 (86.99) 0.47 (87.97) -1.13 (87.62) -0.43 (88.00) -1.34 (87.71) -0.65 (87.98) -0.65 (87.98) -1.58 (87.48) 0.97 (87.51) 0.95 (87.67) 0.07 (87.92) -2.13 (84.65) -0.41 (88.00) -1.31 (88.00) -1.10 (87.90) 0.93 (87.88) -0.65 (87.98) 0.03 (87.99) -0.91 (87.52)

.85 .64 .52 .67 .84 .98 .27 .64 .39 .12 .52 .49 .29 .96 .04 .64 .26 .67 .18 .52 .52 .12 .33 .34 .95 .04 .68 .19 .27 .36 .52 .98 .36

.98 .87 .84 .87 .98 .99 .84 .87 .84 .83 .84 .84 .84 .99 .83 .87 .84 .87 .83 .84 .84 .83 .84 .84 .99 .83 .87 .83 .84 .84 .84 .99 .84 (Continued)

260

Appendices

Table K2 Continued Judge J34 J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

Audio Measure * (S.E.) 0.30 (0.32) -0.31 (0.32) 1.15 (0.35) 0.60 (0.32) 0.60 (0.32) 0.60 (0.32) 0.49 (0.32) 0.92 (0.33) 0.19 (0.32) 0.09 (0.32) 0.19 (0.32) 1.03 (0.34)

Video Measure * (S.E.) 0.19 (0.32) -0.10 (0.31) 0.72 (0.33) -0.20 (0.32) 0.00 (0.31) 0.30 (0.32) 0.83 (0.34) 0.72 (0.33) 0.83 (0.34) 0.72 (0.33) 0.30 (0.32) 1.99 (0.45)

Medium Contrast (Joint S.E.) 0.11 (0.45) -0.21 (0.45) 0.43 (0.48) 0.79 (0.45) 0.59 (0.45) 0.29 (0.45) -0.34 (0.46) 0.2 (0.47) -0.64 (0.46) -0.63 (0.46) -0.11 (0.45) -0.96 (0.56)

Welch t (d.f.)

Prob p

FDR q

0.25 (87.99) -0.47 (87.97) 0.89 (87.84) 1.76 (87.96) 1.32 (87.96) 0.65 (87.99) -0.73 (87.73) 0.41 (88.00) -1.39 (87.60) -1.38 (87.77) -0.25 (87.99) -1.71 (81.80)

.81 .64 .38 .08 .19 .52 .47 .68 .17 .17 .81 .09

.98 .87 .84 .83 .83 .84 .84 .87 .83 .83 .98 .83

* Measures are judges’ Round 2 measures Fixed (all =0) chi-squared: 44.3 d.f.: 90 significance (probability): 1.00

261

Appendices Table K3 Round 3 DJF pairwise interactions Judge J01 J02 J03 J04 J05 J06 J07 J08 J09 J10 J11 J12 J13 J14 J15 J16 J17 J18 J19 J20 J21 J22 J23 J24 J25 J26 J27 J28 J29 J30 J31 J32 J33

Audio Measure (S.E.) 0.60 (0.32) 0.19 (0.31) 0.19 (0.31) 0.60 (0.32) 0.19 (0.31) 0.60 (0.32) 0.70 (0.33) 1.54 (0.39) 0.49 (0.32) 0.72 (0.33) 0.72 (0.33) 0.40 (0.32) 1.07 (0.35) 0.51 (0.32) 0.40 (0.32) -0.10 (0.32) 0.72 (0.33) 0.10 (0.31) 0.40 (0.32) 0.40 (0.32) 0.72 (0.33) 0.40 (0.32) 0.61 (0.33) 0.72 (0.33) 1.07 (0.35) 0.72 (0.33) 0.72 (0.33) -0.20 (0.32) 0.72 (0.33) 0.72 (0.33) 0.72 (0.33) 0.20 (0.32) 0.72 (0.33)

Video Measure (S.E.) 0.51 (0.32) 0.40 (0.32) 0.10 (0.31) 0.30 (0.32) 0.10 (0.31) 0.51 (0.32) 0.51 (0.32) 1.63 (0.42) 0.30 (0.32) 0.92 (0.33) 0.70 (0.33) 0.19 (0.31) 0.70 (0.33) 0.39 (0.32) 0.70 (0.33) 0.39 (0.32) 0.60 (0.32) 0.29 (0.32) 0.81 (0.33) 0.60 (0.32) 0.39 (0.32) 0.39 (0.32) 0.81 (0.33) 0.70 (0.33) 1.27 (0.35) 1.15 (0.34) 0.92 (0.33) 0.60 (0.32) 0.81 (0.33) 0.70 (0.33) 0.92 (0.33) 0.19 (0.31) 1.40 (0.37)

Medium Contrast (Joint S.E.) 0.09 (0.46) -0.21 (0.45) 0.09 (0.44) 0.30 (0.45) 0.09 (0.44) 0.09 (0.46) 0.20 (0.46) -0.09 (0.57) 0.19 (0.45) -0.20 (0.47) 0.02 (0.47) 0.21 (0.45) 0.36 (0.48) 0.12 (0.46) -0.30 (0.46) -0.49 (0.45) 0.13 (0.46) -0.19 (0.45) -0.41 (0.46) -0.19 (0.46) 0.33 (0.46) 0.01 (0.45) -0.20 (0.46) 0.02 (0.47) -0.20 (0.50) -0.42 (0.48) -0.20 (0.47) -0.79 (0.45) -0.09 (0.47) 0.02 (0.47) -0.20 (0.47) 0.01 (0.45) -0.67 (0.50)

Welch t (d.f.)

Prob FDR p q

0.20 (-) -0.47 (-) 0.20 (-) 0.65 (-) 0.20 (-) 0.20 (-) 0.43 (-) -0.16 (-) 0.42 (-) -0.42 (-) 0.04 (-) 0.47 (-) 0.77 (-) 0.26 (-) -0.66 (-) -1.09 (-) 0.27 (-) -0.42 (-) -0.89 (-) -0.43 (-) 0.72 (-) 0.03 (-) -0.42 (-) 0.04 (-) -0.40 (-) -0.88 (-) -0.42 (-) -1.75 (-) -0.19 (-) 0.04 (-) -0.42 (-) 0.02 (-) -1.36 (-)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(Continued)

262

Appendices

Table K3 Continued Judge J34 J35 J36 J37 J38 J39 J40 J41 J42 J43 J44 J45

Audio Measure (S.E.) 0.30 (0.32) -0.30 (0.32) 0.92 (0.33) 0.60 (0.32) 0.60 (0.32) 0.60 (0.32) 0.60 (0.32) 0.92 (0.33) 0.39 (0.32) 0.09 (0.31) 0.19 (0.31) 0.81 (0.33)

Video Measure (S.E.) 0.81 (0.33) -0.10 (0.32) 0.72 (0.33) -0.20 (0.32) 0.00 (0.31) 0.30 (0.32) 0.72 (0.33) 0.72 (0.33) 0.51 (0.32) 0.72 (0.33) 0.72 (0.33) 0.61 (0.33)

Medium Contrast (Joint S.E.) -0.51 (0.46) -0.21 (0.45) 0.20 (0.47) 0.79 (0.45) 0.60 (0.45) 0.30 (0.45) -0.13 (0.46) 0.20 (0.47) -0.12 (0.46) -0.63 (0.46) -0.53 (0.46) 0.20 (0.46)

* Measures are judges final Round 3 measures Fixed (all =0) chi-squared: 44.3 d.f.: 90 significance (probability): 1.00

Welch t (d.f.)

Prob FDR p q

-1.11 (-) -0.46 (-) 0.42 (-) 1.75 (-) 1.32 (-) 0.65 (-) -0.27 (-) 0.42 (-) -0.26 (-) -1.38 (-) -1.16 (-) 0.42 (-)

- - - - - - - - - - - -

- - - - - - - - - - - -

263

Appendices

Appendix L: DGF pairwise interactions Table L1 Round 1 DGF pairwise interactions Group (1) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G2 (A) G2 (A) G2 (A) G2 (A) G2 (A) G2 (V) G2 (V) G2 (V) G2 (V) G3 (A) G3 (A) G3 (A) G3 (V) G3 (V) G4 (A)

Target measure (S.E.) 0.05 (0.11) 0.05 (0.11) 0.05 (0.11) 0.05 (0.11) 0.05 (0.11) 0.05 (0.11) 0.05 (0.11) -0.05 (0.11) -0.05 (0.11) -0.05 (0.11) -0.05 (0.11) -0.05 (0.11) -0.05 (0.11) 0.01 (0.09) 0.01 (0.09) 0.01 (0.09) 0.01 (0.09) 0.01 (0.09) -0.01 (0.09) -0.01 (0.09) -0.01 (0.09) -0.01 (0.09) -0.07 (0.09) -0.07 (0.09) -0.07 (0.09) 0.07 (0.09) 0.07 (0.09) 0.23 (0.10)

Group (2) G1 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (V) G4 (A) G4 (V) G4 (A) G4 (V) G4 (V)

Target measure (S.E.) -0.05 (0.11) 0.01 (0.09) -0.01 (0.09) -0.07 (0.09) 0.07 (0.09) 0.23 (0.10) -0.22 (0.10) 0.01 (0.09) -0.01 (0.09) -0.07 (0.09) 0.07 (0.09) 0.23 (0.10) -0.22 (0.10) -0.01 (0.09) -0.07 (0.09) 0.07 (0.09) 0.23 (0.10) -0.22 (0.10) -0.07 (0.09) 0.07 (0.09) 0.23 (0.10) -0.22 (0.10) 0.07 (0.09) 0.23 (0.10) -0.22 (0.10) 0.23 (0.10) -0.22 (0.10) -0.22 (0.10)

Target contrast (Joint S.E.) 0.10 (0.15) 0.04 (0.14) 0.06 (0.14) 0.12 (0.14) -0.02 (0.14) -0.18 (0.15) 0.27 (0.14) -0.06 (0.14) -0.04 (0.14) 0.02 (0.14) -0.11 (0.14) -0.28 (0.15) 0.17 (0.14) 0.02 (0.13) 0.07 (0.13) -0.06 (0.13) -0.22 (0.13) 0.23 (0.13) 0.06 (0.13) -0.07 (0.13) -0.24 (0.13) 0.22 (0.13) -0.13 (0.13) -0.29 (0.14) 0.16 (0.13) -0.16 (0.14) 0.29 (0.13) 0.45 (0.14)

Welch t (d.f.)

Prob. FDR p q

0.65 (808.00) 0.30 (872.10) 0.41 (868.80) 0.81 (875.10) -0.11 (874.60) -1.22 (870.30) -1.89 (862.50) -0.41 (871.60) -0.30 (867.30) 0.11 (874.60) -0.81 (874.20) -1.89 (870.00) 1.21 (862.20) 0.13 (1168.00) 0.57 (1115.00) -0.44 (1115.00) -1.64 (1043.00) 1.76 (1051.00) 0.45 (1113.00) -0.56 (1114.00) -1.76 (1040.00) 1.64 (1048.00) -0.99 (1078.00) -2.14 (1023.00) 1.18 (1027.00) -1.19 (1023.00) 2.15 (1027.00) 3.26 (987.40)

.51 .77 .69 .42 .91 .22 .06 .68 .77 .91 .42 .06 .23 .90 .57 .66 .10 .08 .66 .57 .08 .10 .32 .03 .24 .24 .03 .00

.84 .87 .84 .75 .92 .52 .32 .84 .87 .92 .75 .32 .52 .92 .84 .84 .32 .32 .84 .84 .32 .32 .66 .30 .52 .52 .30 .03

264

Appendices

Table L2 Round 2 DGF pairwise interactions Group (1) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (A) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G1 (V) G2 (A) G2 (A) G2 (A) G2 (A) G2 (A) G2 (V) G2 (V) G2 (V) G2 (V) G3 (A) G3 (A) G3 (A) G3 (V) G3 (V) G4 (A)

Target measure (S.E.) 0.07 (0.11) 0.07 (0.11) 0.07 (0.11) 0.07 (0.11) 0.07 (0.11) 0.07 (0.11) 0.07 (0.11) -0.07 (0.11) -0.07 (0.11) -0.07 (0.11) -0.07 (0.11) -0.07 (0.11) -0.07 (0.11) -0.08 (0.09) -0.08 (0.09) -0.08 (0.09) -0.08 (0.09) -0.08 (0.09) 0.08 (0.09) 0.08 (0.09) 0.08 (0.09) 0.08 (0.09) -0.06 (0.10) -0.06 (0.10) -0.06 (0.10) 0.06 (0.10) 0.06 (0.10) -0.01 (0.10)

Group (2) G1 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (V) G4 (A) G4 (V) G4 (A) G4 (V) G4 (V)

Target measure (S.E.) -0.07 (0.11) -0.08 (0.09) 0.08 (0.09) -0.06 (0.10) 0.06 (0.10) -0.01 (0.10) 0.01 (0.10) -08 (0.09) 0.08 (0.09) -0.06 (0.10) 0.06 (0.10) -0.01 (0.10) 0.01 (0.10) 0.08 (0.09) -0.06 (0.10) 0.06 (0.10) -0.01 (0.10) 0.01 (0.10) -0.06 (0.10) 0.06 (0.10) -0.01 (0.10) 0.01 (0.10) 0.06 (0.10) -0.01 (0.10) 0.01 (0.10) -0.01 (0.10) 0.01 (0.10) 0.01 (0.10)

Target contrast (Joint S.E.) 0.13 (0.15) 0.15 (0.14) -0.01 (0.14) 0.13 (0.14) 0.01 (0.14) 0.08 (0.15) 0.05 (0.15) 0.01 (0.14) -0.15 (0.14) -0.01 (0.14) -0.13 (0.14) -0.05 (0.15) -0.08 (0.15) -0.16 (0.13) -0.02 (0.13) -0.14 (0.13) -0.07 (0.13) -0.10 (0.13) 0.14 (0.13) 0.02 (0.13) 0.09 (0.13) 0.06 (0.13) -0.12 (0.13) -0.05 (0.14) -0.08 (0.14) 0.07 (0.14) 0.05 (0.14) -0.03 (0.14)

Welch t (d.f.)

Prob. FDR p q

0.87 (808.00) 1.04 (871.41) -0.09 (871) 0.88 (877) 0.04 (876) 0.56 (861) 0.35 (866) 0.10 (.870) -1.04 (871) -0.04 (877) -0.88 (875) 0.36 (861) -0.56 (865) -1.25 (1167) -0.15 (1113) 1.07 (1114) -0.50 (1051) -0.71 (1046) 1.07 (1113) 0.15 (1114) 0.71 (1051) 0.48 (1046) -0.90 (1077) -0.34 (1028) -0.55 (1026) 0.55 (1027) 0.33 (1025) -0.21 (987)

.38 .30 .93 .38 .96 .58 .73 .92 .30 .97 .38 .72 .58 .21 .88 .28 .62 .48 .28 .88 .48 .63 .37 .73 .58 .58 .74 .83

.99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99 .99

265

Appendices Table L3 Round 3 DGF pairwise interactions Group Target (1) measure (S.E.) G1 (A) 0.04 (0.11) G1 (A) 0.04 (0.11) G1 (A) 0.04 (0.11) G1 (A) 0.04 (0.11) G1 (A) 0.04 (0.11) G1 (A) 0.04 (0.11) G1 (A) 0.04 (0.11) G1 (V) -0.04 (0.11) G1 (V) -0.04 (0.11) G1 (V) -0.04 (0.11) G1 (V) -0.04 (0.11) G1 (V) -0.04 (0.11) G1 (V) -0.04 (0.11) G2 (A) -0.02 (0.09) G2 (A) -0.02 (0.09) G2 (A) -0.02 (0.09) G2 (A) -0.02 (0.09) G2 (A) -0.02 (0.09) G2 (V) 0.02 (0.09) G2 (V) 0.02 (0.09) G2 (V) 0.02 (0.09) G2 (V) 0.02 (0.09) G3 (A) -0.13 (0.09) G3 (A) -0.13 (0.09) G3 (A) -0.13 (0.09) G3 (V) 0.13 (0.10) G3 (V) 0.13 (0.10) G4 (A) 0.03 (0.10)

Group (2) G1 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (A) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G2 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (A) G3 (V) G4 (A) G4 (V) G3 (V) G4 (A) G4 (V) G4 (A) G4 (V) G4 (V)

Target measure (S.E.) -0.04 (0.11) -0.02 (0.09) 0.02 (0.09) -0.13 (0.09) 0.13 (0.10) 0.03 (0.10) -0.03 (0.10) -0.02 (0.09) 0.02 (0.09) -0.13 (0.09) 0.13 (0.10) 0.03 (0.10) -0.03 (0.10) 0.02 (0.09) -0.13 (0.09) 0.13 (0.10) 0.03 (0.10) -0.03 (0.10) -0.13 (0.09) 0.13 (0.10) 0.03 (0.10) -0.03 (0.10) 0.13 (0.10) 0.03 (0.10) -0.03 (0.10) 0.03 (0.10) -0.03 (0.10) 0.03 (0.10)

Target contrast (Joint S.E.) 0.09 (0.15) 0.07 (0.14) 0.02 (0.14) 0.18 (0.14) -0.09 (0.14) 0.01 (0.15) 0.07 (0.15) -0.02 (0.14) -0.07 (0.14) 0.09 (0.14) -0.18 (0.14) -0.07 (0.15) -0.01 (0.15) -0.05 (0.13) 0.11 (0.13) -0.16 (0.13) -0.05 (0.13) 0.01 (0.13) 0.16 (0.13) -0.11 (0.13) -0.01 (0.13) 0.05 (0.13) -0.27 (0.13) -0.16 (0.14) -0.10 (0.14) 0.10 (0.14) 0.16 (0.14) 0.06 (0.14)

Welch t (d.f.)

Prob. FDR p q

0.57 (16.00) 0.48 (17.28) 0.14 (17.22) 1.23 (17.44) -0.61 (17.53) 0.09 (17.12) 0.51 (17.16) -0.14 (17.27) -0.48 (17.21) 0.62 (17.43) -1.22 (17.53) -0.51 (17.11) -0.09 (17.16) -0.38 (24.00) 0.84 (22.79) -1.20 (22.74) -0.41 (21.40) 0.05 (21.36) 1.21 (22.77) -0.83 (22.72) -0.05 (21.36) 0.41 (21.32) -1.98 (22.00) -1.21 (20.89) -0.76 (20.87) 0.75 (20.92) 1.20 (20.91) 0.44 (20.00)

.58 .63 .89 .23 .55 .93 .62 .89 .63 .54 .24 .62 .93 .71 .41 .24 .68 .96 .24 .41 .96 .68 .06 .24 .46 .46 .25 .66

.94 .94 .99 .94 .94 .99 .94 .99 .94 .94 .94 .94 .99 .94 .94 .94 .94 .99 .94 .94 .99 .94 .94 .94 .94 .94 .94 .94

266

Appendix M: Wright maps

Figure M1 Form A Round 1 Ratings

Appendices

Appendices

Figure M2 Form B Round 1 Ratings

267

268

Figure M3 Form A Round 2 Ratings

Figure M4 Form B Round 2 Ratings

Appendices

Appendices

Figure M5 Form A Round 3 Ratings

Figure M6 Form A Round 3 Ratings

269

References Abdi, H. (2010). Holm’s Sequential Bonferroni Procedure. In N. J. Salkind (Ed.), Encyclopedia of research design (pp. 575–576). doi:10.4135/9781412961288 Thousand Oaks: SAGE Publications, Inc. Alderson, J. C. (2006). Diagnosing Foreign Language Proficiency: The Interface between Learning and Assessment. London: Continuum. Alderson, J. C., Clapham, C., & Wall, D. (2005). Language test construction and evaluation. Cambridge: Cambridge University Press. American Council on the Teaching of Foreign Languages. (2012). ACTFL proficiency guidelines 2012. Alexandria: American Council on the Teaching of Foreign Languages. Retrieved from https://www.actfl.org/resources/actfl-prof iciency-guidelines-2012 American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington: American Educational Research Association. Andrich, D., & Marais, I. (2019). A Course in Rasch Measurement Theory: Measuring in the Educational, Social, and Health Sciences. Singapore: Springer Nature Singapore Pte Ltd. Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington: American Council of Education. Audiotranskription.de. (2014a). F4analyse (Version 1.0)[Computer Software]. Retrieved from https://www.audiotranskription.de. Audiotranskription.de. (2014b). F4transkript (Version 1.0) [Computer Software]. Retrieved from https://www.audiotranskription.de. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. Banerjee, J. (2004). Qualitative analysis methods. In S. Takala (Ed.), Reference supplement to the manual for relating examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section D) . Strasbourg: Language Policy Division, Council of Europe. Retrieved from https://www.coe.int/en/web/common-european-framework-reference- languages/additional-material Bejar, I. I. (1983). Subject matter experts’ assessment of item statistics. Applied Psychological Measurement, 7, 303–310. doi:10.1177/014662168300700306

272

References

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 298–300. Retrieved from https://www.jstor.org/stable/2346101 Benjamini, Y., Krieger, A. M., & Yekutiele, D. (2006). Adaptive linear step-up procedures that control the false discover rate. Biometrika, 93(3), 491–507. doi:10.1093/biomet/93.3.491 Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). New York: Routledge. Boone, W. J., & Staver, J. R. (2020). Advances in Rasch Analyses in the Human Sciences. Cham: Springer Nature Switzerland AG. Brandon, P. R. (2004). Conclusions about frequently studied modified Angoff standard-setting topics. Applied Measurement in Education, 17(1), 59–88. doi:10.1207/s15324818ame1701_4 Brennan, R. L. (2001). BB-CLASS (version 1.1) [Computer software]. Iowa: Center for Advanced Studies in Measurement, University of Iowa. Retrieved from https://education.uiowa.edu/casma/computer-programs Brennan, R. L., & Wan, L. (2004, June). A bootstrap procedure for estimating decision consistency for single administration complex assessments. Casma Research Report, 7, 1–23. Retrieved from https://education.uiowa.edu/sites/ education.uiowa.edu/files/2021-11/casma-research-report-7.pdf British Council, UKALTA, EALTA, & ALTE. (2022). Aligning Language Education with the CEFR: A Handbook. British Council, UKALTA, EALTA, ALTE. Retrieved from https://www.britishcouncil.org/sites/default/files/cefr_ alignment_handbook_layout.pdf Brunfaut, T., & Harding, L. (2014). Linking the GEPT listening test to the Common European Framework of Reference. Taipei: The Language Training and Testing Center. Retrieved from https://www.lttc.ntu.edu.tw/lttc-gept-grants/RRep ort/RG05.pdf Bryman, A. (2004). Social research methods (2nd ed.). Oxford: Oxford University Press. Buckendahl, C. W. (2011). Additional blended development and validation activities. Orlando: Paper presented at the CCSSO National Conference on Student Assessment. Buckendahl, C. W., & Davis-Becker, S. L. (2012). Setting passing standards for credentialing programs. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (Second ed., pp. 485–501). New York: Routledge.

References

273

Buckendahl, C. W., Ferdous, A. A., & Gerrow, J. (2010). Recommending cut scores with a subset of items: An empirical illustration. Practical Assessment Research & Evaluation, 15, 1–10. doi:10.7275/tv3s-cz67 Centre for Canadian Language Benchmarks. (2012). Canadian Language Benchmarks: English as a second language for adults (October 2012 ed.). Ottawa. Retrieved from http://www.cic.gc.ca/english/pdf/pub/language-ben chmarks.pdf Cetin, S., & Gelbal, S. (2013). A comparison of Bookmark and Angoff standard setting methods. Educational Sciences: Theory & Practice, 13(4), 2169–2175. doi:10.12738/estp.2013.4.1829 Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard setting methods. Applied Measurement in Education, 12(2), 151–165. doi:10.1207/s15324818ame1202_3 Cizek, G. J. (2012a). An introduction to contemporary standard setting: Concepts, characteristics, and contexts. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 3– 14). New York: Routledge. Cizek, G. J. (2012b). The forms and functions of evaluations in the standard setting process. In Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 165–178). New York: Routledge. Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. London: Sage Publications. Cizek, G. J., & Earnest, D. S. (2016). Setting performance standards on tests. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 212–237). New York: Routledge. Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23(4), 31–50. doi:10.1111/j.1745-3992.2004.tb00166.x Cohen, A. S., Kane, M. T., & Crooks, T. J. (1999). A generalized examinee-centered method for setting standards on achievement tests. Applied Measurement in Education, 12(4), 343–366. doi:10.1207/S15324818AME1204_2 Corbin, J., & Strauss, A. (2015). Basics of qualitative research: Techniques and procedures for developing grounded theory. California: Sage Publications, Inc. Council of Europe. (2001). Common European Framework of References for Languages: Learning, teaching, assessment. Cambridge: Cambridge University Press. Council of Europe. (2003). Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment: Preliminary Pilot Manual. Strasbourg: Language Policy Division.

274

References

Council of Europe. (2009). Relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (CEFR). A Manual. Strasbourg: Language Policy Division. Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment: Companion volume. Strasburg. Retrieved from http://www.coe.int/lang-cefr Creswell, J. W. (2012). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4th ed.). Boston: Pearson. Creswell, J. W. (2014). Research design (4th ed.). Los Angeles: SAGE Publications. Creswell, J. W. (2015). A concise introduction to mixed methods research. London: Sage Publications Ltd. Davis-Becker, S. L., & Buckendahl, C. W. (2013). A proposed framework for evaluating alignment studies. Educational Measurement: Issues and Practice, 32(1), 23–33. doi:10.1111/emip.12002 Dennis, A. R., & Kinney, S. T. (1998, September). Testing Media Richness Theory in the New Media: The effects of cues, feedback, and task equivocality. Information Systems Research, 9(3), 256–274. doi:10.1287/isre.9.3.256 Downey, N., & Kollias, C. (2009). Standard setting for listening, grammar, vocabulary and reading sections of the Advanced Level Certificate in English (ALCE). In N. Figueras, & J. Noijons (Eds.), Linking to the CEFR levels: Research Perspectives (pp. 125–130). Arnhem: Cito, Council of Europe, European Association for Language Testing and Assessment (EALTA). Retrieved from http://www.coe.int/t/dg4/linguistic/Proceedings_CITO_EN.pdf Downey, N., & Kollias, C. (2010). Mapping the Advanced Level Certificate in English (ALCE) examination onto the CEFR. In W. Martyniuk (Ed.), Aligning tests with the CEFR: Reflections on using the Council of Europe’s draft manual (pp. 119–129). Cambridge: Cambridge University Press. Dunlea, J., & Figueras, N. (2012). Replicating results from a CEFR test comparison project across continents. In D. Tsagari, & C. Ildikó (Eds.), Collaboration in language testing and assessment (Vol. 26, pp. 31–45). Frankfurt: Peter Lang. Dunn, O. J. (1961, March). Multiple Comparisons Among Means. Journal of the American Statistical Association, 56(293), 52–64. doi:10.2307/2282330 Eckes, T. (2009). Many-facet Rasch Measurement. In S. Takala (Ed.), Reference supplement to the manual for relating examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H). Strasbourg: Council of Europe/Language Policy Division. Retrieved from https://www.coe.int/en/web/common-european-framework-reference- languages/additional-material Eckes, T. (2011). Introduction to Many-Facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Frankfurt: Peter Lang.

References

275

Eckes, T. (2015). Introduction to Many-Facet Rasch Measurement: Analyzing and evaluating rater-mediated assessments (2nd revised and updated ed.). Frankfurt: Peter Lang. Eckes, T., & Kecker, G. (2010). Putting the Manual to the test: The TestDaf -CEFR linking project. In W. Martyniuk (Ed.), Aligning tests with the CEFR: Reflections on using the Council of Europe’s draft Manual (pp. 50–79). Cambridge: Cambridge University Press. Egan, K. L., Schneider, M. C., & Ferrara, S. (2012). Performance level descriptors. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 79–106). New York: Routledge. Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York: Routledge. Faez, F., Majhanovich, S., Taylor, S., Smith, M., & Crowley, K. (2011). The power of “can do” statements: Teacher’s perceptions of CEFR-informed instruction in French as a second language classrooms in Ontario. Canadian Journal of Applied Linguistics, 14(2), 1–19. Retrieved from https://journals.lib.unb.ca/ index.php/CJAL/article/view/19855/21652 Feskens, R., Keuning, J., van Til, A., & Verheyen, R. (2014). Performance standards for the CEFR in Dutch secondary education: An international standard setting study. Cito: Arnem. Retrieved from https://www.cito.nl/-/media/files/ kennisbank/citolab/2014_feskens-et-al_performance-standards-for-the-cefrin-dutch-secondary-education.pdf?la=nl-nl. Figueras, N, & Noijons, J. (Eds.). (2009). Linking to the CEFR levels: Research perspectives. Arnhem: Cito, EALTA. Finch, H., & Lewis, J. (2003). Focus groups. In J. Ritchie, & J. Lewis (Eds.), Qualitative research practice: A guide for social science students and researchers (pp. 170–198). London: SAGE Publications. Fisher, W. R. (1992). Reliability, separation, and strata statistics. Rasch Measurement Transactions, 6, 238. Retrieved from http://www.rasch.org/rmt/ rmt63i.htm Gelin, M., & Michaels, H. (2011). Virtual and face-to-face standard setting: A blended model. Orlando: Paper presented at the CCSSO National Conference on Student Assessment. Glaser, B. G. (1965). The constant comparative method of qualitative analysis. Social Problems, 12(4), 436–445. doi:10.2307/798843 Glaser, B. G., & Struass, A. (1967). The discovery of grounded theory. Chicago: Aldine. Glaser, R. (1963). Instructional technology and the measurement of learning outcomes. American Psychologist, 18, 519–521. doi:10.1037/h0049294

276

References

GraphPad Prism (Version 9.3.1) for Windows. [Computer Software]. (2021). San Diego, California: GraphPad Software. Retrieved from www.graphpad.com Guetterman, T. C., & Salamoura, A. (2016). Enhancing test validation through rigorous mixed methods components. In a. J. Moeller, J. W. Creswell, & N. Saville (Eds.), Second language assessment and mixed methods research (pp. 153–176). Cambridge: Cambridge University Press. Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159–170. doi:10.1111/j.1745-3984.1973.tb00793.x Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport: Praeger Publishers. Hambleton, R. K., Pitoniak, M. J., & Copella, J. M. (2012). Essential steps in setting performance standards on educational tests and strategies for assessing the reliability of results. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 47– 76). New York: Routledge. Hambleton, R. K., Zenisky, A. L., & Popham, W. J. (2016). Criterion-referenced testing: Advances over 40 years. In C. S. Wells, & M. Faulkner- Bond (Eds.), Educational measurement: From foundations to future (pp. 23–37). New York: The Guilford Press. Hambleton, R., & Eignor, D. R. (1978). Competency test development, validation, and standard-setting. Paper presented at the Minimum Competency Testing Conference of the American Education Research Association, (pp. 1– 50). Washington, DC. Retrieved from http://files.eric.ed.gov/fulltext/ED206 725.pdf Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345– 359. doi:10.1111/j.1745- 3984.1990.tb00753.x Harding, L. (2017). What do raters need in a pronunciation scale?: The user’s view. In T. Isaacs, & P. Trofimovich (Eds.), Second language pronunciation assessment: Interdisciplinary perspectives (Vol. 107, pp. 12– 34). Bristol: Multilingual Matters/Channel View Publications. Retrieved from https://www.jstor.org/stable/10.21832/j.ctt1xp3wcc Harsch, C., & Hartig, J. (2015). What are we aligning tests to when we report test alignment to the CEFR? Language Assessment Quarterly, 12(4), 333–362. doi:10.1080/15434303.2015.1092545 Harvey, A. L., & Way, W. D. (1999). A comparison of web-based standard setting and monitored standard setting. Montreal: Paper presented at the annual

References

277

meeting of the National Council of Measurement in Education. Retrieved from https://files.eric.ed.gov/fulltext/ED463747.pdf Haynes, W. (2013). Benjamini– Hochberg Method. In W. Dubitzky, O. Wolkenhauer, H. Yokota, & K.-H. Cho (Eds.), Encyclopedia of Systems Biology. New York: Springer. doi:10.1007/978-1-4419-9863-7_1215 Hellenic American University. (n.d.). Basic Communication Certificate in English (BCCE™): Official past examination Form A test booklet. Retrieved from Hellenic American University: https://hauniv.edu/images/pdfs/bcce_past_ paper_form_a_test_booklet2.pdf Hoyt, W. T. (2010). Interrater reliability and agreement. In G. R. Hancock, & R. O. Mueller (Eds.), The reviewer’s guide to quantitative methods in social sciences (pp. 141–154). New York: Routledge. Hsieh, M. (2013). An application of Multifaceted Rasch measurement in the Yes/No Angoff standard setting procedure. Language Testing, 30(4), 491–512. doi:10.1177/0265532213476259 Impara, J. C., & Plake, B. S. (1997). Standard setting: An alternative approach. Journal of educational measurement, 34(4), 353– 366. doi:10.1111/j.1745- 3984.1997.tb00523.x Impara, J. C., & Plake, B. S. (1998). Teacher’s ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69–81. doi:10.1111/ j.1745-3984.1998.tb00528.x Interagency Language Roundtable. (n.d.). Descriptions of proficiency Levels. Retrieved from Interagency Language Round Table: https://www.govtilr.org/ index.htm Iramaneerat, C., Smith, E. V., & Smith, R. (2008). An introduction to Rasch measurement. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 50–70). Los Angeles: Sage Publications. Irwin, P. M., Plake, B. S., & Impara, J. C. (2000). Validity of item performance estimates from an Angoff standard setting. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans. Retrieved from http://files.eric.ed.gov/fulltext/ED443875.pdf Ivankova, N. V., & Creswell, J. W. (2009). Mixed methods. In J. Heigham, & R. Croker (Eds.), Qualitative research in applied linguistics (pp. 135–163). Palgrave Macmillan. Jaeger, R. M. (1989). The certification of student competence. In R. L. Linn (Ed.), Educational Measurement (3rd ed.). New York: Macmillan. Jaeger, R. M. (1991). Selection of judges for standard- setting. Educational Measurement: Issues and Practices, 10(2), 3– 14. doi:10.1111/ j.1745-3992.1991.tb00185.x

278

References

Jang, E. E., Wagner, M., & Park, G. (2014). Mixed methods research in language testing and assessment. Annual Review of Applied Linguistics, 34, 123–153. doi:10.1017/S0267190514000063 Johnson, R. B., & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm whose time has come. Educational Researcher, 33(7), 14–26. Retrieved from http://www.jstor.org/stable/3700093 Kaftandjieva, F. (2004). Standard setting. In S. Takala (Ed.), Reference supplement to the manual for relating examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section B). Strasbourg: Language Policy Division, Council of Europe. Retrieved from https:// w ww.coe.int/ e n/ web/ c om mon- e urop e an- f ramew ork- refere nce- languages/additional-material Kaftandjieva, F. (2010). Methods for setting cut scores in criterion-referenced achievement tests: A comparative analysis of six recent methods with an application to tests of reading in EFL. Arnhem: Cito. Retrieved from http:// www.ealta.eu.org/documents/resources/FK_second_doctorate.pdf Kaftandjieva, F., & Takala, S. (2002). Council of Europe scales of language proficiency: A validation study. In C. J. Alderson (Ed.), Common European Framework of References for languages: Learning, teaching, assessment: Case studies (pp. 106–129). Strasburg: Council of Europe. Kahl, S. R., Crockett, T. J., DePascale, C. A., & Rindfleish, S. L. (1994). Using actual student work to determine cut scores for proficiency levels: New methods for new tests. Paper presented at the National Conference on Large Scale Assessment, Albuquerque. Retrieved from http://files.eric.ed.gov/fulltext/ED380479.pdf Kaliski, P. K., Wind, S. A., Engelhard, G. J., Morgan, D. L., Plake, B. S., & Reshetar, R. A. (2012). Using the Many-Faceted Rasch Model to Evaluate Standard Setting Judgments: An illustration with the advanced placement environmental science exam. Educational and Psychological Measurement, 73(3), 386–341. doi:10.1177/0013164412468448 Kane, M. T. (1998). Choosing between examinee-centered and test-centered standard-setting methods. Educational Assessment, 5(3), 129–145. doi:10.1207/ s15326977ea0503_1 . Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 53–88). Mahwah: Lawrence Erlbaum Associates. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1–16). West Port: Praeger Publishers.

References

279

Kane, M. T., & Wilson, J. (1984). Errors of measurement and standard setting in mastery testing. Applied Psychological Measurement, 8(1), 107–115. doi:10.1177/014662168400800111 Kannan, P., Sgammato, R., Tannenbaum, R. J., & Katz, I. (2015). Evaluating the consistency of Angoff-Based cut scores using subsets of items within a generalizability theory framework. Applied Measurement in Education, 28(3), 169–186. doi:10.1080/08957347.2015.1042156 Kantarcioglu, E., & Papageorgiou, S. (2011). Benchmarking and standards in language tests. In B. O’Sullivan (Ed.), Language testing: Theories and practices (pp. 94–110). Basingstoke: Palgrave MacMillan. Karantonis, A., & Sireci, S. G. (2006). The bookmark standard-setting method: A literature review. Educational Measurement: Issues and Practice, 25(1), 4–12. doi:10.1111/j.1745-3992.2006.00047.x Karau, S. J., & Williams, K. D. (1993). Social loafing: A meta-analytic review and theoretical integration. Journal of Personality and Social Psychology, 65(4), 681–706. doi:10.1037/0022-3514.65.4.681 Katz, I. R., & Tannenbaum, R. J. (2014). Comparison of web-based and face- to-face standard setting using the Angoff method. Journal of Applied Testing Technology, 15(1), 1–17. Katz, I. R., Tannenbaum, R. J., & Kannan, P. (2009). Virtual standard setting. Clear Exam Review, 20(2), 19–27. Kenyon, D. M., & Römhild, A. (2014). Standard setting in language testing. In A. J. Kunnan (Ed.), The companion to language assessment (First ed., Vol. 2, pp. 1–18). John Wiley & Sons, Inc. doi:10.1002/9781118411360.wbcla145 Kim, J. (2010). Within-subjects design. In N. J. Salkind (Ed.), Encyclopedia of research design (pp. 1639–1644). Thousand Oaks: SAGE Publications, Inc. doi:10.4135/9781412961288.n503 Kingston, N. M., Kahl, S. R., Sweeney, K. P., & Bay, L. (2001). Setting performance standards using the body of work method. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 219–248). Mahwah: Lawrence Erlbaum Associates. Kinnera, P. R., & Gray, C. D. (2008). SPSS 15 made simple. Hove: Psychology Press. Kleiber, P. B. (2014). Focus groups: More than a method of qualitative inquiry. In K. deMarrais, & S. D. Lapan (Eds.), Foundations for research: Methods of inquiry in education and the social sciences (3rd ed., pp. 87–102). New York: Routledge. Knoch, U., & McNamara, T. (2015). Rasch Analysis. In L. Plonsky (Ed.), Advancing quantitative methods in second language research (pp. 275–304). New York: Routledge.

280

References

Kock, N. (2004). The psychobiological model: Towards a new theory of computer- mediated communication based on Darwinism evolution. Organization Science, 15(3), 327–348. doi:10.1287/orsc.1040.0071 Kock, N. (2005). Media richness or media naturalness? The evolution of our biological communication apparatus and its influence on our behavior toward e-communication tools. IEEE Transactions on Professional Communication, 48(2), 117–130. doi:10.1109/TPC.2005.849649 Kock, N. (2010). Evolutionary psychology and information systems theorizing. In N. Kock (Ed.), Evolutionary psychology and information systems research (pp. 3–38). New York: Springer. Kollias, C. (2012). Standard setting of the Basic Communication Certificate in English (BCCE™) examination: Setting a Common European Framework of Reference (CEFR) B1 cut score. Technical Report. Hellenic American University. Retrieved from https://hauniv.edu/images/pdfs/bcce_standard-setting-repor t_v2002.pdf Kollias, C. (2013). Collecting procedural evidence through comprehensive evaluation survey forms of panelists’ impressions. Paper presented at BAAL Testing, Evaluation, and Assessment SIG. Bedfordshire. Retrieved from https:// www.baalteasig.co.uk/_files/ugd/92ac44_e88ca8604b114541bc201aa8eddfd b39.pdf Krueger, R. A., & Casey, M. A. (2015). Focus groups: A practical guide for applied research (5th ed.). Los Angeles: SAGE Publications. La Marca, P. M. (2001). Alignment of standards and assessments as an accountability criterion. Practical Assessment, Research & Evaluation, 7(21). doi:10.7275/ahcr-wg84 Lewis, D. M., Mitzel, H. C., & Green, D. R. (1996). Standard setting: A bookmark approach. In R.D. Green (Chair). Symposium conducted at the Council of Chief State Officers National Conference on Large-Scale Assessment, Phoenix. Lim, G. S., Geranpayeh, A., Khalifa, H., & Buckendahl, C. W. (2013). Standard setting to an international reference framework: Implications for theory and practice. International Journal of Testing, 13(1), 32–49. doi: 10.1080/ 15305058.2012.678526. Linacre, J. M. (1989/1994). Many-Facet Rasch measurement. Chicago: Mesa Press. Linacre, J. M. (2004). In E. V. Smith, & S. Richard (Eds.), Introduction to Rasch measurement: Theory, models and applications (pp. 258–278). Maple Grove: JAM Press. Linacre, J. M. (2005). Rasch dichotomous model vs. One-parameter Logistic Model. Rasch Measurement Transactions, 19(3), 1032. Retrieved from https:// www.rasch.org/rmt/rmt193h.htm

References

281

Linacre, J. M. (2014). Winsteps® (Version 3.81.0) [Computer Software]. Beaverton. Retrieved from www.winsteps.com Linacre, J. M. (2020a). A user’s guide to FACETS Rasch- model computer programs (Program manual 3.83.4). Retrieved from http://www.winst eps.com/manuals.htm Linacre, J. M. (2020b). Winsteps® Rasch measurement computer program: User’s Guide. Oregon: Winsteps.com. Retrieved from http://www.winsteps.com/ manuals.htm Linacre, J. M. (2020c). Winsteps® (Version 4.7.0.0) [Computer Software]. Beaverton. Retrieved from www.winsteps.com Linacre, J. M. (2022). Facets (Many- Facet Rasch Measurement) computer program (Version 3.84.0) [Computer software]. Retrieved from www.winst eps.com Linn, R. L., & Gronlund, N. E. (1995). Measurement and assessment in teaching (7th ed.). New Jersy: Prentice Hall. Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classification based on test scores. Journal of Educational Measurement, 32(2), 179–197. doi:10.1111/j.1745-3984.1995.tb00462.x Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards on performance on educational and occupational tests. Princeton: Educational Testing Service. Lodenback, T., Stranger, M., & Martin, E. (2015, December 17). The top 10 business schools with the highest GMAT scores. Retrieved from Business Insider UK: http://uk.businessinsider.com/business-schools-with-highest- gmat-scores-2015-12 Loomis, S. C. (2012). Selecting and training standard setting participants: State of the art policies and procedures. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 107–134). New York: Routledge. Lorie, W. (2011). Setting standard remotely. Orlando: Paper presented at the National Conference on Student Assessment. Lu, Z., & Yuan, K. (2010). Welch’s t test. In Encyclopedia of research design (pp. 1621– 1623). Thousand Oaks: SAGE Publications, Inc. doi:10.4135/ 9781412961288.n497 Lunz, M. E., & Stahl, J. (2017). Ben Wright: A Multi-facet Analysis. In M. Wilson, & W. P. Fisher Jr. (Eds.), Psychological and Social Measurement: The Career and Contributions of Benjamin D. Wright (pp. 33–43). Switzerland: Springer International publishing AG.

282

References

Mackay, A. (2007). Motivation, ability and confidence building in people. London: Routledge. Mager, R. F. (1962). Preparing instructional objectives. Palo Alto: Fearon Publishers. Martyniuk, W. (Ed.). (2010). Aligning tests with the CEFR: Reflections on using the Council of Europe’s draft Manual. Cambridge: Cambridge University Press. McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30– 46. doi:10.1037/ 1082-989X.1.1.30 McNamara, T. F. (1996). Measuring second language performance. London: Longman. McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576. doi:10.1177/0265532211430367 Miles, M. B., Huberman, A. M., & Saldana, J. (2020). Qualitative Data Analysis: A Methods Sourcebook (4th ed.). Thousand Oaks: SAGE Publications Inc. Mitzel, H. C., Lewis, D. M., Patz, R. J., & Ross, G. D. (2001). The bookmark procedure: Psychological perspectives. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 249– 281). Mahwah: Lawrence Erlbaum Associates. Moeller, A. J. (2016). The confluence of language assessment and mixed methods. In A. J. Moeller, J. W. Creswell, & N. Saville (Eds.), Second language assessment and mixed methods (pp. 3–16). Cambridge: Cambridge University Press. Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using Many-Facet Rasch measurement: Part 1. In E. V. Smith, & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 460–517). Maple Grove: JAM Press. Onwuegbuzie, A. J., Dickinson, W. B., Leech, N. L., & Zoran, A. G. (2009). A qualitative framework for collecting and analyzing data in focus group research. International Journal of Qualitative Methods, 8(3), 1–21. doi:10.1177/ 160940690900800301 Ornstein, A., & Gilman, D. A. (1991). The striking contrasts between norm- referenced and criterion-referenced tests. Contemporary Education, 62(4), 287–293. O’Sullivan, B. (2008). City & Guilds Communicator IESOL Examination (B2) CEFR Linking Project: Case study report. City & Guilds Research Report. O’Sullivan, B. (2010). The City & Guilds Communicator examination linking project: A brief overview with reflections on the process. In W. Martyniuk (Ed.), Aligning test with the CEFR: Reflections on using the Council of Europe’s draft manual (pp. 33–49). Cambridge: Cambridge University Press.

References

283

Papageorgiou, S. (2009). Setting performance standards in Europe: The judge’s contribution to relating language examinations to the Common European Framework of Reference. Frankfurt: Peter Lang. Papageorgiou, S. (2016). Aligning language assessment to standards and frameworks. In D. Tsagari, & J. Banerjee (Eds.), Handbook of Second Language Assessment (pp. 327–340). Boston: De Gruyter. Pearson (2012). PTE Academic score guide. (November 2012, version 4). London: Pearson Education Ltd. Retrieved from: https://pearson.com.cn/file/PTEA_ Score_Guide.pdf. Piccardo, E. (2013). The ECEP project and the key concepts of the CEFR. In E. D. Galaczi, & C. J. Weir (Eds.), Exploring language frameworks: Proceedings of the ALTE Krakow conference, July 2011 (pp. 187–204). Cambridge: Cambridge University Press. Piezon, S. L., & Feree, W. D. (2008). Perceptions of social loafing in online learning groups: A study of public university and U.S. naval war college students. International Review of Research in Open and Distance Learning, 9(2), 1–17. doi:10.19173/irrodl.v9i2.484 Pitoniak, M. J. (2003). Standard setting methods for complex licensure (Unpublished doctoral dissertation). Amherst: University of Massachusetts. Pitoniak, M. J., & Cizek, G. J. (2016). Standard setting. In C. S. Wells, & M. Faulkner-Bond (Eds.), Educational measurement: From foundations to future (pp. 38–61). New York: The Guildford Press. Plake, B. S., & Cizek, G. J. (2012). Variations on a theme: The modified Angoff, the extended Angoff, and yes/no standard setting methods. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 181–199). New York: Routledge. Plake, B. S., Impara, J. C., & Irwin, P. (1999). Validation of Angoff-based predictions of item performance. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal. Retrieved from http://files. eric.ed.gov/fulltext/ED430004.pdf Pollitt, A., & Hutchinson, C. (1987). Calibrated graded assessments: Rasch partial credit analysis of performance in writing. Language Testing, 4(1), 72– 92. doi:10.1177/026553228700400107 Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1–9. Retrieved from http://www.jstor.org/stable/1433917 Rasch, G. (1960/1980). Probabilistic models in some intelligence and attainment test. Chicago: The University of Chicago Press.

284

References

Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training participants for standard setting. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 119–157). Mahwah: Lawrence Erlbaum Associates. Reckase, M. D., & Chen, J. (2012). The role, format, and impact of feedback to standard setting panelists. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 149–164). New York: Routledge. Rheingold, H. (1993). A slice of life in my virtual community. In L. M. Harasim (Ed.), Global networks: Computers and international communication (pp. 57– 80). Cambridge: MIT Press. Rouam, S. (2013). False Discovery Rate (FDR). In W. O. Dubitzky W (Ed.), Encyclopedia of Systems Biology. New York: Springer. doi:10.1007/ 978-1-4419-9863-7 Saldana, J. (2013). The Coding Manual for Qualitative Researchers (2nd ed.). London: SAGE Publications Ltd. Savin-Baden, M., & Major, C. H. (2013). Qualitative research: The essential guide to theory and practice. London: Routledge. Schnipke, D. L., & Becker, K. A. (2007). Making the test development process more efficient using web- based virtual meeting. CLEAR Exam Review, 18(1), 13–17. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420– 428. doi:10.1037/0033-2909. 86.2.420 Sireci, S. G., & Hambleton, R. K. (1997). Future directions for norm-referenced and criterion- referenced achievement testing. International Journal of Educational Research, 27(5), 379–393. Sireci, S. G., Randall, J., & Zenisky, A. (2012). Setting valid performance standards on educational tests. CLEAR Exam Review, 23(2), 18–27. Skaggs, G., & Tessema, A. (2001). Item disordinality with the bookmark standard setting procedure. Paper presented at the annual meeting of the National Council of Measurement in Education, Seattle. Retrieved from http://files.eric. ed.gov/fulltext/ED453275.pdf Skorupski, W. P. (2012). Understanding the cognitive processes of standard setting panelists. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 135-147). New York: Routledge. Skorupski, W. P., & Hambleton, R. K. (2005). What are panelists thinking when they participate in standard-setting studies? Applied Measurement in Education, 18(3), 233–256. doi:10.1207/s15324818ame1803_3

References

285

Snyder, S., & Sheeham, R. (1992). The Rasch measurement model: An introduction. Journal of Early Intervention, 16(1), 87–95. doi:10.1177/105381519201600108 Stearns, M., & Smith, R. M. (2009). Estimation of decision consistency indices for complex assessments: Model based approaches. In E. V. Smith, & G. Stone (Eds.), Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models (pp. 513–527). Maple Grove: JAM Press. Stemler, S. E., & Tsai, J. (2008). Best practices in interrater reliability: Three common approaches. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 29–49). California: Sage Publications, Inc. Stone, G. E. (2004). Objective standard setting (or truth in advertising). In E. V. Smith, & S. Richard (Eds.), Introduction to Rasch measurement: Theory, models and applications (pp. 445–459). Maple Grove: JAM Press. Stone, G. E. (2009a). Objective standard setting for judge-mediated examinations. In E. V. Smith, & G. E. Stone (Eds.), Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models (pp. 294–311). Maple Grove: JAM Press. Stone, G. E. (2009b). Introduction to the Rasch family of standard setting methods. In E. V. Smith, & G. E. Stone (Eds.), Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models. (pp. 138–147) Maple Grove: JAM Press. Stone, G. E., Belyokova, S., & Fox, C. M. (2008). Objective standard setting for judge-mediated examinations. International Journal of Testing, 8(2), 180–196. doi:10.1080/15305050802007083 Storey, J. D. (2002). A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 479–498. Retrieved from https://www.jstor.org/stable/3088784 Subkoviak, M. J. (1988). A practitioner’s guide to computation of interpretation of reliability indices for mastery tests. Journal of Educational Measurement, 25(1), 47–55. doi:10.1111/j.1745-3984.1988.tb00290.x Tannenbaum, R. J. (2013). Setting standards on the TOEIC(R) listening and reading test and the TOEIC(R) speaking and writing tests: A recommended procedure. In The research foundation for the TOEIC tests: A compendium of studies (Vol. II, pp. 8.1–8.12). Princeton: Educational Testing Service. Tannenbaum, R. J., & Wylie, E. C. (2008). Linking English-Language test scores onto the Common European Framework of Reference: An application of standard-setting methodology (ETS Research Rep. No. RR- 08- 34). Princeton: ETS. Retrieved from https://www.ets.org/Media/Research/pdf/ RR-08-34.pdf

286

References

TeamViewer 9 (Version 9.0) [Computer software]. (2014). TeamViewer GmbH. Retrieved from https://www.teamviewer.com Tiffin-Richards, S. P., & Pant, H. A. (2013). Setting standards for English foreign language assessment: Methodology, validation, and degree of arbitrariness. Educational Measurement: Issues and Practice, 32(2), 15–25. doi:10.1111/ emip.12008 Tschirner, E., Bärenfänger, O., & Wanner, I. (2012). Assessing evidence of validity of assigning CEFR ratings to the ACTFL oral proficiency interview (OPI) and the oral proficiency interview by computer (OPIc): Technical report 2012-US- PUB-1. Leipzig: Institute for Test Research and Test Development. Turner, C. E. (2013). Mixed methods research. In The companion to language assessment (Vol. III, pp. 1403– 1407). John Wiley & Sons. doi:10.1002/ 9781118411360.wbcla142 Wang, N. (2009). Setting passing standards for licensure and certification examinations: An item mapping procedure. In E. V. Smith, & G. E. Stone (Eds.), Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models (pp. 236–250). Maple Grove: JAM Press. Way, W. D., & McClarty, K. L. (2012). Standard setting for computer-based assessment: A summary of mode comparability research and considerations. In G. C. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 451–466). New York: Routledge. Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. doi: 10.1177/026553229801500205 Welch, B. L. (1947). The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved. Biometrika, 34(1/2), 28–35. Retrieved from http://www.jstor.com/stable/2332510 Williams, K. D., & Karau, S. J. (1991). Social loafing and social compensation: The effects of expectations of co-worker performance. Journal of Personality and Social Psychology, 61(4), 570–581. doi:10.1037/0022-3514.61.4.570 Wolfe, E. W., & Dobria, L. (2008). Applications of the Multifaceted Rasch model. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 71–85). California: Sage Publications, Inc. Wu, S. M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: the case of a university placement test. Higher Education Research & Development, 35(2), 380–394. doi:10.1080/07294360.2015.1087381 Ying, L. (2010). Crossover design. In N. J. Salkind (Ed.), Encyclopedia of research design (pp. 310–314). Thousand Oaks: SAGE Publications, Inc. doi:10.4135/ 9781412961288.n95

References

287

Zhu, W., & Flaitz, J. (2005). Using focus group methodology to understand international students’ academic language needs: A comparison of perspectives. TESL-EJ, 8(4), 1–11. Retrieved from http://www.tesl-ej.org/ wordpress/issues/volume8/ej32/ej32a3/ Zieky, M. J. (1995). A historical perspective on setting standards. Proceedings of Joint Conference on Standard Setting for Large-Scale Assessments of the National Assessment Governing Body (NAGB) and the National Center for Education Statistics (NCES), (pp. 30–67). Washington, DC. Retrieved from http://files.eric.ed.gov/fulltext/ED403326.pdf Zieky, M. J. (2012). So much has changed: An historical overview of setting cut scores. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 33–46). New York: Routledge. Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards on performance on educational and occupational tests. Educational Testing Service.

Author index Abdi, H., 104 Alderson, J. C., 32, 88 American Council on the Teaching of Foreign Languages (ACTFL), 42 American Educational Research Association (AERA), 33 American Psychological Association (APA), 33 Andrich, D., 114 Angoff, W. H., 9, 34, 35, 138 Association for Language Testers in Europe (ALTE), 42 Bachman, L. F., 101 Banerjee, J., 81 Bärenfänger, O., 81 Bay, L., 39 Becker, K. A., 26, 50–52 Bejar, I. I., 34 Belyokova, S., 102 Benjamini, Y., 103–105, 134 Bond, T. G., 101, 119 Boone, W. J., 123 Brandon, P. R., 34, 126 Brennan, R. L., 129 Brunfaut, T., 45, 48–49 Bryman, A., 81 Buckendahl, C. W., 46 Bunch, M. B., 25, 34–36, 129 Casey, M. A., 80–81, 241 Centre for Canadian Language Benchmarks, 42 Cetin, S., 49 Chang, L., 126 Chen, J., 35, 94

Cizek, G. J., 25, 27, 33–36, 39, 40, 49, 50, 77–79, 83, 86, 100, 129, 189, 212, 214 Clapham, C., 32 Cohen, A. S., 34, 114, 124 Copella, J. M., 33, 125 Corbin, J., 193 Council of Europe, 27, 34, 39, 41, 42, 44, 71, 102 Creswell, J. W., 64–65, 73–74, 81 Crooks, T. J., 34, 124 Crowley, K., 81 Davis-Becker, S. L., 57, 130, 212 Dennis, A. R., 78 Dickinson, W. B., 80, 105 Dobria, L., 102 Downey, N., 83 Dunlea, J., 25, 46, 49, 50 Dunn, O. J., 104 Earnest, D. S., 27, 33, 40, 50, 100, 129 Eckes, T., 9, 28, 102, 114, 119, 124 Egan, K. L., 71 Eignor, D. R., 25, 32 European Association for Language Testing (EALTA), 42 Faez, F., 81 Ferdous, A. A., 49–50 Feree, W. D., 217 Ferrara, S., 71 Feskens, R., 48 Figueras, N., 44 Finch, H., 80 Fisher, W. R., 114

290

Author index

Flaitz, J., 81 Fleiss, J. L., 100 Fox, C. M., 101–102, 119 Gelbal, S., 49 Gelin, M., 50 Geranpayeh, A., 46 Gerrow, J., 49–50 Gilman, D. A., 32 Glaser, B. G., 105, 193 Glaser, R., 31 Gray, C. D., 105 Green, D. R., 36 Gronlund, N. E., 31 Guetterman, T. C., 64 Hambleton, R. K., 25, 31–34, 41, 49, 100, 123–125, 128, 130, 212 Harding, L., 9, 45, 48–49, 80–81 Harsch, C., 9, 49, 72 Hartig, J., 49, 72 Harvey, A. L., 26, 50–52, 59 Haynes, W., 104 Hellenic American University, 70 Hochberg, Y., 103–104 Hoyt, W. T., 128 Hsieh, M., 28, 46, 102 Huberman, A. M., 106 Husek, T. R., 32 Hutchinson, C., 119 Impara, J. C., 34–35, 83 Interagency Language Roundtable (ILR), 42 Iramaneerat, C., 105 Irwin, P., 34 Ivankova, N. V., 64–65 Jaeger, R. M., 34, 124 Jang, E. E., 64 Johnson, R. B., 64

Kaftandjieva, F., 25, 27, 32–34, 99– 100, 124, 126, 128–129, 246 Kahl, S. R., 39 Kaliski, P. K., 28 Kane, M. T., 27–28, 32–34, 40, 61, 83, 124, 224 Kannan, P., 26, 49–52, 54–55, 57, 60, 215, 217, 229 Kantarcioglu, E., 42, 102 Karantonis, A., 36 Karau, S. J., 217 Katz, I. R., 26, 49–52, 54–55, 57, 60, 215, 217, 229 Kecker, G., 102 Kenyon, D. M., 42 Keuning, J., 48 Khalifa, H., 46 Kim, J., 66 Kingston, N. M., 39 Kinnera, P. R., 105 Kleiber, P. B., 80 Knoch, U., 101–102, 119 Kock, N., 28, 58–59, 69, 214– 216, 219 Kollias, C., 83, 134, 138, 223, 237 Koons, H., 25 Krieger, A. M., 104–105, 134 Krueger, R. A., 80–81, 241 LaMarca, P. M., 42 Leech, N. L., 80, 105 Lewis, C., 100, 129, 133 Lewis, D. M., 36, 37 Lewis, J., 80 Lim, G. S., 46 Linacre, J. M., 9, 36, 70, 71, 101– 102, 104, 111, 114, 118–119, 124, 133, 140, 143 Linn, R. L., 31 Livingston, S. A., 27, 36–37, 39, 83, 100, 129, 133, 214 Lodenback, T., 32

Author index

Loomis, S. C., 91, 220 Lorie, W., 50 Lu, Z., 103 Lunz, M. E., 102 Mackay, A., 215 Mager, R. F., 31 Majhanovich, S., 81 Major, C. H., 81 Marais, I., 114 Martin, E., 32 Martyniuk, W., 44 McClarty, K. L., 50 McNamara, T., 101–102, 119 Michaels, H., 50 Miles, M. B., 106 Mitzel, H. C., 36–37 Moeller, A. J., 64 Myford, C. M., 114, 119 National Council on Measurement in Education (NCME), 33 Noijons, J., 44 Novick, M. R., 130 Onwuegbuzie, A. J., 64, 80, 105 Ornstein, A., 32 O’Sullivan, B., 43–44, 49, 102 Pant, H. A., 49–50 Papageorgiou, S., 9, 42, 44–45, 49, 72, 102 Park, G., 64 Patz, R. J., 37 Pearson, 46 Perie, M., 27, 36–37, 39, 83, 214 Piccardo, E., 81 Piezon, S. L., 217 Pitoniak, M. J., 27, 33–34, 39–41, 49, 83, 100, 123–125, 128, 212

Plake, B. S., 34–35, 83, 86, 212 Pollitt, A., 119 Popham, W. J., 32 Randall, J., 33, 41–42 Rasch, G., 101 Raymond, M. R., 74 Reckase, M. D., 35, 94 Reid, J. B., 74 Reshetar, R. A., 28 Rheingold, H., 75 Römhild, A., 42 Ross, G. D., 37 Rouam, S., 104 Salamoura, A., 64 Saldana, J., 106 Savin-Baden, M., 81 Schneider, M. C., 71 Schnipke, D. L., 26, 50–52 Sgammato, R., 49–50 Sheeham, R., 101 Shrout, P. E., 100 Sireci, C. G., 32–33, 36, 41–42 Skaggs, G., 36 Skorupski, W. P., 49, 214 Smith, E. V., 105 Smith, M., 81 Smith, R. M., 100, 129, 132–133 Snyder, S., 101 Stahl, J., 102 Stearns, M., 100, 129, 132–133 Stemler, S. E., 128 Stone, G. E., 34, 37–38, 67, 102 Stranger, M., 32 Strauss, A., 193 Subkoviak, M. J., 129–130 Sweeney, K. P., 39 Takala, S., 9, 129 Tan, S., 102

291

292

Author index

Tannenbaum, R. J., 9, 25–26, 43–44, 46, 49–52, 54–55, 57, 60, 215, 217, 229 Taylor, S., 81 Tessema, A., 37 Tiffin-Richards, S. P., 49–50 Tsai, J., 128 Tschirner, E., 81 Turner, C. E., 64

Weigle, S. C., 102, 119 Welch, B. L., 103 Wiley, E. C., 43 Williams, K. D., 217 Wilson, J., 32 Wind, S. A., 28 Wolfe, E. W., 102, 114, 119 Wong, S. P., 100 Wu, S. M., 102

Van Til, A., 48 Verheyen, R., 48

Yekutiele, D., 104–105, 134 Ying, L., 66 Yuan, K., 103

Wagner, M., 64 Wall, D., 32 Wan, L., 129 Wang, N., 33 Wanner, I., 81 Way, W. D., 26, 50–52, 59

Zenisky, A., 32–33, 41–42 Zhu, W., 81 Zieky, M. J., 27, 32, 36–37, 39, 50, 57, 83, 214 Zoran, A. G., 80, 10

Subject index A Adjusted p-values, 104, 134–135, 141, 150 Adobe® ConnectTM platform, 67– 69, 83–84, 86, 193 Angoff method, 11, 34–38, 47–48 Angoff method, Yes/No Angoff method, 12, 34–35, 47, 83, 86 B Bandwidth, 27, 56, 153, 204, 226, 228 BB-CLASS (software), 129 Behavioural scales, 42 Bias. See also differential analysis, 111, 140–144, 209, 212 Body of Work (BoW) method, 11, 23, 39 Bookmark method, 11, 36, 38, 46–48 marker, 36 Borderline Group (BG) method, 11, 23, 39 Breakout rooms, 217 C Chi-square statistic, 103, 114 Classical test theory (CTT), 7, 13, 23, 29, 99, 223, 230 perspective, 122–123, 133, 145 Cognitive 15, 34–37, 44, 48–49, 53, 56, 58–60, 83, 195, 197, 199, 213– 215, 219, 224, 230 burden, 44, 48–49, 58–59, 199, 214, 219[ demands, 36–37, 49, 53, 60, 83 overload, 219

strain, 15, 195, 197, 199, 215, 224, 233 Common European Framework of Reference (CEFR), 23, 38, 42– 43, 81 alignment process, 42–45 alignment studies, 11, 42–46 companion volume, 42 descriptors, 44–47, 49, 71, 77, 99, 157, 208, 214, 246 global descriptors, 88–89, 233 reading descriptors, 73, 75, 78, 88, 236 familiarisation verification activities, 12–13, 17, 19, 71–73, 89–91, 246 handbook, 42–43 levels, 43–45, 47, 49, 71–72, 74, 77, 88–89 manual, 42, 44, 71 preliminary pilot manual, 44 Common items, 47, 71 Concurrent equating, 70–71, 111–112 Concurrent verbal reports, 230 Consistency. See Internal validity. 13, 17, 21, 40–41, 53, 99–100, 102, 124–126, 128–133, 149, 190, 209, 246, 252 Constant comparison method (CCM), 23, 105, 107 coding scheme, 193–195 Contrasting Group (CG) method, 11, 23, 39–40 Counterbalance workshop design, 12, 19, 66 Criterion-referenced (CR), 23, 28, 31

294

Subject index

standard setting, 32–33 testing (CRT), 31 Cut scores 7, 11, 13–1 4, 21, 25–2 9, 32–3 3, 35–4 9, 51, 53–5 6, 60–6 1, 63–6 5, 70–7 2, 99–1 00, 103, 111, 114, 122–123, 126–1 27, 129–1 40, 145, 181, 190–1 91, 193, 196, 209, 211– 214, 219–2 20, 223–2 24, 230– 231, 246 definition of, 32 evaluating, 27–28, 41, 60 D Data-Driven Direct Consensus (3DC) method, 23, 48 Decision consistency and accuracy, 13, 41, 100, 129 Livingston and Lewis method, 100, 129–130, 133 Standard error method, 100, 129, 132–133 See also Internal validity. Decision-making process, 25, 27– 28, 32, 63, 81, 193–195, 208–209, 212–214, 230 in virtual environments, 26–28, 51, 63, 195, 208–209 DIALANG scales, 72–73, 88–89, 234–235 Differential analysis. See also DGF, DJF, DMF, 104, 140, 142 Differential group functioning (DGF), 14, 23, 142, 143 Differential judge functioning (DJF), 14, 23, 140 Differential medium functioning (DMF), 14, 23, 141 Digital Rights Management (DRM), 23, 229 Direct Consensus method, 48

E E-communication medium, 7, 27, 51, 59–61, 67, 81, 104, 111, 118, 140–141, 143–144, 147, 149–150, 153–154, 180, 190–191, 193, 209, 211–215, 219, 221, 223–224, 226, 230–231 Electronic consent form, 73–74, 237 E-platform, 69, 78, 84–85, 187, 189 audio medium session, 85 equipment check session, 83–84 function buttons, 84 video medium session, 85 See also Adobe® ConnectTM platform. E-polls, 69, 216, 229 Equating, 28, 70–71, 111–112, 230 displacement, 71 Etiquette. See also netiquette, 84, 169, 171, 231 Examinee-centred methods, 34, 39–40 External validity, evidence, 43–44, 46–47, 49 F Facet, definition of, 102 Face-to-face (F2F), 23, 59, 134 experience, 152, 160, 216 interaction, 67, 152, 154, 201, 203 FACETS (computer programme), 104, 111, 118, 140, 143 edited specification file, 122, 248 False discovery rate (FDR), 103 Family-wise error rate (FWER), 103 Feedback, 15, 19, 35, 37, 40–41, 79, 87, 90–91, 94, 96–98, 195, 197 consequences, 19, 35, 37, 87, 96– 98, 169, 197, 209, 214, 244–245 normative, 35, 87, 94, 96–97 visual, 90

Subject index

Fisher’s Z transformations, 119 Fit statistics 118–119 infit, 118–119 misfitting, 119 outfit, 118–119 Fixed effect hypothesis, 143, 144 Focus groups, 80–82, 99, 107, 194– 196, 209 protocol, 82, 241 sessions, 19, 82, 98, 193, 203–204 G General Data Protection Regulation (GDPR), 23, 226 GRAPHPAD PRISM (Computer programme), 104, 134, 140 I Internal check, 100, 124, 125 Internal validity, 13, 19, 28, 41, 99– 100, 102–103, 123, 133 consistency within the method, 13, 41, 53, 99–100, 124 decision consistency, 13, 40–41, 53, 100, 129–130, 132 group level indices, 113 interparticipant consistency, 40– 41, 53, 100, 126, 128–129 intraparticipant consistency, 40– 41, 53, 100, 125, 126, 209, 252 judge level indices, 118 reasonableness of cut scores, 40–41 Intraclass correlation coefficient (ICC), 23, 100 Item response theory (IRT), 23, 36, 48 Iterative rounds, 35, 56, 86 K Kappa coefficient, 130–131 Knowledge, skills, and abilities (KSAs), 23, 31, 33, 36–37, 39, 42, 45, 47, 56–57, 71, 83, 213, 220

295

L Language testing and assessment (LTA), 11, 23, 29, 31, 42–44, 47, 49, 64, 81, 102 reseachers, 102 Logits, 38, 94–95, 103 scale, 101–102 score table, 139, 256 M Many-facet Rasch measurement (MFRM) model, 7, 13, 23, 102 Mastery, 32, 37–38 level, 37–38 Media naturalness theory (MNT), 7, 12, 23, 28, 58 naturalness scale, 19, 59, 69, 219 e-platform, 69, 84–85, 185, 187 super-rich medium, 69, 219, 226 Minimum competence, 33–34 Misplacement index (MPI), 23, 99, 125, 246 Mixed methods research (MMR), 64–66 embedded, 65–66 Modified Angoff method. See also Angoff method, 34, 51 Multiple comparison adjustment, 103 Benjamini, Krieger, & Yekutiele adaptive method, 104–105, 134 Dunn-Bonferroni adjustment method, 104 N Netiquette. See also etiquette, 84, 160–161, 169–171, 205, 225, 231 Non-disclosure agreement (NDA), 24, 57 Norm-referenced (NR), 24, 31 test scores, 32 testing (NRT), 24, 31

296

Subject index

O Objective standard setting (OSS) method, 24, 37–38, 67 Ordered item booklet (OIB), 24, 36, 47 P Performance levels, 33, 35, 39–42, 71, 213 performance level descriptors (PLDs), 24, 49, 71 Performance standards, 40, 128, 211, 214, 224 Platforms 21, 56–57, 59, 67, 86, 217, 229 access, 52, 56, 58, 79, 92, 219, 225, 228 appropriacy, 225, 231 MNT principles, 67 netiquette, 16, 84, 160–161, 169– 171, 205, 225, 231 security, 49, 55–57, 229 password protected, 229 training, 12, 16, 52, 68–69, 83–84, 205, 218–219, 228 Procedural validity evidence, 41 survey instruments, 147–148, 189, 190 P-values. See also adjusted p-values, 134, 141, 150 Q Q-values. See also adjusted p-values, 104, 134, 140, 150 R Rasch measurement theory (RMT), 13, 24, 100 basic model, 101, 102 model, 101, 114

Rasch person, 123 reliability, 123 separation index, 113, 123 See also internal validity, judge level indices. Rasch-Kappa index, 114, 119, 122 Response probability 24 RP50, 24, 36 RP67, 24, 36 Root mean square standard error (RMSE), 24, 124 S Sign test, 22, 105, 149–150 SNAGGIT (recording software), 69 Social loafing, 217–218 Standard error of measurement (SEM), 24, 37, 100, 124 Standard error of the cut score (SEc), 24, 100, 124 Standard setting 11–12, 21, 25–27, 31–34, 40–43, 49, 83–98 challenges, 49–50 definition of, 32, 243 logistics, 49–50, 60 Subject matter experts (SMEs), 32–33, 40 T TEAMVIEWER (computer access software), 69, 84 Test-centred methods, 11, 34–38 V Virtual community, 75 Virtual standard setting 51–59 challenges, 55–58 equipment, 19, 27, 57, 74, 77, 83–84, 153, 170, 203, 219, 226, 228 sessions, 27, 52–55, 60, 66–67 studies, 26, 51–55

Subject index

virtual cut scores, 63, 103, 111, 133, 138 W Web-conferencing platform. See also E-platforms, 27, 52-54, 67– 69, 79 Welch t-test, 103, 134 Wilcoxon signed-rank test, 105, 149 symmetry assumption, 105, 149 WINSTEPS (computer programme), 71, 133 WONDER (virtual workspace), 217

297

Workshop stages, 12, 19, 86–87 introduction stage, 87–88 judgement stage, 13, 93–98 method training stage, 13, 92–93 orientation stage, 13, 88–92 Y Yes/No Angoff method. See also Angoff method, 34–35, 46–47, 83, 86 Z Z-score, 119, 132–133 ZStd, 103, 119

Language Testing and Evaluation Series editors: Claudia Harsch and Günther Sigott

Vol.

1 Günther Sigott: Towards Identifying the C-Test Construct. 2004.

Vol.

2 Carsten Röver. Testing ESL Pragmatics. Development and Validation of a Web-Based Assessment Battery. 2005.

Vol.

3 Tom Lumley: Assessing Second Language Writing. The Rater’s Perspective. 2005.

Vol.

4 Annie Brown: Interviewer Variability in Oral Proficiency Interviews. 2005.

Vol.

5 Jianda Liu: Measuring Interlanguage Pragmatic Knowledge of EFL Learners. 2006.

Vol.

6 Rüdiger Grotjahn (Hrsg. / ed.): Der C-Test: Theorie, Empirie, Anwendungen/The C-Test: Theory, Empirical Research, Applications. 2006.

Vol.

7 Vivien Berry: Personality Differences and Oral Test Performance. 2007.

Vol.

8 John O‘Dwyer: Formative Evaluation for Organisational Learning. A Case Study of the Management of a Process of Curriculum Development. 2008.

Vol.

9 Aek Phakiti: Strategic Competence and EFL Reading Test Performance. A Structural Equation Modeling Approach. 2007.

Vol.

10 Gábor Szabó: Applying Item Response Theory in Language Test Item Bank Building. 2008.

Vol.

11 John M. Norris: Validity Evaluation in Language Assessment. 2008.

Vol.

12 Barry O’Sullivan: Modelling Performance in Tests of Spoken Language. 2008.

Vol.

13 Annie Brown / Kathryn Hill (eds.): Tasks and Criteria in Performance Assessment. Proceedings of the 28th Language Testing Research Colloquium. 2009.

Vol.

14 Ildikó Csépes: Measuring Oral Proficiency through Paired-Task Performance. 2009.

Vol.

15 Dina Tsagari: The Complexity of Test Washback. An Empirical Study. 2009.

Vol.

16 Spiros Papageorgiou: Setting Performance Standards in Europe. The Judges’ Contribution to Relating Language Examinations to the Common European Framework of Reference. 2009.

Vol.

17 Ute Knoch: Diagnostic Writing Assessment. The Development and Validation of a Rating Scale. 2009.

Vol.

18 Rüdiger Grotjahn (Hrsg. / ed.): Der C-Test: Beiträge aus der aktuellen Forschung/The CTest: Contributions from Current Research. 2010.

Vol.

19 Fred Dervin / Eija Suomela-Salmi (eds. / éds): New Approaches to Assessing Language and (Inter-)Cultural Competences in Higher Education / Nouvelles approches de l'évaluation des compétences langagières et (inter-)culturelles dans l'enseignement supérieur. 2010.

Vol.

20 Ana Maria Ducasse: Interaction in Paired Oral Proficiency Assessment in Spanish. Rater and Candidate Input into Evidence Based Scale Development and Construct Definition. 2010.

Vol.

21 Luke Harding: Accent and Listening Assessment. A Validation Study of the Use of Speakers with L2 Accents on an Academic English Listening Test. 2011.

Vol.

22 Thomas Eckes: Introduction to Many-Facet Rasch Measurement. Analyzing and Evaluating Rater-Mediated Assessments. 2011. 2nd Revised and Updated Edition. 2015.

Vol.

23 Gabriele Kecker: Validierung von Sprachprüfungen. Die Zuordnung des TestDaF zum Gemeinsamen europäischen Referenzrahmen für Sprachen. 2011.

Vol.

24 Lyn May: Interaction in a Paired Speaking Test. The Rater´s Perspective. 2011.

Vol.

25 Dina Tsagari / Ildikó Csépes (eds.): Classroom-Based Language Assessment. 2011.

Vol.

26 Dina Tsagari / Ildikó Csépes (eds.): Collaboration in Language Testing and Assessment. 2012.

Vol.

27 Kathryn Hill: Classroom-Based Assessment in the School Foreign Language Classroom. 2012.

Vol.

28 Dina Tsagari / Salomi Papadima-Sophocleous / Sophie Ioannou-Georgiou (eds.): International Experiences in Language Testing and Assessment. Selected Papers in Memory of Pavlos Pavlou. 2013.

Vol.

29 Dina Tsagari / Roelof van Deemter (eds.): Assessment Issues in Language Translation and Interpreting. 2013.

Vol.

30 Fumiyo Nakatsuhara: The Co-construction of Conversation in Group Oral Tests. 2013.

Vol.

31 Veronika Timpe: Assessing Intercultural Language Learning. The Dependence of Receptive Sociopragmatic Competence and Discourse Competence on Learning Opportunities and Input. 2013.

Vol.

32 Florian Kağan Meyer: Language Proficiency Testing for Chinese as a Foreign Language. An Argument-Based Approach for Validating the Hanyu Shuiping Kaoshi (HSK). 2014.

Vol.

33 Katrin Wisniewski: Die Validität der Skalen des Gemeinsamen europäischen Referenzrahmens für Sprachen. Eine empirische Untersuchung der Flüssigkeits- und Wortschatzskalen des GeRS am Beispiel des Italienischen und des Deutschen. 2014.

Vol.

34 Rüdiger Grotjahn (Hrsg./ed.): Der C-Test: Aktuelle Tendenzen/The C-Test: Current Trends. 2014.

Vol.

35 Carsten Roever / Catriona Fraser / Catherine Elder: Testing ESL Sociopragmatics. Development and Validation of a Web-based Test Battery. 2014.

Vol.

36 Trisevgeni Liontou: Computational Text Analysis and Reading Comprehension Exam Complexity. Towards Automatic Text Classification. 2015.

Vol.

37 Armin Berger: Validating Analytic Rating Scales. A Multi-Method Approach to Scaling Descriptors for Assessing Academic Speaking. 2015.

Vol.

38 Anastasia Drackert: Validating Language Proficiency Assessments in Second Language Acquisition Research. Applying an Argument-Based Approach. 2015.

Vol.

39 John Norris: Developing C-tests for estimating proficiency in foreign language research. 2018.

Vol.

40 Günther Sigott (Ed./Hrsg): Language Testing in Austria: Taking Stock/Sprachtesten in Österreich: Eine Bestandsaufnahme. 2018.

Vol.

41 Carsten Roever / Gillian Wigglesworth (Eds.): Social perspectives on language testing. Papers in honour of Tim McNamara. 2019.

Vol.

42 Klaus Siller: Predicting Item Difficulty in a Reading Test. A Construct Identification Study of the Austrian 2009 Baseline English Reading Test 2020.

Vol.

43 Anastasia Drackert / Mirka Mainzer-Murrenhoff / Anna Soltyska / Anna Timukova (Hrsg.): Testen bildungssprachlicher Kompetenzen und akademischer Sprachkompetenzen. Zugänge für Schule und Hochschule. 2020.

Vol.

44 Khaled Barkaoui: Evaluating Tests of Second Language Development. A Framework and an Empirical Study. 2021.

Vol.

45 Theresa Weiler: Testing Lexicogrammar. An Investigation into the Construct Tested in the Language in Use Section of the Austrian Matura in English. 2022.

Vol.

46 Charalambos Kollias: Virtual Standard Setting: Setting Cut Scores. 2023.

www.peterlang.com