A Teacher's Guide to Educational Assessment [Revised ed.] 9087909128, 9789087909123

This book is a natural step beyond our earlier text A Teacher's Guide to Assessment, which was published almost six

273 57 16MB

English Pages 372 [370] Year 2009

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
A Teacher’s Guide to Educational Assessment
CONTENTS
PREFACE TO THE SECOND EDITION
CHAPTER 1
INTRODUCTION TO ASSESSMENT
Background to educational assessment
History of educational assessment
Purposes of assessment
Ethical issues associated with assessment
Summary
Review Questions
Exercises
CHAPTER 2
THE VARYING ROLE AND NATURE OF ASSESSMENT
Valid and reliable assessments are difficult to develop
The diverse nature of assessment
A final word of caution: The ‘assessment for learning’ versus ‘assessment of learning’ debate
Summary
Review Questions
Exercises
CHAPTER 3
FUNDAMENTAL CONCEPTS OF MEASUREMENT
Score distributions
Agreement index: the correlation
Summary
Review Questions
Exercises
CHAPTER 4
VALIDITY AND VALIDATION OF ASSESSMENTS
Validity issues
Factors which reduce validity
Invalid uses of assessment results
Summary
Review Questions
Exercises
CHAPTER 5
THE RELIABILITY OF ASSESSMENT RESULTS
Reliability
Internal consistency methods
Reliability of criterion-referenced assessments
How high should reliability be?
Standard error of measurement
How can I increase reliability?
Effect of practice and coaching on reliability
How to purchase valid and reliable commercial assessments
Summary
Review Questions
Exercises
CHAPTER 6
ANALYSING TASKS AND QUESTIONS
Item difficulty
Further analysis of criterion-referenced results
Norm-referenced item analysis
Differential Item Functioning
Summary
Review Questions
Exercises
CHAPTER 7
OBJECTIVE MEASUREMENT USING THE RASCH MODEL (FOR NON-MATHEMATICIANS)
Analysis of test results put into a context
Introduction to the Rasch model
Analysis of test results using the Rasch model
Have you kept your model fit?
The identification of mismeasured individuals
How do we treat the omitted responses?
Test development using Rasch model
The assumptions of the Rasch model
Can the test be split in two educationally meaningfull sub-scales?
Review Questions
Exercises
CHAPTER 8
THE PARTIAL CREDIT RASCH MODEL
Analysis of test results using the Partial Credit model
How to build a test from the scratch using the Rasch model
What to have in mind before analysing a dataset with the Rasch model
Further Reading
Review Questions
Exercises
CHAPTER 9
FURTHER APPLICATIONS OF THE RASCH MODEL
The Rating Scale Rasch model
Analysis using the Rating Scale mo
The multi-dimensional model
Computerised Adaptive Testing
Summary
Review Questions
CHAPTER 10
PLANNING, PREPARATION AND ADMINISTRATION OF ASSESSMENTS
Developing assessment specifications
The preparation and administration of an assessment
Cheating and assessment
Special arrangements
Summary
Review Questions
Exercises
CHAPTER 11
ASSESSMENT OF KNOWLEDGE: CONSTRUCTED RESPONSE QUESTIONS
Types of test questions
How to assess knowledge using essays
How to assess knowledge using short answer questions
Summary
Review Questions
Exercises
CHAPTER 12
ASSESSMENT OF KNOWLEDGE: SELECTED RESPONSE QUESTIONS
True-false, alternate choice and matching questions
Matching Questions
Alternate Choice
Corrections for guessing
Multiple-choice questions
Summary
Review Questions
Exercises
CHAPTER 13
ASSESSMENT OF PERFORMANCE AND PRACTICAL SKILLS
Phases in the acquisition of a psychomotor skill
Stages in the development of expertise
Forms of assessment
Direct observation
Skills tests
Simulation techniques
Questioning techniques
Evidence of prior learning
Assessment types in performance based-assessments
Features of performance-based assessments: process and product
Judgment in practical tests
Setting standards for holistic assessment tasks
Using checklists in performance-based assessments
Using rating scales in performance-based assessments
Summary
Review Questions
Exercises
CHAPTER 14
ASSESSMENT OF ATTITUDE AND BEHAVIOUR
Formative or summative assessment of attitudes
Using questionnaires to survey students
Using questionnaires to evaluate courses, teachers and instructors
Attitude scales
Observational forms of assessing attitudes
Information from supervisors
Assessing interests in formative assessments
Summary
Review Questions
Exercises
CHAPTER 15
GRADING PERFORMANCE AND RESULTS
The role of grading
Establishing cut-off points using the Angoff m
Mark conversion
Types of grading systems
Grading on the normal curve
Some practical guidelines for grading students
The analysis of evidence
Summary
Review Questions
Exercises
CHAPTER 16
TEST EQUATING
Data collection designs for test equating
The technical details of the Anchor-Test-Nonequivalent-Groups equating
How to use the equipercentile equation in your school
Summary
Concluding remarks
Review Questions
Exercises
APPENDIX A: CODE OF FAIR TESTING PRACTICES IN EDUCATION
APPENDIX B: ASSESSMENT TOPICS AND RESOURCES
APPENDIX C: PERCENTILE RANKS AND STANDARD SCORES
Percentile ranks
Standard scores
APPENDIX D: AN INFORMAL DERIVATION OF THE RASCHMODEL
The Partial Credit Rasch model
APPENDIX E: ARITHMETIC TEST
APPENDIX F: STUDENT FEEDBACK ON TEACHING AND SUBJECTS
APPENDIX G: HOW TO USE MS EXCEL© TO ANALYSE RESULTS
APPENDIX H: ANSWERS TO REVIEW QUESTIONS
INDEX
GLOSSARY OF TERMS USED
ABOUT THE AUTHORS
REFERENCES AND NOTES
Recommend Papers

A Teacher's Guide to Educational Assessment [Revised ed.]
 9087909128, 9789087909123

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

A Teacher’s Guide to Educational Assessment

A Teacher’s Guide to Educational Assessment

Iasonas Lamprianou European University-Cyprus University of Manchester, UK

James A. Athanasou University of Technology, Sydney, Australia

SENSE PUBLISHERS ROTTERDAM/BOSTON/TAIPEI

A C.I.P. record for this book is available from the Library of Congress.

ISBN 978-90-8790-912-3 (paperback) ISBN 978-90-8790-913-0 (hardback) ISBN 978-90-8790-914-7 (e-book)

Published by: Sense Publishers, P.O. Box 21858, 3001 AW Rotterdam, The Netherlands http://www.sensepublishers.com

Printed on acid-free paper

All Rights Reserved © 2009 Sense Publishers No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.

To international friendship and collaboration

CONTENTS

PREFACE TO THE SECOND EDITION ........................................................ xiii CHAPTER 1 ............................................................................................................ 1 INTRODUCTION TO ASSESSMENT ............................................................................ 1 Background to educational assessment ........................................................... 2 History of educational assessment ................................................................... 4 Purposes of assessment .................................................................................... 7 Ethical issues associated with assessment ....................................................... 9 Summary......................................................................................................... 10 Review Questions ........................................................................................... 11 Exercises......................................................................................................... 12 CHAPTER 2 .......................................................................................................... 15 THE VARYING ROLE AND NATURE OF ASSESSMENT.............................................. 15 Valid and reliable assessments are difficult to develop................................. 19 The diverse nature of assessment................................................................... 21 A final word of caution: The ‘assessment for learning’ versus ‘assessment of learning’ debate ..................................................................... 34 Summary......................................................................................................... 35 Review Questions ........................................................................................... 36 Exercises......................................................................................................... 37 CHAPTER 3 .......................................................................................................... 41 FUNDAMENTAL CONCEPTS OF MEASUREMENT..................................................... 41 Score distributions.......................................................................................... 44 Agreement index: the correlation................................................................... 51 Summary......................................................................................................... 54 Review Questions ........................................................................................... 55 Exercises......................................................................................................... 55 CHAPTER 4 .......................................................................................................... 57 VALIDITY AND VALIDATION OF ASSESSMENTS .................................................... 57 Validity issues................................................................................................. 57 Factors which reduce validity ........................................................................ 65 Invalid uses of assessment results .................................................................. 66 Summary......................................................................................................... 67 Review Questions ........................................................................................... 67 Exercises......................................................................................................... 68 CHAPTER 5 .......................................................................................................... 69 THE RELIABILITY OF ASSESSMENT RESULTS ........................................................ 69 Reliability ....................................................................................................... 70 Internal consistency methods ......................................................................... 73 Reliability of criterion-referenced assessments ............................................. 78 How high should reliability be?..................................................................... 79 Standard error of measurement ..................................................................... 79 How can I increase reliability?...................................................................... 82 vii

CONTENTS

Effect of practice and coaching on reliability................................................ 83 How to purchase valid and reliable commercial assessments....................... 83 Summary......................................................................................................... 85 Review Questions ........................................................................................... 85 Exercises......................................................................................................... 86 CHAPTER 6 .......................................................................................................... 89 ANALYSING TASKS AND QUESTIONS .................................................................... 89 Item difficulty.................................................................................................. 90 Further analysis of criterion-referenced results ............................................ 93 Norm-referenced item analysis ...................................................................... 96 Differential Item Functioning ...................................................................... 100 Summary....................................................................................................... 102 Review Questions ......................................................................................... 103 Exercises....................................................................................................... 103 CHAPTER 7 ........................................................................................................ 107 OBJECTIVE MEASUREMENT USING THE RASCH MODEL (FOR NON-MATHEMATICIANS) ........................................................................... 107

Analysis of test results put into a context ..................................................... 108 Introduction to the Rasch model .................................................................. 110 Analysis of test results using the Rasch model............................................. 114 Have you kept your model fit? ..................................................................... 118 The identification of mismeasured individuals ............................................ 123 How do we treat the omitted responses?...................................................... 128 Test development using Rasch model........................................................... 129 The assumptions of the Rasch model ........................................................... 133 Can the test be split in two educationally meaningfull sub-scales? ............ 137 Review Questions ......................................................................................... 140 Exercises....................................................................................................... 141 CHAPTER 8 ........................................................................................................ 143 THE PARTIAL CREDIT RASCH MODEL .................................................................. 143 Analysis of test results using the Partial Credit model................................ 148 How to build a test from the scratch using the Rasch model....................... 154 What to have in mind before analysing a dataset with the Rasch model..... 157 Further Reading ........................................................................................... 159 Review Questions ......................................................................................... 162 Exercises....................................................................................................... 163 CHAPTER 9 ........................................................................................................ 165 FURTHER APPLICATIONS OF THE RASCH MODEL ................................................ 165 The Rating Scale Rasch model..................................................................... 165 Analysis using the Rating Scale model......................................................... 168 The multi-dimensional model ....................................................................... 171 Computerised Adaptive Testing ................................................................... 172 Summary....................................................................................................... 175 Review Questions ......................................................................................... 175

viii

CONTENTS

CHAPTER 10 ...................................................................................................... 177 PLANNING, PREPARATION AND ADMINISTRATION OF ASSESSMENTS.................. 177 Developing assessment specifications.......................................................... 177 The preparation and administration of an assessment ................................ 185 Cheating and assessment ............................................................................. 189 Special arrangements................................................................................... 191 Summary....................................................................................................... 191 Review Questions ......................................................................................... 192 Exercises....................................................................................................... 193 CHAPTER 11 ...................................................................................................... 199 ASSESSMENT OF KNOWLEDGE: CONSTRUCTED RESPONSE QUESTIONS ............... 199 Types of test questions.................................................................................. 200 How to assess knowledge using essays ........................................................ 201 How to assess knowledge using short answer questions ............................. 207 Summary....................................................................................................... 211 Review Questions ......................................................................................... 212 Exercises....................................................................................................... 213 CHAPTER 12 ...................................................................................................... 215 ASSESSMENT OF KNOWLEDGE: SELECTED RESPONSE QUESTIONS ...................... 215 True-false, alternate choice and matching questions .................................. 215 Matching Questions...................................................................................... 215 Alternate Choice........................................................................................... 219 Corrections for guessing .............................................................................. 220 Multiple-choice questions ............................................................................ 221 Summary....................................................................................................... 229 Review Questions ......................................................................................... 230 Exercises....................................................................................................... 232 CHAPTER 13 ...................................................................................................... 235 ASSESSMENT OF PERFORMANCE AND PRACTICAL SKILLS .................................. 235 Phases in the acquisition of a psychomotor skill ......................................... 237 Stages in the development of expertise......................................................... 238 Forms of assessment..................................................................................... 242 Direct observation........................................................................................ 243 Skills tests ..................................................................................................... 243 Simulation techniques .................................................................................. 244 Questioning techniques ................................................................................ 244 Evidence of prior learning ........................................................................... 245 Assessment types in performance based-assessments.................................. 245 Features of performance-based assessments: process and product............ 246 Judgment in practical tests........................................................................... 246 Setting standards for holistic assessment tasks............................................ 247 Using checklists in performance-based assessments ................................... 248 Using rating scales in performance-based assessments .............................. 249 Summary....................................................................................................... 251

ix

CONTENTS

Review Questions ......................................................................................... 252 Exercises....................................................................................................... 252 CHAPTER 14 ...................................................................................................... 253 ASSESSMENT OF ATTITUDE AND BEHAVIOUR ..................................................... 253 Formative or summative assessment of attitudes......................................... 253 Using questionnaires to survey students ...................................................... 255 Using questionnaires to evaluate courses, teachers and instructors........... 257 Attitude scales .............................................................................................. 259 Observational forms of assessing attitudes.................................................. 262 Information from supervisors....................................................................... 263 Assessing interests in formative assessments............................................... 263 Summary....................................................................................................... 264 Review Questions ......................................................................................... 264 Exercises....................................................................................................... 265 CHAPTER 15 ...................................................................................................... 267 GRADING PERFORMANCE AND RESULTS............................................................. 267 The role of grading....................................................................................... 267 Establishing cut-off points using the Angoff method ................................... 272 Mark conversion........................................................................................... 273 Types of grading systems.............................................................................. 275 Grading on the normal curve ....................................................................... 277 Some practical guidelines for grading students........................................... 277 The analysis of evidence............................................................................... 278 Summary....................................................................................................... 280 Review Questions ......................................................................................... 281 Exercises....................................................................................................... 282 CHAPTER 16 ...................................................................................................... 283 TEST EQUATING ................................................................................................. 283 Data collection designs for test equating..................................................... 284 The technical details of the Anchor-Test-Nonequivalent-Groups equating 291 How to use the equipercentile equation in your school ............................... 294 Summary....................................................................................................... 295 Concluding remarks ..................................................................................... 295 Review Questions ......................................................................................... 297 Exercises....................................................................................................... 297 APPENDIX A ...................................................................................................... 299 CODE OF FAIR TESTING PRACTICES IN EDUCATION............................................. 299 APPENDIX B ...................................................................................................... 301 ASSESSMENT TOPICS AND RESOURCES ............................................................... 301 APPENDIX C ...................................................................................................... 305 PERCENTILE RANKS AND STANDARD SCORES .................................................... 305 Percentile ranks............................................................................................ 305 Standard scores ............................................................................................ 306 APPENDIX D ...................................................................................................... 309 AN INFORMAL DERIVATION OF THE RASCH MODEL ............................................ 309 The Partial Credit Rasch model................................................................... 310 x

CONTENTS

APPENDIX E ...................................................................................................... 313 ARITHMETIC TEST .............................................................................................. 313 APPENDIX F....................................................................................................... 317 STUDENT FEEDBACK ON TEACHING AND SUBJECTS ........................................... 317 APPENDIX G ...................................................................................................... 321 HOW TO USE MS EXCEL© TO ANALYSE RESULTS ................................................ 321 APPENDIX H ...................................................................................................... 331 ANSWERS TO REVIEW QUESTIONS ...................................................................... 331 INDEX.................................................................................................................. 341 GLOSSARY OF TERMS USED ....................................................................... 343 ABOUT THE AUTHORS .................................................................................. 349 REFERENCES AND NOTES ........................................................................... 351

xi

PREFACE TO THE SECOND EDITION

This book is a natural step beyond our earlier text A Teacher’s Guide to Assessment, which was published almost six years ago. The purpose of this book is to offer a straightforward guide to educational assessment for teachers at all levels of education, including trainers and instructors. The scope of this book is wider, however, and the targeted audience is broader than the first edition. It is designed to address the needs not only of those taking a first course in educational assessment and measurement but it can also usefully serve students at the post-graduate level, as well as experienced teachers, trainers and instructors who would like to update their knowledge and acquire practical skills using relevant quantitative methods. The book is appropriate for an international audience since it includes material and examples from Australia, the United States and Europe. In this revised edition we have added new and important material which covers the assessment arrangements necessary for people with special needs, the use of technology for assessment purposes (such as e-assessment and computerised assessment systems); we have elaborated on the dangers of differential item functioning; we have extended the Rasch measurement material; and enriched the book with practical examples using Microsoft Excel. In some cases, we wanted to offer more examples or more information or additional material on issues that could be of interest to some readers. In order to offer the opportunity for those who are interested Visit WebResources for to have access to this material without increasing additional material the length of the book too much, we created a dedicated web companion to this book, which we called ‘WebResources’. Whenever additional material is available on the WebResources companion, you will see a sign with some descriptive text, like the one shown to the right of this paragraph. You may access this material at http://www.relabs.org/assessbook.htm. The main message of the book is that assessment is not based on commonsense but on a huge body of international research and application over many years. Testing is a powerful, vital and large part of a teacher’s assessment arsenal because it can be practical, structured and very informative. The correct use of testing, either in its traditional paper-and-pencil form or in its modern technology-based style can be a formidable ally for every teacher who aspires to practise evidencebased teaching and learning. I am really grateful that my colleague James Athanasou from the University of Technology, Sydney, offered me the opportunity to take the lead on this new endeavour. I was responsible for re-writing much of the existing material, adding new chapters and for removing some of the material that was not deemed necessary today (most of it was moved to the WebResources). James kindly guided and advised me throughout the book, proof-read the manuscript and made sure that we produced a readable and scientifically acceptable book.

xiii

PREFACE TO THE SECOND EDITION

Wherever possible we have tried to provide references and every effort has been made to acknowledge or trace sources but if this has not always been possible we apologise for any error or omission. All acknowledgments have been placed directly in the text or in the endnotes. We trust that this book is useful to you in your study of educational assessment and measurement. It is a fascinating field with rapid innovations and intense research output. Iasonas Lamprianou European University, Cyprus University of Manchester, UK September 2009 If I may add a few words to those of my colleague and associate Dr Iasonas Lamprianou. Readers may be surprised to learn that we have only ever met on one occasion in faraway Penang in Malaysia at a conference on educational measurement. For more than 8 years we have corresponded and collaborated without ever meeting face-to-face. And so this text is a tribute to international collaboration and friendship. I am very grateful that Sense Publishers agreed to publish the second edition of A Teacher’s Guide to Assessment and even more grateful that Iasonas agreed to become the first author. This handing over of the baton ensures that the tradition of this text, which started in 1997 with Introduction to Educational Testing (Social Science Press) will continue. Equally, we have decided to continue the tradition of donating the royalties to a charity for children with disabilities – the Estia Foundation in Australia or Radio Marathon in Cyprus. The changes made to this edition are especially pleasing to me. The emphasis on criterionreferenced assessment and Rasch measurement has been strengthened. This perspective is – if I may say - unique amongst introductory texts on educational assessment. Rasch measurement is a point of view that we both passionately share. Instructors can now also contact the senior author for a free copy of an instructor workbook to accompany the subject. This contains 12 lectures for a basic course in educational assessment together with student exercises and review questions. It covers the essence of the text for an introductory class on assessment. There are also a free copy of PowerPoint slides to accompany the lectures and these are available on request. In editing the various chapters I had the opportunity in my retirement to reflect on the content. With each page I concluded that this text really expresses what I know and believe to be true about testing. I would say it also summarises most of what I tried to teach generations of students about this wonderful field. Very few realise that assessment is the cornerstone of educational evaluation and research. It deals with fundamental concepts such as validity. Accordingly it has been a privilege to have been involved in educational and vocational assessment over the last 30 years of my career. I have now retired from university teaching but I am very proud to be associated with Iasonas Lamprianou and this effort. I look forward to hearing of future editions. I hope that this will remain an applied handbook for teachers involved in assessment and a useful reference for those who are new to testing.

James A. Athanasou Maroubra, Sydney – Australia September 2009 xiv

CHAPTER 1

INTRODUCTION TO ASSESSMENT

Assessment has grown to become a multi-billion economic sector which shapes the educational and vocational future of millions of people worldwide. In recent years and throughout the world, assessment has been a fundamental component of everyday life in primary, secondary or tertiary education as well as in industrial or commercial training. The frequent exposure of people to assessment has cultivated very different attitudes: some laypersons have unbounded faith in the outcomes of assessment while others are dismissive of their value. Given the significant role that assessments play in modern education as well as their impact on career prospects, we have always considered that there is a need for teachers to be skilled and informed in this area but sadly it has not yet featured in all teacher-preparation or trainer-preparation courses. Many teachers and trainers are still left to acquire their assessment expertise through in-service courses, the assistance of colleagues or by trial-and-error. It is helpful for you to be knowledgeable about educational assessments because the results may have a significant impact on other people’s lives: they are widely used for selection, certification, diagnosis, special instruction or placement. Furthermore tests, exams, quizzes, projects, assignments or portfolios are part and parcel of your own teaching and it is valuable for you to have some knowledge of their development and use. The ability to develop worthwhile assessments does not come naturally but it is a skill that can be acquired and it needs knowledge as well as experience. Moreover assessments demand your attention because they are a sizable chunk of your professional workload. It has been estimated that teachers spend as much as one-third of their time in assessment-related activities1 and there are indications that this workload is not decreasing. Indeed, some countries are scrapping external examinations so as to give more focus on teacher assessment results – the latest example is England which removed the external examinations for 14-year olds in October 2008 and relied upon teacher assessments. It is possibly a sign of the times in which we now live that society is prepared to put more faith in your professional opinion. Consciously or otherwise, assessments play a role in your teaching. For instance, your assessments may occur informally during evaluation of your instruction and act as catalyst for helping the learner; or at the other extreme a formal assessment at the end of a course may determine someone’s career prospects. This means that you need to be informed about the effects of different forms and methods of assessment. During the course of reading this book you may wish to ponder the extent to which you want assessments to become a seamless component of your teaching or the extent to which you might

1

CHAPTER 1

want them to become a discrete part of the process. Different people use assessment in different ways, depending on the situation; there are no fixed rules. Both experienced and novice teachers acknowledge the importance of assessment in learning and instruction for various reasons. They usually focus on assessment as a vehicle for giving some feedback on teaching or assessment as a means of providing evidence of learning or assessment as meeting the requirements of the administration. At the same time many teachers do worry about the quality and the quantity of the testing that they are required to undertake. They worry about issues like the fairness of marking, the accuracy of the questions asked, the suitability of tests for students and the usefulness of the information eventually obtained. In part, this book seeks to address some of these concerns and in the remaining sections of this chapter we shall provide some background on the nature and role of assessment and a brief outline of its development. The discussion concludes with a look at some ethical issues associated with assessment. Throughout this text, you should feel free to skip those sections that are not of direct interest to you. BACKGROUND TO EDUCATIONAL ASSESSMENT

Assessments are dominant aspects of our culture and are encountered at many points in our life. If you are under 60 years old then you may have undertaken your first assessment within one minute of being born. This is the Apgar test (see Figure 1) which is an assessment (devised by Dr. Virginia Apgar in 1952) based on separate tests that it meets the formal definition of an assessment as we accept it in this book. It gives the general overall condition of a newborn within minutes of birth. The assessment is based on a score of 0-2 for five modalities: heart rate, respiratory effort, muscle tone, reflex irritability and colour. Heart rate: Respiratory effort: Muscle tone: Reflex irritability (nasal catheter): Colour:

absent, 100; absent, slow, irregular, good; limp, some flexion, good motion; no response, grimace, cough or sneeze; blue or pale, body pink extremities blue, completely pink.

A score of 10 is perfect; 1-3 is severe depression; 4-7 moderate depression; and 8-10 no depression Figure 1. Apgar Test.

In education, the word ‘assessment’ is used in a special way that is derived from but different to its ordinary, everyday meaning. The word ‘assessment‘ comes originally from the Latin assessare and meant to impose a tax or to set a rate and modern dictionary definitions refer to the valuation or financial meaning of assessment. This word has crept into the field of educational testing largely through 2

INTRODUCTION TO ASSESSMENT

psychology. For instance, the term assessment was used during World War II to describe a program of the Office of Strategic Services, which involved the selection of personnel for secret service assignments. This program involved situational tests and staff rated candidates on many traits. In these assessments, unlike the results of a single test, the assessor usually combined data from different sources. In the last decade, the word ‘assessment’ has taken over from terms such as ‘testing’. Firstly, it was seen as a broader term than ‘test’. It now encompasses many different educational practices, such as portfolios, case studies, presentations, simulations or computer-based activities. Secondly, it also took into account divergent processes of assessment such as teacher assessment, self-assessment and peer-assessment. Thirdly, it gave some expression to more liberal views in education that were opposed to the oppressive, mechanical and unthinking use of tests. For many people the word test may conjure up images of three-hour examinations or endless multiple-choice questions that bring back recollections of exam anxiety. Using the word assessment avoided many of the negative connotations of the word ‘test’. Even children in kindergarten are now familiar with the term assessment and one wonders whether in the future it will eventually share some of the same negative associations as its predecessor. At the outset, we have tried to give you a broad working definition of assessment and one that typifies its use in modern educational circles: Assessment is the process of collecting and organising information from purposeful activities (e.g., tests on performance or learning) with a view to drawing inferences about teaching and learning, as well as about persons, often making comparisons against established criteria. Apparently, you are expected to collect some information from purposeful (not random) activities. This implies that assessment should be part of your planning and cannot be left to happen in a haphazard way. It is also implies that this information has to be organised – just collecting and storing information is not enough. And all this must serve a specific purpose. This definition also means that you are free to include many different types of activities (assignments, exercises, projects, quizzes, simulations) under the umbrella of assessment. It also means that assessment is principally a professional process of collation, comparison and judgement and inferences are drawn as a result of the process. Although this is one aspect that we would like to stress, it is recognised that ‘assessment’ is also being used daily by students and teachers to refer in a shorthand type of way to a particular task that may have been assigned. As an example, the Department of Education and Training in New South Wales (Australia) referred to the State-wide English Language and Literacy Assessment, which is administered to some 150,000 Year 7 and 8 students, as a test of reading, writing and language.2 Similarly the Victorian Curriculum and Assessment Authority defined an assessment as: ‘A task set by the teacher to assess students’ achievements of unit outcomes.’3

3

CHAPTER 1

This everyday use of the word to refer to tasks or tests is not a problem for us as long as we recognise that underlying all of this is the systematic process of collecting information. A key part of this process is the various assessment tasks or events. HISTORY OF EDUCATIONAL ASSESSMENT

Assessments have not sprung from some historical vacuum but have evolved from experience over many thousands of years. In fact, some of the earliest classrooms were those of the Sumerian civilisation from which the clay tablets on which students practised their writing have been recovered. As you will see shortly, current testing practices can be traced back over some 4,000 years and provide a substantial knowledge base about assessment. The triposes are the formal examinations at the University of Cambridge, in which undergraduates are required to obtain honours in order to qualify for the degree of Bachelor of Arts. The word tripos comes from the fact that the examiner, the ‘Ould Bachilour’ of the University, sat on a three-legged stool. The examination took the form of a debate or wrangle and concentrated on Grammar, Logic and Rhetoric. The Mathematical Tripos is probably the most well-known of the triposes. At the time of Sir Isaac Newton’s discoveries, mathematics dominated other studies and ‘tripos’ came to mean the examination in mathematics. The tradition became that one had to pass the mathematical tripos before being able to specialise in classics or other studies. This tradition continued until around 1850. The mathematical tripos is now a written examination but some of the traditions surrounding the exam have been preserved. The final year results continue to be announced from the balcony of the Senate House and the top students are still called Wranglers. Sources: Faculty of Mathematics, Faculty of Engineering, University of Cambridge Figure 2. The tripos examination.

A long history of educational assessment can be dated from at least 2200 BC when Visit WebResources for a the Mandarins set up a civil-service testing description of additional program. For the most part oral examinations relevant examples (viva voce) were used until the late 1800s to evaluate achievement. A famous example of a rigorous traditional oral examination was the tripos (see Figure 2).

4

INTRODUCTION TO ASSESSMENT

In practical fields, there were examples of competency assessments such as those in the middle ages, where an apprentice was required to complete a master piece before acceptance into a guild.4 The master piece is described in Figure 3. In medieval times parents paid a fee to place their seven-year old son as an apprentice with a master of the guild. Both parties signed a contract called an indenture that the boy would work for the master for seven years as an apprentice and the master promised to train him. He became a journeyman after the period of apprenticeship. The term comes from the French ‘journee’, (day), and meant that the journeyman was paid a daily rate for his work. After several years as a journeyman the craftsman would submit a piece of his best work to the guild for approval. The example that he created was known as his ‘master-piece’ and if it was considered good enough, he was granted the title of ‘Master’. Figure 3. The master piece.

Formal written testing, as you know it today, dates from around the 1860s. An early example of formal written testing survives from a 1904 college entrance examination (see Figure 4).5 Test-takers were assigned to write a two-page essay on one of the four statements listed below. The test did not identify the works of literature. 1. Describe the character of Mr. Burchell, and compare or contrast him with Dr. Primrose. How far does he influence the course of events in the story? 2. Locksley shoots for the prize. 3. The elements of greatness in Shylock’s character. 4. Describe, from the point at which the Albatross “begins to be avenged”, the events that precede the Mariner’s being left “alone, on the wide, wide sea”. Source: Education Week, June 16, 1999 Figure 4. 1904 College Entrance Examination Board Test.

The advent of computers has given the opportunity to develop new tests with very desirable characteristics: they are cheaper to administer, they can be more realistic (they may use multimedia and simulations), they are often scored automatically and it is much easier to report the results in many different and informative ways. The second generation of computer-based tests are adaptive, in the sense that they automatically adapt the difficulty or the content of the test to the needs and the characteristics of the person who takes the test. Computer-based and computer5

CHAPTER 1

adaptive tests have achieved widespread use in education and are now a useful tool for both teachers and students. Computer-adaptive testing via the Internet has also been adopted by large testing organisations around the world and today millions of students all over the world have the opportunity to take cheap and state-of-the-art tests for high stakes assessment purposes. The many landmarks in the history of educational testing indicate the continual development of this field and that assessment practices have reflected the prevailing educational policies. We have summarised some of these developments in Table 1. It is not important that anyone should recall the details of these but that at the very least you are now aware that educational assessment is an ongoing endeavour with a significant history. The gist of these few background details is that some form of assessment has been a regular feature of Western education but that educational tests of a written nature are a relatively recent invention. Much of our past assessment was informal, oral and practical in nature. Table 1. Selected events in the history of educational assessment 2200BC

Mandarin civil service testing program

1219AD

University of Bologna holds oral examinations in law

1636

Oral exams for awarding of degrees at Oxford

1845

Printed examinations first used in Boston

1864

Fisher (UK) develops sample questions and answers to grade essays

1897

Rice surveys spelling abilities of US schoolchildren

1908

Objective arithmetic tests

1926

Scholastic Aptitude Test first used (now a common basis for university entry in the US) Australian Council for Educational Research established (premier educational test distributor in Australia as well as the major research and development agency) Educational Testing Service (major testing agency) founded in US Development of Rasch model of measurement US Supreme Court ruling against group ability tests to stream students US Supreme Court rules that selection tests must have a direct relationship to job performance Development of item response theory (a new approach to analysing test scores and responses) Publication of the Standards for Educational and Psychological Testing First computerised adaptive tests are given to students in Portland (OR) public schools Code of Fair Testing Practices in Education Fourteenth Edition of The Mental Measurements Yearbook published (major listing and review of all tests) first published in 1938 by Oscar Buros

1930 1947 1952 1967 1971 1980 1985 1986 1988 2001

6

INTRODUCTION TO ASSESSMENT

There is now a widespread desire to assess individuals for purposes of grading and evaluating educational outcomes. The advent of large classes and organised education has meant that much of our assessment has changed dramatically. Visit WebResources for more It has become more formal, involved information on the widespread greater standardisation and become more use of standardised assessment quantitative in nature. It has come to meet other demands (administrative, policy, accountability) as well as meeting the needs of the learner or the teacher. Before discussing the purposes of assessment it might be useful to show you some of the extent of standardised assessment in education in Australia, Europe and the USA. We have included examples in the WebResources companion from the New South Wales Department of Education, the Qualifications and Curriculum Authority (UK) and the American College Testing Program to indicate the widespread use of formal educational assessment in primary and secondary education. The increase in formal testing and assessment procedures has been linked with large-scale education systems.6 Some of the developments in testing are of recent origin and the pace of change has certainly quickened. This reflects (a) the growth of mass schooling and expansion of tertiary education with its need for large-scale written rather than oral assessments; (b) the increasing access to technology leading to newer forms of assessment, such as printed test papers originally and most recently the use of computer-based testing; and (c) other developments in fields such as psychometrics (i.e., the field of psychological testing and measurement), statistics and educational research. PURPOSES OF ASSESSMENT

Assessment can also be linked to the following educational purposes: diagnosis, prediction, placement, evaluation, selection, grading, guidance or administration. In all fields of education, assessment results are used to decide about students (i.e., student progression), to decide about teaching and learning (i.e., curriculum decisions) and increasingly assessments are linked with certification of competence and the validation of performance on job-related tasks. Leaving aside these policy and administrative considerations, you can make many decisions that are dependent upon some form of assessment. For instance you can use assessments to give answers to questions, such as the following. – How realistic are my teaching plans for this group? – Are my students ready for the next unit? – What learning difficulties are students facing? – Which students are underachieving? – How effective was my teaching? – Which learners are advanced? – Which learners are gifted or talented? – Which learners require special assistance? Assessments are a natural accompaniment to instruction. The assessment process is an integral aspect of education that helps you make judgements about students’ 7

CHAPTER 1

current levels, about the most appropriate method for teaching, or when to conclude teaching a topic. Table 2 shows a few instances where there is scope for educational assessment at many points in the teaching process. Assessments can be a valuable component of your teaching and few formal courses would be complete without an assessment component. Even the best assessment plan, however, can test only part of an individual’s educational achievement.7 Your own experiences would already have shown you that a person’s development and growth, his/her interests and values may not be assessed accurately or consistently. As teachers our aim is to use assessments fairly, to analyse the results carefully and to combine the results with other evidence of progress in order to provide the best possible evaluation of our students’ achievement and development. Table 2. Assessment and the teaching process

BEFORE TEACHING To determine the level of skills/knowledge prior to instruction To diagnose learning difficulties or advanced achievement To plan instruction DURING TEACHING To make on-going changes and improve teaching and learning To focus on a small segment of instruction To identify learning errors and misconceptions and take remedial action AFTER TEACHING To certify the attainment of outcomes at the end of learning To self-evaluate your teaching effectiveness and improve teaching plans To assign grades and communicate results When assessment is conducted appropriately it will help people to learn in a way that is meaningful and encourage their motivation to learn. Assessment can be used to improve the quality of learning and instruction. It provides important feedback on progress and helps people to realise their strengths and weaknesses without being judgemental. From your viewpoint, effective assessments give you the necessary information to decide how a person is progressing with their learning. Appropriate assessments can encourage what we call deep rather than shallow approaches to learning in a subject or occupation. What does this mean? – Deep approaches: learners focus their attention on the overall meaning or message of teaching. Ideas are processed and interests developed in the topics. Where possible the content is related to experiences to make it meaningful. Modern approaches to assessment such as self-assessment and peer-assessment may help achieve these goals. – Surface approaches: with this type of learning the focus is on acquiring skills, test-taking techniques or knowledge that is necessary to do well on assessments. 8

INTRODUCTION TO ASSESSMENT

There is less focus on understanding and on being able to transfer knowledge in other situations. There is a concern for the details that need to be remembered for assessment purposes. The quality of the learning outcomes is much lower. Assessment approaches that focus too much on past papers and test taking techniques are of this kind. If you are not careful, then the types of assessments that you use may inadvertently develop only surface or shallow approaches to learning. You need to remember that the type of the assessments you use may be a valuable tool that may encourage learning but if not used wisely, may also deter learning and make your work more difficult. ETHICAL ISSUES ASSOCIATED WITH ASSESSMENT

We would like to complete this chapter by indicating to you that there are some important ethical issues associated with assessments. Some of these issues relate to ensuring that people are assessed appropriately, that everyone is marked on an equal basis, that they are not disadvantaged by assessment results, that any assessment is for the benefit of the student and that their confidentiality or privacy is respected. In particular, assessment results are the property of the person and should only be released to other colleagues on a need to know basis and only if the person will benefit directly (the next year’s teacher may want to know past performance of his/her students in order to plan teaching) or indirectly (the teachers of the school may use the results to evaluate the effectiveness of different teaching approaches in their school). A fundamental question is: ‘Will my pupil, student or trainee benefit from this assessment?’ While some form of assessment is helpful in teaching, we may need to reduce the inappropriate use of tests, especially long tests that may cause anxiety and may tire the students. Even a cursory glance at most curricula should show you that most students are being over-assessed or that assessments are being used in a less than optimal way. Assessments have the potential to be very helpful but the application of assessment also needs to be well-organised. Some of the critics of modern-day testing deserve attention because assessment has not always been applied appropriately. Most of us are far removed from the educational decision-making that affects our everyday work but if you are at the coalface then you have to put these policies into practice. And this raises a large number of ethical issues, many of which centre around assessments. We can aim for an ideal assessment system, however, and it might be one that encourages meaningful learning, that has fair and equitable procedures and that produces results that are both valid and reliable. Some possible steps for you to consider are to: – use tasks that encourage learning; – use tasks that encourage interest; – ensure that tasks are linked to the learning outcomes; – offer tasks that foster self-direction; – use a variety of tasks as part of your assessment; – give grades that reflect true levels of achievement; 9

CHAPTER 1

– – – –

treat every learner fairly and equitably; give timely and appropriate feedback; provide alternative assessments for special groups; or adhere to a code of ethics for test users. We have taken the liberty of reproducing for you some relevant sections for educational test users of the Code of Fair Testing Practices in Education 8 (see Appendix A). The Code was not intended to cover the use of classroom tests made by individual teachers but it is considered helpful because it outlines some major obligations that we have to test takers in education. It is directed more towards those assessments used in formal testing programs. There are also influential professional associations that are involved in testing, such as the International Testing Commission, International Association for Educational Assessment and the National Council on Measurement in Education and Visit WebResources for at the time of writing, regional professional examples of professional associations for assessment are being formed assessment bodies in the Asia-Pacific region. As well, there are published Standards for Educational and Psychological Testing jointly prepared by the American Educational Research Association, American Psychological Association and National Council on Measurement in Education. Some links to professional associations, helpful listserves and assessment resources are provided in the WebResources companion. Appendix B provides references to additional assessment topics. SUMMARY

In this chapter we have tried to provide you with some idea of the importance of assessment for teaching but also of the importance of assessment for the lives and the future of your students. The fundamental aspect of assessment is not about pieces of paper or exams or marks or grades or intricate scoring systems; assessment is really about the question of how we use certain tasks or events to prove or establish that learning has occurred, that someone is able to do something or knows something, or even to provide information to evaluate our own teaching effectiveness and improve our teaching methods. The process of assessment and its key components address the important issue in this world of how do we know that something is the case – you may regard this as an evidence-based practice of the teaching profession. Assessments have assumed social importance because of their links with formal qualifications and because more and more people pay serious attention to education and its outcomes. Modern educational systems are now centred on various forms of assessment and it is important for teachers to be familiar with educational assessment. In part, this trend reflects the community’s desire for validity, fairness and objectivity. The need to consider the appropriate and ethical use of assessments in teaching has been stressed. The valid concerns or problems that some critics have identified, such as the use of test results to grade pupils, teachers and schools, or the value of testing as an indicator of competence or the role of testing in 10

INTRODUCTION TO ASSESSMENT

teaching and learning may not really be assessment problems. They are mainly issues of educational policy and administration and how assessment results ought to be used rather than misused. In the next chapter, specific issues associated with the nature of assessment are considered in further detail. References to the sources cited are listed in the Notes. We have also provided a glossary of the terms used throughout the text. Now take a moment to review the key ideas and maybe undertake some of the exercises or activities at the end of this chapter. -oOoREVIEW QUESTIONS

Try these review questions to help you reinforce some of the key ideas in this section. These are all true-false questions to make it easier and quicker for you to complete. You will find a set of similar questions at the end of each chapter. First, think whether each statement is mainly true or false. Then just circle the T (True) or F (False). If you are not sure, just guess. T

F

T F T F

In education, the word ‘assessment’ is used in a special way that is different from its ordinary, everyday meaning In education, the term ‘assessment’ has taken over from terms such as ‘testing’ The everyday use of the term ‘assessment’ refers to a process of collection of information, judgement and comparison Educational assessment has developed over a period of some 2000 years Viva voce refers to an oral exam Written formal testing dates from around the 1500s

T

F

T

F

T F T F

Psychometrics refers to the field of psychological testing and measurement Assessment results are used to decide about students

T F T F

T F

The main scope for assessment is after the teaching process The Code of Fair Testing Practices in Education outlines obligations that we have to test takers Consent of a test taker is required before providing results to any outside person or organisation Educational test results are a privileged communication to other teachers

T F

It is helpful to explain how passing test scores are set

T F

T

F

11

CHAPTER 1

EXERCISES

Here are some review exercises for you to answer or they can be used as the basis for discussion. 1. What is one stated purpose for assessment in the education or training context in which you work? 2. Do you consider that assessment is essential in education or training? 3. Indicate three classroom teaching decisions that can be made by using the results of assessments. 4. How can you use assessments to influence your students’ learning? 5. What valid reasons would a teacher have for assessing a group on the first day of instruction? 6. Reflect on and write down the reasons you have for testing in your organisation. 7. Which document sets out the assessment and testing policy for your organisation? 8. Comment on the following approach to assessment in mathematics. – Having assessment be an integral part of teaching – Focusing on a broad range or mathematical tasks and taking a holistic view of mathematics – Developing problem situations that require the application of a number of mathematical ideas – Using multiple assessment techniques, including written, oral, and demonstration formats – Using calculators, computers, and manipulatives in assessment. Source: Curriculum and Evaluation Standards for School Mathematics (National Council of Teachers of Mathematics, l989, p. l9l).

12

INTRODUCTION TO ASSESSMENT

9. Read the following excerpt from the FairTest Principles produced by the National Center for Fair and Open Testing. Comment critically on these principles and highlight any ethical issues. Assessment of student learning is undergoing profound change at the same time reforms are taking place in learning goals and content standards, curriculum, instruction, the education of teachers, and the relationships among parents, communities, schools, government, and business. These Principles provide a vision of how to transform assessment systems and practices as part of wider school reform, with a particular focus on improving classroom assessment while ensuring large-scale assessment also supports learning. To best serve learning, assessment must be integrated with curriculum and instruction. High quality assessment must rest on strong educational foundations. These foundations include organizing schools to meet the learning needs of all their students, understanding how students learn, establishing high standards for student learning, and providing equitable and adequate opportunity to learn. The Principles reflect an ‘ideal’–what the National Forum on Assessment believes is the best that assessment can be and do. We understand that they will not be implemented immediately or with great ease. We do firmly hold, however, that education systems must move toward meeting these principles if assessment is to play a positive role in improving education for all students. Principle 1: The Primary Purpose of Assessment is to Improve Student Learning Principle 2: Assessment for Other Purposes Supports Student Learning Principle 3: Assessment Systems Are Fair to All Students Principle 4: Professional Collaboration and Development Support Assessment Principle 5: The Broad Community Participates in Assessment Development Principle 6: Communication about Assessment is Regular and Clear Principle 7: Assessment Systems Are Regularly Reviewed and Improved Source: FairTest, The National Center for Fair & Open Testing, http://www.fairtest.org/ princind.htm

13

CHAPTER 1

10. Read this brief excerpt from the National Council on Measurement and Evaluation (December 2000 NCME Newsletter). What ethical issues are involved? Opponents of High-Stakes Tests Seek To Breach Exam Security Some opponents of high-stakes tests have turned into would-be saboteurs. In at least three separate instances over the past six months, people apparently have sought to undermine the purpose and validity of state or district exams by sending copies of them to newspapers in the hope of having the questions published. Two leading daily newspapers, The Atlanta-Journal Constitution and The Los Angeles Times, and Substance, a monthly teacher-run publication in Chicago, have been the recipients. Only the latter actually published questions from a test.

Note: You can visit the WebResources for more Exercises

14

CHAPTER 2

THE VARYING ROLE AND NATURE OF ASSESSMENT

Every lecturer has had this experience at one time or another: You’re explaining some especially intricate and fascinating aspect of your discipline when you see a hand shoot up in the back row. “Yes?” you ask, eager to engage on a favourite topic with a bright, inquisitive mind. “Um, do we have to know this? Will it be on the test?” As far as students are concerned, there is nothing more central to the learning experience than assessment. Some learning researchers call this the backwash effect. The type of assessment students know will be coming determines when they “tune in” to a lecture and when they “tune out.” Evidence from student diaries indicates that students spend less than 10 percent of their time on non-assessed academic work1.

No matter whether you are a primary, secondary or higher education teacher, the above extract may remind you of one or more similar events in your own career. It shows how important assessment’s role can be and how it may help to shape many aspects of any education system. It shows that assessment can either be the driving force behind improved teaching and learning or a catalyst that can de-motivate learners, destroy the climate of the class and have a negative impact on learning. Whether assessment has a positive or negative impact is determined by its aims and the way it is implemented. In the previous chapter we mentioned briefly some of the aims of assessment: for diagnosis, motivation, prediction, placement, evaluation, selection, grading, guidance or administration. Each one of those aims of assessment has its own importance for any education system. In this chapter, we will refer to the different roles of assessment a little bit more closely and we will investigate how these different roles may be served more efficiently when different assessment approaches (such as tests, portfolios, performance assessments) are utilised where and when needed. It is possible to split assessments in two large categories based on their perceived importance: high stakes and low stakes assessments. High stakes assessments are those which are perceived by one or more groups of stakeholders as very important for some reason. Such high stakes assessments can be end-of-year examinations which are used for school graduation or university entrance examinations. High stakes assessments can also be those assessments that certify eligibility to exercise a specific profession or the assessments which evaluate the effectiveness of schools or teachers. In effect, the high stakes nature of an assessment is not determined by any characteristic other than the consequences of the intended use of the assessment results. 15

CHAPTER 2

High stakes assessments A high stakes assessment such as an examination may have a huge impact on the life of various stakeholders. To mention just a few examples, students may proceed to the next level of education (e.g., access to tertiary education); graduates may be licensed to practice a profession (e.g., medicine); teachers may get a promotion or lose their jobs and schools may be closed down depending on the performance of their students; students may get a PhD if they submit their thesis and succeed in the oral defence (examination). There are cases, however, where high stakes assessment benefits other people as well. For example, strict and thorough assessments for licensing people to work as pilots or doctors benefit the whole society because they reassure us that only the most competent individuals will practise the most demanding professions. To offer another example, university entrance assessments seek to ensure that only the most academically oriented students will proceed to further education so that scarce public resources are used more efficiently. As shown above, most often high stakes assessments serve important social purposes which may also have significant financial, political as well as educational benefits. They are used frequently as instruments to monitor the education output of schools as well as countries. International bodies, as well as governments use large scale assessments to compare their education output to that of other countries. Frequently, leading financial journals like The Economist take up a lot of their space to discuss results of international assessments. These international comparisons of student achievement involve assessing the knowledge of elementary and secondary school students in a number of subjects. These subjects usually include mathematics, science, reading and writing and technology. The main medium of assessments is written tests because they are very practical and easy to standardise. They use high quality test items that have been agreed upon by participating countries. Complex comparability studies have been carried out since 1959 and have reached a level of maturity. One typical example of such studies could be the Trends in International Mathematics and Science Study (TIMSS) which included different types of items such as multiple-choice, short answer, extended response and performance items. TIMSS was administered by personnel from the participating countries after they were given extensive training to assure the high quality and comparability of the TIMSS data. Quality-control monitors were Visit WebResources for more also hired from different countries and they information on international were trained to visit the national research comparability studies centres and to review their procedures. Although it might be theoretically desirable to enrich international comparability studies with other means of assessment such as laboratory tasks, this would lead to huge practical and logistical problems as well as increased costs; therefore, tests remain the main assessment medium. High stakes assessments are also used frequently for selection, placement, grading or evaluation purposes and are becoming more and more important for more countries around the world. Although there are some countries with excellent education systems, like Finland, which still tend to avoid competitive, high stakes 16

THE VARYING ROLE AND NATURE OF ASSESSMENT

tests, today many countries are shifting towards more formal, paper-and-pencil, high-stakes assessments. For example, Sweden, a country with a very weak highstakes assessment culture is now reforming its educational system, firming up the assessment process, emphasising the role of teachers as assessors, investing money on training teachers to build their own tests and introducing more external tests. England is now piloting the new concept of single-level tests: they are intended to be short (but focused and more accurate) tests assessing the curriculum at a particular level in order to verify teachers’ judgements (students will take a test matching their ability). Ukraine, Georgia, Lithuania as well as Russia and other former Soviet republics are refreshing their interest in high stakes, formal assessments, emphasising fairness and quality for access and equity issues. The recent huge investment of resources in the Anglo-Saxon world (USA, Australia, New Zealand and U.K.) on issues directly related to high-stakes assessment (and especially tests) is also very impressive. All over the world, formal, large scale and high stakes assessments seem to be perceived as one of the main driving forces not only for selection, reporting and administration but also for evaluation and prediction. Written tests are the preferred option in almost all countries around the world when it comes to large scale, high stakes assessment. In England, 600.000 pupils at each one of the first three Key Stages of mandatory education took end of year tests in order to assess attainment – in other words, to determine the output of the education system. It would be rather extreme to suggest that other means of assessment such as an oral examination or portfolio evaluation would be more practical than paper-and-pencil tests. For example, it would take a breath-taking amount of resources (money, time, personnel and infrastructure) to assess those students with personal interviews, and it would also take a surprising amount of time to assess them through portfolio or project evaluation. In the context of high stakes assessment, practicality is not, however, the only virtue of written tests. Of course, an assessment may not be perceived by everybody as high stakes. For example let us consider the situation where university students who wish to skip an introductory level course are given a screening test to determine whether they already know the content and can proceed to the next stage. Succeeding on the test will reduce the time spent at university. Therefore, students who need to graduate and get a job as soon as possible will consider the test to be high stakes but those who do not worry too much about the extra cost of graduating from the university a bit later may consider it to be of low stakes. Low stakes assessments Low stakes assessments are also gaining ground all over the world. It is widely recognised today that more resources need to be invested in order to improve informal, classroom based, low stakes assessment. The reason behind this is that educationalists around the world have realised that classroom assessment, if used wisely, can be a powerful tool for diagnostic, placement and motivation purposes,

17

CHAPTER 2

as well as for grading and prediction. But the most useful purpose of classroom assessment is arguably diagnosis. For example, try to picture a situation where a primary school teacher needs to teach 11-year-olds some basic principles of probabilistic thinking, which is one of the goals of a mathematics curriculum in many countries around the world. Probability is a topic which is taught at the later grades of primary schooling but all children come to school with some relevant intuitive concepts. From everyday life while playing with friends, or through observation or through trial and error, many children already know that if they have a bag with two red and four blue pencils, if they close their eyes, put their hand in the bag and randomly pick one pencil they are more likely to pick a blue rather than a red one. In a recent experiment, 426 primary school pupils in a European country were asked a similar question as above and 85% gave a correct response saying that there were more blue than red pencils in the bag so they would expect to get a blue pencil by chance. Then, the same pupils were asked a similar question but in a different context. They were put in the context of a zoo, where there were two elephants and four monkeys. They were told that the staff of the zoo would like to wash all the animals but they did not mind with which animal they started. So, they decided to pick one animal purely randomly. The children were asked which animal was more likely to be selected by chance alone: an elephant or a monkey. A staggering 20% of the children gave responses based on non-probabilistic processes. Some said that the staff of the zoo would pick an elephant first because an elephant is usually dirty but the monkeys are cleaner. Other children said that the elephant can neither run nor climb on the trees; therefore, an elephant was the animal most likely to be picked. When challenged further with an ‘explain why’ question, it was obvious that these children would frequently switch to non-mathematical, subjective answers to various questions when they found the context of a question interesting; in other situations they would revert back to more scientific thinking, based on their observations of every-day life. It would obviously be a bit more difficult for a teacher to teach those children their first formal concepts of probability, and special teaching and preparation would probably be required to motivate them to comprehend the meaning of chance. Thankfully, in any context like the above, a primary school teacher could use ready-made tests with tasks and questions especially designed to pinpoint the cognitive errors and misconceptions of the pupils. Such tests are called diagnostic because their aim is to identify errors and misconceptions early on, before the teacher spends one or two weeks trying to introduce the children to a new concept. Diagnostic tests are excellent tools because they are easy for the teacher to administer and score; and can also be very easy for the pupils to complete. In some cases, Visit WebResources for more the pupils read a question, pick one out of information on diagnostic tests five multiple-choice options (or provide a and performance assessment short response) and then explain very briefly why and how they reached this answer. The tests are deliberately very short, and they do not aim to measure achievement – they only focus on a very specific and 18

THE VARYING ROLE AND NATURE OF ASSESSMENT

narrow sub-domain trying to find errors and misconceptions. In one or two minutes, a teacher can scan a pupil’s responses and prepare for the next class in the best possible way, taking into account the intuitive errors and misconceptions the pupils bring to the class. The above paragraphs have demonstrated some of the important social, financial and educational roles of assessment. Tests have traditionally dominated the arena of high stakes as well as low stakes assessments, although other methods such as performance assessments and portfolios have their own place, especially when skills are assessed. There are growing numbers of teachers and schools that would like to see a more frequent use of alternative methods of assessment, even for high stakes purposes such as selection for access to tertiary education. Such initiatives, although non-practical and expensive have made their appearance in the last years in various places: one example is the New York based Performance Standards Consortium, which is a group of 28 high schools currently promoting the concept of less high stakes tests and more performance assessments. In all cases, however, the development of high quality assessment instruments and activities demands not only in-depth subject matter knowledge but also a good deal of experience and theoretical knowledge. Teacher-made assessments can be very efficient and informative if they are tailored to the needs and the characteristics of the students of their class; but to be able to build high quality assessments the teachers need much initial experience and in-service training. This has been widely recognised and more and more countries invest resources to improve the assessment skills of their teachers. For example, the body of school inspectors in Wales (UK) recently published guidelines asking all schools to provide opportunities to their teachers for adequate in-service training to update their assessment skills. Sweden has recently been reforming its education system and investing money on training teachers to build their own tests and introducing more external tests. Across Europe more and more countries spend resources to help their teachers improve the quality of assessments. The next section explains the complexity of building high quality assessments, both for high and low stakes purposes. VALID AND RELIABLE ASSESSMENTS ARE DIFFICULT TO DEVELOP

Assessment results need to be reliable as well as valid. We will deal with the concepts of validity and reliability in much depth later but it suffices here to say that an assessment is valid to the degree that its results determine or measure accurately what they were supposed to determine or measure. The assessment is also reliable to the degree that the assessment results are consistent. So what does a low-stakes, teacher-made assessment involve? First, the teachers need to study the curriculum very carefully and comprehend and unpack the standards (i.e., understand the level at which their students should be working). This is not as easy as it may look at first glance and it is not something a teacher should wait to build up by experience – it does not come naturally; it needs training. Tables of specifications need to be created for each separate sub-domain within each subject. This is necessary for the teachers if they are supposed to know exactly what they should assess. 19

CHAPTER 2

Then, a teacher needs to decide how, when and where assessment will take place and for what purpose: for example, a diagnostic test at the beginning of the week may be shorter and more narrow in scope but the questions need to be well prepared to address specific possible errors and misconceptions. On the other hand, a grading test needs to cover as much of the taught curriculum as possible. The methods of assessment need to be determined (e.g., a project or a test) depending on the available time and the scope of the assessment. The actual assessment needs to be prepared (i.e., the questions of a test need to be written) and a scoring guide (i.e., a rubric) needs to be developed and trialled. Then, the assessment needs to be administered (i.e., conducted), scored and the results need to be analysed to draw conclusions. Finally, grading and reporting may (or may not) happen, depending on the purpose of the assessment exercise. Among all those issues just mentioned, a teacher needs to keep in mind that the assessment needs to be fair to everyone and should not be biased against groups of students. The teacher should also focus on gathering evidence from different sources, so when useful, different media of assessment should be employed: written tests, computer-based assessments, portfolios, group projects or oral questioning. How to combine evidence from different (and frequently incompatible) sources of information is another difficult task. On the other hand high-stakes, external assessments may even be harder to develop. They have to follow a very strict development process that involves an amazing list of quality control procedures, field trials, evaluation and improvement of the assessment under construction. The development of high stakes assessments is so complex and costly that it requires the collaboration of a whole array of specialists under the umbrella of dedicated organisations. A typical assessment construction cycle usually starts with drafting a very detailed document with the aims of the assessment and the intended audience. Then, expert question writers and teachers come together to build the table of specifications, which is a very detailed document explaining what is to be assessed, in order to fulfil the aims of the assessment. The next stage is to start drafting questions according to the table of specifications. The authors of the questions may generate new questions, or may get ideas from past assessments. The draft instrument is then evaluated again (sometimes by an independent group of experts) to confirm that the questions comply with the table of specifications. The next stage includes the piloting of the assessment and data is collected and analysed to identify whether the targeted population perceives it to be a fair and valid instrument. Statistical analysis is used to build the profile of each question or task of the assessment. Questions or tasks of inferior quality will be deleted or replaced or improved. A new cycle of piloting takes place and the assessment is finalised. During this cyclical procedure, the scoring rubrics are also developed, evaluated and finalised. The assessment is ready when a document is prepared explaining how it should be administered and scored and what the valid uses of its results are.

20

THE VARYING ROLE AND NATURE OF ASSESSMENT

The development of high stakes assessments usually entails much work from psychometricians who use advanced statistical Visit WebResources for more models to evaluate the quality of the assessinformation on teacher-made ment and to link the assessment results to assessments and development those of other assessments. This shows that of high stakes assessments the development of assessments – either for low stakes or for high stakes purposes – is a very demanding procedure which requires both resources and expertise. There are many steps involved in either a low-stakes classroom assessment or a high-stakes formal assessment process. The next section delves deeper into the diverse nature of assessment. THE DIVERSE NATURE OF ASSESSMENT

Assessment can successfully achieve its complex roles in modern society, because it is flexible and has a very diverse nature. It may be regarded as a process varying in frequency, duration, content, form, formality and intention. Assessment as a process In thinking about assessment as a process, we were searching for some way in which we could communicate quite simply the key parameters (i.e., the constituent variables or qualities of assessment). Five parameters are set out in Table 3. These are amongst the most observable parameters and there is no reason why they cannot be supplemented by other factors that are relevant to you. No claim is made that this framework is comprehensive but it may offer you a starting point. How would you use this framework to describe an assessment process? In thinking about an assessment for a group you could indicate, for instance, that there will be two assessment events, one focusing on knowledge and one focusing on skills; the two assessment events may involve questioning (e.g., a mid-term exam) and a project (e.g., a practical assignment at the end of semester) respectively; the grading of the assessments is criterion-referenced using a rating scale provided to the students; the results for both assessment events contribute to the final grade (i.e., summative); they are internally set and internally marked, and graded passfail. You are correct in thinking that this may seem a little academic at present. The main purpose, however, is to provide you with a reasonably descriptive and conceptual framework with which you can meaningfully discuss, describe and plan assessments. Now, if your assessments have been specified for you in advance by the curriculum, syllabus or training plan, then you can sit back and relax because all the thinking has been done for you. (You may not agree with what has been set out for you but that is not an assessment issue.) Let us describe these parameters in a little more detail for you.

21

CHAPTER 2

Table 3. Key parameters of an assessment process Parameter Number Content Form Intention Formality

Dimension frequency of assessment events; number of assessments; duration of assessment. knowledge, skills and/or attitudinal content. form(s) of assessment; specific method(s) of assessment. formative and/or summative; criterion-referenced assessment. standardisation; objective or subjective scoring; internal or external assessment.

Assessments may vary in number, frequency and duration Assessments may involve one, two or more different components or assessment events. As an example, an engineering class has short weekly tests, a major practical assignment lasting all semester and a final 2-hour exam, while a marketing diploma class has one major assignment, a class presentation and an external exam. Some such as a welfare subject have two assignments (essays) and others, like an outreach program have many informal but no formal assessments. The assessment issue here is the amount and frequency of assessment. In most formal courses, higher educational institutions prescribe a minimum of two assessment events2, such as an assignment plus an exam or maybe a combination of theory and practical assessments. Moreover, it has been recognised that few courses should allocate more than 10% of the total teaching hours to formal assessments.3 So if you are teaching a module that involves 36 hours you should allow for around 3-4 hours of formal, summative and face-to-face (i.e., direct) assessment time. This assessment time would not include formative activities such as class exercises that do not contribute to a final grade. The assessment time would not include the time spent by students outside class (but it would include any time spent in class introducing the topic, describing the assignment, dealing with enquiries or explaining procedures). These guidelines may assist you in not allocating too much time for assessment and thereby reducing the time available for instruction, which after all, is your principal function. As a general rule, the greater the number of assessments you conduct then the greater will be the reliability of your final decision about a student’s competence. As mentioned before, reliability refers to the reproducibility and consistency of your results. And a final practical point – as a general rule, the greater the number of assessments you conduct then the greater will be your assessment workload!

22

THE VARYING ROLE AND NATURE OF ASSESSMENT

Assessments may vary in content Assessments can be distinguished in terms of their content, especially the extent to which they are assessing knowledge, skills or attitudinal areas of learning. The main focus of assessment and testing in formal education settings is knowledge and skills but there are many courses where attitudinal factors are also assessed (e.g., standard of service in hospitality; depth of rapport in child care; quality of patient contact in nurse aides; sensitivity of client services in welfare; or appropriateness of customer service in sales). Remember that this is an artificial classification. It is doubtful that human behaviour can really be grouped into three such neat categories but these categories do provide us with a common terminology. The issue that is at stake here is the relevance of the assessment process for the learning outcomes. This affects the validity or accuracy of your results. If the learning outcomes for your subject or group are mainly attitudinal then clearly the assessment must focus on attitudes and values. In the same way, the content of the assessment methods that you use (e.g., a role play, an observation of someone’s performance) also need to reflect the attitudinal content area. (An exception to this rule may be those cases where an indirect assessment might be more economical in the first instance, say as a means of screening out those who really have no chance of passing on a more expensive and time consuming assessment.) Holistic assessment. One direction in which you may wish to proceed with your assessments is towards holistic assessments of performance. The term ‘holistic’ derives from holism, a philosophic view that the important factors in nature are entities that cannot be reduced to the sum of their parts. Holistic assessments are ways of integrating the assessment of knowledge, skills and attitudes into one assessment event. In other words, you are asking yourself: ‘Is there some way in which I can assess everything I need through the one assessment task or event?’ This approach merges the intellectual demands rather than assessing them separately or in a piecemeal fashion. For instance, in vocational education a single project may be used to assess different aspects of a technician’s ability to design cooling systems for a furnace. In primary education, a project on reptiles may form part of a portfolio assessment of both language and science. Holistic approaches to assessment have become popular because they promise economy of effort from the assessment side but they are not easy to design or to grade. (Note that holistic assessments are different from holistic scoring – this is a process in which a result comes from an overall impression of a finished product that is compared to a standard for a task). One form of holistic assessment that may merit discussion at this point is authentic assessment. Authentic assessment. In authentic assessment learners complete particular assessment tasks as part of their instruction. The tasks are designed to be meaningful, real-life, adult-like and of long-lasting value. They involve multiple skills, diverse knowledge and attitudinal components. Although they tend to involve individual assessment, students may also be involved in completing a product over a period of time in a group. While this might seem oriented to vocational or professional education, there is increasing reference to the use of authentic assessment in primary and 23

CHAPTER 2

secondary education. An advantage of authentic assessment is that it allows observation of problem solving and can be used to indicate a student’s level of functioning across a range of situations. Assessments may vary in form How can you ascertain whether a student knows some facts or can perform a skill or how they feel about an issue? In education we are restricted largely to what a student says or does. As mentioned in a previous section, your assessment processes can basically comprise a number of information sources such as observations, skill tests, simulations, oral questions etc. We have listed the five major forms of assessment that you are likely to encounter in education and training settings. It is the evidence from these forms of assessment that will be used by you to make your judgements. Each form of assessment could comprise a number of methods. Table 4. Assessment methods and forms of assessment Method of assessment assignments attendance/participation measures class questioning drawing essays exams learning contracts practical tests presentations projects reports short quizzes speed tests take-home exams

Form of assessment questioning observation questioning skills tests questioning questioning questioning observation, simulations, skills tests observation, questioning questioning, skills tests questioning questioning skills tests questioning

You will find that there are many different methods or types of assessment tasks or events which are used in education and training. In Table 4 we have listed the most common methods of assessment mentioned by teachers in our discussions with them and we have related these to the five major forms of assessment that we outlined earlier. The most popular methods include (in order): practical work, classroom questioning, class exercises, class tests, end of topic quizzes, exams, projects and attendance/participation measures. The fact that attendance or participation is used for assessment shows the breadth of assessment approaches but it should be reflected in the learning outcomes for the course. Do not be too alarmed about the dominance of questioning in assessment – it reflects the fact that most of our learning outcomes in education and training are cognitive in nature. 24

THE VARYING ROLE AND NATURE OF ASSESSMENT

Whether or not they should be is not an assessment issue; it is an educational policy and curriculum issue! Also, keep in mind that much of this questioning can today be carried out with very user-friendly and efficient software, so that teachers do not have to engage in questioning all the time or marking written responses to open-ended questions. The assessment issue that is at stake here with the forms and methods of assessment is the principle that wherever possible the form of assessment should be consistent with the modality of the learning outcomes for the course being taught. If the learning outcomes involve performance then observation, skills tests or even simulation are preferable to questioning. If the learning outcomes are cognitive or knowledge-based, then questioning is the obvious first choice. The aim is to use assessments that make the process as authentic (i.e., real-life) as possible. This enhances the meaningfulness of your Visit WebResources for an assessment tasks. example of authentic We have not said much about the prior assessment criteria evidence form of assessment. This can take many forms including exemption on the basis of previous achievements or the recognition of prior learning. The aim is to avoid unnecessary assessment. You should use assessments to tell you things that you do not already know about a learner (or could not reasonably be expected to know). For instance, if you already knew that people who completed a particular course of training invariably passed an external assessment then there is little point in continuing with the assessment. It does not add anything to your existing knowledge about the learners. If everyone in your class or course passes and is promoted then what was the point of the assessment? Is it just a rite of passage? Wouldn’t more formative assessment or additional time for instruction be preferable? Two additional topics that merit attention are those of portfolios and performance-based assessment. Portfolio. We are not certain whether we should classify a portfolio as a subcategory of prior evidence but it certainly involves a multi-faceted collection of evidence collected during and following instruction. Portfolios have become popular methods of assessment, originating in the creative areas where there has been a tradition of developing a portfolio of products for evaluation. The use of portfolios has now spread to other areas of educational assessment (e.g., reading). It is viewed as a means of assessment that encourages learning and student participation. It is also viewed as a contrast to formal, written, summative assessments. A portfolio is a deliberate collection of materials (e.g., student work, journals, teacher notes, audiotapes, videotapes, other evidence) that relate to major learning outcomes. The portfolio is developed actively by the learner and serves as a record of achievement and development. They offer students the freedom to construct their own assessment evidence and enhance critical analysis of what is relevant, appropriate and acceptable. Portfolio development may accompany instruction and offer multiple evidence of achievement. Portfolios may cover various forms and methods of assessment. They may include a wide range of work that is criterion-referenced but not standardised in the 25

CHAPTER 2

strict sense of the word (see the later sections in this chapter for a formal description of these terms). Of course, there are limits on what one might include in a portfolio, so the parameters of standardisation are much broader. Portfolios still serve as form of summative assessment, even though the component parts might have been developed during the course of instruction or had the benefit of instructor input. They may offer a welcome break from other types of assessments and can be constructed in interesting and diverse ways but some student guidance may be needed in the first instance. Some limitations of portfolios are concerns about how time consuming they are for the student and how well they sample the learning outcomes of a curriculum but this problem can be overcome with guidance. One should also evaluate their merits and worth for a particular individual, subject area and cohort of students. A substantial argument in support of portfolios centres upon the beneficial consequences for the learner. It supports an approach to teaching that promotes interest and involvement. The portfolio is seen as a way of showing that the learner has been able to achieve across a variety of contexts and is given the freedom to show the extensive nature of his/her thinking and responding. Furthermore, it is argued that it does not disadvantage students who may not do as well under the controlled conditions of standardised written or other formal assessments. In addition, the contents of the portfolio are meaningful to the learner and considered worthwhile exhibiting and retaining. While it may be thought that portfolios are less efficient in terms of time and cost than alternative assessments we do not think that this is a major disadvantage given their other benefits and the fact many other methods of assessment are onerous for the teacher. Once again, you are reminded that the promised benefits of portfolios need to be evaluated in a fair comparison with other methods so that you can determine whether they are valuable for you and useful in your teaching context. One of the recent achievements of technology is the evolution of e-Portfolios which are the digital equivalents of the traditional portfolios. You may store examples of your students’ achievements (documents, photos, graphics, spreadsheets, web pages, speeches, music, video, three-dimensional models of physical objects). This type of portfolio brings the merits of technology into the equation: teachers and students save space and they do not have to carry around physical objects; nobody needs to worry that the content of the portfolio may be physically damaged, one can recall and compare the portfolios of different students quickly. Scoring the quality of the portfolio and Visit WebResources for more storing the assessment results on a file for information on Portfolio and further use is also straightforward. It is also e-Portfolio possible to make the work of the students anonymous and it is easy for teachers to organise common projects for their students. Of course, digital portfolios need the availability of infrastructure such as computers, networks and specialised software but these are becoming more and more widely available. The next aspect of assessment that is related to forms and methods of assessment is the area of performance-based assessment. We were unclear about whether to 26

THE VARYING ROLE AND NATURE OF ASSESSMENT

include it under this category (i.e., forms and methods) or whether to include it under the category of the content of assessment, especially skills assessment. Either way we do not think that much is lost by including it at this juncture. Performance-based Assessment. The use of the term ‘performance-based assessment’ in education and training has increased in recent years throughout the world. As a broad concept it is closely linked to the assessment of a practical activity. There are many definitions of performance-based assessment that you will encounter. Performance-based assessment has been used to refer to: – a general term for an assessment activity in which students construct responses, create products, or perform demonstrations to provide evidence of their knowledge and skills;4 – assessment tasks that require students to perform an activity (e.g., laboratory experiments in science) or construct a response. Extended periods of time, ranging from several minutes to several weeks, may be needed to perform a task. Often the tasks are simulations of, or representations of criterion activities valued in their own right. Evaluations of performance depend heavily on professional judgement;5 – requiring students to perform hands-on tasks, such as writing an essay or conducting a science experiment. Such assessments are becoming increasingly common as alternatives to multiple-choice, machine-scored tests. Performancebased assessments are also known as authentic assessments;6 – a type of testing that calls for demonstration of understanding and skill in applied, procedural, or open-ended settings.7 The key features of performance-based assessments typically centre on authentic activities that involve mixing knowledge, skills and attitudes in a realistic context. The advantage of a performance assessment is its realism and in some fields it is essential to be able to demonstrate both knowledge and skills through performance. Some examples demonstrating performance-based assessment are provided in Figure 5. There are many reasons for advocating performance-based assessments. The most important of these is their relevance to the curriculum. A second reason is their attraction for learners and teachers. They can add interest to instruction. They can be used for both formative and summative assessments. They can be criterionor standards-referenced. They can test the underlying knowledge as well as the skills involved in a field. Performance-based assessments, however, are not a panacea for all assessment problems. Most of what has been written about them is largely theoretical and there have been relatively few evaluations of performance-based assessments in practical settings. Some people claim that performance assessments may suffer from problems of content or predictive validity. The problem of content validity is related to the fact that it is difficult to sample representatively in a single task all the activities required in a curriculum or for workplace performance. Because of time limitations, no curriculum can cover everything and something has to be omitted. Similarly, teachers may need to settle for limited samples of activities to assess performance because assessment is usually a time-consuming activity. It is also difficult to be 27

CHAPTER 2

certain that correct performance on one task will be sufficient for the correct performance on related tasks or even on the same task in other contexts. Again, you can only hope that the task you have designed is an adequate indicator. In these cases you can only do your best learning from experience and also learning from what has worked for other teachers. (Upper Level) Middle or High School (Provide the students with a copy of a speeding ticket that shows how the fine is determined.) Say to students: “How is the fine for speeding in our state determined? Make a graph that shows teenagers in our town how much it will cost them if they are caught speeding. Excellent graphs will be displayed in the Driver’s Education classroom.” Secondary School (At several specified times during the school day, students observe and count, for a set length of time, the number of cars and other vehicles going through an intersection near the school.) Say to students: “The police department is considering a traffic light or a crossing guard at the intersection near your school. Your help is needed to make graphs that show how many vehicles go through that intersection at certain times of the day. Excellent graphs will be sent to the Chief of Police.” Primary School (In view of the class, place 10 caterpillars in a box. Place a flashlight at one end, while darkening the other by folding over the box top.) Say to students: “Do caterpillars move more to the light or more to the dark? Make a graph that shows how many caterpillars move to the light and how many move to the dark part of the box. Your graphs will be displayed at Open House. Source: A Teacher’s Guide to Performance-Based Learning and Assessment by Educators in Connecticut’s Pomperaug Regional School District 15. To reach the material: http://www.ascd.org/readingroom/books/hibbard96book.html#chapter1 Figure 5. Examples of performance-based assessments in school settings.

Although it is claimed that performance-based assessments are less discriminatory, there is the clear possibility that these tasks may even further discriminate against minority and disadvantaged groups. This series of arguments against performancebased assessments are important not only for this method of assessment but also for any forms of assessment. On the basis of the information available it would appear that there is a good case for using performance-based assessments but that they should be evaluated against a range of criteria. Assessments may vary in intention: Formative and summative assessment Two basic purposes of assessment – formative and summative – are mentioned frequently. The importance of these two approaches to assessment is how they affect 28

THE VARYING ROLE AND NATURE OF ASSESSMENT

your teaching and the way students in your class identify what is important to learn.8 It may be helpful to define these terms for you because they are used so widely in education. It was Scriven9, an educational philosopher and evaluation expert, who made the distinction between what he called ‘formative’ and ‘summative’ evaluation. Let us look at formative assessments first of all. Formative assessments are conducted during teaching or instruction with the intention to improve the learning process. The information that you gain from formative assessments may force you to re-think your teaching plans for your group. The power of this assessment intention is that it is done with a view to making on-going changes or to improve learning before it is too late. This is a classic instance of using assessment information for the benefit of the learner. Tasks that are used for a summative assessment contribute to the ultimate grade. Typically, they occur at the end of instruction and may provide information to the students but also to someone else (e.g., parents, administrators). Mid-semester assessments that contribute to a final mark might be considered as summative in nature (e.g., the assessment tasks which contribute to the school estimate for the Higher School Certificate in the final year). However, if the teacher uses them to evaluate the effectiveness of his/her teaching and to improve his/her teaching plan for the second half of the term then the assessment results are used formatively. Scriven10 described the distinction between formative and summative evaluation like this: ‘When the cook tastes the soup that’s formative; when the guests taste the soup, that’s summative’. One final point of clarification – it is not the assessment that is formative or summative but how you intend to use the results that make it so. As mentioned in the example above, mid-term assessment results may be used formatively as well as summatively. Furthermore the same assessment results in other cases (e.g., weekly tests, projects or homework) may be used both in a formative and summative fashion. That is, a result can contribute to the final grade as well as changing your teaching and instruction. An example of formative assessment in science and technology learning is provided in Figure 6. Objective Knowledge about…

Activities to achieve this Inquire, library search, reading

Assessment procedure Short answer test about…

Investigate factors...

Follow instructions, devise experiments

Report investigation of...

Awareness of...

Group discussion, debate, role play

Comment on...

Communicate about...

Design handout, poster, drama

Effectiveness of output…

Source: UNESCO11 Figure 6. Formative assessment in science and technology learning.

Our observation is that most of our assessment in education and training is summative (that is, we use it for grading) although this should not be the case. Teachers and trainers are hard put to find enough time for formative assessments 29

CHAPTER 2

(that is, we tend to use the assessment results mainly in a summative way). Without high quality teaching, there is also the issue of the extent to which some students might take such formative assessments seriously. Probably the best type of teaching that we have observed is when the teaching and the assessment were intertwined. For example in a carpentry class, students were given a free choice of construction as their assigned task (i.e., summative assessment) and worked on it during the entire course. This was the basis for their assessment but also the source of the teaching and instruction that occurred. The teacher moved around the group and she offered suggestions, guidance and mentoring whenever it was required. Norm-referenced and criterion-referenced assessment. Whenever we compare an individual’s performance with a pre-determined standard or we search for results that are directly interpretable in terms of specified performance, we are focusing on a criterion-referenced assessment. 12 This approach to assessment is consistent with mastery learning, competency-based training and outcomes-based assessment because of its practical emphasis on performance. It contrasts with a norm-referenced approach, which is used to find out how a person performs compared to others. An example of norm-referenced assessment would be when we compare a student to the overall performance of the class. Another common example of norm-referencing is the Universities Admissions Index (a type of tertiary entrance rank or percentile ranking) following the Higher School Certificate in Year 12 and used for university admissions purposes. Normative results are useful when performance is age-related or when a student’s results should be compared with a specific group (e.g., assessment of performance with students versus experienced workers; linguistic performance of non-English speaking students). Any statement indicating above or below average performance is normative and class rankings are also normative in nature. It is not always possible to tell whether an assessment is criterion-referenced or norm-referenced from the assessment itself. The difference lies in the way the results are used. The criterion-referenced assessment describes performance, whereas the norm-referenced assessment distinguishes amongst individuals. The same results may be used in both a criterion- and norm-referenced fashion. It was as recently as 1963, that Glaser, an educational psychologist, put forward the notion of criterion-referenced testing. Criterion-referenced assessment described performance in terms of the nature and order of the tasks performed. It represented a fundamental shift in perspective for educational assessment. The emphasis is on what the person can do rather than on comparisons with others. In criterion-referenced assessment there is a clearly defined domain of learning tasks. ‘Standards-referenced testing’ or ‘standards based assessment’ are terms that are now related to criterion-referenced assessment. Do not think that criterion-referenced means that there is a cut-off point or criterion for passing; it means that the assessment is referenced to a criterion (i.e., a specific content area). Criterion-referenced tests grew out of mastery learning and approaches to learning that were meant to be specific and observable. The criterion that was set for mastery learning varied but it was defined as a success-rate of around 80% or more. Criterion-referenced assessments are used for various purposes, including: 30

THE VARYING ROLE AND NATURE OF ASSESSMENT

– classification (e.g., placement, screening, certification, selection, recognition of prior learning); – diagnosis (e.g., to identify the education and training needs); – instruction or training (e.g., to provide feedback on the learner’s current performance and progress, that is formative assessment; or summative assessment which records performance up to or at a point in time); – self-knowledge; – program evaluation; and – research. Criterion-referenced types of assessments are useful when we want to check whether people have gained sufficient knowledge or skills to go on to the next stage. This is because the content of a criterion-referenced assessment should be designed to match curriculum learning outcomes closely. Questions in these assessments are directly related to what has been taught. Furthermore, the difficulty level of the questions is linked to learning, so that easy tasks are not necessarily omitted. Criterion-referenced assessments are therefore useful for mastery or competency testing not only because they focus on a special domain but also because they describe exactly what a person can do (e.g., can type at 60 words a minute for three minutes with 98% accuracy). Criterion-referenced tests are suited for workplace assessment; for determining the mastery of basic skills; and when grouping learners for instruction. Assessments may vary in formality Another way in which your assessments can vary is in terms of how they are conducted. We have used a simple dimension of formal versus informal to describe assessment. Formal assessment is aimed at obtaining information in a public, structured and prescribed manner. Frequently, formal assessments are also of a high stakes nature. Informal assessment, on the other hand, can be unobtrusive, less structured and private; its results tend to be used formatively and diagnostically. Most usually, the results of informal assessment are of low stakes. The assessments that involve public end-of year examinations are highly structured and clearly formal in nature; so are the national basic skills tests used in primary schools. Other assessments, like a centrally set but locally marked class project in surveying would be formal, public, summative and quasi-structured. Similarly, a welfare teacher might be visiting a student on a locally set and locally marked work-experience placement in order to make some observations that contribute to his/her final grading of this student’s ethical behaviour. Do not be misled by the informality of the situation, as this is still a public albeit formal assessment. In another context, a teacher may use questions directed to every member of the class in order to obtain a general idea about the level of knowledge in the group. This is an informal evaluation rather than a formal assessment. When these questions are carefully directed and varied in difficulty, they can provide a barometer of classroom achievement.

31

CHAPTER 2

The assessment issue addressed by the formal versus informal distinction relates mainly to the conduct of the assessment. Certain assessments (e.g., high stakes exams) will demand a degree of formality and openness to scrutiny. Public assessments are accountable and have substantial consequences for stakeholders (students, parents, teachers, schools, education systems). Without wishing to tire you with all this background detail we would like to mention some further parameters of formal assessment. The first of these relates to standardisation. Standardisation. A key feature of an assessment task or event is the extent to which it is standardised. Standardised means the extent to which the procedure is uniform in its administration, answering and scoring. Failure to use standardisation means that results are not comparable across individuals or situations. We have always considered that standardisation could well be the chief attraction of assessment for the community. Rightly or wrongly, the standardised assessment is perceived by the layperson as intrinsically fair and equitable because it is meant to be uniform from one teacher and situation to another. An example of a standardised assessment is one of the trade recognition tests in building that uses the same task to determine the value of an overseas qualification. A diagnostic reading test administered by an infant’s teacher who is required to read instructions word for word from a manual would also be a standardised assessment. On the other hand, the use of a uniform assessment conditions may not be as important where a teacher initially suspects a learning difficulty or thinks that a student may be gifted in a particular way. An ad hoc assessment might be developed in the first instance before making a referral to a specialist service. Standardisation can vary. A multiple-choice examination is uniform in the questions it asks and in its scoring; a take-home examination is uniform in questioning, varied in completion but may have an explicit marking guide; a design brief (i.e., an assignment or project for a class in design) is uniform in the task it sets for the students but the completion of the assignment can occur under varying conditions and the marking criteria may vary from one panel member to another. The degree of standardisation in your assessment may be dictated to you by the syllabus or by the need to make comparisons between students on the same basis. The less emphasis that is placed on examinations and the more emphasis that you give to learning, then the greater will be the scope for varied and more personally designed situations. While standardisation is important in psychological testing, it may be sacrificed in educational settings for the benefit of the student. For instance, assignments, projects and classroom tests might be varied to suit the interests or rate of learning of students. Usually, however, standardised tests undergo thorough statistical scrutiny and various measures regarding the test (such as the average score or the spread of scores for a specific population) are published. Using these statistics, one might use the results from standardised assessments for norm-referenced purposes. Individual versus group assessment. A second way in which your assessment can vary is whether it is an individual or group assessment. There are some forms of assessment that are restricted largely to personal testing (e.g., clinical skills, trade skills). Reading tests taken on a one-to-one basis and oral examinations are examples of individual assessment. The advantage of individual assessment is that 32

THE VARYING ROLE AND NATURE OF ASSESSMENT

it gives you an opportunity to observe the person in action and to come into contact with the student as a person; whereas performance in a group assessment is less direct and may be limited to responses on a sheet of paper. The obvious disadvantage of individual assessment is that of additional time and effort, so teachers may need to strive for some balance between these two approaches. Timed versus un-timed tests. A third way in which your assessment can vary is in terms of time limits. Some tasks are timed or speeded, whereas others are untimed or power tests. Power tests have ample time conditions. In some tasks (e.g., keyboarding) speed of response is important and it is legitimate to measure the time taken. Furthermore, speed of response is a powerful indicator of the extent to which skilled performance has become completely automatic and time taken to learn is also considered a useful indicator of aptitude for a subject. Time limits are recommended where speed is a factor in the occupational performance; otherwise generous time allowances should be provided for students in order to ensure that you are assessing only ability in your subject and not a mixture of ability and time-pressured performance. While some of your students are able to cope with the demands of time limits others may find it difficult to demonstrate their best ability when they are rushed and for others performance under fixed conditions with strict time limits can still be anxiety-provoking. If assessments have prescribed time limits then you should help students to prepare through frequent practice. If you are required to set a time limit, then design your assessments so that around 95% of people can complete them in the time assigned. Practical implications Well now that you know more about the variable roles and nature of assessment, you might well ask, ‘So what?’ The real advantage of this framework is that you can now describe the assessment for your group or for the curriculum that you teach. The framework can be used as a mental checklist for you to plan and record the outline of your assessment at the commencement of every semester or when you wish to revise the assessment for a program. In particular, if you are a college teacher, you should ensure that each person receives a copy of the course outline or a written copy of subject outlines or assessment documents. In some educational systems, a college student is required to sign these documents. It is not clear to us that students always understand what they are signing so it is important that you explain the details to students. Ensure that your assessment is in conformity with the subject or course outline. Do not hesitate to explain to your students as much information as necessary in order to understand all the practicalities regarding assessment, such as details like critical deadlines. Appeals against assessments are now commonplace – especially in high stakes education – and it is becoming increasingly difficult for teachers to defend themselves. There are fewer problems, however, when assessment processes and details are documented. This is a consequence of the modern overemphasis on 33

CHAPTER 2

competition and using education to gain vocational or academic qualifications. Secondly, this level of formality is a by-product of the bureaucracy associated with commercial training, mass schooling and further education systems. Nevertheless, within the scope of the assessment requirements for your subject, there should always be considerable leeway for teacher sensitivity to cases of genuine hardship, disadvantage, serious misadventure and difficulty. In particular, special considerations should be given to persons with disabilities. Remember that the subject outlines are essentially a guideline within which you operate. They indicate the parameters and are there for the benefit of the learner. Very often when a course is actually taught it may bear only minor resemblance to the content specified because: (a) it has never been taught before, or (b) it is out-ofdate, or (c) the person who wrote the course is not teaching it, or (d) the teacher is experienced enough to provide additional material, or (e) the nature of the learners and their needs are not satisfied by the subject as it is written. So, provided your learners are not disadvantaged you should feel free to be flexible and to lighten assessment loads when you deem it appropriate – for a start it does wonders for your popularity amongst learners. You need to make these professional assessment decisions in the light of what is going on around you (in other locations, with other colleagues, in other organisations, in your profession). This needs to be balanced against your responsibility to the learners and the need to keep faith with the essence of the outline. If in doubt, stick to the subject outline. Certainly, you cannot increase assessment loads once an outline of the assessment has been distributed. In some high stake courses there will be little scope for deviation from the curriculum and the expert teacher has an obligation to deliver the content and assessment in accordance with the prescribed requirements. A FINAL WORD OF CAUTION: THE ‘ASSESSMENT FOR LEARNING’ VERSUS ‘ASSESSMENT OF LEARNING’ DEBATE

There is a number of teachers and academics who believe that tests cause increased anxiety to the students and may have negative effects on the learning process. They also claim that tests make students infer that the sole aim of the teaching is simply to perform well on the assessment. They avoid tests altogether and they often promote ‘assessment for learning’ in the sense that alternative (other than tests) assessments may be a vehicle and a motivation to promote learning, not only to measure it (assessment of learning). It is possible that they may miss the point: assessment (and especially testing) does not have any inherently evil properties. It is rather how people use – and frequently misuse – assessments that may sometimes cause a negative feeling to students and parents. Having said that, it is also true that different applications of assessment could enrich the assessment toolbox of teachers. For example, self-assessment has long been known to have beneficial effects on learning and on the internal motivation of students. Provided it is well focused and designed, self-assessment can involve students in the teaching-assessment-teaching cycle in a very constructive way.

34

THE VARYING ROLE AND NATURE OF ASSESSMENT

Particularly when connected/attached to projects and portfolios, self-assessment can have especially beneficial and positive results. The idea of self-assessment is not new but it is not frequently used for several reasons. Firstly, it is usually claimed that self-assessment is a time-consuming process which shrinks the actual teaching time. It is also claimed that it is a rather complex process which demands cooperation (and maturity) on behalf of the students. The advocates of self-assessment, on the other hand, suggest that selfassessment is not much more time-consuming when it is designed properly. They also claim that the students will need some time in the beginning to get used to the new method of assessment but the positive results will be more valuable than the initial investment in time. It is usually better, they claim, to reduce the time of paperand-pencil testing and increase the time of this type of alternative assessment. One of the secrets of the success of self-assessment is that the students are continuously exposed to a flow of feedback about their work when they have the scoring criteria handy. Feedback is known to have very positive results on learning when it is focused, detailed, clear and on-time. The same could also happen with peer-assessment. As a method, peer-assessment means that other students may use scoring rubrics to express their opinion (i.e., assess and evaluate) the work of their classmates. Peer-assessment has been tried in various contexts with very positive results, especially in tertiary education. Although self-assessment and peer-assessment may have very good results in one context, they may have very undesirable consequences in other settings. At the first stages of your career you may prefer to use more traditional assessment methods until you get more experience and become more confident. If, however, you feel that you would like to try self-assessment and peer-assessment methods in your class, be warned that you will need to be very well organised to succeed. In any case, do not feel that you need to be involved in the assessment for learning versus assessment of learning debate. You now know that you need to design your assessment approach according to the needs of your students. Over-reliance on one single approach is not wise and this chapter has given you a broad selection of approaches from which to choose. SUMMARY

We have covered a considerable amount of detail in this chapter and you can be reassured that you do not need to recall it all. If we had to say what was most important about the role and nature of assessment then we would focus upon (a) the extent to which the results of assessment are used in a formative or summative way; (b) whether assessment is criterion-referenced in its interpretation; and (c) the extent to which it is standardised. A case has been made for using formative, holistic and authentic assessments that are situated in reality. They foster the application of knowledge and skills and bring about important changes in your teaching. Studying and learning in your area will be affected by the forms and methods of assessment that are used. More importantly, studying, learning and assessment in your area will be affected by the 35

CHAPTER 2

approaches to instruction that are commonly used. These may depend upon the traditions in your field. A minimum of two assessments per subject and no more than 10% of teaching time for assessments has been suggested. It has been recommended that students should be given at least two week’s notice of assessments and that teachers should provide learners with course outlines and assessment details early in the semester. This chapter has shown you that there are many approaches to assessment. Assessment is not a unitary concept but is composed of many dimensions so it is important to be clear about the type of assessment we are discussing. Sometimes one form or method of assessment is contrasted with another but these comparisons are not always valid. You need to compare two assessment methods that are equivalent in terms of time, effort, teaching implications, predictive validity, consistency of results, learning consequences, economy and so forth. Sometimes the limitations of a particular assessment may be a function of the developer and not an inherent fault of the method itself. (For example, one may develop an excellent authentic assessment or one may develop a less than satisfactory authentic assessment; in the same way one might have an excellent assessment approach that is implemented in a less than optimal manner or one can have a poorly developed assessment that is interpreted and implemented in a positive manner. There are so many permutations and combinations that comparisons are fraught with difficulty.) There is considerable hype concerning the benefits of some types of assessment; our view is that all the forms and methods of assessment have their place and utility. The issues here can be evaluated for your context and circumstances. Maybe the framework and the parameters we have outlined will ensure that we can be clearer in our discussions about assessment. In the next chapter we shall focus specifically on planning the assessment. Now it is time to take a break. When you are ready you might look at the true-false questions to review your reading of the chapter and maybe undertake some of the exercises individually or as a group. -oOoREVIEW QUESTIONS

Try these review questions to help you reinforce some of the key ideas in this section. These are all true-false questions to make it easier and quicker for you to complete. Think whether each statement is mainly true or false. Then just circle the T (True) or F (False). If you are not sure, just guess. T F T F T F T

36

F

Assessment is a comprehensive generic term Assessments may vary in number, frequency and duration At least three assessment events are recommended for each subject, unit or module As a general rule, the greater the number of assessments you conduct then the higher will be the reliability of your results

THE VARYING ROLE AND NATURE OF ASSESSMENT

T

F

T

F

T F T F T F T F T F T F T

F

T F T F T F T

F

T F T F

Human behaviour can be grouped into the categories of knowledge, skills or assessments Holistic assessment integrates the assessment of knowledge, skills and attitudes Holistic assessment involves holistic scoring The five major forms of assessment are: observation, simulations, skills tests, questioning and the use of prior evidence Summative assessments seek to improve the learning process It is how the results will be used that makes an assessment summative or formative Classroom questioning is a formative public assessment Teachers should provide students with course outlines and assessment details early in the semester Around 15% of a subject’s teaching time should be given over to assessments At least two week’s notice should be given to students for a class test Teachers can increase the assessment load slightly even though an outline of the assessment has been distributed Teachers have an obligation to deliver content and assessment in accordance with the prescribed requirements A standardised assessment has instructions for administration and scoring A group test can be used to observe individual performance

T F T F

A criterion-referenced test can be distinguished by its format and questions A keyboarding test is likely to be a power test A formal essay examination is likely to be an objective test.

T F

Competency-based assessments are criterion-referenced

T

Norm-referenced assessments are designed to give descriptions of performance

F

EXERCISES

Here are some review exercises for you to answer or they can be used as the basis for discussion. 1. Give three examples of formative and summative assessment from your teaching context. 2. What are the assessment requirements for students/trainees in the subject or course that you teach or plan to teach? 3. Describe a form of holistic assessment that could be used in your teaching area.

37

CHAPTER 2

4. Identify three distinctions between (a) criterion-referenced and (b) normreferenced approaches to testing and assessment. 5. Read the following definitions of criterion-referenced assessment and comment critically on these views of criterion-referenced testing. A. Criterion-referenced tests: These tests measure what a student knows and can do in relation to specific objectives or criteria that all students are expected to meet. The tests are designed to reflect the knowledge and skills that a state or community has identified as important for an academic subject. The focus is on whether the student can meet the criteria and not on how he or she performs relative to others. (Source: Education Week, http://www.edweek.org/ sreports/qc97/misc/#cr) B. Criterion-referenced test: A test in which scores are evaluated, not in terms of comparative rankings, but rather, in terms of the percentage of mastery of a predetermined standard. Examples include behind-the-wheel driving tests, tests of typing speed and accuracy, tests in the military for strength, and tests measuring the effects of alcohol on muscular coordination. Most manual skills tests are criterion-referenced. Criterion-referenced tests tend to focus on minimum thresholds, such as the threshold needed to pass a driving test, or to pass for secretarial or military service — or, in the new education system, to graduate from high school. The test focuses — not on the best, the median, or the average students — but on the worst students, those near the minimum threshold. The new education system mandates the use of criterion-referenced tests (and the de-emphasis or elimination of the traditional norm-referenced tests such as ACT, SAT, and Iowa Basic tests), thereby redefining ‘success’. The new system intends to ‘hold schools and teachers accountable’ (by various threats and punishments from the government) for failure to meet its peculiar measure of success. By this means the system compels teachers to forsake students who are average or better, and focus instead on those students near the minimum threshold, for that is how teachers and schools are to be judged. This furthers the twin goals of: (1) educating mostly just for minimum competencies in specific job skills, and (2) ‘equalizing’ educational outcomes (not educational opportunities) — while turning a blind eye to the development and recognition of academic excellence and the broad-based knowledge needed to keep people free. (Source: Maple River Education Coalition, http://www.mredcopac.org/glossary.htm) 6. Indicate the characteristics of assessments in the subject area in which you teach or plan to teach, in terms of the characteristics used to describe assessments. 7. What is your opinion of the assessment processes used in your area of education or training? How could they be improved?

38

THE VARYING ROLE AND NATURE OF ASSESSMENT

8. Read this brief excerpt from the American Psychological Association’s Monitor (October 1999. pp. 10). In your opinion, does frequent assessment lead to greater achievement? FREQUENT TESTING MEANS BETTER GRADES, STUDIES FIND College students who were given a quiz on reading material every week outperformed students who were given comparable homework or who had neither … students taking “spot-quizzes” were compared to students of comparable aptitude who were assigned homework on the same material, and to those who neither took spot-quizzes nor completed homework assignments. On final achievement tests, the spot-quiz group outperformed the homework group by 16 percent and the control group by 24 percent. S. Kass

39

CHAPTER 3

FUNDAMENTAL CONCEPTS OF MEASUREMENT

In previous chapters we discussed the diverse nature of assessment, and its multiple and important roles at different levels of the education system. Why has educational assessment managed to reach such a prestigious position in all education systems around the world? Generally speaking, society puts its trust in the results of educational assessment because they are believed to be both meaningful and useful. When assessment is carefully designed and applied its results are meaningful because they can present a detailed picture of what students have learned. The results of high stakes assessment need to be accepted by people as fair and valid: High stakes public examinations are usually established under the assumption of trust and support from the society. In many countries the citizens are involved in a ‘social contract’ where examinations are generally accepted by the society as the filter that controls the distribution of scarce educational resources (and as a result affects social mobility). In these countries, persons’ moral and/or political obligation to accept the output of high-stakes public examinations depends upon a perceived ‘contract’ or ‘agreement’ between the citizens.... It is therefore reasonable that the perceived fairness of high stakes public examination systems around the world frequently receive their fair amount of political and educational discussions and debates1. So it makes sense to pay attention to the results of assessment because they have something important to say: teachers, parents, students and politicians use assessment to make sense out of the complex world of schooling. Beyond being meaningful, assessment results are also useful because they are usually expressed in numbers or scales, or they indicate levels of achievement or degree of success. Therefore, they can usually be processed in numerical (mathematical) ways and they can be aggregated and compared, and inferences (or even predictions) may be drawn. The assessment results can be presented graphically and convey rich messages with a simplicity that is easily understood by the layperson. People today take it for granted, but educationalists were not always able to monitor the output of the educational system in a numerical manner. They were not able to conduct large scale assessments (at least not as we perceive them today), nor to evaluate how accurate or unbiased were their results. The engine behind this powerful nature of assessment is educational measurement. The fundamental concepts of educational measurement will be presented in this chapter, and several practical examples will be demonstrated.

41

CHAPTER 3

Educational measurement may be loosely defined as the scientific field which deals with the numerical determination of education-related attributes or dimensions. In simpler words, educational measurement deals with expressing numerically the quantity of knowledge, skills and mental capacity of people. As such, educational measurement is founded on the assumption that it is possible to measure latent (i.e., unobserved) psychological constructs (e.g., mathematical ability, linguistic ability or intelligence) through observing the behaviour of a person (e.g., his/her responses to oral or written questions or instructions). As such, educational measurement is closely related to the scientific field of psychological measurement but further discussion of this relationship is beyond the scope of this book. Of course, to measure something, you first need to be able to describe it fully and to know what it is. Many psychologists do not feel comfortable with the concept of measuring unobservable psychological constructs (such as linguistic ability or intelligence) but we do not need to delve into this aspect. Leaving theoretical discussions aside, you as a teacher, should be able to explain to other people (e.g., colleagues, students or parents) what your assessments intend to measure and why you think they are useful. If you use commercial assessments you should be able to explain the philosophy and the theory behind the assessment as well as the validity and the reliability of the results in the context of your own class (we will discuss validity and reliability extensively later). You also need to be able to demonstrate that the assessment is appropriate for your students. The developers of educational assessments rely on statistical models from the field of educational (as well as psychological) measurement in order to develop and evaluate their assessments. They also rely on statistical models to draw inferences and make predictions based on assessment results. For this reason, it is of paramount importance to be familiar with some fundamental concepts of educational measurement. Only then you will be able to fully appreciate the power, the usefulness and the psychometric properties of assessments. The next sections will introduce you to some fundamental concepts of educational measurement. You will familiarise yourself with useful statistics using real data from recent educational assessments in primary, secondary and higher education. Setting the scene of a peer-assessment exercise Let us put you in the context of an undergraduate university class. Ten undergraduate students at a European university enrolled in the subject ‘Educational Assessment and Evaluation’ at the School of Education Sciences in the Spring semester of 2008. They were studying for a Bachelor of Education degree, which is a formal qualification to become a teacher. The requirements for a successful completion of the subject were the following: – 10 marks for their class attendance; – 40 marks for a personal or group project (which will be presented in class); – 20 marks for a written mid-term examination; and – 30 marks for a computer-based practical examination.

42

FUNDAMENTAL CONCEPTS OF MEASUREMENT

As shown above, each student should carry out a project either alone or as a member of a group. For the project, each student should choose one out of many possible primary school subjects (such as mathematics, science, language or biology) and then choose one specific teaching unit from that subject. For example, a student might choose to carry out a project in biology and in particular the teaching unit or module on the ‘circulation system’. The aim of the project would be to prepare their own assessment plan for their hypothetical class to assess whether the learning goals of the teaching unit were met. A scoring rubric was agreed between the undergraduate students and their professor; therefore, the students were aware of the requirements and the virtues of a good project. This is presented in Table 5. Table 5. Scoring rubric Α. On a scale from 1-5 (1: less successful … 5: more successful), the assessment plan you produce must Be realistic and feasible (e.g., think of time, cost, security, 1 1 2 3 4 5 ethical restrictions) Help to determine the degree to which the learning goals 2 1 2 3 4 5 were achieved 3 Have diagnostic elements 1 2 3 4 5 Facilitate planning for remedial teaching (if proved 4 1 2 3 4 5 necessary) Help the students be in charge of their learning through 5 1 2 3 4 5 group projects or self-assessment 6 Motivate students 1 2 3 4 5 Be compatible or have provisions for learners in need of 7 special administration arrangements (e.g., Braille, 1 2 3 4 5 magnification, use of computer, extra time etc) 8 Original/Innovative 1 2 3 4 5 Β. On a scale from 1-5 (1: less successful … 5: more successful), the presentation assessment must 1 Be interesting 1 2 3 4 2 Be comprehensible 1 2 3 4 3 Make good use of the available presentation time 1 2 3 4

of the 5 5 5

Generally speaking, most students decided to work in groups, so five projects were undertaken (one student worked alone and the rest of the students worked in four different groups). Each project was presented in class and was rated by everybody (including the students who owned the project and the teacher). In all, every project received 11 ratings according to the rubric (the eleventh rater was the professor). The results were collected anonymously, so it was not possible to know how the students (rated) self-assessed their projects. In real life, we would like to know who

43

CHAPTER 3

rated what by how many marks in order to differentiate the self-assessment from peer-assessment but for the purposes of this chapter this is not so important. All the assessments are presented in a Microsoft Excel file which may be downloaded from the WebResources companion to this book. The next sections use the data collected from the peer-assessment exercise in order to introduce the reader to various fundamental measurement concepts. SCORE DISTRIBUTIONS

Minimum, maximum and the range of scores Table 6 presents the grades on each criterion (A1 to A8 and B1 to B3 in the scoring rubric of Table 5). This shows the scores awarded to Project A by all 11 raters (the 10 students and their professor) on each criterion during the peer-assessment process. Table 6. Scores for Project A Student Mary

Α1

Α2

Α3

Α4

Α5

Α6

Α7

A8

Β1

Β2

Β3

5

5

4

5

5

5

3

5

5

5

5

Nick

4

4

3

3

4

4

3

4

5

5

4

Jason

5

5

4

4

4

5

3

5

4

5

4

Jim

5

5

5

5

5

5

1

5

5

5

5

Thekla

5

4

5

4

4

5

3

5

4

4

4

Lambros

5

4

5

4

4

5

1

5

4

4

4

George

5

5

4

4

5

5

2

5

5

5

5

Iphigenia

4

4

4

4

4

4

1

3

4

4

4

Jo

4

4

3

3

4

4

1

4

4

5

4

Luke

4

3

3

3

5

4

3

5

5

5

4

Teacher

5

5

4

4

3

5

3

4

5

5

5

One indication of the spread of the scores granted to Project A Visit WebResources to download the MS Excel file “peer-assessment.xls” containing the would be given by identifying data displayed in Table 6 and other tables. For the minimum and the maximum Table 6, open the spreadsheet “ProjectA”: score on each of the 11 criteria. www.relabs.org/assessbook_files/content.htm The minimum score for criterion www.relabs.org/assessbook_files/datasets.htm A1 (Realistic and feasible assessment plan) is 4 and the maximum is 5. The same holds for the criteria A6, A8 and B1 to B3. However, the minimum score on criterion A2 (The assessment plan helps us determine the degree to which 44

FUNDAMENTAL CONCEPTS OF MEASUREMENT

the learning goals were achieved) is 3 and the maximum in 5. The same holds for A3, A4, A5 and A8. The minimum score on criterion A7 (‘The assessment plan is compatible or has provisions for learners in need of special administration arrangements’) was 1 and the maximum was 3. It seems that Project A did not satisfy the raters on this criterion. It is sometimes desirable to compute the range of the scores so as to get an idea of their variability. Although this is not the best measure of variability of assessment scores (we will later discuss better ones) it is useful and simple. We compute the range by subtracting the minimum from the maximum score. For example, the range of scores for the criteria A1, A6, A8 and B1 to B3 is 1 (i.e., 5-4=1) which shows that there was not much variability between the scores. The eleven raters were largely in agreement. On the other hand, there is some more variability of scores on criteria A2 to A5 and A8 where the range is 2 (i.e., 5-3=2). The same variability can also be computed for the criterion A7 (3-1=2). Notice, however, that there is a fundamental difference between, say, the distribution of scores on criteria A7 and A2: although both criteria have the same score range, the scores on criterion A7 are much lower. The next section will present an easy way to summarise and describe the scores on any assessment. The mean In total, Project A received low scores on criterion A7. Let us add up all scores awarded to each criterion in order to compare which one received the highest scores. From now on, when we want to show summation, we will use the Greek letter Σ. For example, if X represents the scores awarded by the raters on criterion A7, the designated summation sign Σ directs us to sum (add up) whatever comes after it

∑ X = 3 + 3 + 3 + 1 + 3 + 1 + 2 + 1 + 1 + 3 + 3 = 24

(1)

Similarly, if the scores of Project A on criterion A1 are represented by the letter Y, the sum of all scores would be

∑ Y = 5 + 4 + 5 + 5 + 5 + 5 + 5 + 4 + 4 + 4 + 5 = 51

(2)

It is obvious from the sums that A1 has received much higher scores than A7 – so the A1 criterion was easier to be satisfied. However, it might be even more useful if we could express this easiness on the original scale as shown in the scoring rubric (i.e., from 1 to 5). We would call this ‘the mean score’ on each criterion and it would be a very handy way to describe the distribution of the scores. 45

CHAPTER 3

The mean is the most often used measure in statistics when we want to describe the arithmetic average which you may already remember from your high school days maybe when you were calculating your grade point average (GPA – your average score resulting from each one of your school subjects). You may obtain the mean by adding up all scores and dividing by the number of scores. Although some textbooks use different notation, we will use a bar over the letter X symbolizing the variable to represent the mean. So, the mean score on criterion A7 would be X =

∑X N

=

3 + 3 + 3 + 1 + 3 + 1 + 2 + 1 + 1 + 3 + 3 24 = = 2.18 11 11

and the mean score on criterion A1 (which we will call Y =

Y ) would be

∑ Y = 5 + 4 + 5 + 5 + 5 + 5 + 5 + 4 + 4 + 4 + 5 = 51 = 4.64 N

11

(3)

11

(4)

On a scale from 1-5, the eleven raters awarded a mean score of 2.18 to criterion A7 but a mean score twice as large to criterion A1. Combining the information we gathered from the range and the mean, we can describe the distribution of scores on the criteria A1 and A7 very efficiently. For example, we could say that on A7 the scores ranged from 1 to 3 with a mean score of 2.18 but the scores on A1 ranged from 4 to 5 with a mean score of 4.64. We do not need to return to Table 6 to view all the scores; just the range and the mean are informative enough. This is the power of statistics, they allow us to summarise and communicate huge amounts of information easily, quickly and with great precision. Words alone would be difficult to describe the distribution of scores, we would normally need to read Table 6 which would take time and we would surely not remember all scores at once anyway: two statistics (the range and the mean) can do a much better job. However, there is another measure of variability, which is also very handy when we want to describe score distributions: the mean absolute deviation. We will see this in the next section. The mean absolute deviation We have seen that the mean is a very handy way to describe how easy or difficult for Project A was to satisfy the raters on a specific criterion. We have also seen that there is an easy way to judge whether there was much variability between the ratings of different raters, that is, using the range. However, there is another, even better way to judge variability. Take for example the scores on criterion A1. After computing the mean score, it is relatively easy to determine who was a lenient and who was a severe rater. For example, look at Table 7. 46

FUNDAMENTAL CONCEPTS OF MEASUREMENT

Table 7. The deviation and absolute deviation from the mean Criterion A1 Rater

X

X−X

X−X

Criterion A7

Y

Y −Y

Y −Y

Mary

5

-0.64

0.64

3

-0.82

0.82

Nick

4

0.36

0.36

3

-0.82

0.82

Jason

5

-0.64

0.64

3

-0.82

0.82

Jim

5

-0.64

0.64

1

1.18

1.18

Thekla

4

0.36

0.36

3

-0.82

0.82

Lambros

4

0.36

0.36

1

1.18

1.18

George

5

-0.64

0.64

2

0.18

0.18

Iphigenia

4

0.36

0.36

1

1.18

1.18

Jo

4

0.36

0.36

1

1.18

1.18

Luke

3

1.36

1.36

3

-0.82

0.82

Teacher

5

-0.64

0.64

3

-0.82

0.82

Mean

4.36

0.00

0.58

2.18

0.00

0.89

The first column shows the name of the rater, the second column the score awarded on criterion A1, and the third column shows the mean score for the group minus the score awarded by the rater. In other words, the third column shows how much the score awarded by each rater on criterion A1 deviates from the mean score for the group. For example, the first rater (Mary) awarded 5 marks but the mean score was 4.36, so Mary was 0.64 marks more lenient. On the other hand, Luke was 1.36 marks more severe because he awarded 3 marks whereas the average score was 4.36. It is easy to figure out who was severe and who was lenient by checking the sign of the deviation of his/her score from the mean score. For example, For Table 7, open the George is lenient but Iphigenia is severe. It is spreadsheet “ProjectADev” important to mention that in this context we in the file “peerdetermine if somebody is severe or lenient assessment.xls” only in comparison to the rest of the group. We could also judge the leniency or the severity of the rater in comparison to the score awarded by the teacher (their professor), who is supposed to be the expert and who therefore is supposed to be ‘carrying the standard’. However, we will say more on this issue when we discuss the issues of the reliability and validity of marking. However, if we compute the mean of the deviations we find its value to be always zero. Some algebra can show us that the mean of the deviations will always be zero because this is exactly the formal definition of the mean: it is the point from which the aggregated deviation of all scores is zero (but we do not need to delve into this issue right now). 47

CHAPTER 3

However, we would like to exploit the useful information conveyed by the concept of deviation so as to compare the variability of scores. To do this, we can use the concept of absolute deviation. In mathematics, the absolute value is always a positive number. For example, the absolute value of 2 is 2. The absolute value of a negative number is again a positive number. For example, the absolute value of -2 is 2. The fourth column of Table 7 shows the absolute deviation of the scores awarded by the raters. You can see from Table 7 that all deviations, either positive or negative have a positive absolute value. Let us compute the value of the mean absolute deviation for criterion A7. Do not be intimidated by the formula. By now, you already know how to compute the mean. Go back to the formula of the mean and see that the only difference between the formulae of the mean and the mean absolute deviation is that we replaced X with X − X : MeanAbsoluteDeviation = =

∑ X−X N

(5)

0.64 + 0.36 + 0.64 + 0.64 + 0.36 + 0.36 + 0.64 + 0.36 + 0.36 + 1.36 + 0.64 11 6.364 = = 0.58 11

It is important to compute the absolute deviation because we can then aggregate all absolute deviations in order to arrive at the mean absolute deviation of the scores. In this case, the mean absolute deviation is 0.58 for criterion A1 and 0.89 for criterion A7. You can see that although the two criteria have the same score range, they have very different mean absolute deviations; that is, the scores awarded by the reviewers on criterion A7 are more scattered around their mean compared to the scores awarded on criterion A1. There is, however, a measure of variability which is even more efficient and more popular than the mean absolute deviation. It is called the standard deviation. In the next section, we will first study the variance statistic and then the standard deviation which can be derived directly from the formula of the variance. The variance

The use of the absolute values is somewhat cumbersome. Mathematically speaking, it is always more difficult to work with absolute values of numbers. Therefore, another statistic has been developed to help us make sense out of the variability of the scores about the mean, without using the mean absolute deviation. This is the variance. It is computed roughly 48

FUNDAMENTAL CONCEPTS OF MEASUREMENT

in the same way as the mean absolute deviation, but instead of using the absolute values, we use a little trick: we multiply the deviation by itself and this makes sure that we will always get a positive number. We will show you how. Table 8. The deviation and the squared deviation Criterion A1 Rater

X

X−X

(X − X )

Criterion A7 2

Y

Y −Y

(Y − Y ) 2

Mary

5

-0.636

0.405

3

-0.818

0.669

Nick

4

0.364

0.132

3

-0.818

0.669

Jason

5

-0.636

0.405

3

-0.818

0.669

Jim

5

-0.636

0.405

1

1.182

1.397

Thekla

4

0.364

0.132

3

-0.818

0.669

Lambros

4

0.364

0.132

1

1.182

1.397

George

5

-0.636

0.405

2

0.182

0.033

Iphigenia

4

0.364

0.132

1

1.182

1.397

Jo

4

0.364

0.132

1

1.182

1.397

Luke

3

1.364

1.860

3

-0.818

0.669

Teacher

5

-0.636

0.405

3

-0.818

0.669

0.00

0.41

2.18

0.00

0.88

Mean

4.36

Let us take the example of Mary. On criterion A1, she awarded a score of 5 but the mean score was 4.36. Therefore, her score deviated from the mean by -0.64 (it was 0.64 higher than the mean). If we multiply -0.64 by itself, we get 0.4096 which is a positive number. If you are in doubt, you may try it on your calculator. The same For Table 8, open the holds for Thekla: multiplying 0.36 by itself spreadsheet “ProjectAStDev” in also gives a positive number: 0.36 by 0.36 the file “peer-assessment.xls” gives 0.1296. Checkout Table 8 and you will see the deviations and the squared deviations of the scores for both criteria A1 and A7. We have now computed – almost – the variance of the scores. Do you see that the mean of the squared deviations (as shown on Table 8) is much larger for the criterion A7 than for the criterion A1? So, the scores of the raters are, indeed, more scattered about the mean on criterion A7 compared to criterion A1. However, the mean squared deviation is not exactly the preferred measure for variance. Instead of dividing by N (which is the number of scores), we prefer to divide by (N-1) for reasons that are beyond the scope of this book. Basically it gives a less biased estimate. Therefore, the formula giving the variance is

49

CHAPTER 3

∑ (X − X ) =

2

s

2

(6)

N −1

So, the variance for criterion A1 may be computed as shown below s2 =

s2 =

∑ 0.405 + 0.132 + 0.405 + 0.405 + 0.132 + 0.132 + 0.405 + 0.132 + 0.132 + 1.860 + 0.405 11 − 1

4.545 = 0.45 10

In order to be able to differentiate between the variance of criteria A1 and A7, we may use a subscript which is a smaller (in size) letter or number or small word at the bottom right of the variance symbol. Here, s A21 = 0.45 shows that the variance of scores on criterion A1 was 0.45 (compare to the mean absolute deviation which was 0.58). For the criterion A7, the variance is s A2 7 = 0.66 (the mean absolute deviation was 0.89). The variance is one of the most popular measures of the variability of scores about the mean. It has very desirable statistical properties. You may observe that both the mean absolute deviation and the variance agree, at least in this case, in that the scores on criterion A7 are more variable than the scores on criterion A1. It seems that the raters were more in agreement on criterion A1. Remember not to be intimidated by the formulae. Formulae represent great ways to compress information and meanings, so they may look difficult but remember that they convey a lot of information. Do not hurry when reading formulae, take your time. There is an old rule of thumb which says that any line of formula should take you about as long to read as a page of text. So, do not think that there is something wrong with you because you cannot read a formula in a split second. Take as much time as you need and remember to break it into more manageable bits that you understand best. For example, you may break the formula of the variance to two parts: the first one is the sum of squares ∑ ( X − X ) . Remember that you know what the 2



does; it

asks you to sum whatever comes next. You know the squared deviation (you used the deviation to compute the mean absolute deviation and you found it easy). Then, all you need to do is to divide by the number of raters (minus 1). It would be a good thing to remember the formula but if you cannot, you do not need to worry too much. Any package, like Microsoft Excel will happily carry out the calculations for you and present you with the results in meaningful ways. 50

FUNDAMENTAL CONCEPTS OF MEASUREMENT

The standard deviation After you compute the variance for a score distribution, it is very easy to compute another statistic, the standard deviation which is designated by the letter s (compare this to the s 2 which is used to designate the variance). The standard deviation is simply the square root of the variance, and it can very easily be computed using your pocket calculator. The formula of the standard deviation is given by

s = s2

(7) Therefore, the standard deviation of scores on criterion A1 is s A1 = 0.67 (the mean absolute deviation was 0.58). For the criterion A7, the standard deviation is s A7 = 0.98 (the mean absolute deviation was 0.89). The standard deviation has many uses beyond showing us the variability of scores and this is something that will be addressed in later chapters. AGREEMENT INDEX: THE CORRELATION

The example used in the previous sections has a potential flaw: it mixes the scores awarded to Project A by the teacher with the scores awarded to the same project by the students (we collectively called them raters). One might suggest that the scores awarded by the students in the peer-assessment exercise should be reported separately from the ones awarded by the teacher. That way, we could compare the mean score awarded to each project by the students to the score awarded by the teacher; if the two are similar, then we might assume that the students applied the scoring rubric correctly (i.e., in agreement with the teacher) when assessing the five projects. Table 9 shows the teacher’s and the students’ scores. We can see that while the students’ average score increases, so does the teacher’s score. There seems to be a relationship between the two columns of the table: bigger values of students’ scores correspond to bigger values of the teacher’s score and smaller values of students’ scores correspond to smaller values of the teacher’s score. This relationship, however, does not appear to be perfectly consistent. For example, on Project C, the teacher awarded 2 marks fewer than the average score awarded by the students, whereas on Project B, the teacher awarded 2 marks more than the average score awarded by the students. Although the relationship is not perfectly consistent, it seems to be good enough. Notice that if two variables are correlated (as in this case), knowing a value for one variable helps us predict with some precision the value of the other variable. This is one of the most important motives of researchers when computing correlation indices.

51

CHAPTER 3

It would help a lot to plot the scores on a dedicated graph (called a scatterplot) so as to have a visual idea of the relationship between the two variables (this is how we will now call the columns of Table 9). Remember that it is always easier for the human eye to read (i.e., to see) patterns from graphs rather than unscrambling mathematical relationships between numbers in tables.

Figure 7. A scatterplot between the scores awarded by the teacher and by the students. Table 9. A comparison of the scores awarded by the students and the teacher on each project Project

Students’ Mean Score

Teacher’s score

C

41

39

E

42.1

42

D

43.8

45

A

46

48

B

48.2

51

Mean

44.22

45

Try to draw an imaginary straight line connecting as many points on the plot as possible. You will see that is possible for a straight line to pass over (or very close to) all of the points on the graph. This shows that there is a very strong relationship between the two variables.

52

FUNDAMENTAL CONCEPTS OF MEASUREMENT

Project

Table 10. How to compute the (Pearson product-moment) correlation index Students’ mean score on each project

The product of the teacher’s and students’ scores

Teacher’s score on each project

X

X2

Y

Y2

X*Y

A

46

2116

48

2304

2208

B

48.2

2323.24

51

2601

2458.2

C

41

1681

39

1521

1599

D

43.8

1918.44

45

2025

1971

E

42.1

1772.41

42

1764

1768.2

Sum

221.1

9811.09

225

10215

10004.4

We will not ask you to memorise the formula for the correlation because it is a bit cumbersome. Neither will we ask you to routinely carry out the necessary calculations to compute a correlation index: Microsoft Excel or any other commonly used spreadsheet or statistical software or some calculators will do it For Table 10, open the spreadsheet for you easily. We will, however, show “ProjectACorrel” in the file “peeryou a numerical example just to give assessment.xls” you a hint of how the correlation index is computed. Please note that there are several different ways to represent the relationship between two variables: in our case we will just use the Pearson product-moment correlation index, or just Pearson correlation for simplicity. The formula to compute the Pearson correlation is given below:

r=

N × ∑ XY − (∑ X ) × (∑ Y ) ⎡⎣ N × ∑ X 2 − (∑ X ) 2 ⎤⎦ × ⎡⎣ N × ∑ Y 2 − (∑ Y ) 2 ⎤⎦

(8)

It is important to help you find your way through this complex formula. The only column you may not be able to understand in the table is X*Y: it just represents the product between X and Y. For example, when the mean student score on Project C was 41 marks, the teacher’s score was 39, so the product was 41*39=1599 (the value in the last column). Now, ∑ X shows the sum of all five mean scores of the students on the five projects (i.e., the sum of the first column). Similarly, of the teacher on the five projects.

∑ XY

∑ Y is the sum of the scores

is the sum of the products, that is, the

sum of the last column in the table (i.e., the sum of the sixth column). Finally, ∑ X 2 is the sum of the squares of the means scores of the students on each project 53

CHAPTER 3

(i.e., the sum of the third column) and

∑Y

2

is the sum of the squares of the means

scores of the teacher on each project (i.e., the sum of the fifth column) and N=5 (the number of projects). Now you know all the necessary information to carry out the computation of the correlation index. r= r=

r=

5 × 10004.4 − 221.1× 225

[5 × 9811.09 − 221.1× 221.1] × [5 ×10215 − 225 × 225] 50022 − 49747.5

[ 49055.45 − 48885.21] × [51075 − 50625] 274.5 170.24 × 450

=

274.5 76608

=

274.5 = 0.99 276.78

Therefore, the correlation between the scores awarded by the teacher and the mean score awarded by the students is surprisingly high, higher than anyone would expect in a real setting. We can also verify this from the straight line one could fit on the points on the graph. Notice, however, that the correlation index measures the linear relationship between the students’ and the teacher score (i.e., the straight line). It is possible that two variables may be related very closely but not in a linear way (i.e., straight line) way. To return to our example, it suffices to say that the correlation index is a statistic which may take values from -1 to +1 through zero, where -1 means a perfect negative linear relationship (when the one goes up, the other goes down) and +1 a perfect positive linear relationship (when the one goes up, the other goes up as well). A correlation of zero means no linear relationship at all. The relationship between the teacher’s score and the students’ mean score is almost a perfect positive correlation. In real life, we rarely find perfect correlations in our educational measurements. Remember, however, that correlation between two test scores does not indicate that one causes the other, only the extent to which scores on one test might be related to another test. SUMMARY

This chapter introduced you to some fundamental concepts of educational measurement. Firstly, we aimed to give you some hints about the variability of scores. In a real-life assessment situation, not everybody gets the same score, so it is important to recognise variability when you see it. We discussed various ways of expressing variability and we also discussed the use of the mean in order to summarise score distributions. Variability is the single most important concept which is related to everything else that will follow in this book. Our aim is to identify variability, measure it, express it and then explain it. The correlation is the second most important concept (which is of course based on variability itself): it will be useful when in the next chapter we will discuss the concept of reliability.

54

FUNDAMENTAL CONCEPTS OF MEASUREMENT

Figure 8. An indication of the perceived strength of a correlation.

Note that there are also many more statistics that are of interest, but this is meant to be just the beginning. Try the exercises that follow. You will gain valuable experience and they will help you digest all those new concepts. -oOoREVIEW QUESTIONS

Try these review questions to help you reinforce some of the key ideas in this section. These are all true-false questions to make it easier and quicker for you to complete. Think whether each statement is mainly true or false. Then just circle the T (True) or F (False). If you are not sure, just guess. T

F

T T T T

F F F F

T

F

Range is an index of variability which is computed by subtracting the minimum from the maximum score The mean absolute deviation is the most popular index of variability The standard deviation is synonymous to absolute deviation A correlation can take values from 0 to +1 where 0 means no correlation If scores on test A are positively correlated to scores on test B (say, with r=+0.8), then your high score on test A causes your score on test B to go up as well. To calculate the standard deviation, you must divide the variance by two. EXERCISES

Here are some review exercises for you to answer or they can be used as the basis for discussion. 55

CHAPTER 3

Note that you may visit the WebResources companion to the book and download the file peer-assessment.xls. Go through the spreadsheets of the file to study the examples there and to get information on how to use Microsoft Excel to carry out the statistics discussed in this chapter. 1. Give at least three examples of indices which can be used to calculate variability in assessment scores. What are their main differences? 2. Use the table below to calculate the range, the mean absolute deviation, the standard deviation and the variance. Student

Score on mathematics test

Mary

5

Nick

4

Jason

5

Jim

5

Thekla

4

Lambros

4

Jane

0

3. Compare the indices and reflect on their usefulness.

56

CHAPTER 4

VALIDITY AND VALIDATION OF ASSESSMENTS

Validity refers to the appropriateness, truthfulness, accuracy and consequences of your assessment results. The definition of validity used in the Standards for Educational and Psychological Testing is: ‘the degree to which accumulated evidence and theory support specific interpretations of test scores entailed by proposed uses of a test’.1 Although many people are prepared to accept the validity of results from an educational test at face value, the degree of validity is an inference that requires several lines of evidence. ‘Validity’ is a noun, but ‘validate’ is a verb and actually refers to the process by which we accumulate evidence about the validity of our assessment results. Validation is the process of validating. It almost always involves two different actions: (a) the generation of interpretive argument (which entails the inferences and assumptions intrinsic in the proposed interpretations and uses of assessment results) and (b) the validity argument (which is an organised and purposeful set of analyses and empirical studies triangulating different kinds of evidence from different sources). However, the development of the desired scientifically sound argument to draw our line of defence regarding the validity of our assessment results is very demanding. Especially in high-stakes situations, the validation of assessment results usually requires the allocation of resources for the organisation and the monitoring of a relevant research project. If the research results suggest that some interpretations or uses of our assessment results are not appropriate, then we may need to rethink our interpretive argument. In this chapter we will describe the different approaches to determining the validity of results. We would not expect you to implement these approaches since the validation process of assessment results for informal classroom assessment is more relaxed but we think it is helpful for you to be familiar with the nature of this term and the ways in which validity is determined. Validity is by far the most important aspect of assessment. VALIDITY ISSUES

Validity is a comprehensive view of the meaningfulness of your test findings and it applies to the assessment results rather than the test itself. This is an important

57

CHAPTER 4

distinction. Some other aspects of validity, which are not always recognised, are listed below2: – there are different types of validity that affect assessment results; – results are valid for a given group of individuals (e.g., primary school pupils, persons from non-English speaking backgrounds, third year apprentices, tradespersons); – validity can be said to vary in its extent from high through moderate to low validity; – validity is specific to some use or interpretation (e.g., selection, placement, diagnosis) – no result is valid for all purposes; – validity is inferred from available evidence and is not measured (for instance, in the way that the reliability of results can be determined); and, – if the assessment results are valid, then they must – by definition – also be reliable (the converse is not true, that is, results may be reliable but not valid). Types of validity Three types of validity are commonly described: content, construct and criterion validity, although an argument-based approach to validity has gained much ground recently. We will deal with the argument-based approach to validity later; at the moment we will start with the more traditional conceptions of it. The strongest case for validity can be made when all three (i.e., content, construct and criterion validity) are present. For many everyday uses of a classroom test, it is not practical or necessary to have evidence that would be classified in all three categories. Each type of validity seeks to answer a separate question, such as: – Content validity (and the closely related face validity) - does the assessment match the content and learning outcomes of the subject? – Criterion validity - does the assessment provide evidence of future achievements or performance in other related subjects? – Construct validity - does the assessment really involve the particular behaviours, thought processes or talents that are said to be assessed? Content (and Face) Validity Your primary concern for a classroom assessment is with content validity. Contentrelated evidence helps you determine how well results reflect the content and learning outcomes of a subject. Content validity is sometimes confused with what is called ‘face’ validity. Face validity is the appearance of the assessment and its relevance to the person taking the assessment. For instance, you could ask the question ‘Would the students think that the content appears to match the subject area?’ Face validity mostly deals with how much the assessment is perceived by the students to be relevant: it is a matter of degree. It is really important to make sure that your classroom assessments enjoy a high degree of face validity. Returning to content validity, classroom tests must reflect the instruction, the learning outcomes and the content of the course. Firstly, you need to ensure that all 58

VALIDITY AND VALIDATION OF ASSESSMENTS

questions are asked in a way that is familiar for the learner and consistent with the subject. If the subject has been almost entirely practical in nature and emphasising learning through doing, then it would be inappropriate to have a test of theoretical knowledge. The second essential aspect of content Visit WebResources for more validity is for you to judge whether the information on validity and assessment adequately samples the topics validation of assessments and learning outcomes of the subject. You should settle for the largest number of topic areas that it is feasible to assess using your resources and time. This is important because you need to control the danger of leaving out of the assessment any important aspect of your curriculum or teaching. The table of specifications can provide some assistance in guiding your sample, so that you focus on the most important topic areas. Overall, content validation is inferential and logical in nature rather than statistical. It is mainly a matter of preparing detailed assessment specifications covering the learning outcomes and the topics in a course. If assessment items are prepared in accordance with a table of specifications (more information about the table of specifications is given later in the book) then the students’ answers should be a reasonable indicator of achievement. The validity of criterion-referenced and classroom assessments can best be assured through such careful test preparation. The content validity approach to validity has, however, been criticised by some academics and researchers because it tends to rely on human (subjective) judgements about the relevance and representativeness of the tasks or questions included in an assessment. It is possible that a teacher may consider a task or question as relevant to the purposes of the assessment but somebody else might have a slightly different opinion. Moreover, in the case of cognitive processes and theoretical (psychological) constructs such as intelligence, it is not always easy to show that every task or every question is directly relevant or necessary for the assessment. The late Lee Cronbach, perceived by many as a patriarch of educational measurement, suggested that: Judgments about content validity should be restricted to the operational, externally observable side of [assessment]. Judgments about the [students’] internal processes state hypotheses, and these require empirical construct validation (Cronbach, 1971, p. 452; italics in original)3 When we will consider the argument-based approach to validity later you will see that the content validity argument is very important and mainly has to do with the domain relevance and representativeness of the assessment instrument. Note that the words ‘assessment instrument’ refer to the test or the method of assessment that is being used (e.g., portfolio, interview, practical skills task). For classroom assessment purposes content validity is usually easier to investigate and support, especially in academic achievement contexts, compared to other views of validity which will be presented next.

59

CHAPTER 4

Criterion validity The second aspect of validity that is relevant for education and training is that of criterion validity. Criterion validity is subdivided into predictive and concurrent validity. It is a very important view on validity especially in the cases where the assessment results are used for selection, administration, placement and employment purposes. The predictive validity of results indicates how well the assessment results can predict future performance. The extent to which results estimate present performance on some other task or assessment is concurrent validity (see Figure 9 for an example). In education, there is little evidence of the predictive validity of tests since this involves long-term follow-up studies of pupils who have completed schooling, students who have graduated or people who have completed courses. Information on the concurrent validity of assessments is easier to obtain since it involves examining the performance of students on other subjects in a course, or comparison with work placements or on-the-job training.

Figure 9. Comparison of concurrent and predictive validity.

An example of predictive validity would be if a teacher wants to determine whether the results of a computer programming selection test predict success in his/her class. The end-of-year test results for a certificate course in computing are used as the predictive criterion. The question is: do students who have high aptitude selection test scores also do well in the end-of-year tests? Visual inspection of the scores is a first step but it does not provide the most satisfactory way of comparison. It is better to estimate the degree of relationship between the 60

VALIDITY AND VALIDATION OF ASSESSMENTS

aptitude and end-of-year test by using a correlation coefficient, which is a quantitative summary of the relationship. Criterion validity is commonly indicated by correlation coefficients. We dealt with the calculation of the correlation coefficient in the last chapter. It is easy to overestimate the concurrent validity of our assessment methods or to assume that some methods have interchangeable validities. The correlations between different testing methods vary considerably and the findings are that there are only moderate correlations between different educational tests.4 For instance, from unpublished data on final year practical nursing examinations, the median correlation between eight clinical tasks is around 0.125 (ranging from -0.05 to 0.43). Teachers should be aware that the correlations between different methods of assessment are only moderate, and probably less than expected. Table 11 indicates some concurrent validities that we have collected from ten small technical and further education classes (N= 9 to 65). These are moderate in nature and probably much less than teachers would expect.5 An example of predictive validity from a professional context also substantiates the low to moderate correlations between educational assessments. The Part I examination of the National Board of Medical Examiners (USA) has been compared with various criteria. Results indicated correlations of 0.58 with quality of care, 0.29 with residency ratings, 0.23 with supervisor ratings, 0.45 with other examinations, and -0.66 with a specialty exam.6 This provides some indication of how one examination or assessment may not always provide an all-purpose predictor of future performance. Table 11. Concurrent validity of some classroom assessments

Comparisons Final exam and class quiz Final exam and assignment True- false quiz and multiple-choice Essay and true-false Multiple-choice and essays Theory and practical Four essays Exam and case study Essay and case study Essay and assignment Essay and class participation Class participation and case study

Correlations 0.56 0.20 0.31 0.48 0.29 – 0.38 0.35 0.13 – 0.62 0.41 0.61 0.54 0.10 0.72

An example of predictive validity from vocational education is the use of Tertiary Entrance Ranks (TER) to predict educational achievement in technical and further education (see Table 12). The correlation coefficients for criterion validity in vocational education will rarely be larger than 0.6 and for most tests they will usually be around 0.3. For another example, consider the case of GPA as a 61

CHAPTER 4

predictor of success on a high-stakes university entrance examination in a European country: correlations between the GPA and the examination scores are in the area of 0.6 for the last three years. If you square the correlation coefficient (and multiply it by 100), you will obtain an indication of just how much of the total variance the two assessment scores share in common. So a correlation of 0.6 would mean that the assessments share 36% of the variance in common. It leaves a considerable part of the remaining variance in an assessment unexplained (i.e., it needs to be accounted for by other factors such as chance, linguistic ability, preparation, test taking strategies etc). Table 12. Predictive validity for first year subjects of high school results7

Subjects Programming concepts COBOL I Computing I Systems analysis and design I

Tertiary Entrance Rank 0.26 0.12 0.03 0.09

You can also describe predictive and concurrent relationships with an expectancy table. An expectancy table shows the relationship between scores on two assessments. For example, the assessment results are placed down the left-hand side and the measure of success or final grades or criterion scores are placed across the top. The scores on the assessments are broken into categories and percentages are calculated. Table 13 provides an example of an expectancy table showing the relationship between ability test scores and performance in a final exam. From this expectancy table, predictions can now be made in standard terms (that is chances out of 100) for all score levels. Table 13. Expectancy table ABILITY RESULTS

Above average Average Below average

FINAL

EXAM

35-50

51-65

45%

15% 35%

66-75 10% 25% 15%

76-85 20% 55% 5%

90-100 70% 5%

We can say that in this group, students with above average ability scores had a 100% chance of passing the final exam (they were all classified in the 66-75 or better groups); those with average abilities had a 100% chance of passing (they were all classified in the 51-65 or better groups); but those with lower than average abilities had only a 55% chance of passing (45% were classified in the 35-50 group). An expectancy table (see Table 14) can also be calculated to show the relation between pre-test scores and the number of students attaining mastery at the end of the course. From this table you can see that 30% of students who demonstrated competency had a pre-test score greater than 61. Alternatively, it can also be said 62

VALIDITY AND VALIDATION OF ASSESSMENTS

that 27 out of the 30 students with a pre-test score of 61-100 were able to demonstrate mastery. Such expectancy tables are useful in providing evidence of the success of educational and training programs but any interpretations based on small groups should be regarded as highly tentative. Table 14. An expectancy table for mastery and non-mastery

Pre-test scores 81-100 61-80 41-60 21-40 1-20

Number of Students Non mastery Mastery 0 (0%) 9 (10%) 3 (3%) 18 (20%) 6 (7%) 24 (27%) 21 (23%) 3 (3%) 6 (7%) 0 (0%)

To summarise, the criterion validity has two important advantages: (a) it is obvious even to the layperson that criterion-related evidence is directly relevant to the plausibility of the interpretation of assessment results, for example if the assessment results are highly correlated with an external criterion (such as success at a work-related task) then it is useful and appropriate to use the assessment results for selection or licensure purposes. Also, (b) criterion-related validity is based on measurable indicators (e.g., correlations), so it gives a sense of objectivity which may not be shared by other forms of validation (e.g., content validity). However, having said the above, we should accept that sometimes it is difficult to find appropriate criteria. But even if we find an appropriate criterion, how can we be sure that this criterion is valid itself? For example, let us assume that we want to be able to advise our secondary education students whether they are ready to take an expensive and time consuming course which will prepare them for some high-stakes examinations. We could use a commercial test for this purpose, but we do not want to do so because it is expensive. Instead, we develop our own teachermade assessment. How could we validate our assessment? One way would be to investigate the correlation between the results of our assessment and the results of the commercial test (i.e., a classic case of criterion-related evidence). However, we then need to validate the commercial test itself: normally we would need to see some evidence that the commercial test can classify our students successfully and advise us whether they are ready or not to take on this expensive and timeconsuming course. In effect, we may accumulate criterion-related evidence for our teacher-made assessment by comparing its results with those of an external criterion but then we would need to validate the criterion itself. The commercial test could be validated by comparing its results with those of another criterion (e.g., an older commercial test) but then we would need to validate that criterion. At some point, we will always need to validate the last criterion in the chain in another way.

63

CHAPTER 4

Construct validity The final type of validity is construct-related validity, which refers to the theoretical evidence for what we are measuring. For instance, how can you be sure that a final exam in marketing is really assessing ‘marketingness’ and not scholastic aptitude or examination ability? Construct validation is the process of determining the extent to which results can really be interpreted in terms of what you are claiming to test. Construct validity studies can involve: – internal consistency analysis of the questions in an assessment to see whether a single aspect is being assessed; – analysis of results over time to trace changes in student development of knowledge, skills or attitudes; – checks to see whether graduates or workers in an occupation perform better than novices or students; – factor analysis of the items; and – the correlation between assessment results and other related assessments. It is asking a good deal of a busy teacher to provide such evidence for classroom assessments. In construct validation you have to identify and describe the general characteristic you are assessing (e.g., numerical skills, literacy, communication competence, mechanical competence, word processing skills, design creativity). Then you gather evidence to support the claim that the assessment is directed to what is intended. It may be a process of gathering supporting evidence for the fact that an assessment is related to a specific set of knowledge, skills and/or attitudes. Comparing known groups who have the characteristic you are assessing will also help you to check whether the test has some construct validity (e.g., engineers on tests of numerical reasoning, programmers on tests of logic). Comparing scores before and after learning will also assist because some test scores can be expected to change as a result of instruction. The final approach is to correlate your assessment with related assessments to see whether results are consistent. It is frequently assumed that construct validity is a broad concept that encompasses all evidence for the validation of assessment results such as the criterion-related evidence, the content-related evidence, the reliability of assessment scores and relevant theories and researches. It is important to mention that under this broad conceptualisation of construct validity, we pay a great deal of attention to theoretical explanations: the mere correlation of our assessment results with the results of a criterion is not enough; we need to show why our results are correlated and we need to tap into a theory-based explanation. Argument-based approach to validity The most recent conception of validity is the ‘argument-based validity’. The argument-based validation process may be envisaged at two different levels: (a) we need to build an interpretative argument where we will demonstrate how we interpret the assessment results. We actually need to present the case for all proposed interpretations and uses of test results. To fulfil our task successfully, we 64

VALIDITY AND VALIDATION OF ASSESSMENTS

need to present all the inferences and assumptions we have adapted in order to make the connection between the assessment results and the conclusions and decisions based on those results. In other words, we need to explain what makes us sure that the assessment results have the meaning we assign to them. Also, (b) we need to build a validity argument in order to Visit WebResources for more evaluate our interpretative argument. The information on the argumentvalidity argument aims to accumulate and based approach to validity present a full-scale and formal evaluation of the intended interpretations and uses of assessment scores. For this purpose the developer of an assessment would need to provide a solid analysis of all the available evidence in favour and against the proposed interpretations and uses of assessment results. However, this is not all. The validity argument should also present the case (to the extent possible) for any plausible alternative interpretations and uses of test scores for decision making. FACTORS WHICH REDUCE VALIDITY

The validation of your test results should be a continuing process. In high stakes contexts you may consider it as a full-scale project but for classroom assessment purposes it can be more informal and less demanding. It helps you review and analyse your assessments. The results of validity studies can help guide your testing of a subject. It will help you evaluate whether the assignments or tests you are using do really provide an indication of the knowledge, skills or attitudes in your subject. It is all based on collecting different pieces of evidence about the accuracy of the assessment results. The most important factor for you as a classroom teacher is to ensure that your assessment has content validity. You need to ask questions such as: – does the content sample the topics in my syllabus? – is every learning outcome covered by a question? – is the relative importance of a topic and learning outcome reflected in the number and types of questions? – is the format appropriate for the students? – is the reading level of the questions appropriate for the group? – are the directions clear to the students? – are the questions clear and unambiguous? – is the scoring accurate?, or – is the grading of the results appropriate to the learning outcomes? From this list you might well have concluded that many factors can prevent us from assessing the knowledge, skill or attitude intended. Some of these items may seem obvious but it is not uncommon for major errors to occur in examinations. Recent examples have included: asking for a 100 word essay when 1,000 was intended; causing confusion by stating that two parts of a test are worth 100% of the marks; asking questions on topics which were not in the syllabus; some students noting that correct answers in a multiple-choice test had been inadvertently signified by a dot on the reproduced question sheet or providing wrong answers in marking guides. 65

CHAPTER 4

One source of errors in testing which affects the validity of results is the existence of clerical errors in marking. For example, in technical and further education an inspection of 15,000 exam papers found that around 10% contained some form of clerical error. 8 These marking errors have occurred even in commercial tests which have detailed administration instructions. For instance, IQ score differences of 10 IQ points have been reported from practising psychologists and graduate students.9 INVALID USES OF ASSESSMENT RESULTS

After spending so much space discussing issues of validity, it is worthwhile to inform you about some classical invalid uses of assessment results. Unfortunately, politicians, teachers and parents frequently use assessment results in obviously invalid ways around the world all the time. One classic example of the invalid use of assessment results is the aggregation of the Scholastic Aptitude Test (SAT) scores in England in order to rank order schools for value-added purposes. The SAT tests were never designed having in mind the aggregation of the scores of the students in order to rank order the effectiveness of their schools. The same holds for the misuse of the (American College Test) ACT and SAT scores in the United States to rank order the States. Many academics and researchers have frequently warned against this type of misuse of test results but with limited results. As a teacher you should be very careful when using assessment results. Especially, when you buy or use commercial assessments, make sure that you have carefully read all the accompanying material as well as the manual. Make sure that you have read the small print as well; sometimes commercial assessments try to underplay the drawbacks or the weak points of their assessments. Confirm that the assessment was developed, evaluated and improved by the developer using a sample of students with similar characteristics to your students. If the assessment is standardised, confirm that it was standardised on a population with similar characteristics as your students. On the statistical aspect of commercial assessment, read carefully the technical section and try to identify the statistics concerning the reliability of the instrument and the error of measurement (we will see these concepts in later sections). This is important because it will let you know how much trust you can put in the assessment results. Do not forget that the score of a student on an assessment is just a point estimate of his/her ability or skill which is not perfectly accurate. Finally, make sure that you know how to score and interpret the results of the assessment. Thus, if you have to make decisions, never use the result of a single assessment; instead, try to accumulate evidence from various sources (e.g., use a second assessment) in order to minimise the chance for errors. Remember that you should always be prepared to explain the scoring and the accuracy of the assessment results to the parents, to the students or even to the principal of the school. The use of a commercial assessment does not automatically validate any use of the assessment results. 66

VALIDITY AND VALIDATION OF ASSESSMENTS

As a consumer of assessment material and assessment results, make sure that your needs are addressed by the products you use. If you are not satisfied with a commercial assessment, contact the developer and explain the reasons that make you unhappy. In some instances other people might have asked the same questions before and the assessment developer will be able to give you an explanation. In some cases, the assessment developer might be able to suggest alternative products that may be more appropriate for your indented uses. SUMMARY

The topic of validity may seem a little esoteric but it is the key characteristic that should be sought for any assessment results. Validity is based on different kinds of evidence, so if a colleague recommended that you use an assessment because it was valid, you might want to see: previous results, the table of specifications, the predictive nature of the scores, correlations with other subjects and/or some details of the construction of the test. The actions which may result from invalid test results can have serious consequences for someone’s future. Accordingly, it pays to take some time and trouble to first ensure that, above all, your assessments have content validity. Whenever scores are used to predict performance or to estimate performance on some other task then you are concerned with criterion related validity. Criterionrelated validation may be defined as the process of determining the extent to which test performance is related to another result. The second measure may be obtained at a future date or concurrently. Finally, construct validity provides an over-arching framework within which to accumulate evidence that your test represents an assessment of a particular educational achievement. There are some refinements to the concept of validity to include consequential validity, that is the meaningfulness and implications of the results. We have not dealt with this in detail but encompassed it in most of our discussion. The strongest case for validity can be made when there is evidence of content, criterion and construct validity of assessment results. The next chapter takes up the issue of reliability of results. This is also a key aspect of assessment results but often overemphasised at the expense of validity. -oOoREVIEW QUESTIONS T T T T T T T

F F F F F F F

Validity refers to the truthfulness and accuracy of results Validity is the degree to which a certain inference from a test is appropriate Results are valid for a given group of individuals Validity is a measurable concept Validity is inferred Reliability is a necessary condition for validity Content validity is determined by comparing the questions to the syllabus topics and learning outcomes 67

CHAPTER 4

T F T F T F T F T F T F T F T F T F T F T F

Criterion validity includes face validity Construct validity considers the predictive potential of test scores Content validity is important for criterion-referenced tests Criterion validity includes predictive and concurrent validity Comparing the results of two assessments is a form of predictive validity The degree of relationship between two sets of test results is determined by visual inspection An expectancy table is used to calculate correlations between results Construct validity refers to the theoretical evidence for what we are assessing The correlation is a statistical index ranging from –1 through 0 to +1 In an expectancy table predictions are made in terms of chances out of 100 An item analysis would improve construct validity of results EXERCISES

1. How should a classroom teacher go about establishing the validity of the results from a test that he/she has developed? 2. What criteria should be used for the predictive validity of tests in education? 3. Here are students’ scores on two tests. THEORY PRACTICAL THEORY PRACTICAL 5 9 6 6 1 3 2 3 4 6 9 8 5 9 3 7 8 4 10 8 1 3 2 6 8 4 2 6 4 6 9 8 10 8 2 3 6 6 3 7

Produce an expectancy table which shows the relationship between successes on the two tests. – Construct a scattergram from this information (see Appendix G for a description of scattergrams) – If you have access to a scientific calculator, determine the correlation between the theory and the practical test results.

68

CHAPTER 5

THE RELIABILITY OF ASSESSMENT RESULTS

In the last chapter, we described aspects of validity and stressed that it related to the accuracy of results. Reliability is the next most important characteristic of assessment results after validity; actually the two characteristics are closely intertwined. If you are certain that your results have a high degree of validity, then the results must also be reliable. Despite our best efforts, however, we can never be sure that our assessment is even close to being 100 per cent accurate. Unfortunately in educational assessment there will always be a margin of error. The major source of error or unreliability in assessment is people. This is because people are naturally inconsistent in their responses, answers and reactions to different types of tasks or questions and to different situations. This variation in responding brings about the major source of error in educational assessment. The topic of reliability is about determining this margin of error. The Standards for Educational and Psychological Testing define reliability as: ‘the degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable, and repeatable for an individual test taker’.1 Reliability relates to questions about the stability and consistency of your results for a group. It seeks to answer questions, such as the following: – Would the ranking of students’ results on an assessment be similar when it is repeated? – Do two versions of an assessment produce the same results? – Can I increase the reliability of results by lengthening a test? – Are all the responses to the items or tasks homogeneous and consistent with each other? Bear in mind that reliability must be conceived relative to the intended assessment uses and interpretations – much like validity should be conceived. Before making an evaluating judgement about the degree of the reliability of your assessment results, make sure that you take into consideration the context, the purposes of the assessment and the possible consequences of the use of your assessment results. These are some of the issues that are addressed in this chapter. Do not worry if you find some aspects of this chapter a little quantitative – you should feel free to skip those sections that seem less relevant or more difficult to you on your first reading. Remember that one of the purposes of this book is to familiarise you with useful technical concepts of educational measurement so that you will be able to evaluate and choose the most appropriate commercial assessments for your class (if you are given the resources to purchase them). 69

CHAPTER 5

We begin with a discussion of the concept of reliability and then show you how to establish the reliability of your assessment results. RELIABILITY

Unless assessment results can be shown to be reasonably consistent, then you cannot have any confidence in your findings. It is like measuring the length of an object and coming up with a different answer each time or taking someone’s blood pressure and noting large variations in the readings or assessing word recognition and finding inconsistencies from one occasion to another. These variations or unreliability lead you to question the accuracy of your results. Reliability or the stability of results are necessary but not sufficient for validity. Like validity, reliability refers to the results and not necessarily to a particular assessment. In other words, we cannot say that a particular test is reliable but we can say that a set of results is reliable. In fact, an assessment may have a number of different reliabilities. Unlike validity which is based on evidence or inference, reliability is largely a statistical concept and is reported mainly as a correlation coefficient. The correlation coefficient is a useful statistical indicator and if you remember we discussed it briefly in a previous chapter. We would recommend that you refamiliarise yourself with some basic descriptive statistics such as mean, variance, standard deviation and correlation presented in the earlier chapter. If you like, you may refer to some standard textbooks and also use spreadsheets such as Microsoft Excel that have statistical functions to do the computations for you. These take all the drudgery out of statistical calculations. If in doubt, always ask a friendly statistician or mathematics teacher to help. Factors which influence reliability The major influence on the reliability of assessment results is individual differences. Sometimes there are transient or temporary influences such as fatigue, rapport, guessing, or practice effects. These factors affect performance on-the-day and mask the real level of knowledge, skills or attitudes. The real level of knowledge, skills or attitudes is best assessed by lengthier or more regular assessments that cancel out the effect of these temporary influences. Reliability really seeks to answer the hypothetical question – ‘if it were theoretically possible to give the same assessment to the same person under the same conditions would you then get the same result?’ We can never really answer this question in education but we can come reasonably close and over the years, a number of different ways of estimating reliability have been established. Ways of estimating reliability There are four main methods of assessing reliability of results. These methods are: the test-retest method; the use of parallel or equivalent forms of an assessment; the 70

THE RELIABILITY OF ASSESSMENT RESULTS

split-half method; and the method of internal consistency. These approaches are depicted in Figure 10.

Figure 10. Methods of estimating reliability.

All the methods of estimating reliability involve correlation-type statistics. Correlations are statistical indicators of a relationship. To remind you, a perfect positive relationship is indicated by a correlation of 1 and a zero relationship by 0. Test-retest method The test-retest approach establishes the reliability of results for a group. You would give the same test twice to a group with any time interval between. You then correlate the scores from the first occasion with those from the second occasion (see the example in Table 15 but note that with small numbers of students, the statistics may be unstable). This is a measure of stability over time. Usually it results in moderate (0.5 to 0.6) to larger (0.7 to 0.8) reliability coefficients. The coefficients are higher if the time interval is short probably because some students may remember the answers to some questions. Some people 71

CHAPTER 5

might argue that students will naturally increase their scores on re-testing but what you are interested in is whether they retain the same ranking in the group. For instance, do the students who are the top scorers still stay on top and are the lowest scoring students still likely to remain low scorers? A test-retest correlation cannot be determined for an individual who takes a test twice. The test-retest correlation is calculated for a group. We would assume that if the reported test-retest correlations are high for a particular assessment then we could have some confidence in the stability of results over time for most people who take this assessment. Notice that many assumptions and inferences are being made. We assume that if assessment results are reliable for a group then they are likely to be reliable for an individual. Table 15. Data for a test-retest correlation Student A B C D E F G H I J

Time I 61 69 62 63 70 68 65 62 66 63

Time II 72 77 72 75 78 76 77 73 75 75

Test-retest Correlation=0.86 The test-retest method is not used for most classroom tests because it is an inconvenient approach. It is mainly of technical interest and useful for commerciallyproduced assessments. The rationale for conducting a test-retest study would be difficult to explain to many students. We would use a test-retest study only for high stakes tests. How could one then determine the reliability of results for an individual student? Well, the only method is repeated assessments and determining the variation in the results. Of course, you would need to take into account developmental changes and any practice effects. Most likely you would find some consistency in the range of results but determining test-retest reliability for a particular student is not straightforward and rarely, if ever, practised. Parallel forms The second approach to evaluating the reliability of results is called parallel or equivalent forms. In the parallel forms approach you might develop two equivalent forms of an assessment and then give the two versions to a group. A comparison of the results for the group determines equivalence and this should provide you with moderate (0.5 to 0.6) to large (0.7 to 0.8) reliability coefficients. 72

THE RELIABILITY OF ASSESSMENT RESULTS

This approach is much easier to implement for classroom tests and the format for the collection of data is shown in Figure 11. In this example you would correlate the results for Form A and Form B. It can be done easily in most spreadsheet programs. STUDENT A B C . . .

Form A 71 79 62

Form B 62 77 72

Figure 11. Data format for parallel forms reliability. INTERNAL CONSISTENCY METHODS

Split-half method The split-half approach to reliability is a form of internal consistency. It is based on giving the test once to a group; scoring two equivalent halves of the test (e.g., odd and even numbered questions), correlating the scores on the two halves and finally correcting for the shortened test length. Do not be alarmed if the explanation seems complex. An example of the data format for calculating the split-half reliability is shown in Figure 12. The split-half evaluates the internal consistency (reliability) of an assessment since in theory one half of an assessment, such as a multiple choice, true-false, short answer or practical test, should be equivalent with the other half. Keep in mind that the two halves should be equivalent in format (e.g., same proportion of essay and multiple-choice questions), content and difficulty. It is also necessary to divide the items so that there is no particular bias and this is why odd-even arrangements are usually chosen. Also, remember that in the case where you have questions referring to some common ‘stimulus’ (e.g., two reading comprehension questions referring to the same passage), you should generally keep them together. This is because if we split them in the two parts, we will artificially increase the correlation between them; we should treat them as one bit of information, and keep them together. The procedure of splitting the assessment is always a trade-off between creating parallel parts without artificially inflating the correlation of the scores on them.

73

CHAPTER 5

STUDENT

A B C . . .

Score from even numbered questions

71 79 62 . . . .

Score from odd numbered questions

62 77 72 . . . . .

Figure 12. Data format for correlating split-half reliability.

The split-half tends to provide the largest reliability coefficients and produces quite high estimates for tests with time limits. However, if you do not split the assessment Visit WebResources for more in two equivalent parts, then the split-half information the Spearmanreliability estimates will be biased downward. Brown formula and other Since you are correlating only half the measures of reliability test it is necessary to correct the correlation obtained to the original test length and you can use the Spearman-Brown formula: Reliability on full test =

Two times the correlation between the halves One plus the correlation between the halves

In mathematical notation, if we split an assessment in two parts (say Part 1 and Part 2), then the total scores of a student on each of the parts will be X 1 and

X 2 correspondingly. Then, the Spearman-Brown formula may take the following formal form:

ρ SpearmanBrown = where

ρx x

1 2

2 × ρ x1x2

1 + ρ x1x2

(9)

refers to the correlation between the two halves of the assessment and

ρ SpearmanBrown

refers to the reliability index of the whole assessment.

Remember that when you use the split-half method, after you calculate the correlation between the two halves, you may use the formula above in order to compute a more realistic reliability index for your assessment.

74

THE RELIABILITY OF ASSESSMENT RESULTS

After carrying out the above procedure, if you still think that the reliability of your assessment is low for the indented interpretations, uses or decisions to be taken, you may wish to calculate how the reliability would be improved if you added some more questions or tasks to the assessment. Again, as before, we assume that the questions or the tasks to add will be similar in nature and will cover the same domain as the existing ones. Although the reliability of an assessment usually increases when we increase the number of questions or tasks, keep in mind that for classroom assessment purposes very long assessments may not be desirable because they consume too much time which could otherwise be used for teaching purposes. Very long assessments may also be discouraging for the less able learners. In any case, the reliability of a longer assessment is calculated using (again) a Spearman-Brown formula (often called ‘Spearman-Brown Prophecy formula’): number of new items number of old items New reliability = ⎛ number of new items ⎞ 1 + present reliability × ⎜ − 1⎟ ⎝ number of old items ⎠ present reliability ×

In mathematical notation, the formula may take the following formal form:

nnew nold ρ new = ⎛n ⎞ 1 + ρ x1 x2 × ⎜⎜ new − 1⎟⎟ ⎝ nold ⎠

ρx x × 1 2

(10)

where n is the number of questions in the test and ρ refers to the reliability index of an assessment. You can easily see that the reliability of the assessment will increase while you increase the number of questions or tasks. As mentioned again, do not be intimidated by the perceived complexity of the formula because a numerical example (see below) will show you how easy it is to implement. Example: Let us assume that we computed the reliability for a 10-item-test and it was 0.7. (a) What would the reliability be if we doubled the number of questions to 20? (b) What would the reliability be if we halved the number of questions to 5? In the case where we double the number of questions, then ρ 20 questions

20 1.4 1.4 10 = = = = 0.82 ⎛ 20 ⎞ 1 + 0.7 1.7 1 + 0.7 × ⎜ − 1⎟ ⎝ 10 ⎠ 0.7 ×

In the case where we would like to halve the number of questions, then

75

CHAPTER 5

5 0.35 0.35 0.35 10 ρ 20 questions = = = = = 0.54 ⎛ 5 ⎞ 1 + 0.7 × ( −0.5 ) 1 − 0.35 0.65 1 + 0.7 × ⎜ − 1⎟ ⎝ 10 ⎠ Beware, however, because when you add or remove questions or tasks to and from an assessment, you always run the danger of altering the nature of the assessment itself. The new assessment should resemble the old one: for example, if you have an assessment with 20 questions in total (say, 10 multiple-choice and 10 short-essay questions), then you cannot use the formula above to estimate the new reliability if you remove all ten multiple-choice questions because the remaining assessment will not resemble to the old one. Keep in mind that the validity of the assessment is always more important than the reliability: if you keep adding tasks or questions you may end up with a more reliable assessment which – unfortunately – may or may not measure what you originally intended to assess. We shall not trouble you further with details of other relevant formulae, but the interested reader can refer to the WebResources for additional details. Using the split-half approach may be time consuming in adding up scores for each half but it is not difficult and it could conceivably be used for classroom tests. We would not expect an instructor or classroom teacher to determine the test-retest or split-half coefficients for their assessments but it is helpful for you to have some familiarity with these indicators of reliability, especially when high-stakes assessments are involved. This information might also help you when you choose commercial assessments for your school or institution to purchase, or when you have the luxury to choose among several available assessments. 0.7 ×

Kuder-Richardson and Cronbach’s coefficient alpha formulae There are also measures of internal consistency such as the Kuder-Richardson formulae or Cronbach’s coefficient alpha which can be used to give an indication of reliability. In this approach, you do not need to split the assessment in half: instead, you give the test once and then score the total test and apply the formula to the questions and the total score. Cronbach’s coefficient alpha determines the homogeneity of responses. It is by far the most frequently used index of internal consistency and it has been shown that it is equal to all possible split-half reliability estimates (as described above). Therefore, its use will protect you from errors when splitting the assessment in two parts (it also saves you from actually having to decide how to split it). With Cronbach’s alpha, you get an idea of whether the assessment is internally consistent (i.e., whether the questions in your assessment measure a single, unidimensional latent construct) and may be given by the following formula: ⎛ questions ⎞ ⎛ sum of question variances ⎞ Reliabilityα = ⎜ ⎟ × ⎜1 − ⎟ variance of total score ⎠ ⎝ questions-1 ⎠ ⎝

76

THE RELIABILITY OF ASSESSMENT RESULTS

Or more formally,

⎛ n ⎞ ⎛ ∑Vari ⎞ ⎟ ⎟ × ⎜1 − σ x2 ⎟⎠ ⎝ n − 1 ⎠ ⎜⎝

ρα = ⎜

(11)

where n is the number of questions in the test, Vari is the variance of the scores of the students on a question, and σ x2 is the variance of the total scores on the whole assessment. Example: Let us assume that we administered a 12-question arithmetic test to our students. Each question is of a similar nature and marked on the same scale (e.g., 0 marks for an incorrect answer, 1 mark for a partly correct answer and 2 marks for a fully correct answer). Here is an example of the calculation of the Cronbach’s alpha measure of the internal consistency of our test results. In this case, we use the data illustrated in Table 16. Table 16. The variances of the responses of students on a 12-question test

Question/item 1 2 3 4 5 6 7 8 9 10 11 12 Total

Variance of each of the questions .319 .412 .335 .506 .452 .429 .332 .557 .453 .339 .331 .455 5.01

The Cronbach’s alpha reliability for this test with 12 questions that are each marked from 1 to 2 and which has a variance of 27.312 is: ⎛ 12 ⎞ ⎛

5.01 ⎞

ρ KR = ⎜ ⎟ × ⎜1 − ⎟ = 0.89 ⎝ 11 ⎠ ⎝ 27.312 ⎠ 20

In cases where the responses of the students to an assessment are scored dichotomously (i.e., 1 mark for a correct response and 0 marks for an incorrect response), the Kuder-Richardson 20 formula can be applied as well. There are several Kuder-Richardson formulae such as the Kuder-Richardson 20 (hence,

77

CHAPTER 5

KR20) and the Kuder-Richardson 21 (hence, KR21) formulae. These are considered by some people as simplifications of the Cronbach’s alpha formula, although mathematically they are distinct. In any case, we will not delve into them because computational simplicity is not of interest any more: where software such as Microsoft Excel can be used to compute Cronbach’s alpha for you which is appropriate for most purposes (instead of using different formulae). As a word of caution, you can expect problems when you calculate these statistics using small class sizes because the results can be very unstable. It is often recommended that you should have a ratio of around 5-10 students for each question when you analyse test results, so this test with 12 questions (see Table 16) would need at least 60 students. Cronbach’s alpha will generally increase when the correlations between the items increase. It may theoretically take values from -1 to 1 but only positive values make any sense. Practically, in your classroom settings, you may see Cronbach’s alpha indices from near-zero up to 0.8. In high stakes situations, such as the SAT National tests in England, the tests typically have a Cronbach’s alpha reliability index above 0.8. The internal consistency reliability is useful for assessments of knowledge, attitudes as well as practical skills. It should be, and normally is, calculated routinely for large-scale and high-stakes assessments. Keep in mind, however, that all indices of internal consistency reliability estimates overestimate the true reliability of your assessment simply because they do not take the fluctuations of students’ performance over time into account. RELIABILITY OF CRITERION-REFERENCED ASSESSMENTS

You will need a different approach to assessing reliability for criterion-referenced assessments. This happens because the reliability indices for norm-referenced assessments require the scores to vary across the students; remember that we use correlations to compute most of the reliability indices, for example, test-retest reliability or split-half reliability. If most of the students get roughly the same scores on an assessment (as it may be the case in a criterion-referenced context) there will be little scope to try to compute a correlation index. The focus in the case of criterion-referenced assessments is on the consistency of mastery/non-mastery decisions. One way in which you can do this is to use two forms of an assessment or two tasks to decide competency. You then calculate the percentage of consistent decisions on the two forms of an assessment.2 The general formula is set out below and it requires you to determine whether someone was competent or not yet competent on one or both occasions. Percentage Consistency =

78

Masters(both tests) + Non masters(both tests) ×100 Total number in the group

(12)

THE RELIABILITY OF ASSESSMENT RESULTS

The following example (see Table 17) is based on 30 students: Table 17. Consistency of Mastery

ASSESSMENT A

ASSESSMENT A

Masters

Non-masters

2

20

7

1

ASSESSMENT B

Masters ASSESSMENT B

Non-masters Percentage Consistency =

2+1 ×100 = 10% 30

Where only one assessment or test is available, it is possible to use around ten items or questions for each objective or set of tasks. You would then check whether the two groups of tasks or questions led to the same mastery decisions. HOW HIGH SHOULD RELIABILITY BE?

Teacher-made assessments commonly have reliabilities somewhere between 0.60 and 0.85 and for the most part this is a satisfactory level for the instructional decisions made by teachers. High reliability is demanded when the decision is important, final, and irreversible, concerns individuals or has lasting consequences. An example of the need for high reliability would be in selecting or rejecting applicants for a course, in the Higher School Certificate and trade recognition tests. Lower reliability is tolerable where: it is a minor decision; or you are dealing with the early stages of a decision; or a decision is reversible; or achievement is able to be confirmed by other information; or the decisions that are made concern a class rather than an individual; and/or the effects of any decision are temporary. An example of acceptable lower reliability would be whether to review a lesson unit based on test results, using an initial screening test prior to an expensive practical test or using test results for a course evaluation. Remember that the reliability coefficients indicate the likely stability or consistency of results. They are just a guide to the potential error in a set of results. The next section focuses in a little more detail on the topic of error and its measurement. Once again, do not worry if the details are not clear to you on first reading. You might also want to consult additional measurement or statistical texts for different explanations of the same topic. STANDARD ERROR OF MEASUREMENT

Every time you measure something in the physical world (e.g., the length of your arm or the distance between two objects) you know your measurement is not perfect: there is always some error due to chance or carelessness. The same holds true in education where we try to measure unobservable quantities such as 79

THE RELIABILITY OF ASSESSMENT RESULTS

range of error around it has some technical problems and makes quite a few assumptions but it is a useful concept for teachers who have to interpret test scores. What it means is that ideally, assessment results should be interpreted as a range of scores, rather than as a specific score. The standard error of measurement gives you an idea of the potential error surrounding any student’s score and tends to make judgements about performance more conservative and less absolute in nature. A confidence band of one standard error of measurement is commonly applied by professional test users to results. Accordingly, students whose performance is borderline pass, credit etc can now be given the benefit of doubt. Technically, the standard error of measurement is an estimate of the standard deviation of the errors associated with test scores. You will need to know the testretest reliability for the results and also the standard deviation for the set of results on which you are working. The standard error of measurement can be calculated using this formula: (13) stand. error of measurem. = stand. deviation × 1-reliability of the test Here is a worked example: Assume that the standard deviation of test scores is 10 and the reliability of test scores is 0.65. What is the standard error of measurement? stand. error of measurement = 10 ×

1-0.65 = 5.9

This means that if a student received a score of 50 on the test (this is usually called his/her ‘observed score’), then we could say: ‘Given the student’s obtained score of 50, there is a 68% probability that the individual’s true score would fall between (50 – 5.9=) 44.1 and (50 + 5.9=) 55.9’. It is possible that we might want to be more precise, so we could use a 95% probability statement: ‘Given the student’s obtained score of 50, there is a 95% probability that the individual’s true score would fall between 38.2 (=50 – 2 x 5.9) and 61.8 (=50 + 2 x 5.9)’ Standard error of measurement for criterion-referenced tests Sometimes we can only assess learners on a small fraction of the potential range of tasks and the result that we obtain from our testing is only an estimate of his/her performance on the entire number of tasks. As you know every test result or performance has a margin of error around it and this can be calculated. We can estimate the range within which a person’s true competence falls. With criterion-referenced tests, we also determine an interval around the proportion of tasks in an assessment that are correct. This is calculated using the formula4: stand. error of measurem. =

proportion corrext × proportion incorrect number of questions

(14)

81

CHAPTER 5

The standard error of measurement for a criterion-referenced assessment indicates the range within which a person’s results might have fallen. You only need to know the proportion correct and the number of items or tasks in the test. Here is an example: On a 20 question test, someone scores 12 correct; therefore, the proportion correct is 12 out of 20 or 0.6 and the proportion incorrect is 0.40 (1 – 0.60 = 0.40). We may compute the standard error for a criterion referenced test as standard error =

0.6 × 0.4 0.24 = = 0.11 20 − 1 19

As we have seen in a previous paragraph, one standard error of measurement would include the real level of ability 68 times out of 100. So, using a confidence interval of 68%, this person could (in repeated testings) score anywhere from 49% to 71% correct (0.6 ± 0.11). So, 32 times out of 100 the true score might not even be in the range 49-71%. If you wanted to be 95% confident then you could take a margin of two standard errors (i.e., 38-82% correct, but note how large the range of scores is – it is almost entirely uninformative for practical purposes). Let us use another example. If a person completed 6 out of 10 tasks correctly, then the standard error would be 0.16. This means that 68 times out of 100, the person’s level of ability would be in the range of 0.6 correct plus or minus 0.16 (i.e., 0.44 to 0.76). The ten tasks that were chosen represented a sample from an infinitely large number of tasks and the standard error indicates the potential range of scores from repeated and equivalent tests. If you increased the number of tasks to 20 and still obtained 0.6 correct, then this interval would reduce to 0.11; for 30 tasks it would be 0.09 and for 40 tasks it would be 0.07. This shows how the reliability and confidence in results is a direct function of the number of items. This was done in order to demonstrate to you how the reliability of measurement increases while you increase the length of your assessments. HOW CAN I INCREASE RELIABILITY?

Reliability can be increased in two easy ways. Firstly, reliability is related to the length of the assessment. As a general rule, increasing the length of an assessment will increase the reliability of results. This is because a longer assessment lessens the influence of chance factors (e.g., individual differences, guessing on a multiple choice test). If short assessments are necessary (i.e., for younger students who may have shorter concentration spans), then you may need to use more frequent assessments. In simple words, if you want to increase the reliability of your measurement, you will need more sources of information: imagine every question or task (or assessment, in general) to be such a source of information. Secondly, reliability is also linked with the variation or range of scores on an assessment. The larger the spread of scores is, then the higher will be the estimate of reliability. (With only a narrow spread of a few points between the highest and lowest scores, it is probable that individuals will change position). To obtain a wider spread of scores, you need to set more difficult assessments and use questions which measure more complex learning outcomes. However you need to 82

THE RELIABILITY OF ASSESSMENT RESULTS

be very careful because a very difficult assessment will also give you results with small variation. You need to use assessments of appropriate difficulty for your students: neither too difficult, nor too easy. Nevertheless, we are not sure that reliability will really be your major problem in assessment; we think that validity is a much more important concern. EFFECT OF PRACTICE AND COACHING ON RELIABILITY

One further issue affecting the reliability (and validity) of results is that of the effect of practice or coaching on assessment results. When assessment material or content is varied, such as in studies5 on the effects of coaching on the Scholastic Aptitude Test, then the consequences of coaching are variable. They depend upon the amount of time, the similarity to the final assessment material or motivation. Repeated practice on the same assessment, however, will certainly affect subsequent performance and this is a serious problem for occupational certification and licensing. Significant gains can be demonstrated on repeated testing and these are often sufficient to make people who would have failed or were near passing, eventually achieve a passing grade (or competence). Figure 14 shows an estimate of how repeated testing can affect the overall proportions of people passing.

Source: Hager, Athanasou & Gonczi, 1994, p.155 Figure 14. Estimated effects of the number of attempts on the proportion passing

HOW TO PURCHASE VALID AND RELIABLE COMMERCIAL ASSESSMENTS

Commercially-available assessments are usually sold for many subjects, for different age groups and in many languages. If the policy of your school is to devote funds to purchase commercially-available assessments, you need to take into account some information regarding the reliability and the validity of the assessment results. We have already addressed some of those issues in the chapter on Validity, so this section is rather short. First of all, as we have already emphasised before, confirm that the assessment was developed, evaluated and improved by the test developer using a sample of 83

CHAPTER 5

students with similar characteristics to your students. Make sure that you inspect the domain covered by the assessment and that it agrees with your curriculum. Check that the ‘intended uses’ of the test constructor cover your own aims. Before making an investment, if in doubt, ask the advice of a more experienced colleague. It is a good practice to ask for sample material before buying a commerciallyavailable assessment. Ask to review the scoring/marking guides: are they clear enough? Do you think that marking scheme covers the most frequent (as you expect them) responses of your students? Are there any guidelines to explain to you how to interpret the results of the assessment? Concerning the statistical aspects of commercial assessments, ensure that you read carefully the technical section and try to identify the statistics relating to the reliability of the instrument and the error of measurement. You now know how important those issues are. If you intend to use this test for selection purposes, do you think that the precision of measurement is good enough? If you intend to use the Visit WebResources for instrument to measure learning, can you find relevant links to web sites any indications about its sensitivity to about the reliability of teaching in the ‘Technical Manual’ (there assessment results should be a technical manual for a commercially-available assessment)? These details are important because they will let you know how much trust you can put in the assessment results. Remember that you should always be prepared to defend your decisions to the students, the parents, and the management team of the school. The use of a commercial assessment does not automatically validate any use of the assessment results. If you have two or three alternatives to choose from, you need to use some clear criteria. The first criterion should be validity: the domain covered by the assessment and the face validity (the students and the parents should judge the assessment to be appropriate) are of paramount importance. Then, use the statistical features (e.g., the reliability of the assessment) as a second criterion. If two assessments have the same Cronbach’s alpha (this is the most usual statistic reported), make sure to check the number of questions or tasks in each assessment: scales of different length may not have comparable reliability statistics. If one of the two assessments has a much higher reliability estimate than the other make sure it is not too long for your students to complete. As a consumer of assessment material and assessment results, make sure that your needs are addressed by the products you use. Remember that this is a tool to be used by you and your colleagues: if you are not satisfied with a commercial assessment, contact the developer and explain the reasons that make you unhappy. In some cases, the assessment developer might be able to suggest alternative products that may be more appropriate for your indented uses.

84

THE RELIABILITY OF ASSESSMENT RESULTS

SUMMARY

By and large, the results from most of the tests likely to be used by a teacher will have reasonable reliability. The classroom tests that we have evaluated usually produce internal consistency reliability coefficients of around 0.6 and sometimes as high as 0.8; there have been some exceptions which have much lower reliabilities but these are rare. In fact, as a teacher you have to go out of your way to try and produce an assessment which will give you very unreliable results. For the most part, your tests compare favourably with the reliability values of published standardised commercial tests, especially when you contrast the investment of time involved on a pro rata basis. The available evidence, however, points towards much longer assessment times than is usual.6 If you want to achieve minimum reliabilities of 0.8 or more, you may need around 150 multiple choice questions; or 60 short answer questions; 5-7 practical tests or around 6-8 hours of observation. The question for you as a teacher is a trade-off between the utility of what you are doing now versus the cost, time and effort of longer assessments with much higher reliabilities. For the most part longer assessments are not necessary in classroom situations but they are important for high-stakes testing. The use of reliability coefficients as a check on the quality of your test results is recommended together with the use of the standard error of measurement in criterion- and norm-referenced tests to give a range of results. No matter, how competently and carefully produced our tests are, the major source of unreliability will usually be inconsistencies in student performance. Someone once wrote that the test is perfectly reliable because it never changes; it is people’s reactions to assessment situations that change. -oOoREVIEW QUESTIONS T F T F T T T T T

F F F F F

T F T F T F

If you are certain that your assessment has a high degree of reliability then the results from it must be valid Reliability is the degree to which test results are consistent, dependable or repeatable The major influence on reliability of assessment results is individual differences Methods of estimating involve correlation-type statistics A moderate correlation is 0.4 The test-retest method is widely used by teachers to determine reliability Parallel forms involves giving the same test twice to a group and comparing the results The split-half method involves comparison of the results from two equivalent halves of an assessment The split-half is automatically corrected for test length by the Kuder-Richardson formula It is easier to estimate reliability using an internal consistency formula than using test-retest methods 85

CHAPTER 5

T F T F T F T T T T T T T

F F F F F F F

A procedure for assessing the stability of test scores is the parallel forms method of reliability The coefficient alpha is a criterion-referenced estimate of reliability The percentage of consistent decisions on two forms of an assessment is a criterion-referenced estimate of reliability Teacher-made assessments have reliabilities of around 0.5 Two standard errors includes 68% of all mistakes on a test Reliability coefficients vary up to +1 As the number of questions increases, the reliability will generally increase As error increases reliability decreases The standard error is the likely range of results around a given score You can be 95% certain that the true score is within plus or minus two standard errors EXERCISES

1. Distinguish between the different types of reliabilities. 2. In the context within which you are teaching or plan to teach, which estimate of reliability would provide you with the most useful information? 3. Which methods of estimating reliability would provide you with the most useful information for tests of (a) knowledge, (b) skills and (c) attitudes? 4. Indicate the effect on reliability of: (a) increasing the number of questions; (b) removing ambiguous questions; (c) adding harder questions; (d) adding very difficult questions; and (e) adding very easy questions. 5. For what purpose is the standard error of measurement useful? 6. Determine the standard error of measurement for the values in this table. Test score 85 85 85 85 60 60 60 60 40 40 40 40

86

SD 10 10 10 10 10 10 10 10 10 10 10 10

Reliability 0.9 0.8 0.7 0.5 0.9 0.8 0.7 0.5 0.9 0.8 0.7 0.5

THE RELIABILITY OF ASSESSMENT RESULTS

7. A teacher noted the following reliability coefficients: Test-retest one month = 0.7 Correlation with an equivalent test = 0.5 Split-half correlation = 0.8 How would you account for these differences in reliability? 8. What would be the potential variation in ability if a student was given a criterion-referenced test of knowledge in biology and answered 20 out of the 30 questions correctly on one occasion? 9. Determine the percentage consistency or reliability of a mastery decision for these results. The one (1) represents competence or mastery on a task and the zero (0) represents someone who is not-yet-competent or non-mastery. Person A B C D E F G H I J

Test 1 1 1 1 1 0 0 1 0 0 1

Test 2 1 0 1 1 1 0 1 1 1 0

87

CHAPTER 6

ANALYSING TASKS AND QUESTIONS

Once you have used an assessment with your students, you will then have some information about the particular tasks or questions. The tasks or questions in the assessment are usually called items and the performance of a group on the items can be analysed to give you useful information about learning and achievement. The marked (or scored) responses of students to the assessment are called itemlevel data; this is distinct from the information regarding the total scores of the students on the assessment. Therefore, item analysis (or in some cases ‘item-level analysis’) refers to methods for obtaining information about the performance of a group on an item. In this chapter, you will see some different ways of analysing tasks and questions. This is frequently useful when we would like to evaluate our assessment and decide whether we need to drop or alter any of the questions or tasks. In cases where we would like to reduce the length of the assessment, the item analysis will help us so that only the best items are retained. The procedure that we are describing in this chapter is one small aspect of item analysis. It is also dealt with in greater detail later in this book (see the chapter on Rasch analysis). In a nutshell, one of the main aims of item analysis is to improve the quality of your assessment for future use. Item analysis guides you when you wish to replace, revise or eliminate items from an assessment and when you wish to shorten or lengthen aspects of an assessment in order to achieve your purposes. Items (i.e., questions, tasks) in normreferenced assessments (i.e., when we want Visit WebResources for to compare the performance of one student relevant links to web sites to that of the others) can be analysed using a about the analysis of range of statistical indices such as item assessment results difficulty, item discrimination and pointbiserial correlation. There are also special techniques of item analysis for criterion-referenced assessments (i.e., when we want to check whether students have mastered a body of knowledge). The criterion-referenced approaches consider the difficulty of questions, whether questions are sensitive to the effects of instruction or are able to discriminate competent from not yet competent performance. Both criterion-referenced and norm-referenced assessments use information about the difficulty of items and it is with this concept that we shall begin the introduction to item analysis.

89

CHAPTER 6

ITEM DIFFICULTY

It is always important to see how difficult an item is for a group of test-takers, in order to provide some diagnosis of learning difficulties as well as to improve the assessment for future use. Sometimes you are surprised to find items that are very easy or in other cases quite difficult for a group. Like many issues in assessment, item difficulty can only be calculated ‘after the fact’. It is true, however, that researchers have spent a lot of time trying to figure out ways to predict the difficulty of an item or task from their characteristics , for example, the length and the number of words used, the complexity of cognitive demands or the perceived clarity of pictures, graphs or artwork (if any). Unfortunately, even the most experienced teachers cannot always predict the difficulty of a task or question when it is administered to a specific group of students. In the case of an item which is marked dichotomously (i.e., 0 for an incorrect response and 1 for a correct response), item difficulty is the proportion of people in a group who give a correct response. Item difficulty is really an indicator of how easy a question is for a group who took the test. Some people call it item facility (i.e., easiness) since they find the expression ‘item difficulty’ to be misleading. This difficulty of a question or task is a natural measure for the analysis of any item. At a minimum, the item difficulty of a task should be calculated.

Item Difficulty =

Number who get the correct answer Total number in the group or class

(15)

Using this simple calculation, the smaller the value (or percentage figure), then the more difficult the question is. The item difficulty may be expressed as a decimal (e.g., difficulty=0.6) or as a percentage (60% correct responses). An example is shown in Table 18. This shows the results for ten learners on eight different tasks. The students are shown as rows from A to J. The tasks are listed from Q1 to Q8. (Can we convince you to set out your results in this format from now on – that is, rows for people and columns for items? Over the years we find that most of our students use columns for people and rows for questions. The advantage of rows for people and columns for items is that the data can be analysed easily using spreadsheets or statistical software.) One more comment about the ones and zeros in the table: The ones represent tasks or items that were answered correctly and the zeros represent items that were failed. In this case the one is not intended as a quantity but represents a category or an ordering from right to wrong. From this example it is clear that at least one item (Q1) was far too easy and another item (Q7) was far too difficult. One might argue that these two items are not giving you much information and you might consider dropping them from a future assessment. However, this is true only if one assumes that you desire to use only items that have at least some discriminatory function, in other words, if you want to use items that can discriminate between more and less able students. If, however, the test is purely criterion-referenced and each item targets a different teaching goal, then Q1 and Q7 should not be removed from the test. They 90

ANALYSING TASKS AND QUESTIONS

can provide you with valuable information about the students who have mastered (or failed) your teaching aims. If everybody succeeds in Q1 that means that the corresponding teaching aim was achieved and you can focus on other aims. If nobody succeeds in Q7 then this means that you have failed in teaching the corresponding aim and more teaching is needed. It could also be the case that the teaching aim represented by Q1 was too easy and the teaching aim represented by Q7 was too hard to master. This can help you revise your teaching plans according to the ability of your students. Table 18. Calculation of item difficulty for a group of ten trainees on an eight item assessment

Example Here are the results for 10 learners on an eight item test. The item difficulty (or item facility) is the total correct on each task divided by the number of people in the group. A score of 1 in the table means that the task was completed correctly and a 0 means that the answer was wrong. Learner A B C D E F G H I J Total correct Item difficulty

Q1 1 1 1 1 1 1 1 1 1 1 10 1.0

Q2 1 1 0 1 1 0 0 0 0 0 4 0.4

Q3 0 0 0 0 0 0 0 0 0 1 1 0.1

Q4 0 1 0 1 0 0 0 0 0 1 3 0.3

Q5 1 1 1 1 1 1 1 1 0 1 9 0.9

Q6 0 1 0 1 0 1 1 1 0 1 6 0.6

Q7 0 0 0 0 0 0 0 0 0 0 0 0

Q8 0 1 1 1 1 1 1 1 0 1 8 0.8

If you perform no other form of item analysis, then at least calculate the item difficulty from time to time, so that you can obtain a clear picture of a group’s response to your questions or tasks. You may then seek explanations for such variations in performance in the curriculum, your teaching or the understanding of the students. Item difficulty is easy to calculate but it is difficult to estimate its true value because it will depend largely on the ability of students undertaking the test. Of course, the difficulty of a question can also be affected by factors such as the extent of prior instruction. Try to group your items according to the sub-domains of the curriculum or according to the skills they test. Average the difficulties of the items in each group 91

CHAPTER 6

and compare them. This will give you a rough but useful indication about the success of your teaching of different parts of the curriculum. If, for example, we assume that Q1, Q5, Q6 and Q8 represent sub-domain ‘A’ and the rest of the questions represent sub-domain ‘B’ then the average success rate for sub-domain ‘A’ is 83% (0.83) and the average success rate for sub-domain ‘B’ is 20% (0.2). This indicates that more teaching is needed on sub-domain ‘B’. This can also make you revise your teaching strategy on this sub-domain. Knowing the item difficulty can help you in a number of other ways. It is acceptable and good practice to place easier questions (Item difficulty > 0.7) at the start of an assessment. Another way that item difficulty can help is that questions passed by everyone or failed by everyone are not helpful in making up the total score on an assessment or in distinguishing between students. You might as well add one to a final score rather than have an item which everyone passes. Note that we said items that everyone passes or fails; the emphasis is on everyone and does not mean nearly everyone or almost everyone. Let us remind you of the key point of this chapter, namely, that item difficulty should always be examined. Ideally, it would be hoped that we could develop assessments that match the ability levels of our group and which provide us with information. We aim to produce assessments in which the items ideally might fall into a pattern of responses like those shown in Table 19. This is an imaginary four item assessment in which the tasks or questions are either correct (1) or incorrect (0). The results are called a Guttman pattern and you can see that each score tells you clearly about the pattern of items that a person can complete correctly. (Louis Guttman was a psychometrician and researcher who gave his name to this type of pattern.) The person with a score of one can complete only the easiest item; the person with a score of two can complete only the first two easiest items; then a person with a score of three can complete the three easiest items and finally the person with a score of four can complete all four of the tasks correctly. In this case the score tells us about the pattern of right and wrong responses. It is an ideal and is rarely achieved but it depends upon knowing the item difficulty. Table 19. A Guttman patterns of responses

Person A B C D Item difficulty

Q1 1 1 1 1 1.0

Q2 0 1 1 1 0.75

Q3 0 0 1 1 0.5

Q4 0 0 0 1 0.25

Person’s score 1 2 3 4

We recommend to our students that they arrange the results for their groups in columns for items and rows for people. Then determine the difficulty of each item. Once you have done this, rearrange the items or columns from easiest on the left to hardest on the right. You will not yet see a pattern of ones and zeros as two more steps are required. Now add across each row to produce a total like that in Table 19. 92

ANALYSING TASKS AND QUESTIONS

Once you have done this, sort the people from lowest at the top to highest scorer at the bottom. You might then start to see the outlines of a Guttman pattern. Now you are in a position to see which items might usefully be deleted. You can also inspect this table to see if any people are answering in a way that is not expected – maybe they are answering difficult questions correctly and making errors on much easier tasks. As an example of this process we have rearranged the responses from Table 18. You can see the beginnings of a pretty reasonable pattern in Table 20. You will not always get such a pattern. [The shaded responses in the Table 20 indicate that Item 2 and the responses of persons A, E and J need closer examination.] We realise that many teachers will not have the time to engage in such analyses but it can be helpful for important assessments and especially where instructors will use the same assessments time and again. Table 20. Rearranging items and persons to see if there is a Guttman pattern of responses

Learner I A C E F G H B D J Item difficulty

Q1 1 1 1 1 1 1 1 1 1 1 1.0

Q5 0 1 1 1 1 1 1 1 1 1 0.9

Q8 0 0 1 1 1 1 1 1 1 1 0.8

Q6 0 0 0 0 1 1 1 1 1 1 0.6

Q2 0 1 0 1 0 0 0 1 1 0 0.4

Q4 0 0 0 0 0 0 0 1 1 1 0.3

Q3 0 0 0 0 0 0 0 0 0 1 0.1

Q7 0 0 0 0 0 0 0 0 0 0 0

Score 1 3 3 4 4 4 4 6 6 6

Unless item difficulty is calculated then you will not be able to identify tasks that may have some technical flaw and which as a result become either too difficult or easy for a group. Finally, the items may be too difficult or too easy for learners to demonstrate their mastery of a subject. For instance, if you have set a mastery level of 85% and the assessment you are using has an average item difficulty of 0.5 then few students will pass. In competency-based testing the actual difficulty level is not pre-set but depends on the relevance of the tasks to the standards. For most criterion-referenced tests you would be looking at 80-95% mastery of knowledge or skills. FURTHER ANALYSIS OF CRITERION-REFERENCED RESULTS

Item discrimination is a concept that is related to the purpose of an assessment. We use assessments to tell us things that we do not already know and we want the results to describe and discriminate in some way. This is a positive side of discrimination. For instance, we might want to discriminate those who have learnt from those who have not yet learnt or to separate out those who are competent 93

CHAPTER 6

from those who are not yet competent or to discriminate those who have achieved at various standards of performance. In this section, we shall cover only sensitivity and the Brennan discrimination index. Both of these are discrimination indices. They are time consuming to calculate if you have many items in an assessment and we would not really expect a teacher or trainer to undertake these analyses unless it was a high stakes assessment. In criterion-referenced tests the basis for discrimination is an external criterion. Sensitivity The first type of discrimination is the sensitivity of the item to instruction, that is, ‘To what extent did the question indicate an effect of instruction?’ This discrimination between pre- and post-instruction can be calculated using the following formula:

Sensitivity =

Number correct after instr. - Number correct before instr. (16) Number of students who attempted the question both times

An example of the calculation of the sensitivity criterion-referenced item discrimination index is shown in Table 21. Table 21. Calculation of the sensitivity index

Person

A B C D E F G H I J Total correct

Item #1 Answered correctly Pre-course 0 1 0 1 1 0 0 0 0 1 4

Sensitivity =

Item #1 Answered correctly Post-course

0 1 1 1 1 1 1 1 0 1 8

8-4 = 0.4 10

This index can vary from -1 to +1 and indicates the direction of change. An index of zero would indicate no change, which might occur if all students were able to correctly answer a question both prior to and following instruction.

94

ANALYSING TASKS AND QUESTIONS

Competency discrimination A discrimination index1 was developed by Brennan for criterion-referenced assessments. This is used to determine whether a question or task can discriminate competency from non-competency. It is especially helpful when you establish cutoff points for performance on a test. For example, if a mastery level of 85% overall has been set, then it is possible for you to see whether a particular item is able to discriminate between those who passed and those who did not achieve mastery. To determine this index you will need to divide the group into those who passed overall and those who failed; then see whether each person passed or failed a particular item. It helps to set this out in a two-way table like the one shown in Table 22. Table 22. Discrimination index

Question correct Question incorrect Totals Discrimin.=

Pass or mastery on an assessment Number of people with the item correct and who passed Number of people with the item incorrect and who passed Total number who passed

Fail or non-mastery on an assessment Number of people with the item correct and who failed Number of people with the item incorrect and who failed Total number who failed

N who passed and answered corr. N who failed and answered corr. − N of people who passed N of people who failed

(17)

Example In this example we look at the discrimination of an item and we compare it with the overall result on the assessment. Note that a one (1) on the item represents a correct response and a zero (0) represents a wrong response. The total scores represent the overall grading on the assessment. On this assessment a pass or mastery level of five was set. The trainees who passed are shown in the shaded section of the table. This index ranges from -1 to +1. In the example from Table 23 we obtained a discrimination index of 0.25. This is a reasonable value and a little toward the lower range of acceptable values for discrimination. It is possible to obtain both positive and negative item discriminations. An item with a negative discrimination index would not be working in the same direction as all the other component items of an assessment. This is usually an undesirable situation in classroom settings. As we mentioned previously, we would not expect you to rush out and calculate these item discrimination indices unless you had nothing better to do.

95

CHAPTER 6

Table 23. Test results for a pass/fail criterion test

Trainee

Item Total Pass/Fail score score Decision I 0 1 Fail A 0 3 Fail C 0 3 Fail F 1 4 Fail G 1 4 Fail H 1 4 Fail E 0 5 Pass B 1 6 Pass D 1 6 Pass J 1 6 Pass Pass/mastery level = a score of 5

Table 24. Aggregated pass/fail criterion test results

Item correct Item incorrect Totals

Number of students with ‘Pass’ or ‘Mastery’ outcome on the assessment 3 1 4 Discrimination=

Number of students with ‘Fail’ or ‘Non-mastery’ outcome on the assessment 3 3 6

3 3 − = 0.25 4 6

You would need to be working on an important assessment or a high stakes assessment in order to go to all this trouble. We would recommend, however, that you routinely determine the item difficulty for your assessments and monitor the discrimination index of the items you most frequently use to inform your teaching effectiveness. NORM-REFERENCED ITEM ANALYSIS

This section outlines some aspects of norm-referenced item analysis using indices of item discrimination, such as point-biserial correlation. Item difficulty Item difficulty is the starting point for any item analysis in a norm-referenced environment. If the purpose of an assessment is to distinguish between students on the basis of their ability then the aim is to have a test with items which have a 96

ANALYSING TASKS AND QUESTIONS

difficulty level around 50%. Item difficulty values of around 40-60% are usually sought and items with difficulties less than 20% or greater than 80% are omitted as they usually add zero and one respectively to a person’s score. Please remember that in this section we are talking about norm-referenced assessments. A different approach applies to criterion-referenced assessments. The reason for seeking an item difficulty about of 0.5 is because the closer the difficulty level approaches 0.5 then the more discrimination between assessment takers is possible. At item difficulty equal to 0 or 1, the discrimination is zero (i.e., everyone passes or everyone fails). For example, for a test taken by 100 examinees, with an item difficulty of 0.5 then 50 will pass and 50 will fail. There are 50 times 50 discriminations. For an item difficulty of 0.7 there will be 30 times 70 discriminations. The aim, therefore, is to construct a test with an average item difficulty of around 0.5. Item difficulty gives you valuable information about the difficulty of the question for your group of students and should be calculated routinely for every question that you use in a test. Item discrimination Item discrimination is the second measure used for analysing questions. It looks at the ability of a question to distinguish high scorers from low scorers and is an indicator of the sensitivity of a question. Item discrimination answers the question of whether the high scorers in a test were better able to answer the question than the lower scorers. Item discrimination can also be determined by a discrimination index and the simplest measure is to look at the difference in the number of persons in the highest (top 27%) and lowest groups (bottom 27%) on a test who answered a question correctly. Dividing this by the number of persons reduces it to a proportion between -1 and +1. (The reason for choosing 27% is to maximise the difference in ability in the two groups while at the same time having large enough samples in the extreme groups.) Discrimin. =

Stud. in top group giving right response - Stud. in bottom group giving wrong response Total number of students in top and bottom group : 2

(18)

Steps used in calculating norm-referenced item discrimination – To calculate the item discrimination you must first separate the scores into groups. – Take the top and bottom 25%. Use equal numbers in both groups. The ideal proportion is 27% but this may be hard to calculate by hand, so 25% or even 30% can be used in a classroom context. – You then look at how many in the top and bottom groups answered the question correctly. The idea is that the question should distinguish the high scorers from the lower scorers on the test.

97

CHAPTER 6

– Calculate the difference between the numbers of people who answered a question correctly in a lower group from the number who answered the question correctly in the top group. – Then halve the number of students the upper and lower groups. (Remember it is easier if the groups are of equal size). Determine the item discrimination for the questions from our earlier example. To make it easy divide the group into the top three, middle four and bottom three. In the top group we have used those trainees with scores of six and in the lower group those trainees with scores of two or three (see Table 25 for an example). Table 25. How to compute the item discrimination Trainee I A C F G H E B D J Total correct in lower group Total correct in upper group Item discrimination

Q1 1 1 1 1 1 1 1 1 1 1

Q2 1 1 0 1 1 0 0 0 0 0

Q3 0 0 0 0 0 0 0 0 0 1

Q4 0 0 0 0 0 0 1 1 1 1

Q5 0 1 1 1 1 1 1 1 1 1

Q6 0 0 0 1 1 1 0 1 1 1

Q7 0 0 0 0 0 0 0 0 0 0

Q8 0 0 1 1 1 1 1 1 1 1

3

2

0

0

2

0

0

1

3

0

1

3

3

3

0

3

0

-0.6

0.3

1

0.3

1

0

0.6

TOTAL 2 3 3 4 4 4 5 6 6 6

The effectiveness of the item will be rated from 0 to 1, with the ideal value at 1.0. Table 26 indicates the action required for different levels of item discrimination. From the above example, it is clear that for some reason item 2 was discriminating negatively and that items 1 and 7 also require revision. Whenever you revise an item or even re-word it slightly then you have to re-calculate the item statistics because for all intents and purposes it is a different question. A low discriminating power does not necessarily indicate a defective question. The discrimination index is also sensitive to the ability of the group of test takers. For instance, it is at a greater value for a representative group of test takers compared to a sample of students who might have completed a course of instruction. Also, item analyses from small samples produce only tentative results. The next section looks at a more sophisticated ways of determining item discrimination using correlation coefficients. If you do not have access to a computer to calculate statistics for you, then you can omit this section or come back to it at a later stage.

98

ANALYSING TASKS AND QUESTIONS

Table 26. Action required for values of the item discrimination Discrimination index > 0.4 0.3 – 0.39 0.2-0.29 < 0.2

Action required retain the question reasonable question with minor adjustments marginal question which needs revision poor question which must be completely revised

Point biserial correlation A more appropriate measure of item discrimination used in test theory is the point biserial correlation. This correlation is calculated between the score on a single test item and scores on the total test. It tells you how well the group answered a test question compared with their overall scores on the test. Again the general assumption is that each question should be passed by those with the highest scores on the test. The point biserial correlation is an index of the relationship between a score on a test and a dichotomous value (i.e., 1 or 0), such as passing or failing a particular question. The formula for and details about the point biserial correlation are available in most standard texts on statistics in psychology. The formula requires you to calculate the standard deviation which is an index of the dispersion or variation of scores around the average score. Most scientific calculators can be used to determine the standard deviation. An example of the calculation of the point biserial correlation is shown in Table 27. Table 27. Calculation of the point-biserial correlation

Trainee

Q3

Q8

A B C D E F G H I J Proportion who passed item Proportion who failed item Average score for the pass group Average score for the fail group Standard deviation of test Point-biserial correlation

0 0 0 0 0 0 0 0 0 1 0.1 0.9 6 4

0 1 1 1 1 1 1 1 0 1 0.8 0.2 4.55 2

.375

.625

Total score on all 8 questions 3 6 3 6 5 4 4 4 1 6

1.6

99

CHAPTER 6

This coefficient is calculated typically by a test analysis program. The index provides a positive or negative number which varies from -1 (a very high, opposite and negative relationship) through 0 (no relationship at all) to 1 (very strong, direct and positive relationship). The point biserial correlation will be positive if the group that passed the question has a higher average test score overall. It will be negative if the group failing the question has the higher overall test score. The significance of the value produced can be compared against a table of values to determine whether it would occur by chance in say five out of a 100 cases. For samples of 15 a correlation of about 0.48 is required for significance, for samples of 30 a correlation of around 0.34 is required and for samples of 100 a correlation of around 0.19 is required for statistical significance. You should regard point biserial correlations of around 0.3 or more as being of some use to you and indicating that the question is worth using again. Biserial correlation The biserial correlation is another index of the relationship between a score on a test and a dichotomous value (i.e., 1 or 0). In this index it is assumed that the ability underlying the value of 1 (pass) or 0 (fail) is normally distributed and continuous in its range. [The point biserial correlation is preferred because it makes no assumptions about the underlying ability being normally distributed. If the ability being measured by the question is not normally distributed then the biserial correlation can produce coefficients larger than one.] The formula for the biserial correlation and further details about this index are also readily available in standard texts on psychometrics. DIFFERENTIAL ITEM FUNCTIONING

Differential Item Functioning (DIF) is present when students from different groups have a different probability or likelihood of answering an item correctly after controlling for overall ability2. In other words, DIF exists when two equally able persons who belong to two different groups (e.g., because of gender, race, age) have a different probability of getting an item correct. Sometimes DIF reflects actual differences in knowledge or skills or experiences. Other times, however, DIF is a result of a systematic error and it is called item bias. Item bias can have detrimental effects on assessment and can invalidate any decisions based on the assessment results. For example, it has been observed that equally able students that have English as an additional language may have a significantly lower probability to answer a mathematics or a science question correct (compared to students that have English as a first language because of language deficiencies. Note that the important point in the previous statement is the term ‘equally able students’. That means that English as an additional language students are less likely to answer the item correctly not because they are less knowledgeable in mathematics or science but because they are disadvantaged by the linguistic load of the question, a parameter that should have nothing to do with the aims of the assessment. 100

ANALYSING TASKS AND QUESTIONS

Item bias is very dangerous in multicultural countries like Australia, or England. Language, culture and gender have been identified in the past as sources of item bias. This is why it is very important to spend some time analysing your students’ assessment results in order to verify that no significant item bias exists. To ensure that assessments are fair to all examinees, the responsible agencies and organisations for high-stakes exams have formal review processes, which are, usually, parts of the test development procedures. However, in the case of lowstakes classroom assessment, such methods cannot be routinely applied either because of lack of expertise or because of lack of time. You can, however, as a teacher, be cautious and base the design of your assessments on your experience and common sense. Avoid, for example, difficult language. Research into the context of high-stakes tests in England has demonstrated that language or cultural differences may affect the observed difficulty of the questions3. In other words, language deficiencies may affect the way pupils respond to the tests and, as a result, the ‘typicality’ of their response patterns. The items may artificially appear to be more difficult for them not because they are less knowledgeable but because of language deficiencies. It has also been observed that differences in the layout of items could increase or decrease the perceived difficulty of items. Even a slight variation in the presentation of the item may have a significant effect on the achievement of certain students. For example some students may be miscued through the artwork and therefore respond incorrectly to an item. Although it is necessary to construct items carefully in order to avoid the negative effects of item bias, there is no guarantee that a carefully constructed item will not be biased. The following method will help you identify whether DIF affects your assessments. It is up to you, then, to interpret the results of the analysis and decide if DIF reflects real differences in knowledge or if it is an artefact of extraneous factors (e.g., language) that intervene in the assessment process. In order to investigate for DIF, you must first identify the socio-demographic variables of interest (e.g., gender). Then classify your students in groups according to their raw score. If, for example, the maximum possible score in an assessment is 20, you can classify your students into four groups: 0-5, 6-10, 11-15 and 16-20 marks. There is no rule on the number of groups. It depends on the number of the students. Make sure that the number of students per group is not very small. Then, identify the proportion of males and females in every group that answered each question correctly. The following table is an example. Table 28. Percentage correct per ability group per gender

Group (Range of marks) 0-5 6-10 11-15 16-20

Male

Female % of correct responses

17% 29% 35% 44%

25% 38% 46% 58%

101

CHAPTER 6

Table 28 indicates that for males and females with the same total score (this is the indication of equal ability on the assessed subject matter) the females were much more likely to give a correct response than the males. It is up to you now to interpret this result and identify whether this reflects a real difference in knowledge or if it is just a spuriously high performance on behalf of the females for another reason (irrelevant to the aims of the assessment). The above technique, however, hides a danger. It may be claimed that if there are items that are biased against a group of students, then the assessment itself may be biased because the assessment itself consists of the items. In such a case, the total score cannot be trusted to group the students. Although this sounds reasonable, it is possible that a few items of the test are biased against males but other items may be biased against females. Overall, there may be biased items but the assessment result may not be biased. In the case where you suspect that the total score may also be biased against a group you can try any other external criterion (e.g., an external indicator of their ability on the same sub-domain or attribute or skill) to cluster your students into groups of equal ability as long as you have very good reasons to support such a decision. Although other more formal and powerful methods exist for the identification of DIF, they are not presented here because they are technically more laborious and may need the use of appropriate software. Remember that the above method is just a crude approximation of more formal methods and should be used with care. Readers who are interested to read more about item bias can also read a very interesting and informative article by Berk.4 SUMMARY

Educational assessment offers you a number of decision-making tools for determining the quality of the questions that you use in tests. Item difficulty and item discrimination are the most common measures used in item analysis. These are calculated fairly easily and there are now computer programs that undertake the details of such an analysis for an entire assessment. A number of free shareware assessment analysis programs are also available on the world-wide web. Standard spreadsheet programs such as Excel can be adapted as useful assessment analysis workbooks. Although item analysis is recommended for analysing each assessment that you use, there will be significant sampling error in small groups of less than 30 students. Ideally item analysis should be based on large (>500) representative samples but this is not feasible in classroom contexts. The calculation of item difficulty is a reasonable expectation for most classroom assessments but additional analyses are required for high-stakes testing especially where certification of performance and competence is expected.

102

ANALYSING TASKS AND QUESTIONS

While some people may consider that educational assessment is a highly quantitative exercise you should have realised that for the most part it is largely a descriptive and qualitative analysis of performance. One exception was the determination of difficulty and discrimination. Item difficulty is just a natural approach to analysis that focuses on the consequences of each task. In the next chapters we shall consider the overall grading of performance and results. We will contain some quantitative aspects but we have tried to minimise them. If you want to delve into more details about total-test scores, please visit the material at Appendix C. -oOoREVIEW QUESTIONS T F T F T F T F T F T F T T T T T T T

F F F F F F F

T F T F T F

Tasks or questions in an assessment are called items Item analysis refers to the methods for obtaining information about normreferenced performance Item analysis guides you when you wish to shorten an assessment Items in criterion-referenced tests can be analysed using item difficulty Item difficulty is an index which shows the proportion of students failing a question The formula for difficulty is: the number of correct responses divided by the number of persons answering Easy questions have a higher item difficulty value Difficulty values can range from -1 to 1 If the difficulty is zero then the content of the question was covered in class Item difficulty is the same as item facility The score in a Guttman pattern tells you which items were answered correctly Point-biserial correlation is used for norm-referenced item discrimination Criterion-referenced item discrimination means that a task separates out high scorers from low scorers An sensitivity index of 0.3 and greater means that it is a useful item If the sensitivity is negative then more people answered an item correctly before rather than after instruction Very easy items have low discrimination for competency EXERCISES

1. Item analysis for a test showed that a question was answered correctly by 7 out of 10 students in the top group, and 3 out of 10 students in the bottom group. – What is the index of item difficulty for this question? – Do you consider this question to be effective?

103

CHAPTER 6

2. Here are the results for 24 students on a 25question test of knowledge Only the answers (a, b, c, d) to the first four questions (I, II, III, IV) are listed. At the end of each row is listed the overall total score on the 25 item test. I II III IV Total 1. a b a d 17 2. c d a d 16 3. c b a d 16 4. c d a d 15 5. c b a d 21 6. c b a d 15 7. a c d d 12 8. c b a d 16 9. a a d a 12 10. c b a a 15 11. c b a d 15 12. a a a d 14 13. a b a d 12 14. a c b d 13 15. d b b d 8 16. c d d d 15 17. c a b d 8 18. c b a d 19 19. c b c a 7 20. c d a d 16 21. c b a d 16 22. c b a d 16 23. c b a a 10 24. c b c a 7 Determine the item difficulty for the multiple-choice question (i.e., the number who passed the item). The correct responses for questions I, II, III and IV are C B A D.

Visit WebResources where you can find the data for this question in a table. You may select the data, copy and then paste them into any spreadsheet software such as MS Excel if you want to experiment with software analysis. The same holds for the data in Exercise 4 below.

104

ANALYSING TASKS AND QUESTIONS

3. Use the data from the previous question. Take the pattern of right and wrong answers for items I to IV and for persons 1 to 24 and set them out as a spreadsheet. Use Columns for the questions and rows for the students. Now rearrange the items from easiest to hardest and also sort the students from lowest score to highest score. Comment on the pattern of the results. 4. Calculate the discrimination index of competence for these seven tasks. Decide which tasks are worth retaining as indicators of competence. The minimal cut-off score for competence is 4. Person A B C D E F G H I J Difficulty = Brennan discrimination =

Task 1 0 0 1 1 0 1 0 1 1 0

Task 2 1 0 0 0 1 0 1 0 0 1

Task 3 1 1 1 1 0 0 1 0 0 1

Task 4 1 0 1 1 1 0 1 1 1 0

Task 5 0 0 0 1 1 1 0 0 1 0

Task 6 0 0 1 0 0 1 0 1 0 1

Task 7 1 1 0 1 1 1 1 0 0 0

Total score 4 2 4 5 4 4 4 3 3 3

Note: minimal competence = 4

105

CHAPTER 7

OBJECTIVE MEASUREMENT USING THE RASCH MODEL (FOR NON-MATHEMATICIANS)

Teachers and instructors all over the world use tests routinely to monitor the efficiency of their teaching and to draw inferences about learners’ knowledge. They use the tests to extend their idea of and to draw the best possible picture of something they cannot directly see, touch or measure. The outcome of a test, however, is usually a mere set of numbers that indicate the success of a person on a group of purposefully built questions. These numbers are usually aggregated and the sum, the raw score, is used as a trustworthy measure of a person’s knowledge. Notwithstanding this practice, is the raw score on a test the best descriptor of a person’s knowledge? Many teachers have used raw scores in the same way people use centimetres. If two objects are 10 and 20 centimetres long, then the one has half as much length as the other. If one pupil achieved a raw score of 10 and another pupil achieved a raw score of 20 then it is assumed that the one pupil achieved half the marks of the other. This does not mean, however, that the first pupil has half the knowledge of the second pupil otherwise one could also assume that a person who scores zero does not know anything and that a person who scores, say, 20 out of 20, knows everything! This certainly cannot be the case. Georg Rasch, a Danish mathematician, developed a statistical model in the 1960s that allowed us to replace the raw scores of the pupils on a test with a different measure. This measure was found to have much more desirable characteristics over the mere use of the raw scores and since then this Rasch model has gained worldwide acceptance. During the last decades the family of Rasch models has grown considerably and different models are used in education and other disciplines to make sense out of test results. This chapter will not get into much detail of the technical or philosophical issues of the Rasch models; it will maintain a clearly non-mathematical style. The assumptions of the model and other information and details will unfold Visit WebResources where you can gradually and in a non-technical way find a lot of information about the through the rest of the chapter. The Rasch model and links to web pages chapter will follow an instrumental with related material. approach aspiring to give the reader the potential to use the Rasch model and interpret its results. The chapter uses an example from teachers and pupils but the ideas contained in the examples are relevant to all areas of educational assessment. The concepts apply equally to secondary education settings, higher 107

CHAPTER 7

education contexts, technical and further education, assessment in commercial and industrial training as well as other adult or vocational educational assessments. ANALYSIS OF TEST RESULTS PUT INTO A CONTEXT

Meaningful educational measurement cannot exist in a vacuum. In order to make this chapter both useful and interesting to teachers and others it needs to be put into the context of every-day teaching practice. Therefore, the Rasch model will be presented as a useful tool for a group of teachers seeking to develop an arithmetic test and make sense out of pupils’ test results. Consider the case where the assessment coordinator of a primary school desires to design a test to evaluate the knowledge of 7 year old pupils in addition (sums up to 100). The test could be piloted in one school but the final version would be shared with neighbouring schools to avoid duplication of the test development procedure. In a meeting of the assessment coordinators of the participating schools it was proposed that the test to be used should be short in order to save valuable teaching time. After some consideration, a test was designed covering a specific range of the addition curriculum for 7 year olds (see Appendix E for a copy of the test). The test consisted of 12 questions covering the following sub-domains of the domain ‘addition’ (see Table 29). Table 29. Sub-domains tested by the questions

Sub-domain Problems using the keyword ‘more’ Problems using the keyword ‘double’

‘Simple’ problems Horizontal and Vertical Sums Total

Example Tommy caught 15 fish. His friend caught 13 more fish. How many fish did they catch altogether? Jack and Teddy are two good friends. Jack has 7 cookies to eat. Teddy has double the number of cookies that Jack has. How many cookies does Teddy have? Tommy can play 12 songs. Rex knows 19 songs. How many songs can they play altogether?

Questions

Marks

Q2 and Q7

2

Q4 and Q6

2

Q1, Q3 and Q5

3

Q8 and Q9a,b,c,d

5 12

The test was administered to all 7 years old pupils of the school just after they covered the relevant curriculum. In all, 80 pupils were administered the test. Their teachers marked the scripts according to the pre-specified marking scheme. 108

OBJECTIVE MEASUREMENT

They entered the data into a spreadsheet and computed several question analysis statistics. They found that the average or mean score was approximately 6.6 out of 12 possible marks (55% of the maximum possible score). The minimum score was 1 mark and the maximum score achieved was 11 marks. No pupil scored zero or full marks. The easiness of each of the questions of the test was computed. Since all the questions were dichotomously scored (dichotomous usually means 0 marks for an incorrect response and 1 mark for a correct response), the facility (or easiness) of the questions was represented by the percentage of the pupils that gave a correct answer. For example, 3 out of 80 pupils gave a correct response to question 7 resulting in a facility index or item difficulty as we called it in a previous chapter of 3 = 0.04 . The following table (Table 30) indicates the facility of the questions. 80

Q8

Q9

Q10

Q11

Q12

0.64

0.26

0.73

Q7

0.40

0.86

Q6

0.74

0.83

Q5

0.60

Q4

0.04

Q3

0.38

Q2

0.53

Q1 0.56

Facility

Table 30. Questions’ facility (item difficulty)

Question 7 (see Figure 15) appeared to be extremely difficult for the pupils because the percentage of them that gave a correct response was very low (4%). Question 7 Tommy caught 15 fish. His friend caught 13 more fish. How many fish did they catch altogether? Write your answer on this line _____

Figure 15. Question 7.

On the other hand, Questions 2 and 3 were found to be very easy and almost everybody gave a correct response (observe their vey high facility indices in Table 30). Question 3 especially was not found to contribute much information about the pupils because it could not differentiate between them: almost all of the pupils have a correct response. This might be a good sign if the aim of the test was just to identify mastery or non-mastery of the specific skills. However, if we also aimed to sort our students so that we would identify the less able among them (say, to provide corrective/additional teaching) then Question 3 does not help us a lot to achieve this goal. 109

CHAPTER 7

Although this preliminary analysis was useful, the assessment coordinator suggested asking for professional help and a workshop was arranged with a Rasch model practitioner (RP). The practitioner would explain the nature and the philosophy of the Rasch model to them in simple terms and would train them on the use of relevant software. INTRODUCTION TO THE RASCH MODEL

The RP began: RP: Whenever a pupil attempts to respond to a test question, we usually expect two possible outcomes: either a correct or an incorrect response. However, you must have observed that not everybody gets the same number of questions correct. You cannot expect everybody to have the same chance to give a correct response to a specific question, can you? One teacher said: T: No, we expect that more competent pupils will get more questions correct… I mean… a more knowledgeable pupil has a better chance to get a difficult question correct than a less knowledgeable pupil… RP: So you have raised two issues here. You talked about ‘competent’ and ‘less competent’ pupils. You then repeated your statement by replacing competence with knowledge. You considered that it is the knowledge of the pupil that has to do with the probability of giving a correct response…. T: Well, I assumed that each pupil had some quantity of an ‘ability to do the sums correctly’ and when we, the teachers, designed the test we … errr … we though we could measure this ability using some questions … RP: You mentioned that a more knowledgeable/able pupil would have higher chance to get a question correct. Yes, this is correct but is there a way to find the probability that somebody gives a correct response to a question? Assume that knowledge/ability is infinite (it probably is, right?) and assume that it extents from the minus infinite (for less knowledgeable persons) through zero (for persons of ‘average’ knowledge) to the plus infinite (for the most knowledgeable persons). Visualise ability as a very long ruler that has no beginning and end and you can stand anywhere on this ruler. The questions of the test measure your ability and locate you on the ruler in the same way a physical ruler is used to measure your height… T: But what do the questions have to do with this ruler? RP: Every question has a degree of difficulty, right? Some questions are more difficult than others are and some are very difficult. If we want to be able to find how likely you are to answer correctly a question, we must be able to compare your ability with the difficulty of the question. Visualise this as two people standing shoulder to shoulder and comparing their height. If the 110

OBJECTIVE MEASUREMENT

difficulty of the question is lower than your ability, then you are likely to give a correct response. However, to do this, we need to put questions’ difficulties and your ability on a common ruler – on a common scale.

Figure 16. A common scale for person abilities and question difficulties.

RP revealed Figure 16 and commented that the pupils’ ability and the questions’ difficulty could be aligned on the same ruler for purposes of comparison. He went on saying: RP: Look at Figure 16. It is reasonable to assume that a correct response is more likely when the ability of a pupil is larger than the difficulty of the question. Imagine this as the effort of an athlete to jump over a height. If the ability of the athlete was higher than the height then we would expect him/her to succeed nearly all the time. If the ability of the athlete was significantly lower than the height, we would expect him/her to fail most of the time. In the case where the height was on the border of his/her ability, we would expect the athlete to have approximately 50% success. The first person in Figure 16 has an ability of –3 units on the ruler and the difficulty of the second question is 0 units. It is reasonable to assume that a person with ability –3 units has a very small probability of giving a correct response to a question with a difficulty of 0 units because the difficulty of the question is far larger than the ability of the person. In the case where a question is just right for a person (when the difficulty of the question is the same as the ability of the person) there is a probability of 50% for a correct response. For example, the second pupil on the ruler has 50% probability of giving a correct response to question number 2 because they both lie on the same location on the scale.

111

CHAPTER 7

The teachers found the concept of a measurement ‘unit’ a bit vague, and asked for clarifications. The Rasch practitioner went on: RP: It should be noted, at this stage, that the unit of measurement in the context of the Rasch model is called a logit. The logit is the unit of measurement that results while the Rasch model transforms mathematically the raw scores of the pupils and locates them on the ability scale. The same happens with questions. The Rasch model transforms the number of the pupils who gave a correct response to a question to a difficulty estimate which is measured in logits and locates the question on the same scale as the pupils. Imagine logits to be similar to centimetres but instead of being used as the measurement unit of length, they are used to measure ability. In the case of the sums test, the logit is used as the unit of measurement on the scale that measures ability to do the sums up to 100 as defined operationally by the questions of the test. One of the teachers asked whether the Rasch logit scale has a specific theoretical or practical range of values. RP: The logit scale does not have a specific range of values but theoretically ranges from minus infinite to plus infinite. Large negative values usually mean low ability/difficulty and large positive values usually mean high ability/difficulty. Ability or difficulty of zero logits does not mean no ability or no difficulty at all. Consider zero to be a point on the scale with no special meaning. It is like any other ordinary number. A person with ability of zero logits is more able than a person of ability -1 logit but is less able than a person with ability of 1 logit. Numerical comparisons can also be done between logits. For example, the difference in difficulty between questions 1 and 2 (3 logits) is half than the difference in difficulty between questions 1 and 3 (6 logits). Both persons’ and questions’ measures (ability and difficulty respectively) are expressed in logits and this is the term to be used hereafter. Figure 17 shows the probability for a correct response on a question with difficulty δ = 0 logits and on a question with difficulty δ = 1 logit. Notice that the two lines are pretty much the same, the only difference being that the second one is shifted by 1 logit towards the right. It is apparent from the graph that the larger the ability of a person, the larger the probability for a correct response. It can be observed that if a person has a large positive ability measure (e.g., +3 logits) then the probability for a correct response approaches 100% for both questions. On the other hand, if a person has a large negative ability measure (e.g., –4 logits) then the probability for a correct response approximates zero for both questions. It is interesting to note that for ability=0 logits, the probability for a correct response is 50% for the question which has difficulty=0 logits and just below 30% for the question that has difficulty=1 logit. On the other hand, a person with ability 1 logit has 50% probability of giving a correct response to the question with difficulty=1 logit but the probability for a correct response to the question with difficulty=0 logits is over 70%.

112

OBJECTIVE MEASUREMENT

Figure 17. A comparison of the probability for a correct response on two questions with different difficulties.

In all cases the probability for a correct response to the difficult question is smaller than the probability for a correct answer to the easy one. Table 31 illustrates some basic relationships between the ability of a person, the difficulty of a question and the probability for a correct response. The ability of the Rasch model to predict the outcome of the interaction between a human and a question is probably one of its strongest points and has made the Rasch model extensively used around the world. This gives the user of the Rasch model the potential to predict the outcome of a test for a person with some accuracy even before the administration of the test (provided the ability of the person and the difficulty of the questions are known or may be hypothesised accurately). Table 31. The relationship between ability, difficulty and the probability of a correct response If Ability is larger than the difficulty Ability is smaller than the difficulty

Ability equals the difficulty While Ability gets bigger than the difficulty Ability gets smaller than the difficulty

Then Probability of a correct response larger than 50% Probability of a correct response smaller than 50% Probability of a correct response equals 50% Then The probability for a correct response increases The probability for a correct response decreases

A mathematical formulation of the Rasch model and an informal derivation of the relevant equations are attempted in the Appendices. Although extensive knowledge of the mathematics hidden behind the model is not essential, you may 113

CHAPTER 7

find that the relevant section may contribute considerably to a fuller understanding of the Rasch model. ANALYSIS OF TEST RESULTS USING THE RASCH MODEL

The RP arranged for a hands-on experience believing that the best way for the teachers to make sense out of the Rasch world would be to run their own analyses using their own data. For this purpose, they would use the pupils’ results from the administration of the sums or arithmetic test. The aim of this instruction was to give information about some of the most basic characteristics of the Rasch model that one usually reads in any standard Rasch analysis output.

N of pupils

Figure 18. A common scale with abilities and difficulties.

The data were copied and pasted from a spreadsheet and a first analysis was run. The first output demonstrated the scale (the ruler) on which the pupils and the questions were located (see Figure 18). The teachers realised that the estimates of the pupils (their abilities) were located from approximately –2.5 logits to 3.5 logits. The estimates were well spread and it seemed that they had an approximately normal-looking distribution (more pupils in the middle and fewer in the two tails). For example, approximately 30 pupils had ability measures around zero. The question estimates (difficulties) were also spread. They started from around –2 logits (questions 2 and 3) and they extended to +4.5 logits (question 7). It could be easily realised that the question at the upper part of the scale was an extremely difficult one. One might argue that very few people could get this question correct. Table 32 displays information about the questions.

114

OBJECTIVE MEASUREMENT

Table 32. Questions’ Rasch statistics

Question number 1 2 3 4 5 6 7 8 9 10 11 12 Mean

Score(# correct) 45 67 70 59 42 31 3 48 59 32 51 21 44

Estimated difficulty -0.17 -1.77 -2.09 -1.09 0.02 0.73 4.45 -0.43 -1.21 0.63 -0.63 1.5 0.00

Error of Estimate 0.26 0.32 0.36 0.28 0.26 0.27 0.64 0.26 0.29 0.27 0.27 0.31

RP explained Table 32 to the teachers. RP: Let us all have a look at the table. The first column indicates the physical order in which the questions appeared in the test. The second column indicates the raw score of the question. This is the number of pupils that gave a correct response to that question. For example, it can be observed that only three pupils gave a correct response to question 7, which appears to be the most difficult. The third column is the difficulty of the questions in logits. This is the number that is used to compute the probability of a pupil to give a correct response to a question. T: But what is the meaning of the term ‘error of estimate’ in the fourth column? RP explained that this number gave an indication of the precision by which the difficulty of a question was computed. RP: Whenever we give a test to a group of pupils, the question estimates are simply an approximation of the truth – of the true question difficulty. The error of estimate is used to remind us that the true difficulty of a question is never known. For example, using the data we have collected, we estimated the difficulty of question 12 to be 1.5 logits. What we know in reality is that the true difficulty of question 12 should be around 1.5 logits. T: And is it necessary to know the value of the error? I mean … is this really important? RP: It is very important because it allows us to compare the difficulties of two questions and decide whether the one is truly more difficult than the other. It is possible that the difference we have found between the difficulties 115

CHAPTER 7

of two questions may be a characteristic of the specific sample of pupils we used, that is, the observed differences may be a result of chance. Sometimes, one question looks more difficult than another but if we select another sample of pupils’ responses, say, from another school, we may find slightly different results. T: And how do we use the error of estimate to compare the difficulties of two questions? RP: Figure 19 illustrates the estimated difficulties of the questions with their associated errors. The dot in the middle of each vertical line represents the estimated difficulty. The plausible values of the ‘true’ difficulty of a question lie between the upper and the lower bounds of the vertical lines. It can be seen that the estimated difficulties of questions 2 and 3, for example, are not statistically distinguishable. Although question 2 seems at first glance to be more difficult than question 3, the ranges of their plausible values indicate that the two questions could, in fact, have the same difficulty. T: How did you draw the ranges of the plausible values for the difficulties of the questions? RP: Well, it is really very simple. To find the lower bound of the plausible values of a difficulty, we subtract two times the error of estimate from the difficulty. To find the upper bound of the plausible values of the difficulty we add two times the error of estimate. For example, the true difficulty of the question 12 is somewhere between 1.5+2 x 0.31 = 2.12 and 1.5-2 x 0.31=0.88 logits (where 0.31 is the corresponding error of estimate). This makes us 95% confident that the difficulty of question 12 should lie somewhere between 0.88 and 2.12 logits. T: Apparently, the smaller the error of estimate, the larger the precision of the estimation. But how can we improve our precision? RP: What we always desire is to minimise those errors of estimates in order to get more precise estimates. This can usually be done by administering the questions to a larger number of pupils. You see, the larger the number of pupils who attempt to respond to a question, the richer the information, therefore, the more precise the estimate of the difficulty of this question is. One of the teachers stepped forward. T: Hold on a second! All the questions were administered to the same number of pupils – 80 pupils in total. However, not all the questions have the same magnitude of error of estimate. For example, question 7 has a really large error but the other questions have much smaller ranges of plausible values. Why is that? RP: Well, there is another factor that affects the precision of the estimate of a question’s difficulty: the matching between question difficulty and pupil ability. What do we mean by that? Look again at the logit scale with the pupil 116

OBJECTIVE MEASUREMENT

abilities and question difficulties (Figure 18). Apparently, a few of the questions are located at the lower part of the scale and one is located at the upper part of the scale. These questions are not well targeted. That means that the difficulty of those questions is away from the abilities of the majority of the pupils.

Figure 19. The estimated difficulties and the range of the ‘true’ difficulties.

T: So, do we need only questions of medium difficulty because we get more precise measures? RP: No, of course not! We need a few difficult and a few easy questions but we always try to use at least a few questions with difficulties that match the abilities of the pupils in whom we are mostly interested. If we are generally interested to get precise measures of ability for the majority of the pupils, then we need more questions with medium difficulty. If we are mostly interested in measuring the more able pupils with better precision then we need more difficult questions. Finally, if we want to focus our measurement on the less able pupils, we need more easy questions. Now look at Figure 20. The questions with the extreme difficulties (either extreme negative: very easy or extreme positive: very difficult) also have larger errors of estimates. Check this out in Figure 20. RP: You should realise, however, that the questions of average difficulty have smaller errors because most of the pupils have abilities that match their difficulties. If most of the pupils were of high ability, then the more difficult items (the ones at the right) would be the items with the smaller standard errors.

117

CHAPTER 7

0.7 Standard Error (in logits)

0.6 0.5 0.4 0.3 0.2 0.1 0 -4

-2

0

2

4

6

Question Difficulty (in logits)

Figure 20. Question difficulty vs error of measurement.

HAVE YOU KEPT YOUR MODEL FIT?

The next session started with the RP trying to introduce the teachers to some additional, but fundamental, aspects of the Rasch models. RP: We’ve learned enough for the time being about the precision of measurement. Let us now focus on the quality of measurement. One of the most important statistics in the context of the Rasch model is the Infit Mean Square. This statistic is based on the residuals (the discrepancy) between the expected responses of people (predicted by the Rasch model) and their observed responses1. If a very able pupil attempts a very easy question then we would expect the pupil to give a correct response. If, however, the pupil gives an incorrect response, then we can identify a large discrepancy between the predicted response of the pupil and the actual response. Large and frequent discrepancies between pupils’ responses and their expected responses on a question indicate that the question may not work as intended by those who wrote it. When do you think this might happen? T: Well, when we get unexpected responses from pupils to a specific question, then we are always suspicious. The question’s wording may be confusing, its artwork may be complex or misleading, or the question may be assessing some knowledge irrelevant to the taught curriculum. Maybe the question should not be included in the test in the first place. Does Table 33 display the questions of the test sorted according to their Infit Mean Square statistic? RP: Exactly! What this fit statistic (as we may also call it) does is to measure the average mismatch between the responses of the pupils and the Rasch model. In effect, the larger the infit mean square then the larger are the discrepancies between the model and the responses. When a question has a 118

OBJECTIVE MEASUREMENT

large fit statistic it means that the question does not behave according to the Rasch model (i.e., as expected by the Rasch model). It means that frequently, the model may expect a correct response from the pupils but the pupils give an incorrect response and vice versa. T: Is this so bad? I mean, in real life this happens every now and then. Table 33. Information on the questions of the test sorted by their fit statistic

Question 1 6 5 3 11 4 2 12 10 9 7 8 Mean

Score 45 31 42 70 51 59 67 21 32 59 3 48 44

Estimate -0.17 0.73 0.02 -2.09 -0.63 -1.09 -1.77 1.50 0.63 -1.21 4.45 -0.43 0.00

Error of Estimate 0.26 0.27 0.26 0.36 0.27 0.28 0.32 0.31 0.27 0.29 0.64 0.26

Infit Mean Square 1.25 1.22 1.13 1.08 1.07 0.92 0.90 0.88 0.87 0.83 0.82 0.80

RP: Oh, yes, this indeed happens sometimes. And that’s why the Rasch model is also called ‘stochastic model’: because it gives you the probability whether a person will give a correct or an incorrect response but it cannot tell you for sure what it will happen. When these predictions of the Rasch model are too unreliable, however, this is very bad: it implies that a misfitting question does not work in the same way the other questions in the test do. It is likely that this question may not be measuring the same thing as the other questions. For example, a mathematics question with a heavy linguistic load (i.e., with long or confusing wording) may be misfitting because it does not only measure ability in mathematics but also assesses linguistic ability. This problem is usually called the violation of the unidimensionality. T: Unidimensionality? What is this? RP: Well, unidimensionality means that all the questions of the test measure a single ability. We all know, of course, that there is no such test that measures only a single ability. Many other factors can intervene in the process of measurement like excessive linguistic demand on behalf of the questions; the format of the questions may invite cheating or guessing (e.g., multiple-choice questions); a person may be distracted by noise or by fatigue and so on. However, a test is usually called one-dimensional or unidimensional if it

119

CHAPTER 7

behaves as if it measures mainly one single ability (hopefully the one for which the test was built). T: Can you give us a more concrete example of unidimensionality? RP: When we want to measure the height, for example, of a person, we do not care about their weight. We simply focus on measuring the height correctly. Therefore, we do not use a scale, rather we use a ruler. We select the measurement instrument very carefully; we do not want any disturbances and noise in our measurement procedure. Whenever we want to measure an ability, such as ‘ability to answer sums’ (like the test we have already seen), we focus only on the administration of questions that really relate to doing sums. We try to avoid difficult words and complex problems that confuse the pupils unnecessarily. This is called unidimensionality: we only measure one ability; we only need questions that focus, as much as humanly possible, on this ability. That is why we start worrying when we identify a misfitting question; because we know that this question may not contribute to the concept of measurement; it rather disturbs our effort to measure the ability. It is like trying to measure height with a scale; you are mistaken and you will get incorrect results. T: This brings to my mind an old discussion we had at the school about the validity of the assessments we use. We concluded that we should be careful to include in the test only questions relevant to the curriculum (or the skill) we wanted to assess. I think that this fit statistic is very useful. But how large is a large misfit? RP: Well, it depends on the intended use of the test. ‘Reasonable’ question Mean Square Ranges have been proposed for various types of tests by Wright2 and Linacre in 1985 (see Table 34). T: How seriously should we take those values? RP: Although researchers have proposed various cut-off scores, we should make clear that these are just rules-of-thumb. One should always check the data carefully and possibly apply different rules and cut-off scores. Are there any questions in the test that appear to be misfitting according to the rules of thumb as mentioned in Table 34? T: Question 1 has a fit statistic which approaches the cut-off score of 1.3. Do you think that the infit mean square for question 1 is big enough to be a misfitting question? RP: Well, the fit statistic of question 1 is between 1.2 and 1.3. Should we regard this question as misfitting? I am afraid that there is no definite answer. My advice, however, to you is to apply the suggested cut-off scores very judiciously. It has been shown that those fit statistics may not be appropriate all the time. Think of misfit as a continuum, as something all questions have to some extent but not as a property that a question either has or not. Then, 120

OBJECTIVE MEASUREMENT

try to decide for yourself, using the suggested cut-off scores as a guide, if a question is misfitting according to the type of the test, its significance, the intended use and the like. Table 34. Questions’ Rasch statistics Reasonable Question Infit and Outfit Mean-square Ranges

Type of test

Range

High stakes

0.8 – 1.2

Run of the mill

0.7 – 1.3

Rating scale – survey

0.6 – 1.4

Clinical observation

0.5 – 1.7

Judged – agreement encouraged

0.4 – 1.2

T: Do you really expect us to make personal judgements on something we are not experienced enough and without the use of clear pre-defined cut-off scores? RP: Well, I am afraid that you must use your professional judgement and your intuition as well. Needless to say that you will need some experience as well, but you will build this gradually. There is, fortunately, one golden rule to aid your quest for the identification of the ‘misfitting questions’. Sort all the questions using the infit mean square as the key. Sort them in descending order. Start examining all the questions one by one starting from the top of the table and progressing towards the bottom. Ask yourself every time: ‘Do I consider the fit statistic of this question excessive? Why? What may be the reasons that made this question have such a large (if it is large) fit statistic? Go back to the test and inspect the question. Try to see if you see any wording problems, if the artwork is misleading, and investigate whether the question is assessing something not in the curriculum covered in the class. When you reach a point when the fit statistics are not worrying, you can stop your investigation. T: So, even if a question is not clearly misfitting according to the cut-off scores, do we still need to investigate the possible sources of the misfit? RP: Even if the most misfitting question of a test has a relatively small fit statistic (e.g., Question 1 has infit mean square=1.25) you still need to do some investigation. Don’t forget that ideally (in a perfect world), your questions should have an infit mean square of 1.0. An infit mean square of 1.25 may not be too bad, but still needs some explanation. When trying to explain why a question is not behaving as expected, we first check the 121

CHAPTER 7

question’s difficulty. It is, sometimes, possible that very easy or very difficult questions are misfitting merely for statistical but not for substantial theoretical reasons. Try to check your marking scheme. You may need to re-mark a few tests. Try to spot confusing words or phrases. Think of the fit statistic as a friend not as a foe: you can use it as a red flag to warn you of problems with your questions. T: When we find that a question is misfitting, a problem arises: should the question be kept, transformed or be removed and replaced? RP: Well, it seems that there is no definite answer to this question either. When you include a question in a test you do so for very good reasons. It is certainly not wise to keep deleting questions from a test simply because of large fit statistics. Do not forget that this way you may end up deleting all questions with a special characteristic and this may mean that you actually changed the nature of the test. Think of the validity of your tests: not only questions relevant to the tested curriculum but also questions that are representative of the curriculum should be included in a test. All the important parts of the curriculum should be covered. Bearing this in mind, the rejection of a question only because of negative statistical data is not a desirable solution. However, a misfit in Rasch analysis can be an indication that the nature of the question should be reviewed once again. If the numbers say that a question is severely misfitting but you cannot identify any problems to fix, you may decide to remove it, especially if you have other similar questions which test the same sub-domain and, probably, have approximately the same difficulty. T: So, what do we do with question 1 in our sums test? RP: In this test, question 1 is the most misfitting question (Table 33). Our task now is to find out why this happens. Check if this question measures something different than the rest of the questions in the test. If you cannot identify something profoundly wrong and if you think that this is an important question that has a role to play in the test then do not remove it simply for a fit statistic of 1.25! T: I realised that we can have questions with large fit statistics but we can also have questions with very low fit statistics. What happens when a question has a very low fit statistic? RP: Well, at first place, having a very low fit statistic is called overfit. We should make it clear that overfit is not as bad as misfit. Overfit does not disturb the meaning of measurement. Overfit for a question simply means that when the pupils have higher ability than the difficulty of the question, they may give a correct response more frequently than expected by the Rasch model. Also it means that pupils with lower ability than the difficulty of the question tend to give an incorrect response more frequently than expected by the model. Convention suggests that questions with an infit mean square 122

OBJECTIVE MEASUREMENT

smaller than 0.7 indicate significant overfit. No question in this test had an infit mean square as low as 0.7. However, if one did, we might not be tempted to remove it from the test without further substantial theoretical reasons. Remember that overfit does not mean that a question measures a different ability than the rest of the questions. It simply means that the responses of the persons on this question are too predictable.

Figure 21. Question 1.

T: Too predictable? But the whole point is that we want to predict the responses of the persons. Don’t we expect a person to give a correct response whenever his/her ability is larger than the difficulty of a question? RP: It doesn’t mean that a person who has a larger ability than the question’s difficulty will always give a correct response. It merely means that he/she is more likely to give a correct rather than an incorrect response. Overfit is the case where (a) pupils give correct responses more frequently than expected (by the Rasch model) when their ability is larger than the difficulty and (b) pupils give incorrect responses more frequently than expected (by the Rasch model) when their ability is smaller than the difficulty of the question. However, Rasch practitioners often do not worry a lot about overfit and many of them are happy to turn a blind eye to overfitting questions. THE IDENTIFICATION OF MISMEASURED INDIVIDUALS

The quality of educational measurement has always concerned teachers. Factors like increased test anxiety, cheating, copying or a sudden illness can invalidate measurement making the results of a test unusable or, even worse, misleading. The Rasch model, however, provides the necessary tools to evaluate the quality of 123

CHAPTER 7

measurement for individuals. It is possible to use statistics to identify those persons for whom the test score is not a valid indicator of their true ability. The output of the Rasch analysis provided information about pupils’ achievement on the test. Table 35 demonstrates several statistics for a selected sub-sample of the pupils. Table 35. Pupils’ Rasch statistics

Pupil ….. John Mary Nicky Susan Jason Mike Bill Pamela Anna ….

Score (# correct) …. 3 7 7 5 5 5 6 11 11 …..

Ability Error of Estimate Infit Mean Square …. -1.56 0.24 0.24 -0.63 -0.63 -0.63 -0.20 3.51 3.51 …..

….. 0.74 0.70 0.70 0.68 0.68 0.68 0.68 1.48 1.48 …..

….. 1.02 0.85 0.85 0.78 1.12 0.94 0.64 0.18 0.18 …..

The column with the heading ‘Score’ indicates the number of correct responses. A score of 7 for Mary, for example, indicates that she answered correctly 7 out of the 12 questions. The column with the title ‘Ability’ indicates the ability of the pupils measured in logits. The error of estimate shows the precision of measurement. The last column is the fit statistic for the response pattern of each pupil. A teacher asked: T: We have said previously that the fit statistic is a red flag raised by the Rasch model whenever a question does not fit very well with the rest of the questions. What is the role of the fit statistic for the pupils? RP: A large fit statistic, for example, larger than 1.3 according to Karabatsos3 may mean that the ability of the pupil was probably mismeasured and that the ability awarded may not be a valid indicator of his/her true ability. Large fit statistics for a pupil may mean that he/she gave incorrect responses to easy questions and correct responses to difficult questions. As we have said before, the Rasch model is a probabilistic model and as such, it accepts that we humans, are likely to behave unexpectedly and get easy questions incorrect and difficult questions correct. If, however, a pupil gets very difficult questions correct but fails on very easy questions, the Rasch model concludes that the response pattern of the pupil is aberrant or invalid and flags it with a large fit statistic…

124

OBJECTIVE MEASUREMENT

RP: … On the other hand, very small fit statistics (e.g., smaller than 0.7) may indicate that a pupil gave correct responses more frequently than expected (by the Rasch model) when his/her ability was larger than the difficulty of the question. It may also mean that he/she gave incorrect responses more frequently than expected when his/her ability was smaller than the question’s difficulty. T: Hold on a second! Isn’t this logical? A pupil is expected to give incorrect responses to the difficult questions and correct responses to the easy questions. What is wrong with that? RP: The answer is that nothing is wrong; the pupil has a very good response pattern. Actually, the response pattern of the pupil is too good to believe according to the Rasch model. Look at Table 36. The questions are sorted from the easiest to the most difficult. Nick has a very reasonable fit statistic. He failed on the easiest question and on the third easiest question but he succeeded on the rest of the easy questions. He also failed all the very difficult questions. Although one might be surprised since the pupil failed the easiest question, the Rasch model accepts some inconsistency because of our imperfect human nature. The Rasch model found Nick’s response pattern natural. RP: On the other hand, Pam was awarded a very low fit statistic of 0.53. The reason is that she got a continuous string of correct and a continuous string of incorrect responses. The Rasch model considered this response pattern to be too nice; no accidentally correct responses on the most difficult questions (because of selective knowledge, for example) and no unexpected incorrect responses on the easiest questions (because of carelessness, for example).

Pupil Number

Q3

Q2

Q9

Q4

Q11

Q8

Q1

Q5

Q10

Q6

Q12

Q7

Infit Mean Square

Table 36. Pupils’ response patterns

Nick Pam

0 1

1 1

0 1

1 1

1 1

0 1

0 1

0 0

0 0

0 0

0 0

0 0

1.02 0.53

RP: The Rasch model is not saying that this response pattern is invalid. It is just saying that it is too perfect and that if we administer the same test to the same pupil again (assuming that no questions are memorised), she may give us a slightly different response pattern (without perfectly continuous strings of correct and incorrect responses). For example, next time, it is likely that she may get one of the easy items incorrect because of boredom, fatigue, carelessness or lack of concentration.

125

CHAPTER 7

T: How much should we worry about overfit? RP: Don’t worry too much about it. A low fit statistic does not indicate a mismeasured pupil. Let us spend more time on the misfit rather than on the overfit. Look at Table 37. It demonstrates a few of the most misfitting pupils. The Rasch practitioner asked the trainees: RP: Would anyone like to interpret Table 37? Table 37. A few of the most misfitting pupils

Pupil

Score

Ability

Luke Anne Susan Mark Bill Thomas

11 11 11 4 1 8

3.51 3.51 3.51 0.08 -2.55 0.72

Error of Estimate 1.48 1.48 1.48 0.96 1.16 0.75

Infit Mean Square 3.4 3.03 3.03 1.67 1.57 1.51

One of the teachers decided to attempt an interpretation of the table. T: Let me see… according to the table, Luke is the most misfitting person. With an infit mean square of 3.4, his response pattern surely qualifies for further inspection. The same holds for Anne and Susan. Anne, Luke and Susan are very able; they all have a score of 11 and their ability is 3.51 logits. Mark and Bill are not very able but they have given somewhat aberrant response patterns. They are not as misfitting, though, as the first three pupils and you could easily turn a blind eye to their fit statistic. RP: Well done, very good comments! Could you also tell us what makes the response patterns given by those pupils misfitting? T: I am afraid, I’ll need to see their actual responses to the questions. RP: OK, you are right. Table 38 illustrates the actual response patterns to each one of the questions of the test for those five pupils. T: Question 1 has an estimate of -0.17 and question 12 has an estimate of 1.5. The ability estimate of Anne, Luke and Susan was 3.51. Because the ability of Anne, Luke and Susan is much larger than the difficulty of the questions, we would expect the pupils to give a correct response on both. However, Anne and Susan failed question 12 and Luke failed question 1. This comes as a surprise to us. This is one reason why the three pupils appear to be misfitting.

126

OBJECTIVE MEASUREMENT

Table 38. Response patterns of pupils misfitting the Rasch model

Pupil

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Q11

Q12

Luke

0

1

1

1

1

1

1

1

1

1

1

1

Anne Susan Mark Bill

1 1 1 0

1 1 1 0

1 1 0 0

1 1 1 0

1 1 0 1

1 1 1 0

1 1 0 0

1 1 NA NA

1 1 NA NA

1 1 NA NA

1 1 NA NA

0 0 NA NA

Key: 0 and 1 indicate an incorrect and a correct response respectively NA indicates that the pupil did not attempt to answer that question

RP: Yes, however this may only be half the story. Look at question 7, it has a difficulty of 4.45. Do you think that the responses of Luke, Anne and Susan to this question were expected? After all, the pupils’ ability is approximately 1 logit less than the difficulty of question 7. Their probability for a correct response must be relatively small. What do you think? T: … yes, you are probably correct …. RP: What about Bill? T: Well, this pupil has given only one correct answer – question 5. This is a question of medium difficulty but Bill has a very low ability which was estimated to be -2.55 logits. I would expect this pupil to give an incorrect response but apparently, this was his only correct response! He did not attempt the last five questions. RP: Table 39 illustrates the residuals for each one of the questions of the test for those pupils. You can spend some time to check for yourself if your predictions about the sources of misfit were correct. RP repeated that the residuals are the discrepancies between the actual response of the pupil and the expected response. For example, the probability for a pupil giving a correct response to a question may be estimated by the Rasch model to be, say, 0.15. In other words, the pupil may have 15% probability of giving a correct response. If the pupil finally gives a correct response, then the discrepancy between the expected and the actual response may be given as: Residual = observed response – expected response In numbers, the residual in the above example is 1 – 0.15 = 0.85. RP: Who would like to comment on the Table 39? T: Well, it is apparent from the table that the misfit for Luke is generated by a highly unexpected incorrect response on the first question and an unexpected correct response on question 7.

127

CHAPTER 7

RP: Correct. Using the table of residuals can help to diagnose the source of the high fit statistics. On the other hand, Bill had a different source of misfit. He gave a highly unexpected correct response on question 5. It is, however, evident that the actual reasons that caused the misfit cannot be known without in-depth study such as interviews. Table 39. Table of discrepancies between expected and observed responses to questions

Q1

Pupil Luke Anne Susan Mark Bill

Ability

Q2

Q3

Q 4 Q 5 Q 6 Q 7 Q 8 Q 9 Q 10 Q 11 Q 12 Question Difficulty (in logits)

-0.17 -1.77 -2.09 -1.09 0.02 0.73 4.45 -0.43 -1.21 0.63 -0.63 1.5 Residuals

3.83 -1 0 0 0 3.83 0 0 0 0 3.83 0 0 0 0 0.09 0.43 0.1 -0.93 0.2 -2.78 -0.08 -0.31 -0.38 -0.18

0.01 0.03 0.73 0 0 0.01 0.03 0.73 0 0 0.01 0.03 0.73 0 0 -0.52 0.67 -0.01 N/A N/A 0.93 -0.03 -0.01 N/A N/A

0.03 0 0.09 0.03 0 -0.92 0.03 0 -0.92 N/A N/A N/A N/A N/A N/A

T: So, a few of the pupils gave a couple of highly unexpected responses to questions. Is this enough to flag their response patterns as ‘misfitting’? RP: It is true that only two aberrant responses cannot convince us to discard the response pattern of a pupil as totally misleading and unusable. We would be more confident in our judgement if the test was longer (i.e., consisted of more questions) because now two unexpected questions can affect the fit statistic considerably. However, the length of the test in this case is given, it cannot change, and we must do our best with the existing information. The fit statistic of the pupils is just an indication that something may be wrong with their response pattern. HOW DO WE TREAT THE OMITTED RESPONSES?

Omitted responses to questions are a very frequent phenomenon in classroom testing. People who run out of time, people who suffer from test anxiety or simply people who are tired or bored by the test are likely not to provide responses to one or more questions. Teachers have traditionally treated those omitted questions as incorrect in the sense that no marks were awarded for them. The impact on the person’s estimated ability when treating the omitted responses as incorrect can be huge. Put simply, if the missing responses are treated as incorrect, the score of Bill in Table 38 is one out of twelve and the percent correct is 1 = 8.3% . If the omitted responses are treated as ‘not attempted’ instead 12

of incorrect, then Bill’s score remains one out of seven questions this time. The percent correct for Bill will then become 1 = 14.3% . Although the score in absolute 7

128

OBJECTIVE MEASUREMENT

values remains the same, the percent correct indicates a more knowledgeable person in the second case where the omitted questions are not treated as incorrect. The Rasch model can estimate pupils’ abilities in both ways. The Rasch analysis discussed in the previous sections treated the omitted responses as not Visit WebResources where you can attempted instead of incorrect. A find more information about the use of second Rasch analysis was run after the Rasch model for test development converting the omitted responses to purposes incorrect. The following example (see Table 40) shows the ability estimate for two pupils who had their omitted responses treated at the first time as not attempted and the second time as incorrect. Table 40. The effect of treating the omitted responses as incorrect

Pupil Andy Mark

Omitted responses treated as ‘not attempted’ in the Rasch analysis Score (# correct) Estimated ability 3 -0.73 4 0.08

Omitted responses treaded as incorrect in the Rasch analysis Score (# correct) Estimated ability 3 -1.56 4 -1.07

It is obvious that the ability estimates of the pupils can change considerably when treating the questions as incorrect or as not attempted. The ability of the Rasch model to deal efficiently with the omitted responses in both ways is one of the characteristics that make the Rasch model so useful and desirable among the practitioners. However, the issue of treating omitted responses as not attempted or as incorrect is rather a methodological and philosophical one. If the assessor considers that the test-takers should attempt all questions and if this is made clear to them then omitted responses cannot be treated as not attempted but should rather be treated as incorrect. On the other hand, whenever it is believed that the test-taker did not skip the questions because they were too difficult but rather did not attempt to answer them, omitted responses may be treated as not attempted instead of incorrect. As these are mainly methodological and philosophical issues they cannot be solved by pure statistical methods. Table 40 indicates that our decisions can affect pupils’ estimates heavily and our preferences must be based on thorough theoretical background. It is encouraging that the Rasch model offers a very simple and straightforward method to deal with omitted responses. TEST DEVELOPMENT USING RASCH MODEL

After the administration of the test and the collection of the data, it was decided that a reduction in the time needed to complete the test was necessary. Obviously, the easiest way to reduce the time needed to complete the test was to reduce the number of the questions. However, it was not necessary for the teachers to 129

CHAPTER 7

completely remove questions from the test since the Rasch model could accommodate for the administration of different subtests to different groups of pupils. A linked design administers different subtests to groups of different ability; the goal is to administer subtests that better match the ability of the pupils. The subtests should have a few questions in common, so that the Rasch model can link the estimates from the subtests and put all questions and people on a common metric scale. Such a solution gives the opportunity to keep all the questions so that information about learning can be acquired from all of them but each person only has to complete a sub-test. Therefore, the time spent on testing can shrink considerably. Table 41. Test equating using the Rasch model

Group of less competent pupils Group of more competent pupils

4 easiest questions

4 questions of medium difficulty (common subtest)

4 most difficult questions

Administered

Administered

Not Administered

Not Administered

Administered

Administered

In our example, the existing sums test could be split in two subtests of eight questions each but with four common questions between the two tests: Subtest E may consist of the eight Easiest questions and subtest D may consist of the 8 most Difficult questions. The pupils can also be split into two groups: Group L may consist of the Less competent pupils and group M may consist of the More competent pupils. The final design is shown in Table 41. The pupils in our example were clustered into two groups named the ‘More able’ and ‘Less able’ pupils according to their raw score. The test was also split into two subtests with four common questions. The pupils that belonged to the ‘Less able’ group were administered the easier test. The pupils of the ‘More able’ group were administered the more difficult test. All the pupils with a raw score of six or less were considered to make up the ‘Less able’ group. A new Rasch analysis was run using the linked design of Table 41. This analysis included only a part of the original dataset. For each pupil only eight responses were used instead of 12. For the low ability pupils only their responses to the 8 easiest questions were included in the analysis. For the high ability pupils, only their responses to the 8 most difficult questions were used. The abilities of the pupils and the difficulties of the questions derived from the new analysis (‘linked design’ analysis) were compared with the estimates of the first analysis where all the data was used (‘original’ analysis). According to Figure 22, the question difficulties estimated from the linked design analysis did not change significantly from the original analysis although one-third of the questions were completed only by the most competent and the other one-third was completed only by the less competent pupils. Even in a few cases where the estimated difficulties changed a bit more, none of the changes was statistically significant meaning that they might have happened purely by chance. 130

OBJECTIVE MEASUREMENT

(The fact that the difficulties did not change considerably can be seen by the fact that the points of the graph lie on the straight line or are very close to it). 5

Original analysis

4 3 2 1 0 -4

-2

-1 0

2

4

6

-2 -3 Linked analysis

Figure 22. Comparison of question estimated difficulties (Linked design vs Original analyses).

Figure 23 illustrates a comparison between the pupils’ estimated abilities using all their responses and their estimated abilities using only their responses on the appropriate subtest. 4

Original analysis

3 2 1 0 -3

-2

-1

-1

0

1

2

3

4

-2 -3 Linked analysis

Figure 23. Comparison of pupil estimated abilities (Linked design vs Original analyses).

According to the graph above, the pupils’ estimated abilities did not change considerably between the two analyses. This indicates the power and the flexibility of the Rasch model to deal with complex designs and missing data. Although in the 131

CHAPTER 7

linked design analysis eight responses were used for each pupil, the estimated abilities were statistically the same as the estimated abilities of the original analysis where all the data was used. Summarising, thanks to the Rasch model, the sums test could be reduced to two subtests: an easy one for the less competent pupils and a difficult one for the more competent pupils. No significant loss of information would result and the pupils would only spend two thirds of the teaching time to complete the test. 1.6 1.4

Standard Error

1.2 1 0.8 0.6 0.4 0.2 0 -3

-2

-1

0

1

Ability estimates

2

3

4

Figure 24. The ability estimates and their error of estimates.

The smaller number of questions used to determine the ability of every pupil in the linked design affected the precision of measurement. Usually, the larger the number of questions administered to the pupils, the better the precision for their estimated ability. However, one other factor affects the precision of measurement for the ability of the pupils in the Rasch model. This factor is the matching between the pupils’ ability and the questions’ difficulties. For example, when a person is administered off-target questions, that is, questions that are too difficult or too easy compared to his/her ability, then not much information is given for this person. The person will probably get most of the very easy questions correct and most of the very difficult questions incorrect. In such a case, we do not have enough information to determine the ability of this person with much precision. In such a case we’ll know that the ability of the person lies on the scale somewhere between the very difficult and the very easy questions but we cannot estimate the exact location. On the other hand, if this person is administered questions that match his/her ability, we can identify his/her ability in the area where he/she gets half of the questions correct and half of the questions incorrect. The larger the number of

132

OBJECTIVE MEASUREMENT

questions attempted around a person’s ability, the more confident we are that we can identify the area where he/she gets half of the questions correct and half of the questions incorrect. In that case, the information is much richer and the precision is larger. Figure 24 indicates the relation between the ability estimates and the error of estimates of the pupils’ estimated difficulties. The dependence of the error of estimate from the ability estimate is demonstrated from the above graph. Extreme ability estimates have large error of estimates because the pupils are more likely to be administered off-target (too easy or too difficult for them) questions that do not contribute much information to the estimation of their ability. If, however, the test included many difficult questions, then the ability estimate of the most able pupils would be more accurate because they would be administered many questions with difficulty that matched their ability. Therefore, their error of estimate (at the right of the figure) would be relatively small. So, it is really the matching between item difficulty and the distribution of abilities that reduces the error of estimates. THE ASSUMPTIONS OF THE RASCH MODEL

Statistical models usually base the validity of their results on specific assumptions. Failure to take the assumptions of the models into consideration can cause the failure of the model invalidating the results of the analysis. This section deals with the assumptions of the Rasch model and its robustness to violations of those assumptions. The first assumption of the Rasch model discussed by the teachers was unidimensionality. The term was familiar from a previous discussion. RP: In a previous session we talked about unidimensionality when we were testing the fit of the questions. It is enough to remind us at this moment that unidimensionality is a very important concept for the Rasch model. Unidimensionality has to do with the abilities we measure using a specific test. In the case of our sums test, the Rasch model assumes that all 12 questions measure the same ability: the ability to do sums up to 100. T: ‘But, is this realistic?. If you go back to the blueprint of the test you’ll see that we have clustered the questions of the test to various categories (Table 29). For example, questions 2 and 7 formed their own sub-domain … err … I mean that those questions test whether the pupils understand the meaning of the keyword ‘more’. Other questions, for example 4 and 6, are based on the comprehension of the word ‘double’. If we have categorised the questions of the test into sub-domains from the very beginning, doesn’t this mean that we accept that these questions test a slightly different thing… and doesn’t this violate the assumption of unidimensionality? Another teacher defended the same position: T: A striking example of violation of the unidimensionality in our test is the fact that the first seven questions are verbal problems (they have a relatively heavy linguistic load) but the last five questions are simply vertical and 133

CHAPTER 7

horizontal sums with only a minor linguistic demand. My experience tells me that some children may be very able in vertical and horizontal sums but they may not be very capable in solving problems like the ones presented in questions 1 to 7. RP: Yes, I find it reasonable for you to worry but you may worry unnecessarily. It is possible to fit a Rasch model on the results of a test that actually measures a few highly related abilities. (For example, in high-profile studies such as the TIMMS, a number of researchers run Rasch analyses on datasets including data from both science and mathematics questions and no severe model-data misfit problems were found). In the case of our test, we may measure slightly different abilities but they may be so closely related so that the Rasch model may be happy to assume that we measure only a single overall ability. Whether the Rasch model actually recognises the results of the test as coming from a unidimensional test or not is something that can only be determined very carefully by inspecting the output of the Rasch analysis. If we identify badly fitting questions then this might be a strong indication that the questions of the test do not fit together to create a unidimensional scale. T: So the Rasch model may accept as unidimensional a test that measures several highly related abilities? RP: Only if the combined effect of those very similar abilities acts in the same, overall, way on all of the items of the test. Bejar 4 suggested that unidimensionality did not necessarily mean that the performance on the questions was due to a single cognitive process. Instead, he proposed that a variety of cognitive processes could be involved as long as they functioned in unity, that is, as long as each question in the test was affected by the same process and in the same form. T: So, in our case, we could assume that our sums test measures a single overall mathematical ability which may be called ‘doing the sums’ and is defined operationally by the questions of the test. Are there any other assumptions we should learn about? RP: Yes! Another main assumption of the Rasch model is the assumption of local independence. This assumption says that the response of a person to a question should not affect responses to other questions. For example, previous questions should not give hints or insights for the solution of the next questions. This is a necessity that is highly related to the mathematical foundations of the model. You understand that this assumption can be easily violated when we have … T: … Questions with sub-questions? a teacher interrupted him. RP: Well, yes, this could easily be the case. When you have one question with many sub-questions, you cannot treat them as being different and independent. For example, consider the following hypothetical question (Figure 25). It is 134

OBJECTIVE MEASUREMENT

the same as question 2 in our sums test but I simply added a second subquestion which is linked to the first sub-question (hence the two are not independent). The pupils are now asked to double their first answer. But if a pupil is not in a position to find the first answer how can he/she find the second answer? The pupils might be able to double a number correct but they will give an incorrect response to the second sub-question if they do not get the first correct. Do you agree?

Figure 25. The modified version of question 2 with two sub-questions.

The teachers agreed. RP’s arguments sounded convincing and they realised that this type of question was very frequent in their tests at school. ‘But how can we analyse this type of data’? asked somebody. RP: Well, in order to analyse such data we need a slightly more complex version of our model. This is called ‘the Partial Credit Rasch’ model and we can award zero marks for an incorrect response to both of the questions, 1 mark for a correct response to the first sub-question and a second mark for a correct response to the second sub-question as well. But don’t rush – we still have a few assumptions of the Rasch model to review. RP: The administration of a test is expected to be non-speeded or untimed, that is, it is expected to be a power test. The pupils should have enough time to attempt all the questions in the test. We do not want to have situations where the number of persons reaching the last questions shrinks suddenly. This will simply make the questions look harder, especially if we plan to treat the unreached responses as incorrect. The assumption of non-speeded test administration can be checked in many ways. One easy way is to compare the number of examinees who did not attempt some easy questions (usually at the beginning of the test) and the number of examinees who didn’t attempt some difficult ones (usually at the end of the test). If the numbers are

135

CHAPTER 7

comparable then this is an indication of non-speeded test administration, but its only an indication, not a proof. RP: The next assumption of the Rasch model has to do with guessing. Can somebody guess what this assumption says? T: Hey, I think I can guess what this assumption says. According to this assumption, the pupils cannot guess the correct answer on the questions. RP: Correct! Congratulations – I guess this was an easy one. RP: Yes, minimal guessing is one factor that should always be checked before the use of Rasch model. However, guessing is usually suspected for multiple-choice questions where even the less competent pupils can get a few questions correct. If low-ability examinees appear to have performance levels close to zero on the most difficult questions then the assumption of minimal guessing holds. However, in the case of your sums test it is very difficult for the pupils to guess the correct answer. Guessing is usually a problem with multiple-choice questions. T: So, have we finished with the assumptions? RP: Well, not quite! The next assumption talks about the power of the questions to discriminate between the more and less able pupils. T: What do you mean when you say that a question discriminates between pupils? RP: Well, we have built the test because we primarily want to assess pupils’ knowledge and distinguish between more and less able ones. There is no point in using questions that everybody will fail or everybody will get correct. The point is that we use the questions because we want to identify the more able pupils. So it is important to differentiate between high and low achievers. In other words, we want to discriminate between high and low achievers because we expect more low achievers to fail the questions and more high achievers to answer the questions correctly. Therefore … T: …we needed questions that would discriminate between the pupils. RP: Exactly. Now, the Rasch model says that the questions of the test should have the same power to discriminate between the pupils. For example, imagine the case where half of the most able pupils and half of the less able pupils give a correct response to a question. What happens is that this question cannot discriminate between the pupils and may not be in agreement with the majority of the questions of the test that actually discriminate between the pupils properly. After all, we expect more able pupils to answer correctly the questions of the test … and the Rasch model demands that the questions discriminate between the pupils in a similar way.

136

OBJECTIVE MEASUREMENT

CAN THE TEST BE SPLIT IN TWO EDUCATIONALLY MEANINGFULL SUB-SCALES?

The RP asked the teachers how they could investigate whether the test might consist of two different sub-scales. In other words, he asked them how one might be able to identify whether the test was successful in measuring two slightly different abilities (one for the verbal problems and one for the vertical and horizontal sums) instead of one single ability (addition up to 100). One teacher suggested: T: It seems that the test works well as a single scale. We did not have any worrying fit statistics when we ran our first analysis (Table 33) except from the fact that question 1 had a slightly large infit mean square. However, in order to investigate whether the test consists of two different sub-scales (verbal problems vs horizontal and vertical sums) I would be tempted to run a separate analyses on the two potential subscales. Lets call the first scale ‘Verbal sums problems’ (questions 1-7) and the second ‘Horizontal-Vertical sums’ (questions 8-12). RP: OK. This is a very good suggestion. Let’s run the analyses. Table 42 indicates the results of the Rasch analysis using only the first seven questions which can be described as the ‘Verbal sums problems’ scale. The columns with the title ‘Original analysis’ demonstrate the questions’ statistics that were derived from the Rasch analysis using all the 12 questions of the test in a single analysis. Eventually, the table compares the statistics of the first seven questions when they are analysed as a stand-alone test (only questions 1-7) or as part of the whole test (Original analysis). Table 42. The ‘Verbal sums problems’ sub-test

Question 1 2 3 4 5 6 7 Mean

Only questions 1-7 Infit Score Estimate Mean Square 43 -0.3 1.33 65 -1.84 0.88 68 -2.16 1.12 57 -1.19 0.97 40 -0.12 0.86 29 0.56 0.79 1 5.02 0.95 43.29

0.00

0.99

Original analysis* Infit Estimate Mean Square 1.25 -0.17 0.9 -1.77 1.08 -2.09 0.92 -1.09 1.13 0.02 1.22 0.73 0.82 4.45 1.05 0.01

* Using all the questions of the test in a single analysis. RP: Can you please comment on the results of these analyses? T: It seems that the first question has a slightly worse fit… I mean it was almost misfitting on the original analysis when the whole test was used as a 137

CHAPTER 7

single scale … but in fact the fit of this question is worse when we use only the questions of the scale ‘Verbal sums problems’. I don’t understand why this happens… according to our theory we should improve the fit of the questions because questions 1-7 should form a very consistent ‘Verbal sums problems’ scale. 5

Original analysis

4 3 2 1 0 -3

-2

-1

-1

0

1

2

3

4

5

6

-2 -3 'Verbal problems' analysis

Figure 26. Question estimates using two different subtests.

RP: Well, let’s compare the estimates of the questions … if the Rasch model holds, the estimates of the difficulty of questions 1-7 should not change significantly between the two analyses … correct? T: I guess so… RP: OK, this is the graph (see Figure 26) which compares the question estimates for the two analyses … the estimates do not seem to change considerably from the one analysis to the other. T: OK… Let us summarise all the findings: the fit of question 1 gets worse, that means that this question was ‘happier’ when we used all the questions of the test to form a single scale, but the fit of the question was not excellent before. RP: Well, it seems that question 1 really does not match the rest of the questions very well… it seems that this question measures a slightly different aspect of sums … and it’s not really a problem-type question, is it? I mean, look … as an experienced teacher, are you happy to assume that this question is a problem-solving situation? It looks to me as if this question is simply a horizontal addition ... or maybe subtraction …? What do the others think?

138

OBJECTIVE MEASUREMENT

T: Hmmm … yes… you may be right. It is not a real problem-solving question … but still the results show that, overall, the ‘Verbal sums problems’ scale is a good one. RP: OK, let’s now check the table with the statistics of the second sub-test (Table 43), the ‘Horizontal-Vertical sums’ scale. Table 43. Horizontal-vertical sums’ scale Questions 8-12 Error of Question Score Estimate Estimate 8 34 -0.5 0.3 9 45 -1.44 0.35 10 18 0.73 0.32 11 37 -0.73 0.31 12 7 1.92 0.43 Mean 28.20 0.00

Infit Mean Square 0.78 0.86 0.92 1.39 0.82 0.95

Original analysis* Infit Estimate Mean Square -0.43 -1.21 0.63 -0.63 1.5 -0.03

0.8 0.83 0.87 1.07 0.88 0.89

* Using all the questions of the test in a single analysis. RP: What do we understand from this table? T: Apparently, the fit of question 11 got significantly worse. I mean, it was OK in the first analysis but now it seems that this question turned to be misfitting. How is this possible? If we inspect question 11 we’ll see that there is nothing that makes this question different from the other questions… check it out for yourself in Figure 27. T: The only difference I can see is that question 11 is the only one that asks the pupils to add a number to zero, but I am not sure if this is a reason for misfit. I am not sure and I can’t tell. My experience tells me that this question should fit nicely with the rest of the scale. I am reluctant to say that this question should be removed from the scale… RP: Your professional judgement is important. If I were you, I would trust my professional judgement but I would also try to figure out what is going wrong with the question. We should remember that this sample only has 80 pupils. This sample size, although perfectly legitimate for an initial (exploratory) Rasch analysis, sometimes may flag questions as misfitting with no apparent reason. You see, a few unexpected responses become more important if the sample size is small. If we had data from, say, 200 pupils, this question might not be so severely misfitting. RP: So, what is the conclusion of all these analyses? If my memory serves me well, we were trying to see whether two different sub-scales exist. We split the test into two parts and tried to see if we could formulate two legitimate scales. 139

CHAPTER 7

Figure 27. ‘Horizontal-vertical sums’ scale.

T: Well, it seems that the two separate scales might not be that bad, but the single scale seems to be fine as well. In fact, I might be tempted to keep all the questions in one scale. The statistics are not bad for practical intents and purposes of a low-stakes informal assessment instrument and a single scale serves our initial aim which was the generation of a unidimensional test to measure the ability of the pupils to do the sums. -oOoREVIEW QUESTIONS T F T F T F T F T F T F

140

In the simple Rasch model, if the ability of an examinee equals the difficulty of an item then the examinee has more than 50% chance for a correct respone. Overfit is more dangerous for a question than misfit because it indicates that the question may not measure the same ability as the other questions. Overfit is the case where the infit mean square of a question is larger than 1.3 The assumption of unidimensionality demands that all questions test material from exact the same sub-domain. Misfitting persons must be identified because their ability estimate may not be a valid indicator of their true ability. Omitted responses may be scored either as missing or as incorrect without significant effects on the ability estimate of the persons because Rasch is a robust measurement model.

OBJECTIVE MEASUREMENT

T F T F

Unexpectedly correct responses may be identified by comparing the ability of a person with the difficulty of a question. Questions that should not be part of a test because they test a different ability or trait than the rest of the questions may be identified because they have a large fit statistic. EXERCISES

1. In which ways may the Rasch model be useful when developing and evaluating a teacher-made test? 2. What are the factors that affect the size of the error of measurement for the item and person Rasch estimates? 3. Which are the main assumptions of the Rasch model?

141

CHAPTER 8

THE PARTIAL CREDIT RASCH MODEL

When a person encounters a question, the outcome is not always either a correct or an incorrect answer. Often it is desirable to identify the intermediate shades that lie in-between a completely correct and a completely incorrect answer in order to award credit for partial knowledge. In educational measurement this is usually done by awarding ‘partial credit’ for responses that are neither totally correct nor totally incorrect. Figure 28 gives an example of a question that is suitable for partial credit.

Question: Find the sum

1 2 + = ? Show how you worked to find the answer. 4 3

Possible answers ordered from the one indicating less knowledge to the one indicating more knowledge are illustrated below: (a)

1 2 + = 3 No demonstration of the procedural knowledge needed to solve 4 3

this type of problem. Incorrect response. Marks awarded: 0 (b)

1 2 3 The pupil demonstrates only a vague understanding of + = 4 3 7

fractions. Marks awarded: 1 (c)

1 2 1 2 3 The pupil demonstrates understanding for the + = + = 4 3 12 12 12

need for a common denominator. Marks awarded: 2 (d)

1 2 3 8 11 The pupil gives a correct response. Marks + = + = 4 3 12 12 12

awarded: 3 Figure 28. An example of awarding partial credit for incomplete responses.

The possible responses are ordered according to the amount of competence they illustrate. Although it would be much easier to award 0 marks to categories (a), (b) and (c) and 1 mark to category (d), this would lead to a loss of information. It 143

CHAPTER 8

might be, for example, diagnostically important to distinguish between totally ignorant pupils and pupils that have almost mastered the skill. Another type of questions that invites partial credit is demonstrated in Figure 29. It illustrates a modified version of question 2 on the sums test studied in the previous sections.

Figure 29. The modified version of question 2 with two sub-questions.

There is a number of possible outcomes when a pupil attempts to answer the question of the previous figure. A less competent pupil may give an incorrect response to both the first and the second parts of the question. A more competent pupil may give a correct response to the first part of the question but may fail the second part. On the other hand, a very competent pupil may succeed in both the first and the second parts of the question. Table 44. The possible outcomes when a pupil attempts the question of Figure 29 First part × √ × √

1 2 3 4

Second part × × √ √

Total marks 0 1 1 2

Comments A less competent pupil A competent pupil Impossible outcome A more competent pupil

Note: × denotes an incorect response √ denotes a correct response Each correct response awards a mark It is impossible for somebody, however, to fail the first part and succeed in the second part because this would mean that by doubling the incorrect number you 144

THE PARTIAL CREDIT MODEL

can find the correct result! (It might happen, of course, if somebody copied the response to the second part from a more able neighbour). The following table illustrates the possible outcomes of the ‘confrontation’ between a pupil and the above question. According to the above table, the possible outcomes are either 0 or 1 or 2 marks. However, the Rasch model described in the first part of this chapter (usually called the Simple Rasch model because it can only accommodate for incorrect/correct marking) is not suitable for test results where partial credit is awarded. When the responses of the pupils are ordered in more than two categories, for example, totally incorrect (0 marks), partly correct (1 mark) and completely correct (2 marks) then another model, the Partial Credit Rasch model, is the model of choice to analyse the test results. The partial credit model is conceptually very much the same as the simple Rasch model in the sense that it uses measures for the difficulty of a question and measures for the ability of the pupils in order to compute the probability for a specific outcome. The probability of a person to be awarded one of the three possible marks is demonstrated in the following figure as a function of the ability of the person and some difficulty measures of the question.

Probability correct

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -6

0 marks 1 mark 2 marks

-4

-2

0

2

4

6

Ability (in logits)

Figure 30. The score probability lines for a question.

The y-axis demonstrates the probability for each of the possible outcomes (e.g., 0/1/2 marks) which ranges between 0 and 1. The x-axis demonstrates the ability of a hypothetical pupil. The three curves in the figure illustrate how the probability of the pupil to obtain each one of the three scores changes while ability increases. For very low abilites, the probability of getting 0 marks is almost 100% while the probability of getting either 1 or 2 marks is almost zero. While the ability grows, the probability of a score of 0 marks reduces until the line of ‘0 marks’ meets the line of ‘1 mark’. This happens at an ability of approximately –1 logit. At this point, 145

CHAPTER 8

the probability to get 0 marks is the same as the probability to get 1 mark so it is a 50-50 chance whether the pupil will get 0 or 1 marks. At this point, the probability for 2 marks is still small. From this point up to an ability of approximately 1 logit, the probability of 1 mark is the largest. At the point of an ability of approximately 1 logit the line of ‘1 mark’ meets the line of ‘2 marks’. At this point the pupil has a 50-50 chance to be awarded 1 or 2 marks. Pupils with higher ability than that are more likely to be awarded 2 marks. In order to draw the above graph we need to know the difficulty of achieving each of the score categories. In the case of the simple Rasch model the formulae included only one parameter for the people (the ability) and one parameter for the questions (the difficulty of getting 1 instead of zero). In the case of the partial credit model we need to introduce additional parameters because the δ (difficulty) parameter cannot describe the question fully. We need to give information about the difficulty of achieving each one of the categories ‘1 mark’ and ‘2 marks’. The previous figure is repeated below but this time more information is provided.

Figure 31. The score probability lines for a question.

The point where the probability lines of the categories ‘0 marks awarded’ and ‘1 mark awarded’ meet is called the step measure and it is denoted by δ 1 = −1 logit meaning that ‘the measure of the first step is –1 logits’. This indicates that a person with ability –1 logit has 50% to succeed in being awarded 1 mark instead of zero marks. Larger ability than the step measure means that the person has more than 50% chance to be awarded the larger mark instead of the lower. Ability smaller than –1 logit, for example, may indicate that the person has less than 50% chance to get the first mark and therefore will probably get 0 marks. 146

THE PARTIAL CREDIT MODEL

The measure of the second step is denoted by δ 2 = 1 logit and means that if a person has an ability of 1 logit and he/she succeeded in getting the first mark then he/she has a probability of 50% to be awarded the second mark as well. Notice that this is a conditional probability in the sense that it gives the probability for a person with a given ability to get the second mark provided he/she managed to get the first mark. If the question had a third mark available for an even more complete answer, then the step measure of the ‘3 marks’ would be the transition point between the 2 marks and the 3 marks. If that step had a measure of 2 logits, then a pupil with ability 2 logits (who managed to get the second mark) would have 50% chance to get the third mark provided he/she succeeded in the second mark. In other words, the step measure is the transition point from the one step to the next, that is, from the score below to the next score on a question. At the point of transition the pupil has a 50-50 chance to be awarded either the lower score or the next. From now on, for the sake of convenience and to be consistent with the previous sections, we’ll call the ‘step measures’ as step difficulties. Assume that a person has an ability of 0 logits. What are the three probabilities that this person will get 0 marks, 1 mark and 2 marks on the question demonstrated on the previous graph? The probabilities for 0, 1 and 2 marks to be awarded to a pupil of ability 0 logits can be found in five steps by using the graph: Locate the point where the ability of the pupil (on the x-axis) is zero logits. Draw a vertical line starting from the ‘ability=0 logits’ on the x-axis and extend it to meet the highest category line as in the example below. At the point of ability=0 the highest category line is the ‘1 mark to be awarded’.

Figure 32. The score probability lines.

147

CHAPTER 8

– Find the points where the vertical line meets the category lines of ‘0 marks’, ‘1 mark’ and ‘2 marks’. In this case the point where the vertical line meets the category lines of ‘0 marks’ and ‘2 marks’ is the same. – At the points where the vertical line meets the category lines draw a line to the left until you meet the y-axis. – Read the probability on the y-axis for each score. For the category lines of ‘0 marks’ and ‘2 marks’ the probability is approximately 20%. This means that for a person with ability 0 logits there is a 20% probability to be awarded 0 marks and 20% probability to be awarded 2 marks. However, the most likely outcome is to be awarded 1 mark (approximately 60%). It is, therefore, obvious that if a person with ability 0 logits attempt this question, he/she is expected to give a completely correct answer and be awarded 2 marks. ANALYSIS OF TEST RESULTS USING THE PARTIAL CREDIT MODEL

Although the above theoretical description of the partial credit model gives a rough idea of how it works, only a hands-on experience with real datasets will help to fully conceptualise all the interesting features and the potential of the model. Let us imagine again that the group of teachers of the previous sections is ready to run a first partial credit Rasch analysis. The RP will present the data and then the output will be discussed. Question 1. A fair coin is flipped four times, each time landing with heads up; HHHH. What is the most likely outcome if the coin is flipped a fifth time? Please circle only one of the answers. (a) A Head. (b) A Tail. (c) A Head and a Tail are equally likely. Explain why: _______________________________________________ Question 2. In a family four boys were born; BBBB. What is the most likely outcome if a fifth child is born in this family? Please circle only one of the answers. (a) A Boy. (b) A Girl. (c) A Boy and a Girl are equally likely. Explain why: _______________________________________________ Figure 33. The first two questions of the probabilities test.

RP: We cannot use the data from the sums test to run a partial credit Rasch model because they are dichotomous data, that is, the pupils’ responses are scored as incorrect/correct (0/1 marks). For this reason, data gathered from a

148

THE PARTIAL CREDIT MODEL

probabilities test that was administered to fifth and sixth grade pupils will be used. Figure 33 illustrates two of the questions of the probabilities test. RP: The pupils were awarded one mark for a correct response to the multiplechoice questions of the test and then, provided they got the first mark, they were awarded a second mark for a correct explanation of how they reached their answer. There were 13 questions in the test which were split into two sub-tests. The easy subtest consisted of eight questions and was administered only to the fifth year pupil whereas the more difficult subtest consisted of ten questions and was administered only to the sixth year pupils. T: Did they have any common questions? RP: Yes, the two tests had five questions in common. If they did not have any common questions we would need to have pupils completing both subtests. The linked design is demonstrated in Table 45. Table 45. Test equating using the Rasch model

Group of fifth year pupils Group of sixth year pupils

3 easy questions

5 questions of medium difficulty (common subtest)

5 difficult questions

Administered

Administered

Not Administered

Not Administered

Administered

Administered

RP: In total, 116 pupils were tested. Of them, 52 were administered the easy subtest (they were fifth year pupils) and 64 were administered the more difficult subtest (they were sixth year pupils). RP: All 116 pupils completed five of the questions. Figure 34 illustrates the results of the analysis. It is the common logit scale where the pupils and the questions are located. Would someone like to comment on this figure? T: I would like to try … well, first of all, this is very similar to the one we generated for the simple Rasch model. At the right hand side we have the abilities of the pupils and at the left hand side we have the questions. We have 13 questions in the test, we have two steps per question therefore we have 26 steps at the left hand side of the graph. Each one of the blue lines at the left hand side of the graph illustrates the number of item steps with that difficulty e.g. the first blue bar at the left bottom part of the figure says that there were three steps with difficulties around -3 logits. RP: Exactly. This figure shows the difficulty of each step but this time expressed as a threshold not as a step measure. What is the threshold of a step? The threshold of a step is the point on the scale where a person has 50% chance to be awarded that step or a larger one. In other words, if the first step (1 mark) of a question has a threshold of 2 logits, then a pupil with ability 2 149

CHAPTER 8

logits has 50% to be awarded a score equal or larger than 1 mark on that question. I’ll give you another example in Figure 35.

Figure 34. The logit scale for the partial credit analysis.

RP: Look at pupil 74 at the bottom of the figure. The ability of pupil 74 is just below the threshold of the first mark of question 1 (Q1.1). This means that pupil 74 has a probability of just below 50% to get one or more marks on question 1. Because question 1 has only two marks available, the pupil has 50% chance to get zero marks and 50% chance to get either 1 or 2 marks. If the ability of pupil 74 was equal to the threshold of the second mark (Q1.2) then this pupil would have 50% chance to get 2 marks and 50% to get 0 or 1 mark. Would someone like to explain what is going on with pupil 91? T: The ability of pupil 91 is approximately equal to the threshold of step 2 of question 13 (Q13.2). This means that pupil 91 has a 50% chance to be awarded the second mark on question 13 or any other larger mark if there was one available. I noticed that the threshold of the second mark is always larger than the theshold of the first mark giving the message that it is always more difficult to score 1 or more than to score 2 or more (if more than 2 marks were available). RP: Well done! Now, let us talk about pupils 43 and 57. Can someone predict the outcome when those two pupils attempt to answer questions 1 and 13? T: I think that this is easy. Both pupils are located very high on the logit scale as far as question 1 is concerned. I would expect them to be awarded both the 150

THE PARTIAL CREDIT MODEL

available marks on the question 1. However, both steps (1 and 2 marks) of question 13 are too difficult for them. I would expect them to achieve 0 marks on this question.

Figure 35. The logit scale of the partial credit analysis.

RP: Table 46 indicates both the thresholds and the difficulties for each one of the possible steps (to score 1 or 2 marks) of the questions of the test. Remember that the threshold of a score on a question is the ability needed by a pupil to have 50% to achieve that score or a larger one. Larger ability means larger than 50% probability to achieve that score or a larger one on the specific question instead of one of the lower scores. The difficulty of the score is the ability needed by a pupil to have 50% probability to achieve that score instead of the previous one. T: Which one of the two measures is used most, the deltas or the thresholds? RP: Look, the threshold is not a different characteristic of an item, it is merely the other side of the same coin. It is a bit like converting your money in two different currencies. When you are in Europe, for example, you use Euro but if you visit Australia you have to convert your money into dollars. When you go to the bank, your 100 Euros will become, say, 200 dollars in a split second. This does not mean that you are richer; you merely express the same value in a different way. Converting from difficulties to thresholds is a parallel concept: (a) You use difficulties to find the probability of a pupil to be awarded the next score instead of the previous one (b) you convert to the threshold of a specific score on a question to determine what is the probability of a pupil to be awarded that score or a larger one. T: How can we convert from the difficulty of a step to the threshold? 151

CHAPTER 8

Table 46. The outcome of the partial credit Rasch analysis Question 1 2 3 4 5 6 7 8 9 10 11 12 13

First mark

Second mark Delta

SE

First mark

Second mark

Delta

SE

Threshold

SE

Threshold

SE

-2.16 -1.59 -1.26 -2.11 -1.75 1.04 3.04 1.29 2.26 -1.03 0.1 0.88 3.16

0.67 -2.75 0.5 -2.81 0.54 -2.39 0.44 -2.31 0.42 -3.08 0.35 -2.38 0.32 0.34 0.28 -2.19 0.43 0.16 0.42 -1.88 0.35 3.49 1.07 0.97 0.62 0.38 0.77 1.58 0.47 3.02 1.13 1.16 0.41 0.87 0.56 1.31 0.60 -4.64 0.53 -2.91 0.17 N/A N/A 0.1 0.21 N/A N/A 0.88 0.81 1.74 1.27 2.22 Key: SE stands for error of estimate

0.81 0.69 0.53 0.44 0.59 0.53 0.92 0.72 0.64 0.75 0.17 0.21 1.31

-2.09 -1.67 -1.97 0.44 0.29 3.55 1.83 3.16 1.81 -2.76 0.1 0.88 2.69

0.76 0.64 0.51 0.36 0.57 1.25 0.97 1.35 0.69 0.76 0.17 0.21 1.47

RP: This is indeed a very easy task. Figure 36 demonstrates the score probability lines for the question 1. The step measures of the question are demonstrated in Table 46. Would someone like to comment on the figure?

Figure 36. The score probability lines for question 1.

T: It seems that the probability for a pupil of any ability to be awarded 1 mark on question 1 is always very low. From very low abilities up to approximately –2.5 logits the most likely outcome is 0 marks. From approximately –2.5 logits and up the most likely outcome is 2 marks. The thresholds can be found by identifying the 50% probability on the y-axis and 152

THE PARTIAL CREDIT MODEL

drawing a horizontal line to the right. The place where the line intersects the line of ‘0 marks’ indicates the ability where a pupil has 50% to get 0 marks and 50% to get 1 mark or more. Extending the line to the right we intersect the ‘2 marks’ line. This point indicates the ability where the pupil has 50% chance to be awarded 2 marks. T: The output of the Partial Credit Rasch analysis is a little bit different than the output of the simple Rasch analysis as far as the questions are concerned… I mean that in the simple analysis we only have one measure for every question but in the partial credit we have one measure for every possible mark. Is there any difference between the simple Rasch analysis and the partial credit analysis as far as the pupils’ are concerned? RP: No, there is no difference between the two models. In both cases we only need to report the raw score, the ability of the pupils, the precision with which we have measured the ability (error of estimate) and the fit statistic. The following table indicates the output of the Partial Credit Rasch analysis for a few pupils. Table 47. An extract from the partial credit analysis output for the pupils Pupil No 1 2 3 4 5 6

Score Ability Error of Estimate 11 9 9 6 10 11

0.12 -1.04 -1.04 -2.08 -0.52 0.12

0.82 0.68 0.68 0.53 0.77 0.82

Infit Mean Square 0.47 1.34 0.27 0.83 3.21 0.47

RP: I assume that Table 47 is familiar. We’ve seen it several times in the sections where we were talking about the simple Rasch model. You can tell that pupil 5 must have given very aberrant responses by the large fit statistic. T: Talking about fit statistics, I remembered that you didn’t talk to us about the fit statistics of the questions for the partial credit model. Does this mean that we do not use the fit statistics for the questions in the partial credit model? RP: Of course we do! The fit statistics are very important in all Rasch models. They are the flag which tells us if there is something wrong with a question. Again, the same cut-off score of 1.3 may be applied. This is the table with the fit statistics of the questions (Table 48). RP: As usual, a large fit statistic indicates that the question is in trouble. You must have noticed that although we have a separate estimate for the difficulty of every possible step on the questions, we only report one overall fit statistic

153

CHAPTER 8

for each question. This is because, in this case, we are mostly interested to the overall quality of the question. Table 48. An extract of the partial credit analysis output for the questions Threshold

Question No

First mark

Second mark

1 2 3 4 5 6

-2.81 -2.31 -2.38 -2.19 -1.88 0.97

-2.09 -1.67 -1.97 0.44 0.29 3.55

Infit MNSQR 0.85 0.85 0.8 1.07 0.98 1.23

HOW TO BUILD A TEST FROM THE SCRATCH USING THE RASCH MODEL

Up to this point we explained how the Rasch model works and how to interpret the results of a Rasch analysis. What is still missing is a brief guide on how to develop a test for classroom assessment from scratch using the Rasch model. A basic requirement for the development of a test is to make sure that the objective of measurement is clear and that the ability to be measured is, at least roughly, defined. If you are developing a test you should be able to explain in simple terms what you want to measure, for example, I want to measure my pupils’ ability in doing vertical and horizontal sums up to 100. The second step is to identify questions that are not only relevant to the ability as defined previously but also representative (as much as possible) of the domain to be tested. It is important to make sure that all aspects of the domain are tested otherwise the validity of the test may be challenged. The test could include newly written questions but also existing questions provided you are reasonably sure that pupils have not memorised the answers. At this stage, it is helpful to take on good habits like avoiding questions that encourage guessing or cheating (e.g., multiplechoice questions) and making sure that previous questions do not give insights Visit WebResources where you can for the solution of next questions and find a short paper by Goldstein and the like. This stage of the test developBlinkhorn who criticise the use of the ment may be regarded as the stage of Rasch model. We consider it the operational definition of the ability important to expose you to both sides you want to measure. Make sure that of an argument. the questions you are including in the test materialise fully your concept of the ability you want to measure. Also, make sure that you pilot more questions than you need so you’ll have the flexibility to discard some of them and keep only the best. That way, if a question fails, you’ll be able to replace it with another one that

154

THE PARTIAL CREDIT MODEL

tests the same sub-domain so that you can keep the representativeness of the curriculum to a high level. Prepare a marking scheme. This is a detailed guide of what makes up a correct response to every question. If you are prepared to award partial credit for partly correct responses, make sure that you can follow very clear guidelines for the award of marks. Use your previous experience to predict what types of answers should not be accepted so that no ambiguities will arise while you will be marking the scripts. The next stage is to identify a suitable sample of pupils to administer the test – in other words to pilot the test. The golden rule is that more pupils usually give more informative results. You do not need hundreds of pupils though – just a few tens of them. Make sure that you will have no sharing of answers – you do not really want the pupils to circulate the actual test between them to practice! Try to get a sample of pupils that is representative of the ability of the targeted population of pupils. Try to include pupils of all abilities in the sample. If the targeted population includes pupils with special characteristics (e.g., ethnic minorities having English as additional language) think whether you need to include a few of them in the pilot sample. Administer the tests to the pilot sample, verify that no cheating or copying is encouraged during the completion of the test by the pupils, collect the scripts and mark the result. This is the time where you’ll give yourself some feedback about the test. Try to identify the obvious problems (for example many pupils scribbling hopelessly on a too confusing and ambiguous question). Draw conclusions from the types of the mistakes of the pupils. Do you see any evidence that pupils may have misunderstood the wording of a question? Is there any evidence that the diagram of a question is misleading the pupils? If you have enough people taking the test (a few tens) then you can run a Rasch anlysis and use the results to evaluate the questions. (a) First use very simple statistics like the percentage of the pupils that answered each question correct. If you identify questions that are too difficult or too easy for the pupils you may be tempted to remove them from the test. You do not need many questions that nobody answered correctly, for example! They do not provide rich information and they may discourage some pupils. You also do not need many questions that everybody answered correctly. (b) Sort the questions according to their fit statistics. Inspect their fit statistic and identify the most misfitting ones. Try to figure out why they are so badly misfitting. Do you really need them? If you have other questions that cover the same subdomain you may decide to remove them. If they have a special characteristic and you think that they should remain in the test because they have something unique to contribute, try to identify if there is something that makes them ‘misbehave’. Are there any awkward words or confusing diagrams? Is there something you can do to make sure that the questions do not load heavily on another ability (e.g., language) instead of the ability you want to measure (e.g., geometry)?

155

CHAPTER 8 Co llec t & m a r k sc r ip t s

P r e -R a s c h a n a l y s i s : Rem o ve q u est io n s w it h fu ll/ z er o sc o r e

Ru n Rasch a n a lysis

S o r t q u est io n s a c c o r d in g t o fit st a t ist ic s

Ye s

Rem o ve o ver fit t in g q u est io n s w h ic h t est t h e s a m e s u b -d o m a i n w it h o t h er q u est io n s

Rem o ved q u est io n s?

No

S o r t p u p ils a c c o r d in g t o fit st a t ist ic

Ye s

No

Ye s Rem o ved p u p ils?

S t ill w a n t t o r ed u ce t est len g t h ?

No

EN D

Figure 37. The Rasch analysis ‘path’ for test development.

Identify the most overfitting questions. If the test is still too long and if you are looking for questions to discard then you may safely remove a few overfitting ones, especially if you have other questions that measure exactly the same subdomain. (c) Generate the logit scale, identify the sub-domain tested by each question and try to see if questions testing the same sub-domain cluster together. You do not need many questions that test the same sub-domain and have approximately the same difficulty. You can remove a few of them to shorten the test. Keep the less misfitting questions and remove the most overfitting ones. 156

THE PARTIAL CREDIT MODEL

(d) If you remove a few misfitting questions you may want to rerun the Rasch analysis and see what happens. The result of the removal of a few misfitting questions may mean that the rest of the questions have very good fit and that they form a very nice scale. Other times the removal of misfitting questions may lead to the appearance of other misfitting questions. (e) Build a table with information about the pupils. Sort the pupils from the most misfitting to the less misfitting. Start asking questions about the most misfiting pupils. Do they have a common characteristic? They may be mostly of one gender or they may all face language problems. Try to make sense out of this and identify why a group of pupils with common characteristics may have a large percentage of misfitting pupils. Try to identify which questions in the response patterns of the misfitting pupils cause the misfit. Identify patterns of unexpectedly correct or incorrect responses. Do you see something suspicious? Have you identified pupils who give unexpectedly correct or incorrect responses mainly on questions of a specific sub-domain? Do you suspect cheating, copying a sudden illness or other reasons that suggest having a specific response pattern be removed from the analysis? If you decided to remove pupils’ response patterns from the sample you may want to rerun the Rasch analysis and see what the impact was. (f) Taking all the previous steps one at a time you may modify your test and improve the quality of the measurement. Always use common sense and your professional judgement when taking decisions. Figure 37 illustrates a path that may help you systematise your efforts for the construction of a test using the Rasch model. WHAT TO HAVE IN MIND BEFORE ANALYSING A DATASET WITH THE RASCH MODEL

The following figure summarises the most important things you should always have in mind when you attempt to fit a Rasch model on a dataset. The figure stresses a few of the most common mistakes or questions of people when they use the Rasch models for the first time. – If theory does not support unidimensionality, the results of the analysis may be very difficult to interpret even if no misfitting questions appear in the data. For example, if a test includes equal number of mathematics and science questions, it is likely that the Rasch model will have a relatively good fit because the two abilities (science/maths) are usually highly correlated. However, we know that the test involves questions on two different subjects and many people would prefer to measure the ability of individuals on two clearly distinct Rasch scales. – Look out for questions that are strongly related. Do not treat the different subquestions as stand-alone questions in the analysis. Instead, try to use the Partial Credit Rasch model if possible. If you have questions that are strongly related (e.g., language questions refering to the same passage) you may wish to sum the marks on each individual question and treat the sums as the total score on a large partial credit question. This technique has been successfully used by other researchers and for more information please refer to Bond and Fox.1 If this does 157

CHAPTER 8

not work and there are still clear indications of strong dependence between your questions (e.g., one question gives hints to answer the next question), it is likely that a Rasch model will not have a good fit. – Speeded tests may pose problems in the sense that questions at the end of the tests are more likely to appear more difficult than they are. Try to investigate the context of the administration of the test. – If you have strong indications that guessing is possible or even encouraged by the format of the questions then check your results very carefully. It is likely that you will not have any serious problems since the Rasch model has been applied successfully many times on data generated by multiple-choice tests. However, questions with high perceived difficulty may encourage frequent guessing and may have particularly bad fit. Figure 38 is a summarised guide for the application of the Rasch model. Before running the analysis, think...

Does your theory support unidimensionality?

No

Results difficult to interprete

Yes Are there any dependent questions?

Yes

Can Partial Credit help?

No

Difficult to build a scale

Yes

No

Apply a Partial Credit model

Is the test speeded?

Yes

Problems with dimension 'speed'?

No

Is guessing possible/encouraged?

Yes

Possible problems. Check your results with caution.

No Proceed

Figure 38. A guideline for the application of the Rasch model on a dataset. 158

THE PARTIAL CREDIT MODEL

FURTHER READING

This chapter aspired to serve as a simplified introduction to the Rasch model for the novice. If you are interested in learning more about the world of the Rasch models you are encouraged to spend more time reading relevant material such as the work of Wright and Stone (1979) which is considered to be a classic reading. After you take this step you can proceed to Wright and Masters (1982). Check out the following list of references. They have been categorised for your convenience. You can download many of them from the web free of charge. Basic Reading – do not miss Reference Wright, B. D. & Mok, M. (2000). Understanding Rasch Measurement: Rasch Models Overview. Journal of Applied Measurement, 1 (3), 83-106. Smith E. V. Jr. (2001). Understanding Rasch Measurement: Metric Development and Score Reporting in Rasch Measurement. Journal of Applied Measurement, 1 (3), 303-326. Smith E. V. Jr (2001). Evidence for the reliability of measures and validity of measure interpretation: A Rasch measurement perspective. Journal of Applied Measurement, 2(3), 281-311.

Wright, B. D. (1967). Sample-free test calibration and person measurement. ETS Invitational Conference on Testing Problems, MESA Research Memorandum Number 1, MESA Psychometric laboratory, The University of Chicago. http://www.rasch.org/memo1.htm Wright, B.D. (1999) Common Sense for Measurement. Rasch Measurement Transactions 13:3 p. 704 http://www.rasch.org/rmt/rmt133h.htm Ludlow, L. H., & Haley, S. M. Rasch model logits: interpretation, use, and transformation. Educational and Psychological Measurement, 55 (6), 967-975. Wright B.D. (2001) Counts or Measures? Which Communicate Best? Rasch Measurement Transactions 14:4, p.784 http://www.rasch.org/rmt/rmt144g.htm

Comments Audience: Beginner Content: Presents the models Maths: Not very demanding Rating: Must-read Length: 23 pages Audience: Beginner Content: Using the models Maths: Not very demanding Rating: Must-read Length: 23 pages Audience: Beginner Content: Validity & interpretation of measures Maths: Not very demanding Rating: Must-read Length: 20 pages Audience: Beginner Content: Illustration of Rasch models properties Maths: No maths Rating: Very Useful Length: 12 pages Audience: Beginner Content: General Introduction Maths: Not very demanding Rating: Must-read Length: 3 pages Audience: Beginner Content: The meaning of logit Maths: No maths Rating: Useful Length: 9 pages Audience: Beginner Content: Rasch Measures vs raw score Maths: No maths Rating: Must-read Length: 1 page

159

CHAPTER 8 Fisher W.P. Jr. (1998) Do Bad Data Refute Good Theory? Rasch Measurement Transactions 11:4 p. 600. http://www.rasch.org/rmt/rmt114h.htm#Bad

Wright, B. (1993). Equitable Test Equating. Rasch Measurement Transactions 7:2 p. 298. http://www.rasch.org/rmt/rmt72.htm Wright, B. (1993). Thinking with Raw Scores. Rasch Measurement Transactions 7:2 p. 299. http://www.rasch.org/rmt/rmt72.htm

Audience: Beginner Content: Reasoning to use Rasch models Maths: No maths Rating: Very Useful Length: 2 pages Audience: Beginner Content: Reasoning to use Rasch models in test equation Maths: No maths Rating: Very Useful Length: 2 pages Audience: Beginner Content: Reasoning to use Rasch instead of raw scores Maths: No maths Rating: Very Useful Length: 1 page

Reading for discussion with more experienced practitioners (… after you do some basic reading) Smith, E. V. Jr., & Smith, R. M. (2008). Introduction to Rasch measurement. JAM Press. Please visit www.jampress.org

Embretson S.E. commented by Linacre, J.M. (1999) The New Rules for Measurement. Rasch Measurement Transactions 13:2 p. 692 http://www.rasch.org/rmt/rmt132e.htm Linacre, J.M. (2000) Historic Misunderstandings of the Rasch Model. Rasch Measurement Transactions 14:2 p.748-9. http://www.rasch.org/rmt/rmt142f.htm Ludlow, L., & O’Leary, M. (2000) What to Do about Missing Data? Rasch Measurement Transactions 14:2 p.751. http://www.rasch.org/rmt/contents.htm Wright, B. D., & Linacre J. M. (1985). Reasonable Mean-square fit values. Rasch Measurement Transactions 8:3, pp.370-371. http://www.rasch.org/rmt/

160

Audience: Beginner to Advanced Content: An all-round approach Maths: Not very demanding Rating: Highly recommended Length: 689 pages Audience: Beginner Content: General concepts Maths: No maths Rating: Very Interesting Length: 3 pages Audience: Intermediate Content: Discuss assumptions of the models Maths: No maths Rating: Very Interesting Length: 3 pages Audience: Beginner Content: Very Comprehensive Introduction Maths: No maths Rating: Must-read Length: 1 page Audience: Beginner Content: Discussion on fit measures Maths: No maths Rating: Must-read Length: 1 page

THE PARTIAL CREDIT MODEL

Special Topics – further reading for specialisation Topic Test equating (use multiple linked tests)

Reference Wolfe, W. (2000). Understanding Rasch Measurement: Equating and Item Banking with the Rasch Model. Journal of Applied Measurement, 1 (4), 409-434.

Test equating (use multiple linked tests)

Suanthong, S., Schumacker, R. E., & Beyerlein, M. M. (2000). An Investigation of Factors Affecting Test Equating in Latent Trait Theory, Journal of Applied Measurement, 1 (1), 25-43.

Test equating (item banking)

Masters, G. N. (1984). Constructing an item bank using partial credit scoring. Journal of Educational Measurement, 21 (1), 19-32.

Fit Statistics

Smith, R. M. (2001). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Fit Statistics

Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1(2), 152-176.

Comments Audience: Non-beginners Content: Very Comprehensive Introduction Maths: Not very demanding Rating: Must-read Length: 25 pages. Audience: Non-beginners Content: Focused on details Maths: Intermediate (you can ignore them) Rating: Very Useful Length: 18 pages Audience: Experienced Content: Focused on details Maths: Demanding Rating: Not first priority Length: 13 pages Audience: Non-beginners Content: Overview of fit statistics and information on residual analysis Maths: Demanding Rating: Very Useful Length: 19 pages Audience: Non-beginners Content: Misfit, sources of misfit, properties of fit statistics Maths: Demanding Rating: Useful Length: 19 pages

Other Interesting Books and Articles – Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: fundamental measurement in the human sciences. Lawrence Erlbaum: New Jersey. (Very good for people who prefer to avoid hard mathematics) – Goldstein, H. & Wood, R. (1989). Five decades of item response modelling. British Journal of Mathematical and Statistical Psychology, 42 (2), 139-167. – Hambleton, R. K., Swaminathan, H. & Rogers, K. J. (1991). Fundamentals of item response theory. Sage: California. – Weiss, D. J. & Yoes, M. E. (1991). Item response theory, in R. K. Hambleton, & J. N. Zaal, (Eds.). Advances in educational and psychological testing. Kluer: Boston. – Wright, B. D. & Masters, G. N. (1982). Rating Scale analysis. Chicago: MESA Press. – Wright, B. D. & Stone, M. H. (1979). Best test design. Chicago: MESA Press

161

CHAPTER 8

The richest and probably the most widely used source for material about Rasch measurement can be found at www.rasch.org. Do not miss this excellent site. You can also become a member of the listserve www.rasch.org/rmt/index#Listserv and you will have the opportunity to ask questions and interact with a few of the most established Rasch practitioners of the world. -oOoREVIEW QUESTIONS T F T F T F T F T F T F T F T F T F

162

The partial credit Rasch model may be used instead of the simple Rasch model when a number of questions test a partially different ability than the rest of the questions. If one desires to reduce the length of a test, one can preferably remove overfitting questions which have the same difficulty as other questions in the test The above statement is true because similar questions with the same difficulty in the same test tend to have small fit statistics. Questions that have fit statistics larger than 1.3 must be definitely removed from the test because they do not test the same ability as the rest of the questions in the test. The above statement is true because infit mean square larger than 1.3 means that the examinees were able to guess the correct answer to the question According to the Rasch model two persons with the same ability will definitely get the same marks on the same question. The assumption of Local Independence demands that each examinee works indepentently from the other examinees while completing the test. Data generated by two or more different tests which measure the same ability may be analysed by the Rasch model provided all people completed at least a common group of questions. A person of ability θ = 3 logits will definetly get a score of 2 on the following question because this is the most likely score for his ability according to the figure below

THE PARTIAL CREDIT MODEL

T F

According to the following figure, the most likely score for an examinee with ability -1 logit is 2 marks 1 0.9 0.8

Probability correct

0.7 0 marks

0.6

1 mark 2 marks

0.5 0.4 0.3 0.2 0.1 0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Ability (in logits)

EXERCISES

1. Draw the score probability line(s) of a question with the following characteristics: First mark Second mark Delta: -2.16 -2.75 SE: 0.67 0.50 2. According to the following figure, what is the most likely score for an examinee with ability 0.75 logits? 1 0.9

Probability correct

0.8 0.7

0 marks

0.6

1 mark

0.5

2 marks

0.4 0.3 0.2 0.1 0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Ability (in logits)

163

CHAPTER 8

3. According to the figure of exercise 2a, what is the ability of an examinee if he has the same likelihood to be awarded 1 or 2 marks (approximately within 0.3 logits)? 4. Draw the score probability line(s) of a question with the following characteristics: Difficulty: -2.16 SE: 0.67 Infit mean square 2.16

164

CHAPTER 9

FURTHER APPLICATIONS OF THE RASCH MODEL

The previous chapters introduced the simple and the Partial Credit Rasch models in a non-technical way. They were introduced through practical examples with primary school mathematics. You may, however, generalise the examples to your own context easily. This chapter aspires to go one step further and present other members of the family of the Rasch models. The informal and non-mathematical style of the previous chapter will be maintained here as well. Some issues to be covered in this chapter may be slightly more advanced so they will be presented in a simplified (but not simplistic) way. Further reading will be needed if you want to acquire a more complete understanding of the most advanced concepts. Firstly, we will go through the Rating Scale model which is a more restricted version of the Partial Credit model which was described in the previous chapter. This Rating Scale Rasch model is frequently used with rating scales like the Likert scale. The second section presents briefly two multi-dimensional models which are very useful when items of the test tap on two slightly different abilities. For example, a science test may consist of questions that not only test knowledge in science but also require extensive calculations and high mathematical ability to solve. Finally, this chapter offers a brief introduction in one of the most promising areas of educational and psychological measurement which is the use of Rasch models in computerised adaptive testing. THE RATING SCALE RASCH MODEL

Rating scales are extensively used in education and psychology to measure attitudes, traits and other psychological dimensions of interest. Rating scales have been frequently used in the form of a Likert scale (e.g., Strongly Agree, Agree, Neutral or Undecided, Disagree, Strongly Disagree). Such questionnaires may be assumed to yield ordinal data which need to be transformed to an interval scale through Andrich’s (1978) Rating Scale model to be useful. Earlier we described two other Rasch models: the simple and the Partial Credit Rasch models. What are the similarities and the differences between those models and the Rating Scale Rasch model? Well, the three models are very similar in the sense that both the Partial Credit and the Rating Scale are extensions of the simple Rasch model. All three models share most of their assumptions. All three models share, for example, the assumption of unidimensionality, the assumption of local independence and the 165

CHAPTER 9

assumption of minimal guessing. They are all used to make sense out of ordinal data (that is the raw scores) by building interval measures (the person and the item estimates) that are mostly useful. Secondly, all three models share roughly the same, roughly, statistical indicators (e.g., ability and difficulty estimates, error of estimates and fit statistics). All models produce estimates for the persons and the items and use the error of estimate to indicate how uncertain we are about those estimates. Finally, the infit mean Visit WebResources where you can square is used in all models to evaluate find various links to Rasch papers: the quality of the measurement. Since do not miss the paper by Bradley, all these terms have been extensively Sampson and Royal (2006) where discussed in previous sections, there the Rating Scale Rasch model is is no need to elaborate on them used to develop a students’ again. However, there are also a few conceptualisation of Quality differences between the Rating Scale Mathematics Instruction and the other two models. Firstly, the simple and the Partial Credit Rasch models are usually used in achievement tests but the Rating Scale Rasch model is usually used with questionnaires and other rating scales. However, all models can work in any context as long as they are mathematically appropriate. For example, the simple Rasch model cannot be used to analyse data from a Likert scale unless the items are transformed to dichotomous (e.g., by recoding the categories ‘Strongly Disagree’, ‘Disagree’ and ‘Neutral’ to zero and the categories ‘Agree’ and ‘Strongly Agree’ to 1). A major difference between the Partial Credit model and the Rating Scale is that each question in Partial Credit may have a different number of steps/marks/ categories and each step is free to have a different difficulty estimate from question to question. For example, the difficulties to get two marks on the first and the sixth question of the probabilities test (see Table 46) were different. The threshold for two marks on question 1 was –2.09 logits and the corresponding threshold on question six was +3.55 logits. Therefore, the same steps on two different questions can have, and usually do have, different difficulty estimates. On the other hand, the Rating scale Rasch model assumes that the same category, say in a questionnaire, has exactly the same meaning for the pupils across the questions. For this reason, it is assumed that the five, say, categories form a scale which is the same across all items and each category has the same estimate across all the items. In other words, the category ‘Rarely’, for example, will have the same estimate for all items. If you feel a bit confused at this stage, don’t worry. More explanations are given in next paragraphs. Let us now consider a practical example. Consider the case of a questionnaire administered to 52 pupils of the Hellenic (Greek) school of Manchester, England. The aim of the questionnaire was to investigate the reasons that made the pupils attend the Greek school. The authors of the questionnaire assumed that pupils with a generally more positive attitude towards the Greek school would also have more 166

FURTHER APPLICATIONS OF THE RASCH MODEL

positive attitudes towards different aspects of the school like learning Greek, meeting Greek friends and so on. In order to analyse this dataset with the Rasch model, it is necessary to use specialised software. The software for the statistical analysis, however, does not understand the meaning of the five categories of the Likert scale ‘Strongly Disagree’ to ‘Strongly Agree’ and they had to be assigned numerical values. The categories of the Likert scale were recoded numerically according to Table 54. The codes for the categories of the Likert scale. Table 49. The questionnaire for attitudes of the pupils to the Greek School 1. I come to the Greek school because I like learning Greek. • • • Strongly Disagree Disagree Neutral/Don’t Care

• Agree

• Strongly Agree

2. I come to the Greek school because I like meeting my Greek friends. • • • • Strongly Disagree Disagree Neutral/Don’t Care Agree

• Strongly Agree

3. I come to the Greek school because I like meeting my teachers. • • • • Strongly Disagree Disagree Neutral/Don’t Care Agree

• Strongly Agree

4. I come to the Greek school because I want to learn more about the Christian Orthodox religion. • • • • • Strongly Disagree Disagree Neutral/Don’t Care Agree Strongly Agree 5. I come to the Greek school because I want to learn more about the Greek culture. • • • • • Strongly Disagree Disagree Neutral/Don’t Care Agree Strongly Agree 6. I come to the Greek school because I have fun. • • • Strongly Disagree Disagree Neutral/Don’t Care

• Agree

• Strongly Agree

7. I come to the Greek school because I want to learn more about Greece and Cyprus. • • • • • Strongly Disagree Disagree Neutral/Don’t Care Agree Strongly Agree 8. I come to the Greek school because I feel nice. • • • Strongly Disagree Disagree Neutral/Don’t Care

• Agree

• Strongly Agree

It is obvious from the above table that the categories were recoded in such a way so that more positive attitudes would be awarded a higher numerical score. Therefore, a larger score means a more positive attitude towards coming to the Greek school. 167

CHAPTER 9

If the above data were analysed using the Partial Credit model, then each one of the categories would have a different estimate for each item. On the other hand, the Rating Scale model will give one overall estimate for every item which indicates how easy was for the pupils to agree or disagree with the item. The model will also award a difficulty level to each category and each category will have this estimate for all items. Table 50. The codes for the categories of the Likert scale Category Strongly Disagree Disagree Neutral/Don’t Care Agree Strongly Agree

Score 1 2 3 4 5

We expect the difficulty of the steps to increase from ‘Strongly Disagree’ to ‘Strongly Agree’. For example, we expect the category ‘Strongly Agree’ to be more difficult to endorse than the category ‘Strongly Disagree’ in the sense that you need to like the Greek school more in order to ‘Strongly Agree’ instead of ‘Strongly Disagre’ with a positive statement. Do not worry if you feel a bit confused at the moment. Everything will become clearer after the presentation of the results of the analysis. ANALYSIS USING THE RATING SCALE MODEL

The data were keyed into the computer and they were analysed with the Rating Scale Rasch model. The output of the software informed us that 52 pupils were entered in the analysis but 12 of them either had extreme scores (answered ‘Strongly Disagree’ or ‘Strongly Agree’ to all items) and were removed from the analysis. Table 51 presents the results for the 40 remaining pupils. Table 51. Results of the Rating Scale analysis for pupils

Mean Maximum Minimum

Estimate 1.18 2.87 -1.38

Infit Mean Square 1.09 4.00 0.27

According to Table 51, the pupils had an average attitude estimate of 1.18 with the maximum estimate being 2.87 and the minimum being –1.38. The infit mean square column however gives a somewhat worrying result in the sense that the maximum infit mean square appears to be 4 which is relatively large. Also, it appears that the minimum infit statistic is 0.27 and these results indicate that a more detailed investigation is needed to identify the pupils who gave aberrant or

168

FURTHER APPLICATIONS OF THE RASCH MODEL

too predictable response patterns. Table 52 provides the statistics of the items of the questionnaire for the Rating Scale model. You must have noticed that all questions are described by only one estimate. In the Partial Credit model we called this estimate ‘difficulty’. We will keep the same term for the sake of consistency. When we say that an item has a large ‘difficulty’ we mean that the item was more difficult to be endorsed by the pupils (e.g., it was difficult for the pupils to agree that they attended the school in order to meet their teachers but it was easier for them to agree that they attended the school to learn more about Greece and Cyprus). In other words, an item is more difficult when the pupils select more ‘Disagree’ and ‘Strongly Disagree’ and is easier when they select more ‘Strongly Agree’ and ‘Agree’. Table 52. Item statistics Item

Estimate

Error of Estimate

Infit Mean Square

Items’ content

1 2 3 4 5 6 7 8

-0.25 -0.17 0.91 -0.08 -0.17 0.07 -0.45 0.14

0.21 0.21 0.15 0.2 0.21 0.19 0.23 0.19

1.58 1.01 0.83 0.32 1.03 0.58 1 1.48

Learn Greek Friends Teacher Religion Culture Have fun Greece/Cyprus Feel nice

The results of the analysis are also demonstrated by Figure 39. The left side of the figure illustrates the distribution of the pupils’ estimates and the right side illustrates the location of the items on the logit scale. You can see that it was more difficult for the pupils to agree that they attended the Greek school to meet their teachers. On the other hand, it was much easier for the pupils to agree that they attended the Greek school to learn about Greece and Cyprus. It is likely that the pupils are motivated to learn more about Greece and Cyprus because they visit the two countries very frequently for holidays and they have friends and relatives there. It was also relatively easy for the pupils to agree that they attended the school to learn Greek, to learn more about the Greek culture and to meet their Greek friends. We can see from the figure that the spread of the items is not very large. Table 53 also indicates that the spread of the item estimates is not very large and only ranges from 0.91 to –0.45. Also, the infit mean square column gives results that need more investigation. The maximum infit mean square is 1.58 which indicates a degree of misfit and the minimum is 0.32 which indicates extensive overfit.

169

CHAPTER 9

Figure 39. The person and item estimates. Table 53. Results of the Rating Scale analysis for items

Mean Maximum Minimum

Estimate 0.00 0.91 -0.45

Infit Mean Square 0.98 1.58 0.32

Table 52 indicates that the first and the last items are slightly misfitting. According to convention (see the discussion in previous chapter), the fit statistic of item 8 (‘I come to the Greek school because I feel nice’) is just above the rule of thumb which suggests that an infit mean square of 1.40 is almost acceptable for a Rating Scale question. However, item 1 (‘I come to the Greek school to learn Greek’) has an even larger infit mean square of approximately 1.6. This may be because pupils who were generally positive on the other items disagreed with the statement that they attended the Greek school to learn Greek. More details on how to tackle this type of problems were discussed in previous sections. One might ask, however, about the difficulty of the pupils to select each one of the five categories of the items. For example, how much more difficult was for the pupils to ‘Strongly Agree’ instead of ‘Agree’? This question may be answered by Table 54.

170

FURTHER APPLICATIONS OF THE RASCH MODEL

Table 54. The categories difficulties Category

Score

Strongly Disgree Disagree Neutral/Don’t Care Agree Strongly Agree

1 2 3 4 5

Estimate (in logits) -0.49 0.27 0.50 1.14 1.62

Table 54 informs us that the estimates for the categories ‘Agree’ and ‘Strongly Agree’ are much larger than the estimates for the previous categories. This shows that the estimates for the ‘positive’ and ‘negative’ estimates are well spread. Although the Rating Scale model is very useful for the interpretation of the results of Likert and other rating scales, many researchers prefer the use of the Partial Credit model because it usually has a better fit. It is a good practice to try both models, compare the results and decide which one is more appropriate for your data. If the Rating Scale Rasch model has satisfactory fit for the intended purpose, do not hesitate to prefer it over the use of the Partial Credit model. THE MULTI-DIMENSIONAL MODEL

The models that have been described up to now share the assumption of unidimensionality. That means that a single ability is measured by the test and that the probability for a correct response to the questions is affected only by this ability. Researchers, however, have been worrying about this assumption because, frequently, our tests do not test only a single ability, no matter how carefully we construct them. Therefore, new models have been developed during the last two decades to accommodate for multi-dimensional tests. Multidimensional models were developed in order to help researchers analyse results from assessments that tested two or more abilities in the same item. For example, a test may consist of science items that demand (a) science knowledge, and (b) mathematical ability for the computations. In such a case, the multidimensional model assumes two abilities, say, ability A and ability B where both affect the probability for a correct response to the items. Multidimensional models may be classified in two categories. The first category of multidimensional models assumes that a low ability in the first dimension cannot be compensated by increased ability in the second dimension. For example, if the same item taps both science and computations, then the lack of the one ability alone can prevent a person from answering the question correctly. Multidimensional models are not in wide use today because of their complexity. After all, statistical models are only useful if you have a robust theory to support them. However, they are becoming more and more popular among the researchers. Although it is unlikely that you would use one of those models as a teacher, it is useful to be aware of their existence since we are all consumers of educational or psychological research. Multidimensional models have been the topic of popular 171

CHAPTER 9

Australian software created by the Australian Council of Educational Research (ACER). In the near future we will come across research that uses those models more and more frequently. COMPUTERISED ADAPTIVE TESTING

An educational or psychological test is adaptive when the questions/items to be administered to an examinee are selected based on information gathered from the examinee’s responses to previously administered items Therefore, a computeradaptive-test or CAT for short is an adaptive test which is administered on a computer. In that sense, CATs may be regarded as a development/improvement over the sequential tests (usually traditional paper-and-pencil tests). Sequential testing may also be administered on a computer Visit WebResources where you can but it generally demands that items find links to Computerized Adaptive are either administered in a fixed order assessment package and other related or are randomly drawn from a large material item bank. Moreover, a computerised sequential test usually has different termination rules than the adaptive test. The development of computerised adaptive tests (CAT) began about 20 years ago following the development of modern and faster personal computers. Today, CATs are used in the United States in many cases like the Graduate Record Examinations, the Armed Services Vocational Aptitude Battery, the General Aptitude Test Battery, the Differential Aptitude Test, certification exams for the Board of Registry of the American Society for Clinical Pathologists, in placement courses for universities and in a very large number of elementary and high schools for survey and diagnostic purposes. Advantages and characteristics of CAT One of the major characteristics of a CAT is that individuals take a different set of items so that each person is administered a psychometrically optimal test. One of the characteristics of CAT is that individuals may be administered tests of different length. Although this might initially cause different reactions among policy-makers this is one of the most appealing psychometric characteristics of CAT. Examinees are administered the minimum number of items necessary in order to achieve a prespecified precision of measurement. This leads to considerable savings of time. As a matter of fact, research has found CAT to be up to 2.5 times shorter while keeping the precision of measurement constant. Another desirable characteristic of a CAT is that tests are marked automatically. This means that the time which would have been spent by the teachers on this activity can now be used to interpret and digest the results or feedback. This is also related to the scope for automated reporting.

172

FURTHER APPLICATIONS OF THE RASCH MODEL

Another very desirable characteristic of a CAT is that testing on-demand becomes feasible. This accords with the new ‘testing when ready’ strategy which is emerging from the assessment and research community. The selection of the first item of the test is one of the issues which has attracted research and attention. Usually the first items assume an examinee of average ability. Other strategies include the use of prior information (e.g., using the results of previous tests). The selection of ‘next’ items takes an adaptive form by selecting, for example, a more difficult item if the previous item was answered correctly. Finally, different termination rules exist. One strategy is to administer a test of the same length for all the examinees. This is easier to explain to the public and may be fair because all the examinees have the same opportunity to demonstrate their knowledge. On the other hand, a very popular strategy is to stop the test when a pre-specified precision is reached. Problems of adaptive testing One of the drawbacks of a CAT is that each item carries more ‘weight’ for the estimation of the ability of the examinees because the total length of the tests is generally shorter. Therefore, under certain circumstances a CAT may not be very robust to one or two unexpected wrong answers (e.g., carelessness). However, this problem may be tackled by allowing the computer to decide when to stop testing based on various statistics. Moreover, because of the shorter tests, it is not easy to cover all the relevant sub-domains. Therefore, if one examinee has increased or reduced knowledge on a sub-domain (e.g., because of absence) then the possibility for a biased ability estimate is larger. Item overexposure is another severe problem. When items are overexposed, they are more likely to be memorised and shared among examinees. Various techniques and algorithms have been developed by researchers to tackle this problem in a practically fashion. Finally, because of the adaptive nature of a CAT, each examinee is expected to get around 50% of the items correct. This, may encourage some examinees, especially the less able, because on a CAT they would get more correct responses that they are used to. On the other hand, this characteristic may frustrate the most able students who are used to getting most of the items in paper-and-pencil tests correct. Components for an adaptive test To build an adaptive test we need the following ‘ingredients’: – A calibrated item bank: This is a large number of items which are jointly calibrated through the Rasch model or other similar models. The item difficulties and other item characteristics are computed and are used for the selection of the most appropriate items to be administered to each examinee.

173

CHAPTER 9

– A delivery system: This is a software which has two major components. The first component is the user interface and the second component responsible for the administration of the items. – The statistical algorithms: This component is responsible for the estimation of the ability of the examinees and for the selection of the most appropriate items to be administered. Computer-adapted and self-adapted tests Two different philosophies are referred to in the relevant literature. The first philosophy allows the software to select the difficulty of the next item based on the previous responses. This is the usual CAT system (computer adapted). The second philosophy allows the examinees to determine the difficulty of the next item to be delivered. For example, an examinee may ask for easy items because of a lack of self-confidence. This is the self-adapted testing. The computer-adapted test is statistically optimal but it is likely to generate increased anxiety. Since most pupils get around 50% of the items correct, it can distress the most able pupils but encourage the less competent pupils. The self-adapted test strategy has been reported to reduce test anxiety, generate a feeling of control, increase challenge for some pupils but is statistically suboptimal. Overall, the self-adapted test strategy has been found to promote either higher proficiency estimates or to generate proficiency estimates that are less affected by test anxiety. This philosophy has not, yet, found its way to large-scale commercial packages and its not very popular because its not psychometrically optimal. Benefits for school teachers Computerised adaptive tests become more and more necessary especially when considering the application of a ‘test-when-ready’ strategy. CAT can potentially be invaluable assistance to teachers, especially for the generation of diagnostic feedback and for deciding when pupils are ready to take external high-stakes tests (testing-when-ready strategy). Teaching time dedicated to testing can be reduced considerably, maintaining the precision of measurement. The time saved may be used for teaching. Time dedicated to teacher-made tests can be used to interpret and digest the feedback from the adaptive test and to remedial teaching. Computerised adaptive tests are usually of high quality and can yield valuable cognitive information for the pupils if such a facility is available in the software package used. The test results may be used to predict achievement on external high-stakes tests. Finally, the tests are readily available at any time with minimal cost for time and money.

174

FURTHER APPLICATIONS OF THE RASCH MODEL

SUMMARY

Further applications of the Rasch models have been described in this chapter. The Rating Scale model has been briefly presented and this can be very useful for the measurement of attitudes or other psychological dimensions. Many practitioners are tempted to use the Partial Credit model instead of the Rating Scale model, especially when they analyse data from questionnaires. Although the Partial Credit model may have better fit, use the Rating Scale model whenever appropriate. Multidimensional models were also presented as a solution to the problem of multidimensionality. Although the use of those models is still restricted, it is anticipated that they will be more frequently used in the near future. The teachers, as consumers of educational and psychological research should be aware of their existence. The most promising, possibly, application of the Rasch model has been presented: computerised adaptive testing. CAT can be an invaluable assistance to the teacher if it is used carefully. Do not assume that it will do the teaching for you. They can, however, save you much time and they can also provide you with valuable diagnostic feedback. Finally, you are reminded that this chapter did not aspire to give you in-depth and detailed information. More reading is certainly needed in order to help you learn more about those issues. For more information about the Rating Scale model see: Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573. Also try Wright, B. D., & Masters, G. N. (1982). Rating Scale Analysis. Chicago: MESA Press. -oOoREVIEW QUESTIONS T F T F T F T F T F T F T F

The Rating Scale model is an extension of the simple Rasch model. The Multidimensional models do not share the assumprions of Unidimensionality and Local Independence. The Multidimensional models were developed for the cases where we want to use multiple models (e.g., the Partial Credit and the Rating Scale models) in the same analysis. The Rating Scale model should not be used in tests where different items have different numbers of steps/categories. The Computerised Adaptive Tests select the most appropriate items for a person based on his/her previous responses. When a Computerised Adamptive Test is used, the examinees are expected to get 50% of the responses correct. The Partial Credit model is a more general case of the Rating Scale model where the same category/step on different questions can have a different estimate.

175

CHAPTER 10

PLANNING, PREPARATION AND ADMINISTRATION OF ASSESSMENTS

In the first chapters we defined the nature and purpose of assessment. We also discussed the concepts of reliability and validity and we visited various technical methods for the analysis and evaluation of assessment results. It is now time to focus on planning an assessment for a subject, unit or module. This chapter will emphasise the planning of an assessment through what is called a table of specifications. We will also deal with issues of assessment administration and we will focus on the special arrangements necessary when administering assessments to people with special needs. For any assessment to be accurate and relevant it must be based on the learning outcomes (e.g., the competencies or objectives of instruction) and the content covered in the course. In this chapter you will be shown how to prepare the specifications for an assessment. The main point to be emphasised is that assessment is linked to the syllabus. Fortunately, many curriculum and syllabus documents address the issue of assessment for the teacher. They outline a general assessment program for the subject that is being taught and describe particular assessments for the topics that are being studied. Situations also exist, however, where teachers or trainers are required to develop their own program or curriculum or have considerable latitude in assessment. DEVELOPING ASSESSMENT SPECIFICATIONS

The blueprint for an assessment is called the Table of Specifications. This gives you the assurance that your assessment is based on the content and learning outcomes. Preparing the table of specifications involves some initial steps: (a) listing the main topics; (b) identifying the general learning outcomes; and (c) preparing a twoway table of content and learning outcomes. The following sections will explain these steps in some detail for you. Step 1: Identify the subject content, competencies or learning outcomes The first step is a one-way table of specifications based on the topics or competencies: In effect, you will need to produce an outline of the assessment program and the weighting of topics or competencies. You will also need to indicate the assessment weighting attached to each method (if you feel that you need to have differential weights for different methods). Then, you will determine 177

CHAPTER 10

the forms and methods of assessment, produce a one- or two-way table of specifications and finally, identify the content, competencies and/or learning outcomes. Table 55. A one-way table of specifications for Computers A First Course TOPICS File management Editing Formatting Blocking (cut and paste) Page layout Source: Cecelia Cilesio

WEIGHTS 20% 20% 30% 20% 10%

The topic areas listed in a table of specifications indicate the instructional content that will be sampled. The topic outline should be done in sufficient detail to ensure that no major topics are omitted. It does not need to include every sub-topic covered in class. The ultimate aim is to ensure that the assessment does not give too much emphasis to minor points. Around seven major topic areas should suffice. The values under the heading WEIGHTS are the teacher’s judgment as to how important the topic was in a course. Down the right hand column you can enter the weights. These represent the value of the topic for assessment purposes. A useful way to decide on the value of a topic is to ask yourself: ‘How much time did I spend teaching this?’ The number of hours spent on a topic can act as a broad indicator of the topic’s value. It may be that time is not a good indicator. One reason for this is that the subject may centre on a common topic or that some early sessions are really only introductory. In that case, you should feel free to vary the time allocations. It is up to you as a professional to decide the values. Another reason is that a critical topic may only require a small amount of teaching time. In this case you should also feel free to vary the time allocation. By and large, time allocations are broad indicators of importance. Using these weights, you could now design an assessment for Computers A First course. If it was going to be a test with 10 items then you would allocate two items on File Management, two on Editing, three on Formatting, two on Blocking and one on Page Layout; each one carrying the same amount of marks (say, they are marked dichotomously: one mark for a correct response and no marks for an incorrect response). Or else, the number of questions may not be proportionate to the weights, but the amount of available marks may be larger for those topics with larger weight: for example, you may have only one question on File Management but it may carry twice as many marks as any of the other questions and it may demands twice the amount of time to complete as well. Some assessments, however, do not have questions or cannot be subdivided. This includes practical exercises, case studies, and skills assessments. In this case, the weights can represent the allocation of assessment emphasis across the content of a subject rather than the number of questions on a topic. So if you decided to set 178

ASSESSMENT PREPARATION AND ADMINISTRATION

a single practical exercise for the assessment of Computers A first course then 20% of your assessment judgment would be allocated to File Management, 20% to Editing, 30% to Formatting, 20% to Blocking and 10% to Page Layout. The weights can therefore represent the number of items in an assessment or the assessment emphasis. The aim is to ensure that the assessment accurately represents the emphasis in the curriculum. You may recall from a previous chapter that a technical way of expressing this is to say that it has content validity. Remember that you are free to vary the weights according to your professional opinion. When you are preparing a short assessment on a limited topic, it may be sufficient to stop at this stage. You would indicate the key tasks in the topic and the assessment emphasis relevant to each task. The one-way classification of topics may also suffice when the area you are assessing is limited to a number of clearly defined topics. You may not need to proceed further to a two-way table of specifications. You can skip the next stage. If you are assessing an entire subject, then it is also worthwhile to take into account the learning outcomes for your subject area. Assessing competence If you have a syllabus that is competency-based then you would focus only on the learning outcomes or competencies. We recommend that you work at the level of units of competence. (Note that not everyone agrees on this point.) Now back to competence. A unit of competence is a component of a competency standard and is a statement of a key function or role in a job or occupation. Units of competency are the components against which assessment and reporting occur for the purpose of gaining credit towards a qualification. The teacher would list all the learning outcomes or competencies and decide how much assessment emphasis is required. This is similar to a one-way table of specifications. If a course had four competencies or learning outcomes, then you might rate them in order of importance. A sample table of specifications for a competency-based syllabus is provided in Table 56. The important step is to decide how much assessment emphasis should be given to each learning outcome. One reason why we do not need topics as much as we do in the examples above is because the topics and outcomes in competency-based course tend to be directly related. Some competency-based curricula (especially those relating to vocational education) are specified for you in advance and you do not need to go through these detailed steps. An example of competencies from a secondary curriculum might assist. This is from the draft Business Services Curriculum Framework for the Higher School Certificate.1

179

CHAPTER 10

Table 56. Table of specifications for Microcomputer Hardware Unit of competence Select from supplied components those suitable for assembly into a functioning unit Assemble a microcomputer using safe assembly procedures Assessment the finished unit Document the procedures followed during installation and testing Provide an up-to-date report of a microcomputer’s configuration

Assessment emphasis 15% 35% 20% 10% 20%

Here the assessment is prescribed and detailed. It is complicated because it has been designed to include a range of assessments. Firstly here are some units of competence (see Table 57) and their indicative hours (250 hours) for one of the business service strands. Table 57. Units of competence for a Business Services Strand Units of competency Participate in workplace safety procedures Work effectively in a business environment Organise and complete daily work activities Communicate in the workplace Work effectively with others Use business technology Process and maintain workplace information Prepare and process financial/business documents Provide information to clients Handle mail Produce simple word-processed documents Create and use simple spreadsheets Maintain business resources Create and use databases Source: Board of Studies, Business Services Curriculum

Hours 15 15 15 15 15 20 20 25 15 15 25 20 15 20

In this subject, the units of competency comprise the table of specifications. Only a part of the planning process that we have outlined is required. For instance, the forms and methods of assessment would feature a student log book and on-thejob training. Competence is decided against the performance criteria set out under each element of competency (each unit of competence is composed of elements and performance criteria). These are progressively noted in the student log book. A qualified assessor conducts the assessment using an evidence guide for each unit of competence. In many respects the assessment is prescribed because all elements of competency must be achieved in order to demonstrate the achievement of a unit of competence.

180

ASSESSMENT PREPARATION AND ADMINISTRATION

Listing the learning outcomes A table of specifications can also include learning outcomes. In this step, we need to list all the learning outcomes for a subject and Table 58 provides a continuation of the earlier example. Table 58. Learning outcomes for Computers A First Course Learning outcomes I. Identify different components of the computer system and MS-DOS II. Create a document using the formatting, blocking features; spell checker; save, retrieve and print III. Create a spreadsheet; calculate totals using built-in functions; save and print IV. Create a database; extract specific information; produce lists and reports V. Analyse programming problem; draw flowchart; write code Source: Cecelia Cilesio

Problems that may arise with some syllabus documents are that the learning outcomes (a) may not be stated, (b) may be poorly phrased, (c) may not reflect what is actually being taught or (d) may be too numerous to compile into useful groups. Let us look at these problems one at a time. If there are no learning outcomes in the syllabus, then consider (a) whether the aims for the subject can be converted into learning outcomes; or (b) whether to rely just on the topics set out in Step 1. Usually the aims are broad and general in nature but they may still serve your purpose. Sometimes the learning outcomes are poorly phrased. Learning outcomes need to be stated in specific terms such as, ‘can apply communication principles to case work with people of non-English speaking background’. An objective such as ‘The student should appreciate the importance of communication in welfare work’ is not helpful here. If the learning outcomes are not stated in terms of performance then they need to be re-phrased. In some instances the written learning outcomes may not reflect what is actually taught. The aim is to work with learning outcomes that can be used to guide assessment. The questions to ask are: What do I want my students to achieve by the end of this subject? What should they know that they do not already know? What should they be able to do that they couldn’t do at the outset? or, In what way might they demonstrate that their attitudes have changed? Hopefully, you will be able to produce learning outcomes that are stated in direct, observable terms and which can be assessed. If the learning outcomes do not reflect what is actually taught, then you may wish to add learning outcomes to the syllabus, if this is possible (in many cases, however, the syllabi are centrally constructed and enforced by the Ministry of Education and other related bodies). If your syllabus document is one of those that has numerous learning outcomes for each topic or provides a long list of competencies, it is unrealistic to ask you to be able to focus on these when you have to plan an assessment. You need to ask yourself, if there is some way in which these diverse learning outcomes can be summarised or grouped together. Can they be reduced in number? 181

CHAPTER 10

Step 2: Producing a two-way table of specifications It is now possible to put together the topics and the learning outcomes to produce what is called a two-way table of specifications. (If you use competencies then you can omit this stage.) The purpose of this is to determine what proportion of the assessment should be devoted to each topic and outcome. In Table 59, we have set out an example of the steps involved in preparing a table of specifications. This example is from the editing strand in the Advanced Certificate in Film and Television Production. This table is particularly important because it indicates to the teacher, how the subject is structured and where the emphasis is in both instruction and assessment. It has been rightly called a ‘blueprint’. You may notice that there are some gaps in this table; these indicate that some topics and learning outcomes did not overlap. This is to be expected. Let us repeat that the most important feature of the two-way table of specifications is that it shows you the emphasis in a subject at a glance. It summarises content and learning outcomes. If you look carefully at the two-way table of specifications then you will see that the four topics (A, B, C, D) and the four learning outcomes (1, 2, 3, 4) overlap to a great extent. This is not unusual in many syllabus documents. If the overlap is considerable then just use the learning outcomes. From the two-way table you can see that the key areas to assess are topic C and learning outcome 3 as well as topic D and learning outcome 4. They account for almost half the assessment emphasis. Do not overlook the other areas. Now you could use either one assessment method for this subject or a number of assessment methods to get at the particular aspects. My guess is that you would use multiple assessments that get at those areas of the two-way table where there are substantial weights. As for the numbers in the table, then let us assure you that these are only an approximate guide. Assessment planning is not rocket science so the numbers do not need to be exact but they do help to formulate your decisions. They act as a guide to relative importance. The two way table of specifications is useful when there are multiple topics and learning outcomes. Tables of specifications can come in many forms. In the WebResources, you will see a table of specifications for a Year 9 Mathematics unit test. This has not identified the learning outcomes but grouped them under general headings based on a taxonomy or classification of educational objectives. In simpler language, it has grouped them under knowledge, understanding and higher mental processes. So far we have considered the table of specifications and applied it to specifying the assessment emphasis. If you have topics and learning outcomes in your syllabus then aim for a two-way table of specifications; if Visit WebResources for you have only topics or competencies or learning more examples on the outcomes then aim for a one-way table of tables of specifications specifications. This analysis of a curriculum into content and learning outcomes forms the basis for a criterion-referenced assessment approach. This planning process provides a powerful foundation for the next steps in the assessment process. 182

ASSESSMENT PREPARATION AND ADMINISTRATION

Table 59. Steps involved in preparing a table of specifications ADVANCED CERTIFICATE IN FILM & TELEVISION PRODUCTION EDITING STRAND One-way table of specifications Topic A Picture editing B Video editing C Film production D Track laying

Weight 18% 18% 36% 28% 100%

(36hrs) (36hrs) (72hrs) (56 hrs)

Learning outcomes and learning outcomes Learning outcomes 1 Edit mute action scenes and sync sound dialogue scenes on film 2 Set up and operate video editing equipment to perform complex editing procedures 3 Participate in various roles in all phases of pre-production and post-production of a simple sync sound film 4 Lay sound tracks to an edited motion picture image and produce a sound mix Two-way table of specifications TOPIC

LEARNING OUTCOMES 1 2 3 A 12 6 B 18 C 6 24 D 6 TOTAL 18 18 36 Source: Denise Hunter

WEIGHTING 4 6 22 28

18 18 36 28 100

The remaining stages focus on determining the methods of assessment and how to weight the methods of assessment. There are a number of checks and balances in this process that will ensure some consistency in Visit WebResources for your decision-making. In this section we shall more examples on the use another example of an assessment program tables of specifications developed for a technical and further education subject, Land Information Systems. This subject provides for an understanding of operating systems so as to achieve efficient system administration, as well as the strategy and methods involved in formulating tender specification and evaluation of software. It is offered as an elective for those students who require training in the more advanced area of computer use and development. (Note that we modified slightly the learning outcomes for the purposes of this example.) The first few steps are familiar to you and have been completed. Since there are too many tables, please refer to the

183

CHAPTER 10

WebResources for a complete list of all the relevant information, tables and figures. Step 3: Determine the forms and methods of assessment This is a new step for you. At this stage you decide what forms and methods of assessment will be used for the various learning outcomes. These are then set out in a table. Some comments may be provided to describe the methods and to provide some justification. In this example, the teacher resolved to have two class assessments, three minor assignments, two class practicals and a minor assignment in the form of a case study. The details are set out in the WebResources under the heading of Forms and Methods of Assessment. The teacher has indicated the number of assessments, their content (mainly knowledge), the forms and methods of assessment (mainly questioning, simulation and skills tests), the summative intention and criterionreferenced nature of the assessments, as well as the indication that they are formal assessments that are standardised, with marking criteria but locally set and locally marked. The advantage of this step is that someone should be able to come along, study this and then have a fairly good idea of what assessments to develop using these descriptive criteria in Step 3. Step 4: Indicate the weights/marks for each form of assessment The next step in the planning process relates to giving some assessment weight to the methods of assessment. In the example from the WebResources, the teacher decided to give 60% emphasis to class assessments, 20% to class practicals and 20% to assignments. It means that there is greater emphasis to theoretical knowledge in this subject. It may or may not be the case that this overestimates the role of knowledge, retention and recall. Some might think that greater emphasis might have been given to practical or performance tasks that incorporate a knowledge component but it is not for others to decide. It is the teacher’s prerogative to decide upon the assessment emphasis of their subject (unless of course all this thinking has been done for you in the curriculum materials). At this point we might also allocate the marks for each assessment across topics. The right hand values are those from the very first steps. The values in the final row are those from the methods of assessment. This process of planning uses a number of such built-in checks to ensure that you weight various sections consistently. Step 5: Detail the marks for the full assessment program and produce a breakdown of marks for each topic in the unit. The final step in this process is to outline the assessment program for the students. This gives them an idea of the methods of assessment, the mark allocation and how

184

ASSESSMENT PREPARATION AND ADMINISTRATION

much emphasis is placed on each topic. It would form part of a larger document, such as a course outline or assessment program, handed out to students. THE PREPARATION AND ADMINISTRATION OF AN ASSESSMENT

There are many ways in which you can prepare assessments that will streamline the professional aspects of teaching and learning. In this section, the emphasis will be on the effective development and administration of assessments within a classroom context. Administration of assessments is a term that means giving the test or assessment to a person or group; it has nothing to do with the paperwork associated with assessment systems. Commercial test publishers, who concentrate their efforts on one test, are able to provide manuals (i.e., a user’s guide) for their standardised tests. These indicate every feature of the administration, scoring and interpretation of their test. The manual enables test users to become acquainted with all the materials or procedures in order to ensure some uniformity in administration. This level of preparation is not suited for one-time, one-group classroom assessments, nor is it feasible even for many one-time, large-scale educational assessments. It may not be feasible (time-wise) even in the case when you want to prepare an assessment which you can use every year. Your assessments have to be largely self-contained and self-explanatory (i.e., not needing a manual for use). The ultimate aim is to ensure that the assessment is prepared thoroughly and administered fairly. It is difficult to conceive of a situation in which educational assessment is completely unplanned, so the focus of what follows is on more effective preparation for such assessment. In this section we shall outline some basic steps for preparing classroom tests. We emphasise paper and pencil tests but the principles that we outline apply to practical assessments and attitudinal questionnaires. Some general aspects of preparation for assessment Preparation for an activity like assessment is important in order to avoid errors in conducting the assessment. Some of these errors can lead to appeals and grievance complaints that are not pleasant for a teacher or instructor to encounter. An obvious precursor for assessment is a thorough knowledge of the subject being assessed. This involves being familiar with the relevant criteria for performance or standards for competency. This is important so that you know how to determine standards for scoring. Your assessment should be in line with the expectations of other professionals and the practices in your industry. The preparation for assessment also assumes that people are informed appropriately of the nature and scope of the assessment and assessed under suitable conditions. This is part of being fair to learners. For too long assessments have been seen as some sort of punishment or torture and this is not consistent with education or training. Learners need to know how they are going to be assessed and on what they are going to be assessed. We find it difficult to justify the idea that the contents of 185

CHAPTER 10

occupational or educational) assessments should always be a closely-guarded secret revealed only on the day of the assessment. For instance, people of all ages need to know the reasons for the assessment, the way in which the assessment is to be conducted and whether further attempts can be made. Inform learners (according to their level of understanding) of any formal policies or guidelines that are relevant to assessment. Indicate how judgments will be made and how the results of the assessment will be handled (e.g., confidentiality, access to results). It may seem strange to you to say that a person should really have the right to postpone or decline an assessment if it is not in their best interest. In most cases the responsibility for arranging and conducting assessments will lie with you, as a teacher or instructor. In commerce and industry there may be a requirement to determine assessment techniques and conditions with the relevant industrial parties. In educational settings the formal aspects of assessment may be clearly set out in your syllabus or curriculum documents from your institution or external authorities (e.g., Board of Studies, accreditation documents). Having established these preliminary considerations you can then turn to the task of preparing the assessment. Preparing a test There are many ways in which you can prepare classroom tests in education and training. The following sections outline some general steps. These are directed mainly to written tests but many of the principles also apply to other forms and methods of assessment. Step 1: Writing the questions Part of the first step in preparing a test (i.e., written or practical) is to develop a table of specifications for the subject or topic in order to ensure that the test has content validity. The quality of a classroom assessment depends mainly on giving the proper weight to topics and learning outcomes rather than worrying about which item format to use. We covered this step in the previous sections. Next, you need to decide on the proposed length of the test. This is determined mainly by the time available. If there is going to be a time limit, then aim for around 95% of your students being able to complete it within the time required. You can allow around one minute for each multiple-choice question, around half a minute for each true-false and about 60 minutes for a 1,000 word essay. Be very careful so that you will accommodate the needs of all of your students: if you have a mixed-ability class (where you have students with very different capabilities) you may need to spend more time organising and thinking about the assessment. At this point, you need to select items, questions or tasks from existing sources (e.g., an item bank) or write new questions. If there are no item banks available (i.e., an organised collection of questions or tasks) from which you can select material for the assessment, then you will need to build new questions/tasks from the scratch. 186

ASSESSMENT PREPARATION AND ADMINISTRATION

Once the questions have been chosen (or constructed) then marks can be given to various questions. Items do not have to be given equal weights especially when some are much more difficult or time consuming than others. The allocation of marks can be on the basis of either (a) the difficulty of the items or (b) the proportion of the subject matter that is covered (i.e., their weighting in the table of specifications), or (c) the time a student needs to complete the task or answer the question etc. Step 2: Preparing the test The next step involves preparing the test. This involves the setting out of the questions and preparing any instructions for other users. Some factors to consider in setting out the questions are: – indicate the title of the test, course, subject, date and time; – specify instructions, such as guessing in multiple-choice; – set out details of any time limits; – offer information about choice of questions or options; – indicate the number of questions to be attempted; – list any aids or materials that can be used; – group similar types of questions together (e.g., all true-false, all short answer etc); – if time is limited then the easier questions should be presented first; – number each question; – ensure that a question is not split between pages; – indicate where answers are to be recorded; – place directions at the bottom of the page to advise students to continue; – provide space for writing answers if required; and – indicate the marks for each question. There are many options available to you in the format of the tests or tasks that you are preparing. You may wish to consider whether the test should be an openbook or take home or closed test. Another option that is used regularly is to provide students with the test questions in advance. An open-book test is one in which students can bring with them any specified texts, materials, notes etc. They are used widely in higher education where memorisation is not required and are recommended because test anxiety is reduced. Take-home tests have some of the advantages of open-book tests and also remove some of the time pressure. They have the disadvantage of all assignments, namely, that there is no guarantee that the work was completed by the student. The differences between these various methods needs to be investigated for your class. Often the effect on the rank ordering of performance will be minimal but the positive benefits on class morale can be quite substantial, especially for those whose experience of tests and exams has been negative.

187

CHAPTER 10

Step 3: Preparing a marking guide The third step involves the development of a marking guide or scoring key for the test. A scoring key will list all the correct answers. The marking guide provides an indication of key elements for scoring and may require you to prepare ideal answers for each question. Step 4: Proof-reading and review It is recommended that a copy of the final test can be proof-read by a colleague or someone else to alert you to any errors or omissions. At this stage it may also be helpful to review once again the length of the test or the number of tasks required to be completed because length is a major factor in reliability of test results. Step 5: Administering the test Ideally tests should have standard instructions. Instructions for assignments, projects etc can include: deadlines for submission; marking criteria; and any guidelines on referencing, style, format etc. Where verbal instructions are used, then these may need to cover issues such as: – whether guessing is permitted; – how to change an answer or correct an error; – how much reading time is provided; – whether the use of calculators, aids, notes, texts etc. is permitted; – whether there is additional help for persons with disabilities; – what assistance is given for persons of non-English speaking background. The teacher should always bear in mind that all instructions preceding each test are to ensure that every person taking the test has, as far as is possible, an equal initial understanding. This is why it is important to read any prescribed directions and not to vary instructions across groups, as this can influence results. No one should underestimate the importance of careful administration of tests, especially as it is often necessary to carry them out under conditions that are far from ideal. Concessions to standard assessment may be required for students who are disadvantaged. For instance, in the case of persons with disabilities there should be scope for: additional time, amended tests, alternative assessments, readers, amanuensis, interpreters or alternative venues. Standard translation dictionaries may be permitted for students from non-English speaking background. We will come back to this issue later. Step 6: Marking the test The process of marking has a direct impact upon the validity of a test. There are many potential suggestions for improving marking and these are aimed at increasing the objectivity or accuracy of the process. some recommendations include:

188

ASSESSMENT PREPARATION AND ADMINISTRATION

– – – –

where possible preserve anonymity by using student numbers rather than names; ideally it is better to have someone independent mark assessments; marking the same item in all tests before going on to the next question; and award marks rather than deduct marks. Modern assessment procedures depend on numerous situations that really depend upon a person’s honesty and the next section considers some aspects of cheating that may be of interest or concern for educators and trainers who are involved in the preparation of assessments. It outlines some broad aspects of the nature and extent of cheating in education. CHEATING AND ASSESSMENT2

Cheating is an important area for educational assessment not only because it reduces the validity of results but also because it is anathema to widely-held public principles of equity and truthfulness (see Cizek3 for a comprehensive review of the topic). The essence of cheating is deliberate fraud or deception, and involves a wide range of behaviours. They can vary in their seriousness, execution, purpose and social dimensions4: There is no consensus in estimates of the extent of cheating but it has been viewed as a major problem with the majority of students indicating that they have been dishonest. There have been at least 21 self-report studies published from 1964 through to 1999. The overall proportions of female students cheating varied from a low of 0.05 to a high of 0.97 (median = 0.569) and for men the proportion varied from 0.16 to 0.91 (median = 0.612). Accumulating the findings across the studies showed that 21% of females (22,334 out of 112,328) and 26% of males (31,743 out of 120,188) had cheated. If a study with an extremely large sample is excluded from this analysis, then the proportions increase dramatically to 60% for both males and females. Cheating by taking, giving or receiving information from others – allowing own coursework to be copied by another student; – copying another student’s coursework with their knowledge; – submitting a piece of coursework as an individual piece of work when it has actually been written jointly with another student; – doing another student’s coursework for them; – copying from a neighbour during an examination without them realising; – copying another student’s coursework without their knowledge; – submitting coursework from an outside source (e.g., a former student offers to sell pre-prepared essays, ‘essay banks’); – premeditated collusion between two or more students to communicate answers to each other during an examination; – obtaining test information from other students.

189

CHAPTER 10

Cheating through the use of forbidden materials or information – paraphrasing material from another source without acknowledging the original author; – inventing data (i.e., nonexistent results); – fabricating references or a bibliography; – copying material for coursework from a book or other publication without acknowledging the source; – altering data (e.g., adjusting data to obtain a significant result); – taking unauthorised material into an examination. Cheating by circumventing the process of assessment – taking an examination for someone else or having someone else take an examination; – attempting to obtain special consideration by offering or receiving favours through bribery, seduction, corruption; – lying about medical or other circumstances to get special consideration by examiners (e.g., to get a more lenient view of results; extra time to complete the exam; an extended deadline; or exemption); – deliberately mis-shelving books or journal articles in the library so that other students cannot find them or by cutting out the relevant article or chapter; – coming to an agreement with another student/colleague to mark each other’s work more generously or harshly than it merits; – illicitly gaining advance information about the contents of an examination paper; – concealing teacher or professor errors; – threats or blackmail or extortion. Sources: Baird, 1980; Cizek, 1999, p. 39; Newstead, Franklyn-Stokes & Armstead, 1996, p. 2325 Issues surrounding cheating are largely a matter for educational policy and administration. Sometimes there seems to be an over-concern for cheating with elaborate precautions and at other times there seems to be a naïve view that the results are genuine. We do not have any answers to these policy issues. One important issue, however, is that educational certification has gained its value through the fact that it seeks to assess the level of skills, knowledge and attitudes in a truthful and reasonably fair manner. We would imagine that many people in the community would be surprised at the extent to which the results in some courses cannot be verified. All this detracts from the validity of our results. For summative assessments that lead to a public certification, the best that we can do is to use multiple sources of evidence and to ensure that most of these are able to be verified as the person’s own work. If you are interested, there are whole web pages designed especially so that they will offer ‘valuable’ information to students on how they can cheat on tests. For an example, please visit http://www.wikihow.com/Cheat-On-a-Test.

190

ASSESSMENT PREPARATION AND ADMINISTRATION

SPECIAL ARRANGEMENTS

Special arrangements (during assessments) are also sometimes called ‘access arrangements’, especially in England. The intention is, as far as possible, that all students should have an equal opportunity when taking an assessment, and not be placed at a disadvantage over others because of the means used to examine them (e.g. a student with some hearing loss should not penalised because he/she cannot hear accurately the instructions from the teacher during a test). On the other hand, it is worth mentioning that any special arrangements should not give to any student an unfair advantage over others. There are many legal and moral issues when compensating a student for some disability – you need to be very careful not to cross the line by giving any unfair advantage over others. It is customary, in many countries for the following arrangements to be in place, depending on the disability or the conditions of a student: – Extended testing time (almost all tests are timed) – Additional rest breaks – Writer/recorder of answers – Reader – Sign language interpreter (for spoken or oral directions only) – Braille – Large print – Large-print answer sheet – Audio recording – Audio recording with large-print figure supplement – Audio recording with raised-line (tactile) figure supplement – Use of a computer It is natural that different institutions may have a different policy in place (or may not have a formal policy in place at all), depending on the formality of the assessment. In case of high-stakes examinations, for example, it is only natural that you will need to get the formal policy of your organisation in writing, before offering any special arrangements to students. It is also customary, especially in the case of high-stakes examinations, that students eligible for special arrangements are given a formal letter explaining to the director of the examination centre the exact special arrangements that must be offered to the candidate. SUMMARY

We hope that these examples from practising teachers provided a useful illustration of how to plan an assessment program for an entire subject. It is applicable to traditional curricula as well as competency-based curricula. For competency-based curricula we would produce a one-way table of specifications, then focus on the forms and methods of assessment (Step 3), allocate the weights for the methods of assessment (Step 4), omit the allocation of weights across topics and then summarise the assessment program for students (Step 5).

191

CHAPTER 10

We consider that it is a very useful exercise because it forces you to think about the aims and content of your teaching. It also highlights the many decisions that need to be made and the subjectivity involved in deciding on an assessment. The foundations of the planning of an assessment are the topics and learning outcomes of the subject. This is followed by specific decisions about the types of assessments to use and the appropriate allocation of assessment emphasis. Throughout this process there are a number of checks and balances to ensure that your specifications are consistent with earlier decisions. Overall, the content validity of the results obtained is enhanced by focusing the assessment on the topics and learning outcomes. At the very least, it is recommended that you prepare a one-way table of specifications and think about the forms and methods of assessment if you are ever required to plan an assessment program. Preparing for any classroom assessment is a significant responsibility for a teacher or instructor. Some of these guidelines may be of assistance to you and you should feel free to experiment with whatever approach is suitable and fair for your students. The aim is to improve the quality of educational assessment. -oOoREVIEW QUESTIONS T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F

192

Planning a assessment usually commences with considering the learning outcomes Learning outcomes are the topics in a course The blueprint for an assessment is called the Table of Specifications The values under the heading WEIGHTS in a one-way table of specifications are the teacher’s judgments of importance The weights in a table of specifications represent only the number of questions in an assessment When you are preparing a short assessment on a limited topic you can stop with a one-way table of specifications If you have a syllabus that is competency-based then you would focus only on the topics The row and column headings for a table of specifications consist of topics and learning outcomes The table of specifications gives the specific procedure for developing a assessment You can develop a table of specifications for a topic, a unit, a module, an entire subject or even a whole course A table of specifications is a three dimensional classification for preparing a assessment A table of specifications can be used for competency-based assessments The learning outcomes determine which forms and methods of assessment will be used In planning an assessment you allocate weights to the methods of assessment This planning of an assessment program is designed to increase the content validity of your assessments

ASSESSMENT PREPARATION AND ADMINISTRATION

EXERCISES

1. Take the instructional learning outcomes for a syllabus or curriculum that you teach or plan to teach and estimate how much assessment emphasis should be given to them. 2. Why should a teacher develop a table of specifications for an assessment? 3. Which of the following is likely to come about from not preparing an outline of topics, prior to writing a classroom assessment? a. too many questions on a relatively minor topic b. questions which measure relatively unimportant outcomes c. using the wrong type of questions 4. Four topics are assigned the weights 10%, 20%, 40%, 30% respectively. How many questions would come from Topic 3 if an assessment contained 80 questions? 5. A course has five learning outcomes with the following weights, 10%, 20%, 20%, 30%, and 20%. If an assessment is to have 60 questions, how many would be based on the fourth objective? 6. Suppose you were preparing a 100 question assessment, based on the table of specifications given below. How many questions would you allocate to the particular topics and learning outcomes. Use the table below to complete your answer. TOPICS A B C D E WEIGHTING OF LEARNING OUTCOMES

I

LEARNING OUTCOMES II III

70%

20%

10%

WEIGHT 10% 30% 10% 40% 10% 100%

7. Plan a comprehensive assessment program for a curriculum unit. The plan should include: – a table which outlines the subject/topic matter and related weights for the unit; – an analysis of the learning outcomes or learning outcomes; – a table of specifications; – an outline of the forms of assessment and assessment strategies you would use to assess learning outcomes identified; – a table which indicates the allocation of weights for the forms of assessment chosen;

193

CHAPTER 10

– a table which indicates the marks for each topic and relates these to the forms of assessment; and – a full assessment program sheet which can be given to students informing them of the assessment process in this unit. Outline how this method increases the reliability and validity of your assessment. 8. Examine the following questions for assessing knowledge and comment upon them in terms of the criteria and guidelines for evaluating questions6. Consider the appropriate directions and the conditions under which these assessments might be used. Before you comment upon the questions, undertake this as a test. Do not worry about your familiarity with the subject matter. The correct answers are provided at the end of this section. The learning outcomes to be assessed are set out below. 1. Understanding the importance of pharmaceutical representatives to general practitioners. 2. Demonstrated knowledge of Foradile, the Aerolizer and the class that it belongs to. 3. Demonstrated knowledge of Asthma and the treatment options. 4. The ability to match drugs to their delivery devices. 5. Understand the pharmacology of bronchodilators. Samples of essay questions

Spend approximately 15 minutes answering the following: In less than one page outline 5 reasons why a general practitioner (GP) would speak with a pharmaceutical representative. You will be assessed on the clarity of your ideas and more importantly the degree to which you relate those ideas to the general practitioners point of view. 1 point is awarded for each clearly presented idea. An additional point is awarded for each idea that is related to the GP. (Total 10 points) Samples of short answer questions

What is the name of the asthma delivery device shown below? (1 Point) As quoted in MIMS, Foradile belongs to which class of drug? (1 Point) List the generic names of all long acting B2 Agonists? (1 Point for each correct response) 194

ASSESSMENT PREPARATION AND ADMINISTRATION

Outline the indications for use of Foradile? (1 Point) What are the common side effects associated with Foradile? (1 Point) Complete the following sentence. Foradile is indicated in patients

year(s) and over.

Samples of True/False questions

Please answer the following True/False questions by circling the desired response. Example: On Earth one day is known as being 24hours long. True False True False: The clinical characteristics of asthma include cough, wheeze, shortness of breath and chest tightness. True

False: Pollens and dust mites may trigger the symptoms of asthma.

True False: Exercise and cold air are non-specific triggers for asthma symptoms. True False: Asthma medications are safe to use in pregnancy and during breast feeding. True False: In patients with poorly controlled asthma during pregnancy, the condition may represent a significant risk to the developing foetus. True

False: Australia has one of the lowest asthma mortality rates in the world.

True False: It is widely accepted that patients with moderate to severe asthma can be adequately controlled with bronchodilator medication alone. Samples of matching questions

Match the drug trade name with the delivery device that it is commonly prescribed with, in Australia by drawing a line from one to the other. Here is an example of how to do this. NB: Each option can only be used once FRUIT Banana Strawberry Orange

COLOUR Orange Yellow Red 195

CHAPTER 10

Samples of multiple-choice questions

Please indicate your desired response by placing a circle around it, as shown in the example below. Chocolate is a desirable food because of it’s colour smell taste feel all of the above The onset time and duration of action for Foradile, Serevent and Ventolin respectively are – 1-3 minutes and 12 hours, 30 minutes and 12 hours, 1-3 minutes and 4-6 hours – 1-3 minutes and 12 hours, 1-3 minutes and 4-6 hours, 30 minutes and 12 hours – 30 minutes and 12 hours, 1-3 minutes and 4-6 hours, 30 minutes and 12 hours – 1-3 minutes and 4-6 hours, 30 minutes and 4-6 hours, 1-3 minutes and 12 hours The indicated and approved ages for use of Foradile, Oxis and Serevent respectively are – 5 years, 8 years and 5years – 5 years, 8 years and 4 years – 4 years, 12 years and 5 years – 5 years, 12 years and 4 years – 4 years, 8 years and 5 years The half lives of Foradile, Serevent and Ventolin respectively are – 67 hours, 67 hours and 4-6 hours – 12 hours, 12 hours and 2-7 hours – 4 hours, 67 hours and 2-7 hours – 2-7 hours, 12 hours and 4-6 hours The mechanism of action for Beta agonists is best represented by which statement below – Beta agonists increase the parasympathetic nervous system activity which inhibits the sympathetic nervous system resulting in relaxed smooth muscle – Beta agonists increase the sympathetic nervous system activity which inhibits the parasympathetic nervous system resulting in relaxed smooth muscle – Beta agonists decrease the parasympathetic nervous system activity which inhibits the sympathetic nervous system resulting in relaxed smooth muscle – Beta agonists decrease the sympathetic nervous system activity which inhibits the parasympathetic nervous system resulting in relaxed smooth muscle Indicate the response below that contains only inflammatory mediators – Histamine, leukotrienes, platelet activating factor 196

ASSESSMENT PREPARATION AND ADMINISTRATION

– Mast cells, histamine, leukotrienes – Eosinophills, neutrophills, mast cells – Eosinophills, neutrophills, platelet activating factor Answers to the questions above

Essay question: 5 ideas are clearly conveyed, 1 point for each, if those points are discussed from the GP viewpoint the student gains an additional point for each. Short Answer: 1.Aerolizer 2.From top right in a clockwise direction the labels should read, aperture, mouth piece, chamber for capsule, buttons for piercing the capsule, air inlet. 3.Bronchodilators 4.eformoterol fumarate dihydrate and salmeterol 5.Reversible airway obstruction including nocturnal and exercise induced asthma 6.headache, tremor (muscle shakes), palpitations (fast heart rate) 7. 5 years True/False True, True, True, True, True, False, False Multiple Choice Questions a, d, c, b, a

197

CHAPTER 11

ASSESSMENT OF KNOWLEDGE: CONSTRUCTED RESPONSE QUESTIONS

The study of human knowledge falls within the field of cognitive psychology and it is a research area in which there have been significant developments in recent years. Neuroscience and cognitive psychology have changed the way in which we consider people’s knowledge, memory and learning processes. Knowledge in cognitive psychology is represented by a range of viewpoints and all of these have different implications for the ways in which we might assess learning. At a basic level knowledge is a series of neural networks but this is too fine a level for educational assessment, so we need to move to a higher level of analysis. There are cognitive theories in which knowledge is merely a mental representation and ideas are linked in hierarchies. Others see knowledge as sets of propositions that are connected in networks while others favour the idea of schemata which are more general organisations than concepts or propositions. In schemata there are knowledge structures that people develop in order to help them actively engage in comprehension and to guide the execution of processing. A useful distinction from psychology is that between declarative knowledge (information about factual things) and procedural knowledge (knowledge that something is the case, or how to perform). These two types of knowledge are related. The gist of all these views is the complex storage, retrieval and interrelation of ideas, knowledge and facts in the human brain. Your assessment of knowledge in the classroom or the workplace is at a Visit WebResources where you can macro level and probably focuses on find more information about the the key cognitive factors of concepts, content of this chapter propositions, schemata, as well as procedural or declarative knowledge. You will probably have an interest in whether people are able to recall, recognise, apply, synthesise or evaluate knowledge, data, facts or ideas. In education and training, you are seeking some way of describing the cognitive development of a person at a macro-level. At present, our best methods for doing this are based almost exclusively on questioning or performance. If you would like to read more about these issues then we would refer you to any recent, standard textbook in educational psychology that deals with the topic of cognition. To find out whether your students have learned the required facts, data or information you will probably examine their performance on some type of written (paper and pencil) test. The types of questions that have been found most useful in 199

CONSTRUCTED RESPONSE QUESTIONS

Table 60. Characteristics of objectively and subjectively scored tests FEATURES

OBJECTIVELY SCORED QUESTIONS

SUBJECTIVELY SCORED QUESTIONS

Capable of standardised scoring Scoring involves judgment Used for formative or summative assessment Used for criterion- or norm-referenced results Useful for individual or group assessment Suited to both timed and untimed tests Applicable to aptitudes and achievement tests Found in both teacher-made as well as commercial tests Suited for higher cognitive levels Useful for small number of students Economic for large numbers Tends to sample content widely Provides freedom of response May discriminate against some students Easier to score Yes =

No=

HOW TO ASSESS KNOWLEDGE USING ESSAYS

The essay question is normally a prose composition or a short treatise. It permits a student to select, frame and exhibit an answer and can be used to measure higher level outcomes where the ability to create, to organise or to express ideas is important. It is a pity if we limit essay questions just to the traditional ideas of a written response because we can widen the scope to include oral exams or vivas or some performance-based assessments. Under the heading of an essay we could also include any task or question which requires a constructed response and for which there is no single, objectively agreed upon, correct answer. The essay task is useful in meeting outcomes such as those requiring a student to: defend, generalise, give examples, demonstrate, predict, outline, relate, create, devise, design, explain, relate, rewrite, summarise, tell, write, appraise, compare, conclude, contrast, criticise, interpret, justify, relate, summarise or support. Essays have become popular in recent years as a reaction against the exclusive use of multiple-choice examinations. The extended response is seen as a performance201

CHAPTER 11

based assessment that allows students to demonstrate a wider range of skills. Despite the time-consuming aspects of essays, many teachers and instructors prefer constructed-response assessments and persist with them because they believe that this approach offers authentic evidence of knowledge, skills and attitudes. In the WebResources you can find an example of a Year 5 essay topic (Summer Holidays) and four categories of responses from a basic skills task. How do students respond to such exams? James Traub wrote about the differing reactions of schools to mandated assessments as part of the standards-based reform of US education. He described the New York State English Language Assessment (ELA) that requires students inter alia to listen to a passage and then write an essay. He indicated that some schools found it a meaningless and irrelevant task. He also reported an example of the Longfellow School where only 13 per cent of grade 4 students had passed the exam. The school changed its teaching to reflect the content of the test since there was general agreement that the assessment represented essential skills. Teaching was directed towards literacy with increased daily reading at home, writing tasks across all subject areas and the use of exemplars for essays that were considered superior. … in 2000, Longfellow’s pass rate on the English test made a staggering leap to 82 from 13… the increase in math scores was almost as steep. And the improvement persisted…Fourth graders in Mount Vernon outperformed many of the state’s middle-class suburbanites, thus severing the link between socioeconomic status and academic outcome.1 In thinking about the use of the essay in this context, it is important here to distinguish between the exam (in this case the ELA), the standards-based educational reforms, the nature of educational administration in the US, the purpose of the assessment and the content of the assessment. For our purposes, we wish to focus on the content of the exam and we want to show that it is possible to help students; in particular, we want to show that it is possible to train students to perform well on constructed-response items; and finally we wanted you to note that reactions to constructed-response (i.e., essay) as well as multiple-choice questions are Visit WebResources where you can often equally negative. find an example of a Year 5 essay Essay tasks allow students to present topic and four categories of responses or organise ideas in their own way. from a basic skills task This broadens the concept of an essay and makes it useful from primary education through to vocational and professional levels. Essay tasks are useful as final or progressive assessments. Essays are used widely in the humanities where sustained expression of ideas is important for assessment. Essays, in particular, are useful because they provide the freedom to respond to complex tasks. The typical argument for using an essay is that the teacher is seeking a special type of performance, usually a sample of scholarly performance. The setting of an essay task reflects the decision of a teacher that there are few alternatives for assessing complex skills or understandings. It could also be argued 202

CONSTRUCTED RESPONSE QUESTIONS

that the limited essay task is not typical of work settings. Our response to this is that this is not the fault of the essay task rather it reflects other restrictions. Certainly, the essay task is inefficient as a means of assessing information recall or recognition. Essay tasks can be classed as restricted or extended-response essays. In the restricted essay question the student is asked for specific information or the answer is limited in some fashion. For instance, a question such as ‘Outline three causes of Visit WebResources for inflation during the period 1980-1990?’ would more information on essay be a restricted essay question. Extended essay questions questions permit greater freedom of response and could include questions such as, ‘Indicate the major causes of inflation during the period 1980-1990. Justify your response’. This second type of question provides a greater freedom of response and permits the student to organise and structure the answer in his/her own way. Some examples of essay questions and instructions on how to write essay questions are listed in the WebResources. Scoring of essay questions The essay answer is many times longer than the question itself and this leads to the well-known limitation of essay questions relating to scoring. Point scoring methods, analytic and holistic approaches are considered in this section. Point scoring involves the allocation of marks to an essay using a predetermined scale or range. When you are scoring an essay, the use of too fine a scale (e.g., 0-100) only gives the appearance of accuracy. The reliability of a shorter scale (e.g., 10 point scale) should be acceptable and a recommended scale is up to 15 units. In cases where longer scales are used, research has shown low reliability of marking. There are other approaches to the scoring of essays ranging from analysis of particular aspects to overall holistic judgments. Table 61 gives an example of an analytic approach to grading an essay question. The WebResources also provide an example of a scoring rubric for essays based on knowledge, critical discussion, use of sources, arguments and structure and expression. While we have focused on the scoring of written responses, there is no reason why such frameworks cannot be transported to other areas such as fine arts or creative expression and not limited only to written work. The method of analytic scoring can also be applied on a point scale in cases discusses at least four points; such as this shortened example: (a) 5 marks describes each accurately and gives acceptable reasons for each; or lists at least seven important points; alternatively, (b) 1 mark lists at least two points. Holistic scoring of an essay, however, has also been considered to be worthwhile and reliable 2 . This is based on the ‘...commonsense belief that assessing the total effect of written discourse can be more important than simply

203

CHAPTER 11

adding up sub-scores on such facts as clarity, syntax, and organisation; instead, the work is judged as a whole’.3 Table 61. A sample set of criteria for marking essays FAILURE

PASS

CREDIT

DISTINCTION

HIGH DISTINCTION

Essay has one or more of the following shortcomings not compensated by other strengths: [ ] fails to address the issue/s or addresses it/them only marginally [ ] is considerably shorter than the required minimum length [ ] lacks coherence or structure and has serious deficiencies in the quality of the writing [ ] shows misunderstanding or little understanding of the basic theoretical issues and/or their implications for professional practice [ ] relates to the subject on a simple, essentially anecdotal level, demonstrating little reading or little capacity to apply concepts to practice or experience and to draw conclusions from that practice [ ] substandard through a lack of appropriate content, poverty of argument, poor presentation, inadequate length or a combination of these. [ ] basically addresses the title and of appropriate length according to course regulations [ ] includes a bibliography (where applicable). [ ] is coherent and structured and of an acceptable standard of literacy [ ] demonstrates a basic understanding of the issues and a capacity to relate them to practice/experience/context [ ] shows evidence of basic reading relevant to the topic In addition: [ ] demonstrates a sound understanding of the issues and a capacity to relate them or apply them to experience and practice [ ] shows evidence of wider reading and some independent selection of sources [ ] shows evidence of a capacity to be critical, evaluative or to make judgments In addition: [ ] demonstrates a comprehensive understanding of the issues and a capacity to relate them to a wider context [ ] shows evidence of wide independent reading and/or investigation Furthermore: [ ] shows evidence of initiative and some originality or ingenuity in the approach to or execution of the essay

The problem with overall judgments is that they may entail poor decisionmaking policies in terms of consistency and accuracy. The development of a set of clear reference points may be better because these give you an idea of the characteristics of answers falling at each point on the scale.

204

CONSTRUCTED RESPONSE QUESTIONS

The scoring of written essays by computer A number of proprietary automated writing assessment programs are now available for large-scale assessments.4 These grade essays on the basis of (a) writing quality (average word length, number of semi-colons, word rarity); or (b) relevant content (weighted terms – words, sentences, paragraphs); or (c) syntax, discourse and content. Reviews of these programs found that their grades correlated positively with those of human judges and about as well as the ratings of human judges correlate amongst themselves. It is possible that the scores on some automated assessments may be manipulated through well-written prose on an irrelevant topic or appropriate content without any sentence structure. These problems may be overcome through syntax, discourse and content measures but the solutions are complex. Consistency in the grading of essays A major concern with essays has always been the difficulty in achieving similarity in grading. This comes about through errors in rating. Typical rating errors are due to: Halo effects – a teacher’s impression may be formed on the basis of other characteristics (e.g., legibility or standard of writing might influence an overall rating on the quality of content in a written task); Stereotyping – a teacher’s impression of a learner might be influenced by his or her impressions about other members of the same group (e.g., males, females, ethnic backgrounds, social characteristics, occupational stereotypes, or streaming of classes); Tolerance factors – some teachers systematically rate lower or higher through a lack of experience in a field or knowledge of a topic; and Regression towards the mean – some teachers fail to use the extremes of a scale in their judgments.5

Figure 41. Typical rating errors

These errors are reflected in different teachers assigning different grades to the same paper (i.e., inter-teacher variability) and a single teacher assigning varying grades to similar papers at different times (i.e., intra-teacher variability).6 Firstly, there is inter-teacher variability. Teachers may differ in how strict they are with their marking or how they distribute grades throughout the range or how they go about marking different papers. The correlation between human judges on essay-type responses is around 0.70 to 0.75.7 How do we know if there is an effect of rating by a teacher? We can look at the difference between the teacher’s rating of a task and the average of all other teachers for the same task. Rating errors are not a problem if all teachers rate everyone being assessed because the leniency or strictness across teachers is shared by all the persons being assessed. The problem arises when multiple assessors rate different groups of essays. In these cases, someone may encounter an assessor who is excessively strict or 205

CHAPTER 11

lenient. The typical solution is to have two markers and to award the first mark unless there is a substantial difference between assessors. The difference is agreedupon at the outset. The work goes to a third or senior assessor when the difference exceeds the agreed-upon margin or the assessors cannot resolve their differences. This approach provides a workable solution but it is not perfect. Sometimes the margins for examiners are quite large and students may be disadvantaged. There are some statistical solutions to these issues but they are complex. They may take into account the fact that each person’s grading is a function of their ability, the effect of the rater or the consistency of the rater, and any error.8 There is also variability within each teacher. Sometimes you may have a different standard for different students; you may vary in the overall grades; or, you might use a different range of grades.9 The professional teacher will take steps to find out just how much his/her own ratings may be expected to vary from time to time. A simple way to test your rating reliability is to rate some papers of your students who are identified by number only. Record these grades separately. Then several weeks later repeat the procedure. To evaluate your reliability, count the number of students who are placed in a different grade. An example for one teacher is shown in Table 62. This shows that a three fails on the first occasion translated into two passes and one fail on the second occasion; five passes translated into one fail, three passes and one credit. Despite these variations, this teacher proved relatively reliable.10 Table 62. A test of rating reliability Grade on First occasion Fail Pass Credit Distinction High Distinction

Grade on second occasion F 1 1

P 2 3 3

CR 1 7 2

D

HD

1 4 2

1 1 1

The nature of the ‘marking errors’ (another name for the inconsistency that humans exhibit when scoring essays) has been researched thoroughly. Recent studies have shown that the characteristics of markers (such as leniency and consistency) are not very stable, but actually change across time11. There is a greater source of error, however, in essays than intra-individual rating variability and this is the error attributable to the sampling of questions. For instance three essays in 120 minutes are probably less valid and reliable in the overall result than six shorter essays over the same time. The greater the number of different questions included in the exam, the higher will be the overall reliability of the results.12 One factor that adds to the unreliability of essays is the practice of offering students the freedom to select questions to answer from a range of topics. In effect, this means that students can be answering completely different examinations and 206

CONSTRUCTED RESPONSE QUESTIONS

that comparability of performance is difficult to establish. It penalises students who are most capable and those who have covered the entire subject. In Table 63, we have summarised some suggestions for the overall scoring of essay tasks. Table 63. Suggestions for scoring answers to essays 1. Score by the point method using a model answer 2. Grade long answers by the rating method 3. Evaluate all answers to one question at a time 4. Evaluate all answers without knowing the writer’s identity 5. Use a sufficiently fine scale for recording the ratings 6. Develop clear reference points HOW TO ASSESS KNOWLEDGE USING SHORT ANSWER QUESTIONS

The short answer question is an all-purpose form of question that has wide application in education because if you want to find out whether someone knows something, the simplest way is to just ask. This section focuses on the short answer, identification and completion questions, which are all examples of questions in which someone provides rather than selects the answer. Short answer questions require a constructed Visit WebResources for response. They involve a special cognitive more information on short process based on recollection and complex answer questions processing to provide a correct answer. Three types of short answer questions are dealt with in this chapter: short answer, completion question and identification question. They constitute the most fundamental form of knowledge assessment because they are useful for eliciting necessary facts, information, declarative or procedural knowledge. They find application in a wide variety of contexts and form a substantial component of many educational and vocational assessments. Short answer questions An assessment that uses short answer questions restricts the answer to a paragraph, a sentence or even a phrase. Unlike the essay, the short answer question has a clearly identified, correct answer(s) and is scored objectively. Short answer questions are used widely in education and training to determine if someone has the required knowledge or understanding. One reason why they are popular is that they are easy questions to write and grade. Furthermore such questions are less influenced by guessing than true-false or multiple-choice questions. Some examples of short answer questions are indicated the WebResoures and an additional example is illustrated in Figure 42. In the short answer question, the student must produce answers which can be clearly identified as correct. This is why it is considered to be an objective question. Before considering further details about the short answer, it is worthwhile stressing again the point that this type of question has an important influence on the type 207

CHAPTER 11

of learning that takes place. Some early research into testing 13 established that students obtain their ideas about what they are expected to know from the kinds of assessment they face. YEAR 5 MATHEMATICS Which number has been left out of this addition?

3 +

4

1 Write your answer here.

?

4

8

8

Figure 42. Examples of short answer questions

Short answer questions are appropriate for some learning outcomes more than others. For instance short answer questions are useful for recall of facts, analysis of data, and solving mathematical, economic, scientific or engineering problems. In Table 64 we have provided a tentative classification of outcomes against question types and in Table 65, we have set out some guidelines for writing short answer questions. Table 64. Types of questions appropriate for learning outcomes Short answer. define, describe, identify, label, list, match, name, outline, reproduce, select, state, convert, estimate, explain, paraphrase, compute, solve, illustrate, compile. Completion. identify, label, diagram, illustrate. True-false. distinguishes, differentiates, discriminates, identifies, categorises, Matching. match, diagram, categorise. Multiple choice. convert, explain, infer, predict, summarise, change, compute, predicts, solve, diagram.

Table 65. Guidelines for writing short answer questions 1. Ensure that there is a single definite correct answer 2. Ask for specific information 3. Word the question precisely and concisely 4. Indicate the criteria for answering, if several options are available 5. For calculations specify the degree of accuracy and the units of measurement indicated 6. Avoid clues to the answer 7. Indicate the marks for the item/question

208

CHAPTER 11

Completion questions Completion questions are another variation of the short answer question. The question consists of a true statement which one or two important words have been omitted. In the completion question there is very little scope for guessing the correct answer but it is not easy to write questions that have only one correct answer. This type of question is useful for assessing recall and an example of two typical completion questions are provided in Figure 44. SPREADSHEETS ______________ charts are most useful when you want to compare individual parts to the whole. (1 mark) Source: Steven Yuen HAIRDRESSING Pigment in the ___________________ gives hair its natural colour. (1 mark) ACCOUNTING The Group Employer’s Reconciliation Statement should be completed and sent to the Australian Taxation Office by the . (1 mark) Figure 44. Typical completion questions.

A precaution in writing completion questions is to avoid the use of too many blank spaces leading to what is called a ‘Swiss cheese’ effect. Some guidelines for the writing of completion questions are summarised in Table 66. Table 66. Guidelines for writing completion questions

1. Use short straightforward sentences 2. Omit only a relevant key word 3. Check that students can infer the meaning even with the deleted word 4. If more than one word makes sense then give credit for all potential answers 5. Make all blanks the same length 6. Do not use ‘a’ or ‘an’ to provide a clue 7. Number each blank space, if required 8. Use around 5-10 completion questions for each learning outcome

210

CONSTRUCTED RESPONSE QUESTIONS

Preparing a short answer assessment An item bank is a file of questions that makes it possible to develop an assessment with minimal time and effort. All available and useable test questions can be sorted by topic. Sets of instructions or directions are added and the file can be updated easily on a word processor. When the time comes to prepare a test, the first step is to make up a table of specifications. This can be a one-way or two-way table. Short answer questions are then selected which match the topics and/or learning outcomes. Some general guidelines for preparing your test are: – include easier as well as difficult questions; – prepare clear and concise directions; – arrange questions in order of difficulty; – use separate answer sheets to retain the questions for future use and to ensure that scoring is easier, simple and accurate; and – analyse your test questions in terms of item difficulty. Ideally, an item bank should include information regarding the ‘identity’ of each item (e.g., who wrote it, when was it included in the item bank), its psychometric statistics (e.g., its difficulty and discrimination index), and information related to the reactions of the students (e.g., comments of the students who attempted the item). SUMMARY

The essay task or question is a method of assessment can be used for either normor criterion-referenced purposes. It is generally the case that markers tend to mark in a normative fashion. The consistency of marking is not good but it can be increased. While essay questions are easy to write, the ease of scoring is difficult and this constitutes a disadvantage. It does not mean that essays produce results which are necessarily less reliable than other types of assessments. The choice of an essay should be related to the traditions of assessment in your field, what is acceptable to students and what you can achieve with the time and resources available to you. The advantage of an essay is that it offers students greater freedom to show their ability to express and evaluate ideas. The greatest disadvantage of the essay questions are their restricted coverage of topics and learning outcomes. We also looked at short answer questions. These are another form of question that require a constructed response but one that is scored objectively. They are similar to the essay but have an agreed-upon and correct answer. Short answer questions are recommended for one-time and one-class assessments. It is probably the best all-purpose form of questioning for classroom use. As with many other types of questions, the responses to the short answer, the identification and the completion question make some initial assumptions about people’s answering of questions14, such as: – students understand the question as intended by the teacher; – students had access to the information required for an answer; – students might reasonably be expected to be able to recall or retrieve such information to provide answers;

211

CHAPTER 11

– students can change the information and knowledge they have into the required form for an answer; and – students are prepared to give accurate answers. Ideally individual questions should be pre-tested to identify those that are difficult for students to understand. If this is not possible, then teachers may wish to prepare a file of test questions over time and to keep records of the difficulty of particular questions and exclude those which, in their professional opinion, had problems. The next chapter considers another group of questions that are useful for the assessment of knowledge. These are questions in which the response is selected rather than constructed. Of course, these require different cognitive processes but they are also useful for the assessment of learning. -oOoREVIEW QUESTIONS T T T T T T T T

F F F F F F F F

T T T T

F F F F

T T T T T T T T T

F F F F F F F F F

T T T T T T T

F F F F F F F

212

Schemata are knowledge structures Declarative knowledge is knowing that something is the case Objective questions refer to specific learning outcomes Test questions can be categorised as supply versus selection Supply questions require a student to recognise an answer Supplying an answer requires similar cognitive processes as selecting an answer The essay question involves a lengthy prose composition or treatise Under the heading of an essay we have included any task for which there is no objectively agreed upon single correct response Essays can be classed as extended response or general response Cognitive outcomes can be assessed only by essays An essay question is easier to write than a multiple-choice question Point scoring involves the allocation of marks to an essay using a predetermined scale or range Holistic scoring assesses the total effect of a written discourse Inconsistency in scoring is a problem with essay questions Providing options for an essay exam is desirable Constructed-response questions include true-false questions Short answer questions are an all-purpose form of question Short answer questions require a word for an answer Short answer questions are influenced more by guessing than essay questions The short answer includes sentence completion questions The short answer question is subjectively scored because it is marked by a teacher or instructor Short answer questions are useful for analysis and evaluation of ideas Short answer questions can be norm-referenced Essay questions provide better coverage of a subject than completion questions Completion questions are a variation of the short answer question In the completion question there is little scope for guessing Spaces for completion questions should be the same size as the missing word A Swiss cheese completion question has too many blanks

CONSTRUCTED RESPONSE QUESTIONS

T F T F

An item bank is a commercially available short answer test Short answer questions assume that students repeat the information presented by the teacher EXERCISES

Essay questions 1. When should teachers use essay-type questions? 2. Write an essay question for an objective from your syllabus or curriculum. 3. In an area in which you are teaching or plan to teach, find three essay questions and examine them in terms of the criteria below [ ] Related questions directly to the learning outcomes [ ] Set questions on the more important theories, facts and procedures [ ] Assessed the ability to use and apply facts rather than memorise [ ] Set questions with clear tasks [ ] Provided ample time limits or suggested times [ ] Asked questions where better answers can be determined [ ] Specified the examiner’s task [ ] Gave preference to specific questions that can be answered briefly [ ] Avoided options [ ] Wrote an ideal answer to the test question 4. What factors should be considered in deciding whether essay questions should be used in education or training. 5. Comment critically on the following excerpt that deals with the value of essays. The excerpt comes from an assessment discussion list. Washington state requires a direct writing assessment as a part of their Washington Assessment of Student Learning test (WASL). Each student writes to two prompts. For several years I have run some analysis of the correlation between a student’s 1st and 2nd writing samples and found a very weak relationship. Also, our district conducts a yearly writing assessment that is scored in the same manner as the WASL Again, poor predictability between a student’s various writing samples. My conclusion is there is considerable source of variance contained in the topic (prompt) and mode (type of writing)…. Source: Pat Cummings, Director or Assessment, Federal Way School District, Federal Way, WA. Online discussion on changes in the SAT, [email protected] Accessed June 2002

Short answer questions 1. When should teachers use short answer questions? 2. Write a short answer and a completion question for an objective from your syllabus or curriculum.

213

CHAPTER 11

3. In an area in which you are teaching or plan to teach, find three short answer questions and examine them in terms of the criteria below [ ] Ensured that there was a single definite correct answer [ ] Asked for specific information [ ] Worded the question precisely and concisely [ ] Stated the basis for the answer, especially where the student is asked to ‘discuss’ [ ] For calculations specified the degree of accuracy and the units of measurement indicated [ ] Avoided clues to the answer [ ] Indicated the marks for the item/question 4. When should teachers use identification questions? 5. Write an identification question for an objective from your syllabus or curriculum? 6. In an area in which you are teaching or plan to teach, find one identification question and examine it in terms of the criteria below [ ] Ensured that there was a single definite correct answer [ ] Asked for specific information [ ] Provided unambiguous pictorial clues [ ] Worded the question precisely and concisely [ ] Indicated the marks for the item/question

214

CHAPTER 12

ASSESSMENT OF KNOWLEDGE: SELECTED RESPONSE QUESTIONS

TRUE-FALSE, ALTERNATE CHOICE AND MATCHING QUESTIONS

You might be surprised at the detailed way in which we are describing the different methods which knowledge may be assessed. One purpose is to let you know that you have many choices at your disposal when it comes time to assess the knowledge of your learners. You can choose objectively or subjectively scored responses. You can focus on constructed answers or answers that require recognition. This chapter outlines some questions that are based on recognition and the objective scoring of answers. Three types of objectively-scored questions are considered in this chapter. These include: matching questions; true-false questions; and alternate choice questions. Matching and alternate choice questions are variants of multiple-choice questions but different enough in character to be considered in a separate section. MATCHING QUESTIONS

The matching question is a type of multiple-choice question that is helpful for assessing knowledge of related facts, events, ideas, terms, definitions, rules or symbols. It consists of two lists or columns of related information from which the student is required to match appropriate items. The major advantages of the Visit WebResources where you can matching question are that you can find many examples of selected cover a great deal of content in one response questions question and you can also obtain a large number of answers using the one question format. It provides a concise question format in which the options do not have to be repeated, so it is easier to develop than the multiple-choice question. Other advantages are that it is a useful variation in questioning and useful for classroom testing programs. It is more efficient than the multiple choice question because more matching questions can be answered in the time it takes to answer one multiple-choice question. It is objectively scored and easy to grade but it is more difficult to prepare than the short answer or completion question. Some other disadvantages are that it is not always easy to adapt this question to all subject areas. You also need a large number of items in the matching question to ensure coverage of a topic. Even though they look simple to write, matching

215

CHAPTER 12

questions are difficult to prepare because it is not always easy to find items which provide plausible options for more than one answer. An example of a matching question is provided in Figure 45. In this example the directions offer the student a guide for matching one column with another. It can be helpful to provide students with directions for answering questions, even if you think that they might all be familiar with the procedure. In this case, directions about guessing, how many times answers could be used and where to place answers were provided. In other cases, you might want to provide a worked example if people are unfamiliar with a particular format. Table 67 indicates some guidelines for writing matching questions. Directions. In the square to the left of each neurological process or system in Column A, write the number of the answer which best defines it. Each answer in Column B is used only once. If you are not sure, just guess COLUMN A COLUMN B [ ]A. Area of brain linked with reading 1.ablation [ ]B. Brain structure which regulates hunger 2.angular gyrus [ ]C. Central nervous system 3.homeostasis [ ]D. Events which affect circadian rhythms 4.hypothalamus [ ]E. Problems understanding what is said 5.leucotomy [ ]F. Surgical removal of part of the brain 6.neuroplasms [ ]G. Severing connections in the brain 7.pineal gland [ ]H. Steady physiological state 8.spinal cord [ ]I. Secretes melatonin which produces drowsiness 9.synaptic clefts 10.Wernicke’s area 11.Wernicke’s syndrome 12.zeitgebers

Figure 45. An example of a matching question format.

The reason for having more alternatives or options than questions is that the number of options reduces quickly after each choice. If there are seven questions with seven alternatives then the probability of being correct is 1/7, then 1/6 then 1/5 then 1/4 then 1/3 then 1/2 and the last choice is pre-determined. If there were ten options or alternatives then the probability of a correct choice would be 1/10, 1/9, 1/8, 1/7, 1/6, 1/5 and the final choice would be 1/4. Table 67. Guidelines for writing matching questions 1. Questions must relate to a common topic area 2. Use around 7 – 10 items for matching 3. Include more alternatives than questions (see explanation below) 4. Put numbers, dates, items in order (e.g., alphabetical) 5. Check that there is only one correct answer 6. Put all questions on the one page

216

SELECTED RESPONSE QUESTIONS

As an alternative to other question types, consideration should be given to greater use and trial of the matching questions where understanding of facts is required to be assessed in an efficient manner. It provides teachers with an item format that can be used to vary the content mix of any assessment. Matching questions are also useful for short formative tests which provide students with a different item format that is quick to answer and less threatening than short answer questions. True-false question The true-false question comes in a variety of formats and is basically a statement which has to be identified as being correct or incorrect. True-false questions are of greater use in education and training than is currently realised because these questions can be directed to the essential structure of a subject’s knowledge. They are helpful where there is a need to assess knowledge of the basic facts or ideas in a subject area. Many subjects incorporate an integrated hierarchical network of true and false propositions and many important decisions reduce to two options. Even multiple-choice questions can be reduced to true-false questions where one multiple-choice option is fundamentally correct and the remainder are false. Although the true-false question can be used in education, some people consider that it tests minor or trivial details and is prone to the effects of guessing. Some see it as a very easy type of format for an exam question. They overlook a number of basic facts about the assessment of knowledge. Firstly, any question format can test trivial information. The question format (e.g., true-false) cannot be blamed for the content of the question. True-false questions are not only limited to simple recall (see Figure 46 for a sample set of true-false questions from occupational areas). The main requirements for any question should be that the knowledge, skills or attitudes are worth assessing and that comprehension as well as memory is being used in forming a response. It means that this type of question-writing is a real skill. Moreover, wrong answers should be written so that they represent misconceptions or so that the person without a knowledge of the subject would find either alternative acceptable. The true-false question is not less valid because of the 50% chance factor of guessing correctly. Questions should be written so that someone who does have the knowledge would immediately recognise one alternative as obviously correct. The true-false question can cover a large amount of subject information in a very short time and the more questions there are in a test, then the less effect there will be for factors such as guessing. It has been recommended that a true-false test should contain at least 50 questions1. Any susceptibility to guessing can be corrected easily by adjusting the pass mark on a test. The true-false question comes into its own when a wide range of content must be covered quickly. It enables you to assess a broad sample of the subject topics and learning outcomes and is certainly efficient in terms of the number of questions that can be answered in a given amount of time. For instance, around two true-false 217

CHAPTER 12

questions can be answered in the time it takes to answer one multiple-choice question. It has the benefit of being relatively easy to prepare and very easy to score. With this type of question there is the problem of being heavily dependent upon reading, such as deciphering the meaning of the negative version of a statement. The functional group which contributes to the instability of aspirin is ester. A True B False Wheat flour should be excluded on a gluten-free diet A True B False The term for false interpretation of sensory stimulus in psychiatry is clouding of consciousness A True B False A lesion of the ulnar nerve at the wrist will produce a monkey hand (main de singe) A True B False Source: Adapted from National Office of Overseas Skills Recognition A section head wants to get workers to join a workplace committee. The best way is to say that they are expected to join. A True B False Most workers in an office find the job very frustrating. The best thing a supervisor can do is to give them some advice about their problem. A True B False Source: Athanasou, 1996b An aircraft observed flying a right hand triangular pattern in two minute legs indicates a radio failure A True B False The existence of thunderstorms is the subject of a sigmet. A True B False Pilots must comply with speed restrictions within 20 seconds of acknowledging the instruction. A True B False Source: Diana Dickens

Figure 46. Examples of true-false questions from occupational areas. Table 68. Guidelines for writing true-false questions 1. Ensure the statement is entirely true or false 2. Include only one idea in each question 3. Place the true and false answers in a random order 4. Do not be overly concerned about the proportion of true and false answers 5. Use false answers to reinforce misconceptions and true answers for correct ideas 6. Use straightforward language and avoid double negatives 7. Avoid trick questions 8. Avoid specific determiners such as ‘usually’, ‘none’, ‘always’ 9. Do not make true statement consistently longer than false statements

218

SELECTED RESPONSE QUESTIONS

A disadvantage of true-false questions is that they are not suited to all subjects, especially those with few generalisations. It is sometimes difficult to write statements that are absolutely true or false. You will also need a large number of items to produce acceptable consistency in results. Typical guidelines for the construction of true-false items are also summarised for you in Table 68. ALTERNATE CHOICE

The final type of objectively scored question that we shall examine is the alternate choice question. The alternate choice question is a multiple-choice question with two options.2 There are only two choices in an alternate choice question but this is not the same as a true-false question. There are advantages for this type of question that make it a preferred form for classroom tests. – These are easier questions to write than multiple-choice questions since only two plausible alternatives are required; – More questions can be asked within a testing period than with multiple-choice questions; – It enables you to assess a wide sample of the subject topics and learning outcomes; – The larger number of test questions means greater coverage of learning outcomes; – It suits students who would have eliminated the most unlikely options anyway in a multiple-choice question; – It is helpful for assessing factual knowledge; and – Alternate choice questions are easy-to-score. A shop owner calculated sales for October and for November this year and for October and November last year. What more calculations are needed to find out how much sales increased in this year? A take the total of the two months last year from this year B take the total of the two months this year from last year In 1994-95, Telstra ran more than 9.2 million fixed phone services and 2.1 million mobiles, which were used to make around 35 million calls a day. About how many calls a made per day on telephones throughout Australia? A 3.09 calls per day B 3.50 calls per day There have been some problems with the quality and quantity of work in a small accounts section. The accountant is busy but at least he/she should: A Write to all staff and tell them that there is a problem B Organise a better work program for staff Source: Athanasou, 1996b

Figure 47. Examples of alternate-choice questions.

219

CHAPTER 12

CORRECTIONS FOR GUESSING

It is true that in true-false, alternate choice and multiple choice questions that some students might ‘blind guess’ their way to a pass. That is why some teachers deduct marks for wrong answers. Alternatively, a correction for guessing formula is sometimes used. The usual formula for this purpose is: Corrected score = number right-

number wrong number of alternatives - 1

(19)

For a true-false item, the correction is Right-Wrong; for three alternatives, the correction is Right-(Wrong/2); for four alternatives, the correction is Right – (Wrong/3) etc. The formula assumes that correct and wrong answers were obtained randomly but it overcorrects in that it assumes the test constructor did not succeed in writing questions with plausible distracters. It under-corrects if examinees were able to eliminate some distracters, therefore, the denominator of the fraction should be smaller. A correction for guessing can be applied in speeded tests and in those unspeeded tests where it is thought that some students will not have enough time to finish.3 The most appropriate application of a correction for guessing is when the number of items omitted varies appreciably from one student to another or from one question to another. The correction for guessing is not recommended as most test scores that have been corrected for guessing will rank students in approximately the same relative position as uncorrected scores. Your test directions should be clear as to whether there are any penalties for guessing. For example, if you will not apply any scoring correction formula, you may decide to inform the students that ‘Your score will equal the number of items you answer correctly. The marker will not subtract points for wrong answers; therefore, you should answer every question, even if your answer must be based on a guess’. If you do not clearly inform all the students about your intentions, it is likely that you may disadvantage those with the smallest tests taking experience. Finally corrections for guessing may not be needed where questions are written skilfully, so that the wrong answers really distract those students who have less knowledge. If you wish to make allowance for guessing then it is more appropriate to adjust the pass rate for a multiple-choice test. The 50% pass mark recommended for different true-false and multiple choice questions are shown in Table 69. Table 69. Adjusting the pass mark to allow for guessing

220

Chance Score

Expected by chance

Two-choice Three-choice Four-choice Five-choice

50% 33% 25% 20%

50% difficulty level after adjusting for chance 75% 67% 63% 60%

SELECTED RESPONSE QUESTIONS

MULTIPLE-CHOICE QUESTIONS

Direct questions or incomplete statements with a number of alternative answers make up the familiar multiple-choice question. These widely applicable forms of questioning were developed around World War I as an economic means of administering tests to large groups. They aim to assess a person’s knowledge as well as his/her ability to discriminate among several possible alternatives. Multiple-choice questions can test judgment as well as memory and have become very popular in formal assessment because they are versatile types of questions which are adaptable to most subject areas. Many standardised and commercial tests use only multiple-choice questions. Most large-scale public examinations such as the Scholastic Aptitude Test; the international mathematics and science assessments used in primary and high schools; many well-known educational and psychological tests; university and technical college examinations; and licensing examinations now use some component of multiple-choice questions. It is important to have some knowledge of their use and construction because there are many misconceptions about this form of questioning. In Europe, however, the multiple-choice culture is much more restricted than in the USA and Australia. In the opinion of many test specialists, the multiple-choice format is the preferred question format for large-scale examinations unless there are compelling reasons for other approaches. Please note the last part of this sentence – ‘unless there are compelling reasons for other approaches’ – and also note that we are referring to large-scale examinations. Some people have memories of simplistic multiple-choice questions but this is a powerful format that can also assess performance on complex cognitive tasks. Along with the alternate-choice question, it is more efficient in terms of the examinee and scorer time per unit of information involved. For a fixed time limit, the multiple choice question provides a better sample of the topic being examined than essay formats and the objectivity in scoring reduces any inter-teacher and intra-teacher variability. The latter (inter-teacher and intra-teacher subjectivity) is a far greater source of unfairness than the multiple-choice format. The use of the multiple-choice format is contested especially in the USA, where they have been used extensively. It is fair to say that the mutliple-choice question has been overused for summative assessment in the US but this is not the case world-wide. Much of the criticism is also linked with educational policy and the purposes of assessment. We would agree that much assessment is not required and that multiple-choice questions are not helpful in some contexts. They have to be used judiciously. All we are saying is that you have available to you a wide technology or range of question formats with which to assess knowledge. Our recommendation is that you consider all these methods.

221

CHAPTER 12

Characteristics of multiple-choice questions A multiple-choice question consists of a direct question or an incomplete statement, in which the main part of the question is called the stem and the suggested solutions are called alternatives (i.e., choices or options). Typically, the student is requested to read the stem and consider the alternatives in order to select the best option. The incorrect alternatives are called distracters. The purpose of the incorrect alternatives or distracters, is to provide plausible answers which ‘distract’ those candidates who may not know the correct answer. The direct question form (see Figure 48) is easier to write than the incomplete statement. It is probably better for younger and less able candidates because it is more likely to present a clearer problem. The incomplete statement is a concise form of questioning. It may help to start with the direct question and only move to an incomplete statement if the problem can remain clearly stated. In both cases, the wrong answers should be designed to reflect the most common misconceptions or errors. EXAMPLE Direct question form Which play was written by Shakespeare? A* Coriolanus B Shadowlands C Death on the Nile D Sacred Heart Incomplete statement Form The English navigator who charted the east coast of Australia in 1778 was A* James Cook B Horatio Nelson C Arthur Phillip D Matthew Flinders Best-answer type Which word means most nearly the same as safe? A* secure B free C locked D peaceful

Figure 48. Direct question and incomplete statement form.

The previous examples illustrated a type of question with only one correct answer but it is also possible to develop a question which asks for the best answer or the Visit WebResources for answer that is most correct. The best answer more information on question is useful for those areas of knowledge multiple-choice questions which go beyond simpler direct questions and where a variety of answers is possible or 222

SELECTED RESPONSE QUESTIONS

acceptable. These might include questions of the how and why variety and learning outcomes which involve ‘higher’ order cognition. Some of the ways in which multiple choice questions can be used are shown in the WebResources. Context-dependent or interpretive multiple-choice questions The last two questions in the WebResources provide examples of what are called context-dependent multiple-choice questions. In some textbooks they are referred to as ‘interpretive exercises’. These questions provide text, graphs, cartoons, charts, diagrams which must be analysed and interpreted. They provide a higher level of questioning and offer the opportunity to use the multiple-choice question as a basis for assessing the ability to make inferences, generalisations or apply knowledge. More complex learning outcomes and aims can be assessed with this multiplechoice format than with single independent questions but it places greater reliance on reading skills and also involves more time for the construction of useful questions. Complex multiple-choice questions Complex multiple-choice questions set out combinations of right and wrong answers. An example of a complex multiple-choice question is shown in Figure 49. This question requires more effort to write and is probably better replaced by a series of multiple true-false questions. Complex multiple-choice: What characterises sarcoidosis? 1. It is characterised by widespread lesions 2. It is like tuberculosis 3. There may be a disturbance of calcium metabolism A 1 and 2 B 2 and 3 C 1 and 3 D* 1, 2 and 3 Alternative: multiple true-false questions: 1. Sarcoidosis is characterised by widespread lesions 2. Sarcoidosis is like tuberculosis 3. In sarcoidosis there may be a disturbance of calcium metabolism

True False True False True False

Figure 49. Complex multiple-choice questions.

Advantages and limitations of multiple-choice questions Multiple choice questions are certainly useful for a wide variety of subjects but they have also been subject to wide-ranging criticism. Much of this criticism is correct where multiple-choice tests have been applied to inappropriate subject 223

CHAPTER 12

areas or there have been technical problems associated with the questions used. Nevertheless, multiple-choice questions offer significant economies. A practical advantage of the multiple-choice format for teachers is that the scoring is uniform and standardised in the sense that there is a designated correct answer. It reduces the anxiety of subjective scoring for a teacher and reduces the potential for any bias in scoring that may discriminate against a particular student. The latter is often evident when class marks and external marks are compared for some individuals. Do not underestimate the inherent unfairness of some subjective scoring approaches and the potential for bias in grading. This is one of the best justifications for standardised objectively-scored assessments. Other advantages of the multiple-choice question include: – it is more difficult to remember exam content than with other questions; – items can be re-used with less concern for security of the exam; – multiple choice is not as physically exhausting as a written exam; – item responses are easily analysed; – the potential for guessing in high scores is minimal; – around one multiple-choice item can be answered per minute; – the results are more reliable than with a comparable (i.e., length of time) essay; – the sampling of subject content is greater than with essays; and – higher level thinking can also be measured by the multiple-choice item. A limitation of the multiple choice format is that it is quite difficult to write effective questions. This is not always recognised. In some sense, item writing is an art and skill that needs to be developed. Secondly, multiple choice formats are restricted to abstracted or verbally presented content rather than real situations in context. Thirdly, they rely largely upon the recognition of the right answer and this is a very specific cognitive process. It is not always clear what specific difficulty a student experienced with a question. This means that the production of alternative answers or distracters is again important for the quality of the question. Multiplechoice tests also need to be quite long to achieve the validity and reliability of results that are desired. An obvious weakness of the multiple-choice format is that a single correct answer can sometimes be obtained without any prior knowledge of the subject or instruction in the subject. It does not mean, however, that those who passed were able to do so quite easily, especially when you have very long tests. Overall it is more likely that you will develop high-quality large-scale examinations using multiple choice questions than in the case of most other types of items (other things being equal). Unlike the short answer question it forces the student to distinguish not only what is correct but also what is incorrect. There are also more alternatives than the true-false question and this leads to more challenging tasks. Compared to the matching question, the multiple choice question does not rely on the need for a consistent list of items. The multiple choice question lends itself to analysis of faults in thinking and responding by analysing choices. This gives clues to misunderstandings. Nevertheless, we do not recommend multiple-choice questions for most classroom tests. Firstly, there are other methods of assessment which may relate equally well to knowledge and which require less preparation from a teacher. Secondly, the real economies of scale for multiple-choice questions can only be achieved with large 224

SELECTED RESPONSE QUESTIONS

numbers of students. Thirdly, there are obvious limitations of multiple-choice questions with some types of learning outcomes that require performance, expression or presentation. Finally, the average presentation and production of one multiplechoice question for commercial use in testing is around six hours. We have set out this preparation time in Table 70 . Table 70. Estimated preparation time for high stakes 100-item multiple choice tests Writing the initial 100 questions Reviewing the 100 questions Item analysis and production of the test Clerical input TOTAL

165 hours 50 hours 55 hours 340 hours 610 hours

Guidelines for writing multiple-choice questions One point that we would like to emphasise is that question writing is a real skill and it takes time to acquire. Many teachers think that preparing an assessment is straightforward but the number of major errors in formal high-stakes exams testifies to the complexity of the task. Quite frankly, we are amazed at how errors can creep into even the most carefully-edited assessments. As an example, one book with practice test questions has been through several editions; it was edited professionally and checked by editors, as well as benefiting from any invited reader input. Even after all these checks, a very bright 11-year old contacted the publisher and enquired about a question, saying that all the four alternatives were correct! Despite all these checks it is easy for errors to creep in to any assessment. Common criticisms of multiple-choice questions often reflect poor test construction and inappropriate use rather than a defect in this form of questioning.4 There are over 40 item writing rules which have been identified and which can be applied to multiple-choice questions.5 Some of the most important guidelines which apply to multiple-choice questions are summarised below. Many of these guidelines6 also reflect common aspects of writing other test questions. – The question should be meaningful and represent a specific problem in the stem of the question. Where the stem does not make sense until all the alternatives have been read, they become in effect a collection of true-false statements in multiple choice form. The better multiple choice question presents a definite problem that makes sense without the options. Use examples from your own experience and background to make questions more meaningful. – The stem of the multiple choice question should be free from irrelevant material. An exception to this rule might be when the question aims to determine a candidate’s ability to identify relevant details. – Most questions should be stated in positive rather than negative terms. The use of negatives can confuse students and provide sentences which are difficult to interpret. An exception would be where the learning outcomes indicate that negative circumstances may have serious consequences in a subject area.

225

CHAPTER 12

– Omit responses that are obviously wrong. The alternatives should be plausible. A useful way to undertake this is to administer the questions as a short answer test in the first place and use the most common errors as alternatives. The aim is to ensure that each alternative is selected by some students, otherwise it is not contributing to the functioning of the question.7 The alternatives may be plausible for one specific group but not for others (e.g., students at a higher level of learning). For instance, some alternatives may be clearly eliminated by someone with even a little topic knowledge. – The answer should be agreed upon by experts in the field and there should be only one correct answer. – Select questions and situations from the candidates’ learning and everyday experiences. Some students may be confused when questions are framed using unusual examples or novel circumstances. These may be persons who otherwise would have been familiar with the knowledge being assessed. – Provide a space for the answers to be written and easily scored – Avoid trivial items – The length of the alternatives should be controlled. Ensure the correct alternatives are not consistently longer as this may provide a clue for students. – Place the correct alternatives across each of the positions. There should be an approximately equal distribution of answers at a, b, c, d etc. – se diagrams, drawings and pictures to make the questions more practical and meaningful. – void the use of ‘a’ or ‘ an’ as the final word or words that qualify the response (e.g., is, are, this, these). Clues can be provided to correct answers by the grammatical consistency between the stem and the options. – There are relatively few situations in which it is recommended that ‘none of the above’ is used routinely as an option. – Use about 10 questions to measure each learning objective or lower this to 5 if the task is of a limited nature. – Try to prepare around 20% more questions than you need for the final version of the test. You may find that some questions are not useful and will have to be eliminated from the final test. Always add a quota of extra questions to existing tests in order to see if they are useful for future assessments. – Try to include about four responses for each question. It is not necessary that every question has an equal number of alternatives. Sometimes it is wiser to have few but better options than seek to maintain questions with equal numbers of options throughout a test. The next section considers the analysis of the options in a multiple-choice question. You may skip this section if you are not interested in the analysis of assessment items. Distracter attractiveness When you are using multiple-choice, matching, alternate-choice and true-false questions you may be interested in seeing how many people chose a particular 226

SELECTED RESPONSE QUESTIONS

option. We shall focus only upon the options in multiple-choice questions but what we have to say can be transferred easily to true-false and alternate-choice questions. In a multiple choice question, the distracters are not only wrong but they should also be plausible because they include mistaken ideas and thinking. If it is an effective question, then weaker students or less competent persons are equally likely to select any option. Table 71. Actual item analysis data (i) an item where most pass Option chosen Proportion in low scoring Proportion in high scoring group group A* .82 .95 B .03 .01 C .11 .02 D .01 .00 If you want to rank students in some order of ability then this item would not be helpful.

(ii) an item where there is a possible alternative answer Option chosen Proportion in low scoring group A .21 B* .16 C .32 D .29 B and D appear to be possible alternative answers

Proportion in high scoring group .08 .35 .00 .55

(iii) all but two distracters are being used Option chosen Proportion in low scoring group A .31 B .12 C* .42 D .10

Proportion in high scoring group .24 .00 .75 .00

(iv) the typical pattern for an effective item Option chosen Proportion in low scoring group A .21 B .59 C* .09 D .09

Proportion in high scoring group .06 .08 .82 .02

227

CHAPTER 12

Distracter attractiveness is determined as the percentage of people who chose an option. It tells you how useful were the distracters in a multiple-choice question. An analysis of distracters will help you to develop test writing skills and it is the basis of good multiple-choice testing. The steps involved in determining distracter attractiveness are quite simple. There are several steps involved in determining distracter attractiveness: – count the number of people who chose each incorrect option – divide this by the number of people who all answered that question. Include the correct option in your total. – multiply by 100 to give you a percentage. These indices will help you to revise each question and over time to refine the power of each question in your multiple-choice assessment. If a distracter attracts more people from the top scorers on an assessment or if it fails to discriminate between those who are competent and not-yet-competent, then it needs to be revised. Similarly, distracters that do not attract are not working for you and should be discarded. A better distracter will attract those who are lower in ability or achievement. Table 71 provides some examples of item analysis results of student responses to tests that we have administered. The results come from a commercially available test as well as a teacher constructed test. In this table the asterisks indicate the correct answer. We have also contrasted the different pattern of answers between students who scored highly on the assessment and those who scored lowly. Trace lines A trace line is a graphical way of describing group performance on a question. The group performance is obtained by dividing the performance on an assessment into four or five groups. 100%

50%

Option C

Option A Option B

0% Very low scorers

Very high scorers

Figure 50. Trace lines for a three option multiple-choice question for different groups.

228

SELECTED RESPONSE QUESTIONS

The trace line (see Figure 50) then shows the proportion of people in each group who chose one of the options. You are looking for a pattern in the trace lines. Presumably, the more able performers will choose the correct option (in this case Option C). We would also expect less able performers to choose the distracters (Options A and B). SUMMARY

Matching, true-false and alternate choice questions involve the selection rather than the supply of an answer. They require quite different cognitive processes than the essay and the short answer questions. All of these question formats are subject to the abuse of item writers who produce questions which are not valid, or which assess the recall of the most trivial details and facts. This is a limitation, however, of all questioning procedures. It is possible that with these questions that students can correctly guess an answer and the remedy for this has been to provide a large number of questions as well as adjusting the pass mark to account for the potential level of guessing. The concern about guessing of answers is not important where one asks some 50 to 100 questions. In these cases it is unlikely that random guessing will provide scores greater than 60%. Some of these issues are dealt with in greater detail in the following section on multiple-choice formats. The different types of questions should be seen as tools for assessing knowledge (and to a lesser extent skills and attitudes) in your classroom. They offer complementary approaches which can be adapted for your students’ specific needs. For the assessment of knowledge, the short answer (including completion) and essay questions are recommended for one-time one-class tests. The true-false, alternate-choice and multiple-choice questions are recommended for large-scale, multiple group and repeated testing. While essay and short answer questions are easiest to write, the true-false, alternate-choice and multiple-choice questions are easiest to score and analyse. Where the emphasis is on recognition then true-false, alternate-choice, matching and multiple-choice questions are appropriate but the short answer is most appropriate for recall and the essay for organisation of responses. Remember also that practical or performance assessments may also include a component of knowledge. This means that you have a wide variety of assessment tools at your disposal for the testing of knowledge. We spent several chapters on assessing knowledge and it is important to note that we did this because most of our education system is geared towards the delivery of intellectual skills. Look at your syllabus documents – by far, the majority of educational outcomes are cognitive in nature. They require some assessment of knowledge, and questioning has been the traditional approach. One of the difficulties with questioning as a form of assessing knowledge is that it may not be as holistic in its approach or authentic in its context as we would like it to be. Questioning is a way of abstracting what is most essential. It may not meet all your needs. This discussion on assessing knowledge is almost complete. If you have doubts about which method of assessment is best for your students then our advice is to use multiple methods and to compare the results. You may find that the results 229

CHAPTER 12

from different methods are positively correlated. You may note that one particular method is preferred by students or is better for your context. Remember that knowledge assessments do not have to be secret. There is a professor in the US who gives his students a copy of their final exam at their very first class. It is possible to do something similar in other educational and training contexts. People should have the right to know how and on what they are going to be assessed. For instance, we can give people a list of the potential questions to be asked or indicate that variations of particular questions will be used. At the very least we can use open-book exams to take away some of the unnecessary anxiety. Of course all the preceding comments apply to formal summative assessments and you may disregard this if you just wish to use questioning for formative assessments. Sometimes we are asked which method of assessing knowledge we would recommend. Firstly, they are all useful – do not overlook any because of a particular bias or prejudice. The essay is useful and the multiple-choice question is useful – it is like asking which do we prefer our hand or our eye. Each has a particular purpose – we want both. But to return to the question about which method of assessment would we recommend; our answer today is: it depends on the content and the aims of the assessment. However, it seems that the short answer question may be used in most of the cases. To our mind this is the all-purpose method of questioning. It avoids the problem of guessing or recognition. It does not have the technical problem of finding the best distracters as in multiple-choice questions. It is easy to score because it has an identified correct answer. It is easy to write. It can be used in classrooms throughout the world. It can be used as a oral/verbal or written question. You can cover a fair amount of content with this question. Of course the short answer question has limitations, such as eliminating creativity and reducing the scope for individual expression. The short answer question comes straight to the point – and for students there is nothing quite like the feeling of opening an exam paper and then seeing a range of exquisitely relevant questions about which they have no earthly clue. Incidentally that is probably the reason most people dread exams – there is the potential for failure. We can now leave behind the assessment of knowledge and move on to the next section. This focuses on practical and performance assessments. -oOoREVIEW QUESTIONS

Part A T T T T T T 230

F F F F F F

Matching questions are objectively-scored questions True-false questions are classed as selection items Alternate-choice are multiple choice questions Matching questions cover only limited subject content You need a large number of items in a matching question The number of options remains constant after each choice in a matching question

SELECTED RESPONSE QUESTIONS

T T T T T T T T T T T

F F F F F F F F F F F

T F T F T F T F T F T F T F T F

A matching question can be displayed in two columns Matching questions should be heterogeneous in content An optimum number of matches is around 20 True-false questions are not helpful for assessing basic facts or ideas Some subjects incorporate a hierarchical network of true and false propositions Any question format can assess trivial information The true-false question is less valid because of the guessing factor The more questions then the less effect there will be for factors such as guessing A true-false assessment should contain at least 20 questions True-false questions can cover a wide range of content Three true-false questions can be answered in the time it takes to answer one multiple-choice question The alternate choice and true-false question are identical Alternate choice questions are easier to write than multiple-choice questions More alternate choice questions can be asked within a testing period than with multiple-choice questions For a question with three alternatives the correction for guessing is Right – (Wrong/2) The correction for guessing assumes that right and wrong answers were obtained systematically The use of a correction for guessing is recommended Test directions should be clear as to any penalties for guessing The 50% difficulty level after adjusting for chance for a two-choice question is 75%

Part B: multiple-choice questions T T T T T

F F F F F

T F T F T F T F T F T T T T

F F F F

T F T F T F T F

Multiple-choice questions were developed around the time of World War II Multiple-choice questions can assess judgment as well as memory A multiple-choice question consists of an incomplete statement or direct question The suggested solutions to a multiple-choice question are called stems The purpose of the incorrect alternatives in a multiple-choice question is to provide plausible answers Context-dependent multiple-choice questions have more than one correct answer Some of the criticism of the use of multiple-choice questions is correct Multiple-choice questions provide a better sample of subject knowledge than essay questions Complex learning outcomes can be assessed using multiple-choice questions An average student can be expected to answer about 50-60 multiple-choice questions in an hour Around 10% more questions are needed before developing the final version of a test Multiple-choice questions in a test are norm-referenced Multiple-choice questions in a test are objectively scored Scores on multiple-choice tests usually rank performance comparably with other tests In multiple-choice questions we include options that are obviously wrong We need around 2-3 questions for each learning outcome About four responses are adequate for each question It is not necessary that every question has an equal number of alternatives 231

CHAPTER 12

EXERCISES

Part A 1. When should teachers use true-false questions? 2. Prepare a true-false test for an objective from your syllabus or curriculum. 3. In an area in which you are teaching or plan to teach, find five true-false items and examine them in terms of the criteria below [ ] Ensured the statement is entirely true or false [ ] Randomly mixed the true and false answers [ ] Avoided double negatives [ ] Used straightforward language and short questions [ ] Avoided trick questions [ ] Avoided general terms such as ‘usually’, ‘none’, ‘always’ [ ] Did not make true statement consistently longer than false statements 4. When should teachers use matching questions? 5. Write a matching question for an objective from your syllabus or curriculum? 6. In an area in which you are teaching or plan to teach, find one matching question and examine it in terms of the criteria below [ ] Used at least four but no more than 12 items in each matching question [ ] Included more alternatives than questions [ ] Included only information related to each matching item [ ] Put numbers, dates, items in order (e.g., alphabetical) [ ] Used each item only once [ ] The entire question was on the one page

Part B: multiple-choice questions 1. List three advantages and disadvantages of the multiple-choice item in education and training contexts. 2. Prepare a multiple-choice test for an objective from your syllabus or curriculum. 3. Comment critically on the following excerpt that deals with the value of multiple-choice questions and essays. The excerpt comes from an assessment discussion list. At Appalachian State University we did a bunch of experiments in assessing writing. It turns out that objective tests (like CAAP) gave us the same results as both a homegrown and CAAP essay exam. We used double blind essay reviews, had some scored by ACT and others by in-house folks. The bottom line is that the knowledge tested in multiple choice writing exams is very highly correlated with the skills tested by a writing sample. They may be ‘different skills’ but one stands for the other in testing. It cost me $18.00 per essay when I bought out faculty time, $17.00 per essay for outside readers, and $9.00 per multiple choice test. All produced the same result. If there are two highly correlated issues, measure the one that costs the least! 232

SELECTED RESPONSE QUESTIONS

Point One: Multiple choice writing exams are accurate and less expensive than essays. Point Two: Bob’s message could have been written by a thousand other faculty members who do not take multiple choice writing tests seriously. It is reasonable to believe that the best measure of writing is.... well writing. One reaction is to simply attack the results rather than use the results for change. It can be a false economy if the cheap method isn’t believed to be valid. Source: Randy L. Swing, Co-Director Policy Center for the First Year of College located at Brevard College in Brevard, North Carolina. Online discussion on changes in the SAT, [email protected] Accessed June 2002 4. In an area in which you are teaching or plan to teach, find ten multiple-choice questions and examine them in terms of the criteria below [ ] Is the question meaningful? [ ] Does it represent a specific problem in the stem? [ ] Is the stem free from irrelevant material? [ ] Is the question stated in positive rather than negative terms? [ ] Do experts agree upon the answers? [ ] Is the question unusual or trivial? [ ] Are the alternatives of equal length? [ ] Have options been allocated equally across each position? [ ] Can a diagram, drawing or picture be used? [ ] Are there words which offer a clue (a, an, is are, this, these)? [ ] Is the use of ‘none of the above’ justified [ ] Are there enough questions to measure this learning outcome? 5. A test was given to 1500 students. For four of the items the following response patterns appeared. They are multiple choice items with four options. Compare the response patterns for the items. In what ways, if any, are revisions indicated for a test that is designed to distinguish high from low scorers. Item 1 Group Upper Middle Lower

a 0 100 100

b* 500 200 0

c 0 200 200

d 0 0 200

Item 2 Group Upper Middle Lower

a 0 100 200

b 0 200 0

c* 500 200 300

d 0 0 0

Item 3 Group Upper Middle Lower

a 0 1 0

b 0 2 0

c 0 2 1

d* 5 0 4 233

CHAPTER 12

Item 4 Group Upper Middle Lower

234

a 0 0 0

b* 500 500 500

c 0 0 0

d 0 0 0

CHAPTER 13

ASSESSMENT OF PERFORMANCE AND PRACTICAL SKILLS

Assessment is a process of obtaining information about learning and achievement. This process is multidimensional in its nature and is not a homogeneous entity. A key aspect of this process is the content of assessments and we emphasised that we can consider these as falling under the broad headings of knowledge, skills or attitudes. In this chapter we want to move on and cover different forms of skills assessment in the area of performance and practical skills. This means that we shall also be considering various forms and methods of assessment. Practical assessments are important where education or training involves performance. Some abilities require mastery, for example, conducting a chemical analysis, creating a sculpture, using a word processor, assembling a computer, preparing a meal in a hospitality course, completing a woodwork project, giving a hypodermic injection, operating a forklift or piloting a plane. In such cases, where people must reach a given level of performance, it is imperative that it is evaluated by practical tasks. Practical assessments are designed to measure a student’s competence on some phase or operation. Although most practical assessments in education fall within the skills area, they really incorporate learning from all three domains (i.e., knowledge, skill, attitude). In this chapter, we shall outline Visit WebResources where you can firstly some of the background to find more information about the skill development and the formation assessment of performance and of expertise. The reason for this is practical skills that you need to know exactly what it is you are assessing and at what stage in the development of expertise you are operating. Then we shall outline the different forms of assessment that apply to practical skills. The last sections of the chapter deal with detailed aspects, such as the need for assessments based on the use of a total job, a work sample or skill sample. These will be described and the methods of rating a product, a checklist of processes, or a combination of product and process assessments will be outlined. These are some of the many different aspects in this chapter that you will need to combine in your own thinking – may we suggest that you take the sections slowly and think about the key issues and the extent to which they apply to you.

235

CHAPTER 13

Psychomotor skills A common misunderstanding in dealing with the assessment of practical skills is to class them all as ‘psychomotor skills’. Psychomotor skills are those skills that involve mental control of manual or motor processes. Many performance-based and practical assessments do not involve psychomotor skills. They may have a small motor component but this is not the essential aspect being assessed. For instance, some students classify word processing as a psychomotor skill but it is essentially cognitive in nature. You are not concerned with the beauty or skill of the keystrokes in word processing but you are concerned about whether someone is familiar with the sequence of commands that are necessary to perform a specific function. Skilled performance Skilled performance is important in all aspects of life. It is used in tasks such as wrapping, using a screwdriver, handwriting, keyboarding, sewing machine operation or equipment repairs. This type of performance involves the ability to perform a task to some standard and results from prolonged training or experience. In many cases this activity is of a complex nature involving underlying knowledge and experience.1 These have a high level of organisation and make extensive use of feedback.2 The three major types of skilled performance that have been identified, namely motor skills, perceptual skills and cognitive/language skills, are classified for you in Table 72. A manual task such as operating a lathe comprises a chain of motor responses, the co-ordination of hand-eye movements and the organisation of complex response patterns.3 Table 72. Categories of skilled performance COGNITIVE

PERCEPTUAL

reading planning problem solving estimating

recognition speed estimation accuracy, angle

MOTOR GROSS FINE lifting typing movement filing sewing threading

Most skills include all three elements but it is their relative importance that varies. Although the motor movements are readily observed, the knowledge of how to perform and the mental image or strategy is not seen. For example, the fine manipulative or precision skills that underlie a craft occur as larger units of procedures. Even apparently simple psychomotor performances depend on complex sequences of motor activity.4 If you are assessing speed of performance then you need to realise that the total response times in a work situation are a function of perceptual delays plus decision making delays plus movement times.

236

PERFORMANCE AND PRACTICAL SKILLS

PHASES IN THE ACQUISITION OF A PSYCHOMOTOR SKILL

Three phases have been proposed5 for the learning of a complex skill: an early cognitive phase, a practice fixation stage and an autonomous stage. These are not necessarily distinct but overlap in a continuous fashion.6 Early cognitive phase – – – –

relatively short duration; the person attempts to understand the basic aspects of the skill; a succession of individual component operations has been identified; similarities with other skills can facilitate learning and differences may impede learning. In the early cognitive phase the plan of skilled performance is voluntary, flexible and able to be communicated. Initial performance depends more on mental factors including the ability to understand the task instructions, to concentrate one’s attention on the task and to perceive important task details. Most of our learners are at this early cognitive phase of skills acquisition. Practice-Fixation stage this phase is relatively long; the correct performance is gradually shaped; errors are gradually eliminated; the learner comes to know the content and nature of each component; continued practice with supervision gradually eliminates errors; correct performance is shaped (through attention to wrong cues, responses out of sequence); – the time required to eliminate errors is usually spread over several days. In this phase complex skills become the result of learning subordinate responses. Correct behaviour patterns are practised until the chance of making incorrect responses is reduced to zero. The learner links together responses to form a chain and organises the chains into a pattern. At this phase the psychomotor skills increasingly account for performance and slow instances of a psychomotor skill are progressively eliminated. – – – – – –

Final-autonomous stage – the pattern of activity becomes practically automatic; – the learner can perform required actions without concentrating on them; – increased facility, speed and accuracy, proper timing, anticipation, knowledge of finer points of skill; – there is a capacity to perform the skill in the face of distractions or while attending to other matters. This phase is characterised by increasing speed of performance in which errors are unlikely to occur. The performance is usually locked in as a response pattern. 237

CHAPTER 13

This is why it is described as automatic and autonomous. Gradual improvement may continue to occur. As the operator becomes more skilful the specific psychomotor abilities required for task performance will change. An example of skill development phases is in keyboarding, although this is a special skill in that it consists of a number of separate actions rather than an entire block of related responses: – First stage – letter association stage; – Second stage – syllable and word association stage (looks at words and syllables slightly ahead); and – Third stage – expert stage (reads copy a number of words ahead of the movement of the hands, keeps eye continuously on the copy). The phases of skill acquisition apply across all areas from learning to write with a pencil in the early grades right through to operating a word processor or technical skills in vocational education. The relevance of these phases for teachers is that learners will vary in the progress along the continuum of skill development and the extent of skill development expected at that stage of learning will affect the standards by which performance is judged. By and large teachers will have a feeling for what is the appropriate or minimally acceptable level in their group. The phases of skill development also relate to the concept of expertise discussed in the next section. STAGES IN THE DEVELOPMENT OF EXPERTISE

In examining experts across a wide variety of fields we have come to notice that experts: (a) have their own specialised area of knowledge; (b) are quicker in their ability to solve problems successfully; (c) understand the structure of their field and how areas and individual pieces of information interrelate; (d) have specific memories with ability to recall complex details from past instances, especially the atypical or error situations; (e) understand the complexities of a situation and (f) are able to apply specific judgment rules to each case. Expertise is practical, informal in nature and only rarely, if ever, taught. Furthermore, it is based upon case and episodic knowledge accumulated over extensive periods of time and involves both positive and negative instances (i.e., correct and incorrect skill application and problem solving). We define case knowledge as being the knowledge that develops from the experience of dealing with a particularly difficult or interesting problem situation, especially the solutions or outcomes. Episodic knowledge is defined as being or involving isolated pieces of knowledge or incidents which when accumulated together by the alert individual build up a more coherent picture as with the pieces of a jigsaw puzzle. However, with episodic knowledge, unlike with the jigsaw puzzle, there may not be a model or overall picture that the individual has been taught or is aware of, to facilitate the piecing together of fragments to make sense. How then does expertise develop? There appears to be a series of distinct, identifiable stages that commence with being a novice, then moving to being an advanced beginner, then on through stages of competence, to proficiency and 238

PERFORMANCE AND PRACTICAL SKILLS

ultimately to expertise as knowledge and skill increases and changes both quantitatively and qualitatively. Novices are students or beginning workers; advanced beginners are in the second or third year of their career; around the third year or fourth year they may become competent. The majority of skill learners will probably reach the stage of being competent: a smaller number will become proficient, while a still smaller number of those who are proficient will develop into experts. These stages which are developed from examining real life expert functioning and work in the field of artificial intelligence have some useful explanatory, identifying characteristics. It should be noted that in these stages the qualitative changes over time are as important, if not more important, than the volume of information acquired. – Novice: The beginner seeks all purpose rules to guide his/her behaviour. These rules are logical, fairly consistent and the beginner typically is locked into these and unable to deal with situations that require more than the application of rules. – Advanced Beginner: At this point experience starts to be important. As knowledge of different situations is accumulated, the individual realises that the rules, which are of necessity generalisations, do not adequately cover all situations. – Competent: The competent worker exercises greater authority by setting priorities and making plans. At this stage they have come to determine what is important and that the order of priority may change with the circumstances. – Proficient: In the proficient worker, intuition or know-how becomes important. They may no longer consciously think about adjustments. They notice similarities between events. There is more analysis and decision making with more flexible observance of rules. – Expert: The expert has an intuitive grasp of situations. Performance is fluid and qualitatively different. The knowledge of experts contains fewer rigid classifications of areas of data with there being a mastery of understanding of the interrelationships and linking between the different areas of knowledge. These stages in the development of expertise can be valuably interpreted in conjunction with the phases of skill learning. Three phases of skill learning were proposed, the first of these is the cognitive stage where the learner comes to grapple with the basic factual understandings, the broad outline, the essential nature of each of the steps and the order in which these must be performed. The second phase in skill learning is the practice fixation stage where the repetition of the skill and involvement with reality of this increases the depth of understanding and also establishes the steps and sequences of skill performance clearly in permanent memory. The third phase is the stage of automatisation, where the skill is performed automatically without any need for the performer to consciously monitor the steps and sequences in the skill as this is done subconsciously in accordance with the mental model that has been constructed through practice. The autonomous stage of skill learning is of tremendous practical importance. Attainment of the autonomous stage of skill learning frees up the conscious mind to concentrate on the identification of potential problems and the solution while there is still a subconscious monitoring of ongoing performance. 239

CHAPTER 13

Taken together the novice-expert stage in development and the stages in skill learning would have the novice situated very much at the cognitive stage of skill learning with probably some movement into the practice fixation stage evident. The advanced beginner is at the cognitive stage but also very much advanced into the practice fixation stage. The competent performer would appear to be fully into the practice fixation stage while the proficient individual is at the practice fixation stage and also partially into the process of developing automatic skill performance. The expert will have achieved the autonomous level that produces the characteristic intuitive type solutions and reactions to problems. It is recognised that these stage in both the novice-expert continuum and the stages in skill development are artificial divisions. It would be expected that there would be some overlap between the stages particularly where a number of skills or sub-skills are being learned at the same time. Given the nature of individual differences, which become very evident in skill learning, it would be expected that the amount of time that individuals spend at a particular stage will vary greatly from individual to individual. The attainment of a level of expertise in highly skilled professions will generally not be attained before a minimum of five years in a specialty and there is ample evidence that ten years may be typically the norm for intricate occupations.

25.00%

PROPORTION OF OCCUPATIONS REQUIRING SPECIFIC VOCATIONAL PREPARATION

20.00% 15.00% 10.00% 5.00% 0.00% Short demonstration only 1-3 months

1 month 3-6 months

6-12 months

1-2 years

2-4 years

4-10 years

10+ years

Figure 51. Specific vocational preparation required for occupations.

For instance, radiologists are expected to have seen 100,000 x-rays, soldiers are expected to have marched 800,000 steps in basic training, 3-year-old violinists are expected to have played 2.5 million notes. The chart in Figure 51 from the 1992 Revision of the Classification of Jobs7 indicates the specific amount of vocational 240

PERFORMANCE AND PRACTICAL SKILLS

preparation required for some occupations. One thing that is not often realised is how long it takes for someone to be minimally competent at a task. For around one-third of jobs, competence is not achieved until after 2 year’s on-the-job experience. Yet, when many trainers and educators talk about competence, it is often competence on a specific task or a range of skills. The type of competence we have outlined takes time to develop and is difficult to achieve. The word competence has been used in this section and there are many ideological positions with respect to competency-based approaches in Australia. For our part, we do not wish to engage in this debate but we shall merely use the word ‘competent’ in two of its ordinary meanings, namely: (a) someone being properly qualified or capable; and (b) someone being fit, suitable, sufficient or adequate for a purpose. In this section we have only sketched an outline of skill acquisition learning leading to the development of expertise. The message for teachers is that the development of learner expertise in practical areas requires a significant time span of instruction and practice. A one-off adequate performance may not be sufficient for inference of competent or expert performance. It would be unlikely that mastery of complex skills will be achieved in less than 5,000 hours. A second implication is that instructors need to be intricately familiar with the content domain of the practical skills that they are teaching. Thirdly, teachers need to be aware of the appropriate types and forms of assessment that will provide the necessary and sufficient evidence for their judgments at the particular stage of interest. Finally, instructors may benefit from repeated assessment that produces a learning curve for the individual. A typical skill acquisition curve from a performance task is shown in Figure 52. Performance at various stages of skill learning varies greatly. Within a few days an apprentice joiner can file the laminated edges in around half the time that it took at first. Typical learning curves that we have plotted for performance on complex LEARNING CURVE ON A BLOCK DESIGN TASK FROM THE WECHSLER ADULT INTELLIGENCE SCALE Accuracy 90% 60% 30% 2

4

6

8

10

Learning trials

Figure 52. Dynamic assessment of repeated performance on a task.

241

CHAPTER 13

tasks show massive gains across as few as 5-6 trials, so performance standards need to take account of the overall stage of skill development. This graph also highlights the distinction between static and dynamic assessments. A static assessment occurs at a point in time and may not indicate the true potential of the learner. A dynamic assessment, on the other hand, can trace the individual’s learning potential. Very often people can be trained to perform at levels well above their current standards of achievement and a dynamic assessment allows you to map their progress. It is particularly important because most people fail to achieve their true potential as learners. FORMS OF ASSESSMENT

Practical forms of assessment can be used in a variety of teaching contexts. They can be used to assess management, laboratory, human relations, technical and manual skills. They are suitable as final or progressive tests and lend themselves to a criterion-referenced approach. They are also useful for the affective domain, in areas like safe work habits and attitudes. There can also be a positive impact on learning through requiring a performance-based assessment when learning is geared towards a realistic or tangible goal, as students become more active in their learning through performance. Practical tests are not always formal summative assessment events. They can be undertaken while you are teaching through your observation of selected aspects of your students’ behaviours and performances. This is sometimes called ‘spotlighting‘. Spotlighting is a way of focusing assessment. It may be used to describe a students’ progress at a particular time in a learning activity. In order to make the judgment, a teacher would use learning outcomes or competency criteria. When these observations are recorded on a checklist or rated in some way they become more formal assessments. The general disadvantages of practical assessments are that they are time consuming to perform and they often only sample a restricted range of performance. Thirdly, the logistics of conducting valid practical assessments should not be under-estimated, especially in the case of high-stakes assessments (e.g., important assessments such as professional certification or occupational registration that have significant personal consequences). Fourthly, when the performance standard is unclear, they may be just as prone to subjective marking as other forms of assessment. For the sake of this discussion on assessing practical skills, we have divided the topic into forms, methods, types and features. The main forms of assessment are already known to you and within these forms there are a range of methods that can be used. Some aspects of the range of choices available to you are indicated in Figure 53, with its artificial breakdown of the different aspects of practical testing. This shows that there are many permutations and combinations of testing processes for practical skills. Of course, each form of assessment has particular advantages and disadvantages.

242

CHAPTER 13

– – – – – – – –

Can focus on total job, work sample or skill sample Can provide direct evidence of demonstrated performance Opportunity to observe specific elements of competence Can assess some interpersonal and problem solving skills Moderate correlation with written exams Indirect evidence of knowledge/understanding Realistic activities enhance acceptance by the community Standardisation of tasks increases validity

Disadvantages: – Performance of one skill may not permit inference of overall competence – Skills may not permit generalisation to varied circumstances – May require lengthy and costly assessments for adequate reliability SIMULATION TECHNIQUES

Advantages: – – – – – – – – – – – –

Valid when related directly to content and outcomes Can focus on products and/or processes Can focus on total job, work sample or skill sample Can provide direct evidence of demonstrated performance Opportunity to observe specific elements of competence Can assess some interpersonal and problem solving skills Moderate correlation with written exams Indirect evidence of knowledge/understanding Permits specialised complex assessments Opportunity to observe specific aspects of performance Provides for simulation of costly activities prior to formal assessment Standardisation of tasks increases validity

Disadvantages: – – – –

Tasks may not offer the most realistic evidence of performance May not generate sufficient evidence to prove competence Inferences may not generalise to other circumstances May require lengthy and costly assessments for adequate reliability QUESTIONING TECHNIQUES

Advantages: – Fidelity for essential knowledge – Valid when related directly to content and outcomes 244

PERFORMANCE AND PRACTICAL SKILLS

– – – – – – – – – –

Can focus on products and/or processes Can focus on comprehension or problem solving Can assess potential performance across a range of circumstances May offer evidence to demonstrate transferability to other contexts Oral questioning can be in conjunction with skills testing Written tests can assess knowledge of workplace procedures Indirect evidence of performance skills Permits screening of candidates prior to practical tests Standardisation enhances validity Direct evidence of knowledge

Disadvantages: – – – – – –

Some feelings and attitudes are not able to be assessed Cannot assess interpersonal performance directly Cannot assess psychomotor skills directly Cannot assess technical performance directly Low correlations between test scores and professional competency May not detect serious deficits in understanding EVIDENCE OF PRIOR LEARNING

Advantages: – – – – – – – –

Valid when related directly to content and outcomes Can focus on products and/or processes Can focus on total job, work sample or skill sample Can provide direct evidence of demonstrated performance Permits economies in assessments Guidelines can permit consistent judgments Provides for flexibility in assessment Ensures that individuals are not disadvantaged

Disadvantages: – – – –

Comparability of performances may be difficult to establish Need to infer the ability to perform in other circumstances Quality of evidence may be difficult to determine Timeliness of past evidence might be questioned ASSESSMENT TYPES IN PERFORMANCE BASED-ASSESSMENTS

The two key issues in planning a practical test are to ensure that the performance is relevant and that sufficient evidence of performance is obtained from the test to enable you to make a judgment. There are three main types of practical assessments comprising (a) performance on the totality of a job; (b) a work sample; or (c) a skill 245

CHAPTER 13

sample. The distinctions between these three types of assessments are easy but they do have consequences for the way you arrange your testing. – TOTAL JOB – learners are assessed in carrying out a real job without assistance; – WORK SAMPLE – learners are assessed in carrying out a section of a job; and – SKILL SAMPLE – learners are assessed on a sample task related to an occupation. Once again, the type of assessment you choose will have an impact upon your teaching and the ways in which your students learn. Integration is greatest when assessment is focused on the total job but such teaching and assessment are rare. In education, an occupational area is divided into subjects and modules and we typically focus on only one aspect. So, much of our assessment is already directed towards work and skill samples. We then assume that someone, somewhere will be able to integrate these disparate skills (there are exceptions to this scenario). If you are teaching a subject rather than an entire occupation, it means that the focus of your assessment is already limited in scope to work or skill samples. In fact, this probably has some advantages over assessing a total job, because it is difficult to imagine how one might go about testing performance in some occupations not only because of their complexity but also variability. FEATURES OF PERFORMANCE-BASED ASSESSMENTS: PROCESS AND PRODUCT

One of the best ways to know whether someone can do something is to watch him/her and judge their performance. You may look at the process the student followed, the final product or both. Aspects of process and product can include: – PROCESS – a receptionist handling a client; – PRODUCT – a pressed garment; or – PROCESS AND PRODUCT – taking a dental x-ray. The distinction between these three aspects is important for a number of reasons. In some fields, the procedures to be followed are essential and they dictate the quality of the service offered. In other instances, the final product is of interest and the procedures followed to obtain it could vary markedly. A second reason why process and/or product are important is because of the teaching emphasis involved. For instance, a different teaching approach is required for instruction to assist students working on making a product to meet the requirements of a design brief from students learning the procedures associated with a word processing software package. JUDGMENT IN PRACTICAL TESTS

Skilled performances are multivariate in nature and various forms of assessment (mainly descriptive) are required. All forms of assessment produce evidence that must be judged and the two main approaches in making judgments of practical tests are to focus on: – an integrated (or holistic) judgment based on the teacher’s overall impression; or – an analysis of the performance based on specific criteria. 246

PERFORMANCE AND PRACTICAL SKILLS

In analytic judgments you might list a set of criteria (e.g., in a checklist or log book) and when all these are satisfied, you might judge someone as competent. In some cases you might want to specify that some criteria are essential while others are only desirable; this would still be analytic judgment. Any overall judgment, however, can be shown to be determined by a particular decision policy by the judge. What do we mean by this? We mean that any study of your repeated decisions would probably show that your judgments are likely to be based on particular criteria. These criteria are emphasised regularly in repeated holistic assessments. So, when you think that you are making an overall decision, you are really making a judgment that is analytic (i.e., it can be decomposed). The only difference between analytic or holistic judgments, is that in holistic judgments you give covertly different emphasis or different weights to some criteria than others. This means that the two approaches can be specified as: – an integrated (or holistic) judgment where the teacher’s overall impression is based on criteria which are given subjectively differing levels of emphasis; or – an analysis of the performance based on specific criteria which are all given an equal or a pre-determined weight. The importance of this distinction for practical tests is that using analytic or holistic approaches does have an impact where student grades are based on marks or scores. Where performances or practical tests have specific criteria, you may wish to determine a marking guide. You will need to decide whether every criterion needs to be satisfied and/or is of equal weight. If you are restricted to a holistic approach (e.g., in some of the fine arts areas) then you will need to put in place some system for ensuring that all students are judged uniformly. An excellent solution is to set standards of performance for assessment tasks. This is an approach to practical testing of complex performances which is still criterionreferenced and which is called standards-referenced assessment9. SETTING STANDARDS FOR HOLISTIC ASSESSMENT TASKS

Standards-referenced assessment is a descriptive assessment of complex skills and products. It is based on setting benchmarks for performances for the products of learning. Standards-referenced tests call for an overall professional judgment of performance where the analytic criteria are built-in to the standards. The benchmarks are provided by examples or descriptors which serve as agreed upon stable reference points. You then describe a student’s performance in terms of the benchmarks or compare their product with the examples provided. The two approaches that are useful for vocational education are specific exemplars or detailed verbal descriptions of performance quality. – EXEMPLARS ‘... key examples chosen so as to be typical of designated levels of quality or competence’ – VERBAL DESCRIPTION ‘... a statement setting down the properties that characterise something of the designated level of quality’10 The advantage of this standards-referenced approach is that it relies on defined benchmarks that obviate the need for convoluted scoring arrangements. There is no need for complicated scoring schemes as the only basis for your judgments. 247

CHAPTER 13

Although, this approach sounds ideal we should warn you that there will always be some exceptional cases that are hard to classify and also some training in judgment will be required to develop agreed-upon assessment criteria amongst markers. USING CHECKLISTS IN PERFORMANCE-BASED ASSESSMENTS

Checklists in performance assessments are especially useful as a way of making standard observations about someone’s performance. They would be familiar to most readers and at first appearance they would seem to be straightforward but they do incorporate a complex view of the process being assessed. CHECKLIST A checklist is taken here to be a list of factors, properties, aspects, components, criteria, tasks, or dimensions, the presence or amount of which is to be separately considered, in order to perform a certain task.11 Checklists have been classified as ranging from a type of list where the order of the items is not critical although the grouping of related items can be helpful (like a shopping list); to ordered listings where the sequence of items is important (e.g., engine overhaul); to checklists that have a hierarchy of items. Some rating scales are called checklists but the difference is that they encompass a range of values such as always, sometimes or never. Just because they use squares that are ticked does not mean that they are a checklist. The most important characteristics of checklists are: – a response indicating presence or absence of a criterion (i.e., yes/no); – the independence of the judgment on each criterion of the checklist; and – the standardised format. Checklists are useful where a process has to be assessed. They provide a convenient recording form for communicating performance to learners. They can be used to summarise observations. Checklists can be used where a teacher is interested in whether specific procedures were followed. Checklists are easily understood by other stakeholders because they set out a public set of criteria. They minimise haphazardness in judgement by ensuring that only the stated criteria are counted towards an assessment. This reduces the effect of other characteristics on the rating of each learner and ensures that they are judged only on the set criteria. Checklists find their greatest application where skills can be divided into a series of steps and easy-to-make ‘yes-no’ decisions (see the training checklist in Figure 54; the classroom checklist in Figure 55). The trainer 1. set out the learning outcomes for the session 2. explained new concepts 3. prepared for this session 4. tried to make the subject interesting 5. demonstrated the relevance of the subject 6. made opportunities to ask questions 7. reinforced the material to be learnt

YES NO [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]

Figure 54. A checklist for training and instruction. 248

PERFORMANCE AND PRACTICAL SKILLS

Class √

Date(s)

Student: arrives on time has books, pens, etc. has shop coat/apron stays on task follows directions is polite works independently

Notes

Figure 55. A checklist for classroom use in an autobody class12.

The details in the checklist must be related to the content and purpose of the instruction as well as being observable without too much disagreement on the part of observers. To be of assistance the list of criteria must be mutually exclusive, the list must be as complete as possible and the criteria must be observable. Although the preparation of a checklist might appear to be relatively easy there are a number of considerations in its development. These have been set out in the WebResources. If the checklist (or rating scale) is important for educational or occupational purposes then it may be advisable to have more Visit WebResources for than one assessor rate the student’s performance. more information on Although a checklist is easy to use for the checklists analysis of procedures and for some indication of quality, the rating scale is most often used where the quality of performance is being rated. USING RATING SCALES IN PERFORMANCE-BASED ASSESSMENTS

Rating scales are useful as a way of making standard observations about the quality of a performance or the quality of a product that has been produced by a student. Once again, they ensure that only the listed criteria are counted towards an assessment and provide a convenient format to communicate to learners. Many different forms of rating scales have been developed over the years and only a few general types are described. Although most people think that constructing scales is relatively straightforward there are many technical traps and you are advised to consult a textbook on d attitude assessment for technical details. Descriptive graphic rating scales allow you to express a student’s performance in terms of the responses that they made and/or the behaviour observed. Points along the scale are described with short phrases and each point is usually allocated a nominal score (see Figure 56). Very often these scores are added to give an overall rating of performance. The reliability of such ratings needs to be checked so that consistent results will be obtained. For instance, some common errors in rating have included (a) the tendency for some people to rate at around the same level (e.g., very high or very low); and (b) there may be a tendency to rate

249

CHAPTER 13

performance on two separate items similarly. You are referred to the chapter on attitude assessment for further details of rating scales. To what extent does the sales representative address customer needs?

rarely sells product benefits

focuses on features and some benefits

diagnoses customer needs and sells benefits

Figure 56. A descriptive graphic rating scale for sales skills.

Rating scales can also be used as the basis for comparison against a set of standards (see Figure 57 for an example). Around five levels of quality can suffice for most evaluations of performance and there should be descriptions for each level. An example is shown in Figure 57. It is also possible to use rating scales as a means of comparing student and instructor ratings and this is helpful in formative assessments of performance. excellent; meets all the quality criteria very good, meets more than 75% of the quality criteria average, meets 50-75% of the quality criteria below average, meets less than half the quality criteria poor, meets less than 10% of the quality criteria Figure 57. A description of performance based on quality standards.

Students's self-rating 1

2

3

Instructor's rating 4

5

1

2

3

4

Figure 58. An example of teacher and self-ratings.

250

5

PERFORMANCE AND PRACTICAL SKILLS

SUMMARY

This chapter has shown you some aspects of practical testing that can be varied, namely the forms of assessment; the emphasis on the total job, a work or skill sample; a focus on the product, process or both; the use of checklists and ratings scales; the trade-off between analytic and holistic judgments; and finally, the potential use of standards-referenced assessment criteria. We have tried to summarise these as guidelines for you in Table 73. If you feel somewhat overwhelmed by all the information and the figures shown in this chapter, you shouldn’t. Allow yourself some time to digest most of the information and then revisit the chapter little by little (maybe section by section) and go through the details and the nitty-gritty. Also, do not forget to visit the WebResources because you will find much more material that was not possible to include in the book because of its large volume. This material will be updated and enriched from time to time, so do come back when you feel that you would like to gather some additional material. Table 73. Guidelines for constructing practical tasks

1. Select the most relevant procedures on which students have received instruction • use only situations or problems which are based on the learning outcomes and topics • ensure that students understand their relevance • check that these are as interesting as possible for students 2. Decide what forms of assessment are suited for practical tests in your area. Use the traditions of assessment in your field as a guide. 3. Decide whether you will assess the total job, a work sample or a skill sample 4. Consider, based on your experience and expertise, just how many assessments are sufficient for you to make your judgment 5. Decide whether the assessments should be conducted in a range of contexts and situations. 6. Determine the benchmarks or standards or criteria against which the student’s performance will be judged (e.g., accuracy, speed, quality, creativity, safety). If it is a process, then list the steps required and note any key points. It should include 10 to 25 steps 7. Develop a student task sheet and explain the task to be completed • write out the instructions so that students will know exactly what they will have to do and what resources, tools or materials are required. Specify any time limit. 8. Develop the checklist and/or rating scale 9. Inform students of the type of test, time requirements and the criteria for assessing performance 10. Observe the points that you have listed in your checklist or rating scale when the student is performing the test. Mark the degree of satisfactory performance. Do not interrupt except where safety is an issue. Source: Adapted from Miller & Rose, pp. 239–240

-oOo251

CHAPTER 13

REVIEW QUESTIONS T F T F T F T T T T

F F F F

T T T T

F F F F

T T T T T

F F F F F

Practical skills are psychomotor skills Skilled performance involves cognitive, perceptual and motor skills The first phase in the acquisition of a psychomotor skill is the early cognitive phase The phases of skill acquisition apply across all the areas of learning Case knowledge involves information from isolated incidents A novice learner realizes that generalizations do not cover all situations A one-off adequate performance may not be sufficient for inference of competent or expert performance A static assessment traces the individual’s learning potential Spotlighting is for use with a high stakes assessment procedure Questioning techniques about practical skills provide direct evidence of knowledge In an assessment using a total job, students are assessed in carrying out a section of a job Checklists are used where a process has to be assessed Rating scales are useful as a way of making observations of quality Standards-referenced assessment uses benchmarks as indicators of quality Exemplars are key examples which are typical of the level of quality Descriptors are the properties that might characterise the level of quality EXERCISES

1. Examine a syllabus document that deals with skills and identify up to three psychomotor learning outcomes. Prepare a performance or practical test to assess these learning outcomes. 2. When should teachers use performance-based tests? 3. Explain how a checklist could be used in the assessment of psychomotor learning outcomes. 4. What special procedures should be followed in administering performance tests? 5. Are there any limitations of practical tests for use in classroom contexts? 6. In an area in which you are teaching or plan to teach, find a checklist assessment of practical skills and examine whether it: Identified each of the actions desired in the performance Arranged desired actions in the expected order Provided a simple procedure for checking each action

252

CHAPTER 14

ASSESSMENT OF ATTITUDE AND BEHAVIOUR

One of the most rewarding things about teaching is that you are developing people and attitudes. Through your teaching you try to bring about changes not only in knowledge or skills but also to instil particular attitudes. It could be attitudes to your class, to a subject, to instruction, to an occupation or profession, to clients or customers, to other students or to principles such as safety, equality, confidentiality, integrity or ethics. Sometimes we deal with attitudes directly but at many other times we do this indirectly. In education, attitudes form part of the affective domain1 and include your students’ interests, opinions, beliefs, feelings or values. Although this domain may encompass important instructional outcomes it does not always feature as part of subject outlines or syllabus documents. We may teach attitudes but we rarely focus on assessing attitudes because we have developed a tradition of reporting mainly on the achievement of knowledge and skills. We often leave the attitudinal achievements as part of a tacit curriculum or agenda. FORMATIVE OR SUMMATIVE ASSESSMENT OF ATTITUDES

The first issue for you is whether you wish to assess such attitudes in a formal manner. If this assessment does occur it needs to be within the constraints of your curriculum (e.g., the stated learning outcomes); the traditions of your field; the nature of your discipline; and the policy of your teaching organisation. You may resolve to teach attitudes but not to assess them and this is quite justifiable. If attitudes are not part of your curriculum then attitudinal techniques must not be used for summative assessment but only in a formative way. This will help teachers with regards to determining a group’s beliefs, interests, opinions and values. A strict requirement is that there must not be any adverse repercussions for individuals when they express their attitude or opinion. If the responses are collected for formative assessment from a class or group then it is helpful to ensure privacy. Anonymous responses without any identifying details (such as age, sex) help to ensure confidentiality. Again there must not be any repercussions for the class or group. Sometimes you may use attitudinal assessments to determine the quality of your teaching or the subject or to invite comments on aspects of the course. It may be argued that in such cases a negative reaction from students may conceivably have implications for the way they were graded. A useful precaution is for the information (e.g., surveys, instructor ratings) to be collected independently of the teacher, for surveys to be processed externally and to ensure that any survey 253

CHAPTER 14

findings are not released to you until the semester is completed and your group or class results submitted. The second issue is the appropriate methods of assessment when your syllabus does have attitudinal learning outcomes. All the forms of assessment are relevant for attitudes. The approaches to assessing attitudes that are covered in this chapter are questionnaires and observations. Both can provide useful forms of assessment to assist in managing instruction in vocational education. Prior to dealing with these, we would like to clarify some aspects of the attitudinal domain for you. The nature of attitudes Attitudes are harder to define and less clear-cut than knowledge or skills. Underlying the importance of attitudes is the view that a person’s attitudes will be reflected in his/her behaviour; sometimes this does not appear to be the case and it often seems that a person’s behaviour is more a function of habit and the situation or context in which they are found. Nevertheless, a reasonable case may be made for the fact that beliefs, opinions, attitudes and interests may predispose someone to act in a particular fashion (if all other factors were controlled). By attitude we mean a system of beliefs, values or tendencies that predispose a person to act in certain ways. An attitude is a theoretical construct or notion that is inferred. It describes a relationship between a person and their behaviour. Attitudes were originally conceived as the degree of positive or negative feeling associated with some object. The ‘object’ can include a person, an idea, a thing or a fact. Attitudes should be distinguished from interests, opinions, beliefs and values; but for the purpose of our discussion we would like to consider all of them under the general heading of attitudes. – Interest – a preference for an activity or object, which – other things being equal – may also be reflected in the amount of knowledge, involvement with and value for the object; – Opinion –specific thoughts on a topic, issue or activity; – Belief – thoughts about an object, person, idea, fact or event that are regarded as true, real or credible; – Value – the personal importance or worth of an object (i.e., a person, an idea, a thing or a fact). From your own experience you would be aware of the complex inter-relationship of these concepts. In this chapter, we would like to continue to use attitude in a general sense to cover interests, beliefs, opinions and values. Questioning as a form of assessing attitudes When you ask someone about his/her attitude to something, you are interested in knowing how he/she feels about it or whether they like or dislike something or whether they believe something to be true or how it is important to them. Questioning is a direct approach for assessing attitudes and in some cases may

254

ASSESSMENT OF ATTITUDES AND BEHAVIOUR

reveal feelings which are even masked. For instance, someone may conform socially but deep down they might have different attitudes, beliefs, values or opinions. In educational contexts, there are many forms of questioning that you can use to assess attitudinal learning outcomes. – Essays can be used for students to indicate their formal commitment to an ideology or belief about an issue; – Case studies and problem solving can be used to gauge student values in ethical and moral dilemmas; – Viva voce or oral examinations can be used to assess complex opinions on wide-ranging topics and learning outcomes; – Self-reports can be used to complement observation of students and provide a fuller picture of attitudes; – References from student placements can be used to assess performance and behaviour; and – Questionnaires which consist of a standard set of questions can be used to assess learning interests. USING QUESTIONNAIRES TO SURVEY STUDENTS

Questionnaires are amongst the most common forms of determining students’ attitudes or opinions. They are used frequently in research and in course or teaching evaluations. Questionnaires tend to be overused with older students and adults and in many instances have become an intrusion. They are not widely used in classrooms, possibly because of the close interaction and opportunities for feedback between teacher and learner. The questionnaire, however, can provide standardised data at a group level that can be useful for instruction. It allows you to take an anonymous sample of student opinions on issues that may be of importance to you or your students. A distinction needs to be made, however, between an opinion questionnaire and an attitude scale. The latter aims to indicate the degree of attitude. Not all questionnaires are attitude scales. An example of an opinion questionnaire developed by a classroom teacher to assess apprentice responses to estimating in a printing course is shown in Figure 59. There was a general view that students failed to approach the task of estimating costs appropriately and did not value the need to assess costs for different printing processes. The questionnaire was given to students to answer anonymously and the results used to sample opinions about the task of estimating. In analysing the results of this questionnaire you would be interested in knowing how many students strongly agreed, agreed, were undecided, disagreed or strongly disagreed with each statement. This would be a valid use of questionnaires in assessing specific attitudes and opinions. It would not be an attitude scale. (An attitude scale groups questions in some way and adds together the responses to give a total score that locates someone along a scale.)

255

CHAPTER 14

You are to circle how you feel about the following aspects of estimating in your work-life. If you strongly agree with the statement circle SA. If you agree with the statement circle A. If you are undecided circle U. If you disagree with the statement circle D. If you strongly disagree with the statement circle D. Estimating classes are interesting SD D U A SA Working maths problems is fun, like solving a puzzle SD D U A SA Knowledge of estimating will be useful to my future SD D U A SA employment Estimating is doing the same thing over and over again SD D U A SA Two hours of estimating is not long enough for me to learn SD D U A SA estimating Estimating is important to the printing industry SD D U A SA My boss has taught me more about estimating skills than I SD D U A SA have learnt at TAFE Class practical exercises are good SD D U A SA Reading the text book is a waste of time SD D U A SA Estimating can be a boring subject SD D U A SA Time spent in estimating classes can be well spent SD D U A SA Estimating class is too short SD D U A SA Being able to add, subtract, multiply and divide is all the SD D U A SA estimator needs to know Estimating is easy SD D U A SA I wish work would give me more on-the-job training SD D U A SA Source: Fred Glynn, NSW TAFE.

Figure 59. A questionnaire to assess opinions to printing estimating.

In this example, you would need to assume that students are willing and able to give you honest answers to your questions and that they understood the questions. You would assume that terms like ‘strongly agree’, ‘agree’, ‘disagree’ or ‘strongly disagree’ have the same clearly defined meaning for everybody. Note, however, some good features of this questionnaire: it has some clear directions; it is short and easy to complete; it is confidential and anonymous; and there are some questions that are worded differently (e.g., My boss has taught me more about estimating skills than I have learnt at TAFE) so that you do not just get a pattern of strongly agree or agree throughout. Some helpful criteria for writing such attitude or opinion statements are summarised2 below: – avoid statements that may be interpreted in more than one way; – avoid statements likely to be endorsed by almost everyone or no-one; – select statements believed to cover the entire range of feeling; – statements should be short, rarely exceeding 20 words; – each statement should contain only one complete thought; and – terms such as ‘all, always, none, never ...’ should be avoided. Experience with questionnaires quickly indicates that people are tired of completing surveys. People may settle for generating any response that is merely reasonable. This can take many forms: – selecting any option that seems to be satisfactory; 256

ASSESSMENT OF ATTITUDES AND BEHAVIOUR

– – – – – –

agreeing with complex statements; responding in a way that ensures privacy; answering conservatively; saying ‘don’t know’ etc when the question is complex; making some random choices; and making indiscriminate choices. If you are considering the use of a questionnaire then we would recommend that you keep the questionnaire as short as possible and preferably to one page. Check that only the most essential questions are asked. Do not include questions just because you might want to know something. Also ask yourself how you are going to process the data once it is collected. This could alert you to any unnecessary questions. Some other things that we take into consideration are to use coloured rather than white paper to improve its appearance and to ensure that the questionnaire is professionally typeset as well as attractive in its layout or presentation. Where possible we try to ensure that the questionnaires are anonymous and certainly confidential. We have noted that it is important not to ask adult students for both age and gender because this can identify them, especially in a small group. Typically age was one of the questions most omitted in surveys of technical and further education students until we started to use broader age groups; but there can still be problems. We recommend that you offer people a reward for completing the questionnaire – sometimes we have used movie tickets for adults, stickers or a pencil and eraser for primary school pupils – just to ensure that there is some reward for participating, giving up their time and volunteering and to increase attention to the task. If rewards are not possible then at least try to offer people a summary of the results or some feedback. We also advise you to distribute and collect questionnaires yourself rather than rely upon others because it gives you a feel for how people responded to your questionnaire, the conditions under which it was completed, whether any particular questions were misunderstood, whether they took it seriously and whether the data may be valid. Certainly there are researchers who have never seen the people that completed their surveys and sometimes did not even see the questionnaires but only a statistical summary of the results. USING QUESTIONNAIRES TO EVALUATE COURSES, TEACHERS AND INSTRUCTORS

At the very least, each teacher should use a questionnaire to evaluate his or her instruction throughout the year. Students or participants are ideally placed to provide feedback about teaching. Their opinions may be collected during or at the end of a course. This data may be obtained at all educational levels. To save you the trouble of developing a questionnaire, we have recommended a standard questionnaire for adult education settings and a copy is provided in Appendix F. A standard questionnaire for primary school settings (School Quality Survey) is also included there. The School Quality Survey is a Hawaii Department 257

CHAPTER 14

of Education survey that seeks information from teachers, students and parents about school quality. The questionnaire contains approximately 45 questions and is intended to be administered to schools biennially.3 The results from these surveys are invaluable. For example, since 1992-1993, the Montgomery County Public Schools4 survey students and their parents. Around one-third of primary schools are surveyed each year. The results indicate the level of student satisfaction and some examples of the typical results that can be produced from such surveys are provided in the WebResources. When standard questionnaires are used then comparisons can also be made across schools. This information is helpful but interpretations need to be made with some caution especially when the class numbers involved are small (less than 30). Customer Satisfaction Surveys are used at the Y E Smith Elementary School. 5 These were administered in February 2001 to parents, teachers and grade 4 students. The 35-item student surveys were anonymous and confidential. Some comparative results are highlighted in Figure 60. Themes

This School

All Similar Schools

This school is clean. 64% 62% This school is safe. 80% 81% Parents are involved at this school. 96% 89% This school has a positive climate. 65% 70% This school has high expectations for students. 95% 95% This school has a strong instructional program. 85% 83% The teaching in this school is effective. 89% 92% This school has good student discipline. 83% 82% Note: Student surveys were administered in each classroom on a selected day, but not by the regular teacher. The table shows the aggregated percentage of students who agreed with the stated themes. Source:http://www.dpsnc.com/dps/schools/CustSat/YESmithCustSat.html

Figure 60. Student satisfaction survey results.

The ratings and information provided by students and course participants should be considered because students are in a unique position to report on teaching and learning. Their ratings and comments provide a valuable basis for formative evaluation of instruction. There is substantial evidence in support of the validity and reliability of student ratings of teachers; for instance, they tend to be consistent across different cohorts and also stable across time. Some additional references on the value of ratings are provided for you in the notes.6 There are a number of reasons, however, why ratings cannot be used for summative comparisons between teachers or as the sole criterion for the continued employment of instructors: – Questionnaires for the evaluation of instruction vary widely in format and content; 258

ASSESSMENT OF ATTITUDES AND BEHAVIOUR

– The context for different instructors is not standardised; – The conditions of data collection are not controlled (e.g., independent data collection); – Ratings are influenced by situational factors (e.g., subject and course factors, teacher popularity, previous experience); – There is no agreed-upon benchmark or criterion for student ratings; – Inferences based on small numbers of participants may be biased; and, – Numerical comparison of differences such as those based solely on averages are misleading, especially when the spread of responses is not considered. Our experience is that student ratings of instructors tend to be overwhelmingly positive and that negative comments are isolated.7 The issue of student ratings of teaching and learning is a major topic and we may not have done full justice to it but we hope that at least we have provided you with an introduction and prompted you to consider using some formative assessment as a means of evaluating your instruction. The next section looks at how to construct or develop an attitude scale. This section can be skipped if you are not involved in trying to assess the extent of an attitude. ATTITUDE SCALES

Whenever you wish to use a questionnaire to classify students along a dimension of attitude or whenever you want to give students a numerical index of attitude, then you are using an attitude scale. In this section, we focus only on the most popular type of attitude scale, the Likert scale, developed by Rensis Likert. 8 This was developed as a shortcut means of assessing attitudes. Likert found that simple numbering of responses such as Strongly Disagree to Strongly Agree can achieve the same sort of results as more complex expert judgment and ratings. Since that time, rating scales with simple scoring have become extremely popular. Answers to each question are categorised and then given a value, for example: Strongly agree = 5; Agree = 4; Undecided = 3; Disagree = 2; Strongly disagree = 1. The numerical values for each question are then added to give an overall attitude score that summarises an individual’s responses or a group’s responses. Sometimes the scoring of questions is reversed when a negative answer to a statement indicates a positive attitude. Examples of questions rated from 3 (Agree) to 1 (Disagree) for customer service are listed below. An example of reversed scoring is the rating of the answer to the second question. Directions: Read the sentences below. Put a circle around the letters next to each sentence to show how much you agree or disagree. I agree that the customer is always right I tell clients when they are wrong

Agree A A

Neutral N N

Disagree D D

259

CHAPTER 14

Attitude scales are meant to consist of carefully edited statements that are selected in accordance with some strict statistical criteria. They can be used to assess some aspects of the direction of an individual’s attitude or those of a group. As long as there are not too many errors, you can obtain a general group estimate of overall attitude to assist you in your work. The use of such questionnaires also makes a number of assumptions and we have started to change our mind about their usefulness. Firstly, it assumes that attitudes can be measured; secondly, that there is an underlying attitudinal dimension that is indicated by the range of scores; and thirdly it also assumes that the ratings for each question can be meaningfully added together – this is doubtful. Here is a complementary view that also expresses similar doubts: …any data from a question open to multiple interpretations is itself uninterpretable. The second is that all values or points on a rating scale should describe the same dimension, say, ‘goodness’ or ‘importance’ but not a mixture of the two9 The valid assessment of attitudes is elusive. Responses to rating scales can be affected by many different factors. These include: the number of scale categories used; the wording of questions; and the connotations of category labels.10 The construction (i.e., development) of attitude scales is not as straightforward as writing a list of questions, then merely asking people to rate their feelings and finally adding scores. In Table 74, we have summarised some steps in preparing an attitude scale. We have stopped short of the quantitative summary because Visit WebResources where you can we no longer believe that the responses find more information about the from a questionnaire should always be assessment of attitudes and behaviour added or that the ratings represent real units. The Rasch scaling that we discuss in this book may provide a potential solution to this problem. The reliability and validity of any scale that is constructed should also be determined and there are well established methods for determining these characteristics. In terms of validity we assess whether results are related to other attitude scales (i.e., concurrent validity) or whether they are able to predict behaviours (i.e., predictive validity). We can enhance the content validity by using the familiar table of specifications for writing questions. Reliability, however, is easier to quantify and determine. In terms of reliability, you need to consider the stability of the answers over time (i.e., test-retest reliability) and the consistency of the results you obtain (i.e., internal consistency). Some of these issues are dealt with in the chapters on reliability and validity. While the use of opinion questionnaires and attitude questionnaires is warranted in evaluating classroom instruction and for formative assessments, the use of attitude scales is not recommended for individual assessments. An attitude scale requires considerable expertise and time to develop.

260

ASSESSMENT OF ATTITUDES AND BEHAVIOUR

Table 74. Steps in Constructing an Attitude Questionnaire 1. Write or select a large number of statements that are clearly either favourable or unfavourable. Think clearly about what attitudinal information is to be gathered by the questions. – Think whether people understand your question; – Check the way the question is worded; – Do not use ‘and’ as it can give double barrelled questions; – Avoid adjectives, double negatives, slang and words like ‘always’ or ‘never’; – Ask yourself whether people will want to answer the question, especially if it is something private; – Think about the order of questions (e.g., beginning with the least sensitive questions). 2. Have some independent judges react to the statements. For example ask them to indicate whether they agree, disagree or are undecided or to rate their agreement from strongly disagree through to strongly agree. The original Likert scale had five alternatives. 3. If you are focusing on a dimension of attitudes then retain only the statements that are classified as positive or negative. Neutral statements are not helpful. 4. Try to have equal numbers of positively and negatively worded questions. Use at least 20% more questions than you will finally need. 5. Prepare the questionnaire. Include directions. – Identify the author or the organisation; – Explain the purpose of the questionnaire; – Tell people how to complete the questionnaire; and – Indicate the conditions of confidentiality. The directions should indicate that people should indicate how they feel about each statement by marking SA if they strongly agree, A if they agree, U if they are undecided or not sure, D if they disagree and SD if they strongly disagree. Thank people for their involvement. 6. Set out the questionnaire in a professional manner with a clear heading and useful directions. Indicate what needs to be done. Check that the form can be completed easily. 7. Administer the questionnaire to a group and then tabulate the responses using a statistical package or even use Excel. Base any conclusions you have on the number of people responding to a category on each question. Also use cross-tabulations between questions to see how people responded to different questions. 8. Do not assume that just because response options can be numbered that they can meaningfully be added. 9. If you believe that the questions represent a dimension or scale of attitudes then try to order them from the lowest level on that dimension to the highest level. See if the pattern of agreement or disagreement is consistent with your ordering of the questions.

Moreover, the concept of attitude is difficult to determine and accordingly its use for judgment of an individual may be challenged. Where the assessment of attitudes is important in a subject, then observational indicators of performance are easier to justify in vocational education. The greatest advantage of attitude scales is their use as group indicators in formative assessments. One exception to attitude scales that may be permissible is a single question with a dimension of ratings. An example would be the questionnaire used in some teacher evaluations, ‘Overall how would you rate the quality of this teacher?’ – and with a rating scale maybe from unsatisfactory to highly satisfactory. We would be 261

CHAPTER 14

prepared to accept this as a rating scale but we would not agree that the categories could be numbered, that they are equidistant and/or that they can be averaged without making many assumptions. OBSERVATIONAL FORMS OF ASSESSING ATTITUDES

Direct observation of students at work or on placement can provide an important assessment of attitudinal achievement that is not available in any other way. Certain attitudes and work habits can best be assessed by observing the student at work in typical situations. Observations also overcome the limitations of the questionnaire which involves interpreting people’s perceptions of situations and circumstances. The strength of observational methods lies in the natural and firsthand quality of the data provided. A key aspect of observation is the potential to describe the context within which activities occur. Observations can be recorded in checklists, such as the following (see Table 75) Table 75. Checklist for observation of student behaviour Behaviour Achieved Comments Took down dictation in shorthand and transcribed it for word processing Proof-read work for typographical and grammatical errors Opened and distributed incoming mail Maintained office file records Answered telephone calls Received clients Photocopied documents Filed client records

There are a number of choices available to you in the way you can conduct observations. Any covert observation raises questions of ethics and these issues should be clarified. We aware that people will behave differently when they know that they are being observed but in free and democratic education systems, We do not think that there is any scope for covert assessments. My recommendation is for the observations to involve you as onlooker, to be overt and public, with a full explanation to students of the way in which the assessment is to be conducted, including multiple observations over as long duration as possible and focusing on the specific learning outcomes you want your students to achieve. To summarise, the two basic guidelines for observational assessment of attitudes are: – that both the student and the observer should know the characteristics being observed; and – that standards (e.g., observational checklists) are used as an aid to consistent evaluation. 262

ASSESSMENT OF ATTITUDES AND BEHAVIOUR

INFORMATION FROM SUPERVISORS

Another source of information about students on which you might want to rely for attitude assessments are supervisor reports from student placements or other forms of on-the-job training. These reports, letters or comments are best incorporated into a student logbook or portfolio. ASSESSING INTERESTS IN FORMATIVE ASSESSMENTS

We would like to emphasise that there is a major role for the assessment of interests as a guide for teaching and learning. Formative assessments of interests can be used to direct teaching and learning to those aspects of a subject that students find relevant and rewarding. Knowledge of interests, for example, can influence the nature and extent of learning in a subject. There are many approaches that can be used here and one of the easiest is merely to ask students to rank their preferences (see Figure 61). Ranking is preferable to rating scales as a means of establishing preferences and the personal order of importance. The number of items to be ranked, however, should be limited otherwise the task becomes onerous. Sometimes we only ask people to list their three most important preferences as a shortcut means of assessment. RANK (1 to 6)

YOUR INTEREST IN TOPICS Word processing Spreadsheets Databases Communications Graphics Programming

Figure 61. A simple ranking of topic interests.

A second basis is to use pair comparisons, where students are instructed to mark the object that they most prefer. A tally of the preferences for each item will indicate a student’s relative preference (see Figure 62). Security and loss prevention Stock movement and control Security and loss prevention Merchandising and display Stock movement and control Security and loss prevention

or or or or or or

Merchandising and display Selling skills Stock movement and control Selling skills Merchandising and display Selling skills

Figure 62. A paired-comparison analysis of topic interests.

Other ways in which attitudes can be used in formative ways are the assessment of the same attitudes at different points in a semester. For instance, behaviour changes can be observed and recorded in order to chart a student’s progress in a subject. 263

CHAPTER 14

SUMMARY

In all forms of education, there is considerable scope for the formative assessment of attitudes and lesser scope for summative assessment (depending upon the syllabus). This aspect of assessment is often overlooked but it is important because of social interactions in the classroom and workplace. Furthermore, human values and interests are key factors that moderate educational achievement. Attitudes and values are taught directly and indirectly in classrooms and contained in some of our learning outcomes. Some teachers shy away from the formative assessment of attitudes and focus mainly on the areas of skills and knowledge. This is understandable but it might also reduce the effectiveness of their instruction. In any event, where attitudes are designated as specific learning outcomes then some form of assessment (especially through structured observation) needs to be developed and then reviewed at regular intervals. Even when attitudes form part of the stated learning outcomes, teachers may decide not to formally grade students separately on these criteria. They might be concerned about their ability to defend judgments of attitudes. While our community accepts that students can be failed for not having the requisite knowledge or skills, it is not clear that there is a consensus about failing people on the basis of ‘attitudes’ alone. The assessment of attitudes, therefore, might best be integrated into the performance-based assessments which demonstrate skills and knowledge and only used where there is clear agreement amongst teaching staff on the attitudinal criteria which form part of the performance-based assessment. Observational checklists have been recommended as a basis for attitude assessments. Questioning of opinions and interests have been suggested as a basis for formative assessment of attitudes. The use of classroom-constructed Likert scales for assessing the extent of an attitude has not been recommended. -oOoREVIEW QUESTIONS T T T T T T T

F F F F F F F

T F T F T F T F 264

Attitudes form part of the cognitive domain An attitude is a real phenomenon Attitudes were originally conceived as the degree of positive or negative feeling The attitudinal domain includes emotions Exams can be used to assess attitudinal learning outcomes An opinion questionnaire indicates the degree of attitude An attitude scale groups the questions in some way and adds the together the responses to give a total score Each teacher should use a standard questionnaire to evaluate his/her instruction The steps in the construction of an attitude scale are: writing questions, asking people to rate their feelings, and adding the scores The validity of attitudes is seen in the consistency of the answers The reliability of attitudes is seen in their ability to predict behaviour

ASSESSMENT OF ATTITUDES AND BEHAVIOUR

T F T F T F T F T F T F T F

The use of an attitude scale to judge an individual may be challenged An opinion checklist is a Likert scale In writing opinion statements, you should avoid statements that are endorsed by everybody Direct observation of students can provide an assessment of attitudes Observation of students should be covert The strength of observational assessments of attitudes is that they provide perceptions of situations and circumstances A major role for the assessment of attitudes is to direct teaching and learning EXERCISES

1. Why do you think student attitudes should be assessed? 2. List three advantages and disadvantages of using observational techniques for assessing student attitudes. 3. Describe three advantages of questionnaires for assessing attitudes. 4. Choose a syllabus in your area which contains a number of attitudinal learning outcomes. Develop an assessment for these learning outcomes. 5. Design your own Likert-type scale for assessing attitudes in an area of teaching in which you are working or plan to work.

265

CHAPTER 15

GRADING PERFORMANCE AND RESULTS

The way the teachers grade students and their work is always a contentious topic. It is directly relevant to all instructors and teachers, especially those who are required to grade within a formal system. Many teachers indicate that they find the assigning of grades to be one of their most difficult tasks. We agree. We have never liked the fact that in education we are called upon frequently to be the instructor as well as the assessor (i.e., the person who might set summative assessments, mark them and then decide who is going to pass or fail). A radical approach for you to consider is that of John Holt in What do I do Monday?: ‘... if you must grade, grade as seldom as possible, as privately as possible, and as easily as possible’.1 THE ROLE OF GRADING

Grading is the most formal part of the process of student assessment (see Figure 63 for an outline of this process). It is associated with the judgments or decisions that you make on the basis of the assessment results. You might grade to indicate whether students have the required knowledge or you might grade to evaluate your own instruction. You grade whenever you compare one student with others. You grade to indicate whether someone will graduate or be certified as competent. You use grades in your reports to students. Educational administrators, the community, commerce, industry and students demand grades. The type of grading with which we are concerned in this chapter is Visit WebResources where you can the summative grading of results in find more information about grading formal educational systems. The main of performance and results reason why we undertake such grading is for bureaucratic purposes. A single grade is preferred by examination systems, so that decisions can be made easily about the future progress of a student. A frustrating aspect of such grading is that the student’s entire learning is reported in a very brief form, usually as a number or a letter grade (e.g., A, B, C, F). It has always astounded us that we seek to summarise a whole semester’s learning in a single letter of the alphabet! The three main options for grading and reporting are: letter grades, pass-fail, and marks or test scores. This chapter aims to provide some general guidelines for teachers and instructors who have to evaluate student performance. Some aspects of scoring tests are outlined first, and then we consider issues associated with grading. We shall begin with test scores as they often underlie other forms of grading. 267

GRADING PERFORMANCE AND RESULTS

have raised these issues because some people have a naive belief in the value of all test scores. Natural units of performance are the most satisfactory form of scores. The most natural units describe the frequency, duration, latency or correctness of a response. Physical units such as speed of reading, words typed per minute, production errors or time taken for a correct fault diagnosis possess a clearer meaning than numerical scores. Where possible this is the preferred method of reporting because it is specific and descriptive. One difficulty with this approach is that our administrative and examination systems are not geared for processing such descriptive statements. Having decided upon the units of performances (i.e., scores or natural units), the next step is to transfer the results or scores from testing into grades. In most cases, the original scores from a test are changed into other scores (e.g., percentages) and then grouped into grades or categories of performance. We will not have anything further to say about natural units of performance but will start with scores that are converted into percentage correct as these are used widely in education and training settings. Scores which indicate the percentage correct Many teachers convert scores or marks to the percentage correct in order to compare performances across tests or even to add results from different tests (e.g., adding the results of theory and practical tests). Percentages are popular because they are easily comprehended by both students and the community. They base all scores on a common scale up to 100% and the level of passing performance in most formal educational systems has been ‘defined’ culturally in Australia as around 50%. As a result, many laypersons often assume that percentage scores for various subjects are useful indicators of performance and can be compared. Remember that most educational scores are not measurements in the sense that centimetres are measures of length. However, scores do provide us with a type of ordering of ability and are useful where there is some underlying dimension of ability or competence that is being assessed. You would have already realised that any worthwhile comparison depends on the content as well as the difficulty level of the various tests. If you are going to use 50% as the pass mark then you need to ensure that this is the lowest level of performance you are prepared to accept as indicating a pass and you will need to examine carefully the difficulty of each question and the marks attached to it. Scores which indicate the class ranking Another process is to take raw scores and to convert these into ranks. Normally these are class ranks. This is a norm-referenced approach because it compares an individual to a group. Many people in the community see class rankings as an indicator of ability but there is little justification for the use of rankings in education settings, except for selection purposes. 269

CHAPTER 15

At the present time, there is no readily available way to transform ranks to a common range in applied settings such as classrooms.2 The difficulty arises when ranks are based on different sized groups. For instance, in examining a student’s position in class on different examinations, they may be ranked out of 15 on one occasion out of 30 students on a second examination and out of 11 on a third examination. There is no easy method of comparing the third position out of 15 with, say, the third position out of 30 or 11. A transformation of ranks3 can be used for small groups based on the following formula: New rank = (Original class rank -1)/(Number of students in the class – 1). This will now give you a value from zero (the highest rank) to one (the lowest rank). The calculations for groups up to 15 students have been done for you in Table 76. From Table 76, a rank of second out of 15 would be converted to a ranking of 0.08; a rank of sixth out of eight would be converted to 0.71. The use of this transformation is only an approximation but it is now possible to consider and relate rankings and positions based on uneven samples in terms of a common metric with fixed endpoints.4 Both percentage scores and ranks make many technical assumptions. The scores are rarely true measurements. The original scores might best be regarded as a type of counting of the number of questions correct. Therefore arithmetical operations like percentages that are applied to the scores may not be valid measures of ability for everyone in a group. Rankings have some merit but they are often based on scores. At best ranks give you only a general sense of the direction or extent of performance but it is hard to convince people that there are limitations in these numbers. (The percentile rank is commonly used as an indicator of educational achievement for university admissions indices and is described in Appendix C.) Table 76. Transformation of ranks for different class sizes Original rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

270

Class Size 2 0 1

3 0 .50 1

4 0 .34 .66 1

5 0 .25 .50 .75 1

6 0 .20 .40 .60 .80 1

7 0 .17 .34 .50 .66 .83 1

8 0 .15 .29 .43 .57 .71 .85 1

9 0 .13 .25 .38 .50 .62 .75 .87 1

10 0 .12 .23 .34 .45 .55 .66 .77 .88 1

11 0 .10 .20 .30 .40 .50 .60 .70 .80 .90 1

12 0 .10 .19 .28 .37 .46 .54 .63 .72 .81 .90 1

13 0 .09 .17 .25 .34 .42 .50 .58 .66 .75 .83 .91 1

14 0 .08 .16 .24 .31 .39 .47 .53 .61 .69 .76 .84 .92 1

15 0 .08 .15 .22 .29 .36 .43 .50 .57 .64 .71 .78 .85 .92 1

GRADING PERFORMANCE AND RESULTS

How to scale test scores using graphical methods One way of making sure that the range of scores on each method is similar is to scale the scores on each test so that they are similar or equated. Sophisticated scaling methods are used in high-stakes examinations such as the various Year 12 final examinations and university admissions indices but the discussion of these is beyond the scope of this book. These scaling methods attempt to equate subjects and it is not a straightforward task. They make many technical assumptions. There are some simple approaches to scaling that can be applied for classroom use. A method of scaling the marks that has been used5 for scaling marks from zero to 100 is to give the lowest mark zero and the highest mark 100. The disadvantage of this approach is that the lowest score on a test does not really justify zero out of 100. Furthermore, it is difficult for a student to understand how his/her score on a test was scaled down to zero. A more reasonable approach for conversion is to take the original test scores and to set what you consider to be the minimum pass mark. This is then set to 50 out of 100. You can then take the highest test score and consider what would be its equivalent out of 100. Then all the original marks are converted from a line graph to marks out of 100. An example of this approach is shown in Figure 64. This shows the conversion for a test with 85 questions to a mark out of 100. The minimum passing level on the original test was 35 and this was made equal to 50. The highest score of 85 on the original test was scaled up to 100. The steps in this procedure for scaling test scores into marks out of 100 are summarised in

Table 77. In this example a score of 30 (original marks) would be scaled up to 40 out of 100 and a score of 40 would be scaled up to around 60 and so on. Original marks

100

Highest passing score 85 70

40

Lowest passing score

10

F

50 P

Cr

D

HD 100

FINAL GRADES

Figure 64. A graphical conversion of test scores to marks out of 100.

271

CHAPTER 15

Table 77. Procedure for scaling test scores to a mark out of 100 1. Place marks out of 100 along the lower axis (i.e., final grades) 2. On the other axis, you place the marks for the original scores on your test(s) 3. Draw a dotted upright line for the pass mark out of 100. This will usually be around 50 4. Draw a dotted upright line to show what you expect would be the highest mark out of 100. This could be 100 or you could make it less 5. Show the highest score made on the test and the lowest passing score and draw these two dotted lines across 6. Now draw a solid line which passes through the intersections of highest and lowest points 7. Use this last line to read off each test score and to translate this into a mark out of 100 or a grade

While these standardisation and scaling procedures are helpful, they also involve considerable effort as well as explanation to others. The easiest solution to avoid the need to standardise scores is to produce tests which have the same notional range (e.g., marks out of 100), are of moderate difficulty (around 0.5) and to produce tests across subjects that have a similar distribution of scores. ESTABLISHING CUT-OFF POINTS USING THE ANGOFF METHOD

The previous section raised the question of setting a minimum passing score. The determination of cut-off points for effective performance is of fundamental importance to educational testing.6 We shall deal with the Angoff method.7 This requires a panel of experienced teachers or instructors to consider each task or item and to judge the probability that a minimally competent person would have of answering it. This probability is the teacher’s estimate of the proportion of students or learners that would normally be expected to pass such a task. The process is subjective but it is based on one’s experience of teaching and learning. These probabilities or proportions are then summed. They are averaged across teachers. The range of the judges’ scores defines a region of minimal competence. An example for the calculation of the Angoff method for a test in Non-Residential Construction is provided in Table 78. On Task 1 in Table 78, the teachers expected that on average 0.8 of students would pass Task 1; on Task 2, they expected that on the basis of their experience and judgment that around 0.4 of students would pass this task. There is unpublished research into setting numeracy standards for riggers 8 which indicated that teachers were able to provide excellent estimates of the difficulty levels of questions but there was wide divergence when it came to establishing the prerequisite cut-off point. It would appear that a common basis for setting standards is required and that for quantitative assessments an approach similar to that developed by Angoff may find application in many areas of education.

272

GRADING PERFORMANCE AND RESULTS

Table 78. Application of Angoff method Task

Teachers’ average estimates of difficulty 0.8 0.4 0.1 0.6 0.8 1.0 1.0 0.5 0.5 0.7 7 out of 10

1 2 3 4 5 6 7 8 9 10 Cut-off

MARK CONVERSION

Sometimes teachers wish to adjust their marks to that of other teachers. One way to do this is to standardise the scores for each teacher. An easier way is to draw a simple graph which is based on the average scores. This procedure is illustrated in Figure 65. Table 79 gives an example using the average grades of four teachers. Table 79. Average grades of four teachers Teacher A B C D

Average Mark 50 65 61 60

If teacher A wishes to adjust the marks given to that of the other teachers (i.e., B, C, D) then, assuming that there is no difference in the achievement level between the groups, he/she should: Step 1. Calculate the average of teachers B, C, D (65+61+60 = 186; average = 62) Step 2. Draw a graph with two axes from zero to 100 Step 3. Plot the point where the score for teacher A (50) meets the average score of the other teachers (62) Step 4. Draw a line from the zero point to the point where the two average scores intersect Step 5. Read off scores from teacher A to the equivalents for other teachers.

273

CHAPTER 15 100 Conversion line

TEACHER A 50

20

40

60

80

100

TEACHERS B, C, D

Figure 65. A graphical method for converting test scores.

Converting marks to the same range Sometimes you may wish to convert marks on one test to the same range of marks on another assessment. You can use a similar graphical process to make sure that marks are on the same range. For instance if you want to combine scores from a theory paper (out of 100) and a practical test (out of 50) and you want to give them equal weight then you can convert the marks to the same range. Do not multiply the practical paper by two to make the scores out of 100. We shall explain this below. The range of scores on the theory paper is from 80 to 100 and the range of scores on the practical test is from 10 to 50. Let us convert the practical to a score out of 100. The worst thing you can do is to multiply the practical score by two. You need to make the range of scores the same. Remember, the range on the theory paper is 20 and the range on the practical is 40. The easiest way to do this is to use a nomograph or chart like the one shown in Figure 66.To replicate that work, follow the steps: Step 1. Draw two parallel lines. Make them the same length. Step 2. Mark off the scores on the theory exam (from 80 to 100) Step 3. Mark off the scores on the practical test (from 10 to 50). Mark these in the opposite direction to the exam scores. Step 4. Join the ends of the two lines. Note where they intersect. Step 5. Now draw a line from any practical score through the intersection and read off the corresponding theory score. So a score of 40 on the practical is now equal to 95 on the theory paper. You can now see that it would have been unfair to just double the practical marks and then add them to the theory paper. A score of 40 is really equal to 95 whereas doubling it would have meant the student received only 80. Now here is a further hint – always scale scores upwards before averaging them. Never reduce a 274

GRADING PERFORMANCE AND RESULTS

score of say, 95 on theory (as in the above example) to an equivalent of 40 out of 50 because it will cause howls of protest and you do not need any more problems as a teacher. 80

50

THEORY PAPER 90

40

95

30

100

10

PRACTICAL TEST

Figure 66. A nomograph for converting test scores to a common scale. TYPES OF GRADING SYSTEMS

Three forms of grading systems are in common use: marks or test scores, letter grades or pass-fail. Test scores have been discussed in some detail already. Letter grades assign a single letter (such as A, B, C etc) or number to a subject’s results. The advantages are that it is concise, convenient and an easy basis for categorising marks on a test. A shortcoming of this approach is that a great deal of information is buried within the letter grade. A problem for teachers is how to describe achievement, behaviour and effort in a single grade. The overall conclusion is that grades are most useful when they represent achievement only. A two-category grade that has been used is ‘pass-fail‘. This is preferred by some teachers but it offers significantly less information about achievement than the letter grades. (In the WebReferences we show how each grading system tends to reduce the information available about the group’s performance.) The pass-fail description is useful for competency-based assessments, for mastery learning and for learning outcomes which are difficult to grade. Some teachers have Visit WebResources where you can an ideological preference for the find more information about grading pass-fail description while others prefer performance and assessment results grades as a means of indicating excellence; some students also indicate that they like the decreased emphasis on competitiveness in the Pass-Fail system while other think it is unfair that there is no distinction between students who just 275

CHAPTER 15

pass and those whose performance is excellent. What is our verdict? We really do not know. It depends on a range of circumstances and contexts and what is fairest and most feasible. How to translate scores into grades The most direct way to translate test scores into grades is to use a percentage score. You can then use a table, such as the one shown in Table 80 to transfer marks or percentages into grades. There are different arrangements that you could use, including Pass-Fail or A, B, C and Fail etc. Table 80. A table of grades Grade High distinction Distinction Credit Pass Fail

Percentage 85-100 75-84 65-74 50-64 less than 50

Failures must represent unsatisfactory performance in essential aspects of the subject, (e.g., final examination and assignments). In some cases, failures may sometimes be given where there are students who have a mark greater than 50 but who have failed an essential component such as a final examination. There may be arrangements for conceded passes which are given where students show satisfactory achievement overall except for unsatisfactory but close to satisfactory achievement in one aspect of a subject. In any event all these procedures should be set out formally in an assessment outline to students. In cases where, scores or percentages cannot be calculated then a verbal description of performance may be used as the basis for grades. Table 81 outlines a table of grades for subjects in which percentage marks are not used. Table 81. Graded descriptions of achievement Grade High distinction Distinction Credit Pass Fail

276

Description of performance Outstanding quality of achievement on all the learning outcomes. Grade given for recognition of originality Superior quality of achievement on all the learning outcomes Good quality of achievement on all the learning outcomes and/or superior quality on some learning outcomes Satisfactory achievement on all learning outcomes Unsatisfactory achievement in a compulsory learning outcome or in one or more components of the subject

GRADING PERFORMANCE AND RESULTS

The use of the terms ‘outstanding’, ‘superior’, ‘good’ and ‘satisfactory’ retains an underlying sense of norm-referencing and still leaves considerable room for subjectivity. As a result, any comparison of grades on different tests or subjects continues to represent some problems. It could depend upon the standards used by markers, the average level of difficulty of the test and/or the range of scores. This is why some teachers prefer the use of pass-fail grades. On the other hand, some teachers do not consider pass-fail grades as providing sufficient incentive for superior achievement. We have written a good deal about the technical problems associated with grades but they do indicate the extent of achievement; and maybe one aspect of grades which is worthy of further consideration is that (other things being equal) they may reflect the quality of our teaching and instruction. GRADING ON THE NORMAL CURVE

The idea of grading students based on the normal curve is not supported for classroom contexts. This idea was introduced in the 1930s because the normal curve represents the distribution of many human characteristics (e.g., biological – height, weight, physical capacity) and was thought to be applicable to educational performances in large groups. This distribution is bell-shaped, with most people scoring in the middle and a few at either extreme. The normal curve also has mathematical properties and characteristics which make it useful for estimation. It is not defensible to assign grades on the basis of the normal curve. Firstly, it is unlikely that the class sizes that are common in education or training would produce a normal distribution of scores. In addition, the students at the later stages of education or training represent a select sample. Furthermore, the assessments that teachers design for use with their classes may not yield a normal distribution. Most teachers design tests that reflect the mastery of instruction and yield distributions of scores which have the majority of students passing. A problem with measurement based on the normal curve is that it places half the group below average (by definition in a normal distribution). It is not possible to condone any practice of grading according to a normal curve in a classroom or any practice of assigning only a limited number of As etc. SOME PRACTICAL GUIDELINES FOR GRADING STUDENTS

Grades will be most meaningful for you when they are based on the stated assessment criteria; (a) when they reflect the achievement of the learning outcomes; (b) when they are criterion-referenced, (that is, a grade does not depend on the performance of other students in a subject). We would like to suggest some practical guidelines for scoring and grading in education and training (see Table 82). Feel free to vary these according to your circumstances and your organisation’s teaching policies.

277

CHAPTER 15

Table 82. Some guidelines for grading 1. Describe the procedures for testing and grading to students at the outset (preferably within three weeks of commencement); 2. Base student grades only on the achievement of learning outcomes; 3. Decide how to report separately on effort, attendance etc.; 4. Retain all examination papers for one semester; 5. Return and review all other work as soon as possible (about 7-10 days for class sizes 200 learners) but mostly you will be working with small groups (30) number of scores or if you are looking at many variables (>3). What do I mean by a relationship? Basically, we are asking whether people who ranked high on one measure also ranked high on another measure and those who ranked low on one measure tended to rank lowly on the second. Or, putting it another way, “do people who do well on one question or task do well on another question or task?” The correlation coefficient provides an index to determine whether there is a relationship. As we mentioned previously, the correlation varies from –1 through 0 to +1. You can visualise a correlation in a scatter plot. This chart shows the scores on the two variables as a series of points. The words ‘positive’ and negative’ are usedin conjunction with correlation and they refer to the direction of the relation. Positive means that high scores on one variable are related to high scores on another (plus low scores on one variable are related to low scores on another). Negative means that some people who score high on one variable might score high on another variable. An example of a negative correlation is between increasing age and motor skills such as dexterity, coordination. As age increases then manual dexterity generally decreases (there are exceptions and this is why correlations of +1 or –1 are extremely rare). In most behavioural research you end up with correlations about 0.3. To compute a correlation the scores do not have to be in the same units or even on the same scale. Part of the calculation (which you do not have to perform) standardises the scores. (It converts them to standard scores or z-scores then proceeds to calculate the index.) Calculating the correlation is easier in Excel than on some pocket calculators or even some statistical packages. Step 1. Go to Tools then Data Analysis on your Menu. Select Correlation from the dialog box then click OK.

Figure 91. Options for Descriptive statistics

329

APPENDIX G

Step 2. Then highlight the input range and complete the remainder of the dialog box. For instance make sure that you click on data arranged in Columns and remember to tick labels in first row. Step 3. Then click OK. The results are set out below: Table 89. Correlation index between variables

Score on X Score on Y

Score on X

Score on Y

1 0.887

1

There is a correlation of around 0.88 between performance on X and performance on Y. How do you read this table? It seems confusing at first but it is rather simple. It is a triangular matrix. The bottom triangle is the same as the top triangle so we only produce the lower or upper triangular matrix. There are always 1s in the diagonal. This is because X correlates perfectly with itself, so the correlation is 1. A score on Y always correlates perfectly with itself so it is always 1. The value we are interested in is the correlation between X and Y, which in this case is 0.88788…. or just 0.8. That is all there is to calculating correlations. Experiment with these functions to become familiar with them and use some examples to check your results. You can use the correlation function in Excel to provide correlations between items and the total score, much like the point-biserial and biserial correlation.

Interpreting correlations The first step is to examine the size of the correlation. In this case a correlation of 0.88 is positive and a very strong relation between X and Y. The second step is to square the correlation (0.88 when squared gives you 0.774 or 77.4%). This means that 23.6% of the variation in X is accounted for by Y. Every time X increases by one unit then 77.4% of the variation in the one unit can be attributed to Y. This is a large proportion. You need a very large correlation to be sure that one variable is really accounting for all the variation in another. Note that correlation does not imply causation. We are not saying that X causes Y or Y causes X. We are only saying that they are related. There could be an unknown third factor that affects both X and Y. One further consideration is that correlation is designed for linear relationships. If you have some characteristics which do not increase uniformly but might be similar at the top and lower levels then correlation is not appropriate. You may also wish to determine whether this result is significant for your sample size and we would refer you to statistical tables as well as basic texts on descriptive statistics

330

APPENDIX H

ANSWERS TO REVIEW QUESTIONS

Chapter 1 – Introduction to assessment T T F F T F T T F T T T T

In education, the word ‘assessment’ is used in a special way that is different from its ordinary, everyday meaning In education, the term ‘assessment’ has taken over from terms such as ‘testing’ The everyday use of the term ‘assessment’ refers to a process of collection of information, judgment and comparison Educational assessment has developed over a period of some 2000 years Viva voce refers to an oral exam Written formal testing dates from around the 1500s Psychometrics refers to the field of psychological testing and measurement Assessment results are used to decide about students The main scope for assessment is after the teaching process The Code of Fair Testing Practices in Education outlines obligations that we have to test takers Consent of a test taker is required before providing results to any outside person or organisation Educational test results are a privileged communication to other teachers It is helpful to explain how passing test scores are set

Chapter 2 – The varying role and nature of assessment T T F T F T F T F T F T

Assessment is a comprehensive generic term Assessments may vary in number, frequency and duration At least three assessment events are recommended for each subject, unit or module As a general rule, the greater the number of assessments you conduct then the higher will be the reliability of your results Human behaviour can be grouped into the categories of knowledge, skills or assessments Holistic assessment integrates the assessment of knowledge, skills and attitudes Holistic assessment involves holistic scoring The five major forms of assessment are: observation, simulations, skills tests, questioning and the use of prior evidence Summative assessments seek to improve the learning process It is how the results will be used that makes an assessment summative or formative Classroom questioning is a formative public assessment Teachers should provide students with course outlines and assessment details early in the semester 331

APPENDIX H

F T F T T F F F F T F

Around 15% of a subject’s teaching time should be given over to assessments At least two week’s notice should be given to students for a class test Teachers can increase the assessment load slightly even though an outline of the assessment has been distributed Teachers have an obligation to deliver content and assessment in accordance with the prescribed requirements A standardised assessment has instructions for administration and scoring A group test can be used to observe individual performance A criterion-referenced test can be distinguished by its format and questions A keyboarding test is likely to be a power test A formal essay examination is likely to be an objective test. Competency-based assessments are criterion-referenced Norm-referenced assessments are designed to give descriptions of performance

Chapter 3 – Fundamental concepts of measurement T F F F F F

Range is an index of variability which is computed by subtracting the minimum from the maximum score The mean absolute deviation is the most popular index of variability The standard deviation is synonymous to absolute deviation A correlation can take values from 0 to +1 where 0 means no correlation If scores on test A are positively correlated to scores on test B (say, with r=+0.8), then your high score on test A causes your score on test B to go up as well. To calculate the standard deviation, you must divide the variance by two.

Chapter 4 – Validity and validation of assessments T T T F T F T F F T T F F F T T T F

332

Validity refers to the truthfulness and accuracy of results Validity is the degree to which a certain inference from a test is appropriate Results are valid for a given group of individuals Validity is a measurable concept Validity is inferred Reliability is a necessary condition for validity Content validity is determined by comparing the questions to the syllabus topics and learning outcomes Criterion validity includes face validity Construct validity considers the predictive potential of test scores Content validity is important for criterion-referenced tests Criterion validity includes predictive and concurrent validity Comparing the results of two assessments is a form of predictive validity The degree of relationship between two sets of test results is determined by visual inspection An expectancy table is used to calculate correlations between results Construct validity refers to the theoretical evidence for what we are assessing The correlation is a statistical index ranging from –1 through 0 to +1 In an expectancy table predictions are made in terms of chances out of 100 An item analysis would improve construct validity of results

ANSWERS TO REVIEW QUESTIONS

Chapter 5 – The reliability of assessment results F T T T F F F T F T T F T F F T T T T T

If you are certain that your assessment has a high degree of reliability then the results from it must be valid Reliability is the degree to which test results are consistent, dependable or repeatable The major influence on reliability of assessment results is individual differences Methods of estimating involve correlation-type statistics A moderate correlation is 0.4 The test-retest method is widely used by teachers to determine reliability Parallel forms involves giving the same test twice to a group and comparing the results The split-half method involves comparison of the results from two equivalent halves of an assessment The split-half is automatically corrected for test length by the Kuder-Richardson formula It is easier to estimate reliability using an internal consistency formula than using test-retest methods A procedure for assessing the stability of test scores is the parallel forms method of reliability The coefficient alpha is a criterion-referenced estimate of reliability The percentage of consistent decisions on two forms of an assessment is a criterion-referenced estimate of reliability Teacher-made assessments have reliabilities of around 0.5 Two standard errors includes 68% of all mistakes on a test Reliability coefficients vary up to +1 As the number of questions increases, the reliability will generally increase As error increases reliability decreases The standard error is the likely range of results around a given score You can be 95% certain that the true score is within plus or minus two standard errors

Chapter 6 – Analysing tasks and questions T F T T F T T F F T T F

Tasks or questions in an assessment are called items Item analysis refers to the methods for obtaining information about normreferenced performance Item analysis guides you when you wish to shorten an assessment Items in criterion-referenced tests can be analysed using item difficulty Item difficulty is an index which shows the proportion of students failing a question The formula for difficulty is: the number of correct responses divided by the number of persons answering Easy questions have a higher item difficulty value Difficulty values can range from -1 to 1 If the difficulty is zero then the content of the question was covered in class Item difficulty is the same as item facility The score in a Guttman pattern tells you which items were answered correctly Point-biserial correlation is used for norm-referenced item discrimination 333

APPENDIX H

F T T F

Criterion-referenced item discrimination means that a task separates out high scorers from low scorers An sensitivity index of 0.3 and greater means that it is a useful item If the sensitivity is negative then more people answered an item correctly before rather than after instruction Very easy items have low discrimination for competency

Chapter 7 – Objective measurement using the Rasch model F F F F T F T T

In the simple Rasch model, if the ability of an examinee equals the difficulty of an item then the examinee has more than 50% chance for a correct response. Overfit is more dangerous than misfit for a question because it indicates that the question may not measure the same ability as the other questions. Overfit is the case where the infit mean square of a question is larger than 1.3 The assumption of unidimensionality demands that all questions test material from exactly the same sub-domain. Misfitting persons must be identified because their ability estimate may not be a valid indicator of their true ability. Omitted responses may be scored either as missing or as incorrect without significant effects on the ability estimate of the persons because Rasch is a robust measurement model. Unexpectedly correct responses may be identified by comparing the ability of a person with the difficulty of a question. Questions that should not be part of a test because they test a different ability or trait than the rest of the questions, may be identified because they have a large fit statistic.

Chapter 8 – The Partial Credit Rasch model F T T F F F F F

334

The partial credit Rasch model may be used instead of the simple Rasch model when a number of questions test a partially different ability than the rest of the questions. If one desires to reduce the length of a test, one can preferably remove overfitting questions which have the same difficulty as other questions in the test The above statement is true because similar questions with the same difficulty in the same test tend to have small fit statistics. Questions that have fit statistics larger than 1.3 must be definitely removed from the test because they do not test the same ability as the rest of the questions in the test. The above statement is true because infit mean square larger than 1.3 means that the examinees were able to guess the correct answer to the question. According to the Rasch model two persons with the same ability will definitely get the same marks on the same question. The assumption of Local Independence demands that each examinee works independently from the other examinees while completing the test. Data generated by two different tests which measure the same ability may be analysed by the Rasch model provided all people completed all of the questions of both tests.

ANSWERS TO REVIEW QUESTIONS

F

A person of ability θ = 3 logits will definitely get a score of 2 on the following question because this is the most likely score for his ability according to the figure below.

F

According to the following figure, the most likely score for an examinee with ability -1 logit is 2 marks. 1 0.9

Probability correct

0.8 0.7

0 mark

0.6

1 mark

0.5

2 mark

0.4 0.3 0.2 0.1 0 -5

-4

-3

-2

-1

0

1

2

3

4

Ability (in logits)

Chapter 9 – Further applications of the Rasch model T F T F T F T F

The Rating Scale model is an extension of the simple Rasch model. The multidimensional models do not share the assumptions of Unidimensionality and Local Independence. The multidimensional models were developed for the cases where we want to use multiple models e.g. the Partial Credit and the Rating Scale models in the same analysis. The Rating Scale model should not be used in tests where different items have different numbers of steps/categories.

335

APPENDIX H

T F T F T F

The Computerized Adaptive Tests select the most appropriate items for a person based on his/her previous responses. When a Computerized Adamptive Test is used, the examinees are expected to get 50% of the responses correct. The Partial Credit model is a more general case of the Rating Scale model where the same category/step on different questions can have a different estimate.

Chapter 10 – Planning, preparation and administration of assessments F F T T F T F T F T F T T T T

Planning a assessment usually commences with considering the learning outcomes Learning outcomes are the topics in a course The blueprint for an assessment is called the Table of Specifications The values under the heading WEIGHTS in a one-way table of specifications are the teacher’s judgments of importance The weights in a table of specifications represent only the number of questions in an assessment When you are preparing a short assessment on a limited topic you can stop with a one-way table of specifications If you have a syllabus that is competency-based then you would focus only on the topics The row and column headings for a table of specifications consist of topics and learning outcomes The table of specifications gives the specific procedure for developing a assessment You can develop a table of specifications for a topic, a unit, a module, an entire subject or even a whole course A table of specifications is a three dimensional classification for preparing a assessment A table of specifications can be used for competency-based assessments The learning outcomes determine which forms and methods of assessment will be used In planning an assessment you allocate weights to the methods of assessment This planning of an assessment program is designed to increase the content validity of your assessments

Chapter 11 – Assessment of knowledge: constructed response questions Part A T F F T F F F T

336

Schemata are knowledge structures Declarative knowledge is knowing that something is the case Objective questions refer to specific learning outcomes Test questions can be categorised as supply versus selection Supply questions require a student to recognise an answer Supplying an answer requires similar cognitive processes as selecting an answer The essay question involves a lengthy prose composition or treatise Under the heading of an essay we have included any task for which there is no objectively agreed upon single correct response

ANSWERS TO REVIEW QUESTIONS

F F T T T T

Essays can be classed as extended response or general response Cognitive outcomes can be assessed only by essays An essay question is easier to write than a multiple-choice question Point scoring involves the allocation of marks to an essay using a predetermined scale or range Holistic scoring assesses the total effect of a written discourse Inconsistency in scoring is a problem with essay questions

Part B F F T F F T F F T F T T T T F F

Providing options for an essay exam is desirable Constructed-response questions include true-false questions Short answer questions are an all-purpose form of question Short answer questions require a word for an answer Short answer questions are influenced more by guessing than essay questions The short answer includes sentence completion questions The short answer question is subjectively scored because it is marked by a teacher or instructor Short answer questions are useful for analysis and evaluation of ideas Short answer questions can be norm-referenced Essay questions provide better coverage of a subject than completion questions Completion questions are a variation of the short answer question In the completion question there is little scope for guessing Spaces for completion questions should be the same size as the missing word A Swiss cheese completion question has too many blanks An item bank is a commercially available short answer test Short answer questions assume that students repeat the information presented by the teacher

Chapter 12 – Assessment of knowledge: selected response questions Part A T T T F T F T F F F T T F T

Matching questions are objectively-scored questions True-false questions are classed as selection items Alternate-choice are multiple choice questions Matching questions cover only limited subject content You need a large number of items in a matching question The number of options remains constant after each choice in a matching question A matching question can be displayed in two columns Matching questions should be heterogeneous in content An optimum number of matches is around 20 True-false questions are not helpful for assessing basic facts or ideas Some subjects incorporate a hierarchical network of true and false propositions Any question format can assess trivial information The true-false question is less valid because of the guessing factor The more questions then the less effect there will be for factors such as guessing 337

APPENDIX H

T T F F T T T F F T F

A true-false assessment should contain at least 20 questions True-false questions can cover a wide range of content Three true-false questions can be answered in the time it takes to answer one multiple-choice question The alternate choice and true-false question are identical Alternate choice questions are easier to write than multiple-choice questions More alternate choice questions can be asked within a testing period than with multiple-choice questions For a question with three alternatives the correction for guessing is Right – (Wrong/2) The correction for guessing assumes that right and wrong answers were obtained systematically The use of a correction for guessing is recommended Test directions should be clear as to any penalties for guessing The 50% difficulty level after adjusting for chance for a two-choice question is 75%

Part B F T T F T F T T T T F F T T F F T T

Multiple-choice questions were developed around the time of World War II Multiple-choice questions can assess judgment as well as memory A multiple-choice question consists of an incomplete statement or direct question The suggested solutions to a multiple-choice question are called stems The purpose of the incorrect alternatives in a multiple-choice question is to provide plausible answers Context-dependent multiple-choice questions have more than one correct answer Some of the criticism of the use of multiple-choice questions is correct Multiple-choice questions provide a better sample of subject knowledge than essay questions Complex learning outcomes can be assessed using multiple-choice questions An average student can be expected to answer about 50-60 multiple-choice questions in an hour Around 10% more questions are needed before developing the final version of a test Multiple-choice questions in a test are norm-referenced Multiple-choice questions in a test are objectively scored Scores on multiple-choice tests usually rank performance comparably with other tests In multiple-choice questions we include options that are obviously wrong We need around 2-3 questions for each learning outcome About four responses are adequate for each question It is not necessary that every question has an equal number of alternatives

Chapter 13 – Assessment of performance and practical skills F T

338

Practical skills are psychomotor skills Skilled performance involves cognitive, perceptual and motor skills

ANSWERS TO REVIEW QUESTIONS

T T F F T F F T F T T T T T

The first phase in the acquisition of a psychomotor skill is the early cognitive phase The phases of skill acquisition apply across all the areas of learning Case knowledge involves information from isolated incidents A novice learner realizes that generalizations do not cover all situations A one-off adequate performance may not be sufficient for inference of competent or expert performance A static assessment traces the individual’s learning potential Spotlighting is for use with a high stakes assessment procedure Questioning techniques about practical skills provide direct evidence of knowledge In an assessment using a total job, students are assessed in carrying out a key section of a job Checklists are used where a process has to be assessed Rating scales are useful as a way of making observations of quality Standards-referenced assessment uses benchmarks as indicators of quality Exemplars are key examples which are typical of the level of quality Descriptors are the properties that might characterise the level of quality

Chapter 14 – Assessment of attitude and behaviour F F T T T F T T F F F T F T T F T T

Attitudes form part of the cognitive domain An attitude is a real phenomenon Attitudes were originally conceived as the degree of positive or negative feeling The attitudinal domain includes emotions Exams can be used to assess attitudinal learning outcomes An opinion questionnaire indicates the degree of attitude An attitude scale groups the questions in some way and adds the together the responses to give a total score Each teacher should use a standard questionnaire to evaluate his/her instruction The steps in the construction of an attitude scale are: writing questions, asking people to rate their feelings, and adding the scores The validity of attitudes is seen in the consistency of the answers The reliability of attitudes is seen in their ability to predict behaviour The use of an attitude scale to judge an individual may be challenged An opinion checklist is a Likert scale In writing opinion statements, you should avoid statements that are endorsed by everybody Direct observation of students can provide an assessment of attitudes Observation of students should be covert The strength of observational assessments of attitudes is that they provide perceptions of situations and circumstances A major role for the assessment of attitudes is to direct teaching and learning

Chapter 15 – Grading performance and results T T

Grading is part of the process of student assessment There are three options for grading and reporting of results

339

APPENDIX H

T F T T F T F F T T T F T F

Grading follows the judgments or decisions that are made on the basis of the test results Raw scores are natural units of performance Natural units of performance include response times In education most scores represent percentage correct Percentages are scaled scores In training contexts most results represent rankings Percentages are the same as percentile ranks Standard scores are criterion-referenced Pass rates on educational assessments are culturally determined Grading is consistent with a competency-based system of instruction The Angoff method is used for cut-off scores The Angoff method is the average item discrimination To equate scores, you need to make the range of scores equal To equate scores you need to make the maximum scores equal

Chapter 16 – Test Equating F T F F T

340

Scores on two tests are equivalent if the tests have similar item format, equal test length and the same content. Large sample sizes (usually more than 250) are needed for a small equating error The Anchor-Test-Nonequivalent-Groups Design demands that the two papers to be equated must be administered to the same group of people. The anchor paper must be at maximum 20% as long as the test to be equated A major advantage of the equipercentile equation between two tests X and Y is the opportunity to set cutting scores (for groups similar to that used for the equation) on both tests and still be sure that the same percentage of examinees will succeed at each test.

INDEX

Cronbach’s coefficient alpha, 78 cut-off points, 278 discrimination index, 99 Distracter attractiveness, 232 distracters, 226 effect of practice or coaching on test results, 85 Essay, 261 essay question, 204 essay questions, 203 ethical issues, 9 expectancy table, 64 Extended essay questions, 206 Face validity, 60 formative, 28 formative assessments, 29 Glaser, 30 Grading, 273 grading students based on the normal curve, 283 graphic rating scales, 254 High reliability, 81 history of educational assessment, 4 history of educational testing, 6 holistic, 251 holistic assessments, 23 Holistic scoring of an essay, 207 identification question, 212 Interest, 260 internal consistency, 75, 78 International Association for Educational Assessment, 10 International Testing Commission, 10 inter-teacher variability, 208 intra-individual rating variability, 209 Item analysis, 91 item bank, 214 item difficulty, 92, 93 item discrimination, 100 Item discrimination, 95, 99 item writing rules, 229 judgment, 251 Kuder-Richardson 20, 80 Kuder-Richardson 20 reliability, 79 Kuder-Richardson formulae, 78 learning outcomes, 185

Apgar, 2 standard, 84 affective domain, 259 alternate choice question, 223 American Educational Research Association, 10 American Psychological Association, 10 analytic scoring, 206 Angoff method, 278 argument for using an essay, 206 aspects of validity, 59 assessment, 2 assessment program, 189 Athanasou, 213 attitude, 259 attitude scale, 265 attitudinal domain, 259 Belief, 260 biserial correlation, 102 Case studies, 261 clerical errors in marking, 68 Code of Fair Testing Practices in Education, 305, 307 completion questions, 214 Completion questions, 213 Complex multiple-choice questions, 227 concurrent validity, 63 consistent decisions, 80 construct validation, 66 Construct validity, 60 construct-related validity, 66 content, 22 content validation, 61 content validity, 67, 183 Content validity, 60 context-dependent multiple-choice questions, 227 Converted scores, 312 correction for guessing formula, 224 Correlation, 55 correlations between different testing methods, 63 Criterion validity, 60 criterion validity., 62 criterion-referenced, 30 criterion-referenced testing, 30 341

INDEX

Letter grades, 281 Likert, 265 limitation of the multiple choice, 228 logbook, 269 Lower reliability, 81 marking, 193 marking guide, 192 marks for each assessment, 188 mastery learning, 31 matching question, 219 multiple-choice question, 225 multiple-choice questions, 225 National Council on Measurement in Education, 10 Natural units of performance, 275 nomograph for converting test scores, 281 normalisation, 311 normative, 30 norm-referenced, 30 objective or subjective scoring, 203 observation, 268 open-book test, 191 Opinion, 260 opinion questionnaire, 261 pair comparisons, 269 parallel or equivalent forms, 74 pass-fail, 281 percentage correct, 275 percentile rank, 310 point biserial correlation, 101 Point scoring, 206 power tests, 33 Practical tests, 246 predictive validity, 62 Questioning, 260 Questionnaires, 261 Ranking, 269 ranks, 275 rating scales, 254 Rating scales, 254 References, 261 reliability, 22, 210, 266

342

Reliability, 72 reliability for criterion-referenced assessments, 80 restricted essay question, 206 restricted or extended-response essay, 206 Scaling, 311 scope for educational assessment, 8 scoring short answer questions, 212 Scriven, 29 Self-reports, 261 short answer question, 211 Spearman-Brown formula, 76 split-half approach to reliability, 75 spotlighting, 246 standard error of measurement, 82, 83 standard scores, 311 Standardisation, 313 Standardised, 311 Standards for Educational and Psychological Testing, 10, 59, 71 stem, 226 subject outline, 49 summative, 28 summative assessment, 29 Supplying or selecting an answer, 203 susceptibility to guessing, 221 table of specifications, 190 Take-home tests, 191 test-retest, 73 time limits, 33 Time limits, 33 trace line, 233 transformation of ranks, 276 translate test scores into grades, 282 true-false question, 221 types of validity, 60 validity, 59, 248, 266 Values -, 260 viva voce or oral examinations, 261 weights, 183 z scores, 312 z-score, 311

GLOSSARY OF TERMS USED

ABILITY ESTIMATE The estimated location of a person on the common person-question scale. The estimated ability of a person is based on the observed raw score on the attempted questions. Larger values indicate more of the ‘measured quantity’. ACHIEVEMENT TEST A test which seeks to assess current performance and ability. ADAPTIVE TEST In an adaptive test, the level of difficulty of the questions asked is varied to suit the ability level of the student. ADVANCED BEGINNER The second stage in the development of expertise in which a learner (around the second or third year of a career) who has accumulated some knowledge. AFFECTIVE The attitudinal or feeling dimension of human behaviour and learning. ALTERNATE-CHOICE This is a question with two answers as options or choices. APTITUDE TEST A test which aims to assess a person’s potential achievement. ASSESSMENT The process of collecting and combining information from tests (e.g., on performance, learning or quality) with a view to making a judgment about a person or making a comparison against an established criterion. ATTITUDE The degree of positive or negative feeling associated with some object. AUTONOMOUS The third and final phase in the process of skill acquisition in which performance becomes practically automatic. BELIEF Thoughts about an object, person or event that are regarded as true, real or credible. BISERIAL CORRELATION The biserial correlation is an index of the relationship between a score on a test and a dichotomous value (i.e., 1 or 0). It is assumed that the ability underlying the value of 1 (pass) or 0 (fail) is normally distributed and continuous in its range. CASE KNOWLEDGE The knowledge that develops from the experience of dealing with a particularly difficult or interesting problem situation, especially the solutions or outcomes. CHECKLIST A checklist is taken here to be a list of factors, properties, aspects, components, criteria, tasks, or dimensions, the presence or amount of which is to be separately considered, in order to perform a certain task. COEFFICIENT ALPHA This coefficient is a measure of the internal consistency or homogeneity of the scores on the test. It is a reliability coefficient which can range in value from 0 to 1. COGNITIVE The thinking or rational dimensions of human behaviour and learning.

343

GLOSSARY OF TERMS USED

COMPARABLE SCORES Comparable scores are results from different tests which are based on a common scale. COMPETENT (a) someone being properly qualified or capable; and (b) someone being fit, suitable, sufficient, adequate for a purpose. The third stage in the development of expertise, the competent worker is experienced (around the third year of their career) and is able to set priorities and make plans. COMPLETION A short answer question. Completion questions require the supply of a missing word and its insertion into a blank space. CONCURRENT VALIDITY The extent to which test results related to another criterion, set of scores or other results. CONSTRUCT VALIDITY The extent to which test scores are related to some other theoretical criterion, measure or behaviour which should be linked to the test scores. CONTENT VALIDITY The extent to which a test adequately covers all the topics and objectives of a course, syllabus, unit or module. CONTEXT-DEPENDENT MULTIPLE-CHOICE QUESTIONS. Contextdependent multiple-choice questions provide text, graphs, cartoons, charts, diagrams which must be analysed and interpreted. CONTINGENCY COEFFICIENT A statistical index of relationship for tabular data. CONVERSION METHOD A conversion method is a means of transforming the score on one test to the range of scores on another test so that they are scored in equal units. CORRELATION A statistical measure or index of relationship between two variables. Ranges from +1 to -1. Can be calculated by a variety of methods – product-moment, point biserial, phi, or rank correlation. CORRELATION A statistical measure or index of relationship between two variables. Ranges from +1 to -1. Can be calculated by a variety of methods – product-moment, point biserial, phi, or rank correlation. CRITERION-REFERENCED TEST A test which assesses a defined content area and with standards for performance, mastery or passing. CUT-OFF SCORE The cut-off score is a critical score level at or above which students pass and below which students fail. DEEP APPROACHES Learners focus their attention on the overall meaning or message of the subject. Ideas are processed and interests developed in the topics. Where possible the content is related to experiences to make it meaningful. DIAGNOSTIC TEST An assessment that identifies specific learning needs, strengths or deficits of individual. DIFFICULTY ESTIMATE The estimated location of a question on the common person-question scale. The difficulty estimate is based on the number of the persons who gave a correct response to the question compared to the number of the persons who attempted the question. DISTRACTER ATTRACTIVENESS Distracter attractiveness is determined as the percentage of people who chose an option. DISTRACTERS Distracters are the incorrect alternatives in a multiple-choice question. 344

GLOSSARY OF TERMS USED

DYNAMIC ASSESSMENT A dynamic assessment occurs over time and can trace the individual’s learning potential. EARLY COGNITIVE PHASE The first phase in the acquisition of a skill in which the person attempts to understand the basic components of the skill. EPISODIC KNOWLEDGE Involves isolated pieces of knowledge or incidents which when accumulated together by the alert individual build up a more coherent picture. ERROR OF MEASUREMENT Person abilities and question difficulties are estimated by the Rasch model through a sample of data. The error of measurement is an indication of the uncertainty about the true ability of a person or the true difficulty of a question. The error of measurement can be used to define a range of plausible values for the true ability or difficulty. ESSAY A prose composition or short treatise. EVALUATION The systematic process of judging value, merit or worth (i.e., how good something is). EXEMPLAR Key example chosen so as to be typical of designated levels of quality or competence EXPERT The final stage in the development of expertise in which a learner has an intuitive grasp of situations. FACE VALIDITY The appearance that a test is valid from its contents. FORMATIVE ASSESSMENT Formative assessments are conducted during teaching and instruction with the intention to improve the learning process. GROUP TEST A test which can be administered to a number of people simultaneously, such as a class test. HIGH STAKES ASSESSMENT Important assessments such as professional certification or occupational registration with significant consequences. HOLISTIC ASSESSMENTS Holistic assessment is an approach to assessment that tries to integrate knowledge, skills, attitudes in assessment tasks. IDENTIFICATION The identification question requires the student to name an object from a picture or drawing. INDIVIDUAL TEST A test administered on a one-to-one basis, compared with a group test which can be given to a number of people simultaneously. INFIT MEAN SQUARE A measure of the degree to which a person or a question ‘behaves’ as expected by the Rasch model. Large values of this measure (usually larger than 1.3) are highly undesirable. Also see Misfitting persons/ questions. INTEREST A preference, which is reflected in the amount of knowledge, involvement with and value for an object. INTERNAL CONSISTENCY The extent to which test questions are homogeneous. ITEM ANALYSIS Item analysis refers to methods for obtaining information about the performance of a group on a question (also called an item). ITEM BANK A systematic collection of potential questions for use in a test. ITEM DIFFICULTY Item difficulty is the number of people who answer a question correctly. The smaller the value (or percentage figure), then the more difficult is the question 345

GLOSSARY OF TERMS USED

ITEM DISCRIMINATION Item discrimination is an index used for analysing questions. It looks at the quality of a question to distinguish high scorers from low scorers. KEY The key is the correct answer to a multiple-choice question. KUDER-RICHARDSON FORMULA 20 A formula for determining the internal consistency reliability of a test. It is based on the average of all the possible splithalves in a test. LINEAR TEST A linear test asks a student to answer every question in a fixed order. It can be contrasted with an adaptive test. LOCAL INDEPENDENCE The probability of a person for a correct response on one question should not be affected by the correctness of the response of that person to the previous question. LOGIT The unit of measurement which results from the mathematical application of the Rasch model. Both abilities and difficulties are expressed in logits. MATCHING QUESTION This question consists of two lists or columns of related information. Related items are required to be matched. MEAN The average value. MEASUREMENT A numerical description or quantification or categorisation of outcomes or performance on a scale, according to a set of rules. MISFITTING PERSONS/QUESTIONS Persons or questions that do not ‘behave’ as expected by the Rasch model. For example, misfitting persons may give correct responses to difficult questions and miss easy questions. Misfitting questions may be answered correctly by less able persons and incorrectly by more able persons. MULTIPLE-CHOICE QUESTION A multiple-choice question consists of a direct question or an incomplete statement usually with three or more alternative solutions. NORMS Tables of scores or test results for different groups, and against which an individual’s performance may be compared. NOVICE A beginning student or learner who seeks all purpose rules to guide his/her behaviour. OBJECTIVE TEST A test with a fixed method of scoring or one correct answer and/or with complete scorer agreement. ONE-WAY TABLE OF SPECIFICATIONS A one-way table of specifications indicates the weights (i.e., the number of marks or questions) for topics in a course subject or module. OPINION Specific thoughts on a topic, issue or activity. PARALLEL FORMS Two equivalent versions of a test. PARTIAL CREDIT Questions are not marked as correct/incorrect (1/0 marks) but more levels of ‘correctness’ are identified. Thus, an examinee who gives a ‘partly correct’ response is also awarded a partial credit. PARTIAL CREDIT RASCH MODEL A Rasch model which can analyse data where partial credit is awarded for partially correct responses. PERCENTILE RANK The percentile rank is the percentage of scores in a group that fall at or below a score. PERCENTILES A transformation of test scores to indicate numerically the number of people who scored less. 346

GLOSSARY OF TERMS USED

POINT BISERIAL CORRELATION The point biserial correlation is an index of the relationship between a score on a test and a dichotomous value (i.e., 1 or 0). It is not assumed that the ability underlying the value of 1 (pass) or 0 (fail) is normally distributed and continuous in its range. POWER TEST A test without a time limit. It is presumed to assess achievements without the pressure of speed of responding. PRACTICE-FIXATION The second phase in the process of skills acquisition where correct performance is gradually shaped. PREDICTIVE VALIDITY The extent to which test scores predict future performance. PROFICIENT The fourth stage in the development of expertise in which learner for whom intuition becomes important and who no longer consciously thinks about adjustments. PSYCHOMOTOR The domain of skilled behaviour which relates to the motor effects of cognitive processes (e.g., manual skills, eye-hand coordination). RAW SCORE A raw score is an original or unadjusted score from a test. RELIABILITY The stability, consistency and dependability of a test score or measurement. Assessed using the split-half method, internal consistency, and parallel form and test-retest methods. SCALED SCORE A scaled score is a raw score which is adjusted or transformed, such as a percentile rank or a standard score. SENSITIVITY Sensitivity of an item refers to the discrimination between pre- and post-instruction effects. It is the change in the proportion of correct answers before and after instruction. SHORT ANSWER The short answer question restricts the answer to be provided to a paragraph, sentence or word that can be clearly identified as correct. SPEARMAN-BROWN FORMULA A formula for correcting the reliability based on the split-half method and for estimating the reliability if a test is lengthened, SPLIT-HALF A method of assessing reliability based on correlating the two halves of a test (usually correlating the odd and even numbered questions). SPOTLIGHTING Spotlighting is a way of focusing assessment. It may be used to describe a students’ progress at a particular time in a learning activity. STANDARD DEVIATION A statistical index of variation or dispersion of scores from the mean or average score. STANDARD ERROR OF MEASUREMENT Standard error of measurement is the dispersion of errors in the measurements of the test. Each score has its own standard error of measurement. This can vary depending where it is on the overall distribution of scores (e.g., at the extremes or in the middle). STANDARDISATION A test is standardised if there are uniform and specified procedures for administration, responding or scoring. Standardisation can also mean administering a specially developed test to a large , representative sample of people under uniform circumstances with the aim of obtaining norms (i.e., a summary of the performance of the group). STANDARDISED SCORE A standardised score is expressed in terms of the mean and standard deviation. 347

GLOSSARY OF TERMS USED

STANDARDS-REFERENCED ASSESSMENT An assessment approach that is related to criterion-referenced assessment and which uses agreed-upon outcomes as the reference point for decisions about learning and achievement. STATIC ASSESSMENT A static assessment occurs at a point in time. STEM The stem is the main or part of the multiple-choice question. SUMMATIVE ASSESSMENT Tests which are used for a summative assessment occur at the end of instruction and contribute to a final grade. Usually, they provide information to someone other than the students. SURFACE APPROACHES With this type of learning the focus is on passing rather than understanding. There is a concern for the details that need to be remembered for assessment purposes. The quality of the learning outcomes are much lower. TEST A systematic procedure, based on a set of questions, exercises or problems, for sampling and verifying desired behaviours such as learning, performance, knowledge, abilities, aptitudes, qualifications, skills and/or attitudes. TEST-RETEST A method of estimating reliability involving the repeated administration of a test to a group. THRESHOLD Threshold of a response category (e.g. ‘2 marks’ on a question) is the amount of ability needed by an examinee to have a 50% probability to be awarded that or a higher category. TRACE LINE A trace line is a graphical way of describing group performance on a question. TRUE-FALSE This is a two option question. Statements have to be identified as being correct or incorrect, true or false. TWO-WAY TABLE OF SPECIFICATIONS A two-way table of specifications indicates the weights (i.e., the number of marks or questions) for combinations of topics and objectives in a course subject or module. UNIDIMENSIONALITY It is one of the most basic assumptions of the Rasch model. Unidimensionality instructs that only a single ability is measured by the questions of the instrument. VALIDITY The accuracy and relevance of a test score. Different types of validity include: concurrent validity, predictive validity, construct validity, content validity. VALIDITY The accuracy and relevance of a test score. Different types of validity include: concurrent validity, predictive validity, construct validity, content validity. VALUES The personal importance or worth of an object. Z-SCORE The z-score is a type of standard score.

348

ABOUT THE AUTHORS

Iasonas Lamprianou was born in Cyprus in 1973 where he currently lives. Iasonas obtained his PhD from the University of Manchester and joined the Centre for Formative Assessment Studies (CFAS; University of Manchester) in 1999 as an analyst. He left Manchester in 2003 and worked at the Cyprus Testing Service (Ministry of Education and Culture) until 2008. Iasonas is now an Assistant Professor of Educational Research and Evaluation at the European University (Cyprus) and an Honorary Research Fellow of the University of Manchester. He has participated on research projects involving school effectiveness, educational assessment and measurement, test equating, Computerized Adaptive Systems and Item Banking. His research interests include psychometrics and especially test-equating, Item Response Models, Multilevel Rasch Models, Computerized Adaptive Systems and personfit statistics. Iasonas currently teaches “Educational Research Methods” and “Educational Measurement and Evaluation” courses at various Universities and Institutions, both at undergraduate and post-graduate levels. Iasonas is married and his interests include music, reading and football. http://www.relabs.org/personal/iasonas.html

349

ABOUT THE AUTHORS

James A Athanasou was born in Perth in 1948 and came to the Maroubra area of Sydney in 1953, where he still lives. Jim obtained his PhD from the Centre for Behavioural Studies in Education at the University of New England. He spent most of his career in the New South Wales Public Service and came to the University of Technology, Sydney in 1991 as a lecturer in Measurement and Evaluation. He was Associate Professor in the Faculty of Education and retired in November 2008. Jim is the author of Adult Educational Psychology (Sense Publishers), Adult Education and Training (David Barlow Publishing), Evaluating Career Education and Guidance (ACER Press), co-editor of the International Handbook of Career Guidance (Springer) and other texts, including Selective Schools and Scholarships Tests (Pascal Press), Basic Skills Tests Year 3/Year5 (Pascal Press), Opportunity Class Tests (Pascal Press). He was editor of the Australian Journal of Career Development published by the Australian Council for Educational Research from 2000-2008 and editor since 1995 of PHRONEMA, the Annual Review of St Andrew’s Greek Orthodox Theological College. He has been a visiting fellow at the Universitat der Bundeswehr Muenchen, the University of Illinois at Urbana-Champaign, the University of Hong Kong and Vrije Universiteit Brussel. Jim is a registered psychologist and member of the Australian Psychological Society. His research centres on vocational assessment and career development. He is married with four children and his interests include religion, reading, walking and gardening. http://www.geocities.com/athanasou

350

REFERENCES AND NOTES

CHAPTER 1 – INTRODUCTION TO ASSESSMENT 1

2 3 4 5 6 7

8

Stiggins, R. J., & Conklin, N. F. (1988). Teacher training in assessment. Portland, OR: Northwest Regional Educational Laboratory. Source: http://www.schools.nsw.edu.au/, accessed March 2002. Source: http://www.vcaa.vic.edu.au/glossary.htm, accessed March 2002. Source: http://www.ku.edu/kansas/medieval/108/lectures/capitalism.html, accessed March 2002. Source: http://www.edweek.com/ew/vol-18/40tests.h18, accessed March 2002. Broadfoot, P. (1979). Assessment, schools and society. London: Methuen. Eggleston, J. (1991). Teaching teachers to assess. European Journal of Education, 26, 231–237, p. 236. Joint Committee on Testing Practices. (1988). Code of fair testing practices in education. Washington, DC: Author. CHAPTER 2 – THE VARYING ROLE AND NATURE OF ASSESSMENT

1

2

3

4

5

6

7

8

9 10 11 12

Lombardi, M. M. (2008). Making the grade: The role of assessment in authentic learning. EDUCAUSE Learning Initiative, ELI Paper 1:2008. For example, the Assessment Procedures Manual (University of Technology, Sydney, 1994, p.1) indicates that no component should account for more than 65% of the total assessment. University of Technology, Sydney. (1994). Assessment procedures manual. Sydney: University of Technology, Sydney. See Supplement to TAFE Gazette No. 41 of 1995: ‘No more than 10% of Module duration is given over to assessment - or 15% where outcomes are mainly practical’ (p.4). A Teacher’s Guide to Performance-Based Learning and Assessment by Educators in Connecticut’s Pomperaug Regional School District 15 Source: http://www.ascd.org/readingroom/books/ hibbard96book.html#chapter1, accessed March 2002. Linn, R. L. (1993). Educational assessment: Expanded expectations and challenges. Educational Evaluation and Policy Analysis, 15, 1–16, p. 9. Education Week on the Web, Source: http://www.teachermagazine.org/context/glossary/perfbase.htm, Accessed March 2000. Baker, E. L., O’Neil H. F., & Linn, R. L. (1993). Policy and validity prospects for performance based assessment. American Psychologist, 48, 1210–1218, p. 1210. Cox (1990, p. 540) has noted that ‘the examination focus becomes the structure’. Cox, K. (1990). No Oscar for OSCA. Medical Education, 24, 540–545. Scriven, M. (1991). Evaluation thesaurus (4th ed.). Newbury Park, CA: Sage. Scriven, M. (1991). Evaluation thesaurus (4th ed.). Newbury Park, CA: Sage. Source: http://www.unesco.org/education/educprog/ste/projects/2000/formative.htm Glaser, R., & Nitko, A. J. (1971). Measurement in learning and instruction. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 625–670). Washington, DC: American Council on Education. CHAPTER 3 – FUNDAMENTAL CONCEPTS OF MEASUREMENT

1

Lamprianou, I. (in press). Comparability of examination standards between subjects: An international perspective. Oxford Review of Education, Accepted, scheduled for June 2009.

351

REFERENCES AND NOTES

CHAPTER 4 – VALIDITY AND VALIDATION OF ASSESSMENTS 1

2 3

4

5 6

7

8

9

American Educational Research Association, American Psychological Association and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Gronlund, N. E. (1982). Constructing achievement tests (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443–507). Washington, DC: American Council on Education. Newble, D. I., & Swanson, D. B. (1988). Psychometric characteristics of the objective structured clinical examination. Medical Education, 22, 325–334, p. 331. Note that small class sizes will result in unstable correlation coefficients. Nungester, R. J., Dawson-Saunders, B., Kelley, P. R., & Volle, R. L. (1990). Score reporting on NBME Examinations. Academic Medicine, 65, 723–729, p. 725. Athanasou. (1996c). A report on the academic progression 1992-1995 of students selected for commercial data processing at the Sydney Institute of Technology. Sydney: NSW TAFE Commission. Assessment Research & Development Unit. (1988b). Manual for examiners and assessors. (Revised May 1988) Sydney: NSW Department of Technical and Further Education. Ryan, J. J., Prifitera, A., & Powers, L. (1983). Scoring reliability on the WAIS-R. Journal of Consulting and Clinical Psychology, 51, 460. CHAPTER 5 – THE RELIABILITY OF ASSESSMENT RESULTS

1

2

3 4

5 6

American Educational Research Association, American Psychological Association and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. This index was proposed by Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10, 159–170. The true score is the average of all the scores that a student would gain on an infinite number of tests. Berk, R. A. (1980). A consumer’s guide to criterion-referenced test reliability. Journal of Educational Measurement, 17, 323–349. Linn, R. L. (1982). Admissions testing on trial. American Psychologist, 37, 279–291. Newble, D. I., & Swanson, D. B. (1988). Psychometric characteristics of the objective structured clinical examination. Medical Education, 22, 325–334. CHAPTER 6 – OBJECTIVE MEASUREMENT USING RASCH MODELS

1

2

3

4

Brennan, R. L. (1972). A generalized upper-lower item discrimination index. Educational and Psychological Measurement, 32, 289–303. Shepard, L. A., Camilli, G., & Averill, M. (1981). Comparison of six procedures for detecting test item bias using both internall and external ability criteria. Journal of Educational Statistics, 6, 317–375. Pollitt, A., Hutchinson, C., Entwistle, N., & De Luca, C. (1985). What makes exam questions difficult? Edinburgh: Scottish Academic Press. Berk, R. A. (Ed.), (1982). Hanbook of methods for detecting item bias. Baltimore: The Johns Hopkins Univesrity Press. CHAPTER 7 – THE PARTIAL CREDIT RASCH MODEL

1

Infit Mean Square stands for ‘Information weighted mean square residual goodness of fit statistic’ Wright, B. D., & Mok, M. (2000). Rasch models overview. Journal of Applied Measurement, 1, 83–106, p. 96.

352

REFERENCES AND NOTES 2

3

4

Wright, B. D., & Linacre, J. M. (1985). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370–371. Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1, 152–176. Bejar, I. I. (1983). Achievement testing: Recent advances. Beverly Hills, CA: Sage. CHAPTER 8 – FURTHER APPLICATIONS OF THE RASCH MODEL

1

Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. London: Lawrence Erlbaum. CHAPTER 9 – PLANNING, PREPARATION AND ADMINISTRATION OF ASSESSMENTS

1

2

3

4

5

6

Board of Studies, Business Services Curriculum Framework Stage 6 Syllabus, http://www.boardofstudies. nsw.edu.au/surveys/bus_serv_part_a_frame_1.html, accessed March 2002. An earlier version of this material appeared in Athanasou, James, A., & Olabisi Olasehinde. (2002). Male and female differences in self-report cheating. Practical Assessment, Research & Evaluation, 8(5). Retrieved March 13, 2009 from http://PAREonline.net/getvn.asp?v=8&n=5 Cizek, G. J. (1999). Cheating on tests: How to do it, detect it and prevent it. Mahwah, NJ: Lawrence Erlbaum. Newstead, S. E., Franklyn-Stokes, A., & Armstead, P. (1996). Individual differences in student cheating. Journal of Educational Psychology, 88, 229–241. Baird, J. S. (1980). Current trends in college cheating. Psychology in the Schools, 17, 515–522; Cizek, G. J. (1999). Cheating on tests: How to do it, detect it and prevent it. Mahwah, NJ: Lawrence Erlbaum, p. 39; Newstead, S. E., Franklyn-Stokes, A., & Armstead, P. (1996). Individual differences in student cheating. Journal of Educational Psychology, 88, 229–241, p. 232. These questions were kindly provided by a former student (Alison Campos GDVET, 2000). CHAPTER 10 – ASSESSMENT OF KNOWLEDGE: CONSTRUCTED RESPONSE QUESTIONS

1

2

3

4 5

6

7

8

9

10 11

Traub, J. (2002, April 7). The test mess. The New York Times.Source: http://www.nytimes.com/ 2002/04/07/…/07TESTING.html For example Conlan, G. (1978). How the essay in the College Board English Composition Test is scored. Princeton, NJ: Educational Testing Service. Wergin, J. F. (1988). New directions for teaching and learning. Assessing Students’ Learning, 34, 5– 17, p. 13. Project Essay Grade, Intelligent Essay Assessor, E-rater, Bayesian Essay Test Scoring System. This discussion is adapted from Rudner, L. L. (1992). Reducing errors due to the use of judges. Practical Assessment, Research and Evaluation, 3(3). Available online http://ericae/net/pare/ getvn.asp?v=3&n=3 Vernon, P. E., & Millican, G. D. (1954). A further study on the reliability of English essays. British Journal of Statistical Psychology, 7, 65–74; Findlayson, D. S. (1951). The reliability of the marking of essays. British Journal of Educational Psychology, 21, 126–134. Rudner, L., & Gagne, P. (2001). An overview of three approaches to scoring written essays by computer. Practical Assessment Research and Evaluation, 7(26). Available online http://ericae.net/ pare/getvn.asp?v=7&n=26 Houston, W. M., Raymond, M. R., & Svec, J. C. (1991). Adjustments for rater effects. Applied Psychological Measurement, 15(4), 409–421. Stanley, J. C. (1962). Analysis of variance principles applied to the grading of essay tests. Journal of Experimental Education, 30, 279–283. Contingency coefficient = 0.69 Lamprianou, I. (2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7(2), 192–200.

353

REFERENCES AND NOTES

12

13

14

See Godshalk, F. I., Swineford, F., & Coffman, W. E. (1966). The measurement of writing ability. NY: College Entrance Examination Board. for an analysis of five independent markers on 20 and 40 minute essays [reliability coefficients for five independent markers: one 20 minute essay, 0.48; two essays (40 minutes), 0.65; one 40 minute essay ,0.59]. Meyer, G. (1934). An experimental study of the old and new types of examination: I. The effect of examination set on memory. Journal of Educational Psychology, 25, 641–661. Meyer, G. (1935). An experimental study of the old and new types of examination: II. Methods of study. Journal of Educational Psychology, 26, 30–40. See Fowler, F. J., & Cannell, C. F. (1996). Using behavioral coding to identify cognitive problems with survey questions. In N. Schwarz & S. Sudman (Eds.), Answering questions (p. 15). San Francisco: Jossey-Bass. CHAPTER 11 – ASSESSMENT OF KNOWLEDGE: SELECTED RESPONSE QUESTIONS

1 2

3

4

5

6

7

Miller, W. R., & Rose, H. C. (1975). Instructors and their jobs. Chicago: American Technical Society. The utility of this question format has been supported in research by Downing, (1992) in a review of true-false and alternate choice questions. Downing, S. M. (1992). True-false and alternate choice item formats: A review of research. Educational Measurement: Issues and Practices, 11, 27–30. Rowley, G. R., & Traub, R. (1977). Formula scoring, number-right scoring and test-taking strategy. Journal of Educational Measurement, 14, 15–22. As Dunstan (1970, p.1) noted: ‘... tests can be written in which such flaws can be reduced to negligible proportions’. Dunstan, M. (1970). A reply to some criticisms of objective tests. Bulletin No. 2. Tertiary Education Research Centre, University of New South Wales. Haladyna, T. M. (1994). Developing and validating multiple-choice test items. Hillsdale, NJ: Lawrence Erlbaum Associates. Adapted from Miller, W. R., & Rose, H. C. (1975). Instructors and their jobs. Chicago: American Technical Society. It may be argued that in a competency-based assessment a question might be correctly answered by all candidates and that this principle might not be valid. The issue here is that in a larger general sample, the alternatives would all be plausible options, but that with this specific group it does not matter that the correct option was chosen by most candidates. CHAPTER 12 – ASSESSMENT OF PERFORMANCE AND PRACTICAL SKILLS

1

2 3

4

5

6

7 8

9

An earlier version of this material appeared in Cornford, I. R., & Athanasou, J. A. (1995). Developing expertise through practical training. Industrial & Commercial Training, 27,10–18. Jones, A., & Whittaker, P. (1975). Testing industrial skills. Essex: Gower Press. De Cecco, J. P., & Crawford, W. R. (1974). The psychology of learning and instruction (2nd ed.). Englewood-Cliffs, NJ: Prentice-Hall. Crossman, E. R. F. W., & Seymour, W. D. (1957). The nature and acquisition of industrial skills. London: Department of Scientific and Industrial Research. Fitts, P. M. (1964). Perceptual skill learning. In A. W. Melton (Ed.), Categories of skill learning (pp. 65). NY: Academic Press; Fitts, P. M. (1968). Factors in complex training. In G. Kuhlen (Ed.), Studies in educational psychology (pp. 390–404). Waltham, MA: Blaisdell Publishing Company. Fitts, P. M. (1968). Factors in complex training. In G. Kuhlen (Ed.), Studies in educational psychology (p. 399). Waltham, MA: Blaisdell Publishing Company. Field, J. E., & Field, T. F. (1992). Classification of jobs. Athens, GA: Elliott & Fitzpatrick. Adapted from Gonczi A., Hager P., & Athanasou, J. (1993). A Guide to the development of competency-based assessment strategies for professions. National Office of Overseas Skills Recognition Research Paper, DEET. Canberra: Australian Government Publishing Service; see also Hager, P., Athanasou, J. A., & Gonczi, A. (1994). Assessment technical manual. Canberra: Australian Government Publishing Service. Sadler, D. R. (1987). Specifying and promulgating achievement standards. Oxford Review of Education, 13, 191–209.

354

REFERENCES AND NOTES 10

11

12

Sadler, D. R. (1987). Specifying and promulgating achievement standards. Oxford Review of Education, 13, 191–209, pp. 200–201. Scriven, M. (2000). The logic and methodology of checklists. Retrieved March 2002, from http://www.wmich.edu/evalctr/checklists/logic_methodology.htm. The following paragraphs are based on Scriven’s outline of checklists. Saskatchewan Education. (1999). Autobody 10, A20, B20, A30, B30, Curriculum Guide. Retrieved March 2002, from http://www.sasked.gov.sk.ca/docs/paa/autobody/appendixd.html CHAPTER 13 – ASSESSMENT OF ATTITUDES AND BEHAVIOUR

1

2

3 4

5 6

7

8 9

10

Some aspects of this are taken up by Krathwohl, D. R., et al. (1964). Taxonomy of educational learning outcomes: Handbook II, Affective domain. NY: D. McKay and by Bloom, B. S., Hastings, J. T., & Madaus, G. F. (1971). Handbook on formative and summative evaluation of student learning. NY: McGraw-Hill. Edwards, A. L. (1955). Techniques of attitude scale construction. New York: Appleton Century Crofts. Source: Accessed March 2002, from http://arch.k12.hi.us/school/sqs/. Department of Educational Accountability (February, 1996). Parent and student satisfaction with elementary schools in Montgomery County. Rockville, MD: Montgomery County Public Schools. Source: Accessed March 2002, from http://www.dpsnc.com/dps/schools/CustSat/YESmithCustSat.html Abrami, P. C. (1989). How should we use student ratings to evaluate teaching? Research in Higher Education, 30, 221–227; Abrami, P. C., d’Apollonia, S., & Cohen P. A. (1990). Validity of student ratings of instruction: What we know and what we do not know. Journal of Educational Psychology, 82, 219–231; L’Hommedieu, R., Menges, R. J., & Brinko, K. T. (1990). Methodological explanations for the modest effects of feedback from student ratings. Journal of Educational Psychology, 82, 232–241. Athanasou, J. A. (1994). Some effects of career interests, subject preferences and quality of teaching on the educational achievement of Australian technical and further education students. Journal of Vocational Education Research, 19, 23–38; Athanasou, J. A., & Petoumenos, K. (1998). Which components of instruction influence student interest?. Australian Journal of Teacher Education, 23, 51–57. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140. Low, G. D. (1988). The semantics of questionnaire rating scales. Evaluation and Research in Education, 2, 69–70. Amoo, T., & Friedman, H. H. (2001). Do numeric values influence subjects’ responses to rating scales?. Journal of International Marketing and Marketing Research, 26, 41–46. CHAPTER 14 – GRADING, PERFORMANCE AND RESULTS

1 2

3

4

5 6

Holt, J. (1971). What do I do Monday? London: Pitman Publishing. A normalized rank approach which assumes that rank orders come from a normally distributed population is well known and has been outlined previously (see Gage, N. L., & Berliner, D. C. (1975). Educational psychology. Chicago: Rand McNally; Guilford, J. P. (1954). Psychometric methods (2nd ed.). NY: McGraw-Hill). but this is cumbersome for practical use. The assistance of Dr Yap Sook Fwe & Dr Poh Sui Hoi of the Nanyang Technological University in suggesting alterations to the original formula are gratefully acknowledged. The transformed ranks that are produced correlate perfectly with the original ranks and correlate (product-moment correlation) almost perfectly with the z scores of ranks (0.97 for ranks up to 250). University of Technolgy, Sydney, Assessment Procedures Manual, 1994, p. 34 Some methods that are based on the performance of successful, unsuccessful and borderline groups have been advocated for determining competency. Quantitative techniques for setting minimally acceptable performance levels on multiple-choice examinations were developed by Nedelsky (1954) and applied to the setting of educational standards. This method relies on judges’ estimates of the performance of minimally competent students on each question. Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3–19.

355

REFERENCES AND NOTES

7

8

9

Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement. Washington, DC: American Council on Education. Korbula, G. (1992). Setting cut-off points for numeracy skills using the Angoff method. Unpublished B.Ed project. University of Technology, Sydney. An earlier version of this material appeared in Hager, P., Athanasou, J. A., & Gonczi, A. (1994). Assessment technical manual. Canberra: Australian Government Publishing Service. CHAPTER 15 – TEST EQUATING

1

2

3

4

5

6

7

8

9

10

Angoff, W. H. (1984/1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education. Lord, F. M. (1982). The standard error of equipercentile equating. Journal of Educational Statistics, 7, 165–174. Marco, G. L. (1981). Equating tests in an era of test disclosure. In B. F. Green (Ed.), Issues in testing: Coaching, disclosure, and ethnic bias. San Francisco: Jossey-Bass. Petersen, N. C., Kolen, M. J., & Hoover, H. D. (1993). Scaling, norming and equating. In Linn, L. R. (Ed.),.Educational measurement (3rd ed.). Phoenix, AZ: Oryx Press. Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with nonrandom groups. Journal of Educational Measurement, 22(3), 197–206. Braun, H. I., & Holland, P. W. (1982). Observed score test equating: a mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 9–49). New York: Academic Press. Kiek, L. A. (1998). Data analysis of the key stage 2&3 pilot mental mathematics tests. Report for ACCAC. University of Cambridge. Angoff, W. H. (1984b). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service. Kolen, M. J. (1984). Effectiveness of analytic smoothing in equipercentile equating. Journal of Educational Statistics, 9(1), 25–44. Braun, H. I., & Holland, P. W. (1982). Observed score test equating: a mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 9–49). New York: Academic Press. Petersen, N. C., Kolen, M. J., & Hoover, H. D. (1993). Scaling, norming and equating. In Linn, L. R. (Ed.), Educational measurement (3rd ed., p. 247). Phoenix, AZ: Oryx Press. Mislevy, R. J., Sheehan, K. M., & Wingersky, M. (1993). How to equate tests with little or no data. Journal of Educational Measurement, 30(1), 55–78.

356