International Large-Scale Assessments in Education: Insider Research Perspectives 9781350023604, 9781350023635, 9781350023628

This book explores the often controversial international large-scale assessments (ILSAs) in education and offers researc

150 45 4MB

English Pages [248] Year 2019

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half-title
Title
Copyright
Contents
Notes on contributors
Foreword
Introduction
Part One: Theory and method
1. Researching inside the international testing machine: PISA parties, midnight emails and red shoes
2. Assessment imaginaries: methodological challenges of future assessment machines
3. The infrastructures of objectivity in standardized testing
4. Detecting student performance in large-scale assessments
Part Two: Observing data production
5. Starting strong: on the genesis of the new OECD survey on early childhood education and care
6. The situation(s) of text in PISA reading literacy
7. Self-reported eff ort and motivation in the PISA test
8. Student preparation for large-scale assessments: a comparative analysis
9. Investigating testing situations
Part Three: Reception and public opinion
10. Managing public reception of assessment results
11. The public and international assessments
12. Post script: has critique begun to gather steam again? Beyond‘critical barbarism’ in studying ILSAs
Index
Recommend Papers

International Large-Scale Assessments in Education: Insider Research Perspectives
 9781350023604, 9781350023635, 9781350023628

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

International Large-Scale Assessments in Education

Also available from Bloomsbury Digital Governance of Education, Paolo Landri Educational Assessment on Trial, Andrew Davis and Christopher Winch, edited by Gerard Lum Global Education Policy and International Development, edited by Antoni Verger, Mario Novelli and Hülya Kosar Altinyelken

International Large-Scale Assessments in Education Insider Research Perspectives Edited by Bryan Maddox

BLOOMSBURY ACADEMIC Bloomsbury Publishing Plc 50 Bedford Square, London, WC1B 3DP, UK 1385 Broadway, New York, NY 10018, USA BLOOMSBURY, BLOOMSBURY ACADEMIC and the Diana logo are trademarks of Bloomsbury Publishing Plc First published in Great Britain 2019 Paperback edition published 2020 Copyright © Bryan Maddox and Contributors, 2019 Bryan Maddox and Contributors have asserted their right under the Copyright, Designs and Patents Act, 1988, to be identified as Authors of this work. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. Bloomsbury Publishing Plc does not have any control over, or responsibility for, any thirdparty websites referred to or in this book. All internet addresses given in this book were correct at the time of going to press. The author and publisher regret any inconvenience caused if addresses have changed or sites have ceased to exist, but can accept no responsibility for any such changes. A catalogue record for this book is available from the British Library. A catalog record for this book is available from the Library of Congress. ISBN:

HB: PB: ePDF: eBook:

978-1-3500-2360-4 978-1-3501-6488-8 978-1-3500-2362-8 978-1-3500-2361-1

Typeset by Integra Software Services Pvt. Ltd. To find out more about our authors and books visit www.bloomsbury.com and sign up for our newsletters.

Contents Notes on contributors Foreword Bruno D. Zumbo

xii

Introduction Bryan Maddox

1

Part One Theory and method 1 2 3 4

Researching inside the international testing machine: PISA parties, midnight emails and red shoes Camilla Addey Assessment imaginaries: methodological challenges of future assessment machines Ben Williamson The infrastructures of objectivity in standardized testing Nelli Piattoeva and Antti Saari Detecting student performance in large-scale assessments Margareta Serder

Part Two 5 6 7 8

9

Observing data production

Starting strong: on the genesis of the new OECD survey on early childhood education and care Simone Bloem The situation(s) of text in PISA reading literacy Jeanne Marie Ryan Self-reported effort and motivation in the PISA test Hanna Eklöf and Therese N. Hopfenbeck Student preparation for large-scale assessments: a comparative analysis Sam Sellar, Bob Lingard, David Rutkowski and Keita Takayama Investigating testing situations Bryan Maddox, Francois Keslair and Petra Javrh

Part Three

Reception and public opinion

10 Managing public reception of assessment results Mary Hamilton

vii

11

13 31 53 69 85

87 105 121

137 157 173 175

vi

Contents

11 The public and international assessments Oren Pizmony-Levy, Linh Doan, Jonathan Carmona and Erika Kessler 12 Post script: has critique begun to gather steam again? Beyond ‘critical barbarism’ in studying ILSAs Radhika Gorur Index

197 219 229

Notes on contributors Camilla Addey is Lecturer in International and Comparative Education at Teachers College, Columbia University, New York City. Previously, Camilla was a researcher at Humboldt University in Berlin. Her research interests include international large-scale assessments, global education policy and international education privatization. Her work has focused on the OECD and UNESCO and been carried out in Laos, Mongolia, Ecuador and Paraguay. Camilla codirects the Laboratory of International Assessments Studies. She obtained her PhD in Education at the University of East Anglia in the UK and her Ed.M at La Sapienza University in Italy. Before her PhD studies, Camilla worked at UNESCO headquarters on adult literacy and non-formal education. Simone Bloem is a research associate at the German Youth Institute (DJI, Munich) in the International Centre Early Childhood Education and Care (ICEC). As National Project Manager of the TALIS Starting Strong Survey she is responsible for the administration of this first international comparative staff survey under the lead of the OECD in Germany. Jonathan Carmona is a master’s student in International Educational Development at Teachers College, Columbia University. His professional and academic focus is on issues of accountability, effectiveness and responsiveness in the educational non-profit sector. His current projects explore such issues by looking at frontline stakeholders and the experience of the people they serve. Linh Doan is a PhD candidate in the Department of International and Transcultural Studies at Teachers College, Columbia University (with a concentration in sociology of education). Her research interests include academic dishonesty in the context of testing, international large-scale assessments, and public opinion towards education and schooling. Hanna Eklöf is Associate Professor at the Department of Applied Educational Science, Umeå University, Sweden. She has a PhD in educational measurement,

viii

Notes on contributors

and her research interests include the psychology of test-taking, large-scale testing and measurement quality issues. Radhika Gorur is Senior Lecturer and DECRA Fellow at Deakin University, Australia, and a director of the Laboratory for International Assessment Studies. Her research seeks to understand how some policy ideas cohere, stabilize, gain momentum, and make their way in the world. Using materialsemiotic approaches, she has been developing and contributing to the sociology of numbers that makes explicit the instrumental and constitutive work of quantification, calculation and comparison in policy. She is currently studying contemporary initiatives in assessment and accountability in the Indo-Pacific. Mary Hamilton is Professor Emerita of Adult Learning and Literacy in the Department of Educational Research at Lancaster University, UK. She is Associate Director of the Lancaster Literacy Research Centre and Director of the Laboratory for International Assessment Studies. She has a long-standing interest in informal, vernacular learning and how communicative and learning resources are built across the life span. She has become increasingly involved with historical and interpretative policy analysis exploring how international influences reach into local practice and the implications of this for tutor and student agency in literacy education. Her current research is in literacy policy and governance, socio-material theory, media, digital technologies and change. Therese N. Hopfenbeck is Associate Professor and Director of Oxford University Centre for Educational Assessment. Her research interests include large-scale comparative assessments and how international testing has shaped public policy across education systems. In addition she is interested in different models of classroom assessment and self-regulation. Therese was the Research Manager of PIRLS 2016 in England and member of the PISA 2018 and PISA 2021 Questionnaire Expert Group. She is Lead Editor of the international research journal Assessment in Education: Principles, policy and practice. Petra Javrh has a PhD in pedagogical science from the University of Ljubljana. She is currently working as a research fellow at Slovenian Institute of Adult Education. During the last twenty years she has been involved in different research projects based on qualitative and biographical research methods. She has been a researcher and collaborator in several research institutions at the University of Ljubljana, national Educational Research Institute and others.

Notes on contributors

ix

She has published several articles and monograph publications concerning career development, adult literacy, key competencies, sustainable development, education of vulnerable groups and others. Francois Keslair is a statistician at the OECD, working in the team in charge of the Program for International Assessment of Adult Skills (PIAAC). He holds a PhD in economics from the Paris School of Economics. Erika Kessler is a master’s student in the Department of International and Transcultural Studies at Teachers College, Columbia University, USA. Her research focuses on the sociological phenomena of educational, social movements and organizational change in schools. She is interested in the examination of teachers, students, and the public engagement in environmental and sustainability education, internationalization and international large-scale assessments. Recently, her research focuses on NYC schools’ engagement with environmental and sustainability education and fostering student activism. Bob Lingard is Emeritus Professor of Education in the School of Education at The University of Queensland. His most recent books include, Globalizing Educational Accountabilities (2016) and Politics, Policies and Pedagogies in Education (2014). Bryan Maddox is Senior Lecturer in Education and International Development at the University of East Anglia, UK. He is a director of the Laboratory of International Assessment Studies (with Camilla Addey, Mary Hamilton, Radhika Gorur and Sam Sellar), and was principal investigator on the ESRC International Seminar Series on ‘The Politics, Potentials and Practices on International Educational Assessments’. He is coeditor of Literacy as Numbers (with Mary Hamilton and Camilla Addey, 2015). Nelli Piattoeva is Associate Professor in the New Social Research programme, University of Tampere. She is currently interested in the post-Soviet educational transformations, particularly audit culture and its manifestations in Russian schools, likewise in the production of numerical data on education and the political work done with numbers. Oren Pizmony-Levy is Assistant Professor in the Department of International and Transcultural Studies at Teachers College, Columbia University. He holds a PhD in

x

Notes on contributors

sociology and comparative and international education from Indiana UniversityBloomington. He investigates the intersection between education and social movements using three cases: international large-scale assessments (accountability movement), environmental and LGBT rights movements. He is broadly interested in change/stability within education systems worldwide. His recent projects have focused on the impact of PISA results on public opinion in thirty countries, the use of PISA as evidence in state education agencies in the USA and the Opt Out movement. David Rutkowski is Associate Professor with a joint appointment in Educational Policy and Educational Inquiry at Indiana University. Prior to IU David was Professor of Education at the Center for Educational Measurement (CEMO) at the University of Oslo, Norway. David’s research is focused in the area of educational policy and educational measurement with specific emphasis on international large-scale assessment and programme evaluation. David has collaborated with or consulted for national and international organizations and led evaluations and projects in over twenty countries. He currently is the coeditor of the IEA policy brief series. Jeanne Marie Ryan is a doctoral candidate at the Oxford University Centre for Educational Assessment (OUCEA), University of Oxford, UK, having previously studied linguistics at Oxford. Her current research focuses on the assessment of reading, variation among the constructs of reading in international and national assessments, and the application of corpus linguistics techniques to language learning and assessment materials. Working with other members of OUCEA, she has previously contributed to a systematic review of articles concerning PISA: Hopfenbeck et al. (2017) ‘Lessons Learned from PISA: A Systematic Review of Peer-Reviewed Articles on the Programme for International Student Assessment’, Scandinavian Journal of Educational Research. Antti Saari is University Researcher in the Faculty of Education at the University of Tampere, Finland. His research interests include curriculum studies and the sociology and history of expert knowledge in education. His work has focused on the practices of translation between educational research and governing education systems. His current work deals with the multiple interfaces between instructional technology and education policies. Sam Sellar is Reader in Education Studies at Manchester Metropolitan University. His current research focuses on large-scale assessments, data infrastructures,

Notes on contributors

xi

commercialization and new accountabilities in schooling. Sam works closely with teacher associations around the world to explore the effects of datafication for educators, schools and communities. He has published widely on the influence of large-scale assessments in school systems, including the education work of the Organisation for Economic Cooperation and Development (OECD) and its Programme for International Student Assessment (PISA). His recent books include The Global Education Race: Taking the Measure of PISA and International Testing (2017) and Globalizing Educational Accountabilities (2016). Margareta Serder has a PhD in Science and Mathematics Education from Malmö University, Sweden. Her thesis Encounters with PISA was published in 2015, and explores standardized assessment of students’ scientific literacy by studying test items, frameworks and result reports from the international comparative study Programme for International Student Assessment, PISA. Serder is Lecturer at the Faculty of Education and Society at Malmo University, and Affiliated Researcher at Jönköping University. Her most recent international publications are to be found in the journals Science Education and Discourse: Studies in the Cultural Politics of Education. Keita Takayama is Associate Professor in the School of Education, University of New England, Australia, where he leads the Equity and Diversity Education Research Network. A number of his recent writings have examined the various aspects of OECD’s PISA, including its item development processes, cultural bias issues, media discourse and national mediation of PISA data. Recently, he coedited two special journal issues in Comparative Education Review and Postcolonial Directions in Education. He was the 2010 recipient of the George Bereday Award from the Comparative and International Education Society. Ben Williamson is a Chancellor’s Fellow at the Edinburgh Futures Institute and Moray House School of Education at the University of Edinburgh. His research interests are in digital technology in education, with a particular focus on data systems, commercial software providers and the role of scientific expertise in policy processes. His latest book is Big Data in Education: The Digital Future of Learning, Policy and Practice (2017).

Foreword Tensions, Intersectionality and what is on the Horizon for International Large-Scale Assessments in Education Bruno D. Zumbo Professor and Distinguished University Scholar; Paragon UBC Professor of Psychometrics and Measurement University of British Columbia

There are two strands of contemporary international large-scale assessments (henceforth ILSAs) in education which sit in tension. On the one hand, there is the desire of developers and purveyors of such assessments, those employed and profiting from the ILSA industrial complex, to ensure that their assessment tools and delivery systems are grounded in our most successful psychometric and statistical theories. Their aim is to do social good whilst serving their economic and financial imperatives. There is nothing necessarily untoward or ignoble in this goal; what I am describing is just a social and economic phenomenon reflecting financial globalization and international competitiveness. On the other hand, there is the increasing desire of those of us outside of the ILSA industrial complex to ensure that the philosophical, economic, sociological and international comparative commitments in ILSA are grounded in a critical analysis that flushes out their intended and unintended personal and social consequences. These two strands are not necessarily disjointed, and are connected by a common body and goal. I believe that this tension is important and healthy as it unites both strands in working toward a common goal of increasing the quality of life of our citizens. Historically, the emergence of ISLAs saw antagonism between the large-scale national and international testing industrial complexes and the international, comparative and sociology of education traditions. Maddox’s International Large-Scale Assessments in Education signals a change in that relation – it is not the first such sign of the change, but it is an important one. Certainly, the staunch critics on both sides still exist, and they play an important role, but there is also recognition of disciplinary and personal intersectionalities that give the contemporary work a new feel and a new purpose. By ‘intersectionalities’ I

Foreword

xiii

mean the interconnected nature of social categorizations, be they disciplinary/ scholarly categories such as comparative education and educational assessment, as well as personal categories such as gender, race, ethnicity and social class. At the core of that intersectionality are overlapping and interdependent systems of advantage/disadvantage, including the potential for discrimination. In the way I am using intersectionality, these advantages/disadvantages as well as discrimination apply to both disciplinary/scholarly and personal intersectionalities. From a disciplinary point of view, Maddox’s International LargeScale Assessments in Education reflects the intersectionality of methodological and theoretical challenges in researching ILSA; of data production and detailed ‘insider’ accounts of ILSA practice; and the down-stream role of the media in presenting ILSA findings with a keen eye on their impact on public opinion and the views of teachers. From my vantage point, this book does not just reflect a bland (and, frankly, overused) notion of ‘interdisciplinarity’ or ‘transdisciplinarity’ but rather a much more interesting focus on the interconnectedness that is reflected in a disciplinary intersectionality. Likewise, in ILSA studies the intersectional advantage of statistical league tables and impenetrable idiosyncratic statistical terminology is fundamentally challenged and the orthodoxy up-ended. The early tensions in the history of ILSA studies between (i) scholars of international and comparative education and scholars of the sociology of education with (ii) psychometricians and educational assessment specialists, came from some psychometricians telling us that intersectionality is for sciencedenying postmodernists. The current book shows to the contrary, that many psychometric-minded scholars use social and personal intersectionality to do good work. As this collection shows, to understand a complex social phenomenon you need sufficiently fine-grained and deeply interwoven categories to discover differential effects. My hope is that the narrative of disciplinary traditions in collision is a thing of the past and that this new view leads to the overall improvement of the quality of life of students and citizens of our world. From a more philosophical lens, this collection highlights that the use and interpretation of ILSAs makes particular kinds of statement about an educational phenomenon of interest (e.g. mathematics or reading) whose definitions rely on normative standards. These normative statements make claims about how things should or ought to be, how to value them, which things are good or bad, and which actions are right or wrong. Empirical generalizations about them thus present a special kind of value-ladenness. Philosophers of science have already reconciled values with objectivity in several ways (see, for example, Douglas, 2004, 2011). None of the existing proposals are suitable for the sort

xiv

Foreword

of claims made in the use of ILSAs, which I would describe as a blending of normative and empirical claims. Some would argue that these blended claims should be eliminated from science – in part, this reflects the early psychometric stance in the disciplinary tension I described earlier. I argue that we should not seek to eliminate these blended claims from the use of ILSAs or in ILSA studies. Rather, we need to develop principles for their legitimate use. What is next on the horizon was signaled by Sam Messick in his work on validity theories and is nudged along by Maddox and his contributing authors; that is, to find or discover the hidden value propositions in the use and interpretation of ILSAs. This needs to be systematic and documented as part of the process of validation and interpretation of ILSA. This needs, in part, to focus on disagreements about the empirical claims from ILSAs. Finally, one needs to check if value presuppositions are invariant or robust to these disagreements, and if not to conduct an inclusive deliberation that focuses on disagreements about the empirical claims from the test. Elsewhere (e.g. Zumbo, 2017; Zumbo & Hubley, 2017) I have argued that this sort of value-ladenness is already part of the science of measurement and testing – and, I would argue, science more generally. Pretending that measurement and testing can be reformulated into value-free claims devalues perfectly good practices and stakes the authority of science of measurement and testing on its separation from the community that it needs and enables. This community of students, parents and policy-makers need to be viewed from an intersectional stance.

References Douglas, H. (2004), ‘The irreducible complexity of objectivity’ , Synthese, 138: 453–73. Douglas, H. (2011), ‘Facts, values, and objectivity’, in I. Jarvie and J. Zamora Bonilla (eds), The SAGE handbook of philosophy of social science, 513–29, London: SAGE. Zumbo, B. D. (2017), ‘Trending away from routine procedures, towards an ecologically informed “in vivo” view of validation practices’, Measurement: Interdisciplinary Research and Perspectives, 15 (3–4): 137–9. Zumbo, B. D. & Hubley, A. M. (eds) (2017), Understanding and Investigating response processes in validation research, New York: Springer.

Introduction Bryan Maddox

International Large-Scale Assessments in Education, or ILSAs as they are known, are one of the most influential phenomena in contemporary education (Kirsch et al., 2013). ILSAs use large-scale, standardized tests to measure and compare educational achievements within and across nations. The rise in influence of ILSAs can be attributed to their rapid global expansion, and the policy ‘shock’ and controversy that has been generated from international league tables (Meyer & Benavot, 2013).1 However, ILSAs such as the OECD’s Programme for International Student Assessment (PISA), and the IEA Trends in International Mathematics and Science Study (TIMSS) are not just important as projects of international comparison. Their emphasis on measurable data has had profound domestic influences in education – whether that is on national educational policy, school level assessment, curriculum change, or systems of accountability (Smith, 2016). While ILSAs are ‘low stakes’ for test respondents, they are often ‘high-stakes’ for educational systems. ILSAs therefore demand to be studied and understood. But understanding ILSAs is not easy. They have formal technical characteristics, and promote ideas of standardization and objectivity. However, in practice ILSAs are also diverse, global projects, and their characteristics, meanings and effects have to be considered in relation to the different locations, institutions, countries and contexts in which they take place (Maddox, 2014; Addey et al., 2017). Reductionist and deterministic attempts to simplify our understanding of ILSAs rarely do justice to the complexity of their character and their influence. Secondly, ILSAs are complex socio-technical projects (Gorur, 2017). Understanding how they work means recognizing and researching the social and technological processes of assessment practice. 1

For a list and database of ILSAs and their underlying frameworks see the IEA ‘ILSA Gateway’ (www. ilsa-gateway.org).

2

International Large-Scale Assessments in Education

International assessment studies The chapters in this book contribute significantly to the field of ‘International Assessment Studies’ (Addey, 2014; Hamilton, Maddox, & Addey, 2015; Gorur, 2017).2 In addition to being a descriptor for a rapidly growing literature, International Assessment Studies have distinctive characteristics and concerns. In the context of this book, three of these characteristics can be identified: Inside the Assessment Machine: The chapters in this book are not satisfied with understanding ILSAs through the study of their reports and technical papers. Instead, the contributors aim to understand ILSAs through systematic observation of ILSAs as they occur in practice. That involves studying the inner-workings of assessment programmes, and it reveals ‘polyvalent’ processes (Steiner-Khamsi, 2017) across different places and moments that are rarely acknowledged in ILSA technical reports. To that end, the book brings together contributors who make use of ethnography, sociology, Actor Network Theory and Science and Technology Studies. Expanding the Assessment Cycle: The chapters in the book describe an expanded notion of the assessment cycle. That moves beyond formal models of assessment as planning, implementation, data interpretation and review – what can become truncated in ILSA speak as ‘rounds’ of assessment. The book expands the assessment cycle in two ways. It extends the tails through its interest in ‘up-stream’ institutional processes of assessment planning and the enrolment of actors and countries (e.g. Addey & Sellar, 2017), and through ‘down-stream’ interests in public opinion, the media reception of ILSAs and their consequences (e.g. Chapters 9 and 10). We also deepen and expand understanding of the ‘stages’ of assessment by using a practice orientated account to capture and reveal back stage processes and activities (e.g. Chapters 7 and 8). An International Perspective: With their orientation to ILSA practice, the chapters in the book capture an international diversity of ILSA experience. That helps us to understand the diverse ways that ILSAs are understood and re-contextualized across different cultures and settings. This book contains chapters that discuss ILSA experience in multiple national contexts, including Ecuador and Paraguay; Russia and the United States; Sweden and Norway;

2

This book was one of outcomes of an international seminar series on ‘The Potentials, Politics and Practices of International Educational Assessments’ funded by the UK Economic and Social Research Council. Other publications from the series include Gorur (2017).

Introduction

3

UK and France; Singapore and Greece; Israel; Slovenia; Canada, Japan, and Scotland. As the international reach of ILSA expands, those international perspectives become especially important.3

About this book The book is in three parts. Part One discusses methodological and theoretical challenges in researching ILSAs. In Chapter 1, Addey offers a candid and provocative account of some of the challenges for the ILSA researcher in conducting insider-research. She frames her chapter as a problem of researching technical, institutional and policy elites. Her chapter describes a gendered process of negotiating research access to ILSA programmes that are protected by procedures of confidentiality and non-disclosure, and that involve sensitive networks of actors including government agencies, commercial testing agencies and transnational organizations. Her account of going ‘back stage’ (Goffman, 1959) into processes of ‘PISA for Development’ decision-making highlights one of the key characteristics of how ILSAs operate. That is, while ILSA techniques aim to reveal and make transparent educational performance as a basis for public debate and accountability, ILSAs are themselves less accessible for researchers who want to understand how they operate as educational and political institutions (see Steiner-Khamsi, 2017; Verger, 2017). Addey describes one of several transgressive methodological moves contained in this book, as she discusses how she accessed the unfamiliar research contexts of an advisory group meeting, a cocktail party, and a dinner. Williamson’s chapter (Chapter 2) continues this provocative methodological discussion as he describes the challenges of researching in a field that is being rapidly transformed by new digital systems and practices. He describes how digital objects, as well as social actors have to be integrated into our understanding of the ILSA machine. In a radical turning over of our received view, Williamson argues that digital data and their systems are themselves researching us; ‘participating in people’s lives, observing and listening to them, collecting and analysing the various documents and materials they encounter, and then producing “records” and “facts” of the “reality” of their lives which might be used for subsequent forms of diagnosis

3

The 2015 PISA survey included seventy-two countries. The OECD predicts that the number will reach 150 by 2030 (Addey, 2017). The ‘PISA for Development’ initiative illustrates efforts being made to incorporate low-income countries.

4

International Large-Scale Assessments in Education

and decision-making’ (Williamson, Chapter 2). His unsettling dystopian image of digital ethnographers illustrates how ILSA research must consider the social lives of non-human actors if we are to properly understand how they function and influence educational decision-making. Piattoeva and Saari (Chapter 3) focus their discussion on assessment infrastructures with illuminating case-studies of national level assessments in Russia and the United States. The chapter describes how testing programmes involve a tireless and unpredictable effort to maintain their objectivity and to protect themselves from accusations of political interference and corruption. In Chapter 4 Serder adds to the theoretical discussion by considering PISA as a ‘detector’ of student performance. She draws on Actor Network Theory (ANT) and Science and Technology Studies (STS) to tease out the way that the PISA test captures certain notions of performance. Her chapter is informed by field observations of how Swedish teenagers tackle PISA science items. Part Two of the book on observing data production provides detailed ‘insider’ accounts of ILSA practice. Bloem (Chapter 5) provides an insightful account of the planning and development of the OECD’s TALIS survey of Early Childhood Education and Care. She shows how the aspirations of different actors and institutions were integrated, and describes the pragmatic decisions that shaped the programme. Bloem’s description, like the other chapters in Part Two of the book, captures the practical work that takes place behind the scenes in ILSAs as ‘scientific’ projects. It highlights a methodological insight, noted by Latour and Woolgar (1979), that real-life observations of scientific work can provide a more insightful account about the conduct of scientific work than those presented in the formal methodological publications of those projects (e.g. in this case the ‘technical reports’ of ILSA programmes). However, in this case, her account might be described as ‘auto-ethnography’ as Bloem is herself one of the ‘scientists’ who is working within the TALIS programme. In Chapter 6, Ryan provides further insights of the workings of the assessment machine with a description of how PISA constructs and articulates the idea of ‘Reading Literacy’ – that is, how the theoretical construct is made real in PISA test items. Her chapter illustrates the subtle challenge of producing ‘decontextualized’ items that capture, or perhaps ‘detect’, global skills, while framing reading literacy assessment within real-life texts. Ryan’s chapter, and indeed, several of the chapters that follow, highlights the difficulties involved in the pursuit of ‘decontextualized’ standardization while at the same time recognizing the significance of place and space, as assessment practices are re-contextualized within national contexts.

Introduction

5

Eklöf and Hopfenbeck (Chapter 7) discuss a survey of self-reported effort and motivation in the Swedish and Norwegian PISA assessment, and ask if the ‘decline’ in student performance over the years might in part be explained by a lack of motivation and effort in a low stakes assessment. They use data from the PISA ‘effort thermometer’ (another detector) and additional survey instruments to answer that question. In discussing questions about motivation and effort with the students, they received feedback on the way that schools and teachers prepared the students for the test, and how the students felt about the test items. Eklöf and Hopfenbeck consider how those various factors may influence student performance. The question of how students in different countries are prepared to take the PISA test is discussed further in Chapter 8 by Sellar, Lingard, Rutkowski and Takayama, in their comparison of the experience of Canada, Japan, Norway and Scotland. They describe a ‘continuum of preparation’, with some countries preparing students minimally, others investing in the coaching and motivation of students, and some profoundly re-aligning their curriculum toward the PISA test. As they note, much of the preparation and curriculum alignment deviates from the technical standards and expectations of PISA, and involves preparation activities that are ‘neither officially recognized or proscribed’. Their chapter nicely illustrates the theme of diversity, as it shows how different countries in a standardized assessment introduce different responses in attempts to influence the outcome. In the final chapter of Part Two, Maddox, Javrh and Keslair (Chapter 9) discuss further scope for contextual variation in testing practices, with a detailed discussion of interviewer behaviour in testing situations. They describe testing procedures in the OECD’s Programme for the International Assessment of Adult Competencies (PIAAC) as it takes place in Slovenia. They use videoethnographic data, interviewer questionnaires and computer-generated log files to investigate how something as apparently mundane as seating arrangements can influence respondent behaviour and performance. Part Three of the book looks down-stream at the role of the media in presenting ILSA findings, their impact on public opinion, and how they influence the views of teachers. These themes concern a paradox. Despite the careful and systematic work of test producers, and the information rich character of ILSA reports, the public reception of ILSA results can involve poorly informed and partial understanding of ILSA results, and that can introduce risks and instability into processes of educational policy making (Gorur & Wu, 2014; Waldow, 2016; Pizmony-Levy & Woolsey, 2017; Sellar, Thompson, & Rutkowski, 2017).

6

International Large-Scale Assessments in Education

Discussions of ILSA ‘reception’ typically concern how people access and use assessment data, and include some debate about the valid use of data (e.g. Sellar, Thompson, & Rutkowski, 2017). However, ILSA reception studies are expanding to consider further questions about how various actors employ ILSA data as a resource to promote and contest political and ideological agendas. That raises questions about the kinds of theoretical models of ILSAs, of public opinion (Chapter 10) and of the State (Steiner-Khamsi, 2017) that we might apply in considering evidence for such processes, and the kinds of information that would be required to support or refute such arguments. In Chapter 10, Hamilton discusses the media reception of rounds one and two of the OECD PIAAC assessment in thirty-three countries. That includes case-study comparisons of the experience of Japan, the UK and France in the first round, and Singapore and Greece in the second round. Hamilton argues that public reception of ILSA results is subject to a great deal of variation between and within countries, including a variety of actors in media and government, and the wider political and economic location of reception. Those might therefore be described as imperfect conditions to support rational and transparent debate about the significance and consequences of ILSA data. Pizmony-Levy, Doan, Carmona and Kessler (Chapter 11) elaborate on that theme in their discussion of ILSA data and public opinion. They present findings from a survey on public opinion on ILSAs in twenty-one countries. They find that while the people surveyed express opinions about the importance and benefits of ILSA participation, they were much less well informed about the content of ILSA results. The discrepancy between public belief in ILSAs and knowledge about ILSA results highlights the potential for partial or inaccurate information to inform public views about educational performance. That perhaps explains why ILSA results are such a potent resource for policy contestation and debate.

Conclusion What then does it mean to view International Large-Scale Assessments as sociotechnical practice? In answering that question the chapters in this book appear to scandalize ILSAs. By observing the way that technical processes are infused with the social, the chapters appear to undermine the idea of scientific and objective ILSA data. However, as Latour has consistently argued, the discovery of the social in technical projects does not necessarily diminish their value and credibility

Introduction

7

(Latour, 2004; Gorur, 2017). It does, however, prompt new understandings about how ILSAs function, why the various networks of actors participate (Addey & Sellar, 2017), and the different kinds of consequences we might expect them to produce (Zumbo & Hubley, 2016). Gorur elaborates eloquently on this theme, and on the insights of Latour’s perspective on critique in the post-script of the book. The book chapters identify an additional source of scandal by demonstrating how ‘social’ practices (e.g. decision-making, policy making) are mediated and influenced by technical and material processes (e.g. Chapter 2). That hints at a dystopian view of assessment technologies that generate their own momentum, and reduce our ability to promote good assessment and to make our own educational decisions (e.g. Thompson, 2017). As educational systems (including teachers and policy makers) are made accountable to ILSA ‘data’ those concerns and affects become particularly telling (Sellar, 2014). Should these departures from ‘official’ portrayals of ILSA process really be considered as a source of scandal, and their revelation such a taboo? (Steiner-Khamsi, 2017). The chapters in this book suggest not. Instead, by carefully researching and documenting ILSA practice, and by highlighting the connections between the social and technical, they provide valuable insights into how and why ILSAs function, and that helps to explain their influences on educational policy and practice.

References Addey, C. (2014), ‘Why do countries join international literacy assessments? An actornetwork theory analysis with case-studies from Lao PDR and Mongolia’. PhD thesis: University of East Anglia. Addey, C. (2017), ‘Golden relics & historical standards: How the OECD is expanding global education governance through PISA for development’, Critical Studies in Education, 58 (3): 311–325. Addey, C. & Sellar, S. (2017), ‘A framework for analysing the multiple rationales for participating in large-scale assessments. Compare Forum sub-section’, in Addey, C., Sellar, S., Steiner-Khamsi, G., Lingard, B., & Verger, A. (2017). ‘The rise of international large-scale assessments and rationales for participation’, Compare: A Journal of Comparative and International Education, 47 (3): 434–452. Addey, C., Sellar, S., Steiner-Khamsi, G., Lingard, B., & Verger, A. (2017), ‘The rise of international large-scale assessments and rationales for participation’, Compare: A Journal of Comparative and International Education, 47 (3): 434–452. Goffman, E. (1959), The presentation of self in every-day life, New York, NY: Doubleday Anchor.

8

International Large-Scale Assessments in Education

Gorur, R. (2017), ‘Towards productive critique of large-scale comparisons in education’, Critical Studies in Education, 58 (3): 341–355. Gorur, R., & Wu, M. (2014), ‘Learning too far? PISA, policy and Australia’s “top five” ambitions’, Discourse, 36 (5): 647–664. Hamilton, M., Maddox, B., & Addey, C. (2015), Literacy as numbers. Researching the politics and practices of international literacy assessment, Cambridge: Cambridge University Press. Kirsch, I., Lennon, M., Von Davier, M., Gonzalez, E., & Yamamoto, K. (2013), ‘In the growing importance of International large-scale assessments’, in M. Von Davier, E. Gonzalez, I. Kirsch, & K. Yamamoto (eds), The role of international large-scale assessments: Perspectives from technology, economy and educational research, Oxford: Springer. Latour, B. (1999), Pandora’s hope: Essays on the reality of science studies, Cambridge, MA and London: Harvard University Press. Latour, B. (2004), ‘Has critique run out of steam? From matters of fact, to matters of concern’, Critical Inquiry, 30 (2): 225–248. Latour, B., & Woolgar, S. (1979), Laboratory life: The construction of scientific facts, Princeton, NJ: Princeton University Press. Maddox, B. (2014), ‘Globalising assessment: An ethnography of literacy assessment, camels and fast food in the mongolian gobi’, Comparative Education, 50: 474–489. Meyer, H. D., & Benavot, A. (2013), PISA, power, and policy: The emergence of global educational governance, Oxford: Symposium books. Pizmony-Levy, O., & Woolsey, A. (2017), ‘Politics of education and teachers’ support for high-stakes teacher accountability policies’, Education Policy Analysis Archives, 25 (89): 1–26. Sellar, S. (2014), ‘A feel for numbers: Affect, data and education policy’, Critical Studies in Education, 56 (1): 131–146. Sellar, S., Thompson, G., & Rutkowski, D. (2017), The global education race. Taking measure of PISA and international testing, Edmondton, Canada: Brush Education Inc. Smith, W. (ed.) (2016), The global testing culture: Shaping education policy, perceptions and practice, Oxford: Symposium Books. Steiner-Khamsi, G. (2017), ‘Focusing on the local to understand why the global resonates and how governments appropriate ILSAs for national agenda setting. Compare Forum sub-section’, in C. Addey, S. Sellar, G. Steiner-Khamsi, B. Lingard, & A. Verger (2017), ‘The rise of international large-scale assessments and rationales for participation’, Compare: A Journal of Comparative and International Education, 47 (3): 434–452. Thompson, G. (2017), ‘Computer adaptive testing, big data and algorithmic approaches to education’, British Journal of Sociology of Education, 38 (6): 827–840. Verger, A. (2017), ‘Theorizing ILSA participation. Compare Forum sub-section’, in C. Addey, S. Sellar, G. Steiner-Khamsi, B. Lingard, & A. Verger (2017), ‘The rise of

Introduction

9

international large-scale assessments and rationales for participation’, Compare: A Journal of Comparative and International Education, 47 (3): 434–452. Waldow, F. (2016), ‘Projecting images of the “good” and “bad” school: Top scorers in educational large-scale assessments as reference societies’, Compare: A Journal of International and Comparative Education, 45 (5): 1–18. Zumbo, B. D., & Hubley, A. M. (2016), ‘Bringing consequences and side effects of testing and assessment to the foreground’, Assessment in Education: Principles, Policy & Practice, 23: 299–303.

Part One

Theory and method

1

Researching inside the international testing machine: PISA parties, midnight emails and red shoes Camilla Addey

Introduction This chapter discusses the methodological and ethical challenges of researching international large-scale assessments (ILSAs) from inside international organizations, government bodies and international businesses involved in ILSA decision-making processes. In particular, the chapter focuses on access and its constant negotiation, carrying out interviews and participant observations, and the ethics of data gathering when investigating ILSAs. The interviewees and observed actors are described as ILSA elites, given the significant role they play in shaping the development of ILSAs. The aim of the chapter is to contribute to methodological scholarship on ILSA research from the inside, an approach which is made challenging by the secretive and poorly accessible nature of ILSAs (this is in great part due to the confidentiality of the test items and the sophisticated psychometric methodologies). The chapter describes the methodological and ethical challenges of gaining access to the elite ILSA community and contributes to scholarship on ethical and methodological questions faced by young, female researchers carrying out research with predominantly male, elite education policy actors (Ozga & Gewirtz, 1995). This is a controversial dimension of empirical scholarship which is rarely discussed in literature. This title and paper were inspired by the work of Ozga and Gewirtz: ‘Ozga, J. and S. Gewirtz (1995). Sex, Lies and Audiotape: Interviewing the Education Policy Elite’. The author is deeply grateful to Radhika Gorur for excellent feedback on earlier versions of this chapter.

14

International Large-Scale Assessments in Education

The chapter draws on the data-gathering experience carried out for the independent research project ‘PISA for Development for Policy1’ (PISA4D4Policy), which focuses on the ILSA known as PISA-D. PISA-D is the redevelopment of the main PISA, which the OECD claimed was poorly policy-relevant for lower and middle income countries since the main PISA background questionnaires do not sufficiently capture their context variables and the test items cluster students in the lower levels of the PISA metric (Bloem, 2013;2 OECD, 2013). The PISA4D4Policy project looks into how the development of PISA-D is negotiated amongst high level staff at the OECD, in the private sector, and in the participating countries who are involved in its development (see Addey, 2016). It also looks into how PISA-D is made to resonate in Ecuador and Paraguay (two of the nine participating countries)3, and how the OECD is redefining processes of global education governance as it moves into contexts where it had previously not worked in education (see Addey, 2017). The research project adopts a critical policy analysis approach and follows in the tradition of a qualitative research design (Diem et al., 2014). To investigate the above-outlined research questions, I carried out interviews with staff at the OECD and The Learning Bar (a private contractor developing the PISA-D questionnaires) and with high level policy actors in Ecuador and Paraguay, in 2015 and 2016.4 I describe these interviewees as ILSA elites. Interviews were carried out in English and Spanish, transcribed verbatim and then checked by each interviewee (who was given the opportunity to revise his/her interview). The interviews were all coded by institution, year and interviewee number or name (i.e. OECD2015#31). I also had the opportunity to carry out participant observations of a PISA-D International Advisory Group (IAG) meeting in Paraguay, and attend a PISA-D cocktail party organized by the OECD and a PISA-D dinner party organized by the Ministry of Education of Paraguay. Data was also gathered through analysis of key documents (all the PISA-D meeting

1

2 3

4

This is an independent research project funded by the Fritz Thyssen Foundation. The project was not funded by the OECD. This is an OECD working paper published by the OECD. Ecuador, Paraguay, Honduras, Guatemala, Senegal, Zambia, Cambodia, Bhutan (this latter joined in 2017) and Panama (this latter participated in the main PISA and is now taking part in the PISA-D section for out-of-school children). At this stage, PISA-D was in the process of being developed and had not yet been piloted in the participating countries.

Researching inside the international testing machine

15

power points and working documents made available online after each PISA-D meeting,5 and other key OECD publications.6

Getting inside the ILSA machine and staying inside It was the second of July 2014 when the PISA-D doors were opened to me. From 2013, I had been emailing and meeting up with high level OECD staff to discuss carrying out research on PISA-D (which was still at a very early stage). One such meeting had gone very well: after an hour talking about my PhD research findings on rationales for ILSA participation in lower and middle income countries and my OECD interlocutor substantiating my research findings with his stories about PISA participation, I was handed some PISA-D documents and asked if I would consider participating in an OECD commissioned study on countries’ PISA experience. I was excited about this opportunity but above all I was sure the doors to do research on PISA-D had been opened. It was not to be. The study was carried out by researchers with quantitative research training and I was told that it would be very difficult for me to do research on PISA-D: representatives from all the PISA-D partners7 and PISA-D countries would have to formally agree first. The doors were politely closed. About a year later, on 2 July, after Andreas Schleicher had given the opening keynote at the International Testing Commission annual conference, all conference participants gathered for the welcome reception. I found myself standing with a glass of white wine on the terrace of the Teatro Victoria Eugenia overlooking the Urumea river in San Sebastian, Spain, next to the only person I knew (by email, as I had corresponded a couple of times with him when I worked at UNESCO years before): Andreas Schleicher, the Director of Education and Skills at the OECD (also widely recognized as the father of PISA). Like me, he was standing alone. I decided he was my best chance to open the OECD 5

6

7

PISA-D working documents are all available at www.oecd.org/pisa/aboutpisa/pisa-for-developmentmeetings.htm These are: OECD Strategy on Development (OECD, 2012), Report on the Implementation of the OECD Strategy on Development (OECD, 2014a), PISA-D Call for Tender (OECD, 2014b), Beyond PISA 2015: A Longer-Term Strategy for PISA (OECD, 2013) and Using PISA to Internationally Benchmark State Performance Standards (OECD, 2011). The OECD’s PISA-D technical partners are UNESCO, UNESCO Institute of Statistics (UIS), Education For All Global Monitoring Report (EFA GMR) team, UNICEF, WEI-SPS, Education International, PISA SDG, PIAAC team, and the assessment programmes: ASER; EGRA; EGMA; SACMEQ; PASEC; Pre-PIRLS and PIRLS; TIMSS; LLECE; STEP; LAMP; UWEZO; its aid partners

16

International Large-Scale Assessments in Education

doors to carry out research on PISA-D from the inside. After a brief chat, Andreas Schleicher asked me to email him so that he could put me in touch with the right person at the OECD. Within the next hour, I had run back to my hotel to email him, and within a week, I was in Paris meeting OECD staff again. As discussed by scholars writing about interviewing, access depends greatly on one’s ‘sponsor’ or access point, whose endorsement grants access and cooperation whilst also acting as one’s professional credential and standing (Welch et al., 2002; O’Reilly, 2009; Busby, 2011). As Welch et al. (2002) suggest, one’s sponsor affects the kind of data one gathers, especially when it is a person or an institution which plays an influential role in the network of research participants. Busby (2011) adds that the power relationships and authority structures in this network also shape the data one gathers. Andreas Schleicher briefly became the ‘sponsor’ of my project, before this role was transferred to OECD staff. Throughout this project, I have been aware of how my point of access influenced who I have met and not met, how people have reacted (and not reacted) to me, the relationships I have built (and not built), and the data I have gathered (and not gathered). My access to OECD staff was built over many years. It was an opportunity to build trusting relationship that would lead to the OECD formally introducing me and putting me in touch with all the PISA-D partners identified in the PISA4D4Policy research design. Andreas Schleicher’s support within the OECD and then the OECD’s staff support within the PISA-D network, meant that I accessed everyone the PISA4D4Policy research design considered crucial informants. Being introduced by the OECD had implications for my study: on the one hand, my new contacts were willing to support my research needs to please the OECD, on the other hand, I risked being perceived as an OECD-researcher carrying out an undercover evaluation (or that I might be reporting to the OECD). Initially, my new contacts (The Learning Bar staff and representatives of the government in Ecuador and Paraguay) were very willing to support my research. I developed research project outlines (in English and Spanish) which explained what kind of support I was seeking (interviews) and what the implications for my research participants were. This correspondence was followed by Skype chats. I approached all Skype meetings as an opportunity to build trust, but the OECD’s introduction had already built it on my behalf. My are France, Inter-American Development Bank (IADB), Korea, World Bank, Global Partnership for Education (GPE), Norway (Norad), UK (DFID), Germany (BMZ/GIZ), Japan (JICA) and Ireland (Irish Aid); and its contractors are Education Testing Services, The Learning Bar, cApStAn, Westat, Pearson, Microsoft and Positivo BGH.

Researching inside the international testing machine

17

contacts were keen to help me in any way I requested and insider information was shared with me from the first contacts (as these were not formally arranged interviews and I had not informed my future interviewees of the rights, I did not use these exchanges as interview data). My interviewees were so supportive that the organizers of the next IAG meeting invited me to participate and present my research. The foundation funding my research understood the value of this opportunity and increased my funds for me to travel to Latin America, and stay in an expensive hotel where the meeting would take place. I was excited about this opportunity, given that no independent researcher has yet been able to carry out participant observations of the OECD’s PISA advisory meetings. I also knew the OECD would not have invited me, and that this invitation was the mistake of someone who was not accustomed to the fact that PISA meetings happen behind closed doors. Rather than creating a diplomatic accident and being turned back once I arrived in Latin America, I decided to inform the OECD staff. Twenty-four hours later, walking down the busy central streets of Zurich with my grandmother, my smart phone suddenly went crazy. People I had only ever emailed and Skyped with, were urgently calling, whatsapping, emailing and texting me all at the same time. Very politely I was told that all PISA-D partners had been consulted, and had agreed that an external participant would make country representatives feel uncomfortable when expressing their concerns. Worried this crisis would close the PISA-D doors forever, I made clear that I knew the rules of the game and had no intention of attending without everyone’s acknowledgement and permission. It was worrying, especially since the next emails I needed to send required actual commitment to an interview appointment. I soon discovered that some doors were still open, but others had been firmly closed. The OECD staff remained supportive. Indeed, upon insistence and increasingly shorter emails, I even obtained an interview with Andreas Schleicher: after not getting a reply to my many emails, I made one last attempt from the Berlin–Paris night train. Two minutes later, at midnight, I received an email from Andreas Schleicher suggesting I make an interview appointment via his secretary for the following week. Trying my luck further, I asked OECD staff if I could attend the next PISA-D IAG meeting (not the one I had been invited to by mistake). I was told that if I were to find myself in Paraguay on 31 March the following year, I could attend one day of the three-day meeting. I made sure I would be there. At the same time, my request for interview appointments with other contacts I had established through the OECD were rejected. Upon my insistence, further clarifications on my research project were requested. Given that ILSAs tend

18

International Large-Scale Assessments in Education

to polarize (Gorur, 2017), it would be difficult to gain access if one outlined the research project as being pro- or anti- ILSAs. Indeed, Busby (2011) shows that interviewees provide access based on how they classify the research and researcher. I thus tried to position myself and the project as neutrally as possible, neither pro-ILSAs nor ILSA-sceptic. I re-wrote the outline of my research in such a way that my research participants would see it as a mutually beneficial research project. Once again, the doors that had been closed were opened, so widely that my research participants suggested they would like to discuss a consultancy opportunity. These consultancy offers from my research participants put me in an unexpected position, challenging me to think about my research focus from my research participants’ perspectives. Busby (2011) suggests this requires the researcher to switch roles and become an active participant in the process under investigation. As with the OECD consultancy offer, this one did not take place either. I was relieved. Although the support I received to carry out my research from the inside required extensive and diplomatic negotiating and trust building, my research participants were motivated by more than the relationship I built with them. In the case of the OECD, the organization committed from the beginning of PISA-D, to make the entire process as transparent as possible. For this reason, the OECD has published online all the documents and Power Points presented at the PISA-D meetings. Allowing and facilitating my research project can be understood as making the PISA-D process more transparent, but also a reflection of the OECD’s commitment to ILSA improvement and innovation through research. Allowing research from the inside is also a way for the OECD staff to tell their version of the story and influence scholarship on PISA-D. For The Learning Bar, supporting my research needs became an opportunity to draw on my expertise (in particular knowledge gained from fieldwork in Ecuador and Paraguay to understand how to respond to the PISA-D countries’ needs) but also to position themselves favourably with the OECD’s request, as was also the case for my contacts in Ecuador and Paraguay. In Ecuador, my presence was used to further the international legitimization which is sought through PISA: beyond the formal dinners and a busy agenda of high level meetings they arranged for me, I was asked to give lectures and photos were taken of me shaking hands with ministers and directors. I was also video-interviewed for video news items to be edited and disseminated. Articles were published claiming an international expert of international education policy approved of and commended the work being carried out with PISA in Ecuador. In the case of Paraguay, participating in PISA clearly fits with the global ritual of belonging (Addey, 2014; Addey &

Researching inside the international testing machine

19

Sellar, 2018) which interviewees suggested also includes attracting international scholars from prestigious universities (being based at Humboldt University in Berlin was considered very highly) to carry out research on Paraguay (being put on the global data map, includes the scholarly research map). These newly established contacts led to more contacts (known as the ‘snowball approach’ to access), thus allowing me to develop an extensive network of research participants and contacts. Within this network of research participants, I experienced different levels of access. Through the OECD I experienced supportive but guarded access: during the IAG meeting, OECD staff ensured all PISA-D participants at the meeting were informed by email about who I was and why I would be attending the meeting.8 Through some newly established contacts, I experienced unlimited access: during the IAG meeting some contacts went out of their way to make sure I was introduced to everyone and informed about everything that could be informative for my research project. Although my network of research participants and trust was expanding constantly, I was also aware that this access and trust were not stable and could easily be broken (Barbour & Schostak, 2005; Busby, 2011), requiring constant renegotiation and demonstrations of trustworthiness before, during and after9 the data gathering (Maxwell, 2005; Gains, 2011). This will be discussed in the ethics section in relation to sharing research findings with research participants and keeping doors open for further research. Finally, it is worth expanding on the relationship between the researcher and the researched as dynamics are often reversed with research participants who occupy high level positions and have significant decision-making power – as was the case in the PISA4D4Policy research project. Apart from constantly negotiating access and building delicate trust relationships, I was also taking decisions about how to position and present myself, evaluating the impact of relationships on the data I gathered, and reflecting on how all the issues related to access were impacting on the data I was gathering (as described by Busby, 2011). Young female scholars carrying out research in elite policy settings are faced with further methodological and ethical challenges. The data-gathering process requires continuously making decisions about how to position oneself, how to use that positioning and their gendered implications. Although some may criticize this as a form of manipulation of the data-gathering process, to deny

8

9

During the meeting, participants would say (to my surprise the first time), ‘Oh, so you are the person the OECD emailed us about!’. And still am with the writing of this methodology chapter.

20

International Large-Scale Assessments in Education

these dimensions would be to deny the complex reality which scholars face in fieldwork. Being self-reflective and honest about the way we take these gendered decisions (i.e. dressing down to not appear threatening) can only contribute to methodological and ethical scholarship and the quality of empirical research. Both Grek (2011) and Ozga and Gewirtz (1995) suggest that junior, female researchers have eased access to elite interviewees (often a predominantly male group, as is the case in my research project) as they are perceived as non-threatening and harmless. Welch et al. refer to this as the seniority and gender gap (2002: 622), whilst Busby (2011) and Gurney (1985) discuss how these can be reinforced by the researcher’s dress code. Not naively, and on several occasions, I chose to reinforce these gaps by dressing young (choosing red sporty shoes rather than elegant black shoes) and feminine (skirts and dresses) with the intention of coming across as unthreatening, informal and relaxed (and if possible countering the impression I was evaluating the implementation of PISA-D for the OECD). Ozga and Gewirtz (1995) guess that presenting oneself as rather innocent and harmless and reinforcing these gaps when researching education policy is rarely discussed but likely an approach used by many. It is also worth mentioning here that I was also aware during this process that I could equally be categorized as an elite researcher, given my high level of education, my prestigious academic affiliation, professional background, abundant research funds, my social capital that allows me to easily socialize with influential ILSA actors, Western dress code, and ease with multiple languages and in particular native English. These elite researcher attributes ease the negotiating and maintaining of high levels of ILSA access.

Interviews and participant observations inside PISA-D Scholars who have written about researching elites acknowledge the subjectivity of what elite means (Smith, 2006) and therefore focus on defining what is intended by their use of the word when identifying research participants. At the heart of this definition is scholars’ understanding of power. Scholars appear to see power as either held or exercised: on the one side, scholars see power as being something structural that certain individuals and institutions have, thus identifying elites and institutions as those who possess power; on the other side, scholars take a more post-structural stance towards power, seeing it as something fluid that is exercised and that cannot be directly translated into the area under investigation nor into the relationships between the researcher and

Researching inside the international testing machine

21

the researched. Allen (2003) has argued that these understandings (amongst the many theorizations of power) see power as a thing rather than a relational effect of social interaction. Acknowledging these differences, it is important to define why and how I use the term ILSA elites to refer to the research participants in the PISA4D4Policy research project. In line with the methodological approach applied in the research project, Actor-Network Theory, power is understood as distributed and exercised rather than held. Although it may be tricky to define who exercises power before fieldwork, the research design identified the individuals who were involved in PISA-D as influential decision-makers. The research participants were part of three elite groups: high level decision-makers at the OECD, corporate elites in international firms, and political elites (either political appointments or high level policy actors working for the government in Ecuador and Paraguay). For confidentiality reasons, more information about who was interviewed and observed cannot be shared as further details would make them easily identifiable. These actors were in the position to influence key decisions in relation to what PISA-D would become, thus exerting significant power on the process under investigation. Many of these actors also held privileged and exposed positions in society which went beyond their influence on PISA-D, often visible individuals both inside and outside their organizations. The interviews I carried out were all semi-structured interviews which ranged from more personal accounts (on one occasion, I received what was more like a confession of all the problems and personal issues related to the interview questions) to more institutional accounts (in particular the interviewees who asked to not be anonymized). Most interviewees were willing to be voice recorded and a significant number of interviewees wanted to be interviewed together with other research participants (on one occasion one interviewee invited seven colleagues to assist her throughout the interview). This impacted on what was shared with me. For example, hierarchy amongst my research participants impacted on whether research participants were more open to critical reflections and deviating from the institutional narrative. On some occasions, formal photos were taken before and during the interviews, personal assistants sat through the interviews to take notes and assist my interviewee’s requests, and some of my interviewees also recorded the interview to document what they had said. I experienced different levels of gender and seniority gaps, and although there was a clear hierarchy (interviewees all held much more influential positions than I and were much older than I), these gaps were generally not emphasized by my interviewees. However, on a couple of occasions, interviewees did dominate the interview (by talking about self-elected topics and not taking into consideration

22

International Large-Scale Assessments in Education

the questions being asked) and repeatedly challenged the questions I asked and my assumptions (i.e. ‘So what is your policy framework?’). Scholars (Richards, 1996; McDowell, 1998; Welch et al., 2002) argue that in elite interviews, this is a result of the power imbalances between researcher and researched, and that these asymmetries can be further exploited when there are gender and seniority gaps. Although the research data-gathering process was not designed to include participant observations (because PISA meetings are carried out behind closed doors and access by independent researchers is unheard of), three kinds of participant observation opportunities emerged during fieldwork. Firstly, information was shared with third parties (whilst I was carrying out interviews) and every day activities that related to my research project could be observed before, during and after the interviews (which mostly took place in the offices of my interviewees). Participant observations of this kind were so revealing that it entirely changed how the interview data could be understood (i.e. although I was told how participative the PISA-D processes were, I could observe to what extent participation was engaged with). Although it would be unethical to use this data without consent from interviewees, the depth these observations provided on the interviewees putting PISA-D into practice, certainly informed my understanding of the processes under investigation and the questions I went on to ask. Secondly, I was invited to attend one day of the IAG meeting in Paraguay. This occasion allowed me to observe the dynamics my interviewees had described, often shedding further light on what had been taken-for-granted (i.e. the way the OECD tries to hand over ownership of the PISA-D process to the PISA-D countries). Thirdly, attending the IAG meeting allowed me to network widely (leading to informative conversations), observe the socializing dynamics that happen around the IAG meeting in the lobby, breakfast room or during the breaks (i.e. who has lunch with whom, what is discussed at breakfast before the meetings, who is anxious or content about the meeting and why), and participate in the PISA-D cocktail parties and formal dinner parties. In order to access and make the most of these opportunities, a substantial part of my research grant was spent on accommodation in the five-star hotel where the IAG meeting was taking place. Even when participant observations cannot be used as data as it is too sensitive or because it is not cleared by the actors observed, the opportunities to talk to all those involved in PISA-D as a ‘native’ and listen to conversations, and observe how PISA-D actors interact, conveyed information on the process under investigation in a way that no other method could provide.

Researching inside the international testing machine

23

Immersion in the PISA-D group allowed me to observe the taken-for-granted, the non-verbal, the unquestioned and unarticulated attitudes and assumptions, and situated day-to-day practices of all PISA-D actors in their natural settings, thus nuancing the data gathered through interviews (Bevir & Rhodes, 2006; Forsey, 2004; Gains, 2011; Rhodes, 2005; Rhodes et al., 2007; Schatzberg, 2008). Complementing my interview data with participant observations of PISA-D processes from the inside, enriched what I could understand from the interview accounts. For example, some interviewees told me that English was a serious problem for them (it is worth noting that there is a problem in many countries of finding assessment experts who are also fluent in English) and that after PISA-D meetings, PISA-D country representatives sent each other whatsapp messages because they did not understand their assignments, only to find out that no one had understood because English acts as a barrier. Having observed the meeting (as a native English speaker and having taught English as a foreign language for many years), I can claim that some presentations were given in an English that only a native speaker could understand. It was therefore not surprising that only one question followed the above-mentioned presentation and that no one objected when the PISA-D secretariat asked if the decisions taken during the meeting could be agreed upon. Another example is the PISA-D cocktail party and the PISA-D dinner party in Paraguay. On this occasion I observed which groups of actors networked and socialized, how different actors took the opportunity to ask questions they had not asked during the meeting, how some actors’ concerns were silenced by more dominating actors, how those who had made critical comments during the meeting apologized to more influential PISA-D actors, how participation in PISA-D was made into a very prestigious international engagement, how the future of ILSAs was presented as key to global economic competitivity, how interviewees reflected on how PISA-D processes had to be changed given what they had observed, and the concerns, level of (dis)engagement, and (dis)contents of all those involved. I also observed how the different groups (aid partners, contractors, countries, technical partners) involved in PISA-D were rather blinded by their own priorities and concerns (i.e. tracking progress with Sharepoint software as opposed to discussing progress with participants; obtaining consensus on documents rather than discussing concerns). It was also informative to observe how the scandal (about the costs the Ministry of Education of Paraguay had covered for the PISA-D dinner and the coffee and water bottles during the meeting coffee breaks) that was on the front page of the main daily papers during the days of the meeting, was hardly discussed amongst participants. The Ministry of Education of Paraguay was accused of

24

International Large-Scale Assessments in Education

corruption, and growing anger and student protests about how education funds were being spent exploded when a school roof fell in one week later, leading to the minister’s removal from office. Observing how PISA-D actors coped with the media scandal whilst they carried out the meetings was informative in terms of how ILSAs travel into different contexts, whilst maintaining expensive practices and remaining isolated from the public’s concerns. Thirdly, I gathered a deeper understanding of the process under investigation by chatting with research participants and hanging out with actors involved in PISA-D during brunches, drinks at bars, dinners, going for walks, chatting on WhatsApp and by Skype, and becoming friends with people whom I have met through the PISA4D4Policy research project. On these occasions, I got to know my interviewees in a way that revealed more about their approach, assumptions about and experience of PISA-D, the OECD, the private companies involved, and the participating countries, than they were willing to share during interviews (i.e. statements that had been made about Western neo-colonialism were dismissed during interviews). Interviewees also shared information which they did not consider relevant but which was greatly informative (i.e. when interviewees told me about offers they had been provided through the PISA-D network). Most of these occasions ended with my interlocutors making sure (indeed swearing over it when the information was particularly confidential) that what had been shared was off-record and they trusted me not to use it as data (demonstrating that access and ethics are a continuously negotiated process and concern). Although this information cannot be used, what was observed cannot be unseen and informs my understanding of the processes studied in PISA4D4Policy.

Elite data-gathering ethics Although ethical guidelines10 are set out to ensure all research is carried out under the highest ethical standards, ethical dilemmas pepper all research processes. Throughout the process of researching PISA-D from the inside, I dealt with numerous ethical challenges. Already acquainted from previous research with the potential procedures interviewees faced in international organizations (i.e. the organization’s lawyers had to revise my informed consent forms before UN staff could sign them) and in countries where research participants are

10

For example, the European Code of Conduct for Research Integrity, and the ESRC Framework for Research Ethics.

Researching inside the international testing machine

25

wary of singing documents (i.e. Laos and Mongolia), I gave my interviewees informed consent forms (outlining their rights) and the project outline, but did not insist on them signing the forms, as long as I could voice record their consent. Although I sought informed consent from my interviewees, the ESRC Framework for Research Ethics states that in the case of elite interviews (defined as senior people who have a public role – like ministers, or who represent the views of their general positions – like judges), ‘formal written consent is not necessary because by consenting to see the researcher, the participant is in fact giving consent’ (ESRC, 2015: 39). This however does not mean that the research participant should not be informed of her rights and about the research project aims, design and outcomes. Two research participants requested their interviews not be presented as anonymous, whilst others were concerned that I would not take any risks, seeking reassurance by asking how I would ensure their confidentiality. Interviewees with exposed positions in society are particularly concerned about what they share in an interview given their public role and image, and in part because they are more easily identifiable (Busby, 2011). Busby states that amongst elite interviewees, ‘concerns about anonymity can be very real’ (2011: 621), as it can lead to a direct impact on the rest of their career. As may be imagined, those who were not concerned with anonymity stuck closely to the institutional line, whilst those who were very concerned about their anonymity gave rather intimate accounts. Others claimed that anonymity would allow them to speak more freely. In order to take no risks, I avoided sharing any non-general information about my interviewees, changed details that may allow their colleagues to identify them, and mixed up genders. No interview tracks are linked to names anywhere, and interviews are coded only by affiliation or country, interview year and randomly attributed interviewee numbers. Although this is not common practice, I gave all my interviewees the opportunity to revise their interview transcripts in order to further protect them (those who were shocked by what they had shared with me did not change the transcript but sought confidentiality reassurances). Any mistake in confidentiality, could, for example, have led to interviewees losing jobs or future opportunities, publicly being accused of making statements, or leading their institution into a crisis. This explains why on a number of occasions my interviewees also recorded the interviews to protect themselves in case I were to make false claims which could damage them or their institution. Scholars have discussed the ethical challenges involved in providing research participants with feedback (in the form of research findings) and maintaining access. On one occasion, towards the end of my data-gathering process, one

26

International Large-Scale Assessments in Education

research participant (key to my fieldwork access) asked me about preliminary findings. Although I claimed it was very early, I managed to share my initial thoughts in a diplomatic way (i.e. all PISA-D actors seem too concerned by their own priorities and goals to manage to fully consider others’ priorities), only to find that my interviewee entirely agreed and proceeded to tell me more stories supporting my findings (as also experienced by Thomas, 1993; Busby, 2011). In order to maintain good relations and access (especially when the object of research is ongoing), the researcher may feel pressure to withhold information deemed too sensitive for the public domain or which may create distrust towards future research and researchers. Ozga and Gewirtz (1995) discuss the challenge of self-censorship and suggest researchers can come under pressure to meet the expectations of research participants in order to reciprocate the effort and support of the interviewee but also to retain access. I would add that researchers are also under pressure to present research findings in non-threatening ways in order to continue the dialogue with research participants, in the hope of leading to impact in the area under investigation. Finally, researchers of elite actors are also concerned about influential interviewees interfering and censoring research findings (Ostrander, 1993; Busby, 2011). All this requires careful ethical considerations whilst also ensuring an ethical academic commitment, which includes putting forward critical perspectives and challenging unjust social policies and practices. A final issue which is linked to feedback and maintaining access, is the relationships which the researcher develops with research participants (i.e. sharing the same networks or friendships). On some occasions, research participants wanted to make me feel welcome and our relationship went beyond the PISA-D interviews (i.e. introducing me to friends and family), thus breaking down the researcher–researched boundaries (Ozga & Gewirtz, 1995). This proximity often led to sharing information that was then cleared as ‘off-record’. Like, Ozga and Gewirtz (1995: 136), these naturally developing relationships left me feeling caught by the help and trust I have been given, thus further complicating the ethical considerations that challenge the researcher.

Concluding remarks on researching ILSAs from the inside This chapter has sought to offer an insight into the methodological and ethical challenges of carrying out research inside the international assessment machine,

Researching inside the international testing machine

27

an area of investigation which is by ILSA nature both secretive and poorly accessible. The chapter started out telling the story of how the researcher gained access to carry out qualitative data gathering inside PISA for Development, which included the diplomatic management of two access crises but also consultancy offers to work on PISA-D. The chapter offers a reflection on how the researcher was perceived given the point of access, the reasons why research participants were willing to support the researcher, and how researcher– researched relationships were influenced by gender and seniority gaps. The chapter then drew on scholarship on elite research participants to define ILSA elites as influential decision-makers in the PISA-D process who often have exposed positions in society. The chapter told of interviewees who wanted to be interviewed with colleagues and how the hierarchies influenced what was shared, how interviewees recorded the interviews to protect themselves, and how some interviewees dominated interviews and challenged the questions being asked. The chapter also told of three kinds of participant observation opportunities which arose (participant observations of PISA-D meetings, PISA-D work in progress happening during the interview, and informal and off-record chats and shared experiences with interviewees) and deeply informed the ‘on record’ data. Finally, the chapter discussed the main ethical issues that emerged. These included confidentiality concerns of interviewees, the challenges of providing research participants with feedback, maintaining access, and the breaking down of the researcher–researched boundaries. The chapter has shown that the stories of ILSAs and the data and knowledge they generate present methodological and ethical challenges. The chapter also makes the case for further research from the inside to contribute to the innovation of global datafication projects through research but also to put forward critical perspectives and challenge unjust practices.

References Addey, C. (2014), ‘Why do countries join international literacy assessments? An actornetwork theory analysis with cases studies from Lao PDR and Mongolia’, PhD thesis. School of Education and Lifelong Learning, Norwich, University of East Anglia. Addey, C. (2016), ‘PISA for development and the sacrifice of policy-relevant data’, Educação & Sociedade, 37 (136): 685–706. Addey, C. (2017), ‘Golden relics & historical standards: How the OECD is expanding global education governance through PISA for development’, Critical Studies in Education, DOI: 10.1080/17508487.2017.1352006.

28

International Large-Scale Assessments in Education

Addey, C., & Sellar, S. (2018), ‘Why do countries participate in PISA? Understanding the role of international large-scale assessments in global education policy’, in A. Verger, M. Novelli, & H. K. Altinyelken (eds), Global Education Policy and International Development, London: Bloomsbury. Allen, J. (2003), Lost geographies of power, Oxford: Blackwell. Barbour, R. & Schostak, J. (2005), ‘ Interviewing and focus groups’, in B. Somekh & C. Lewin (eds), Research methods in the social sciences, London: SAGE. Bevir, M. & Rhodes, R. A. W. (2006), Governance stories, Abingdon: Routledge. Bloem, S. (2013), PISA in low and middle income countries, Paris: OECD Publishing. Busby, A. (2011), ‘“You’re not going to write about that are you?” What methodological issues arise when doing ethnography in an elite political setting?’ Sussex European Institute, Working Paper 125: 1–37. Diem, S., Young, M. D., Welton, A. D., Mansfield, K. C., & Lee, P. (2014), ‘The intellectual landscape of critical policy analysis’, International Journal of Qualitative Studies in Education, 27 (9): 1068–1090. ESRC (2015), ESRC framework for research ethics, UK: ESRC. Forsey, M. (2004), ‘“He’s not a spy; he’s one of us”: Ethnographic positioning in a middle class setting’, in L. Hume & J. Mulcock (eds), Anthropologists in the field: Cases in participant observation, New York: Columbia University Press. Gains, F. (2011), ‘Elite ethnographies: Potential pitfalls and prospects for getting “up close and personal”’, Public Administration, 89 (1): 156–166. Gorur, R. (2017), ‘Towards productive critique of large-scale comparisons in education’, Critical Studies in Education, 58 (3): 341–355. Grek, S. (2011), ‘Interviewing the education policy elite in Scotland: A changing picture?’, European Educational Research Journal, 10 (2): 233–241. Gurney, J. N. (1985), ‘Not one of the guys: The female researcher in a male-dominated setting’, Qualitative Sociology, 8 (1): 42–62. Maxwell, J. (2005), Qualitative research design: An interpretive approach (2nd edition), London: SAGE. McDowell, L. (1998), ‘Elites in the city of London: Some methodological considerations’, Environment and Planning A, 30 (12): 2133–2146. OECD (2011), Using PISA to internationally benchmark state performance standards, Paris: OECD. OECD (2012), OECD strategy on development, Paris: OECD. OECD (2013), Beyond PISA 2015: A longer-term strategy for PISA, Paris: OECD. OECD (2014a), 2014 report on the implementation of the OECD strategy on development, Paris: OECD. OECD (2014b), Call for tender PISA for development strand A and B, Paris: OECD. O’Reilly, J. (2009), Key concepts in ethnography, London: SAGE. Ostrander, S. A. (1993), ‘“Surely you’re not in this just to be helpful”: Access, rapport, and interviews in three studies of elites’, Journal of Contemporary Ethnography, 22 (1): 7–27.

Researching inside the international testing machine

29

Ozga, J. & Gewirtz, S. (1995), ‘Sex, lies and audiotape: Interviewing the education policy elite’, in D. Halpin & B. Troyna (eds), Researching education policy: Ethical and methodological issues, London: Falmer Press. Rhodes, R. A. W. (2005), ‘Everyday life in a ministry: Public administration as anthropology’, American Review of Public Administration, 35 (1): 3–25. Rhodes, R. A. W., ’t Hart, P., & Noordegraaf, M. (eds) (2007), Observing government elites: Up close and personal, Houndmills, Basingstoke: Palgrave-Macmillan. Richards, D. (1996), ‘Elite interviewing: Approaches and pitfalls’, Politics, 16 (3): 1999–2014. Schatzberg, M. (2008), ‘Seeing the invisible, hearing silence, thinking the unthinkable: The advantages of ethnographic immersion’, in Political methodology: Committee on concepts and methods: Working Paper Series 18. Smith, K. (2006), ‘Problematising power relations in “elite” interviews’, Geoforum, 37: 643–653. Thomas, R. J. (1993), ‘Interviewing important people in big companies’, Journal of Contemporary Ethnography, 22 (1): 80–96. Welch, C., Marschan-Piekkari, R., Penttinen, H., & Tahvanainen, M. (2002), ‘Corporate elites as informants in qualitative international business research’, International Business Review, 11: 611–628.

2

Assessment imaginaries: methodological challenges of future assessment machines Ben Williamson

Introduction The current digitization and datafication of educational practices raises significant methodological challenges for studies that are designed to generate insider perspectives into the ‘machinery’ of assessment. These challenges are exacerbated by current ongoing attempts to reimagine assessment for the future. In particular, new approaches to such empirical methodological traditions as ethnography may be required to understand how digital systems that generate data about people also penetrate into their lives with potentially significant consequences. Ethnographic styles of research usually involve the researcher participating in people’s daily lives, observing what happens, listening to what is said, and collecting documents and artefacts, in order to investigate the lives of the people being studied (Hammersley & Atkinson, 2007). Textual documents are important as they construct ‘facts’, ‘records’, ‘diagnoses’, ‘decisions’, and ‘rules’ which are involved in social activities – as ‘documentary constructions of reality’ – while other material artefacts are important as much social activity involves the creation, use and circulation of objects (Hammersley & Atkinson, 2007: 121). Increasingly, researchers are now required to take into account another kind of artefact – digital data. The study of digital data and its impacts on people’s lives raises significant challenges for empirical social science methodologies because, increasingly, systems of datafication may be understood to be playing their own ethnographic role by participating in people’s lives, observing and listening to them, collecting and analysing the various documents and materials they encounter, and then

32

International Large-Scale Assessments in Education

producing ‘records’ and ‘facts’ of the ‘reality’ of their lives which might be used for subsequent forms of diagnosis and decision-making. A focus on digital data in this sense highlights how human experiences are always imbricated in nonseparable networks of human and nonhuman, which each shape each other in mutually constitutive relationships (Latour et al., 2012). Systems of digital data collection, analysis and circulation are becoming significant to the social scientific gaze for three reasons. (1) Data collection is now a routine occurrence in people’s daily lives and many ‘ordinary’ contexts (Kennedy, 2016). (2) Data are compiled into new digital records, hosted on servers and processed with software, as ‘profiles’ that contain the ‘facts’ of people’s lives and also lead to diverse kinds of classifications and diagnoses, and even to automated decision-making that might change how people lead their lives (van Dijck & Poell, 2013). (3) New ‘digital data objects’, or digitally stored information, now have their own active ‘social lives’ as socially and technically produced objects that emerge, travel, do things, have effects, and change as they simultaneously represent people and things even as they come to intervene in their lives (Lupton, 2015). The concept of digital data object draws attention to how digitally produced data – such as a data visualization, a social media post, or an online user profile – are the coming-together of a network of coordinated actions, devices, documents, humans and materials (Rogers, 2013). Recognizing that digital data objects have their own social lives draws attention to data as combinations of human and nonhuman networks of activity and relations (Marres, 2012). In this sense, the concept of digital data object usefully signifies the composite and relational characteristics of digital data, the past lives that produced them and the productive lives they lead as they weave into people’s lives and experiences. But it also raises methodological challenges for study since data objects may be difficult to collect, gather and store; they are often the products of sociotechnical processes that are hidden behind interfaces, proprietary software and intellectual property; they are mutable as data are added, combined, calculated, and analysed to produce new outcomes; and they owe their existence to highly opaque code, algorithms and analytics methods that are the expert domain of computer scientists, statisticians and data analysts rather than qualitative social science researchers. Understood as having social lives that interpenetrate the lives of humans and other non-digital material life, digital data objects ought clearly to be subjects of empirical social scientific study. This is the case particularly when digital data objects are produced that constitute a profile of a person’s actions, as is the focus of this chapter. To be clear, the chapter explores the ways that data profiles are

Assessment imaginaries

33

created when people interact with assessment technologies, and understands these user-data profiles to be the mutable and combinatorial product of multiple interpenetrating digital data objects. Profiles themselves are data objects, but also the composite and aggregated product of other data objects as they are combined, added together, analysed and acted upon (Ruppert, 2012). But how can we study the social lives of the digital data objects produced through assessment technologies as they combine to produce proxy profiles of people that might then be mobilized for decision-making and intervention? What challenges emerge when the lives of the people being studied appear as data profiles composed from agglomerations of digital data objects that have their own lively existence and actively intervene in the lives of those people? Though much social scientific research is concerned with the lives of the people being studied, with new data-driven forms of assessment the object of study might also be the digital data objects that are produced and aggregated into profiles about people, and the underlying methodological processes involved in producing those data as records of people’s lives. The chapter combines insights from science and technology studies (STS) and emerging studies of digital methods and data to examine the recent emergence of a new specialized field of ‘education data science’, focusing on the methodological challenges of gaining ‘insider’ perspectives on emerging digital assessment practices, the digital data objects they produce, and their consequences. Within the emerging field of education data science, new forms of real-time computer-adaptive assessment, learning analytics, and other forms of ‘data-enriched assessment’ have been proposed as future alternatives to conventional testing instruments and practices. These systems create digital data objects that combine into learner profiles, create ‘actionable intelligence’ about their progress, and make them amenable to future decision-making and intervention. Ultimately, the commitment of education data science to forms of data mining and analytics makes new forms of assessment into a constant presence within educational courses and classrooms, where assessment is reconceived not as a temporally fixed event but as a real-time, adaptive and semi-automated feedback process within the pedagogic apparatus itself (Thompson, 2016; Williamson, 2016). For education data science, the idealized future of assessment is one that is analytic, at least partly automated, predictive, adaptive, and ‘personalized’ (Bulger, 2016). The chapter will document and map some of the actors and organizations, technologies and discourses that are coming together in the reimagining of assessment. It will interrogate how a particularly ‘desirable future’ for assessment

34

International Large-Scale Assessments in Education

is being constructed, stabilized and disseminated by powerful actors associated with the educational data science field. The chapter identifies specific sites and also their wider networks for future insider studies of emerging assessment practices. Finally, the chapter summarizes some of the methodological opportunities and challenges for researchers of assessment practices of new approaches to social scientific analyses of methods and digital data.

Education data science Big data and the data science methods required to collect, store and analyse it have become a significant interest in education in the last few years (see Williamson, 2017). In particular, the new field of ‘education data science’ has emerged as a professional and technical infrastructure for big data collection and analysis in education (Pea, 2014). For education data scientists, the main task is the production and analysis of digital data objects from learning activities and environments, and particularly the aggregation of these diverse data objects into models or profiles of learners that might be used to inform decision-making. Understood as a system of interrelated technologies, people, and social practices, education data science consists of: Professional expertise

Techniques and methods

Applications

Normative aspirations

computer engineering, data science, statistics, cognitive psychology, neuroscience, learning science, psychometrics, bioinformatics, artificial intelligence data mining, text mining, learner modelling, machine learning, predictive and prescriptive analytics, network analysis, natural language processing recommendation engines for learning, learning analytics, adaptive learning platforms, wearable biometric sensors, data-enriched assessment, computer-adaptive testing personalized learning, social networked learning, optimizing learning, actionable intelligence

The dominant focus of much education data science in its early years has been on measuring and predicting student progress and attainment, and then on ‘optimizing’ learning and the environments in which it takes place (Baker &

Assessment imaginaries

35

Siemens, 2013). Its disciplinary and methodological origins are in a mixture of computer science and psychological approaches to learning, or ‘learning sciences’, and it has become seen as the R&D community dealing with big data in education (Piety, Hickey, & Bishop, 2014; Cope & Kalantzis, 2015, 2016). The ‘social life’ of education data science as a field can be found in a mixture of academic departments in universities and commercial educational technology companies. The Lytics Lab (Learning Analytics Laboratory) at Stanford University has been a key site for education data science development and advocacy (Pea, 2014), as has the Society of Learning Analytics Research (SoLAR) (Siemens, 2016). The ‘edu-business’ Pearson (Hogan et al., 2015) has been an especially enthusiastic commercial supporter of education data science, which has included the establishment of research centres and commercial partnerships with technology companies dedicated to digital data analytics, adaptive learning and new assessment technologies (Williamson, 2016). Education data science must therefore be viewed as a methodological field of innovation that crisscrosses academic and commercial R&D settings. It is actively producing new imaginaries of learning and assessment, while also seeking to operationalize its vision through specific technical innovations.

Imaginaries, methods and critical data studies Studying the new and emerging assessment technologies associated with education data science requires a combination of methods and conceptual orientations. Recent social scientific conceptualization around ‘sociotechnical imaginaries’, ‘social lives of methods’, and ‘critical data studies’ are important to the study of new assessment technologies.

Sociotechnical imaginaries Sociotechnical imaginaries have been defined as ‘collectively held, institutionally stabilized, and publicly performed visions of desirable futures, animated by shared understandings of forms of social life and social order attainable through, and supportive of, advances in science and technology’ (Jasanoff, 2015: 4). Sociotechnical imaginaries constitute the visions and values that catalyse the design of technological projects. The capacity to imagine the future is becoming a powerful constitutive element in social and political life, particularly as it infuses the technological visions and projects of global media companies (Mager, 2016).

36

International Large-Scale Assessments in Education

A specific emerging sociotechnical imaginary has been identified in relation to digital data analytics technologies. Rieder and Simon (2016: 4) have detailed four specific characteristics of a sociotechnical ‘big data imaginary’: (1) extending the reach of automation, from data collection to storage, curation, analysis and decision-making process; (2) capturing massive amounts of data and focusing on correlations rather than causes, thus reducing the need for theory, models, and human expertise; (3) expanding the realm of what can be measured, in order to trace and gauge movements, actions, and behaviours in ways that were previously unimaginable; and (4) aspiring to calculate what is yet to come, using smart, fast, and cheap predictive techniques to support decisionmaking and optimize resource allocation. Through this big data imaginary of ‘mechanical objectivity’, they suggest, advocates of big data are ‘applying a mechanical mindset to the colonization of the future’ (Rieder & Simon, 2016: 4). Beyond being visions of the future, sociotechnical imaginaries animate and instantiate sociotechnical materialities. Thus, while sociotechnical imaginaries ‘can originate in the visions of single individuals or small collectives’, they can gather momentum ‘through blatant exercises of power or sustained acts of coalition building’ to enter into ‘the assemblages of materiality, meaning and morality that constitute robust forms of social life’ (Jasanoff, 2015: 4). The term ‘assessment imaginary’, then, designates the seemingly ‘desirable’ future for datadriven assessment that its growing coalition of supporters in the education data science community believe to be attainable, and which they are seeking to realize through specific technical projects. Although the emerging assessment imaginary detailed below remains in large part a set of future visions and projections of ‘desirable’ developments, many of the technical and methodological aspects of emerging assessment technologies are already in development. It is essential, therefore, not just to trace the imaginary of assessment but the methods already being developed to materialize it in practice.

Social lives of methods Assessment technologies are thoroughly methodological insofar as they are designed as ways of collecting and analysing learner data and communicating findings and insights into their processes of learning or their learning outcomes. Previous ethnographic studies from a science and technology studies (STS) perspective have detailed the complex methodological work involved in largescale assessments such as the OECD’s Programme for International Student Assessment (PISA), where ‘PISA scientists’ working in the ‘PISA laboratory’

Assessment imaginaries

37

must sort, classify, categorize and calculate assessment data mathematically to produce ‘PISA facts’ (Gorur, 2011: 78). Many of the psychometric methods underpinning existing testing technologies such as ‘e-assessment’ have also been the subject of detailed empirical study (e.g. O’Keeffe, 2016). New, emerging and imaginary future assessment technologies, however, have received less critical scrutiny specifically in relation to the underlying methods that produce, store, sort, and analyse the assessment data in order to produce evidence and insights for intervention (Thompson, 2016). In sociology, attention has been turned recently on the data analytics methods that enact many social media platforms and websites (Marres, 2012). For some, new kinds of data analytics pose a challenge to existing empirical methods of the discipline itself, as they have been built with the capacity to collect and analyse social data at scales and speeds unimaginable to sociologists (Burrows & Savage, 2014). As a consequence, the new methods of data mining, analytics and other ‘digital methods’ enacted by various media have become the subject of concerted critical scrutiny, with questions raised about how they track and report on social trends, events, and individual actions, and how they then turn these things into digital data objects that might be used as ‘actionable intelligence’ to inform future intervention (Rogers, 2013). Building on previous research from STS and the sociology of measurement which have detailed the scientific and mathematical processes through which scientific knowledge and calculable forms of measurement have been made possible and enacted (Latour, 1986; Woolgar, 1991), a particular line of sociological inquiry has begun to focus on the ‘social life of methods’ in relation to digital data (Savage, 2013). Such studies focus on the ways in which different methods have been invented and deployed by particular social groups and organizations, querying the intentions and purposes to which they have been put and the consequences of the findings that are produced as a result (Ruppert, Law, & Savage, 2013). The focus on methodological devices is not simply concerned with their accuracy or ethical appropriateness, ‘but rather their potentialities, capacities and limitations, how they configure the objects they are attempting to study and measure and how they serve political purposes’ (Lupton, 2015: 49). As such, methods are seen as socially configured, but also socially consequential insofar as methods ‘do things’ to make their objects of study visible, intelligible, and even possible to act upon or intervene-able – that is, methods are performative (Desrosieres, 2001). As with previous STS studies of assessment focused on the performativity of the statistical methods that enable education to be measured, the social lives of assessment technologies and the

38

International Large-Scale Assessments in Education

digital data objects they produce therefore need to be scrutinized for how they might act to reconfigure the very things they purport to assess, since ‘as soon as the measurement exercise begins, it acts upon the world, changing priorities and influencing behaviours, policies and practices’ (Gorur, 2015: 582). The performativity of the methods is especially critical in relation to some of the ‘adaptive’ analytics systems described shortly, which not only perform real-time assessment of their users but also actively adapt in response to analysis of their activity datastreams to then intervene in, pre-empt and change behaviours.

Critical data studies Most of the new assessment technologies described below fundamentally depend on some form of algorithmic data analytics. Understanding such systems therefore requires research that is critically attentive to the ways in which data analytics function. ‘Critical data studies’ have opened up ways of approaching data as an object of attention in social sciences, philosophy and geography, again building on existing social scientific studies of measurement technologies and expanding to digital data, analytics and algorithms as the specific objects of study (for a recent overview of studies of metrics and measurement see Beer, 2016). Loukissas (2016), for example, highlights the importance of ‘taking big data apart’, by which he means focusing ethnographic attention on the local contexts in which data are produced and used, and seeking to trace how locally produced data from distributed origins become agglomerated into big datasets. Focusing ‘attention to the local offers an unprecedented opportunity to learn about varied cultures of data collection brought together in Big Data … for wherever data travel, local communities of producers, users, and non-users are affected’ (Loukissas, 2016: 2). By taking big data apart and tracing it back to its local origins, the myriad digital data objects that constitute big data sets may become visible for ethnographic study. The agglomeration of distributed data objects into large datasets is also the focus of Kitchin and Lauriault (2014), whose proposals for critical data studies seek to provoke researchers to interrogate the complex ‘assemblages’ that produce, circulate, and utilize data in diverse ways. Data assemblages, as they define them, consist of technical systems of data collection, processing and analysis, but also the diverse social, economic, cultural and political apparatuses that frame how they work. Their attention, therefore, is not so much on the local origins of big data, but on the wider contexts in which data are produced, circulated and used.

Assessment imaginaries

39

How to adapt such concerns for critical data studies to the study of datadriven assessment technologies? In the following sections, I attempt to outline and illustrate some ways forward. The aim in providing this overview is to open up new kinds of assessment technologies for critical study, focusing on the challenges of investigating digital data objects and the ways they form into student data profiles that may be used for forms of diagnosis, decision-making and intervention.

Sociotechnical assessment imaginaries A productive way to approach the study of new assessment technologies is to interrogate the vision of the future that has catalysed their development. Textual documents are ideal sources for locating and identifying imaginaries. In 2014, the global ‘edu-business’ Pearson published a visionary report detailing its vision of a ‘renaissance in assessment’ (Hill & Barber, 2014). At the core of its vision is a commitment to the use of data management and analysis techniques within ‘next-generation learning systems’: Next-generation learning systems will create an explosion in data because they track learning and teaching at the individual student and lesson level every day in order to personalise and thus optimise learning. Moreover, they will incorporate algorithms that interrogate assessment data on an ongoing basis and provide instant and detailed feedback into the learning and teaching process. (Hill & Barber, 2014: 7)

The kind of algorithmic assessment technologies promoted in the report include ‘adaptive testing’ that can ‘generate more accurate estimates of student abilities across the full range of achievements’; online environments that can facilitate ‘the collection and analysis in real-time of a wide range of information on multiple aspects of behaviour and proficiency; and application of ‘data analytics and the adoption of new metrics to generate deeper insights and richer information on learning and teaching’ (Hill & Barber, 2014: 8). Underpinning the vision of the report is the aspiration to embed assessment systems within the administrative and pedagogic infrastructure of the school, rather than to see assessment in terms of temporally discrete, subject-based and obtrusive testing. From the perspective of trying to get inside the assessment machine, how should we approach Pearson’s visionary ‘renaissance in assessment’? For starters, subjecting its report to documentary discourse analysis is a valuable method for understanding the sociotechnical imaginary it projects. It is through the

40

International Large-Scale Assessments in Education

production of seductive and glossy documentary materials that powerful individual organizations (such as Pearson) are able to gather supportive coalitions around seemingly ‘desirable’ futures that their promoters believe ought to be attained. The Lytics Lab at Stanford University has produced its own vision of ‘the future of data-enriched assessment’: Big data, in the context of assessment, is learner data that is deep as well as broad. Large amounts of data can occur not only across many learners (broad between-learner data), but also within individual learners (deep within-learner data). … [N]ew forms of data-enriched assessment require collecting deeper and broader data in order to gain insight into the new object of assessment (Thille et al., 2014: 6)

In the Lytics Lab vision of the future of data-enriched assessment, the emphasis is on assessment as a ‘continuous’ task where an individual’s learning process can be observed continually and there is ‘no need to distinguish between learning activities and moments of assessment’ (Thille et al., 2014: 7). In addition, dataenriched assessment is envisaged as being ‘feedback-oriented’ so that information can be provided to either the learner or the teacher ‘to make a choice about the appropriate next action’, or so those choices can be delegated to a system such as an adaptive test or an ‘intelligent tutor’ to undertake (Thille et al., 2014: 7). As such, it is assumed that ‘large-scale data collection allows researchers to more effectively use modern statistical and machine learning tools to identify and refine complex patterns of performance’, but also that ‘big data allows educators to build and refine model–driven feedback systems that can match and surpass human tutors’ (Thille et al., 2014: 13). The kinds of feedback-driven computer-adaptive testing systems described by both Pearson and Lytics Lab work by conducting a continuous, automated and realtime analysis of an individual’s responses to a particular test or exam, using the data generated under test conditions to make predictions about future progress through it. The system measures whether responses to questions are correct, how long the test-taker takes to respond, and then automatically adapts the course of the test in response. As such, ‘one of the somewhat utopian promises of CAT measures is that they can operate as “teaching machines” because they can respond far more quickly to student patterns that an individual teacher or a conventional test’ (Thompson, 2016: 4). Data-enriched assessment analytics and computer-adaptive testing are therefore being put to work to produce digital data objects from testing situations which can be analysed by the system in order to produce student data profiles for which real-time adaptive responses can be automatically generated.

Assessment imaginaries

41

As this brief survey of key texts on assessment technologies indicates, the assessment imaginary underpinning education data science is one that emphasizes data analytics, adaptive software driven by machine learning, forms of algorithmic automation, and ‘predictive’ pedagogic recommender systems, all of which are intended to support ‘personalized learning’ by subjecting students to a constant process of real-time assessment, analysis, and automated feedback. However, while mining imaginaries is useful for understanding the catalysts and drivers of technical development, we also need ways of approaching the sociotechnical methods of assessment technologies themselves.

Social life of assessment methods As STS and studies of ‘the social lives of methods’ have shown, interrogating the origins and intentions ascribed to methods can help to illuminate some of the methodological developments that underpin the assessment technologies that produce student profiles from the constant collection of digital data. The real-time feedback mechanisms of big data-driven assessment are sometimes captured in the terms ‘learning analytics’ and ‘adaptive learning platforms’. The domain of learning analytics extends forms of adaptive assessment to become synchronous with the pedagogic routines and rhythms of the course and the classroom. Whereas CAT constructs digital data objects from temporally discrete test situations, learning analytics construct digital data objects from the many thousands of individual data points that are generated as students engage with courseware, e-learning software and online materials and platforms. The learning analytics and adaptive learning platform provider Knewton claims it collects ‘data that reflects cognition’ (Knewton, 2013: 9). Real-time and predictive learning analytics therefore generate digital data objects synchronously with pedagogic tasks and activities. Each task produces micro-objects, such as a record of a keystroke, a mouseclick, a response to a question, and so on. Those micro-objects can then be aggregated into individual profiles, or the digital data objects that represent a person within a database. From there, digital data objects can become the subject of particular methodological techniques which then produce outcomes that act back on the individual being tracked, in the shape of being assigned to an appropriate group or routed down a ‘personalized’ learning pathway. Two such methods integral to learning analytics and adaptive learning platforms are ‘cluster analysis’ and ‘knowledge graphing’. Cluster analysis, in basic

42

International Large-Scale Assessments in Education

terms, consists of mathematical or algorithmic techniques for organizing data into groups of other similar objects. It is often used in quite exploratory stages of data analysis and research, as a way of seeking patterns in masses of disorganized information. The fundamental rule of cluster analysis is that it groups together objects that appear to have a high degree of similarity and association. It is concerned with finding structures between data points, but, crucially, cannot proceed without at least some degree of prior organization, such as setting parameters – for instance, number of clusters and their density – and establishing criteria for how data will be clustered together. As Perrotta and Williamson (2016) have noted, prior hypotheses about how the data might be grouped are explicitly built in to the mathematical processes that partition and group the data in cluster analysis. Therefore, cluster analysis does not necessarily ‘discover’ patterns in data, but can actively construct a structure as it calculates distance between data points according to pre-established criteria and parameters. This matters, because learning analytics platforms routinely assign individual students to particular groups or clusters based on an analysis of their data profiles in relation to large groups of student data. In a technical white paper on the methods underpinning its platform, Knewton (2013: 5), for example, describes how it uses ‘hierarchical agglomerative clustering’ as a method of analysis from data mining to detect ‘structures within large groups’ and then ‘build algorithms that determine how students should be grouped and what features they should be grouped by’. Claims to mathematical objectivity surround the use of these systems. Yet the technical-mathematical mechanism of clustering relies to some extent on subjective judgements and decisions about how the data should be organized and grouped. This might affect whether students are assigned to groups considered ‘at-risk’ or ‘high-achieving’, for example, and could affect how educators treat them as a consequence of those statistical determinations. ‘Knowledge graphing’ is another learning analytics method. Knewton’s technical paper describes ‘inference on probabilistic graphical models’ and ‘knowledge graphing’ methods that involve the organization of course content into discrete nodes or ‘knowledge chunks’ that can then be algorithmically sequenced into ‘personalized playlists’ that have been calculated to be most appropriate for each individual. ‘Knewton’s continuously adaptive learning system’, it claims, ‘constantly mines student performance data, responding in real-time to a student’s activity on the system’ (Knewton, 2013: 5). It achieves its adaptivity again through the deployment of methods with a long prior social life in scientific settings. In particular, Knewton is ‘inspired by Hermann Ebbinghaus’s work on memory retention and learning curves’, describing how

Assessment imaginaries

43

‘Knewton data scientists have used exponential growth and decay curves to model changes in student ability while learning and forgetting’ (Knewton, 2013: 7). Ebbinghaus was a psychologist working in the late 1800s on memory and forgetting, and influentially conceptualized both the ‘learning curve’ – how fast one learns information – and the ‘forgetting curve’, or the exponential loss of information one has learned. The recommendation engine that powers Knewton’s adaptive platform is therefore based on a mathematical formalization of a nineteenth-century psychological theory of memory retention and decay which it has adapted into its knowledge graphing system. It expresses this theory of memory as a calculable formula and visualizes it in graphical learning and forgetting curves, thus refracting and reproducing a psychological conception of how memory underpins learning first formulated over a century ago through the infrastructure of twenty-first century data analytics. The examples of cluster analysis and knowledge graphing indicate how approaching assessment methods as having past social lives can help to illuminate how digital data objects are produced. However, data objects are not just produced; they are also productive in the performative sense of changing how the things they represent are treated and changed as a consequence. The objective of learning analytics is that individuals’ learning experiences might be ‘personalized’ as their profiles are compared with whole data sets to produce ‘norm-averaged’ inferences about individual student progress. Again, Knewton (2013) has described how: inferred student data are the most difficult type of data to generate – and the kind Knewton is focused on producing at scale. Doing so requires low-cost algorithmic assessment norming at scale. Without normed items, you don’t have inferred student data.

In this sense, systems such as learning analytics are norm-based systems that compare individual student data profiles with norms inferred from massive global datasets, stripped of context. The inferred and norm-referenced knowledge about students can then be used by assessment technologies to produce probabilistic predictions and ‘actionable intelligence’ in terms of interventions to pre-emptively change individuals’ learning pathways. These changes can be made automatically, as in the case of clustering and knowledge graph techniques which have the capacity to assign students to groups or assign personalized pathways through specific content based on students’ digital traces. Within the assessment approaches of Knewton and similar organizations, then, learners are approached as collections of data points which have been

44

International Large-Scale Assessments in Education

aggregated, via a wide range of methods and techniques, to produce digital data objects that cohere into temporarily stabilized profiles. The profile produced by data mining and analytics has sometimes been described as a ‘data double’ or a ‘digital shadow’ (Raley, 2013), though these terms perhaps occlude how a data profile is only ever a temporary proxy for a person, as data can be recalculated and recombined to produce different configurations with different results (Ruppert, 2012). As a result, the methods used to create the profile are consequential for how the people they represent are known and intervened-upon, and how their future actions may be shaped and modified. For these reasons, finding ethnographic ways to approach the mutable digital data objects produced out of learner activity, and which are used as proxy profiles by which to understand and act upon them, is an important methodological priority and challenge.

Critical assessment data studies One way of exploring how the digital data objects that constitute student profiles are produced is to interrogate the specific political and economic contexts in which they are generated. Working from a critical data studies perspective, van Dijck (2013) has shown how any individual system or platform of data collection is part of a much wider ‘ecosystem’ of norms, social relations and institutions. From a critical data studies perspective, then, we should also study the wider political economy framing of data-enriched assessment and learning analytics. For example, organizations such as Lytics Lab, Pearson and Knewton have become influential actors in the field of education data science, using their financial power and social influence to build systems that are based on specific underlying theories of learning and memory in ways which are intended to define how processes of learning are defined, measured and understood (Williamson, 2017). Their methodological capacity to undertake big data-driven forms of assessment, therefore, lends them a significant degree of power in relation to the production of knowledge in the field of education more broadly, especially as education data science is increasingly treated as an authoritative source of educational insight (Piety, Hickey, & Bishop, 2014). Pearson is a particularly instructive example of how systems of data collection and analysis are framed by much wider data assemblages or ecosystems that include political economy dimensions. In recent years Pearson has become more than just an ‘edu-business’ with commercial ambitions but an important ‘policy actor’ with a ‘network of interests and objectives’ that stretch its role

Assessment imaginaries

45

‘across all aspects of the education policy cycle, from agenda setting, through policy production and implementation to evaluation’ (Hogan et al., 2015: 62). Moreover, it is able to identify apparent policy problems in education for which it has ready-made solutions to sell: Pearson is involved both in seeking to influence the education policy environment, the way that policy ‘solutions’ are conceived, and, at the same time, creating new market niches that its constantly adapting and transforming business can then address and respond to with new ‘products’. (Ball & Junemann, 2015: 7)

Within the political economy context, Pearson’s assessment imaginary report described above can be seen as a glossy commercial brochure and a policy prospectus combined. It simultaneously constructs a seductive vision of a datadriven and algorithmic assessment future, while problematizing the existing infrastructure of testing. By positioning itself as a solutions-provider for an alleged problem that concerns both practitioners and policymakers, Pearson can also be seen to be assigning itself a particularly powerful position in relation to the control over knowledge and expertise required to conduct next-generation assessment. Anagnostopoulos, Rutledge and Jacobsen (2013) have noted that educational power has been concentrated in recent years in the hands of organizations that have control of the ‘infrastructure of test-based accountability’. For decades, education systems, schools, teachers and learners alike have been the subjects of national and international testing, and the comparisons and judgements which follow from the ways that test data are compiled into performance measures, ratings and rankings by government agencies and non-governmental international organizations. ‘The data that fuel test-based accountability are … the products of complex assemblages of technology, people, and policies that stretch across and beyond the boundaries of our formal education system’, argue Anagnostopoulos et al. (2013: 2). As such, they define a form of ‘informatic power’ which depends on the knowledge, use, production of, and control over measurement and computing technologies … to produce performance measures that appear as transparent and accurate representations of the complex processes of teaching, learning, and schooling. As they define who counts as ‘good’ teachers, students, and schools, these performance metrics shape how we practice, value and think about education. (2013: 11)

Pearson has sought to locate itself in opposition to this infrastructure of testbased accountability. Indeed, by emphasizing assessment of ‘the full range of

46

International Large-Scale Assessments in Education

achievement’ and ‘multiple aspects of behaviour and proficiency’, Pearson has articulated its vision of ‘intelligent accountability systems that utilise multiple indicators of performance’ (Hill & Barber, 2014: 8). By seeking to attract consensus for its vision, Pearson can be seen as seeking its share of informatic power, by shifting the focus of both practitioners and policymakers on to the kind of knowledge and technologies of adaptive testing that it already possesses and that it can sell to schools. To these ends, Pearson has not only established its own in-house efforts to collect and analyse assessment data, but has begun building a network of partnerships with third parties that can undertake educational data science tasks. One key partner is Knewton, which provides adaptive software for many of Pearson’s online learning products: The Knewton Adaptive Learning Platform™ uses proprietary algorithms to deliver a personalized learning path for each student … ‘Knewton adaptive learning platform, as powerful as it is, would just be lines of code without Pearson’, said Jose Ferreira, founder and CEO of Knewton. ‘You’ll soon see Pearson products that diagnose each student’s proficiency at every concept, and precisely deliver the needed content in the optimal learning style for each. These products will use the combined data power of millions of students to provide uniquely personalized learning.’ (http://www.knewton.com/press-releases/ pearson-partnership/)

Even more recently, in 2016, Pearson announced a partnership with IBM to embed IBM’s ‘cognitive computing’ system Watson in its courseware. Watson has been marketed by as ‘a cognitive technology that can think like a human’, and which has the capacity to amplify human cognitive capacity (Williamson, 2017). One of the key applications IBM has developed is a data-based performance tracking tool for schools and colleges called IBM Watson Element for Educators: Watson Element is designed to transform the classroom by providing critical insights about each student – demographics, strengths, challenges, optimal learning styles, and more – which the educator can use to create targeted instructional plans, in real-time. Gone are the days of paper-based performance tracking, which means educators have more face time with students, and immediate feedback to guide instructional decisions. (IBM Watson Education, 2016)

Designed for use on an iPad so it can be employed directly in the classroom, Element can capture conventional performance information, but also student interests and other contextual information, which it can feed into detailed student profiles. This is student data mining that goes beyond test performance to social context (demographics) and psychological classification (learning styles). It

Assessment imaginaries

47

can also be used to track whole classes, and automatically generates alerts and notifications if any students are off-track and need further intervention. Ultimately, Pearson’s efforts to build partnership networks with organizations including Knewton and IBM demonstrate how it is seeking to assume informatic power in relation to the control of assessment technologies and expertise. The informatic power of Pearson highlights how social, political and economic contexts are deeply intertwined with the apparently ‘objective’ assessments and the assessment tools used to generate the digital data objects that constitute individual student profiles. The consequences are not just greater commercialization of assessment, but a redefinition of how teaching and learning are conceptualized and assessed. One of Pearson’s reports on the use of data in education, for example, presents the case that digital data are generating new generalizable models of learning that reveal a ‘theory gap’ with existing disciplinary conceptualizations from the sciences of learning (Behrens, 2013). In other words, Pearson is proposing to build new theoretical explanations about processes of learning based on the analysis of the digital data objects that constitute the profiles of millions of students. As such, understanding Pearson’s aspirations for the future of assessment requires methodological approaches that can engage simultaneously with social, political and economic contexts, as well as with the social lives of the methods Pearson supports to produce knowledge, to generate student profiles, and the underlying imaginary that animates its ongoing efforts in this direction (Williamson, 2016).

Methods for future assessment studies New assessment technologies are the material instantiation of a big data imaginary in education, one that projects a vision of mechanical objectivity in relation to the capture and analysis of data from students that can be used for diverse purposes and techniques of assessment, feedback and pedagogic prescription. In this imaginary, new forms of assessment are envisaged as being increasingly automated, predictive and pre-emptive. New assessment technologies such as adaptive testing, learning analytics and similar data-enriched assessment systems therefore require studies that engage with the liveliness of the assessment data as it is produced behind the interface itself, as it is processed and analysed and then utilized to produce automated responses. Likewise, studies of the ‘infrastructure of test-based accountability’ (Anagnostopoulos et al., 2013) have vividly articulated the webs of technologies,

48

International Large-Scale Assessments in Education

institutions and practices that enable the flow of information from classrooms into performance measures and rankings. As yet, however, the infrastructure of education data science remains to be adequately mapped in full, let alone subjected to close empirical scrutiny. In this chapter I have outlined some of the key ways in which new assessment imaginaries are being generated and operationalized. As these initial attempts to make sense of assessment technologies indicate, future studies of this new and emerging machinery of assessment might: (1) Interrogate the sociotechnical imaginaries of the desirable future of assessment that animate the development of new technological projects in digitized assessment. (2) Examine the ‘social lives’ of their analytics methods and interrogate the performativity of the methods that generate the digital data objects that constitute student profiles. (3) Examine the organizational spaces and political contexts within which new assessment technologies are being imagined and produced. Moreover, further studies will need to go beyond the imaginaries, methods and contexts of assessment technologies to understand their enactment and material consequences. As such, research might: (4) Provide embedded studies of the contingent enactment of such technologies in diverse social contexts. (5) Generate thick ethnographic accounts of how students’ multiple data profiles are generated, modified, and used to drive automated decision-making and recommendations, and of how such processes of automation act to shape and modulate the embodied behaviours of students within parameters of action defined by specific platforms. The methodological challenge of undertaking such empirical studies of the generation and performativity of student data profiles is considerable. How a student is known through the digital data objects that constitute a student profile is the product of highly opaque technologies, often under the ownership and intellectual property rights of commercial companies with their own proprietorial algorithms and analytics methods, which produce the ‘facts’ and ‘records’ that constitute a student in a database and that are the basis of the ‘diagnoses’ used by analytics methods to generate decision-making and other interventions.

Conclusion New forms of data-driven assessment are proceeding from imaginary to the material reality of assessment situations. To date such systems and the ways that

Assessment imaginaries

49

they partake in the everyday lives of students have been the subject of little study, despite how they are coming to participate in institutional practices, human actions, and how learners ‘view the situations they face, how they regard one another, and also how they see themselves’ (Hammersley & Atkinson, 2007: 3). As such systems enrol educational institutions and learners of all ages into the logics of digital data, and actively intervene pre-emptively to shape their future actions, they need to be subjected to studies that get as up-close to the assessment machinery as possible.

References Anagnostopoulos, D., Rutledge, S. A., & Jacobsen, R. (2013), ‘Mapping the information infrastructure of accountability’, in D. Anagnostopoulos, S. A. Rutledge, & R. Jacobsen (eds), The infrastructure of accountability: Data use and the transformation of American education, 1–20, Cambridge, MA: Harvard Education Press. Baker, S. J. & Siemens, G. (2013), Educational data mining and learning analytics. Available online: www.columbia.edu/~rsb2162/BakerSiemensHandbook2013.pdf Ball, S. J. & Junemann, C. (2015), Pearson and PALF: The mutating giant, Brussels: Education International. Behrens, J. (2013), ‘Harnessing the currents of the digital ocean’. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA, April. Beer, D. (2016), Metric power, London: Palgrave Macmillan. Bulger, M. (2016), ‘Personalized learning: The conversations we’re not having’, Data and Society, 22 July. Available online: www.datasociety.net/pubs/ecl/ PersonalizedLearning_primer_2016.pdf Burrows, R. & Savage, M. (2014), ‘After the crisis? Big data and the methodological challenges of empirical sociology’, Big Data & Society, 1 (1). Available online: http:// bds.sagepub.com/content/1/1/2053951714540280 Cope, B. & Kalantzis, M. (2015), ‘Interpreting evidence-of-learning: Educational research in the era of big data’, Open Review of Educational Research, 2 (1): 218–239. Cope, B. & Kalantzis, M. (2016), ‘Big data comes to school: Implications for learning, assessment and research’, AERA Open, 2 (2): 1–19. Desrosieres, A. (2001), ‘How real are statistics? Four possible attitudes’, Social Research, 68 (2): 339–355. Gorur, R. (2011), ‘ANT on the PISA trail: Following the statistical pursuit of certainty’, Educational Philosophy and Theory, 43 (S1): 76–93. Gorur, R. (2015), ‘Producing calculable worlds: Education at a glance’, Discourse: Studies in the Cultural Politics of Education, 36 (4): 578–595.

50

International Large-Scale Assessments in Education

Hammersley, M. & Atkinson, P. (2007), Ethnography: Principles in practice, 3rd ed., London: Routledge. Hill, P. & Barber, M. (2014), Preparing for a renaissance in assessment, London: Pearson. Hogan, A., Sellar, S., & Lingard, B. (2015), ‘Network restructuring of global edubusiness: The case of Pearson’s Efficacy Framework’, in W. Au & J. J. Ferrare (eds), Mapping corporate education reform: Power and policy networks in the neoliberal state, 43–64, London: Routledge. IBM Watson Education (2016), Transform education with Watson. IBM Watson. Available online: www.ibm.com/watson/education/ Jasanoff, S. (2015), ‘Future imperfect: Science, technology, and the imaginations of modernity’, in S. Jasanoff & S. H. Kim (eds), Dreamscapes of modernity: Sociotechnical imaginaries and the fabrication of power, 1–33, Chicago, IL: University of Chicago Press. Kennedy, H. (2016), Post. Mine. Repeat. Social media data mining becomes ordinary, London: Palgrave Macmillan. Kitchin, R. & Lauriault, T. (2014), ‘Towards critical data studies: Charting and unpacking data assemblages and their work’, The Programmable City Working Paper 2. Available online: http://ssrn.com/abstract=2474112 Knewton (2013), Knewton Adaptive Learning. Available online: https://www.knewton. com/wp-content/uploads/knewton-adaptive-learning-whitepaper.pdf Latour, B. (1986), ‘Visualization and cognition: Thinking with eyes and hands’, Knowledge and Society, 6: 1–40. Latour, B., Jensen, P., Venturini, T., Grauwin, S., & Boullier, D. (2012), ‘The whole is always smaller than its parts: A digital test of Gabriel Tarde’s monads’, British Journal of Sociology, 63 (4): 590–615. Loukissas, Y. A. (2016), ‘Taking big data apart: Local readings of composite media collections’, Information, Communication & Society. Available online: http://dx.doi.or g/10.1080/1369118X.2016.1211722 Lupton, D. (2015), Digital sociology, London: Routledge. Mager, A. (2016), ‘Search engine imaginary: Visions and values in the co-production of search technology and Europe’, Social Studies of Science. Available online: http://sss. sagepub.com/content/early/2016/10/26/0306312716671433 Marres, N. (2012), ‘The redistribution of methods: On intervention in digital social research, broadly conceived’, Sociological Review, 60 (S1): 139–165. O’Keeffe, C. (2016), ‘Producing data through e-assessment: A trace ethnographic investigation into e-assessment events’, European Educational Research Journal, 15 (1): 99–116. Pea, R. (2014), A report on building the field of learning analytics for personalized learning at scale, Stanford: Stanford University. Perrotta, C. & Williamson, B. (2016), ‘The social life of learning analytics: Cluster analysis and the performance of algorithmic education’, Learning, Media and Technology. Available online: http://dx.doi.org/10.1080/17439884.2016.1182927

Assessment imaginaries

51

Piety, P. J., Hickey, D. T., & Bishop, M. J. (2014), ‘Educational data sciences – framing emergent practices for analytics of learning, organizations and systems’, LAK ’14, 24–28 March, Indianapolis. Raley, R. (2013), ‘Dataveillance and counterveillance’, in L. Gitelman (ed.), ‘Raw data’ is an Oxymoron, 121–146, London: MIT Press. Rieder, G. & Simon, J. (2016), ‘Datatrust: Or, the political quest for numerical evidence and the epistemologies of big data’, Big Data and Society, 3 (1). Available online: http://dx.doi.org/10.1177/2053951716649398. Rogers, R. (2013), Digital methods, London: MIT Press. Ruppert, E. (2012), ‘The governmental topologies of database devices’, Theory, Culture & Society, 29 (4–5): 116–136. Ruppert, E., Law, J., & Savage, M. (2013), ‘Reassembling social science methods: The challenge of digital devices’, Theory, Culture & Society, 30 (4): 22–46. Savage, M. (2013), ‘The “social life of methods”: A critical introduction’, Theory, Culture & Society, 30 (4): 3–21. Siemens, G. (2016), ‘Reflecting on learning analytics and SoLAR’, Elearnspace, 28 April. Available online: www.elearnspace.org/blog/2016/04/28/reflecting-on-learninganalytics-and-solar/ Thille, C., Schneider, E., Kizilcec, R. F., Piech, C., Halawa, S. A., & Greene, D. K. (2014), ‘The future of data–enriched assessment’, Research and Practice in Assessment, 9 (Winter). Available online: http://www.rpajournal.com/dev/wp-content/ uploads/2014/10/A1.pdf Thompson, G. (2016), ‘Computer adaptive testing, big data and algorithmic approaches to education’, British Journal of Sociology of Education. Available online: http://dx.doi. org/10.1080/01425692.2016.1158640 Van Dijck, J. (2013), The culture of connectivity: A critical history of social media, Oxford: Oxford University Press. Van Dijck, J. & Poell, T. (2013), ‘Understanding social media logic’, Media and Communication, 1 (1): 2–14. Williamson, B. (2016), ‘Digital methodologies of education governance: Pearson plc and the remediation of methods’, European Educational Research Journal, 15 (1): 34–53. Williamson, B. (2017), Big data in education: The digital future of learning, policy and practice, London: SAGE. Woolgar S. (1991), ‘Beyond the citation debate: Towards a sociology of measurement technologies and their use in science policy’, Science and Public Policy, 18 (5): 319–326.

3

The infrastructures of objectivity in standardized testing Nelli Piattoeva and Antti Saari

Introduction In 2014 the Russian President Vladimir Putin signed a decree that removed the indicator that captures the percentage of students failing the national unified school graduation (USE) exam from the list of indicators used to evaluate the efficiency of local governors. The USE is a standardized national school graduation exam administered in the last year of high school. Passing this exam is also crucial for entry into higher education. As one proponent of this decision – a renowned school principal and member of the Committee for Educational Development at the Civic Chamber of the Russian Federation reacted – ‘Thank God! This step will undoubtedly make the procedure of the USE more honest. Some governors tried to influence the results of the exam in order to improve their score on the league table; they came up with schemes to circumvent the law’.1 A year later, the Russian Federal Service for Supervision in the Sphere of Science and Education introduced a ranking list on ‘the objectivity of the administration of the USE’. The agency defined objectivity as, for example, the number of exam rooms equipped with online video surveillance cameras and the number of examination buildings where mobile signals are muted with jammers. Upon regular evaluation, the regions are categorized into three zones, green, yellow and red, with those in the red zone singled out for targeted surveillance as the regions with the least transparent examination procedures (Unchitelskaia Gazeta, 2015). In 2016, the head of the Russian Federal Service for Supervision in the Sphere of Science and Education (Rosobrnadzor) was asked to reflect on

1

See https://www.oprf.ru/press/news/2014/newsitem/24872

54

International Large-Scale Assessments in Education

these decisions and their implementation in the regions. He agreed that the exam had become more objective due to stricter regulation and the systems of surveillance introduced (Uchitelskaia Gazeta, 2016). At a meeting with regional educational authorities he announced that the improved objectivity of the exam led to better results in the latest international large-scale assessments, and called for the regional authorities to move in the direction of using objective testing data for decision-making. These case examples highlight the involvement of the notion of ‘objectivity’ in the political debates and decisions concerning standardized testing in general and high-stakes testing in particular. Yet it is not at all evident how objectivity is defined in the various discourses that capitalize on the term. Our chapter unpacks some of the registers of objectivity which delimit how educational testing is thought about and acted on. It draws together ideas and observations from different data sources, mainly secondary literature, policy documents, reports and media materials on the development of national testing in the US and Russian contexts. We utilize the knowledge that spans the two distant but surprisingly overlapping country cases, limiting our examination to the contemporary period, to illustrate the argument that we are building. We focus on national testing technologies, as these have become an important measure of national education quality and a source of information on education ‘outputs’ for actors located on different scales of governance. The mounting efforts to govern education through testing and test data are surprisingly fragile and ambivalent, as they cannot simply feed off the irrefutable authority and objectivity of numerical information. Historically, reference to objectivity provided a strong argument for the adoption and dissemination of standardized testing at national and international levels. The term objectivity may assume different meanings in statistical testing. On the one hand, objectivity may refer to impartial representation of external reality. On the other, it may indicate meticulous standardization of testing procedures. The latter has been the case especially when the uses of testing have been expanded to cover ever larger populations. Importantly, since the 1980s, the use of testing data has assumed a new significance in the form of performativity, entailing increased emphasis on measured outputs, strategic planning, performance indicators, quality assurance measures and academic audits (Olssen & Peters, 2005; Lingard, 2011). Numerical data sets standards against which performance is evaluated. In this manner it communicates the expected form of conduct to those who are being measured (Davis, Kingsbury, & Merry, 2012). Thus the numbers produced for such purposes are increasingly

The infrastructures of objectivity in standardized testing

55

concerned with the modification of behaviour in the interests of control (Power, 2004: 776), but to be legitimate, they still need to appear neutral and natural, and to pass themselves off as being all about an independent reality. Based on this we claim that the numbers produced in examinations, national standardized testing or international large-scale assessments and used for political intervention face the same issue. On the one hand, they are expected or even claimed to represent the phenomenon of learning ‘objectively’. In this, they accentuate uniform, impartial forms of testing and data management as manifestations of objectivity. On the other hand, test data often enter the systems of performance measurement at national and global levels to direct attention and induce behavioural and organizational change. The situation illustrated by the opening vignettes and references to Michael Power’s work is not new to scholars of quantification. Charles Goodhart’s law states that ‘when a measure becomes a target it ceases to be a good measure’. Alain Desrosières (2015) contributed this feedback effect to any measurement practice, but just like Power, he stressed its increasing occurrence in contemporary contexts characterized by the neoliberal culture of performativity. In this paper we explore the paradoxes of producing legitimate numerical data on education in the context of neoliberal performativity. We look at how measures are made ‘good’ again in a system that capitalizes on their cultural acceptance while simultaneously jeopardizing their proclaimed impartiality.

Quantification and objectivity Before a sphere of government, such as an education system, can be governed, it needs to be territorialized, i.e. its external and internal borders identified and its basic functions rendered knowable (Rose, 1999). Numbers, as Porter (1994) and others (e.g. Rose & Miller, 1992) have maintained, constitute an essential feature in territorializing social phenomena due to their authority, stability, transferability and combinatorial potentials. It is often taken for granted that numbers and the practices of auditing, standardization and testing they inhabit, manifest ideals of trustworthiness and impartiality. In this, they hark back to the discourses on objectivity in scientific research. In discussions on the trustworthiness of scientific knowledge, the concept of objectivity has often invoked a somewhat monolithic and unproblematic meaning, as being ‘neutral’ or ‘true to reality’. Moreover, objectivity is often seen as detached from political concerns. Yet, as Daston (1992) notes, the concept

56

International Large-Scale Assessments in Education

of objectivity has historical layers that entail moral, methodological as well as political aspects, all of which resonate with the discourses of numbers and quantification. Furthermore, these registers of objectivity fuel tensions and paradoxes when they are applied in processes of quantification and the use of numbers in governance. We refer here to roughly two layers of the term objectivity used in discourses of quantification. First, absolute objectivity refers to ‘representing things as they really are’ (Megill, 1994: 2). In other words, objective knowledge is a faithful reflection of ‘external reality’ (cf. Daston, 1992). Here, the dominant metaphor of knowledge is decidedly ocular (Megill, 1994). This may manifest in a rather simplistic idea of numbers ‘representing’ human abilities such as intelligence or learning outcomes. Yet quantification always entails reducing the individual and context-specific complexity of phenomena into abstract forms that produce a homogeneous plane of comparison. Scientific research often constructs ‘black boxes’ (Latour, 1987: 23) in which the processes of quantification are obscured and obfuscated, thereby directing the focus only to the solid numbers and statistics. This is how numbers can assume an air of representing objective reality, so that the problematic relationship between quantification and its object is easily forgotten (Power, 2004: 767–768). Second, claims to objectivity can also focus on just those practices that produce numbers. We can call this ‘procedural objectivity’ that accentuates standardized processes of testing. Procedural objectivity tries to ‘maintain the letter of absolute objectivity, while denying its spirit – using its means, but turning agnostic with regard to its end, the attainment of truth’ (Megill, 1994: 11; cf. Porter, 1994: 207, 211). Here, the root metaphor is ‘tactile, in the negative sense of “hands off !” Its motto might well be “untouched by human hands”’ (Megill, 1994: 10). The problematic human quality refers to the idiosyncratic action of individual investigators or research subjects, and the problem of observation and measurement from a limited perspective. The answer to this problem is both the multiplication of measurements and points of view across different sites and the meticulous standardization and stabilization of the socio-material processes and environments of quantification (cf. Daston, 1992). We can already witness this standardized multiplication in early twentieth century international examination systems of educational research. These systems constructed coherent criteria for pupil examination which would then enable scholarly communication and systematic comparison of pupil achievements across countries and continents (Lawn, 2008).

The infrastructures of objectivity in standardized testing

57

Procedural objectivity is a regulative idea not only of scientific research but also of modern bureaucratic organizations in open, democratic societies. Bureaucracy, as Max Weber (2015) characterized it, is driven by the ideals of formal rationality and impartiality in public administration, often executed by trained specialists. According to Porter (1994), procedural objectivity and the appeal to impersonal numbers in bureaucratic organizations has been fuelled by a mistrust of traditional administrative and political elites who claimed authority solely on the basis of their own personal experience and superior ability to make good judgements. If civil servants are governed by formal rules not susceptible to personal whims, this creates trust in the fairness and transparency of governance. Moreover, mistrust and the call for more objective quantification may emerge even in bureaucratic organizations. These organizations may entail complex hierarchies with a highly developed division of labour, or cover wide geographical distances, so that the centres of calculation cannot exercise direct surveillance over grass-roots administration. This may create doubts about whether, for example, teachers are doing their grading job properly, and this in turn may lead to rigorous standardization of the processes of producing numerical data (cf. Porter, 1994).

The emergence of standardized testing in governing the education system Educational testing is a socio-material technology that has manifested both ‘registers’ of objectivity. On the one hand, tests are supposed to represent the variations in achievements, intelligence, etc. among a population of pupils. On the other hand, this has been accompanied by an insistence on the standardization of procedures so that they are not subject to individual or context-specific variations. Such has been the case especially when the uses of testing have been expanded to cover ever larger populations. One significant site of the emergence of statistical testing procedures in the school system was the coterminous development of statistically driven educational research and the increasing size and bureaucratization of education system in Europe and the United States. In early twentieth-century France, Alfred Binet and Théodore Simon were commissioned by the state to invent techniques of intelligence testing for children. These would aid in identifying forms of mental defects among children and in assigning them to special institutions (Danziger, 1990; Privateer, 2006).

58

International Large-Scale Assessments in Education

Tapping into the first register of objectivity, American pioneers of intelligence testing established that it would reveal the variations of individual abilities among pupils, and thereby serve to reduce ‘waste’ in the form of training pupils intellectually unfit for education (Privateer, 2006: 172–176). Unlike Binet and Simon, whose small-scale testing technology was reminiscent of a medical examination, American testing technology came to be designed to examine large populations efficiently. Intelligence testing standardized the forms of producing knowledge by rigorously controlling the time, space and action in the test situation. This would make the socio-material testing apparatus portable and replicable (Privateer, 2006). Thus testing technology assumed a form that accentuated procedural objectivity. A significant innovation in cost-effective and scalable testing technology in the 1920s was the introduction of multiple-choice items. These enabled fast, uniform grading even by low-skilled personnel. In the 1950s, the speed, ease and affordability of grading test scores were further increased with the introduction of optical grading machines. The vast spread of testing technologies was further fuelled by the emergence of private companies such as the Psychological Corporation (later renamed Pearson Education), which provided statistical expertise, standardized tests, manuals and prep material (Clarke et al., 2000). When applied to statewide, and subsequently national, assessment programmes, these enabled the formation of a scalable plane of comparison. Yet there was nothing intrinsically appealing or ‘objective’ about numbers themselves to render them immediately suitable for the purposes of educational governance. As Latour (1983) notes, for scientific research to be valid and applicable beyond the space of investigation, the discursive and socio-material elements constituting the research environment must be moved. In the case of emerging statistical testing in education, it was the procedures of bringing about, circulating and combining those numbers that were easily translated into the discourses and practices of governing rapidly expanding mass schooling that made it relevant and applicable. In other words, numbers constituted a ‘currency’ (Porter, 1994: 226) or a ‘boundary object’ (Star & Griesemer, 1989) only within a wider system of investigative and governmental practices that were standardized to a high degree. We can indicate some of these isomorphisms and points of translation between the procedures of governing the education system and those of testing. First of all, the space of a classroom was a natural laboratory for testing practices: it was already disciplined and homogenized in terms of age and classified in forms of time, subject content and grades. As a result it constituted an apt

The infrastructures of objectivity in standardized testing

59

plane of statistical comparison, yielding quantitative individual variation on standardized variables. Moreover, the class enabled constant surveillance and disciplining of behaviour (Danziger, 1990; Privateer, 2006). Thus statistical testing procedures could elaborate on already existing socio-material practices.

The rise of performativity and high-stakes testing Since the 1980s, the use of testing data has taken on a new significance in the form of performativity, entailing increased emphasis on measured outputs, strategic planning, performance indicators, quality assurance measures and academic audits (Olssen & Peters, 2005; Lingard, 2011). Performativity ‘employs judgments, comparisons and displays as a means of incentive, control, attrition and change – based on rewards and sanctions (both material and symbolic)’ (Ball, 2003: 216). Performativity is linked to the new contractualism of actor relations legitimated and fostered by state legislation. The essence of contractual models is that they replace professional autonomy and trust with rigid hierarchies, and make contract extension, sanctions and rewards contingent on measurable performance (Olssen & Peters, 2005). Trust, in other words, is exchanged for monitored performance. It is interesting for our argument that performativity is embedded in the assumption that individuals are rational utility maximizers and, because of this, the interests of principals and agents inevitably diverge, and the agents always have an incentive to exploit their situation to their own advantage (Olssen & Peters, 2005). Performativity agendas thus ‘see the professions as self-interested groups who indulge in rent-seeking behaviour’ (Olssen & Peters, 2005: 325). The language of indicators and standards may narrow education policy debates to merely reasoning about the phenomena and aims that can be standardized and measured. As Graham and Neu put it, Increasing adoption of standardized testing begins to draw dissent away from the fundamental topics of the purpose and effectiveness of education, to derivative debates about administrative aspects of testing and about the validity of the results. This shift of focus from the fundamental to the superficial or consequential is a characteristic effect of tools like standardized testing that translate populations into numbers. Debate about such numericizing tools tends to centre on the correctness of the numerical value and the methods of its derivation, rather than on the appropriateness of translating people into numbers at all. (Grahan & Neu, 2004: 302)

60

International Large-Scale Assessments in Education

A candid case in point of the performative aspects of governing education systems is the phenomenon of teaching to the test. This means that high-stakes testing technologies may streamline school practices to yield better results in tests (Popham, 2001). Thereby numerical data sets standards against which performance is to be measured (Davis, Kingsbury, & Merry, 2012), signalling to those being measured the conduct expected of them, and re-shaping their identities, influencing future acts and altering actor relations. This reverses the idea of tests as representing ‘objective reality’ so that high-stakes tests themselves become the reality against which practices of schooling will be adjusted. The breadth and extent of this phenomenon have been well documented in different education systems (see Chapter 7, and Popham, 2001). What we would like to highlight here is that the culture of performativity in high-stakes testing is inherent in the two registers of objectivity. First, it is impossible to legitimate the use of testing without recourse to the idea of representing an objective reality, i.e. student achievements. Second, it is seen as paramount that the processes of high-stakes testing, especially in hierarchical and extensive education systems, be rigorously standardized to prevent subjective distortions. In the culture of performativity, both aspirations are constantly in danger of being compromised. The line between adjusting behaviour to performance criteria, and falsifying numbers to reach the expected outcomes in order to prove one’s trustworthiness and effort, is a fine one. In relation to the two registers of objectivity, performativity engenders both cynical compliance and misrepresentation. Both versions are produced purposefully, and in both ways numbers have a ‘transformational and disciplinary impact’ (Ball, 2003: 224) and make the phenomena measured ever more opaque.

The two faces of objectivity We examine the problematic relationship between objectivity and performativity in the governing of large education systems such as Russia and the United States since the turn of the millennium. We focus on two nationwide testing technologies – the annual high-stakes tests for select grades in implementing the No Child Left Behind Act of 2002 in the United States and the USE – the Unified State Exam, a set of standardized, compulsory graduation exams after the last grade of high school in Russia. In 2009, a Russian country-wide examination – the USE, which combines in a single procedure the school leaving examination with entrance exams to

The infrastructures of objectivity in standardized testing

61

tertiary education – replaced the dual system of separate and independent school graduation and admissions exams that had existed since the Soviet times (Luk’yanova, 2012; Piattoeva, 2015). After a lengthy piloting of eight years, the law on education confirmed the USE as the only format for school graduation examinations (Ministry of Education and Science [MOES], 2013). The exam items comply with the state education standards, that is, with the part of the curriculum compulsory throughout the Russian Federation. The exam consists of ‘standardised exam items which enable the evaluation of the level of attainment of the federal education standards’ (MOES, 2011). Tyumeneva (2013: xi) calls the USE the most important education assessment procedure in the country, and explains that since there is no national large-scale assessment programme for ‘system monitoring and accountability purposes, the USE has ended up being used to fill this gap’, despite the fact that it was not initially designed to yield this kind of information. An indication of the pivotal role of the Russian USE as evidence for policymaking could be found in a circular distributed by the Federal Service for Supervision in the Sphere of Science and Education (MOES, 2014) under the title ‘On implementing the USE in 2015’. The document argues that the USE data should motivate measures that improve the quality of education in schools and also the level of pedagogical proficiency among teachers. Simultaneously, USE data have turned into a means to hold teachers, schools and municipal and regional authorities accountable for the numerical outcomes of their work (Piattoeva, 2015), while channelling resources from municipal and federal funds to the high-performing schools. In the United States, the No Child Left Behind Act was passed in 2002 by G.W. Bush’s administration to introduce a standards-based reform in the US school system. Its aim was ‘to close the achievement gap with accountability, flexibility, and choice, so that no child is left behind’ (No Child Left Behind Act of 2001, 2002). Federal funding for schools was made contingent upon the state imposing the National Assessment of Educational Process (NAEP) to select grade levels (NCES, 2005). With the rapid growth of demand the biggest testing companies such as Harcourt Educational Measurement, CTB McGrawHill, and NCS Pearson gained hegemony in the high-stakes testing industry (Educational Marketer, 2001). While the NCLB programme aimed at promoting better achievement, especially among disadvantaged students, only those aspects of teaching which were explicitly sanctioned by tests were systematically improved, while low proficiency levels (especially from poor districts) that were not measured, were largely neglected. This in turn resulted in poorer outcomes

62

International Large-Scale Assessments in Education

among the disadvantaged, and so in the NCLB failing to achieve its general aims (Mintrop & Sunderman, 2013: 29–30). Moreover, as NCLB implemented a categorization of achievements into three levels (basic, proficient and advanced), the focus was trained on those ‘bubble kids’ who were located just below the level cutoffs (Taubman, 2009: 39). Testing technology is able to overcome mistrust of its objectivity by adjusting relentlessly testing procedures and data management. In the USA, stakeholder organizations in teacher education and educational research have frequently expressed mistrust of the testing apparatus implemented in the NCLB (see e.g. Toch, 2006). Media discussion has also drawn attention to the testing companies’ failures in objective grading and narrow indicators (see e.g. Henriques & Steinberg, 2001; Eskelsen García & Thornton, 2015). More often than not, these criticisms have called not for the dismantling of the testing apparatus altogether, but for more ‘objective’, ‘valid’ and ‘reliable’ forms of testing. Taubman concludes that it is often as if critiques, as well as reformist and development discourses of education in the USA cannot bypass the language of quantification and accountability to make their arguments seem legitimate (Taubman, 2009: 34–38). For instance, in the 1980s there was a call to introduce alternative, more open forms of testing in the USA. The prevalent multiple-choice format was criticized for not objectively capturing the manifold abilities of pupils. Instead, it was argued, written answers and portfolios could give a fuller account of achievements and also motivate learning. In some states, assessment technologies were relaxed. But in the course of the 1990s, these alternative forms were in turn criticized from the viewpoint of procedural objectivity, as being subject to corruption and subjective interpretations. More rigorous testing procedures ensued (Clarke et al., 2000; Mintrop & Sunderman, 2013: 26). In Russia the multiple-choice format of the tests gave rise to mounting criticism about skewing education towards rote learning to the test, and even letting the student pass important exams and secure a higher education placement by mere guesswork. Therefore in the last few years the multiplechoice part has been replaced by open-ended questions. Moreover, testing in foreign languages and Russian as a compulsory state language now includes an oral component. Interestingly, when introducing this oral component of the exam in Russian to a group of international evaluation specialists, the head of Rosobrnadzor said ‘it is important for us not to lose objectivity after the oral part has been introduced’ (Rosobrnadzor, 2016). To counteract the allegedly diminishing objectivity of tests that need to be scored by subject

The infrastructures of objectivity in standardized testing

63

specialists, the authorities tightened their control over those who mark the tests. Since 2014, for the first time on a nationwide scale, the buildings that accommodate students taking the mandatory state exams, staffrooms and regional centres of information processing where the exam assessors gather to mark the exam papers or solve conflictual situations, have been equipped with surveillance cameras (Rosobrnadzor, 2014). Non-mandatory recommendations of a similar content concern all other national testing. Numerous federal inspectors and voluntary public observers who monitor the exam at its different phases through online surveillance or on the spot have to ensure that the markers do not violate the rules, e.g. do not use mobile phones to identify whose examination paper they are marking (Izvestiia, 2015). The digitalization of completed examination papers allows them to be sent to other parts of Russia to be checked by subject specialists. These specialists are considered to be sufficiently removed from the schools and students tested and thus to have fewer incentives to fabricate the exam scores. Equally, voluntary public observers travel to regions other than their own in order to demonstrate impartiality. This has required additional efforts on the part of the authorities to attract university students to agree to an unpaid job in distant locations for a token reward of an interesting cultural programme or as an additional item in their personal portfolios. Even the medical personnel who are on duty on the examination premises to provide first aid have been found guilty of helping the students to cheat, causing the authorities to organize additional briefing sessions to remind people of the sanctions imposed for rule violations. Overall, the system perceives each and every exam participant – and, as we can see, the spectrum of participants is very diverse and growing – as untrustworthy, establishing additional legal and pedagogical structures and involving technological appliances to standardize the examination process and to discipline its various actors. The new reality of performativity produces unexpected consequences which, instead of a profound reimagining of objectivity, seem to only fuel further intensification and adjustment of the testing apparatuses that produced such problems in the first place. These effects were prominent in the NCLB programme in the USA and in the USE in Russia. In the Russian case, in order to confront the falsification of the exam results by students and teachers during the examinations, the federal educational authorities created the so-called ‘test item bank’, which contains exam questions from preceding years. This measure was justified as a means of taking pressure off the exam, and allowing

64

International Large-Scale Assessments in Education

students and teachers to foresee the type of questions that commonly appear on the examination papers. Moreover, independent centres of educational evaluation, the first of them being established in Moscow in 2015, allow students to ‘sit the exam several times per year, thus making the idea of exam cheating unnecessary because it is easier to simply prepare well’ (Kommersant, 2014). The centres are described as more reliable places to take the exams as they function independently of the teachers and school administrators. The administrative attempts to diminish the potential skewing of data due to the intervention of the ‘human factor’ seek to disassociate the numbers from those who produce them. Yet, paradoxically, they also aspire to encourage teachers and municipal actors to pay due attention to the test content and results as desirable learning objectives and reliable representations of the levels of learning achieved. In order to encourage the use of testing data for thoughtful decisionmaking in the regions, the Russian federal authorities recently removed the indicator of education quality based on the USE scores from the measures of performance that rank municipal and regional authorities. Rosobrnadzor seems to have supported and even driven this amendment, concerned that the indicator pushes local authorities to embrace quick fixes, as the reactions quoted in the Introduction suggest. However, since 2014, the leaders of education authorities in the regions have been evaluated on the basis of the level of transparency of the USE as a procedure and Rosobrnadzor ranks regions according to how ‘objectively’ they implement the USE. The indicators look at the availability of online surveillance and other necessary surveillance technologies (see Piattoeva, 2016). Regions that end up in the red zone of the ranking, that is, exhibit low levels of objectivity, are targeted for additional inspections. Statistical operations that compare the scores obtained and their distribution to the so-called ‘ideal exam model’ play an important role in helping to identify areas with anomalously good test results. Regions with exceptionally high scores are likely to attract the authorities’ attention, leading to the re-examination of the exam papers of the students who received the maximum score. These students become a focus of both positive and negative attention. They raise suspicion, especially when coming from the regions where extensive cheating schemes were exposed in previous years, such as the poor and conflict ridden North Caucasian Federal District. At the same time, they can be said to represent educational establishments with pedagogical practices to emulate.

The infrastructures of objectivity in standardized testing

65

Conclusion – assembling the infrastructure of objectivity Since the 1980s, the use of testing data has assumed a new significance in the form of performativity – a mode of governance that relies on the production and display of numbers to form judgements from a distance in order to redistribute material or symbolic resources, and to incentivize individual and collective actors. The emergence of performativity in governing education systems has been coterminous with the design and proliferation of apparatuses of standardized testing. In this chapter we looked at testing as a dynamic process and highlighted that debates about and decisions on educational testing borrow from scientific and governmental discourses particular definitions of objectivity, which are mobilized to legitimize and develop, as well as to criticize testing technologies. The two registers of objectivity delineated here – the absolute and the procedural – are indelibly etched in the prevailing performative culture of high-stakes testing. The focus on the different registers of objectivity at play helps to identify a conspicuous dynamic within the culture of performativity. While performativity may undermine claims to absolute and procedural objectivity, paradoxically, it may lead to an intensification of testing rather than its complete dismantling. However, in the era of performativity the tireless production of objectivity lies at the core of sustaining educational testing: testing ceaselessly expands its reach while feeding off the critique of its objective nature. Because the common currency of numerical data emanates from their assumed detachment and objectivity, we call the complex infrastructure created to sustain numbers the infrastructure of objectivity. The coordinative effort needed to sustain the information system is vast, and it is inevitable that it will never reach perfection, making it susceptible to further criticism and thus the questioning of the reliability of its outputs. As more stakes are attached to data, the problems of trust and data quality can be expected to intensify (Anagnostopoulos & Bautista-Guerra, 2013: 54). Thus testing infrastructures manifest dynamic processes of negotiating, innovating and remodelling practices and discourses of objectivity rather than fixed arrangements. Moreover, although these infrastructures may manifest broad and centralized governmental systems (as in the case of United States and Russia), their effects cannot be entirely controlled. These unpredictable repercussions do not altogether invalidate testing infrastructures, but instead foster their further adjustment and development. The uneasy relationship between objectivity

66

International Large-Scale Assessments in Education

and performativity is thus surprisingly productive for the expansion of testing infrastructures. What seems to be an uncomfortable paradox that should undermine the legitimacy of numerical data through testing ends up intensifying testing.

References Anagnostopoulos, D. & Bautista-Guerra, J. (2013), ‘Trust and numbers: Constructing and contesting statewide information systems’, in D. Anagnostopoulos, S. A. Rutledge, & R. Jacobsen (eds), The infrastructure of accountability: Data use and the transformation of American education, 41–56, Cambridge, MA: Harvard Education Press. Ball, S. J. (2003), ‘The teacher’s soul and the terrors of performativity’, Journal of Education Policy, 18 (2): 215–228. Clarke, M. M., Madaus, G. F., Horn, C. L., & Ramos M.A., (2000), ‘Retrospective on educational testing and assessment in the 20th century’, Journal of Curriculum Studies, 32 (2): 159–181. Danziger, K. (1990), Constructing the subject, New York: Cambridge University Press. Daston, L. (1992), ‘Objectivity and the escape from perspective’, Social Studies of Science, 22 (4): 597–618. Davis, K. E., Kingsbury, B., & Merry, S. E. (2012), ‘Introduction: Global governance by indicators’, in K. E. Davis, A. Fisher, B. Kingsbury, & S. E. Merry (eds), Governance by indicators, 3–28, Oxford: Oxford University Press. Desrosières, A. (2015), ‘Retroaction: How indicators feed back onto quantified actors’, in R. Rottenburg, S. E. Merry, S. J. Park, & J. Mugler (eds), The world of indicators. The making of government knowledge through quantification, 329–353, Cambridge: Cambridge University Press. Educational Marketer (2001), ‘Publishers suffer growing pains as testing market mushrooms’, Educational Marketer, 8 October. Available online: https://business. highbeam.com/5083/article-1G1-78993551/publishers-suffer-growing-painstesting-market-mushrooms Eskelsen García, L. & Thornton, O. (2015), ‘“No child left” behind has failed’, Washington Post, 13 February. Available online: https://www.washingtonpost.com/ opinions/no-child-has-failed/2015/02/13/8d619026-b2f8-11e4-827f-93f454140e2b_ story.html?utm_term=.7d4d2f233e93 (accessed 29 June 2017). Graham, C. & Neu, D. (2004), ‘Standardized testing and the construction of governable persons’, Journal of Curriculum Studies, 36 (3): 295–319. Henriques, D. B. & Steinberg, J. (2001), ‘Right answer, wrong score: Test flaws take toll’, New York Times, 20 May. Available online: http://events.nytimes.com/learning/ general/specials/testing/20EXAM.html (accessed 29 June 2017).

The infrastructures of objectivity in standardized testing

67

Izvestiia (2015), ‘The number of public observers who monitor the USE of 2015 will double’, 16 March. Available online: http://iz.ru/news/584073 (accessed 29 September 2016). [In Russian]. Kommersant (2014), ‘USE was sent for re-examination’, 26 December. [In Russian]. Latour, B. (1983), ‘Give me a laboratory and I will raise the world’, in K. Knorr-Cetina & M. Mulkay (eds), Science observed perspectives on the social study of science, 141–170, London: SAGE. Latour, B. (1987), Science in action: How to follow scientists and engineers through society, Milton Keynes: Open University Press. Lawn, M. (2008), ‘Introduction: An Atlantic crossing? The work of the international examinations inquiry, its researchers, methods and influence’, in M. Lawn (ed.), An Atlantic crossing? The work of the international examinations inquiry, its researchers, methods and influence, 7–37, London: Symposium. Lingard, B. (2011), ‘Policy as numbers: Ac/counting for educational research’, Australian Educational Researcher, 38 (4): 355–382. Luk’yanova, E. (2012), ‘Russian educational reform and the introduction of the unified state exam: A view from the provinces’, Europe-Asia Studies, 64 (10): 1893–1910. Megill, A. (1994), ‘Introduction: Four senses of objectivity’, in A. Megill (ed.), Rethinking objectivity, 1–20, Durham, MD: Duke University Press. Mintrop, H. & Sunderman, G. (2013), ‘The paradoxes of data-driven school reform’, in D. Anagnostopoulous, S. A. Rutledge & R. Jacobsen (eds), The infrastructure of accountability: Data use and the transformation of American education, 23–39, Cambridge, MA: Harvard Education Press. MOES (2011), ‘Decree on approving the procedure of holding a unified state exam’. Available online: http://www.edu.ru/abitur/act.31/index.php (accessed 15 January 2013). [In Russian]. MOES (2013), ‘The federal law on education in the russian federation’. Available online: http://минoбpнayки.pф/нoвocти/2973 (accessed 4 May 2013). [In Russian]. MOES (2014), ‘On implementing the USE in 2015’, Circular 16 September, no. 02-624. Available online: www.edumsko.ru/lawbase/fed/pismo_rosobrnadzora_ ot_16092014_n_02624_o_procedure_provedeniya_ege_v_2015_godu/(accessed 12 May 2015). [in Russian]. NCES (2005), ‘Important aspects of No Child Left Behind relevant to NAEP’. Available online: https://nces.ed.gov/nationsreportcard/nclb.aspx (accessed 29 June 2017). No Child Left Behind Act of 2001, 20 U.S.C. §§ 107–110 (2002). Olssen, M. & Peters, M. A. (2005), ‘Neoliberalism, higher education and the knowledge economy: From the free market to knowledge capitalism’, Journal of Education Policy, 20 (3): 313–345. Piattoeva N. (2015), ‘Elastic numbers: National examinations data as a technology of government’, Journal of Education Policy, 30 (3): 316–334. Piattoeva, N. (2016), ‘The imperative to protect the data and the rise of surveillance cameras in administering national testing in Russia’, European Educational Research Journal, 15 (1): 82–98.

68

International Large-Scale Assessments in Education

Porter, T. (1994), ‘Objectivity as standardization: The rhetoric of impersonality in measurement, statistics, and cost-benefit analysis’, in A. Megill (ed.), Rethinking objectivity, 197–238, Durham, MD: Duke University Press. Popham, W. J. (2001), The truth about testing: An educator’s call to action, Alexandria: ASCD. Power, M. (2004), ‘Counting, control and calculation: Reflections on measuring and management’, Human Relations, 57 (6): 765–783. Privateer, P. M. (2006), Inventing intelligence: A social history of smart, Malden, MA: Blackwell Publishing. Rose, N. (1999), Powers of freedom: Reframing political thought, Cambridge: Cambridge University Press. Rose, N., & Miller, P. (1992), ‘Political power beyond the state: Problematics of government’, The British Journal of Sociology, 43 (2): 173–205. Rosobrnadzor (2014), ‘In the reserve period of the USE Rosobrnadzor engages in the full-time monitoring of the internet’, 29 April. Available online: http://obrnadzor.gov. ru/ru/press_center/news/(accessed 28 February 2017). [In Russian]. Rosobrnadzor (2016), ‘The head of Rosobrnadzor told international experts about the new technologies in Russian education’, 30 September. Available online: http:// obrnadzor.gov.ru/ru/press_center/news/index.php?id_4=5954 (accessed 30 September 2016). [In Russian]. Star, S. L. & Griesemer, J. R. (1989), ‘Institutional ecology, translations, and boundary objects: Amateurs and professionals in Berkeley’s museum of vertebrate zoology, 1907–39’, Social Studies of Science, 19 (3): 387–420. Taubman, P. M. (2009), Teaching by numbers: Deconstructing the discourse of standard and accountability in education, New York: Routledge. Toch, T. (2006), Margins of error: The education testing industry in the no child left behind era, Washington: Education Sector. Tyumeneva Y. (2013), Disseminating and using student assessment information in russia, Washington, DC: World Bank. Uchitelskaia Gazeta [Teachers’ Gazette] (2015), ‘Regions will be divided according to the traffic lights’, 31 March. [In Russian]. Uchitelskaia Gazeta [Teachers’ Gazette] (2016), ‘Sergey Kravtsov, head of Rosobrnadzor: They write and do not fear to receive a low grade for expressing their opinions’, 5 January. [In Russian]. Weber, M. (2015), ‘Bureaucracy’, in T. Waters & D. Waters (eds), Weber’s rationalism and modern society, 73–127, New York: Palgrave MacMillan.

4

Detecting student performance in large-scale assessments Margareta Serder

Introduction In this chapter, I develop the argument that PISA, as an international testing programme, is a kind of a detector, not unlike those deployed in the natural and physical sciences for collecting and interpreting signals of different phenomena in order to construct a representation of the world. In her exploration of Epistemic cultures: How sciences make knowledge, Karin Knorr Cetina wrote about such detectors as ‘a sort of ultimate seeing device’. She noted: the detector functions not unlike the retina; in UA2,1 participants said the detector ‘sees’ or ‘doesn’t see’ certain events, that it was ‘not looking’, ‘watching’, or that it ‘looked away’; they said it was ‘blind’, ‘sensitive’ or ‘insensitive’. (Knorr Cetina, 1999: 115)

As implied by this quote, understanding large-scale assessment as a detector means seeing it as a specific kind of device, constructed to detect certain things of interest at the cost of being blind to others. To regard PISA as being, or acting as a detector means going beyond the idea of a test that somehow mirrors what 15-year-olds know about science, mathematics, and reading to something that is active in terms of selecting and making up an output from signals that have been disclosed and possible to identify by the test. Therefore, it can be metaphorically compared to other detecting technlogies used in scientific investigations or in everyday life, such as particle detectors or fire alarms. I

1

An experiment at CERN, which is one of the scientific laboratories in Knorr Cetina’s ethnographic study.

70

International Large-Scale Assessments in Education

will develop the argument of PISA as a detector within a material–semiotic methodology of exploration, most often known as actor–network theory (Gorur, 2015). In this chapter, following a theoretical exploration of the notion of the detector, I will present examples of how the PISA detector operates in practice. To do this, I will present data from encounters in which 15-year-old students collaboratively worked with test questions that were formerly used in the PISA programme for assessing students’ so-called scientific literacy. The encounters between students and test questions were designed to enable an analysis of the socio-material conditions for the construction of responses in the assessment of literacy. In retrospect, I consider the material as helping to shed light on the response process, broadly defined as ‘the mechanisms that underlie what people do, think, or feel when interacting with, and responding to, the item or task and are responsible for generating observed test score variation’ (Hubley & Zumbo, 2017: 2). The present study was part of a larger research project that asked questions about possible reasons for the declining results of Swedish students in international comparisons from 2006 to 2012. The project has been discussed in Serder (2015), Serder and Ideland (2016), Serder and Jakobsson (2016) and Serder and Jakobsson (2015). Since its inception in the late 1990s, the OECD Programme for International Student Assessment (or PISA) has grown larger each test cycle and so influential that it has become the natural reference point for statements on the state of the art of schools around the globe: PISA has now – in its own words – ‘become the world’s premier yardstick for evaluating the quality, equity and efficiency of school systems’ (OECD, 2013: preface). In other words, not only is it a knowledge measurement interested in the knowledge and skills of students, but an instrument aimed at representing the instrinsic qualities of entire school systems. To deploy an ANT (Actor-Network Theory) vocabulary to describe this development, we could say that the links between the actors in the PISA network have multiplied, strengthened, and been stabilized to the point at which they have become blackboxed (Latour, 2005). It is no longer questioned how and from where PISA grew, because it has become a given. The practice around which it all oscillates is the two-hours-long PISA test that takes place in three-year cycles in all countries of the OECD and its partners. During this short period of time, evidence of knowledge is ‘tapped’ – in the shape of inscriptions on paper sheets or computer hardware – from half a million 15-year-old students around the globe in a veritable ‘hjärnkamp’ (brains’ fight), as it is expressed in the Swedish advertisement produced to motivate

Detecting student performance in large-scale assessments

71

students to do their best in the PISA test (‘Du är utvald’, Skolverket, 2014). The practice of individual, written testing is guided by the assumption that the inscriptions produced also constitute evidence about the school systems that are educating these students. This chapter looks at that testing procedure, and especially the part of it in which the skills are assessed. The goal is to show how performance is brought into being in this practice, with scientific literacy as an example.

Background and theoretical assumptions Scholars have researched PISA from many perspectives and with a number of research questions. Just to give a glimpse of the large body of research, researchers have dealt with PISA’s analytic demands and pitfalls (Laukaityte, 2016), its statistical validity and reliability (Kreiner & Christensen, 2014), its historic origins and influences (Thröler, 2013), its reception among policy makers and politicians (Grek, 2012; Sellar & Lingard, 2014) and in the media (Waldow, Takayama, & Sung, 2014), its policy impact (Martens, 2007), and many more aspects. Some researchers have investigated the tests’ comparativeness, for instance between versions in different languages (Arffman, 2010), or between the PISA framework and national curricula (Lau, 2009). Other scholars have studied the circumstances, conditions and processes that constitute the PISA program me using interviews, meeting protocols, etc. as their main information source (Morgan, 2007; Gorur, 2011). An important insight from the latter type of research concerns the long – and most often unconsidered – chain of events that brings the PISA test and the reported results based on it into being. During each step in this chain, the assessment gradually reshapes from one type of assumption or data to a slightly different other. In ANT theory, this is called a process of translation (Latour, 1999). Gorur (2011) described a process of translation with respect to PISA as follows: At every point there is a translation – from ‘knowledge and skills for life’ to ‘literacy, numeracy and scientific literacy’, to a few items that stand for these literacies. The few test items, validated by the framework and the field testing, refer to the three literacies being tested, which in turn represent the knowledge and skills for life, which in turn represent the work of millions of teachers in tens of thousands of schools in a sizeable portion of the world, and stand for the performance of education systems. (Gorur, 2011: 84)

72

International Large-Scale Assessments in Education

The translations of specific interest in this chapter are those that are produced in encounters between the students and the test during which the students’ knowledge is to be tapped. So, what is taking place during these encounters between students and test questions? What are the consequences for the data being produced during these encounters? What does this mean if we chose to view PISA as a detector?

Actor–network theory and a useful analogy This study draws on actor–network theory, or ANT (Latour, 1999, 2005). ANT has been developed within a vast range of academic disciplines, under the umbrella of Science and Technology Studies, or STS (Fenwick & Edwards, 2012; Gorur, 2015). The work of Knorr Cetina (1999) referred to in the introduction is clearly one of the most cited studies in this field. The STS purpose and means are to nearsightedly investigate the production of scientific and technological achievements, something that is enabled by an ANT methodology. The guiding principle is that scientific work is a practice that is necessary to bring facts into reality, meaning that facts become facts only under specific circumstances. However, once a scientific fact is set, the chain of events that has preceeded it becomes obscured – and the more established a fact is, the more ‘blackboxed’ it becomes (Latour, 1999). The assignment of the ANT researcher is to study how facts are produced and to describe the fact-making process in order to open the black box. This means taking seriously how material and social actors involve in the practices needed to claim scientific facts, and how these practices evolve over time. Thus, facts are consequences of practices, such as using specific detectors to collect and interpret information (Knorr Cetina, 1999). In one of the first ANT articles about PISA, Radhika Gorur (2011) suggested that we should view the PISA production of facts about students’ skills and knowledge as facts produced within scientific laboratories. This is the assumption that I adopt in this chapter. In a previous article (Serder & Ideland, 2016), we proposed comparing the laboratory activities of PISA with the activities that took place in the first laboratory studied in the name of STS, namely that of the scientist Louis Pasteur in Paris during the nineteenth century and his discovery of microbes (Latour, 1999). This analogy helps to make an argument about what it takes for any category (in this case, a PISA competency or a specific level of performance) to exist. In Latour’s philosophical analysis of Pasteur’s scientific work, the

Detecting student performance in large-scale assessments

73

question being elaborated is what it means to ‘discover’ something, such as the microbes that Pasteur became famous for having descibed and named. Latour poses a compelling question: can one assert the existence of microbes, or their location, before Pasteur? Can we claim that they even existed before ‘the discovery’? Latour writes that many of us would find it commonsensical to answer that the microbes were in the same place before and after Pasteur; that Pasteur did not change the course of events and that he only discovered a preexisting category, and named it ‘microbes’. However, Latour argues that what it takes to bring things and categories into reality is to give them attributes. This is what the scientific process contributes to: defining the object. The microbes in Pasteur’s laboratory needed to be given circumstances, conditions, specific nutritional demands and ways of being detected; thereafter, they became microbes as we know them. Then and only then could they be discerned as something. Obviously, there are differences between microbes and students’, or even countries’, levels of performance. However, this parallel adds to our thinking a question about performance results: in what way do the knowledge and skills detected and ‘discovered’ by the PISA test exist before the two-hour-test is conducted and added to the chain of events in which PISA performance is calculated? What are the consequences for our understanding of school systems and students’ knowledge and skills? For instance, the notion of a detector can be exemplified by the instruments constructed for reading the level of haemoglobin in a blood test. Following some procedures, those instruments can detect haemoglobin, but not, for instance, the content of virus or thyroid hormone. Meanwhile, the ensemble of the instrument and the blood also constructs a particular way to perceive and know about blood. (Serder & Ideland, 2016: 3–4)

Methodology A methodological assumption in ANT is that valuable data come from careful deconstruction of what is taked for granted, accomplished through tracing a networks’ socio-material actors and their relations (Fenwick & Edwards, 2012). Actors in the network that make up the phenomenon of interest in this chapter, PISA, include the test developers, researchers, policy makers and 15-year-olds, but also nations and economies world-wide, coding schemes,

74

International Large-Scale Assessments in Education

test items, calculation programmes and softwares, frameworks, and scientific experts in a multitude of fields. Tracing the internal relations of these actors could mean studying how discourses and practices travel across actors, and how, when and through which actors they are being negotiated, used and sometimes stabilized. This empirical example traces the possible beginnings of responses to standaridized test questions in PISA. Another notion that will be added to the theoretical framework is that of mangling,2 which is the contingent adjustment between material and human actors, evolving in the constant resistance and accommodation in the mangle of practice (Pickering, 1995). The mangle metaphor can be used as an analytic tool for grasping how students’ oral and written answers come into being in situated encounters with PISA materials. While the PISA test is a detector of student performance, the mangle provides a way to perceive how a detector is functioning. The mangle invites the observer to see performance not as representations of the knowledge of a person, but as a result of resistances and accommodations that are opened up in the encounters of students, test materials, test instruction and the emotions, needs and socio-material conditions of each encounter. To use the analogy with the microbes in Pasteur’s laboratory, this tracing can help us understand the situated conditions and attributes forming ‘PISA responses’. In PISA, these responses will be translated into ‘results’, into ‘scientific-literacy competencies’, and ultimately into ‘certitudes’ concerning the status of whole school systems. In the PISA test, students normally work alone and in silence as they respond to test questions. This means that no group activities take place. In constrast, the present study was designed as physical encounters, where groups of three to five students worked together to construct responses to a selection of eleven PISA test questions from the scientific-literacy section of the publically released test items (Serder, 2015). In total, sixteen hours of activities in twenty-one groups of students were video-recorded. A delicate question is whether this kind of encounter can inform research about the response processes of ‘real’ testing situations; something for which a widely accepted practice is still lacking (Hubley & Zumbo, 2017). I argue that by this collaborative approach, aspects of students’ co-actions with the materials become observable in talk, propositions and doing with the material in ways that are not afforded in individual set-ups. Therefore, this study makes visible enactments in the process of mangling with the test that are also possible in real testing situations. Central for the analysis of the data material is what the students are doing with the test items (viewed as 2

A mangle is a large machine for ironing sheets or other fabrics, usually when they are damp.

Detecting student performance in large-scale assessments

75

a particular kind of detector) and vice versa: what the test items (detectors) are doing with the students in these socio-material encounters.

The PISA framework and test instrument The construction of a large-scale standiardized test like PISA’s is guided by a framework (see, for instance, OECD, 2006), a detailed description of the procedures and features that should characterize the test and its operationalization. From an ANT perspective, an international assessment framework is more than that – it is a socio-material actor in the tests’ actor– network, which, as Maddox (2014: 134) suggested, ‘defines and “black boxes” a series of concepts, procedures and experiences and puts them beyond debate’. From a detector’s point of view, the framework is setting the frames for how and what to assess in PISA. It articulates the outer borders of the detector’s scope and what type of signals it will and will not look for. The PISA framework is specific about how the 15-year-olds’ knowledge should be assessed. In the PISA 2006 framework, comprising the assessment guidelines for the test questions used in this study, the following text describes a certain requirement: An important aspect of scientific literacy is engagement with science in a variety of situations. In dealing with scientific issues, the choice of methods and representations is often dependent on the situations in which the issues are presented. The situation is the part of the student’s world in which the tasks are placed. Assessment items are framed in situations of general life and not limited to life in school. (OECD, 2006: 26)

This passage frames the tasks presented in the assessment of scientific literacy to ‘the student’s world’ and to situations of something called ‘general life’. This means that the test instrument – the detector – needs to mimic ‘general-life situations’ of the 15-year-old students. What kind of materials meet this requirement of the framework? Testing is a practice of secrecy: most of the items used in the PISA test are not published. This is due to standards for the assessment procedures that assume that the content should remain unknown to the test-takers until the moment of testing. Nontheless, a sample of items are publically released after each PISA testing cycle. These items are excluded from the next testing cycle, but they constribute with some transparency. In this chapter, one of the three PISA items used for this study, ‘Sunscreens’ (S447), will serve as an example of a PISA test item constructed to assess 15-year-olds’ scientific literacy. Released items

76

International Large-Scale Assessments in Education

Figure 4.1 Sunscreens (S447) introduction text

from the literacy tests in reading, science and mathematics can be found on the OECD website, and some are also published in the international and national PISA reports. The introductory text of Sunscreens describes an experiment conducted by two fictional students, Dean and Mimi. Their experiment is designed to compare the effectiveness of four different sunscreens (called S1, S2, etc.) using a set of materials, creams of the sunscreens, mineral oil (M) and zinc oxide (ZnO). One of the introductory passages reads as shown above (Figure 4.1). The next passage says that Dean and Mimi ‘included mineral oil because it lets most of the sunlight through, and zinc oxide because it almost completely blocks sunlight’ (S447; OECD, 2006). Thus, the roles of the ZnO and the M should be understood as extreme measuring points (references) in this experiment. In line with the framework, the introduction serves as a ‘real-life’ scene for the scientific experiment, which is followed by four consecutive test questions, including a multiple-choice question to measure students’ knowledge about scientific methods (the PISA competency Identifying scientific issues; Figure 4.2). The evidence to be collected by the test instrument consists of the responses from half a million students, produced in individual encounters with various national test versions. These tests have been translated from English and French in the original test questions and selected to be valid in the local conditions of each participating country.3 Each response will be coded as correct, incorrect

3

Which means that items with peculiar response patterns or deemed too ‘easy’ or ‘apart’ are excluded from the national assessment in line with standardized proecedures.

Detecting student performance in large-scale assessments

77

Figure 4.2 Sunscreens test item. Question 2

(or, in some cases, partly correct), in a process that lasts for twenty months until the final results are officially reported. For this item, Option D is correct; that is, Option D is a response that corresponds to the detection of scientific literacy according to PISA standards and protocol. In the next two sections, one succesful and one unsuccesful encounter between the students in this study and the Sunscreens item are presented.

Encounters between test questions and students Option D is selected by Caroline, Lucy and their two group members in their encounter with this item. However, their method is supposedly not the one anticipated by the PISA test constructers: what Lucy and Caroline are doing is comparing the introductory text to the test question: Lucy: Mm … Yes, but THAT Dean placed a drop of each substance inside a circle … [reading aloud the text just above the picture in Figure 1] Then both are … ? Caroline: Then both are reference substances [Option D, Figure 2]. (Group J; Nov 2010, 13:25–14:00)

The students’ approach appears to be about finding the word substance in both the description of ZnO and M and in the combination of words of Option D. This idea leads Lucy and Caroline to give a correct response. However, the point is that the detector cannot judge whether this signal is correct because of ‘luck’ or ‘competence’. For the detector, a correct response represents individual, measurable knowledge while an incorrect response corresponds to lack of measurable knowledge. But among the student–item encounters designed for this study, item responses are seen to be constrained by many things, such as

78

International Large-Scale Assessments in Education

the space afforded on the test sheet, by difficulties in formulating the reponses briefly enough, or it being too bothersome to construct an answer at all. In the encounters, some of the constraints are named explicitly: hunger, thirst, boredom, patience, silliness, and fatigue (for excerpts, see Serder & Ideland, 2016; Serder & Jakobsson, 2015, 2016). I consider them socio-material actors that interfere and interact with the detector. Ideally, and from a detector’s point of view, words would mean the same thing to all test-takers and their difficulty would be intercorrelated to the competencies of interest. The everyday life situation suggested by the detector would transmit smoothly between local contexts. However, in the following example a different type of encounter in the video material shows how the meaning of specific words and contexts interact with the detection of competencies. Here Richard, Jana and Cornelia are operating with the same item as Lucy and Caroline; they end up choosing C, an incorrect response to ‘Sunscreens’ Question 2. Richard: [reading] The words are so strange Jana: What? Richard: Reference substance … [reading aloud] Jana: [shaking her head] Yeah, like … what is that? Cornelia: [looking out through the window] [All students read in silence] Richard: If only one knew what reference means? [sighs] [The students are reading again, searching in the text] Jana: This [Option C] is the most reasonable because … zinc oxide does not let sunlight through. One could say perhaps it is a factor, as in sun-protection factor … which is supposed to stop sunlight. Just as sun protection. And they [zinc oxide and mineral oil] are like … not the same … so let’s go for C. (Group L, Nov. 2010, 20:29–22:40)

The enactments of this encounter have little to do with knowledge about scientific methods, which is the competence to be measured. Instead, the encounter propels around the meanings of the words ‘reference’ and ‘factor’, and their relationship to the students’ real-life experiences of sunscreens. To recall the metaphor of the mangle of practice (Pickering, 1995), this can be considered as situated resistances and accommodations that result in specific decisions and directions forward. For example, as a response to Richard’s and Jana’s uncertainty about the word reference, the word factor becomes their focus for action. This group’s use of the word factor might surprise readers who are not native Swedish speakers. Here, as in multiple other occurances in the data material (see further Serder & Jakobsson, 2016), the word factor is used in the sense sun

Detecting student performance in large-scale assessments

79

protection. ‘Sun factor’ (solfaktor) is literally an everyday Swedish vocabulary denoting sun-protection. In the above quote, Jana says that Option C is ‘the most reasonable’ and then suggests the (incorrect) Option C. In Option C, ZnO is called a factor and M a reference, suggesting that they are substances of different characters. According to the introductory text, M is letting through sunlight while ZnO is blocking it – information suggesting that they actually have opposite properties. Whatever the reason for Jana to find Option C the most reasonable, the detection of skills and competencies here becomes entagled with everyday life experiences, local use of certain vocabulary (factor equal to sun protection) and the missing experience of ‘references’. As Maddox wrote, ‘the international character of large-scale assessment programmes highlights the challenge of maintaining cross-cultural validity, and associated questions of test item adaptation, relevance and comparability’ (Maddox, 2014: 482). The requirement for items, words and contexts to be able to travel is an outstanding attribute of this specific practice that Maddox (2014) termed ‘globalizing assessment’. In my example above, the detector has difficulty with (Swedish, teenager) everyday life getting in the way, while everyday life is also supposed to be part of the framework. The detection of competencies can also be missed out due to students knowing too much about the topic. For instance, as Edward and Cathrine are working with Question 3 (Figure 4.3), knowing a lot about the ZnO substance that occurs in the Sunscreens unit does not seem to be an advantage. I will return to this dilemma of ‘knowing too much’ soon by describing how the encounter unfolds. However, for this encounter it appears that language translation between national versions of the PISA test also plays a role: in the Swedish version, the words ‘less’ and ‘more’ in Options C and D have been translated into words with (at least) double meanings. This means that even though the item deploys

Figure 4.3 Sunscreens test item. Question 3

80

International Large-Scale Assessments in Education

correct Swedish, new meanings have been entangled by using the equivalents of worse (sämre) and better (bättre): C Finns det något solskyddsmedel som skyddar sämre än mineralolja? D Finns det något solskyddsmedel som skyddar bättre än zinkoxid?

(Skolverket, 2007) To ensure assessment quality, language translation procedures in PISA are highly standardized (Grisay, 2003). On the other hand, research reports item difficulty (with respect to word frequency) to be unequally distributed among national test versions (Arffman, 2010). What could this mean for the detection of students’ scientific-literacy competencies? As the Swedish words for ‘worse’ and ‘better’ are used, the test question can be interpreted as which substance is worse or better than the other: [After reading in silence, Edward turns to Cathrine] Edward: Yes, but how is … which one protects the best? I mean how does it protect? Cathrine: Yes, yes. Edward: But it must be zinc oxide shuts out the most … out almost all sunlight, and that isn’t good for your body [health] to have … or zinc is no good at all … because some are allergic to zinc and then that is not good. I mean metals and such things, but I think it’s which one gives a better protection than zinc oxide [Option D], so it’s closer to D, I’d say. Cathrine: Is it this one (points)? Is there (marking D on the test sheet) (Group N: 19:01–20:39)

Edward’s and Cathrine’s encounter with Question 3 ends up with them selecting the incorrect Option D (A is correct). However, this is an example of knowing ‘too much’ to be able to interpret this test question as anticipated by those constructing it (cf. Maddox, 2014). Edward’s concern seems to be the properties of ZnO causing allergies. He also argues that ‘shutting out sunlight … isn’t good for your body’. One possibility is that he is referring to the benefits of exposing human skin to sunlight, thus allowing the production of vitamin D. Within that reasoning, the experiment could be about finding a better – in the sense of being less harmful – sun protection. However, the detector is not able to detect Edward’s scientific knowledge. My last example of encounters is illustrated with Damien, Peter and Ali and takes another direction. These boys are also having a conversation about the Sunscreens item, but trying to understand the bigger picture: why conduct an experiment like the one suggested in the test at all?

Detecting student performance in large-scale assessments

81

Peter: Do you understand this? Ali [reading the test question aloud]: ‘They used mineral oil because it lets through most of the sun’ [points at the text] … Hmmm, I don’t know. Damien: But why would they care about which sun cream to use? Err … or sunscreen Ali: They aren’t tested [the mineral oil and the zinc oxide] because they already know what they do (.) That zinc oxide— Damien: I know … But why bother so incredibly much that they do … well … that [experiment]? (Group B: 20:27–22.51)

In this encounter, science, in the shape of a scientific experiment on sunscreens, is put under the microscope of the not-yet-convinced-scientists. As Damien is asking: ‘Why bother so incredibly much that they do … that?’ Later on, the group suggests that instead, one could ‘go to the store and just buy [sun protection]’. Here, the resistance and accommodation does not necessarily affect the detector – it is rather the detector that is affecting those encountering it. What is the image of science that the Sunscreens experiment is presenting? It might seem a general-life situation to the test constructers, but is it in the life of the 15-year-olds?

Discussion In this chapter I have proposed viewing PISA as a detector, a specific kind of device that is constructed to detect certain things at the cost of being blind to others. Parallels have been drawn to detectors like haemoglobin meters, smoke detectors, and even the retina (Knorr Cetina, 1999), all of which offer a certain way of knowing about a phenomenon of interest. Thus, and according to the presented examples, what kind of a detector is PISA, and what seeing does it provide? Or, to use the insights of Latour’s laboratory studies (1999), how does the PISA detector define its object and by what names and descriptions does it bring this object into being? In this closing section I will elaborate on how PISA makes available certain, but not other, ways to perceive and know about students and school system performance. Measurement is ‘a performative act’ (Hamilton, Maddox, & Addey, 2015: ix). ANT invites us to see performance as a result of resistances and accommodations of students, test materials, test instruction and the emotions, needs and sociomaterial conditions of each encounter between students and test items. One

82

International Large-Scale Assessments in Education

such condition is the image of science that is implicitly interwoven into the test that Damien is questioning when he says ‘Why bother so incredibly much?’ (Serder & Jakobsson, 2015). Too much context, too much knowledge and local understandings of the ‘globalized’ and ‘neutralized’ real-life features of the test (Maddox, 2014) are other examples. The familiarity of the situations presented to the students also introduce a dilemma: the situations should not be confused with real real life (Maddox, 2014). In her work on detectors of advanced scientific practices, Karin Knorr Cetina described how a physicist must deal with the background of detectors; and ‘not only “fights against” the background but tries to “kill” it’ (1999: 124). In knowledge measurement, the equivalence of the background is the ‘noise’ produced during assessment: the signals of the unexpected events that occur despite all precautions taken by the framework and its procedures. For PISA as well, the statistics experts work hard to reduce the noise to a minimum (Morgan, 2007). However, by observing the reponse processes such as those of Edward’, Jana’, Caroline’ and the other students of the study presented here, it becomes possible to know rather than to kill. However, the literacy performance as detected by individual, standardized testing and compared to an international reference point is a rationalistic and dualistic object that only offers an understanding of results as representations of the knowledge of a person (Serder, 2015). In the PISA actor–network, discourses and practices travel across actors, being negotiated, used and sometimes stabilized. I argue that, in Sweden – my geographical context – at least at a policy level one of those things that has become stabilized is a monological, rationalistic understanding of knowledge and perhaps also a belief of linear correlation between results and school activities. PISA, being the ‘true story of education’, has constributed to the view of comparison to other school systems to be the master scheme for raising quality. And, even more, to ‘increased results’ in maths, reading and science as the master goal of education. This means that, here, this specific detector leaves us with very narrow understandings of knowledge and performance, reduced from their complexities. However, this is not entrained by PISA iteself, but a result of its many entangled practices alongside it in our historical and geographical context. Paradoxically, PISA may be a detector that makes us blind to real real-life skills, and constrains the possibilities for learning in depth, and for understanding what it means to be a knowledgable citizen. Reprinted with permission from the OECD

Detecting student performance in large-scale assessments

83

References Arffman, I. (2010), ‘Equivalence of translations in international reading literacy studies’, Scandinavian Journal of Educational Research, 54 (1): 37–59. Fenwick, T. & Edwards, R. (eds) (2012), Researching education through actor-network theory, New York: John Wiley & Sons Inc. Gorur, R. (2011), ‘ANT on the PISA trail: Following the statistical pursuit of certainty’, Educational Philosophy and Theory, 43 (1): 76–93. Gorur, R. (2015), ‘Situated, relational and practice-oriented: The actor-network thory approach’, in K. N. Gulson, M. Clarke, & E. B. Petersen (eds), Education policy and contemporary theory: Implications for research, 87–98, London: Routledge. Grek, S. (2012), ‘What PISA knows and can do: Studying the role of national actors in the making of PISA’, European Educational Research Journal, 11 (2): 243–254. Grisay, A. (2003), ‘Translation procedures in OECD/PISA 2000 international assessment’, Language Testing, 20 (2): 225–240. Hamilton, M., Maddox, B., & Addey, C. (2015), Literacy as numbers: Researching the politics and practices of international literacy assessment, Cambridge: Cambridge University Press. Hubley, A. & Zumbo, B. (2017), ‘Response processes in the context of validity: Setting the stage’, in A. M. Hubley & B. D. Zumbo (eds), Understanding and investigating response processes in validation research, 1–12, Social Indicators Research Series 69, Cham, Switzerland: Springer. Knorr Cetina, K. (1999), Epistemic cultures: How the sciences make knowledge, Cambridge, MA: Harvard University Press. Kreiner, S. & Christensen, K. B. (2014), ‘Analyses of model fit and robustness: A new look at the PISA scaling model underlying ranking of countries according to reading literacy’, Psychometrika, 79 (2): 210–231. Latour, B. (1999), Pandora’s hope: Essays on the reality of science studies, Cambridge, MA: Harvard University Press. Latour, B. (2005), Reassembling the social: An introduction to actor-network-theory, Oxford: Oxford University Press. Lau, K. C. (2009), ‘A critical examination of PISA’s assessment on scientific literacy’, International Journal of Science and Mathematics Education, 7 (6): 1061–1088. Laukaityte, I. (2016), ‘Statistical modeling in international large-scale assessments’, doctoral dissertation, Umeå: Umeå University. Maddox, B. (2014), ‘Globalising assessment: An ethnography of literacy assessment, camels and fast food in the Mongolian Gobi’, Comparative Education, 50 (4): 474–489. Martens, K. (2007), ‘How to become an influential actor – the “comparative turn” in OECD education policy’, in K. Martens, A. Rusconi, & K. Lutz (eds), Transformations of the state and global governance, 40–56, London: Routledge.

84

International Large-Scale Assessments in Education

Morgan, C. (2007), ‘OECD Programme for international student assessment: Unraveling a knowledge network’, doctoral dissertation, ProQuest. OECD (2006), Assessing scientific, reading and mathematical literacy: a framework for PISA 2006, Paris: Organisation for Economic Co-operation and Development (OECD). OECD (2013), PISA 2012 assessment and analytical framework: Mathematics, reading, science, problem solving and financial literacy, Paris: Organisation for Economic Cooperation and Development (OECD). Pickering, A. (1995), The mangle of practice: Time, agency, and science, Chicago, IL: The University of Chicago Press. Sellar, S. & Lingard, B. (2014), ‘The OECD and the expansion of PISA: New global modes of governance in education’, British Educational Research Journal, 40 (6): 917–936. Serder, M. (2015), ‘Möten med PISA. Kunskapsmätning som samspel mellan elever och provuppgifter i och om naturvetenskap’ [Encounters with PISA. Knowledge measurement as co-action between students and test asignments in and about science], doctoral dissertation, Malmö: Malmö University. Serder, M. & Ideland, M. (2016), ‘PISA truth effects: The construction of low performance’, Discourse: Studies in the Cultural Politics of Education, 37 (3): 341–357. Serder, M. & Jakobsson, A. (2015), ‘“Why bother so incredibly much?”: Student perspectives on PISA science assignments’, Cultural Studies of Science Education, 10 (3): 833–853. Serder, M. & Jakobsson, A. (2016), ‘Language games and meaning as used in student encounters with scientific literacy test items’, Science Education, 100 (2): 321–343. Skolverket (2007), PISA 2006: 15-åringars förmåga att förstå, tolka och reflektera – naturvetenskap, matematik och läsförståelse, no. 306, Stockholm: Skolverket. Skolverket (2014), ‘Du är utvald’. Available online: https://www.youtube.com/watch?v=DFJ96wirhk Tröhler, D. (2013), ‘The OECD and cold war culture: Thinking historically about PISA’, in H. Meyer & A. Benavot (eds), PISA, power, and policy: The emergence of global educational governance, 141–161, Oxford: Symposium Books Ltd. Waldow, F., Takayama, K., & Sung, Y. K. (2014), ‘Rethinking the pattern of external policy referencing: Media discourses over the “Asian Tigers” PISA success in Australia, Germany and South Korea’, Comparative Education, 50 (3): 302–321.

Part Two

Observing data production

5

Starting strong: on the genesis of the new OECD survey on early childhood education and care Simone Bloem

Introduction The Organisation for Economic Co-operation and Development (OECD), under the help and guidance of the OECD Network on Early Childhood Education and Care (ECEC), decided in 2014 to develop and implement a new international comparative survey: the TALIS Starting Strong Survey. It will collect selfreported data from pedagogical staff and ECEC centre leaders for the first time in 2018 to internationally compare pedagogical practices and beliefs, working conditions, job satisfaction and a wide range of other themes. This chapter describes elements of the production process of international surveys with the example of this new OECD survey on ECEC. In this way, I will provide insights into the ‘black box’ within which data production takes place. Usually, once surveys are administered and result reports prepared, they appear as static and immovable objects. Yet, the development of a survey is a complex process which, among other things, involves questions of power, resources, decision-making and bargaining (Desrosières, 2008a). The chapter is inspired by works that can be attributed to the field of sociology of quantification, which considers statistics as social products and thus analyse statistical production within the specific social and historical contexts (Porter, 1995; Espeland & Sauder, 2007; Espeland & Stevens, 2008; Desrosières, 2008a, 2008b ). I also refer to works from global governance research that analyse international organizations as actors with the power to shape international and national policy themes and discourses (Barnett & Finnemore, 2004; Marcussen, 2004; Porter & Webb, 2007; Martens & Jakobi, 2010; Martens & Niemann,

88

International Large-Scale Assessments in Education

2010; Sellar & Lingard, 2013). From this perspective I consider the OECD as an independent actor with own interests and a strong agenda, whose role goes beyond a purely coordinating role in the survey development. I show that ‘institutional dynamics’ appear in the development of the TALIS Starting Strong Survey. Such ‘institutional dynamics’ are informed by the OECD secretariat (Nagel, Martens & Windzio, 2010: 6) and underpin the influential role of the OECD in the survey development. However, I do so without neglecting the complex relationships with member countries and the extensive committee and network structures within the OECD that enable common agenda setting at the OECD (Marcussen, 2004; Woodward, 2009; Carroll & Kellow, 2011; OECD, 2017). This chapter therefore illustrates the decisive role of the OECD ECEC Network and participating countries in the development of the TALIS Starting Strong Survey. The chapter draws on the knowledge I acquired through an insider perspective in the role of a participant at the OECD ECEC Network Meetings and National Project Manager of the TALIS Starting Strong Survey in Germany. I took up this position in March 2016 at the German Youth Institute, which the German Federal Ministry of Family Affairs has given the task of providing scientific support within the framework of the OECD ECEC Network and administering the TALIS Starting Strong Survey in Germany. This methodological approach can be characterized as ‘participant observation’ which stresses the active engagement within the field of study (Soulé, 2007). I also draw on information obtained through four formal interviews and informal exchanges with OECD ECEC Network representatives and OECD staff. It is important to note that the content of the interviews do not need to reflect the official view of countries or the OECD, but remain as personal reports of persons who are involved in the production of the surveys. Lastly, I refer to background documents which are in most cases prepared by the OECD secretariat in preparation for the OECD ECEC Network meetings. It is important to emphasize, that this contribution it is not intended to judge or criticize decisions that have been made in the course of the methodological and theoretical development of the survey, but rather to describe development processes in order to depict the process-oriented character of statistics. As this chapter can only provide a short insight in to international survey development procedures, further research is indispensable to analyse and to better understand the role of the OECD and different bodies in OECD’s programme development. The chapter is structured as follows. The following section presents the TALIS Starting Strong Survey and describes the way of working and mandate

Starting strong

89

of the OECD ECEC Network. It notably shows the functioning of the OECD from the perspective of international governance research which highlights the autonomy of action of international organizations. ‘The genesis of the TALIS Starting Strong Survey’ traces the genesis of the TALIS Starting Strong Survey by looking at two elements of survey development: fundamental decisions on the methodological approach to generate data on ECEC systems, and questionnaire development. The ‘Conclusion’ recapitulates OECD’s role in the development of the TALIS Starting Strong Survey and raises some questions for future research, among which how the roles of the different actors involved will eventually shift and change in the further development of the TALIS Starting Strong Survey and how this new survey will impact OECD’s work in the field of ECEC.

The TALIS Starting Strong Survey and the OECD Network on Early Childhood Education and Care The OECD is well known for its international education surveys and databases, most prominently the Programme for International Student Assessment (PISA), the OECD Survey of Adult Skills (PIAAC), the OECD Teaching and Learning International Survey (TALIS) and the yearly released Education Indicators (Education at a Glance). OECD’s data wealth will be further extended in the area of ECEC by means of two new studies: the TALIS Starting Strong Survey, targeting pedagogical staff working with children aged (0)3–6 years in ECEC settings as an extension of TALIS, and the OECD’s International Early Learning and Child Well-being Study (IELS) – an assessment study of early learning, competences and well-being of 5-year-olds. Both studies in ECEC are currently under development. The piloting of the draft questionnaires of the TALIS Starting Strong Survey took place in autumn 2016; the field trial was conducted in spring 2017; the main survey is scheduled for 2018 and data analysis and reporting take place in 2019 and 2020. Pedagogical staff and leaders of ECEC centres in nine participating OECD countries1 are asked about a wide range of topics: their initial qualification and professional development activities, their working conditions and job satisfaction, the learning environment and the climate in

1

Four of these nine countries conduct an additional study that targets pedagogical staff and centre leaders who work with children under the age of three.

90

International Large-Scale Assessments in Education

their ECEC centre, centre management practices, as well as pedagogical practices and beliefs. An international consortium led by the International Association for the Evaluation of Educational Achievement (IEA) in cooperation with RAND Europe and Statistics Canada was commissioned in 2016 to operationalize the development and implementation of the survey internationally. National partners of participating countries contribute to the survey development by means of various forms of inputs and feedback at different stages of survey development and are responsible for the implementation of the survey at the country level. The OECD ECEC Network, founded in 2007, has been involved in the development the TALIS Starting Strong Survey. It is one of the several networks and working groups which oversee specific projects and programmes of OECD’s education work, and does so in particular for ECEC. By fostering exchange and discussion between the OECD Secretariat and national experts on specific themes and policy issues – themes and issues which are then concretized in specific OECD projects or programmes – these OECD Networks and Working groups contribute to the strategic and content orientation of OECD’s projects and programmes. Ultimate responsibility over OECD’s education work and budget falls to the Education Policy Committee (EDPC), a high level body of senior (ministry) officials of OECD member countries. It was also the EDPC that mandated the OECD Secretariat in 2014 to develop and implement the TALIS Starting Strong Survey. The OECD ECEC Network meets twice a year and encompasses representatives from OECD member states and partner countries – often senior decision-makers on ECEC policy and government-associated researchers specializing in ECEC issues – as well as representatives from international organizations, notably the European Commission, UNESCO and the Teacher Union Advisory Committee (TUAC). According to the document on the renewal of the mandate of the network its role is ‘to assist countries in the development of effective policies and practices in the field of early childhood education and care to promote better social, cultural, educational and economic outcomes for children’ (OECD, 2011: 2). It further states that the network’s main tasks are the development and dissemination of information on countries’ experiences on policy, research and good practice; the identification of unresearched topics and the assessment of data developments on ECEC systems (OECD, 2011: 2). Within this mandate, OECD’s proposal for the development of an international large-scale survey was substantiated and further concretized. In 2015, the OECD ECEC Network formed a subgroup, the Extended ECEC

Starting strong

91

Network on the TALIS Starting Strong Survey, which discusses in subgroup meetings issues particularly related to the TALIS Starting Strong Survey. These meetings often take place back-to-back with the TALIS Governing Board,2 with the aim of getting advantages from synergies. Data-related activities are thus an important element of the work of the network. This also applies to the OECD Directorate for Education and Skills, as for the OECD as a whole. The majority of the organization’s activities heavily rely on data: the coordination of data collection, data analysis and the preparation of data reports (Jakobi & Martens, 2010a, b). The team at the OECD Directorate for Education and Skills which works specifically on projects related to early childhood prepares international reports on different aspects of ECEC, like the ‘OECD Starting Strong’ publication series, develops international data and surveys, among them the TALIS Starting Strong Survey. It also works on communication and dissemination activities, including the preparation of information brochures and background documents for the OECD ECEC Network meetings, setting up and maintenance of the website, organizing launch events etc. In relation to the data-driven approach of the OECD, the organization has a number of general objectives that guide its work on educational themes. This includes for instance, enhancing the impact of standards, growth of productivity and prosperity, social inclusiveness and well-being or the support of the global agenda of the sustainable development goals. Its goal is also to achieve high relevance and increase productivity of its own work (OECD, 2017). The above already provides an indication of the important role the OECD has in international education due to its specialization on the collection and analysis of data (Martens & Jakobi, 2010; Martens & Niemann, 2010). Not least due to the increasing number of education surveys and the rising number of participating countries in those, the organization’s impact on education policies at the national and international level has risen in the last twenty years (Steiner-Khamsi, 2009; Martens et al., 2010; Jakobi & Martens, 2010a; Pereyra, Kotthoff & Cowen, 2011; Meyer & Benavot, 2013; Sellar & Lingard, 2013). Following the argumentation of Barnett and Finnemore (2004) international organizations, like the OECD, can be understood as 2

The TALIS Governing Board is the board of national representatives of participating countries in the TALIS survey, Education International (representing teacher unions worldwide) and the European Commission. The board has primary decision-making power over the (further) development of the survey, including the goals of the themes included in the questionnaires.

92

International Large-Scale Assessments in Education

‘bureaucracies’ that have the power to set up rules and exercise power. From their member states they receive a certain degree of autonomy, which in turn should enable them to fulfil their mission: the production of specific global knowledge which states are hardly able to produce themselves. At the same time, bureaucracies function according to their own rules and internal processes and procedures (more or less effectively and purposely). As a result of these features, international organizations can evolve and expand in an independent manner, which is sometimes other than desired or planned by their member states. The TALIS Starting Strong Survey provides a case study to understand the ‘institutional dynamics’ that unfold within the OECD’s programme development (Nagel et al., 2010: 6). Such ‘institutional dynamics’ draw upon OECD staff who develop ideas and design projects which make their entry into the various subsidiary and decision-making bodies of the OECD, including also the OECD ECEC Network. OECD staff prepares the agenda and the majority of background documents which network participants receive prior to the meetings in order to prepare for them.3 Furthermore, the OECD Secretariat commissions (in liaison with the network) ECEC experts to draft expert papers or to give presentations on specific topics at the network meetings. This way, the OECD Secretariat not only plays a central role in the coordination of the network, but also takes influence on the substantive conception and formation of the contents of the network meetings. This in turn can have an impact on the network’s decisions and activities. This illustrates on a small scale how international organizations like the OECD become ‘own authorities’, that operate partly independently of member states (Barnett & Finnemore, 2004: 156) and create opportunities to develop their ‘own agendas’ within the organization, thus to create the power to frame problems and issues and to persuade member states (Nagel et al., 2010: 6). But the network is also an important actor in OECD’s programme development in the field of ECEC. With regard to the TALIS Starting Strong Survey the OECD highlights the decisive role of the network in the survey development by defining policy areas where data should be collected, but also by providing constant guidance and advice on methodological issues and questionnaire development (OECD, 2016). By the formation of the subgroup of the network on the TALIS Starting Strong Survey in 2015 the operational governance of survey development should be further supported and strengthened.

3

The proposed agenda is arranged with the advisory group of the OECD ECEC Network and it is generally possible that the advisory group also proactively suggests agenda items.

Starting strong

93

The genesis of the TALIS Starting Strong Survey The TALIS Starting Strong Survey will be the first international largescale survey at an institutional level (i.e. ECEC centres) in the area of early childhood education under the lead of an international organization. This is despite the fact that early childhood education has been high on the agenda of international organizations, like the OECD, UNESCO and the World Bank, as well as national and local policy makers for some time now (Mahon, 2016). The OECD has addressed the field of ECEC since 1996, as part of an important element of larger policy matters of national labour market policies. Its early work on this topic has been renowned and internationally acknowledged, whereby the first two ‘OECD Starting Strong’ studies from 2001 and 2006 provide an important contribution to this (Penn, 2011). Starting Strong I was the first international comparative study of ECEC systems and policies by the OECD (OECD, 2001). Starting Strong II shows the progress made by participating countries since 2001 and identifies ten areas considered key for improving ECEC policies and systems (OECD, 2006). Essential to the decision of the development of an international large-scale survey in ECEC was the publication ‘Starting Strong III: A Quality Toolbox for Early Childhood Education and Care’ which was drafted by the OECD Secretariat under oversight of the OECD ECEC Network (Starting Strong I and II were published before the foundation of the network) (OECD, 2012a). Starting Strong III contains a policy tool box with five key levers which are considered crucial to develop and improve quality in ECEC. As reported by OECD staff, it was notably the policy lever 5 ‘Advancing data collection, research and monitoring’ which led to spin-off discussions in the OECD ECEC Network on how the network can itself contribute to this lever. In the exercise of the network in 2012 with the aim of identifying existing data gaps, priority areas for policyrelevant data collection were suggested by the OECD ECEC Network (OECD, 2012b). A key role of the OECD Secretariat was to help identify the indicators that were needed to address these priority areas. In this way, the OECD helped to clarify and formalize the suggestion of an international data collection under its coordination. Hereby the OECD views its mission as to advise countries ‘to evaluate the effectiveness of early childhood education and care (ECEC) policy interventions and to design evidence-based ECEC policies and in particular, in times of strong budgetary constraints, more cost-effective policies’ (OECD, 2015: 10).

94

International Large-Scale Assessments in Education

Linkage to TALIS and/or observational studies It was not clear from the beginning what type of international survey is appropriate to inform about the quality of national ECEC systems. A survey of ECEC staff in which staff report about their pedagogical practices and an international observation a study of pedagogical practices in ECEC settings were discussed as possible approaches. The OECD Secretariat, started taking stock of lessons learnt on staff-level development from institutions, including NEPS (National Educational Panel Study), NIEER (National Institute for Early Education Research), and the IEA, which was working in parallel on its own Early Childhood Education Study (ECES): Data on pedagogy are not well covered by currently available international data and an investment in the development of staff surveys or staff observational studies is needed to better assess the quality of interactions and pedagogy children experience (OECD, 2013: 8).

Many network representatives recognized the methodological advantages of observational studies, most notably that they are, in contrast to self-reported surveys, less affected by social desirability. Yet, observational studies face other methodological problems, like the comparability of observations across settings and countries, observers’ subjective perceptions and ratings etc. Finally, the development of an internationally comparable observation study was not further pursued as this was considered too cost-intensive and the methodological difficulties too large. Further, and most important as reported by OECD staff and OECD ECEC Network members, the decision to move on with the collection of data through a self-reported survey on ECEC staff was also affected by the fact that such a survey could build upon TALIS and draw on the good experiences gained from the development and implementation of this OECD survey. OECD staff additionally stressed that TALIS has become highly politically relevant to participating countries and this gives good cause to hope that a self-reported survey upon ECEC staff could achieve equally high policy relevance in the ECEC sector. TALIS was first conducted in 2008 and collected self-reported data from teachers and school principals at secondary education level. Since the second cycle of TALIS (conducted in 2013), a number of countries opted for additionally collecting data on teaching practices and working conditions of primary teachers. The envisaged survey in ECEC became a first unofficial name: ‘pre-school TALIS’ (OECD, 2013: 8). In order to extend TALIS to the level of early childhood

Starting strong

95

education and care, it was necessary that the TALIS Governing Board agreed to this extension of TALIS to ECEC. Alternatively, a distinct TALIS-like survey would have had to be developed by the network (OECD, 2013: 8). Earlier in 2013, the OECD had commissioned a scientific expert to explore different approaches to develop a ‘staff-level survey about the teaching, learning and well-being environment in ECEC settings’ (Bäumer, 2014).That discussed how an ECEC staff survey could be constructed and structured. His systematic comparison of the ECEC staff survey and TALIS brought him to the conclusion that ‘TALIS items can be used for an ECEC survey, with (minor) adaptations’ for many themes, such as professional development or leadership (Bäumer, 2014: f.). Overall, Bäumer describes TALIS as an ‘excellent and primary source’ for a staff survey in ECEC after adaptations to the ECEC context (Bäumer, 2014: 9). It may be added that the commissioning of an expert paper on the development of an international survey on quality in ECEC by the OECD Secretariat is one way that the OECD exerts influence on activities under the frame of the network. The proposal on the methodology for data collection could be substantiated and pushed forward in a certain direction with the expert paper. In this sense, the OECD Secretariat acted as a facilitator and driver of the idea to develop of an international study. The commissioning of the expert paper is an example of the ‘institutional dynamics’ that are at place in the OECD Secretariat and was an important factor for the realization of the international survey. The proposal raised in the expert paper to complement the staff survey with observational studies was further pursued by the OECD by bringing it to the discussion at network meetings and by organizing two webinars with the OECD ECEC Network in 2015 and 2016. The OECD was looking for interested countries that could take the lead in the linkage exercise, but only a small number of countries expressed interest in investing in such an endeavour. Due to the low numbers of interested countries, the OECD had to conceptualize the linkage of the staff survey to an observational study as a ‘national option’ which had to be fully developed, organized and implemented by the country itself. However, no country pursued that option. The high personnel and financial expenditure related to the methodological development and implementation of observations of pedagogical practices constituted the most significant obstacles in complementing the staff survey with observations, at least in the case of Germany. This despite the fact that a validation of findings from the staff survey using observational data was generally considered as highly valuable. Moreover, Germany would have

96

International Large-Scale Assessments in Education

favoured a common international approach, which appeared unlikely to be established due to the lack of interest of other countries and thus the lack of mandate of the OECD Secretariat. The arguments for Germany and the OECD Secretariat to not follow-up the linkage of the staff survey to observation studies as presented above well illustrate one of the key characteristics of statistics, notably that their production (quantification) is labour intensive (Espeland & Stevens, 2008). As the authors underline, quantification ‘requires considerable work, even when it seems straightforward’ (Espeland & Stevens, 2008: 410). In some cases, as for instance the observational studies, the fact that the development of a particular type of data is very labour intensive, could ultimately mean that it is not realized. The OECD depended heavily on countries’ interest and own initiatives and actions for the realization of the idea of a linkage of the staff survey to observations, which ultimately did not materialize. Not least, this illustrates the significance of the network regarding OECD’s activities in the field of ECEC.

Questionnaire development Once the decision was made that a self-reported staff survey should be developed, the selection of and agreement on specific thematic blocks which should be examined in the survey was the next consequential step. The TALIS Starting Strong Survey collects data from pedagogical staff and centre leaders of ECEC settings with two separate questionnaires. The questionnaires for staff and ECEC centre leaders are developed by a Questionnaire Expert Group (QEG) which is made up of experts on questionnaire development, content experts, experts responsible for the alignment with TALIS as well as staff members from the OECD Directorate for Education and Skills responsible for policy goals and priorities oversight. The QEG met for the first time in April 2016. The broad themes that should be covered in the questionnaires were decided beforehand on the basis of priority rating activities in 2015 in which countries could indicate their priorities among a list of possible themes and related indicators to be dealt with in the survey. Due to restrictions on the length of the questionnaires (maximum of fifty minutes), the priority ratings should allow a quasi-democratic decision-making among interested countries in order to select those themes that are favoured by the majority of participating countries. Thereby, the OECD stresses the political relevance, which should be achieved in this way and which is presented as a key characteristic and strength of OECD’s surveys.

Starting strong

97

At the network meeting in spring 2016, the network was informed by the OECD Secretariat and the international contractor that, on the basis of a mapping exercise where the TALIS indicators were compared with the indicators considered for the ECEC staff survey, an estimated overlap of around 70 per cent of TALIS and ECEC staff survey indicators could be achieved. However, the main survey version of the questionnaires finally only contain about 30 per cent of items that were the same or only minimally adapted from TALIS. The lower takeover from TALIS is firstly because in the adaptation and revision process of the questionnaires many questions and items from TALIS needed to be heavily revised in order to fit the specific context of ECEC. Secondly, in the process of shortening the questionnaires for the field trial, many questions from TALIS had been evaluated as less important by participating countries and were dropped. Lastly, the necessary reductions of the length of the final main survey questionnaires also resulted in the removal of questions and items from TALIS. For the OECD ECEC Network the fit of the questionnaires to the specificities of the ECEC sector was always given priority over comparability to TALIS. But the linkage of the ECEC staff survey to TALIS has pragmatic reasons because questions did not have to be developed from scratch, but could be taken from an existing survey and adapted to the specificities of the ECEC context. Moreover, new analysis potentials arising as results of both surveys can be directly compared. The OECD promotes such synergies between different strands of works in the area of education, as overlaps between different programmes and projects in the OECD are likely to increase the visibility of OECD’s products because lines of discourses can be interrelated, assembled and diffused more widely. This, in turn, strengthens the policy relevance and impact of the different surveys as well as OECD’s education agenda built on these surveys (Bloem, 2016). Although Germany does not participate in the TALIS survey, an interviewed expert noted that the strong linkage to TALIS can be an advantage to all countries and the ECEC Network as it will link the field of ECEC more strongly to the education sector in general. In this way, ECEC could experience an upgrade within the education sector because the direct attachment of ECEC to higher levels of education stresses the importance of early learning for future learning and life and valorizes the educational element in ECEC (in contrast to care only). Yet, a possible risk is the takeover of a ‘school-based’ approach in early childhood education and a growing focus on outcomes. This has been and still is a topic of discussion for the OECD ECEC Network.

98

International Large-Scale Assessments in Education

In June 2016 a first draft of the questionnaires was released. Participating countries were consulted again to provide feedback on the draft versions. In autumn 2016 the questionnaires were piloted through focus group discussions with pedagogical staff and centre leaders in participating countries in order to test the relevance of the questionnaire in national contexts. Based on the feedback from the pilot, the QEG revised the questionnaires and released an updated version in January 2017. As the questionnaires were still too long, participating countries were asked to select several questions for deletion which led to the omission of those questions that received the most votes. Methodological considerations in order to achieve international comparability of questions, items and the collected data stand in some way opposite to being sensitive towards national contexts and specificities. As international comparability is the prerequisite of the study, national interests in specific topics or approaches need to be sacrificed – this at least in parts. Countries undergo a rigid verification process where all adaptations made to the international version of the questionnaire need to be well justified and remain within a maximum amount of adaptations, with this threshold being considered as a degree of divergence that still allows the data to be considered as internationally comparable. This has consequences on the relevance and fit of questions in partly very distinct national ECEC contexts. The development of questionnaires is thus guided by the idea to find a viable compromise between international comparability and national relevance, from the perspective of international actors, like the OECD as coordinator and important end-user of the survey, but also from national partners. Some questions were developed and included in the draft questionnaires so as to ensure the synergies with other OECD projects, e.g. the Policy Review: Quality Beyond Regulations, and the OECD Future of Education and Skills 2030. The questions related to the OECD Education 2030 concern so-called twenty-first-century competencies. Pedagogical staff are asked to evaluate how importantly they perceive the development of certain competencies for children to prepare them for the future. These so-called twenty-first-century competencies play a major role in OECD’s overall education policy agenda and the ‘Future of Education and Skills: Education 2030’ project in particular. An important strand of work in this project is the development of a conceptual learning framework relevant for 2030, which includes a set of knowledge, skills, values and attitudes which constitute competencies considered necessary for students to exercise their ‘agency’ and shape their own future. This should help them to find both opportunities and solutions in response to ‘globalisation,

Starting strong

99

technological innovations, climate and demographic changes and other major trends’.4 Besides this, a question on spending priorities of ECEC staff if there was a budget increase in ECEC was proposed by the OECD for inclusion in the questionnaire, in light of the real choices decision-makers countries face across OECD countries. While there was a strong policy interest in those topics, some national study coordinators voiced critique regarding the relevance of and problems with these questions in the pilot. The questions were subsequently revised and remained in the final questionnaires. The active involvement of the OECD Secretariat in the instrument development is a further example of the ‘institutional dynamics’ within the organization which make the OECD an active player in international survey development. In this regard, the OECD acts with its organizational interests which are notably the achievement of high policy relevance of its surveys and results (and which is also in the interest of countries) and the use of synergies and alignment of the survey for other education projects of the OECD (Bloem, 2016). The staffing of the QEG with experts who are responsible for the alignment of the TALIS Starting Strong Survey with TALIS illustrates the importance given to synergies and complementarities of different working strands within the OECD Directorate for Education and Skills.

Conclusion The aim of this chapter was to illustrate the dynamic and complex social process of ILSA development with the example of the new study in ECEC under coordination of the OECD, the TALIS Starting Strong Survey. Hence this contribution has been inspired by the field of sociology of quantification, which considers numbers as a social product that has to be analysed as such. Moreover, I refered to works from international governance research which considers international organizations like the OECD as autonomously acting institutions. Even though the chapter could only depict a number of examples of the social processes that underlie the development of large-scale international surveys, it could offer at least some insights into this complex and multi-faceted process. It has shown how and why the TALIS Starting Strong Survey was linked to TALIS and has been developed as an extension of TALIS to the pre-primary level. It 4

http://www.oecd.org/edu/school/education-2030.htm (accessed 27 June 2018).

100

International Large-Scale Assessments in Education

has also highlighted that this orientation has developed in the course of the survey’s development process and has thus not existed from the beginning. The methodological approach concerning how data on the quality of ECEC systems can and should be collected through an international survey was not clear from the beginning. Besides a self-reporting survey upon the staff in ECEC settings the possibility of conducting an international observation study of pedagogical practices was taken into consideration at first. The use of synergies with TALIS was a central factor that led to the decision to go for the first option and hence to develop a self-reported survey on ECEC staff. With regard to a possible linkage of a self-reporting staff survey to observations, the immense personnel and financial expenditure were stressed as reasons why finally the initially envisaged and recommended mixed-method approach was not further concretized. The OECD appeared as an active agent of pushing the development of international surveys forward at several points of the survey development: by bringing the envisaged surveys onto the agenda of ECEC Network meetings, by commissioning experts to evaluate different options on how international surveys in the field of ECEC could be conceptualized, and by proposing its own questions for inclusion in the survey questionnaire. Such kind of activities has been described as ‘institutional dynamics’ within the OECD which finally make the organization an independent actor with its own interests and the power to frame problems and issues and to persuade member countries. But this chapter equally highlighted the important role of the Network in the survey development process and illustrated the role of participating countries at the example of test instrument development. In light of the fact that the conducting of international surveys is an ongoing process with changing emphases and priorities, it will be of interest to observe how the roles of the involved actors – the OECD Secretariat, the Network and participating countries, and the International Consortium whose role it has not been possible to discuss in this contribution – evolve in the further development of the TALIS Starting Strong Survey. While at first the most important issues were notably the determination of a methodological approach and questionnaire development, in the middle to long-term – if the Starting Strong Survey is repeated in a 6-year-cycle – the inclusion of new topics into the questionnaires while ensuring comparability across cycles would become a central concern that will require new negotiations and decision-making. Finally, an important aspect of carrying out an international survey which will be of significance from 2019 will be the interpretation and

Starting strong

101

dissemination of results by different actors. The survey will lead to new data, and also complement existing education data of the OECD, which offers new potentials for data analysis and policy discourse formation on the basis of empirical data – for researchers but also for the OECD itself. The OECD’s own approach, notably a strongly data-driven approach allowing the organization to give evidence-based policy advice to countries, is considered as one of its strengths by lending the organization and policy discourses an objective character (Bloem, 2016). This raises the interesting question of how the OECD will further promote its ECEC policy agenda with the introduction of the new ECEC surveys and extend its role in early childhood education internationally.

References Barnett, M. & Finnemore, M. (2004), Rules for the world: International organizations in global politics, Ithaca, NY: Cornell University Press. Bäumer, T. (2014), Network on Early Childhood Education and Care. Technical review of the analytical benefits to be gained from collecting staff-level data on ECEC. EDU/EDPC/ECEC(2013)14/REV1. Bloem, S. (2016), Die PISA-Strategie der OECD: Zur Bildungspolitik eines globalen Akteurs, Weinheim: Beltz Juventa. Carroll, P. & Kellow, A. (2011), The OECD: A study of organisational adaptation, Cheltenham: Edward Elgar. Desrosières, A. (2008a), Pour une sociologie historique de la quantification, 1, Paris: Presses de l’École des Mines. Desrosières, A. (2008b), Pour une sociologie historique de la quantification, 2, Paris: Presses de l’École des Mines. Espeland, W. & Sauder, M. (2007), ‘Rankings and reactivity: How public measures recreate social worlds’, American Journal of Sociology, 113 (1): 1–40. Espeland, W. & Stevens, M. (2008), ‘A sociology of quantification’, European Journal of Sociology, 49 (3): 401. Jakobi, A. P. & Martens, K. (2010a), ‘Expanding and intensifying governance: The OECD in education policy’, in K. Martens & A. P. Jakobi (eds), Mechanisms of OECD governance: International incentives for national policy-making?, 163–179, Oxford: Oxford University Press. Jakobi, A. P. & Martens, K. (2010b), ‘Introduction: The OECD as an actor in international politics’, in K. Martens & A. P. Jakobi (eds), Mechanisms of OECD governance: International incentives for national policy-making?, 1–25, Oxford: Oxford University Press.

102

International Large-Scale Assessments in Education

Mahon, R. (2016), ‘Early childhood education and care in global discourses’, in K. Mundy, A. Green, B. Lingard, & A. Verger (eds), Handbook of global education policy, Malden, MA: Wiley-Blackwell. Marcussen, M. (2004), ‘OECD governance through soft law’, in U. Mörth (ed.), Soft law in governance and regulation: An interdisciplinary analysis, 103–128, Cheltenham: Edward Elgar. Martens, K. & Jakobi, A. P. (eds) (2010), Mechanisms of OECD governance: International incentives for national policy-making?, Oxford: Oxford University Press. Martens, K. & Niemann, D. (2010), ‘Governance by comparison: How ratings & rankings impact national policy-making in education’, TranState working papers (139). Martens, K., Nagel, A. K., Windzio, M., & Weymann, A. (eds) (2010), Transformations of the state, transformation of education policy, Basingstoke: Palgrave Macmillan. Meyer, H.-D. & Benavot, A. (2013), PISA, power, and policy: The emergence of global educational governance, Oxford: Symposium Books. Nagel, A. K., Martens, K., & Windzio, M. (2010), ‘Introduction – education policy in transformation’, in K. Martens, A. K. Nagel, M. Windzio, & A. Weymann (eds), Transformations of the state, transformation of education policy, 3–27, Basingstoke: Palgrave Macmillan. OECD (2001), Starting strong early childhood education and care, Paris: OECD Publishing. OECD (2006), Starting strong II: Early childhood education and care, Paris: OECD Publishing. OECD (2011), Proposed renewal of the mandate of the network on early childhood and care, EDU/EDPC/ECEC(2011)10/REV2. OECD (2012a) Starting strong III: A quality toolbox for early childhood education and care, Paris: OECD Publishing. OECD (2012b), Network on Early Childhood Education and Care: Indicators of learning and well-being environments for children, EDU/EDPC/ECEC(2012)4. OECD (2013), Network on Early Childhood Education and Care: From policy questions to new ECEC indicators. Draft background paper for the 14th meeting of the OECD ECEC Network, EDU/EDPC/ECEC(2013)9. OECD (2015), Call for Tenders 100001310. Implementation of the first cycle of the International Survey of Staff in Early Childhood Education and Care. Available online: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd= 1&ved=0ahUKEwi69tvm-bnSAhXI1RQKHcrvBygQFggcMAA&url= http%3A%2F%2Fwww.oecd.org%2Fcallsfortenders%2F2015%252006%252008%25 20TERMS%2520OF%2520REFERENCE%2520ECEC%2520Staff%2520Sur vey%2520REV4%2520FINALclean.pdf&usg=AFQjCNE0P1CReGXkoGC_ FlTREbSeiUDCyg&cad=rja (accessed 27 June 2018). OECD (2016), TALIS: ECEC, Starting Strong Survey: Towards a conceptual framework for an international survey on ECEC Staff, EDU/EDPC/ECEC/RD(2016)2.

Starting strong

103

OECD (2017), OECD work on education & skills, Paris: OECD Publishing. Penn, H. (2011), Quality in early childhood services: An international perspective, Maidenhead: McGraw-Hill Education. Pereyra, M. A., Kotthoff, H.-G., & Cowen, R. (eds) (2011), Pisa under examination changing knowledge, changing tests, and changing schools, Rotterdam: Sense Publishers. Porter, T. (1995), Trust in numbers: The pursuit of objectivity in science and public life, Princeton, NJ: Princeton University Press. Porter, T. & Webb, M. (2007), The role of the OECD in the orchestration of global knowledge networks, Saskatchewan, Canada. Available online: https://www.google. de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwjqy7jp2NH MAhWDXSwKHXKSBZcQFggiMAA&url=https%3A%2F%2Fwww.cpsa-acsp. ca%2Fpapers-2007%2FPorter-Webb.pdf&usg=AFQjCNEunPfD8EndAdophAMHeL aF8Vo7ug&cad=rja (accessed 27 June 2018). Sellar, S. & Lingard, B. (2013), ‘The OECD and global governance in education’, Journal of Education Policy, 28 (5): 710–725. Soulé, B. (2007), ‘Observation participante ou participation observante? Usages et justifications de la notion de participation observante en sciences socials’, Recherches qualitatives, 27 (1): 127–140. Steiner-Khamsi, G. (2009), ‘The politics of intercultural and international comparison’, in S. Hornberg, İ. Dirim, G. Lang-Wojtasik, & P. Mecheril (eds), Beschreiben – Verstehen – Interpretieren: Stand und Perspektiven international und interkulturell vergleichender Erziehungswissenschaft in Deutschland, 39–61, Münster: Waxmann. Woodward, R. (2009), The Organisation for Economic Co-operation and Development (OECD), Abingdon: Routledge.

6

The situation(s) of text in PISA reading literacy Jeanne Marie Ryan

In developing the Programme for International Student Assessment (PISA), the Organisation for Economic Co-operation and Development (OECD) has created a system to assess for the new construct of Reading Literacy. While one can easily envision examples of reading – the weight of a book in one’s hands, the feel of pages turning, perhaps scrolling through articles on one’s tablet or phone – are there similar images conjured by reading literacy? What might be the difference between the two? PISA, designed to be a decontextualized assessment tool, attempts to evaluate the analytical and reasoning skills of each test-taker from around the world without the use of contextual clues which could be pointedly linked to any country’s individual curriculum. ‘The knowledge and skills tested … are defined not primarily in terms of a common denominator of national school curricula but in terms of what skills are deemed to be essential for future life. This is the most fundamental and ambitious novel feature of OECD/PISA’ (OECD, 1999: 11). With the establishment of literacy as that novel feature, how is reading literacy in particular defined, and how is it subsequently operationalized into tasks and texts for assessment? This chapter investigates the construct of reading literacy in PISA: its development, its definitions, its assessment. Central focus is placed upon the nature of texts selected for use within the assessment of reading literacy: how and why are these texts chosen, and what does this mean in terms of the construct of reading literacy? In lieu of referring to curricular documents or assessment specifications there is instead an exploration of the PISA framework documents which detail the development, definitions and assessment of PISA Reading Literacy, with some supplementary insight provided from interviews with PISA Reading Literacy

106

International Large-Scale Assessments in Education

designers. Furthermore, examples are provided from the publicly available PISA Reading Literacy items for close analysis of the features assessed therein.

The construct Before analysing the construct of Reading Literacy as found in PISA, it is necessary to provide a brief overview of the nature of the construct itself. The origin of the construct can be traced to a paper published in 1955 by Lee Cronbach and Paul Meehl, in which they explain that the creation of the construct arose from a need for attribute- or characteristic-related validity in psychological testing. As a result, Cronbach and Meehl defined the construct as ‘some postulated attribute of people, assumed to be reflected in test performance’ (1955: 283): the construct forms the underlying structure of attributes that a test aims to assess. Since the initial work of Cronbach and Meehl, the definition of the construct has been more recently adapted, as seen in part in Table 6.1. In the Standards for Educational and Psychological Testing the term construct is defined as ‘the concept or characteristic that a test is designed to measure’ (AERA, APA & NCME, 1999: 5). However, how does the construct move from a definition towards its realization as tasks and texts within an assessment? As the Standards describe, the construct is not a stand-alone concept but is further elaborated by test developers into a conceptual framework that outlines the properties targeted for assessment. To support test development, the proposed interpretation [of the construct] is elaborated by describing its scope and extent and by delineating the aspects of the construct that are to be represented. The detailed description provides a conceptual framework for the test, delineating the knowledge, skills, abilities, processes, or characteristics to be assessed. (AERA, APA & NCME, 1999: 9) Table 6.1 Definitions of construct Definitions of construct Cronbach and Meehl (1955: 283)

AERA, APA and NCME (1999: 5)

‘some postulated attribute of people, assumed to be reflected in test performance’

‘the concept or characteristic that a test is designed to measure’

Alderson (2000: 118)

Kline (2000: 25)

‘a psychological concept, which derives from a theory of the ability to be tested’

‘the object of investigation or study but [is] only useful where [it] can be precisely defined’

The situation(s) of text in PISA reading literacy

107

The conceptual framework therefore serves as an operationalization of the construct from which test developers build, while also providing a more expansive understanding of the underlying construct in practice. In a similar vein as the Standards, Alderson (2000) too links the construct with test content. In the context of reading assessment, specifically, the texts, tasks, and inferences encompassed in an assessment come together to form the full operationalization of a construct: Constructs come from a theory of reading, and they are realised through the texts we select, the tasks we require readers to perform, the understandings they exhibit and the inferences we make from those understandings, typically as reflected in scores. In designing a test, we do not so much pick the ‘psychological entity’ we want to measure, as attempt to define that entity in such a way that it can eventually be operationalised in a test. (Alderson, 2000: 117, 119)

The development of a construct and, consequently, a conceptual framework, is thus fundamental to the functionality of an assessment, as the underlying entity, as Alderson puts it, uniting the elements of a test. How, then, has PISA developed its construct of Reading Literacy as expressed in its conceptual framework documents?

Reading and reading literacy As means of introduction to the construct, the 2015 PISA Reading Literacy framework document identifies the distinction to be made between ‘Reading Literacy’ and ‘reading’. Reading Literacy is envisioned as a broader construct than ‘[reading] in a technical sense’ (OECD, 1999: 19): The term ‘reading literacy’ is preferred to ‘reading’ because it is likely to convey to a non-expert audience more precisely what the survey is measuring. ‘Reading’ is often understood as simply decoding, or even reading aloud, whereas the intention of this survey is to measure something broader and deeper. Reading literacy includes a wide range of cognitive competencies, from basic decoding, to knowledge of words, grammar and larger linguistic and textual structures and features, to knowledge about the world. (OECD, 2015: 49)

The same phrase regarding the broader and deeper nature of PISA Reading Literacy in comparison to reading also reoccurs in several framework documents:

108

International Large-Scale Assessments in Education

‘Reading’ is often understood as simply decoding, or reading aloud, whereas the intention of this survey is to measure something broader and deeper. (OECD, 1999: 20; OECD, 2009a: 23; OECD, 2015; 49)

With the above statement reappearing over the course of several years of framework documents, the juxtaposition of reading as ‘simply decoding’ in comparison to the larger construct of Reading Literacy stands over time. What, however, is intended when the documents refer to measuring something broader and deeper? As these statements address the nature of Reading Literacy being wider ranging than reading alone, the definition of Reading Literacy provides further clarification, as shown in Table 6.2. Following these provided definitions, the PISA Reading Literacy framework documents break down each of the definition’s key components, from understanding through to society, providing insight into what these concepts might mean within the frameworks and, therefore, within the scope of the construct. Understanding, according to the framework documents, ‘is readily connected with “reading comprehension”, a well-accepted element of reading’ (OECD, 2015: 49). The immediate connection between understanding and reading comprehension is, elsewhere in reading research, not as immediate: comprehension involves multiple levels of processes before understanding can be reached from the linguistic decoding which the framework documents have referenced as basic reading, to the understanding of semantics (word meaning) at the level of word, phrase, sentence and text (Kintsch & Rawson, 2005: 210). Interestingly enough, though, research previously conducted into reading comprehension by Kintsch is also cited as an influence in the development of the construct of reading literacy (OECD, 2009a: 20). Therefore, despite the additional provision of explication into the key concepts of the reading literacy

Table 6.2 PISA reading literacy definitions PISA Reading Literacy Definitions 2000–2006 ‘Reading literacy is understanding, using, and reflecting on written texts, in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society.’

2009–2015 ‘Reading literacy is understanding, using, reflecting on and engaging with written texts, in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society.’

The situation(s) of text in PISA reading literacy

109

definition, there are still some simplifications concerning the nature of complex reading processes. Although the caveat has been provided that the definition of reading literacy is attempting to convey to a ‘non-expert’ audience what exactly is being assessed, the framework documents assert that ‘reading literacy includes a wide range of cognitive competencies, from basic decoding, to knowledge of words, grammar and larger linguistic and textual structures and features, to knowledge about the world’ (OECD, 2009a: 23). In comparison, this conceptualization of reading literacy does not vary widely from previous research into reading itself (e.g., Snow, 2002; Kintsch & Rawson, 2005), thereby suggesting that there is some conflation between the two concepts of reading and reading literacy. In further attempting to differentiate reading literacy from reading, the framework documents group understanding, using and reflecting on as a unit, linking together the elements of reading comprehension, applying knowledge from reading, and drawing from individual knowledge or experience together to form a central component of reading literacy as an interactive practice (OECD, 2015: 49). The idea of reading as an interactive process too has been developed and researched, focusing particularly upon the complex relationship between the text and the reader, with acknowledgement of the contextual and situational factors influencing the nature of said relationship (Snow, 2002). The addition of the term engagement to the 2009 PISA Reading Literacy definition similarly reflects an increased interest in ‘affective and behavioural characteristics’ (OECD, 2009a: 24) of the reader that are involved in the reading process, particularly those relating to motivation and the desire to read (and to enjoy reading) not only within the context of school but in life more generally. Significantly, the addition of engagement to the definition also acknowledges ‘the social dimension of reading’ and the potential for ‘diverse and frequent reading practices’ throughout one’s life (OECD, 2009a: 24). While many of the key components of the Reading Literacy definition are present in other reading research, the link made between reading literacy and an individual’s ability to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society are particularly crucial to the novelty of the PISA Reading Literacy construct. The definitions of Reading Literacy in all of the PISA frameworks incorporate a reader’s ability to participate in society (OECD, 1999: 20; 2003: 108; 2006: 46; 2009a: 23; 2012: 61; 2015: 49). In the PISA Reading Literacy framework documents, society carries a distinctive meaning: ‘the term “society” includes economic and political as well as social

110

International Large-Scale Assessments in Education

and cultural life’ (OECD, 1999: 21). The shift towards featuring the connection between literacy and the economy as a prominent feature of reading literacy is a notable epistemological approach possessed by the OECD. While previous research focused on the social and cultural implications of reading (e.g. Gee, 2004) for the individual and for society, fewer examples discuss the link between reading and the economy or, additionally, between reading, technology and the economy (Leu, Jr., 2000) in juxtaposition to sociocultural elements. Having provided the definition of PISA Reading Literacy, how is this definition expanded upon in order to operationalize the way in which Reading Literacy is to be assessed? More specifically, with the emphasis placed upon the intersection between reading literacy and society, how can the assessment of reading literacy be located in the context of society more generally? The framework documents provide some explanation as to the construct in context, using text-based situations to reflect aspects of society within the assessment of reading literacy.

The construct in context Of central importance to the assessment of reading literacy is the selection of texts to be assessed (Alderson, 2000; Fulcher & Davidson, 2007) in relation to the defined construct. In describing the different types of texts which can be used for assessment, the PISA Reading Literacy documents introduce the idea of situations. These situations describe the nature of selected texts for assessment as being personal, public, occupational or educational in nature, with the selection of texts intended to be representative across these four situational categories (OECD, 2015). A situation is defined as a variable of text relating to ‘the range of broad contexts or purposes for which reading takes place’ (OECD, 2009a: 25) and relates to the selection of texts to be used within each item or item cluster. The situations in PISA Reading Literacy are acknowledged as being based directly upon those found in the Common European Framework of Reference for Languages (CEFR): A useful operationalisation of the situation variables is found in the Common European Framework of Reference (CEFR) developed for the Council of Europe (Council of Europe, 1996). Although this framework was originally intended to describe second- and foreign- language learning, in this respect at least it is relevant to mother-tongue language assessment as well. The CEFR situation

The situation(s) of text in PISA reading literacy

111

categories are: reading for private use; reading for public use; reading for work and reading for education. They have been adapted for PISA to personal, public, occupational and educational contexts, and are described in the paragraphs below. (OECD, 2009a: 25; 2012: 62; 2015: 51)

The adaptation of what are referred to as domains in the CEFR – reading for private use, reading for public use, reading for work and reading for education – into situations in PISA Reading Literacy is a significant one: as stated above, the CEFR domains were created in for use in assessing language proficiency amongst second- and foreign-language learners. The application of CEFR domains towards PISA creates a hybrid approach to language assessment, using theoretical approaches from second language assessment towards the assessment of both native and non-native speakers cross-culturally. In the CEFR, the discussion of situation refers heavily to the importance of context within language use and asserts that ‘language is not a neutral instrument of thought’ (Council of Europe, 2001: 44). Of the four domains found in the CEFR and adapted in PISA Reading Literacy situations – personal, public, occupational and educational – the CEFR additionally highlights the idea that the personal domain is able to impact upon all other domains via the input of the individual and of the individual’s personality. It should be noted that in many situations more than one domain may be involved. … the personal domain individualises or personalises actions in the other domains. Without ceasing to be social agents, the persons involved situate themselves as individuals; a technical report, a class presentation, a purchase made can – fortunately – enable a ‘personality’ to be expressed other than solely in relation to the professional, educational or public domain of which, in a specific time and place, its language activity forms part. (Council of Europe, 2001: 44–46)

In contrast, the situations as outlined in the PISA Reading Literacy 2009 framework describe the same four situations with additional emphasis on the types of texts and tasks that could represent each situation. In discussing the situations as chosen, there is an additional connection made between situations and the content of texts representing these situations: While content is not a variable that is specifically manipulated in this study, by sampling texts across a variety of situations the intent is to maximize the diversity of content that will be included in the PISA reading literacy survey. (OECD, 2009a: 26)

112

International Large-Scale Assessments in Education

Therefore, in distributing texts and tasks across the chosen situations there is also the implication that the content, with any social, cultural, economic, political or other cues therein contained, also varies between the selection of tasks and texts in particular. With regards to the development of PISA Reading Literacy, the adaptation of the CEFR domains into situations was a conscious decision made: it all hangs together in the sense that if you define language ability not in knowledge of particular aspects like the past particle or the gerund or whatever, but if you define language ability as being able to use the language in as many circumstances as possible with as much refinement as possible then it is a unidimensional skill. And it doesn’t matter whether you talk about … I can place native speakers on the same scale as foreign-language learners because contrary to the given, uh, hypothesis by many people that native speakers are the target for language learners they are not because native speakers do not know their language completely otherwise we would not have PISA because we are testing native speakers. (Interview with PISA Reading Literacy designer, 2015)

With native speakers being classified in the same manner in which non-native language speakers were classified in the CEFR, some similarities can thus be seen between the nature of PISA Reading Literacy and a proficiency test for second language learners. Further ties can be seen in the focus proficiency tests place upon skills which are not necessarily embedded in a certain curriculum or experience: ‘in proficiency testing, the way you learned the language is supposed to be irrelevant. In other words, proficiency tests are not meant to be tied to any particular curriculum’ (Bailey, 1998: 38). Moreover, PISA Reading Literacy scores are presented in ‘proficiency scales that are interpretable for the purposes of policy’ (OECD, 2015: 58). What are the implications of blurring the lines between native and non-native language assessment? Without a curriculum as referent, PISA Reading Literacy shifts its focus towards contextual situations to assess the reading literacy skills present amongst its test-takers.

Let the items speak for themselves In order to consider the operationalization of the Reading Literacy construct into selected texts and their relevant questions, it is best to turn to examples of PISA Reading Literacy items as publicly released via the OECD. Using a random

The situation(s) of text in PISA reading literacy

113

number generator to select from released items, the following includes one print-based item used in the PISA 2000 survey and a sample digital reading item used in the 2009 field trial. Given space constraints here a larger number of examples cannot be presented, and it is crucial to note that these two items are not representative of every PISA Reading Literacy item: further examples can be found on the OECD’s website at www.oecd.org/pisa. Since not all items are available for public consumption, it is, however, difficult to discern the content coverage that might exist throughout all items even upon consulting those which are available. Items are intended to be representative of the four situations discussed above – personal, public, occupational and educational – but, within these situations, the socially and culturally embedded topics included amongst unreleased items are not known. In looking at PISA Reading Literacy items, there is also the fact that without a curricular base there cannot have been, for example, set texts which pupils may have read in advance of the testing situation; instead, all texts are unseen to the reader. From the first cycle of PISA in 2000 in which Reading Literacy was the main domain, items used in the PISA Reading Literacy assessment were paper-based. During the fourth cycle of PISA in 2009 in which Reading Literacy once again returned as the main domain, electronic texts were introduced for assessment. Furthermore, from 2015 onwards computer-based testing has become the dominant mode of assessment for PISA Reading Literacy items. Despite items now being electronic in nature, the paper-based items functioned throughout the majority of PISA’s cycles thus far and, as a result, have been included in example format here.

Print reading example item: The flu The example seen in Figure 6.1 is taken from the 2009 document produced by the OECD entitled Take the test: sample questions from OECD’s PISA assessments. The below excerpt from one example item (The Flu) presents an information sheet distributed by a hypothetical organization (ACOL)’s human resources department, informing its employees as to how they might participate in the ‘voluntary immunization process’ occurring at the company. The text itself is continuous, with graphics presented representing the immunization jab and a cartoonish drawing of the flu virus as intended to convey the sense of levity a human resources department member was attempting to create in the information sheet.

114

International Large-Scale Assessments in Education

Figure 6.1 (OECD, 2009b: 19)

The OECD, in creating PISA, not only propelled forth the concept of literacies in the domains of mathematics, reading and science, but also promulgated the importance of cross-domain knowledge. The above PISA task involves elements of the immunization process, diet, health and fitness which may be seen as scientific in nature or, at least, would not be relegated strictly to the domain of reading. PISA’s inclusion of scientific information and other information which goes beyond the literary within their Reading Literacy tasks reinforces the PISA assertion that all three literacies – reading, scientific and mathematics – are intrinsically linked (OECD, 2009a). Equally important concerning the selected text is that the information being examined concerns not only information about the flu and about the immunization process but, specifically, information regarding the immunization process within an employment based setting. Of the four situations in PISA Reading Literacy, the information sheet falls into both the public (i.e., it is available for public consumption) and the occupational situation categories. In relating the occupation situation as a context for PISA Reading Literacy, the framework documents explicitly mention the connection between reading literacy and the economy through the lens of human capital: ‘Reading literacy skills matter not just for individuals, but for economies as a whole … in modern

The situation(s) of text in PISA reading literacy

115

societies, human capital – the sum of what the individuals in an economy know and can do – may be the most important form of capital’ (OECD, 2009a: 21; OECD, 2012: 61; OECD, 2015: 49). Importantly, in the development of PISA Reading Literacy it was acknowledged that even at the age of 15 or 16 a student may wish to have economically viable knowledge. In the words of one of the PISA Reading Literacy developers: [there is] another issue in the educational world in that all people involved in education, well not all, but generally people involved in education come from circles of higher educated people so they don’t know about all these kids that go into the workforce at 15 years old. They don’t know! They don’t know them. It’s a different world. And that is of course stronger in developing countries but it is also true in the developed countries. (Interview with PISA Reading Literacy designer, 2015)

The above example item speaks to the choice of texts which assess for the reader’s occupational knowledge: although questions may refer to the style of a text or have the reader evaluate the most effective way of protecting against the flu virus, the act of reading comprehension is contextualized within an office setting.

Digital reading example item: Ice cream From 2015 onwards, the introduction of computer-based items adds a new level of complexity to the nature of reading assessment. Concurrently, however, digital reading assessment also provides the opportunity to assess forms of media not easily captured within paper and pencil based assessment. Although all items assessed in the 2015 PISA cycle and the examples released thus far are fixed or static in nature, the computer-based Reading Literacy items are intended to be dynamic in nature: ‘dynamic text is synonymous with hypertext: a text or texts with navigation tools and features that make possible and indeed even require non-sequential reading’ (OECD, 2015: 52). The image of a screen seen in Figure 6.2 represents an example of a digital reading text featuring web search results from the search term ‘ice cream’. Unlike print-based items which required a test-taker to, for example, select one of four responses for multiple-choice or write in a short answer as a response, the digital reading items ask a test-taker to choose the radial button of the selected response. The example seen in Figure 6.2 is intended to assess the ways in which testtakers use reading literacy in connection with the internet. As provided in the

116

International Large-Scale Assessments in Education

(OECD, 2009a: 240)

(OECD, 2009a: 240)

(OECD, 2009a: 242) Figure 6.2

OECD’s scoring explanation: ‘this question represents another very typical task faced by users of the Internet, that is, evaluating the trustworthiness of the results for a particular purpose’ (OECD, 2009a: 242). Since the first cycle of PISA internet usage amongst 15 to 16 year olds has become, if anything, more ubiquitous; however, the inclusion of internet search results in the assessment of reading does, again, contribute to the novelty of the reading literacy construct in that test-takers are applying their reading literacy skills towards knowledge building in an internet-based scenario.

The situation(s) of text in PISA reading literacy

117

Given the current buzz surrounding society’s movement into a post-truth era, with courses being devised in order to introduce the ability for students to analyse the veracity of sources, the above example seems relevant to students as they navigate the internet or indeed media more generally. In reference to the PISA Reading Literacy definition, such abilities to evaluate truth amongst media sources does relate back to the principle that reading literacy should allow a student (or non-student) to be able to participate in society effectively – or, in this instance, to navigate society and its media more dexterously.

Conclusion In PISA’s emphasis on the decontextualization of reading items by removing identifying national or curricular ties, a lack of context becomes a context of its own: of global, generalized skills. This constraint of context is not at all isolated to PISA but, instead, is common to the regime of nomothetic testing – that is, testing as a measure of covariance amongst performance in test-takers in comparison to an idiographic means of testing which would focus upon the interaction between an individual and the assessment (Haynes, 2000). In terms of defining what this means for the construct of reading or reading literacy, how do test makers then choose the contexts which most accurately depict global skills required by the majority of 15 to 16 year olds? In discussing previous attempts at assessment in the area of intelligence testing and other cross-cultural psychological batteries, Cole (1996) addresses the issue of culture as a confounding factor: The simple fact is that we know of no tests that are culture-free, only tests for which we have no good theory of how cultural variations affect performance. (Cole, 1996: 56)

The problem of addressing cross-cultural variation has existed from the time of Binet’s intelligence testing onwards (Cole, 1996), and continues to be a problem for all assessments or assessment systems attempting to test any population with cultural variation amongst its test-takers. However, the choice of domains (as in the CEFR) and now situations in PISA Reading Literacy attempt to identify realms in which the 15 or 16 year old might find herself or himself involved in the practice of reading. Furthermore, given the rapidly evolving advancements occurring in the assessment of reading via, for example, the ongoing use of computers and other technology such as tablets, there is an ever-increasing opportunity to incorporate more complex interactions between text and reader in the assessment of reading

118

International Large-Scale Assessments in Education

(Pearson, Valencia & Wixson, 2014). There also arises the possibility for more advanced forays into idiographic assessment, should such a path be chosen. In choosing to do so, however, what additional contexts could be used or focused on in the selection of texts for assessment? Would expanding beyond the domains of the personal, the public, the occupational, and/or the educational help to assess reading capacity or hinder the ability to wade through complex interactions arising between text and reader? In second language assessment the use of situations is meant to assure that a test-taker can use the target language proficiently across a range of possible scenarios (Bachman & Palmer, 1996). In PISA Reading Literacy, the use of situations intends to assess native and non-native speakers across a range of possible scenarios not in order to identify their proficiency in an acquired language but to identify their proficiency in language use in society and in life overall – to reiterate, ‘the knowledge and skills tested … are defined not primarily in terms of a common denominator of national school curricula but in terms of what skills are deemed to be essential for future life’ (OECD, 1999: 11). In James Paul Gee’s work on reading, he discusses the reader – or, as it would be in the case of PISA Reading Literacy, the test-taker – as a player in a game: learning to read, or any learning for that matter, is not all about skills. It is about learning the right moves in embodied interactions in the real world or virtual worlds, moves that get one recognized as “playing the game”: that is, enacting the right sort of identity for a given situation. (Gee, 2004: 48–49)

In relation to PISA Reading Literacy, then, the act of reading or indeed the skills developed from reading are not all that is in play: the chosen situations attempt to elicit the test-taker’s ability to play the game – but what game? The game of assessment or the game of life? PISA Reading Literacy has created an ‘ambitious novel’ construct in the form of Reading Literacy, forcing us all to question whether the assessment of reading and, now, of reading literacy can possibly reflect the multitude of abilities required for a student to take on life – or, perhaps, the game of life.

References AERA, APA & NCME (1999), Standards for educational and psychological testing, Washington, DC: AERA. Alderson, J. C. (2000), Assessing reading, Cambridge: Cambridge University Press. Bachman, L. F. & Palmer, A. S. (1996), Language testing in practice, Oxford: Oxford University Press.

The situation(s) of text in PISA reading literacy

119

Bailey, K. M. (1998), Learning about language assessment: Dilemmas, decisions and directions, London: Heinle and Heinle. Cole, Michael (1996), Cultural psychology: A once and future discipline, Cambridge, MA: Belknap Press of Harvard University Press. Council of Europe (1996), Modern languages: Learning, teaching, assessment. A Common European Framework of Reference, Strasbourg: CC LANG (95) 5 Rev. IV. Council of Europe (2001), Common European framework of reference for languages: Learning, teaching, assessment, Cambridge: Cambridge University Press. Cronbach, L. J. & Meehl, P. E. (1955), ‘Construct validity in psychological testing’, Psychological Bulletin, 52 (4): 281–302. Fulcher, G. & Davidson, F. (2007), Language testing and assessment: An advanced resource book, London: Routledge. Gee, J. P. (2004), ‘Reading as situated language: A sociocognitive perspective’, in R. B. Ruddell & N. J. Unrau (eds), Theoretical models and processes of reading, fifth edition. Newark, DE: International Reading Association. Haynes, S. N. (2000), ‘Idiographic and Nomothetic Assessment’, in Haynes, S. N. & O’Brien, W. H. (2000), Principles and practice of behavioral assessment, New York: Kluwer Academic Publishers. Kintsch, W. & Rawson, K. A. (2005), ‘Comprehension’, in M. J. Snowling & C. Hulme (eds), The science of reading: A handbook, Oxford: Blackwell Publishing. Kline, P. (2000), The handbook of psychological testing, London: Routledge. Leu, D. J., Jr, (2000), ‘Literacy and technology: deictic consequences for literacy education in an information age,’ in M. L. Kamil, P. B. Mosenthal, P. D. Pearson & R. Barr (eds), Handbook of reading research, Mahwah, NJ: Erlbaum. OECD (1999), Measuring student knowledge and skills: A new framework for assessment, Paris: OECD. OECD (2003), The PISA 2003 assessment framework: Mathematics, reading, science and problem solving knowledge and skills, Paris: OECD. OECD (2006), Assessing scientific, reading and mathematical literacy: A framework for PISA 2006, Paris: OECD. OECD (2009a), PISA 2009 assessment framework: Key competencies in reading, mathematics and science, Paris: OECD. OECD (2009b), Take the test: Sample questions from OECD’s PISA assessment, Paris: OECD. OECD (2012), PISA 2012 assessment and analytical framework: Mathematics, reading, science, problem solving and financial literacy, Paris: OECD. OECD (2015), PISA 2015 assessment and analytical framework: Science, reading, mathematic and financial literacy, Paris: OECD. Pearson, P. D., Valencia, S. W., & Wixson, K. (2014), ‘Complicating the world of reading assessment: Toward better assessments for better teaching’, Theory into Practice, 53 (3): 236–246. Snow, C. E. (2002), Reading for understanding: Toward an R&D program in reading comprehension (Rand Reading Study Group), Rand Corporation.

7

Self-reported effort and motivation in the PISA test Hanna Eklöf and Therese N. Hopfenbeck

Do students care about doing their best on a low-stakes test like the PISA test? How are the PISA achievement test and the PISA student questionnaire perceived by students? These questions are part of a larger research programme on student motivation in PISA, and the focus of what will be presented and discussed in the following, with a particular focus on the Swedish and Norwegian PISA context, and with the key purpose of drawing attention to perceptions and reactions of the hub inside the assessment machine: the students taking the tests. The thoughts and behaviours of students in the assessment situation are seldom recognized in the discussions surrounding large-scale international assessments like PISA and we find this noteworthy, as it is the students who are actually completing the tests and the questionnaires, generating the scores and thereby providing policy makers, media, researchers and other stake holders with information. Indeed, student literacy in different core subjects is all that is discussed in the context of international comparative studies, and comparisons of student literacy between countries and over time sometimes give rise to heated debate about school quality, in some cases even leading to educational reform. However, the students’ voices and how they think and feel about participating in tests like PISA are often overlooked, even though this could be regarded an important factor to consider, from a test validity perspective (i.e. are the test results valid indicators of student knowledge or are there other student variables affecting performance that might be important to take into account when we interpret the findings?), a test-taking psychology perspective (how do students seem to react in the test situation?) as well as a

122

International Large-Scale Assessments in Education

fairness perspective (is it correct to overlook the thoughts of the individuals actually generating the test scores?). In the following we will, based on findings from empirical studies performed in the PISA context, discuss and reflect on certain aspects of test-taking: selfreported effort and motivation in the test situation, and students’ perceptions of the test and the questionnaire used in PISA. In this chapter, we do not set out to investigate to what extent test-taking effort and motivation relate to or can explain test performance and differences in test performance (for studies investigating this issue, see Eklöf & Knekta, 2017; Hopfenbeck & Kjaernsli, 2016; Eklöf, Reis Costa, & Knekta, 2017), or how students in more general terms experience PISA, but rather to provide a glimpse of how students, and Swedish and Norwegian students in particular, seem to perceive PISA in terms of motivation, as measured by self-report questionnaire items and student interviews. Thereby, we aim to put the light on students as active participants in the assessment situation (Harris & Brown, 2016).

The relevance of acknowledging the students and their test-taking motivation in PISA International large-scale comparative studies of student knowledge and proficiency have had a significant influence on discussions of educational quality and educational policy around the world, and they have grown in impact and scope during the last decades (see Cresswell, 2016; Rutkowski & Rutkowski, 2016; Hopfenbeck et al., 2018). Given the attention these studies get and the impact they have in the educational debate, it is important that they provide high-quality and trustworthy results; that they provide fair and valid assessments. Studies like PISA are technically very sophisticated, from the sampling of students and assembly of test and questionnaire materials to the calculation of scores and reporting of findings (Lüdtke et al., 2007; OECD, 2009). Fairness and validity is however not only a technical or psychometric issue, but also includes ethical, political, and social perspectives (Stobart, 2005). The Standards for Educational and Psychological Testing acknowledges that technical and social perspectives should be considered when analysing and interpreting the outcomes of testing (AERA, APA, & NCME, 2014). The Standards also state that test-takers’ level of motivation should be considered when interpreting test results, especially when scores are not reported to test-takers or otherwise have no consequences for the test-

Self-reported effort and motivation in the PISA test

123

takers (so called low-stakes tests, AERA, APA, & NCME, 2014). PISA is an example of such a test, and therefore student test-taking motivation may be a construct relevant to consider. Test-taking motivation is a motivational state associated with a given test or task, and can be defined as ‘the willingness to work on test items and to invest effort and persistence in this undertaking’ (Baumert & Demmrich, 2001: 441). Completing the PISA test is a lengthy process and is followed by an extensive questionnaire, and so for some students this situation may be quite demanding, and a fair amount of effort and persistence is required for a good performance. However, the result has no consequences for the individual students, their teachers or their school. Furthermore, the students get no feedback on their performance and individual results are never revealed to anyone. With this in mind, concerns are sometimes raised that not all students are motivated to do their best, and that they as a result might underperform on the test, leading to invalid conclusions regarding their proficiency level (Holiday & Holiday, 2003; Sjøberg, 2014). Still, even if PISA is a low-stakes test in terms of personal consequences, it is not unlikely that students may still put value on a good performance. The sampled students are representing their country in an international study and this could contribute to creating motivation (Sellar et al., in this volume; Baumert & Demmrich, 2001). We should also not underestimate students’ sense of ‘academic citizenship’ (Wise & Smith, 2016), that they try their best because they are told to do so, because they always do their best, because they want to see what the know and can do, or in order to help with the study. In a given test situation, different students may react and behave in different ways, but as long as we don’t ask the students themselves, discussions about students’ perceptions of PISA and their motivation to do their best in PISA becomes rather speculative.

Previous research on student perceptions of tests and test-taking motivation Studies in different assessment contexts have shown that test-taking motivation as well as test anxiety and test-taking strategies are important to consider when interpreting assessment results (Pintrich & DeGroot, 1990; Nevo & Jäger, 1993; Wise & DeMars, 2005; Hong, Sas, & Sas, 2006). Findings from empirical studies have suggested that test perceptions vary with test-takers’ age, gender, and

124

International Large-Scale Assessments in Education

experience with standardized testing and grading (Sundre & Kitsantas, 2004; DeMars, Bashkov & Socha, 2013; Eklöf, Japelj & Gronmo, 2014). Findings also rather consistently show that there is a relationship between motivation in the test situation and test performance, but that this relationship tends to be small to moderate in strength (cf. Eklöf & Knekta, 2017; Wise & DeMars, 2005; Abdelfattah, 2010). Research has also shown that student beliefs about their own ability to perform well and the value they attach to the task affects their self-reported test-taking effort (Cole, Bergin, & Whittaker, 2008; Knekta & Eklöf, 2015; Penk, Pöhlman, & Roppelt, 2014). There is also evidence that the purpose the students assign to the test affect their efforts as well as their performance (Brown, 2011). Other social influences, such as the parents’ attitudes towards tests, seem to affect student effort (Zilberberg et al., 2014). Research in the context of large-scale international studies like PISA and TIMSS has shown that student test-taking motivation is lower in these contexts compared with tests that are used for grading, that it varies between individuals and that the correlation between test-taking motivation and test performance is in the same range as for socio-economic background and gender (Butler & Adams, 2007; Eklöf & Knekta, 2017; Eklöf, Japelj, & Gronmo, 2014). Butler and Adams (2007) concluded that the variation in test-taking effort between countries in PISA (as measured by the effort thermometer in 2003) was rather small, while Boe, May, and Borouch (2002) came to another conclusion and argued that a large part of the variation in performance in TIMSS could be explained by a between-country variation in student persistence. Even if students perceive tests with different stakes differently, studies have also found that students in general tend to report a rather high level of effort and motivation even if the test is low-stakes (Eklöf, Hopfenbeck, & Kjaernsli, 2012; Knekta, 2017), hence claims that students underperform in low-stakes tests due to low motivation might be unwarranted (see Hopfenbeck & Kjaernsli, 2016).

Effort and motivation in PISA – what do the students tell us? In Sweden and Norway, we have conducted research on students’ self-reported effort and motivation in a number of PISA cycles, starting with PISA 2006 in Norway. After the PISA 2006 study, students were interviewed about their

Self-reported effort and motivation in the PISA test

125

experience of the test and their motivation to do their best (Hopfenbeck, 2009). In the next two cycles, interview and survey data from PISA 2009 and 2012 were collected in Norway (see Hopfenbeck & Kjaernsli, 2016.) A six-item testtaking motivation scale was developed and included as part of the items labelled ´national options´. In Sweden, student test-taking motivation has been assessed with the test-taking motivation scale in PISA 2012. Sweden and Norway are neighbouring Scandinavian countries with similar cultures and similar educational systems. Both Norway and Sweden have experienced a declining performance in PISA; in both countries there has been an intense political and media interest in the results from PISA, and in both countries large educational reforms have been implemented during the last decade, with increased national assessments within the educational system. Also, in both Sweden and Norway, there have been claims that students do not take the PISA test seriously (Dagens Nyheter, 2014; Sjoberg, 2014). Hence, the Swedish/Norwegian case may be an interesting case to study with respect to student motivation to spend effort on the test. In PISA internationally, student test-taking effort has also been assessed through an ‘effort thermometer’ as part of the international test battery (used in 2003, 2006 and 2012). Findings from this measure are briefly reviewed below.

The effort thermometer Before turning to a discussion of findings from the test-taking motivation scale in Sweden and Norway, and the interviews in Norway, we will present descriptive findings from the PISA 2012 effort thermometer in order to give an overview of students’ reported test-taking effort internationally in PISA. In the PISA effort thermometer, students are asked two questions, each of which are to be responded to on a 10-point scale, a ‘thermometer’ (Figure 7.1). First, students are asked to mark how much effort from 1 to 10 they spent on the PISA test compared to a situation in which they would have spent maximum effort (see Figure 7.1). Then, students are asked to mark how much effort they would have spent on the PISA test had the test result counted towards their grade. From these two ratings, a ‘relative effort’ rating can be calculated, i.e. the difference between reported effort on the PISA test and the amount of effort had the test result counted towards their grade.

126

International Large-Scale Assessments in Education

Figure 7.1 The effort thermometer as it appears in the PISA test booklet

In all PISA administrations where the effort thermometer has been used, the most common rating of PISA effort (2nd column in Figure 7.1) internationally is an eight on the ten-point scale, and most countries have a mean score on the effort thermometer that falls between seven and eight. On average then, students participating in PISA report a fair level of effort on this low-stakes test and on average, there is relatively little variation on the country level. Still, there are some differences between countries. For example, in PISA 2012, the country with the lowest level of reported effort, on average, was Japan, with a mean score of 6.29 on the effort thermometer. The country with the highest average level of reported effort was Kazakhstan, with a mean score of 8.96. From these results, however, it cannot simply be concluded that students in Japan spent less effort on the PISA test than students in Kazakhstan did, as students’ responses to items like the ones in the effort thermometer could be conditioned on social/cultural environment, literacy level and type of educational system (Butler, 2008). The most common rating of how much effort the student would have spent had the PISA test counted towards the student’s mark (3rd column in Figure 7.1)

Self-reported effort and motivation in the PISA test

127

has been a ten in all administrations of the thermometer, and most countries have had a mean score on this question between 9 and 10. When looking at the relative effort variable (the difference between reported effort in PISA and estimated effort had the PISA test counted towards the school marks, a culturally less biased variable according to Butler, 2008), in all countries students claim that they would have invested more effort in the test had it been important for the school marks. In some countries, this difference is small while in others, it is larger (see Figure 7.2) but in all countries, students seem to make a difference between low-stakes tests like PISA and tests that are used for grading. This is an unsurprising result. Students are not machines, and it could be considered a rather adaptive behaviour to invest ‘enough’, but not too much, energy in a test that does not count. Sweden was the country with the largest difference score in PISA 2012 (followed by a number of other industrialized countries mainly in Europe and North America). Vietnam was the country with the smallest difference (together with a number of countries mainly from Southeast Asia and South America). Thus, students in certain countries (cultural milieus) seem to make more of a difference between the PISA test and a test that counts towards the school mark

3.00

2.50

2.00

1.50

1.00

0.50

0.00 Sweden Canada Germany Norway France Luxembourg Estonia Slovenia United Kingdom Iceland Serbia Denmark Switzerland Spain Slovak Republic New Zealand Japan Hong Kong-China Greece Belgium OECD average Liechtenstein Australia Singapore Netherlands Montenegro Portugal Austria Israel Croatia Hungary Poland Korea Czech Republic Ireland United Arab Emirates Italy Tunisia Uruguay Costa Rica Finland Jordan Lithuania Chile Qatar Bulgaria Latvia United States of America Romania Shanghai-China Brazil Russian Federation Colombia Macao-China Peru Mexico Argentina Turkey Chinese Taipei Thailand Malaysia Kazakhstan Indonesia Viet Nam

Difference score

Figure 7.2 Relative effort: the difference (country average) between reported effort in PISA 2012 and effort had the test score counted towards the grade

128

International Large-Scale Assessments in Education

compared to students in other cultural and geographical areas. This may be related to students’ views on assessment generally and low- vs high-stakes tests specifically, but also to the way students tend to respond to items like the effort items, or to how students understand the questions in the thermometer. Even if there are differences, most students in most countries seem to invest effort in the PISA test, at least according to their self-report, which suggests that students take the PISA test seriously. It should be noted that this interpretation is based on answers to one self-report item; there is still a lot we do not know about student experiences and behaviours in assessment situations like PISA (not least when it comes to cross-cultural variability), and many student-related variables we need to learn more about before we can arrive at any definite conclusions. Also, although previous studies have supported the reliability and validity of self-report measures of test-taking motivation within cultures (cf. Thelk et al., 2009), cross-country comparisons of self-reported attitudes in ILSA contexts have for example rather consistently revealed an ‘attitude-achievement’ paradox, with positive within-country relationships and negative between-country relationships between for example motivation and performance (Lu & Bolt, 2015), indicating the need for caution when different countries and cultures are being compared.

Student responses to the test-taking motivation scale in Sweden and Norway As noted, in addition to the effort thermometer, student test-taking motivation in PISA has been measured with a six-item test-taking motivation scale in Norway in PISA 2009 and 2012, and in Sweden in PISA 2012 (see Skolverket, 2015). In Norway, students have also been interviewed in connection with PISA 2006, 2009 and 2012. In all assessments, a majority of students in Norway as well as in Sweden have either strongly agreed or agreed with the statements that they did their best in PISA, that they felt motivated and that they persisted and tried to answer difficult items (see Table 7.1). A minority of the students (with the exception of one important item in Norway in 2009) agreed that a good performance felt important to them, or that it meant a lot for the student to do well on the test. Thus, many students claim to do their best even if they do not perceive the test as very important. Again, that fewer students perceive the test as important is reasonable; we cannot expect of students that they will perceive a low-stakes test

Self-reported effort and motivation in the PISA test

129

Table 7.1 Percentages of students who either strongly agreed or agreed with the statements in the test-taking motivation questionnaire, PISA 2009 and PISA 2012 Norway 2009

Norway 2012

Sweden 2012

I put in a good effort throughout the PISA test

86

81

74

I made my best effort on the PISA test

80

69

66

I was motivated to do my best in the PISA test

75

69

63

While taking the PISA test, I was able to persist even if some of the tasks were difficult

70

60

62

Doing well in the PISA test was important to me

54

47

46

Doing well in PISA meant a lot to me

43

36

37

Item

as highly important, and it also does not seem fair to expect that all students would report a very high level of effort on these types of tests. A situation where all students report a very high level of effort is unlikely also in high-stakes test contexts, as shown for example in the context of the Swedish test used for admission to higher education (the SweSAT, Stenlund, Eklöf, & Lyrén, 2017). Different individuals approach different tests with different mindsets and with different goals. When looking at the 2012 results, the Swedish students are somewhat more modest in their ratings of effort compared to the Norwegian students. When comparing the Norwegian results from 2009 to 2012, a more negative attitude is visible in 2012. This could possibly have to do with the main subject in the different PISA cycles. In 2009, the main subject was reading while in 2012, it was mathematics.

Student voices: Interviews with Norwegian students participating in PISA 2006 The first student interviews of participating students in PISA Norway, were conducted in five different schools with twenty-two participants in April and May 2006. Students were interviewed in their school after they had taken the test, using a semi structured interview guide. Students were asked how

130

International Large-Scale Assessments in Education

the school had prepared them for the PISA test, their motivation for doing their best and experience of sitting the PISA test. Analysis of the transcribed interview data showed that sixteen of the twenty-two students explained that it was an international test and half of the students (eleven of the twenty-two) explained the teacher had said that they did not need to prepare themselves in any particular way. A typical answer would be: ‘The teacher told us to do what we could and not study anything by heart before the test. I have not read anything.’ Whether they had to prepare themselves before the PISA test was probably what the students were most concerned about, given that these students are in their final lower secondary school year. This is because Norwegian students in 10th grade are preparing for their exams and finishing projects the last year. Similar reports were obtained from students in all schools: the teacher (who also was the test-administrator in all the schools) had given them information about the test, but told them that this was not a test they needed to prepare for. In one school, students had been provided with example items from previous PISA studies for support. Seven of the students explained that the teacher had encouraged them to do their best, or stressed the importance of the test, by uttering such as: ‘Important study’, ‘try to do it as well as other schools’, ‘an honour to participate’, ‘almost like an exam’, ‘you have to take it seriously’, ‘it is important that we do our best’. The only statement from the interviews indicating that the teachers had been influencing the students in any particular way, was given by a boy who suggested his teacher said it was ‘an honour for me to participate’. Whether students worked harder due to the teachers’ statements, we cannot know, but it shows that overall, teachers acting as test-administrators for the PISA test take this job very seriously. A similar impression was obtained from a previous interview study in the Swedish TIMSS 2003 context: the teachers had tried to encourage the students to do their best (Eklöf, 2006). The interviews further revealed that students may react adversely to the questionnaire rather than to the achievement test. Among the questions which disturbed students the most, according to themselves, were the questions about their parents’ background. In the student questionnaire, in 2000, 2003 and 2006, students are asked about their parents’ occupation, and the highest level of education their parents had. The questions were asked separately, first about the mother, then about the father, and covering approximately three and a half pages of the questionnaire. Six of the twenty-two students commented upon these questions: ‘There were a lot of strange questions. … What my mother and father are working with, and possessions which we have in our house.’

Self-reported effort and motivation in the PISA test

131

Another student, who made a comment upon the questions of the parents, had recently experienced the loss of his mother only months before sitting the PISA test. He said that he had answered the questions about his mother, but that he did not know what kind of education she had when she was a nurse. He continued to say that he was ok with the questions, because he said: ‘It is not that many students who loses their parents.’ The statement indicates that he had an understanding of the need for the questionnaire to ask about family background. Other students reacted more strongly: ‘I reacted to the questions. I thought they were personal. I was kind of surprised, not negatively, not that, but they were concerned about what your parents were doing, and I did not understand what that had to do with the issue, so I only answered them … But I did not understand this about the family.’ One of the girls explained it as follows: ‘I think there were quite a few unpleasant questions, about parents and such. I do not know how much that has to do with me … There are things that are personal, and there are things that are public, and have nothing to do with the school.’ One of the boys said: ‘I must admit that I did not particularly like it, because I do not understand what it has to do with the education of my mother and father’, while one of the girls commented that that she did not like the questions about the parents, because ‘it is not everybody who is proud of their parents, and that is why it can be unpleasant for them’. If the questionnaire contains items that students perceive as offensive, uncomfortable or ‘strange’, this may also affect their motivation to spend effort on completing the questionnaire, which in turn may lead to invalid interpretation of the significance of different background variables.

What can we learn from acknowledging the student perspective? Our findings from analyses of the PISA effort thermometer, the test-taking motivation scale over different PISA cycles and interview data suggest that many students seem to be motivated and put effort into the PISA test, despite the low stakes of the test. It does not seem like students in general simply dismiss studies like PISA. It is true that many students may not perceive PISA as a particularly important test (and in terms of the consequences of the results for the participating students, it is not), but many of the students who report that the test is not important still agree that they spent effort on the test and did their best.

132

International Large-Scale Assessments in Education

Even if many students report effort and motivation, not all do, and measures could be taken to present the study and the test context in a way that helps the students sense the value of a good performance, without raising the stakes of the test (see Sellar et al., this volume, for different strategies in different countries for preparing the students). Creating a positive climate around the PISA test and other ILSA tests is not a responsibility of the students, but of the authorities, teachers and school leaders. The interviews performed in PISA Norway suggest that teachers have tried to motivate the students, and it is important that all test proctors in all countries explain to the students the purpose of the test and the importance of students giving a good effort. Lack of test-taking motivation is sometimes put forward as a possible threat to the validity of results in low-stakes tests like PISA and TIMSS. Our interpretation is that this threat does not seem to be very severe when looking at the aggregated level, but further small-scale as well as large-scale studies focusing on the student perspective in these studies are needed, and could contribute with important knowledge on these issues. When students are acknowledged as test-takers in PISA, the focus tend to be on their behaviour on the achievement tests, effects of fatigue, how students understand achievement test items (Serder, 2015), alignment with national curricula etc. These are all relevant issues, but our findings from in particular the interviews also suggest that students may be concerned with different aspects of the tests, such as the questionnaire, in which there were items which some students found to be ‘too personal’. In Norway, when the students were interviewed about their experience of sitting the PISA test, the students did not seem to worry much about the tasks in the test or the result as the test was not graded, but they did question the questionnaire and why they had to be asked about their personal life, parents and home belongings. Findings like these are important to consider, not least in light of the fact that PISA has an interest in collecting more information on ‘personal issues’ from the participating students, such as the item on life satisfaction used in PISA 2015 (OECD, 2017). Overall, it is our view that we need to pay attention to the students and their voices in assessment situations and that by doing this, we can learn more about the validity of the tests we are using, and more about the students as active co-constructors of the assessment environment. Not asking students about their views of the tests they are taking can lead to misinterpretation of results as we may miss information about non-cognitive variables affecting student performance, and can also be seen as somewhat disrespectful. There is a student behind every test score and questionnaire index, and it is the students that need to cope with the test situation and to perform at their best level in this situation.

Self-reported effort and motivation in the PISA test

133

Not only can student reactions and behaviour affect test and questionnaire results, the questionnaires and the tests can also affect the students and their views of learning, performance, education and themselves in this context, with short-term as well as long-term consequences. Although a growing body of studies have stressed the importance of including students’ voices in key areas of assessment debate (Elwood, 2012; Smyth & Banks, 2012; Elwood & Baird, 2013; Murphy et al., 2013), we argue that assessment is still an area where students are basically acted upon – not included as agents. We agree with Elwood et al. (2017) who write that neglecting students’ views on assessment means that we are missing out important and valuable insight into how assessment problems play out in reality for test-takers. With the increased use of ILSA studies such as PISA, we would encourage taking students’ perspective into consideration, both when developing the tests, and when interpreting the results.

References Abdelfattah, F. (2010), ‘The relationship between motivation and achievement in lowstakes examinations’, Social Behavior and Personality, 38 (2): 159–168. American Educational Research Association [AERA], American Psychological Association [APA], & National Council for Measurement in Education [NCME] (2014), Standards for educational and psychological testing, Washington, DC: American Educational Research Association. Baumert, J. & Demmrich, A. (2001), ‘Test motivation in the assessment of student skills: The effects of incentives on motivation and performance’, European Journal of Psychology of Education, 16 (3): 441–462. Boe, E. E., May, H., & Borouch, R. F. (2002), Student task persistence in the Third International Mathematics and Science Study: A major source of achievement differences at the national, classroom and student levels. (Report No. CRESP-RR2002-TIMSS1). Available online: http://files.eric.ed.gov/fulltext/ED478493.pdf Brown, G. T. L. (2011), ‘Self-regulation of assessment beliefs and attitudes: A review of the Students’ Conceptions of Assessment inventory’, Educational Psychology, 31 (6): 731–748. Butler, J. C. (2008), Interest and effort in large-scale assessment: The influence of student motivational variables on the validity of reading achievement outcomes. PhD thesis, University of Melbourne, Faculty of Education. Butler, J. & Adams, R. J. (2007), ‘The impact of differential investment of student effort on the outcomes of international studies’, Journal of Applied Measurement, 8 (3): 279–304.

134

International Large-Scale Assessments in Education

Cole, J. S., Bergin, D., & Whittaker, T. (2008), ‘Predicting student achievement for lowstakes tests with effort and task value’, Contemporary Educational Psychology, 33 (4): 609–624. Cresswell, J. (2016), System level assessment and education policy. Available online: http://research.acer.edu.au/assessgems/10 DeMars C. E., Bashkov, B. M., & Socha A. B. (2013), ‘The role of gender in test-taking motivation under low-stakes conditions’, Research & Practice in Assessment, 8: 69–82. Dagens Nyheter (14 June 2014), Just how little do students care about the PISA test. Available online: www.dn.se/nyheter/just-how-little-do-students-care-about-thepisa-test/ Eklöf, H. (2006), Motivational beliefs in the TIMSS 2003 context: Theory, measurement and relation to test performance. PhD Dissertation, Umeå University, Umeå. Eklöf, H. & Knekta, E. (2017), ‘Using large-scale educational data to test motivation theories: A synthesis of findings from Swedish studies on test-taking motivation’, International Journal of Quantitative Research in Education, 4: 52–71. Eklöf, H., Hopfenbeck, T. N., & Kjaernsli, M. (2012), ‘Hva vet vi om testmotivasjon i Sverige och Norge?’ [What do we know about test-taking motivation in Sweden and Norway?], in T. N. Hopfenbeck, M. Kjaernsli, & R. V. Olsen (eds), Kvalitet i norsk skole, Internasjonale og nasjonale undersøkelser av laeringsutbytte og undervisning, Oslo: Universitetsförlaget. Eklöf, H., Japelj, B., & Grønmo, L. S. (2014), ‘A cross-national comparison of reported effort and mathematics performance in TIMSS Advanced 2008’, Applied Measurement in Education, 27: 31–45. Eklöf, H., Reis Costa, D., & Knekta, E. (2017), ‘Changes in self-reported test-taking motivation in relation to changes in PISA mathematics performance. Findings from PISA 2012 and PISA 2015 in Sweden’, Submitted manuscript. Elwood, J. (2012), ‘Quali cations, examinations and assessment: Views and perspectives of students in the 14–19 phase on policy and practice’, Cambridge Journal of Education, 42 (4): 497–512. Elwood, J. & Baird, J.-A.(2013), ‘Students: researching voice, aspirations and perspectives in the context of educational policy change in the 14–19 phase’, London Review of Education, 11 (2): 91–96. Elwood, J., Hopfenbeck, T., & Baird, J.-A. (2017), ‘Predictability in high-stakes examinations: Students’ perspectives on a perennial assessment dilemma’, Research Papers in Education, 32 (1): 1–17. Harris, L. R. & Brown, G. T. L. (2016), ‘The human and social experience of assessment: Valuing the person and content’, in G. T. L. Brown & L. R. Harris (eds), Handbook of human and social conditions in assessment, 1–17, London: Routledge. Holliday, W. G. & Holliday, B. W. (2003), ‘Why using international comparative math and science achievement data from TIMSS is not helpful’, The Educational Forum, 67: 250–257. Hong, E., Sas, M., & Sas, J. C. (2006), ‘Test-taking strategies of high and low mathematics achievers’, Journal of Educational Research, 99: 144–155.

Self-reported effort and motivation in the PISA test

135

Hopfenbeck, T. N. (2009), Learning about Students´Learning Strategies: An empirical and theoretical investigation of self-regulation and learning strategy questionnaires in PISA, PhD, University of Oslo. Hopfenbeck, T. N. & Kjaernsli, M. (2016), ‘Students’ test motivation in PISA: The case of Norway’, The Curriculum Journal, 27 (3): 406–422. Hopfenbeck, T. N., Lenkeit, J., El Masri, Y., Cantrell, K., Ryan, J. & Baird, J.A. (2018), ‘Lessons learned from PISA: A systematic review of peer- reviewed articles on the programme for international student assessment’, Scandinavian Journal of Educational Research, 62 (3): 333–353. Knekta, E. (2017), Motivational aspects of test-taking: Measuring test-taking motivation in Swedish National Test contexts. PhD Dissertation, Umeå University, Umeå. Knekta, E. & Eklöf, H. (2015), ‘Modeling the test-taking motivation construct through investigation of psychometric properties on an expectancy-value based questionnaire’, Journal of Psychoeducational Assessment, 33: 662–673. Lu, Y. & Bolt, D. M. (2015), ‘Examining the attitude-achievement paradox in PISA using a multilevel multidimensional IRT model for extreme response style’, LargeScale Assessments in Education, 3 (2): doi:10.1186/s40536-015-0012-0 Lüdtke, O., Robitzsch, A., Trautwein, U., Kreuter, F., & Ihme Marten, J. (2007), ‘Are there test administrator effects in large-scale educational assessments?’, Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 3 (4): 149–159. Murphy, C., Lunday, L., Emerson, L., & Kerr, K. (2013), ‘Children’s perceptions of primary science assessment in England and Wales’, British Educational Research Journal, 39 (3): 585–606. Nevo, B., & Jäger, R. S. (1993), Educational and psychological testing: The test taker’s outlook, Stuttgart: Hogrefe & Huber Publishers. OECD (2009), PISA 2006 technical report, Paris: OECD. OECD (2017), PISA 2015 Results (Volume III). Students’ well-being, Paris: OECD Publishing. Penk, C., Pöhlmann, C., & Roppelt, A. (2014), ‘The role of test-taking motivation for students’ performance in low-stakes assessments: An investigation of school-trackspecific differences’, Large-scale Assessments in Education, 2 (1): 1–17. Pintrich, P. R. & DeGroot, E. V. (1990), ‘Motivational and self-regulated learning components of classroom academic performance’, Journal of Educational Psychology, 82 (1): 33–40. Rutkowski, L. & Rutkowski, D. (2016), ‘A call for a more measured approach to reporting and interpreting PISA results’, Educational Researcher, 45 (4): 252–257. Serder, M. (2015), Möten med PISA. Kunskapsmätning som samspel mellan elever och provuppgifter i och om naturvetenskap. [Encounters with PISA. Knowledge testing as interaction beween students and test items in and about natural science]. PhD Dissertation, Malmö Studies in Educational Sciences, No75. Sjøberg, S. (2007), ‘PISA and real life challenges: Mission impossible?’, in S. Hopman (ed.), PISA according to PISA. Does PISA keep what it promises?, Wien: LIT Verlag.

136

International Large-Scale Assessments in Education

Sjøberg, S. (2014), ‘Pisa-syndromet. Hvordan norsk skolepolitikk blir styrt av OECD. [The PISA syndrome. How Norwegian education policy is steered by OECD]’ , Nytt Norsk Tidsskrift, 1, 30–43. Skolverket (Swedish National Agency for Education) (2015), To respond or not to respond: The motivation of Swedish students in taking the PISA test, Stockholm: Skolverket. Smyth, E. & Banks, J. (2012), ‘High stakes testing and student perspectives on teaching and learning in the republic of Ireland’, Educational Assessment, Evaluation and Accountability, 24: 283–306. Stenlund, T., Eklöf, H., & Lyrén, P. E. (2017), ‘Group differences in test-taking behaviour: An example from a high-stakes testing program’, Assessment in Education: Principles, Policy & Practice, 24 (1): 4–20. Stobart, G. (2005), ‘Fairness in multicultural assessment systems’, Assessment in Education, 12 (3): 275–287. Sundre, D. L. & Kitsantas, A. (2004), ‘An exploration of the psychology of the examinee: Can examinee self-regulation and test-taking motivation predict consequential and non-consequential test performance?’, Contemporary Educational Psychology, 29(1): 6–26. Thelk, A. D., Sundre, D. L., Horst, S. J., & Finney, S. J. (2009), ‘Motivation matters: Using the student opinion scale to make valid inferences about student performance’ , The Journal of General Education, 58 (3): 129–151. Wise, S.L. & DeMars, C. E. (2005), ‘Low examinee effort in low-stakes assessment: Problems and potential solutions’ , Educational Assessment, 10 (1): 1–17. Wise, S. L. & Smith, L. F. (2016), ‘The validity of assessment when students don’t give good effort’, in G. T. L. Brown & L. R. Harris (eds), Handbook of human and social conditions in assessment, 204–220, London: Routledge. Zilberberg, A., Finney, S. J., Marsh, K. R., & Anderson, R. D. (2014), ‘The role of student’s attitudes and test-taking motivation on the validity of college institutional accountability tests: A path analytic model’, International Journal of Testing, 14 (4): 360–384.

8

Student preparation for large-scale assessments: a comparative analysis Sam Sellar, Bob Lingard, David Rutkowski and Keita Takayama

Introduction The Organisation for Economic Cooperation and Development’s (OECD) Programme for International Student Assessment (PISA) has become arguably the most influential and successful international large-scale assessment (ILSA) and represents a major global investment in test development, administration, data collection and analysis. PISA assesses reading, mathematical and scientific literacy and is conducted with 15-year-olds in approximately seventy countries every three years. This chapter examines different approaches to preparing students to sit for PISA. We compare four national cases at different points in time: (1) Prince Edward Island, Canada in PISA 2003; Japan following PISA 2003; Scotland in PISA 2012; and Norway in PISA 2015. Approaches to preparation across the four cases span a continuum from close adherence to the test administration guidelines to structural reforms that increase alignment between curricula and what PISA assesses. We aim to answer the following question: what can different preparation strategies for PISA tell us about the impact and use of PISA in these contexts? We do not aim to demonstrate a relationship between preparation and performance. Rather, we show how politics and policy within nations produce different approaches to preparing for and administering PISA, and we illustrate how this aspect of the assessment can deviate substantially from the OECD’s technical standards. Much has been written about how variation in sampling processes and the translation of test items into different languages and cultural contexts may affect test performance, as well as about reception and use of test results. However,

138

International Large-Scale Assessments in Education

little attention has been given to different approaches to preparing students for the PISA test. The four cases examined in this chapter cover very different test preparation strategies, including: (1) minimal preparation (e.g. reading a brief script to introduce the test); (2) test motivation strategies (e.g. videos); (3) familiarization of students with test format, content and test-taking strategies using customized handbooks for teachers and students; and (4) structural preparation through reform to pedagogy and curriculum that may increase familiarity with the types of reasoning assessed by PISA. As we will discuss, the OECD attempts to standardize the administration of the PISA test on the day that it is conducted and immediately beforehand, but the practices in cases 2 through 4 fall outside of the OECD’s technical guidelines for administering the test. Thus, we are largely concerned with activities that fall into a ‘grey area’ of test preparation that is neither officially recognized nor forbidden by the OECD. These activities entail costs to nations beyond those incurred by simply participating in the programme and thus provide insight into the importance ascribed to, and the uses of, PISA by governments. This chapter draws on data in the form of policy documents, research interviews and test preparation materials in order to compare different approaches to test preparation across the four contexts. We first provide a background discussion of the PISA technical standards produced by the OECD and the protocols followed by National Project Managers (NPMs), test administrators and staff in schools who have responsibility for managing the assessment. We then outline our theoretical framework, which combines an education governance perspective with assessment literature that examines test preparation, and briefly describe our methodology. The four case studies are then presented before moving to a discussion and conclusion that draws out the key findings and summarizes our overall argument.

Background and theoretical framework With the introduction and ongoing development of PISA, the OECD has generated a global testing infrastructure that connects national assessment programmes with its education work as an intergovernmental organization. Standardization has played a crucial role in the development of this global infrastructure. The OECD cannot compel nations to participate in the test and instead exerts influence by promoting (a) the importance of the assessment and (b) a model of evidence-based policymaking that requires the kinds of data

Student preparation for large-scale assessments

139

generated by PISA. This form of standardization operates through alignment of values within the epistemic communities that are cultivated by the OECD (Sellar & Lingard, 2014a). The OECD also promotes standardization through the technical work undertaken within and across countries to develop the test instruments, administer the test and analyse the results. This standardization is a form of infrastructural governance (Sellar & Lingard, 2014a) that provides the OECD with new capacities to shape national school systems through the provision of tools and platforms that embody a particular view of education (Gorur, 2016). Standardization of this latter kind has been a major project of modernity (Busch, 2011) and is central for the establishment of information infrastructures (Hanseth et al., 1996), of which the OECD’s educational data work is a good example. Our focus is upon variations that emerge across different articulations of international and national testing infrastructures. Specifically, we focus on ‘grey areas’ where the PISA standards neither prescribe nor proscribe test preparation strategies that are employed by some participants. The OECD publishes a number of documents that are designed to standardize the administration of PISA. For each assessment, a technical document is developed that lists ‘the set of standards upon which the … data collection activities will be based’ in order to ensure that all partners involved in data collection ‘contribute to creating an international dataset of a quality that allows for valid cross-national inferences to be made’ (OECD, 2015: 3). With regard to test administration, the most recent 2015 standards explain that ‘[c]ertain variations in the testing procedure are particularly likely to affect test performance’ and these include ‘the instructions given prior to testing’. Here we see an emphasis on ensuring data quality and objectivity through controlling the test situation (see Piattoeva, 2016) and the information provided to students about the test. Given concerns raised about the reliability of its international comparative testing, the OECD sees mechanisms for achieving procedural and administrative standardization as central to its legitimation strategy. Successful challenges to the reliability of PISA data and validity of PISA interpretations and use may reduce the impact of what has become one of the OECD’s most successful policy products, and the OECD provides technical reports and establishes various standards and protocols to guard against and respond to such challenges. In its PISA technical reports, the OECD has consistently stressed participating countries’ strict adherence to the administrative guidelines. The reports specify that within each participating nation, NPMs oversee the conduct of PISA and develop documents that are designed to facilitate standardized data collection.

140

International Large-Scale Assessments in Education

Manuals prepared for test administrators and school-based staff who manage the administration of the assessment advise that no help can be given to students in relation to items in the test booklet, and the only instructions to be given to students are those included in national scripts that are read word-for-word at the beginning of the testing session. These scripts emphasize the global scale of the assessment and the importance of the data for policymakers. Students are urged to do their best, but the script largely focuses on the processes for completing the test. In the OECD school co-ordinator’s manual, school staff are advised to introduce PISA to other staff and students ahead of the test, to distribute promotional materials that raise awareness about the assessment and to encourage students to participate. This promotion is designed to ensure sufficient participation to meet the sampling requirements. The cases examined here, however, illustrate how various kinds and degrees of test preparation occur outside the scope of these OECD standards and manuals for test administrators. Our discussion of the four cases draws on the framework for categorizing different approaches to test preparation or coaching described by Allalouf and Ben-Shakhar (1998) and Brunner, Artelt, Krauss and Baumert (2007). This framework comprises what Brunner et al. (2007) describe as the familiarity approach, the content approach and the test-wiseness approach. The familiarity approach involves training using materials and conditions that closely simulate the actual testing situation. The content approach involves targeted study in areas that are tested, and the test-wiseness approach involves the teaching of general test-taking strategies (see Millman et al., 1965), the role of which have been debated in relation to PISA and other ILSAs (Dohn, 2007; Maddox, 2014). In Germany, Brunner et al. (2007) found that these forms of preparation did not have a significant effect on PISA performance and do not threaten the validity of PISA. Baumert and Demmrich (2001), also in the German context, have shown that incentivizing students using mechanisms such as feedback, grading or financial reward did not significantly improve test motivation beyond that generated by explaining the societal benefit of participation. In the US, however, Braun, Kirsch, and Yamamoto (2011) used monetary incentives as an intervention strategy in a low-stakes literacy assessment and did find general increases in achievement when compared to the nonincentivized group. PISA is a low-stakes test for students and the utility of participation is generally framed in terms of the contribution that students make to understanding education and skills in different societies. While there are certainly anecdotal stories of incentives being offered to students to increase test motivation – including free pizza or free breakfasts – and different findings

Student preparation for large-scale assessments

141

regarding the effects of incentivization in low-stakes assessments, our focus is not on determining whether such activities have a measurable impact on test performance. Instead, we focus upon what test preparation approaches can tell us about the importance governments ascribe to PISA and how PISA is used by governments in the context of national political and policy agendas.

Methodology The following cases have been developed from separate research studies and are based upon a combination of document analysis and research interviews. Case 1 is based on a research interview with Norwegian stakeholders who are familiar with the PISA administration process and textual analysis of test administration materials. Case 2 was developed from ‘videological analysis’ (Koh, 2009) of a short film that was used to motivate students who sat for PISA in Scotland. Case 3 is based on textual analysis of teacher handbooks that were used for test preparation in Prince Edward Island, Canada, as well as analysis of relevant PISA technical documents. Case 4 was developed from research interviews with key stakeholders in Japan. The documents that we discuss are in the public domain and appropriate ethical approval was obtained for all research interviews. Each case is treated as an example of a different approach to test preparation, rather than as an isomorphic unit of analysis that can be compared with the other cases in all respects.

Analysis: Four different approaches to preparing for PISA Case 1, Norway: Following the template Consistent PISA results around the OECD average have influenced Norwegian educational policies and public opinion (Nusche et al., 2011), yet Norwegian students receive little coaching or, for that matter, information regarding their participation in PISA. As such, this case provides an illustrative example of the approach explicitly endorsed by the OECD. Following OECD guidelines, administrators of selected schools are sent a letter in early November informing them that their school has been randomly selected to participate. Before 2015, school participation in PISA was voluntary; however, for the 2015 cycle school-

142

International Large-Scale Assessments in Education

level participation in international assessments became obligatory (mandated by the federal government because of low participation in various international assessments). Consequently, the proportion of selected schools agreeing to participate rose from 85 per cent in 2012 (OECD average was 92 per cent) to 95 per cent in 2015 (the OECD average was 91 per cent). The increase in the participation rate is important and is demonstrative of Norway’s commitment to international assessments as a key indicator for measuring education systems’ performance (Nusche et al., 2011: 26). Once schools are selected they are required to send a list of all enrolled 15-year-olds. Partly due to a large number of small rural schools, Norway only selects thirty students per school to participate. This differs from the typical PISA sample of forty-two students for schools that take the computerbased assessments and thirty-five for paper-based assessments (OECD, 2017). In the context of PISA, two things are somewhat unique about Norway. First, most 15-year-olds (approximately 99 per cent) are in the 10th year of schooling, which differs from a country like the US, where 15-year-olds can be in grade 7 or higher, with the majority of students coming from grades 9 to 11 (Kastberg et al., 2014). The second unique aspect is that in Norway, 15-year-olds – given that they are of legal age – are able to provide consent to take the assessment. This means that, unlike schools in the US, students in Norway have the right to decide whether to participate or opt-out of PISA. As such, PISA is relieved from the burden of collecting parent consent forms, which is a requirement for other international studies such as TIMSS that assess younger populations (e.g. Grade 9 in Norway). However, after reviewing documentation sent to the parents and the script that is read out to students, it is unclear that the ability to opt-out of the assessment is ever explicitly explained to students. During an interview with Norwegian stakeholders it was argued that most Norwegian students are aware of this right and that they have heard of few instances where students refused to take part in the assessment (personal communication). Although participation in PISA and other international assessments is now a requirement from the Norwegian educational directorate, in general school autonomy remains an important tenet of the education system. Schools are given freedom regarding how to inform parents that their children have been chosen to participate in PISA. Following PISA instructions, the NPMs provide schools with a form letter that is rarely adjusted by the schools (personal communication). The one-page letter explains that PISA is an international survey, but does not include any mention of the OECD or its goals. The letter also states that the Norwegian Ministry of Education has decided that participation is mandatory and that the University of Oslo has the responsibility for administering the

Student preparation for large-scale assessments

143

assessment in Norway. The letter then explains what is assessed by PISA, the test duration (two hours), and that the survey and assessment is anonymous and cannot be tracked back to an individual student. Absent from the letter is any discussion concerning a student’s ability to opt-out or their need to provide explicit consent. On the day of the assessment, as per OECD procedure, selected students are removed from the classroom (unless the entire class is selected, as is the case in small schools) and are taken to another room where the assessment is administered. The following information provided by the OECD is the only official information the students are told about their participation: You are selected to participate in an international survey called PISA. In the survey, we want to find out what students your age can do in science, reading, math and problem solving. In Norway, about 6,000 students attend. Norway is one of the 70 countries participating in PISA. Overall there will be more than 300,000 pupils involved, from over 9,000 schools. This is an important study because it says something about what you have learned and how you experience school. The answers you provide will mean something to decisions made about the Norwegian school in the future. We ask you therefore to do the best you can. (Author’s translation, Institutt for Lærerutdanning og Skoleforskning, 2014)

A similar script is read out in other participating countries. Although PISA has clearly had an influence on the Norwegian educational system (Baird et al., 2011; Breakspear, 2012), our research has found no evidence that students are provided with any preparation before the assessment. The only ambiguity that emerged during our review of Norwegian administration guidelines, and interviews with those who administer the assessment in Norway, is the extent to which students clearly understand that their participation is voluntary. Cultural issues may explain Norway’s particular case (e.g. students understand their rights). However, one question that emerges from this case is the extent to which other systems mandate participation in PISA and what threats to the validity of cross-system comparison emerge from those differences.

Case 2, Scotland: Representing your country In contrast to the Norwegian case, explicit attempts have been made to motivate Scottish students for PISA during previous assessment cycles. Here we focus on the three-minute video, Representing Your Country: PISA 2012, prepared by the Scottish Department of Education for the Scottish National Party (SNP)

144

International Large-Scale Assessments in Education

government. The video was distributed for screening at all schools attended by students who were part of the 2012 PISA sample. In our brief ‘videological analysis’ (Koh, 2009), we draw on earlier work (Lingard & Sellar, 2014b) that considered how this motivational video was located within Scottish politics. This video constitutes an explicit test motivation strategy, which builds on the societal value of participation that is emphasized in the official script read to students, and it was inextricably linked to the Scottish independence movement at the time, which saw quality schooling as central to the (economic) future of an independent Scotland. The United Kingdom (UK) is the unit of analysis for PISA and it is the UK that is represented on the PISA Governing Board at the OECD. Nonetheless, Scotland has sent an observer to PISA Governing Board meetings from the outset and devolved Scottish governments have used the OECD to conduct a number of reviews of schooling. Since PISA was first administered in 2000, each of the devolved and constituent parts of the UK (Scotland, Wales, Northern Ireland) has been oversampled, so that UK PISA results can be disaggregated for policy use at these devolved levels of government. Under devolution in Scotland, schooling is the responsibility of the Scottish parliament (created in 1999) with a separate Minister for Education. Schooling has been highly valued historically by the Scottish people and is also seen as central to the construction of ‘Scottishness’ (Paterson, 2003, 2009; McCrone, 2005; Lingard, 2014: 118–130). McCrone (2005: 74) observes that, ‘people think of themselves as Scottish because of the micro-contexts of their lives reinforced by the school system’. The centrality of schooling to Scottish national identity has taken on reinvigorated salience in the context of the independence movement and is manifested in the video, Representing Your Country: PISA 2012. From the time of devolution, the Scottish independence movement has strengthened, culminating in the referendum in late 2014 that was narrowly defeated. The human capital and economic importance of schooling is also evident in the SNP government’s Smarter Scotland project, which aims to create Scotland as an independent knowledge economy. In analysing this project and the SNP government’s education policies, Arnott and Ozga (2010) have argued that both use Scottish nationalism as a resource. A similar discursive strategy is at work in the video. Additionally, the video semiotically constructs and assumes an independent Scottish nation. Although distributed well prior to the independence vote, the Representing Your Country video already portrayed Scotland as a separate, independent

Student preparation for large-scale assessments

145

nation. The video is anchored by the Minister for Education, the then Scottish National Party MP, Michael Russell, and by a voice-over from a Scottish policy-maker, utilizing the intertextual semiotic signifier in the conclusion of a Smarter Scotland. The backdrop to the Minister is the Edinburgh landmark, the Salisbury Crags, and the science centre, Our Dynamic Earth, created as part of the urban renewal of this part of Edinburgh, as was the Scottish Parliament building in which the Minister is standing. All of these are resonant semiotic markers of Scotland, old and new, natural and cultural. The Scottish colours of blue and white – the colours of the saltire, the Scottish flag – saturate the video. We also note how a fluttering saltire is the backdrop to the statements by a number of young athletes stressing how great it is to be selected to represent Scotland. Koh (2009: 284) uses the concept of ‘visual design’ to analyse the semiotics of educational videos and observes that such design ‘works ideologically to constrain the semiotic meaning potential of visual texts to a preferred reading path, and that “design” textually contributes to a closed rather than an open, multiple or contradictory reading of the text’. Representing Your Country seamlessly works together visual images and its explicit message, and also proffers a preferred reading that assumes Scotland is an independent nation, rather than actually still being a political aspiration. The voices of the Minister for Education and the policy-maker evoke a sense for participating students and teachers that being chosen as part of the PISA sample is exciting and significant. He says, ‘You have been selected to represent Scotland’. We see here what Fairclough (2003: 88) refers to as a ‘logic of equivalence’; participating in the random PISA sample means representing Scotland and is framed as equivalent to a young athlete being chosen to represent Scotland in sport, with the parallel necessity of doing one’s best for one’s self and for one’s country. The video closes by noting that individuals are not often asked to represent their country. This, of course, also links to the necessity of getting the appropriate PISA sample size so that the data are useful for policy and comparative purposes inside Scotland. Students are exhorted to do their best for Scotland, as Scotland’s PISA performance, if it is good, the video implies, will mean more investment in the Scottish economy and thus better opportunities for all. The Minister argues that good performance on PISA will mean that that Scotland will be seen as a ‘Great place to invest for the future’. There are also clips of North Sea oil rigs and wind turbines as signifiers of the Scottish economy. The video thus creates what Fairclough (2003: 91) calls a ‘higher-level semantic’ in relation to participation in the 2012 PISA sample in Scotland.

146

International Large-Scale Assessments in Education

In this case, the Representing Your Country video was used strategically in an attempt to motivate students to do their best on the test. This strategy leveraged the emphasis on the social utility of participation that is conveyed in the standard PISA script. The production and distribution of this video indicates that the Scottish government deemed motivation for, and strong performance on, PISA to constitute an important contribution toward a viable and independent Scottish nation with a productive knowledge economy.

Case 3, Prince Edward Island (Canada): Teaching to the test In contrast to the Scottish and Norwegian cases, Prince Edward Island (PEI), Canada, has employed a more substantial approach to test preparation by using teacher handbooks and sample items in the lead-up to PISA 2003. Similar handbooks with sample items and advice to teachers on how to prepare students have been used elsewhere, including in Wales (Welsh Government, 2012) and Mexico (Secretaria du Educacion Publica, 2011). However, this case provides an early example of this preparation strategy in a provincial education system from a country that is an OECD member and which has been a top performer on PISA. Canada does not have a federal education ministry and education is the responsibility of provinces. Like the UK, Canada oversamples for PISA in order to disaggregate province-level results to enhance their relevance for the decisionmaking of provincial education ministries. The four Atlantic provinces (New Brunswick, Newfoundland and Labrador, Nova Scotia and PEI) performed significantly lower than other Canadian provinces in PISA 2000, but PEI was not the lowest performer among this sub-group. However, in 2003, PEI was the lowest performing province across all three domains. The preparation discussed here occurred in the lead-up to the 2003 assessment and was based on sample items from the 2000 assessment. PEI is the smallest Canadian province and had a population of approximately 135,000 people in 2003. Twenty-six schools were selected to participate, which is most schools in the province that enrol 15-year-old students, and 1,653 of the 1,832 students who were sampled participated. Thus, PISA is effectively a census test in PEI, rather than being conducted with a minority of sampled students. A systematic approach to test preparation in PEI thus does not entail unnecessarily preparing students who will not be sampled to sit the test. Preparation for PISA 2003 in PEI involved the production of a twenty-six page teachers’ handbook for each domain – reading, mathematics and science –

Student preparation for large-scale assessments

147

along with matching student handbooks. The handbook for each domain follows a similar format. Here we focus on the mathematics handbook, titled Preparing for PISA: Mathematical Literacy Teacher’s Handbook (Prince Edward Island Government, n.d.), because mathematics was the major domain in 2003. A brief note on the contents page explains that each handbook is based substantially on the OECD document Sample Tasks from the PISA 2000 Assessment: Reading, Mathematics and Scientific Literacy (2002). The handbook specifies that the OECD gave permission for PEI to reproduce this material.1 The OECD (2002) document provides a detailed description, for each domain, of the PISA definition of literacy, the format of the questions and the assessment process. A series of sample items are also included. While the PEI teacher’s handbook does reproduce items and passages of text from the OECD document, it also includes additional text directed at teachers in the province. Firstly, a box out titled Preparing Atlantic Canadian Students for PISA (p. 3) explains that the document has been published to enable students, with the help of their teachers, to attain a clear understanding of the assessment and how it is scored and to help ensure more confident and successful participation. There is also a pamphlet for parents to raise awareness of the purpose, methodology, and significance of PISA. (p. 3)

On the following page, under the heading of Suggestions for Teachers (p. 4), the handbook encourages teachers: to discuss the sample tasks with students in class groups or individually; to carefully review the scoring criteria, which ‘are the same as those used by PISA markers to mark the actual assessment’, and discuss acceptable answers with students; ‘to help students become comfortable with the way PISA questions are formatted and classified’; to incorporate the sample tasks into curriculum, instruction and assessment; and to encourage students ‘to take the assessment seriously and strive for excellence’. The original OECD (2002) document is not addressed to teachers as a guide for preparing students to sit PISA and does not provide tips for improving test-taking capacities among students. The OECD does not explicitly provide materials for test preparation. However, the handbooks produced in PEI adapt the OECD document to support test preparation that falls outside the frame of

1

The PEI handbooks contain the following text: ‘Based almost entirely on the Organisation for Economic Co-operation and Development document Sample Tasks from the PISA 2000 Assessment: Reading, Mathematics and Scientific Literacy © “OECD (2002). Reproduced by permission of the OECD”’. (Prince Edward Island Government n.d., 2). We can thus infer that the OECD was aware of the production of the PEI handbooks when giving permission for the reproduction of its document in this format.

148

International Large-Scale Assessments in Education

the PISA standards and test administration manuals. The OECD has thus given permission for the reproduction of sample items in documents explicitly created for the purpose of preparing students to perform better on the test than they would otherwise. Teachers that follow each of the suggestions in the PEI handbooks would be covering each component of test preparation outlined above: familiarizing students with the format of the test using previous iterations; preparing students for tested content; and increasing test-wiseness by teaching test-taking strategies (Allalouf & Ben-Shakhur 1998; Brunner et al., 2007). Indeed, page six of the document is dedicated to a boxout titled Assessment-taking strategies and gives advice such as: ‘Give each question a try, even when you’re not sure. Remember partial value is given for partially correct answers’. This is a clear example of a strategy for increasing test-wiseness. In the case of PEI, the small size of the system changes the nature of PISA from a sample to a census test and could be seen by administrators to increase the potential effectiveness of test preparation. Moreover, the performance of PEI at the lower end of the Canadian provinces in 2000 may have also created incentives to improve performance through a systematic test preparation strategy. Canadian PISA performance is reported at the level of the provinces that participate, as well as in relation to Canada as a whole. The nature of Canadian federalism can thus encourage a competitive dynamic between provinces, and PEI’s test preparation can be understood in relation to this dynamic, which may create incentives for different levels of investment in preparation across provinces within a single country.

Case 4, Japan: Reforming the system In the case of Japan, PISA has powerfully shaped the policy reform discourse over the last two decades, starting with the ‘crisis’ generated in the immediate aftermath of the PISA 2003 data release (see Takayama, 2008). In contrast to the previous two cases, these reforms do not constitute an immediate form of test preparation, but rather a subtle, embedded form of content preparation. This final case thus represents the end of a continuum extending from a standard approach to test delivery, through motivational, familiarity and test-wiseness strategies, to structural alignment of national curriculum and assessment content with PISA. The PISA ‘crisis’ in Japan was particularly mobilized by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) to revamp the

Student preparation for large-scale assessments

149

constructivist curricular reform initiated in the mid-1990s, which was heavily criticized by those who attributed the declining academic (and by extension moral) standards of schools to the child-centric, process-oriented curricular reform that focused on the ‘how’ over the ‘what’ in children’s learning. MEXT’s policy reference to PISA was prevalent in the years after the crisis, but it was constantly utilized to keep intact, or further advance, the curricula reform which had already been set in motion (Tobishima, 2012; Takayama, 2014). It is in the context of this complex intermingling of the national and the global that the issue of ‘PISA preparation’ must be understood in the case of Japan. According to two NPM members who are researchers in the National Institute of Educational Policy Research (NIER), no cases of inappropriate test preparation were observed. They explained that the sampling and administrative procedures prescribed by PISA were carefully overseen by teachers and school administrators at the sampled schools, so much so that numerous complaints were made by participating schools about the logistical challenges caused by strict compliance. No attempt was made to ignite students’ nationalistic pride as in the case of Scotland, though our interviews at two participating schools suggest that they stressed that the students were ‘selected’, without mentioning that it was a random selection, to encourage them to take PISA seriously. In fact, one of the interviewed NPM members was concerned about students’ lack of seriousness about PISA. Having observed many inattentive and distracted students, including those falling asleep during the testing, he was hence pleasantly surprised at Japan staying among the top-performing countries and economies in PISA 2012. Unlike in Norway, participation of randomly selected schools in PISA is not mandatory in Japan. This has created considerable challenges for the NPM to secure the required number of participating students and schools. In particular, PISA sampling in Japan requires proportional representation of different types of high schools where 15-year-olds study: private, public (national and prefectural), comprehensive, commercial, technical and agricultural. Because of the limited pool of schools in some cases (e.g., private agricultural high schools), some schools have been selected for PISA multiple times. While some schools, especially those who participate for the first time, might consider participation to be an honour and, if private, use the fact of participation for marketing purposes, others, particularly those who have been sampled repeatedly, consider it a burden and disruption to their day-to-day operations. It is a disruption to participating schools, because these schools are required to randomly sample students from every single classroom in Year 10. Without any tangible benefits to

150

International Large-Scale Assessments in Education

the participating schools and students, the NPM struggles to secure the required number of participants. The interviewed NPM officials were reluctant to discuss how to convince reluctant schools to participate in PISA. The most significant PISA ‘preparation’ in Japan took place when the 2007 national academic assessment (for Grades 6 and 9) introduced two parts, Part A (basics) and Part B (application). Japanese researchers and education commentators all agree that the introduction of Part B was strongly guided by the assessment orientation of PISA, the application of skills and knowledge in everyday contexts, and that this was a deliberate attempt by MEXT to prepare students for PISA (see Takayama, 2013). The type of problems in PISA were considered unfamiliar to many Japanese students and educators at the time, and this was particularly the case in the area of reading literacy (MEXT, 2005).2 The integration of PISA-type problems in the national academic assessment, which has been made mandatory to all eligible students since 2012, has facilitated nationwide uptake of PISA. Central to this process was the intense downward pressure generated by publication of national assessment average scores by prefectures. Low ranking prefectures (e.g. Okinawa and Osaka) were stigmatized while highly ranked ones (e.g. Akita and Fukui) quickly became important places to visit and learn from for education administrators and educators from across the country. In response, many prefectural boards of education, under increasing pressure to improve their rankings, organized a range of opportunities where school administrators, teachers and municipal level instructional advisors learned the general orientation of PISA and instructional strategies to prepare students for PISA-type problems. Many of these instructional advisors then worked closely with local schools to ensure that teachers were adequately supported to introduce the new type of PISA-informed teaching and learning. In the course of this process, however, PISA became indistinguishable from Part B of the national assessment. As the ‘new’ curricular orientation of PISA travelled downward through the centralized institutional channel, it was increasingly absorbed as part of the constructivist, process-oriented curriculum reform that MEXT had pursued in the name of ‘zest for living’ since the mid-1990s. In fact, since 2008 MEXT began claiming that its ‘zest for living’ curricular reform had preceded

2

This does not mean that a policy did not exist prior to this point that resembled the curricular logics of PISA. In fact, the social constructivist ethos underpinning PISA had been widely shared among Japanese educational psychologists and implemented partially at the policy level prior to 2003 PISA crisis. But the new curricular orientation had not reached the classroom level and hence PISA was still unfamiliar to many teachers and students (Ichikawa, 2007).

Student preparation for large-scale assessments

151

PISA, stressing the domestic ‘origin’ of its ongoing reform as well as its policy consistency (see Takayama, 2014). The integration of PISA style questions in the Japanese national assessment does not guarantee the desired changes at the classroom level, however. According to Kayo Matsushita (2012), the 2006 introduction of the new managerial ‘Plan-Do-Check-Act’ (PDCA) cycle was critical in this regard, as it mandated that schools use the national assessment data as part of the newly mandated self-improvement planning for individual schools and boards of education (see Takayama, 2013). This move was followed by the integration of the same PISA-informed curricular orientation in the 2008/9 revision of the national course of studies. What is notable in Japan is that MEXT, in response to the ‘PISA crisis’, introduced the curricular shift in an unconventional way, via the addition of Part B in the national assessment, prior to the scheduled revision of the national course of studies in 2008/9, which is the normal procedure in Japan for curricular policy change.

Discussion and conclusion Each of the four cases illustrates how participation in PISA articulates in different ways with (sub-)national political agendas and influences the development of national testing infrastructures. Conforming to the procedures specified in the OECD standards and manuals is the simplest and least expensive option for participating nations and systems. However, cases 2 through 4 involve governments committing additional resources to the production of test preparation materials or larger scale curriculum reforms. This commitment signals a degree of heightened concern about PISA performance, but the reasons for this concern differ across the cases. In Norway, the OECD’s attempts to standardize the administration of PISA largely determines the implementation of the assessment, with only very minor variation due to the idiosyncratic nature of young people’s right to give or withdraw consent to participate. This idiosyncrasy raises interesting questions about the impact that the inability to opt-out may have on the sample in other contexts. Indeed, we would stress the importance of further investigating issues of informed consent in relation to the administration of ILSAs. School participation in the PISA sample is also compulsory in Norway, in contrast to Japan, for example, where participation is voluntary and there are considerable administrative challenges in securing the required sample size for particular

152

International Large-Scale Assessments in Education

types of high schools. This is a matter that demands comparative research across national jurisdictions and should at the very least be reported in OECD technical documentation. While PISA has had an impact on policy and public debate in Norway, it has not prompted the development of ancillary test preparation structures to improve performance. The Scottish case exemplifies a relatively low-cost addition to the national PISA infrastructure in the form of a systematic test motivation strategy. The form and content of the Representing Your Country video demonstrate the significance of PISA participation within the broader agenda of Scottish independence and the promotion of Scotland as having a strong knowledge economy. In this case, the political dynamics of the UK and, specifically, the Scottish independence agenda, have driven the development of a modest addition to the national testing infrastructure that is designed to leverage the emphasis given, in the standard script read to students, to participation being of benefit to one’s country and its schools. Participation in PISA has been used within Scottish politics to emphasize the quality of Scottish education (and by extension, its human capital), as well as its distinctiveness from English schooling. The more developed approach to test preparation evident in the PEI case draws attention to the potential incentives for small school systems to boost PISA performance through test preparation strategies. Where systems are small enough that most or all 15-year-olds are sampled for the test, systematic test preparation becomes more feasible because it does not require preparing a large cohort of students for a test that only a small group will sit. Moreover, the politics of Canadian federalism can create a competitive dynamic between provinces in relation to educational performance and this may also have contributed to the implementation of preparation strategies, given that PEI is not one of the stronger performing provinces. Finally, the Japanese case demonstrates how PISA results can be mobilized by national governments to legitimize education reform strategies and, in this case, a reform to the national assessment infrastructure that brought Japan’s education system into closer alignment with the content assessed by PISA. Here we can see a structural convergence of national and international assessments. While this more embedded approach to preparation through curriculum reform reflects heightened concern about PISA performance in Japan, the reforms were already in train before the release of the results that prompted the Japanese ‘PISA crisis’. Thus, this case illustrates that test preparation may not arise from immediate concern to boost performance, as in the PEI and Scottish cases suggest, but rather can reflect the use of PISA to legitimize internal reform agendas.

Student preparation for large-scale assessments

153

The various forms of preparation for PISA that are explored in this chapter might raise a question about the foundational rationale of the programme: its claim to assess 15-year-old students’ life skills and knowledge, as opposed to what is taught in schools. The distinguishing feature of PISA from other international large assessments, and thus its raison d’être, lies in the fact that PISA is not embedded in formal curricula in participating nations and sub-national entities. PISA has used this logic of ‘independence’ to assert its originality and the reliability and fairness of its international assessment. The various forms of undocumented test preparation detailed in this chapter, however, point to increasing alignment of national and sub-national curriculum and assessment infrastructures (broadly defined) with the type of skills and knowledge tested in PISA. Hence, in an ironic way, PISA is increasingly ‘schooled’ and less about life; the more aligned national and sub-national systems become with the PISA framework, the more its original rationale becomes undermined. What this actually means to PISA and participating jurisdictions remains to be seen, but this chapter has shown considerable variance in the level of preparation that each jurisdiction is prepared to undertake, with Japan providing a case of the most extensive and systemic preparation and Norway the most minimalist approach. This could suggest that the performance of participating countries and school systems is subject to what goes on in the ‘grey areas’ where various official and unofficial strategies are being used to prepare students for PISA. Hence, in a paradoxical way, the increasing international adaptation to PISA could potentially delegitimize the assessment insofar as it is not designed to measure how well schools prepare students for the test.

References Allalouf, A. & Ben-Shakhar, G. (1998), ‘The effect of coaching on the predictive validity of scholastic aptitude tests’, Journal of Educational Measurement, 35 (1): 31–47. Arnott, M. & Ozga, J. (2010), ‘Education and nationalism: The discourse of education policy in Scotland’, Discourse: Studies in the Cultural Politics of Education, 31 (3): 335–350. Baird, J., Isaacs, T., Johnson, S., Stobart, G., Yu, G., Sprague, T., & Daugherty, R. (2011), Policy effects of PISA. Oxford University Centre for Educational Assessment Oxford. Available online: http://research-information.bristol.ac.uk/files/14590358/Baird_et_ al._2011.pdf Baumert, J. & Demmrich, A. (2001), ‘Test motivation in the assessment of student skills: The effects of incentives on motivation and performance’, European Journal of Psychology and Education, 16 (3): 441–462.

154

International Large-Scale Assessments in Education

Braun, H., Kirsch, I., & Yamamoto, K. (2011), ‘An experimental study of the effects of monetary incentives on performance on the 12th-grade NAEP reading assessment’, Teachers College Record, 113: 2309–2344. Breakspear, S. (2012), The policy impact of PISA: An exploration of the normative effects of international benchmarking in school system performance (OECD Education Working Papers No. 71), Paris: OECD Publishing. Available online: http://dx.doi.org/10.1787/5k9fdfqffr28-en Brunner, M., Artelt, C., Krauss, S., & Baumert, J. (2007), ‘Coaching for the PISA test’, Learning and Instruction, 17: 111–122. Busch, L. (2011), Standards: Recipes for reality, Cambridge, MA & London: The MIT Press. Dohn, B. N. (2007), ‘Knowledge and skills for PISA: Assessing the assessment’, Journal of Philosophy of Education, 41: 1–16. Fairclough, N. (2003), Analysing discourse: Textual analysis for social research, London: Routledge. Gorur, R. (2016), ‘Seeing like PISA: A cautionary tale about the performativity of international assessments’, European Educational Research Journal, 15 (5): 598–616. Hanseth, O., Monteiro, E., & Hatling, M. (1996), ‘Developing information infrastructure: The tension between standardization and flexibility’, Science, Technology, & Human Values, 21 (4): 407–426. Ichikawa, S. (2007), ‘Intabyuu/gakuryokuchousa de hakaru “pisa gata gakuryoku” towa, ittaidonoyouna gakuryoku o sasunoka’ [What does PISA-type achievement, assessed in national assessment, actually refer to?], Sougoukyouiku gijutsu, May: 68–71 (in Japanese). Institutt for Lærerutdanning og Skoleforskning (2014), Instruksjon for prøvedagen PISA 2015, University of Olso. Kastberg, D., Roey, S., Lemanski, N., Chan, J. Y., & Murray, G. (2014), Tecnical report and user guide for the Program for International Student Assessment (PISA) (No. NCES 2014-025), Washinton, DC: U.S. Department of Education. Available online: http://nces.ed.gov/pubsearch Koh, A. (2009), ‘The visualization of education policy: A videological analysis of Learning Journeys’, Journal of Education Policy, 24 (3): 283–315. Lingard, B. (2014), Politics, policies and pedagogies in education: The selected works of Bob Lingard, London: Routledge. Lingard, B. & Sellar, S. (2014), ‘Representing your country: Scotland, PISA and new spatialities of educational governance’, Scottish Educational Review, 46 (1): 5–18. Maddox, B. (2014), ‘Globalising assessment: An ethnography of literacy assessment, camels and fast food in the mongolian gobi’, Comparative Education, 50 (4): 474–489 Matsushita, K. (2012), ‘Gakkou wa naze konnanimo hyouka mamirenanoka’ [Why are schools inundated with this much assessment?], in Guruupu didakutika (ed.) Kyoshini narukoto, kyoshide aritsuzukerukoto – konnannonakanokibou [What it means to become a teacher, what it means to remain as a teacher – hope in challenges], Tokyo: Keisoushobou (in Japanese).

Student preparation for large-scale assessments

155

McCrone, D. (2005), ‘Cultural capital in an understated nation: The case of Scotland’, British Journal of Sociology, 56 (1): 65–82. MEXT (2005), ‘Dokukairyoku koujou ni kansuru shidou shiryou’ [Instruction guidance for reading comprehension], Tokyo: MEXT. Available online: http://www.mext. go.jp/a_menu/shotou/gakuryoku/siryo/05122201.htm (in Japanese). Millman, J., Bishop, C. H., & Ebel, R. (1965), ‘An analysis of test-wiseness’, Educational and Psychological Measurement, 25 (3): 707–726. Nusche, D., Earl, L., Maxwell, W., & Shewbridge, C. (2011), OECD reviews of evaluation and assessment in education: Norway, Paris: OECD Publishing. OECD (2002), Sample from the PISA 2000 assessment: Reading, mathematics and scientific literacy, Paris: OECD Publishing. OECD (2015), PISA 2015: Technical standards, Paris: OECD. Available online: https:// www.oecd.org/pisa/pisaproducts/PISA-2015-Technical-Standards.pdf OECD (2017), PISA 2015: Technical report (draft), Paris: OECD Publishing. Paterson, L. (2003), Scottish education in the twentieth century, Edinburgh: The University of Edinburgh Press. Paterson, L. (2009), ‘Does Scottish education need traditions?’, Discourse: Studies in the Cultural Politics of Education, 30 (3): 269–281. Piattoeva, N. (2016), ‘The imperative to protect data and the rise of surveillance cameras in administering national testing in Russia’, European Educational Research Journal, 15 (1): 82–98. Prince Edward Island Goverment (n.d.), Preparing students for PISA (mathematical literacy): Teachers handbook. Available online: www.gov. pe.ca/photos/original/ed_PISA_math1.pdf Secretaria du Educacion Publica (2011), Competencias para el Mexico que queremos: Hacia Pisa 2012. Manual del alumno, Mexico City : Secretaria du Educacion Publica. Sellar, S. & Lingard, B. (2014), ‘The OECD and the expansion of PISA: New global modes of governance in education’, British Educational Research Journal, 40 (6): 917–936. Takayama, K. (2008), ‘The politics of international league tables: PISA in Japan’s achievement crisis debate’, Comparative Education, 44 (4): 387–407. Takayama, K (2013), ‘Untangling the global-distant-local knot: The politics of national academic achievement testing in Japan’, Journal of Education Policy, 28 (5): 657–675. Takayama, K. (2014), ‘Global “diffusion,” banal nationalism, and the politics of policy legitimation: A genealogical study of “zest for living” in Japanese education policy discourse’, in P. Alassutari & A. Qadir (eds), Routledge advances in sociology, national policy-making: Domestication of global trends, 129–146, New York: Routledge. Tobishima, S. (2012), ‘The politics of the re-definition of “zest for living”: Based on Basil Bernstein’s theory of pedagogic device’, The Annual Review of Sociology, 23: 118–129 (in Japanese). Welsh Government (2012), A guide to using PISA as a learning context, Cardiff: Welsh Government.

9

Investigating testing situations Bryan Maddox, Francois Keslair and Petra Javrh

Introduction Several of the chapters in this book describe public encounters with International Large-Scale Assessments (ILSAs) and focus on the big picture, where ‘Performance’ (with a big P) is associated with results at national and international levels, and on education systems and economies. In this chapter, we take a different approach. Our focus is on assessment performance with a small p. That is, small-scale process data on behaviour in testing situations generated from the analysis of talk and gesture and item response time data obtained from computer-generated log files. Our contention is that these different scales of micro and macro data are connected. That the small-scale processes of test performance, interaction and engagement are bound up with large-scale results, validity and consequences of large-scale assessment data. The chapter discusses data from PIAAC, the OECD Programme for the International Assessment of Adult Competencies. PIAAC involves the administration of a background questionnaire, followed by a test of literacy, numeracy and problem-solving in technology rich environments. The background questionnaire and assessment is administered by a computer that runs a multi-stage Computer Adaptive Test (CAT), and it can also be delivered as a paper-based test where respondents are not sufficiently familiar with and able in the use of the computer (OECD, 2013). What makes the PIAAC assessment interesting as a source of field observations is the fact that, as a low-stakes assessment, it is administered in the presence of an interviewer in the respondent’s home rather than a testing centre. That necessity introduces the potential for multiple sources of variation – from the performance of the

158

International Large-Scale Assessments in Education

interviewer, to the engagement of the respondent in the household setting (see Maddox, 2017; Maddox & Zumbo, 2017). To illustrate our argument, we present a case from the Slovenian PIAAC assessment. We discuss and examine a hypothesis about interviewer effects, initially identified in video-ethnographic data, that the conduct of the interviewer, and especially their seating arrangements during the assessment, can impact on the performance and engagement of the respondents (for a wider discussion on interviewer effects see for example, West & Blom, 2016; and on interviewer effects in PIAAC, see Ackermann-Piek & Massing, 2014). In the PIAAC assessment the interviewer first administers the background questionnaire. Then they hand over the computer to the respondent to complete the assessment following the instructions on the screen. Provided that the respondent is sufficiently able to manage the use of the computer they are expected to work alone through the assessment tasks. However, there are occasions when the respondent may want to ask the interviewer for guidance about procedure or assistance in the correct use of the computer. In addition, the presence of the interviewer in the testing situation introduces the possibility for ‘off script’ improvisations – such as gently encouraging the respondent to remain engaged for the duration of the assessment, and ensuring that by-standers such as family members don’t contribute to the assessment process. Indeed, interviewer attentiveness to the testing situation is encouraged in the interviewer manual as features of good testing practice. By handing over the computer, the interviewer also hands over some of the agency to the computer for the delivery and management of the assessment process. This introduces additional scope for improvisation and variation in the administering of the assessment, as departures from pre-defined assessment protocols (see Ackermann-Piek & Massing, 2014). As we shall see, some of the interviewers in the Slovenian PIAAC remained close by, within the ‘ecological huddle’ (Goffman, 1964) of respondent, computer and interviewer. They seemed to be saying ‘I am with you’ as the respondent completed the test. Those who sat beside or perpendicular to the respondent were often able to display their attentiveness by their physical proximity, their facial expressions, talk and their joint attention to the content of the computer screen. In contrast, some interviewers took a more distant approach. After handing over the computer they moved some way away. They were not able to see the content of the screen, and often displayed their more distant approach by occupying themselves with administrative tasks. They seemed to be saying ‘you are on your own’.

Investigating testing situations

159

Our ability to link very small-scale observations of response processes and people’s behaviour in individual testing situations involving human-human and human-computer interaction with large-scale assessment marks out the new terrain of data analytics (Ercikan & Pelligrino, 2017; Zumbo & Hubley, 2017). What’s new in the analysis of assessment data is the potential to systematically link data from these different scales of analysis through the application of digital technologies and computer-based testing (Maddox & Zumbo, 2017). This creates the opportunity to integrate affective response, stance, embodied interaction and its ecological setting into the analysis of test performance. The chapter is structured as follows. We begin by carefully and systematically considering observational evidence from individual testing situations. We present three fragments of interaction in testing situations produced from videoethnographic observations. These data fragments obtain their significance and sovereignty as particular sequences of assessment interaction located in bodies, time and space. They are not intended to somehow stand for an aggregated and decontextualized whole in the metonymic sense. Rather, they operate as clues to patterns in assessment behaviour that can be explored in larger-scale data. We illustrate a process of going to scale by considering the testimony of the Slovenian interviewers. This provides information about the views and attitudes of the interviewers, and about how they conducted the assessments, including seating arrangements and behaviour. Finally, we make use of computergenerated log files as data forensics and data from the observational module from the entire Slovenian PIAAC assessment. In PIAAC computer-based assessments, the computer generates a log file process record of all key strokes, mouse clicks and the time taken on each test item. This illustrates how multiple sources of information on assessment response processes can be investigated and analysed.

Part 1. Data fragments In the autumn of 2014 Maddox conducted a series of video-ethnographic observations of real-life PIAAC assessment events. This involved a non-invasive ethnographic approach and established methods of participant observation. The assessments took place in the respondents’ homes. The ethnographer remained silent during the assessment and took detailed observational fieldnotes of content such as interaction, and documented observations such as body posture, gesture and facial expression. This supported later analysis of the video content.

160

International Large-Scale Assessments in Education

After the event linguistic transcripts were produced, in Slovene and then in English. These were carefully synchronized with the video recordings to provide a rich set of data. The assessments typically involved short sequences of verbal interaction (Maddox, 2017; Maddox & Zumbo, 2017) involving questions of clarification about the assessment procedure, the time required for the assessment, or encouragement to continue to complete the assessment. This perspective on face-to-face interaction in assessment is informed by gesture studies (Kendon, 1997; McNeil, 1985; Duncan, Cassell, & Levy, 2007). We consider talk and gesture that is ‘environmentally coupled’ (Goodwin, 2007a) and oriented to a common focal activity (Goodwin, 2007b). Integral to this understanding of participation, is that participants share a common orientation to a task – their ‘felicity conditions’, to produce a framework of shared sense of communication (Goffman, 1983). The content of talk, gesture and facial expression in face-to-face interaction therefore represents public displays of stance and emotional affect rather than internal emotional states (Du Bois & Karkkainen, 2012; Goodwin, Cikaite, & Goodwin, 2012): there is no doubt that the scope of emotion is not restricted to the individual who displays it. By virtue of their systematic expression on the face (and elsewhere, such as in prosody) emotions constitute public forms of action. (Goodwin, Cikaite, & Goodwin, 2012: 17).

These public displays of stance and emotion are important features of social interaction in testing situations. However, the extent to which the participants are able to develop a shared sense of what is going on in the testing situation depends on the opportunity and frequency of talk, and the possibility for them to jointly view the content of the computer screen. The differences in interviewer stance or orientation toward their administration of the computerbased assessment were observed in the Slovenian PIAAC assessment. We can analyse these differences in the content of linguistic transcripts as process data and investigate their implications for respondent performance. To develop this argument, we partition the seating arrangements into three categories based on post-hoc analysis of ethnographic video recordings (i.e. observations of real-life assessment events). Firstly, interviewers who chose to sit beside the respondent. Those interviewers were attentive to the assessment, and could see the computer screen and the respondent’s answers (see Figure 9.1). We can therefore describe their behaviour as one of joint attention. A second set of interviewers chose to sit near to the respondent, often perpendicular around the corner of a table. Those interviewers were physically present and available, but

Investigating testing situations

161

Figure 9.1 Interviewer sitting beside the respondent

they could not easily see the content of the screen. The final set of interviewers we call distant. They physically removed themselves from the immediate focus of the assessment and typically signalled their distance by occupying themselves with administrative tasks. These differing characteristics of testing situations are illustrated in the following ethnographic fragments.

Fragment 1. Beside Interviewer: Here. ((the interviewer presses a key on laptop)) Respondent: I see. Interviewer: Now this keyboard will work as well. Respondent: Aww, great. The first example illustrates Goffman’s (1983) ‘felicity conditions’. The interviewer and respondent sitting side by side with joint attention to the tasks on the screen. They have a common understanding of the shared activity so that few words and gestures are required as the interviewer helps the respondent with a problem with operating the keyboard.

162

International Large-Scale Assessments in Education

The example demonstrates the intimate conditions of ‘joint attention’ (Bayliss, Griffiths, & Tipper, 2009) as interviewer and respondent align their gaze to the test items. This meeting of minds can be viewed as efficient in terms of test administration and we might speculate from this evidence that it may encourage high levels of respondent engagement. However, the close physical proximity of the interviewer may not be welcomed by some respondents who may feel that they are under surveillance. Respondents may feel embarrassed to be observed as they complete test items, or lack the sense of liberty that would enable them to skip items that they find too difficult or as a result of boredom or fatigue. There may be gender-based reasons why a respondent might feel uncomfortable sitting next to the interviewer and where the interviewer might therefore offer to sit further away. Assuming that these gendered effects may exist (see West & Blom, 2016), this suggests that in an international assessment such as PIAAC, cultural perceptions of gender might also influence interviewer behaviour and impact on response processes.

Fragment 2. Perpendicular Interviewer: I can’t see where.. ((the interviewer moves to see the screen)) Respondent: yes, how far I am? Interviewer: Yes Respondent: Pardon? Interviewer: I don’t know. Respondent: A question.. the second exercise.. Is there still much to go? Interviewer: I never know the precise number of questions. Respondent: I see Interviewer: so that Respondent: yes, yes, okay Interviewer: Yes. Respondent: Are you tired? Interviewer: Yes. well, so so. In this example, the respondent initiates the sequence of talk by directing their gaze from the screen to the interviewer and non-verbally indicating that she is struggling to complete the test. Her non-verbal ‘response cry’ (Goffman, 1981) leads to the interviewer response ‘I can’t see where’, indicating that she is not able to see the computer screen. From where she is sitting nearby around the corner

Investigating testing situations

163

of a table, face-to-face communication is possible, but she has to stand and move closer to be able to read the content of the screen. From her seating position, the interviewer is able to monitor the emotional state of the respondent and their engagement with the test through observations of body posture and interaction with the computer. However, in these seating arrangements joint gaze (to the computer screen) is not continuous, and depends in part on the request of the respondent. While this makes interviewer monitoring of response processes more difficult, it provides the respondent with a certain privacy and liberty meaning that they can skip items without a sense of embarrassment about being observed, as the respondent in the second fragment did on several occasions.

Fragment 3. Opposite and distant Respondent: Here.. I don’t understand something. Quite a lot … I thought that it was some instructions, because I made another round Interviewer: yes? Respondent: Something with the buttons … entering numbers with buttons. I don’t know, I think I skipped quite a few for no good reason… ((The respondent reads from the screen to himself, inaudible)) ((The interviewer stands and walks over to the respondent and computer)) ((they look together at the screen)) Interviewer: This is again only an explanation! Respondent: I see, it’s only explanation! Interviewer: Again, yes. Continue Respondent: I see, OK. Interviewer: Only now you will start to solve Respondent: I see Interviewer: Yes. Respondent: I thought so. Interviewer: Yes. Go ahead. This third fragment provides a contrast with the first two examples. The interviewer had sat some distance away and occupied herself with an administrative task. She had indicated verbally and physically that the respondent was expected to complete the task on his own. In this interview, there are only a couple of sequences of verbal interaction. When the respondent asks for the interviewer’s attention ‘Here.. I don’t understand something’ – the interviewer

164

International Large-Scale Assessments in Education

has to stand up and walk over to where the respondent is sitting. She stands beside him briefly for the interaction and then returns to her seat. Joint gaze is not possible for most of the assessment and the physical and social distance displayed by the interviewer clearly indicates that the respondent is on his own. Nevertheless, as the transcript illustrates, the interviewer can be called (hailed) when there is an area of difficulty.

Part 2. Survey and focus groups Informed by the video-ethnographic data, Javrh conducted focus group discussions and a survey involving almost the entire set of Slovenian PIAAC interviewers. Thirty-seven interviewers completed the survey and that represented the majority of the completed assessments. We were keen to find out more about the interviewer views about the conduct of assessment and reports on their typical seating arrangements. The survey showed that the majority of interviewers were experienced, and we had no reason to question the quality of their work. Indeed, the field observations had suggested that the interviewers showed a high level of professionalism and commitment. Nevertheless, many of the interviewers reported that in comparison to other surveys, the PIAAC assessment was especially long and demanding. Almost all reported that that they had to encourage respondents to stay engaged and complete the survey. Sixty-five per cent of the interviewers reported that their presence in the testing situation appeared to affect the interviewer response (e.g. such as the role of encouragement). Nevertheless, there were differences between interviewers in terms of their reported seating arrangements and views about the best way to conduct the assessment. We found that most of the interviewers had clearly articulated preferences about the most appropriate seating arrangements. In response to a survey question about their typical seating arrangements, 31 per cent of interviewers reported that they sat beside the respondent. Twenty-two per cent said that they sat perpendicular to the respondent around the side of the table, and 19 per cent sat further away, opposite the respondent. Only around 6.5 per cent reported that they sat according to the wishes of the respondent. The remaining proportion did not express a view. Some of the interviewers said that they deliberately sat far enough away so that they were not able to view the computer screen. The self-reported data on interviewer seating arrangements supported our emergent hypotheses about potential for seating arrangements to affect interview response processes and performance.

Investigating testing situations

165

Part 3. Data forensics In the final stage of the investigation we used log file data on response times, combined with information from the interviewer-completed observational module to examine the relationship between seating arrangements and respondent performance. We linked performance data to self-reported survey information from interviewers about their reported (typical) seating arrangements in the PIAAC assessments. We recognize that the self-reported data may operate as a marker for a wider set of interviewer attitudes and preferences. Furthermore, it is not a precise measure of the seating arrangements in each interview. There may also be contextually specific reasons why interviewers might alter their preferred seating arrangements in each particular interview situation, i.e. we cannot make any direct claims of causation. However, as the interviewers reported that they tended to maintain their seating arrangements (i.e. regardless of contextual factors) we believe that it is a useful exploratory measure (Table 9.1). On the question about whether interviewer felt that the respondent understood the questions there were small reported differences according to the proximity of the interviewer (Table 9.2).

Table 9.1 Assessment taking too long Interviewer’s typical seating arrangement (reported preference)

Average proportion

Adjusted difference

Beside

0.336

Ref.

Perpendicular

0.317

−0.014 (0.018)

Opposite

0.302

−0.034 (0.019)

Resp. wish

0.356

0.018 (0.029)

Table 9.1 shows the proportion of interviewers’ positive answers to the question ‘Did the respondent complain that the interview was taking too long?’ according to interviewer’s declared seating arrangement preference. Adjusted differences show coefficients of linear regressions at the respondent’s level, using respondent’s age, immigrant status, education levels, gender and employment status as controls and a dummy that yields one when interviewers reported length complaint.

166

International Large-Scale Assessments in Education

A telling finding related to the observational module comes in the question about whether the respondent requires clarification on some aspect of the assessment. On that question, many more interviewers – almost half of interviewers who sat beside the respondent – reported that clarification was necessary compared to around one quarter for those who sat nearby (perpendicular) and around one third for those who sat opposite (see Table 9.3). This suggests that when interviewers and respondents sat beside each other with shared observation (joint gaze) of the screen, respondents may have felt encouraged or obliged to ask questions of clarification. Adjusting for respondent characteristics, respondents are 25 percentage points more likely to ask for clarification if the interviewer is seating beside instead of perpendicular, and this difference is statistically significant. This suggests that when the interviewer and respondent sat side by side, the frequency of verbal interaction was greater. Computer-generated log file data on response times also supports this finding. We find that where interviewers sat opposite the respondent the assessment was more likely to be among the shorter half of the interviews, whereas those who sat beside or nearby (perpendicular) were more likely to be in the longer half of the interviews (Table 9.4). Finally, on the question of respondent engagement, we observe another significant finding. Following Goldhammer et al. (2016), a respondent is

Table 9.2 Respondent understood the question Interviewer’s typical seating arrangement (reported preference)

Average proportion

Adjusted difference

Beside

0.483

Ref.

Perpendicular

0.582

0.104 (0.019)

Opposite

0.488

0.015

Resp. wish

0.385

−0.091

(0.019) (0.029) Table 2 shows the proportion of interviewers who answer ‘very often’ to the question ‘Did the respondent understand the questions?’, according to interviewer’s declared seating arrangement preference. ‘Very often’ is the maximum value on a 5-point likert scale. Adjusted differences show coefficients of linear regressions at the respondent’s level, using respondent’s age, immigrant status, education levels, gender and employment status as controls and a dummy that yields one when interviewers reported ‘very often’ to the question.

Investigating testing situations

167

Table 9.3 Clarification necessary Interviewer’s typical seating arrangement (reported preference)

Average proportion

Adjusted difference

Beside

0.489

Ref.

Perpendicular

0.245

−0.249

Opposite

0.325

−0.171

(0.018) (0.019) Resp. wish

0.282

−0.214 (0.028)

Table 9.3 shows the proportion of interviewers who gave positive answer to the question ‘Did the respondent ask for clarification in any of the questions?’, according to interviewer’s declared seating arrangement preference. Adjusted differences show coefficients of linear regressions at the respondent’s level, using respondent’s age, immigrant status, education levels, gender and employment status as controls and a dummy that yields one when interviewers reported a clarification request.

considered as ‘disengaged’ if they do not give sufficient time to solve the items on at least 10 per cent of the test items. In this measure, the minimum amount of time to solve an item varies with the time spent by the fastest respondents with a correct answer. A respondent that spent less time than this is not considered to have spent enough time to acquire all information available in the item. While the 10 per cent

Table 9.4 Longest half of the interviews Interviewer’s typical seating arrangement (reported preference)

Average proportion

Adjusted difference

Beside

0.478

Ref.

Perpendicular

0.484

0.017

Opposite

0.409

−0.058

(0.018) (0.019) resp. wish

0.533

0.070 (0.028)

This table shows the proportion of disengaged respondent, according to interviewer’s declared seating arrangement preference. Adjusted differences show coefficients of linear regressions at the respondent’s level, using respondent’s age, immigrant status, education levels, gender and employment status as controls and a dummy that yields one when the respondent is disengaged.

168

International Large-Scale Assessments in Education

Table 9.5 Log file data on respondent disengagement Interviewer’s typical seating arrangement (reported preference)

Average proportion

Adjusted difference

Beside

0.039

0.000

Perpendicular

0.042

0.002

Opposite

0.070

(0.005) 0.031 (0.005) Resp. wish

0.117

0.076 (0.007)

Table 9.5 shows the proportion of disengaged respondents, according to interviewer’s declared seating arrangement preference. Adjusted differences show coefficients of linear regressions at the respondent’s level, using respondent’s age, immigrant status, education levels, gender and employment status as controls and a dummy that yields one when the respondent is disengaged.

threshold is a somewhat arbitrary characterization of disengagement, it is helpful for our analysis. The log file data on item response times suggested that respondents disengaged and skipped items much more frequently when they sat opposite the interviewer, and much less frequently when they sat beside the interviewer. The log file data also indicated that when interviewers offered the respondent choice in the seating arrangements the rates of disengagement were considerably higher (see Table 9.5).

Discussion In recent years there has been a growth in the use of assessment process data for the analysis of test performance and validity (Ercikan & Pelligrino, 2017; Zumbo & Hubley, 2017). In the era of machine learning and data analytics this has potentially significant implications regarding the potential for data such as computer-generated log files on response times and data on eye movements and facial expressions to be operate as extensions of the test, and be integrated into real-time data on respondent performance (Oranje et al., 2017). By looking beyond individual performance in the relational and material conditions of the testing situation this chapter highlights the risks to test validity if such data is used without a concern for the ecological characteristics of the testing situation. While seating arrangements might appear to be a relatively mundane aspect of test

Investigating testing situations

169

administration, our observations that they might operate as a source of variation in performance illustrates how hidden ecological aspects of test administration might unwittingly introduce confounders into process data that are bound up with wider issues such as cultural context and gender, with the additional potential for international variation. This has potential not only to influence the interpretation of test performance, but also to inform techniques for computeradaptive design as process data is integrated into real-time assessment. In the case of PIAAC, log file data has the potential to reveal variation in test performance associated with different interviewers (i.e. as an interviewer effect). However, identification of the source of that variation was unlikely without additional information on test administration and the dynamics of the testing situation. Working together between sources and scales of data we have been able to identify a source of variation and observe it in large-scale data, with implications for future test administration. Relatively inexpensive video-ethnographic observations were able to identify an issue in ILSA administration that has potential to improve future test quality and to inform the analysis of log file data. The incorporation of the testing situation into the analysis of test performance and validity illustrates the potential for contextual sources of variation to be identified that are not associated with routine features of assessment practice. How should test administrators respond? One approach would be to remove the interviewer from the assessment process – to leave it to the computer. That might appear to be the easiest way to ‘standardize’ test administration as it removes the scope for ‘off script’ interviewer improvisations. However, observations of the Slovenian PIAAC assessment suggest that the interviewers add value to the testing situation through procedural advice and encouragement. Their absence (as in the case of ‘distant’ interviewers) may have a disproportionate impact on certain groups such as those people who are less skilled in the use of a computer, less familiar with the testing format and content, or who have low levels of ability, where encouragement to stay engaged and complete the test is likely to be most useful. If the interviewer is present it is by no means clear which seating arrangements would be ideal. Indeed, it seems likely that identification of best practice would have to consider issues of gender and cultural diversity which may vary between participating countries, and how that might impact on data standardization and comparison (while maintaining a commitment to the overall goal of standardization). Whatever the solution, the chapter suggests that administrative care over seating arrangements could help to optimize respondent performance, and reduce the scope for interviewer effects as an unintended source of variation

170

International Large-Scale Assessments in Education

in test performance. Further work would be required to investigate the scale and significance of such variation within and between countries.

Conclusion This chapter has shown how investigations of the testing situation can inform our understanding of ILSA quality and validity. Its consideration in validation practice is illustrative of the rise in use of process data in assessment, and its implications – perhaps even obligations – for inter-disciplinary enquiry (Shaffer, 2017). What is new in assessment is the potential for data such as video-ethnographic observations and log files to be integrated to make meaning about assessment performance (Maddox, 2017). These new techniques and collaborations look set to transform the field as it embraces ‘next generation’ computer-based testing (Oranje et al., 2017), machine learning and data analytics (Ercikan & Pelligrino, 2017; Zumbo & Hubley, 2017). As Williamson argued (this volume), the world of assessment is investing heavily in its imagined futures. In that context, as new techniques and data objects emerge, innovation necessarily involves inter-disciplinary collaboration to make the most of these new opportunities, and to resolve or by-bass old, intractable problems (Gee, 2017). This includes a radical re-positioning of context and the ecology of the testing situation, from a polluting source of noise, to a viable source of data and meaning (Maddox & Zumbo, 2017). With the rise of new data techniques in assessment there is a concern, felt by some (discussed in the introduction of this book), about the rise of machine based techniques that might alienate us from our distinctly human qualities. However, as this chapter has illustrated, new data techniques also offer the potential for human traits and behaviours that are displayed in the testing situation to be integrated into new forms of validity practice. These techniques include human qualities of spatial and temporal awareness, displays of stance, emotional affect, empathy, and shared orientation to an activity that is typical of Goffman’s ‘felicity conditions’. Somewhat ironically, the ‘assessment machine’ can, and indeed must, integrate these subtleties of context and interaction if it is to make valid and high quality inferences about large-scale assessment performance. To conclude then, we can affirm that the ‘discovery of the social’ in computerbased testing is not a threat to the integrity and validity of assessment data. Rather, it recognizes the combined socio-technological character (Latour, 2002)

Investigating testing situations

171

of assessment practice. That the apparently mundane question of where to sit in assessment should affect ‘cognitive’ data on literacy, numeracy and problemsolving affirms the importance of process data in the analysis of assessment performance.

References Ackerman-Piek, D. & Massing, N. (2014), ‘Interviewer behaviour and interviewer characteristics in PIAAC Germany’, Methods, Data, Analyses, 8 (2): 199–222. Bayliss, A., Griffiths, D., & Tipper, S. T. (2009), Predictive gaze cues affect face evaluations: The effect of facial emotion. European Journal of Cognitive Psychology, 21 (9): 1072–1084. Du Bois, J. & Karkkainen, E. (2012), ‘Taking a stance on emotion: Affect, sequence and intersubjecivity in dialogic interaction’, Text and Talk, 32 (4): 433–451. Ercikan, K. & Pelligrino, J. (2017), Validation of score meaning for the next generation of assessments: The use of response processes, London: Routledge. Kendon, A. (1997), ‘Gesture’, Annual Review of Anthropology, 26: 109–128. Gee, J. P. (2017), Preface to D. W. Shaffer, Quantitative ethnography, Madison, WI: Cathcart Press. Goffman, E. (1964), ‘The neglected situation’, American Anthropologist, 66: 133–136. Goffman, E. (1981), Forms of talk, Philadelphia, PI: University of Pennsylvania Press. Goffman, E. (1983), ‘Felicity’s condition’, American Journal of Sociology, 89 (5): 1–53. Goldhammer, F., Martens, T., Christoph, G., & Lüdtke O. (2016), Test-taking engagement in PIAAC. OECD working papers. Goodwin, C. (2007a), ‘Environmentally coupled gestures’, in S. Duncan, J. Cassell, & E. Levy (eds), Gesture and the dynamic dimension of language, 195–212, Amasterdam: John Benjamins. Goodwin, C. (2007b), ‘Participation, stance and affect in the organisation of activities’, Discourse and Society 18 (1), 53–73. Goodwin, M., Cakaite, A., & Goodwin, C. (2012), ‘Emotion as stance’, in M. Leena Sojonen & A. Perakyla (eds), Emotion as interaction, 16–41, Oxford: Oxford University Press. Latour, B. (2002), Reassembling the social: An introduction to actor network theory, Oxford: Oxford University Press. Maddox, B. (2017), ‘Talk and gesture as process data’, Measurement: Interdisciplinary Research and Perspectives, 15 (3–4): 113–127. Maddox, B. & Zumbo, B. D. (2017), ‘Observing testing s’, in B. D. Zumbo & A. M. Hubley (eds), Understanding and investigating response processes in validation research, 179–192, New York: Springer. McNeill, D. (1985). ‘So you think that gestures are non-verbal?’, Psychological Review, 92 (3): 350–371.

172

International Large-Scale Assessments in Education

Oranje, A., Gorin, J., Jia, Y., & Kerr, D. (2017), ‘Collecting, analyzing, and interpreting response time, eye-tracking, and log data’, in K. Ercikan & J. Pelligrino (eds), Validation of score meaning for the next generation of assessments: The use of response processes, Routledge. Shaffer, D. W. (2017), Quantitative ethnography, Madison, WI: Cathcart Press. West, B. T. & Blom, A. G. (2016), ‘Explaining interviewer effects: A research synthesis’, Journal of Survey Statistics and Methodology, 5 (2): 175–211. Zumbo, B. D. & Hubley, A. (2017), Understanding and investigating response processes in validation research, New York: Springer.

Part Three

Reception and public opinion

10

Managing public reception of assessment results Mary Hamilton

Introduction Earlier chapters in this volume deal with the construction of tests and how they are carried out by testing agencies and their partners. The particular focus of this chapter is on how findings from ILSAs reach into national contexts and establish the public presence which enables them to intervene effectively in policy and practice as they are designed to do. In particular it examines how testing organizations and the media manage public and policy discourse on assessment results. It takes the example of the assessment of adult skills by the OECD which has already gone through several iterations and is currently embodied in the PIAAC, (The Programme for the International Assessment of Adult Competencies). Though they often have less visibility and funding leverage in national debates, adult skills are arguably of crucial significance to the development of visions of the global knowledge economy. They cross education, employment and citizenship domains each of which have their own goals and discourses in relation to adult skills policy. The assessment of adults entails methodologies and alignments with policy actors that are different from school children: the tests are carried out as household surveys, there is a greater reliance on the background questionnaire filled in by participants to establish sub-group membership and life experience, they have a more complex and immediate relationship to ‘life-skills’. This is also an ambiguous and less codified area of research and practice than schooling and is thus an interesting case to explore for anyone interested in how ILSAs construct and shape the domains they measure. Its very plasticity (Carvalho, 2012) presents new potential, but also new risks.

176

International Large-Scale Assessments in Education

The PIAAC sits on an international timeline that stretches back to the IALS, and arguably even further in the activities of UNESCO in documenting literacy rates of adults since the early 1950s. It also sits alongside a growing number of other ILSAs (Addey et al., 2017) of the achievements of school children, early years education, vocational skills training, adult lifelong learning and higher education. The claims made for the benefits of surveys like PIAAC are far-reaching. For example, in the foreword to Hanushek and Woessmann (2015), improving the basic skills of adults is claimed to lead to ‘remarkable overall economic gains while providing for broad participation in the benefits of development and facilitating poverty reduction, social and civic participation, health improvement, and gender equity’. In the OECD’s overview of PIAAC for the European Union they explain how the empirical knowledge gathered through the surveys enables countries ‘to base policies for adult skills on large-scale facts and figures. Just as PISA2 had deep repercussions for schools, we can expect the ground-breaking evidence provided by the Survey to have far reaching implications, for the way skills are acquired, maintained, stepped-up and managed throughout the entire lifecycle’ (OECD, 2013) Thus the PIAAC is a significant move in developing global skills policy in that it is designed to address lifelong learning, thereby supplying a missing link between school based studies and the world of work and adulthood (Grek, 2010). In doing so, it draws together a host of new actors and discourses to the ILSAs from the spheres of education, vocational skills, human capital and citizenship. PIAAC is of interest not only educational and development specialists but also to business and financial communities. The timescale of PIAAC is still too short to establish direct impact on policy and practice, though a growing research literature attempts to assess the validity of impact claims based on associative links between test results and background variables (see Hanushek et al., 2015; Pena, 2016; and critique by Komatsue & Rappleye, 2017). The present chapter does not attempt to add to this literature, but rather to explore some of the factors that intervene between the testers’ intentions and policy outcomes, especially the role of the media in national contexts. The case study material, based on documentary data collected in a range of participating countries, highlights the dilemmas and difficulties that arise as the media coverage of assessment data frames and interprets ILSA results. It details the strategies used by the testing agencies to publicize and spread the PIAAC assessment findings through heterogeneous publics. It shows the trajectories taken by the survey findings through print

Managing public reception of assessment results

177

and digital media, the degree to which the methodologies and ambitions of the PIAAC surveys are critically appraised as well as the conflicting interests and priorities with which they have to deal. Findings are compared from Japan, the UK and France for the first round of PIAAC; and from Singapore and Greece for the second round, The chapter explores some aspects of media and public communication, but there are others. For example the semiotic form of reports using narrative text, visuals and numbers would bear much more detailed examination (e.g. see Hamilton, forthcoming; Williamson, 2016) as would the uses of social media and the effects on public opinion (see Pizmony-Levy et al, this volume).

Conceptual framing I view ILSAs through the lens of sociomaterial theory, specifically actor-network theory (ANT). ANT emerged as a strand of science and technology studies (see Latour, 2005; Fenwick et al., 2015) and focuses on detailed ethnographies of the laboratory work involved in developing scientific innovations. Using ANT we can consider ILSAs in a parallel way as social scientific projects in the making, tracing their innovations and struggles to establish a place for themselves as global policy actors (Gorur, 2011; Carvalho, 2012; Hamilton, 2012). This approach has a number of advantages for analysing ILSAs. Firstly, it allows me to take on board the time dimension – the history of development of these assessments with its twists and turns, a sense of the provisionality of the developments and the test instruments as unfinished stories with uncertain outcomes rather than searching for/ asserting an orderliness to them that would be hard to establish. Secondly, this approach also fits the networked nature of the assessments (Morgan, 2007; Sellar & Lingard, 2013), especially in the case of the surveys co-ordinated by the OECD as a distant ‘centre of calculation’ which are also conceptualized and carried out by a dispersed set of national and international actors, who interact at all stages of the testing process to influence its outcome (Ozga et al., 2011). This cast of actors includes, at different times and places, not only testing experts, researchers and policy-makers, but advocacy groups, corporate bodies, practitioners and end users, the adult population itself. In the case of PIAAC, those advocating for adult skills internationally, at European level and in national governments and agencies, all play a part in the public discourses and policy actions that develop around it.

178

International Large-Scale Assessments in Education

ANT points out that agency can also be delegated to non-human actors such as statistical, data science and psychometric methods, software algorithms (see Williamson, 2016; also O’Keeffe, 2015). Increasingly the media are intertwined in these networks, using – and used by – this rich mix of actors and the ILSA findings themselves (Rawolle & Lingard, 2014). Moreover, the media themselves are creating new publics through expanding interactive possibilities and accessibilities. The OECD as an agency exerts ‘soft power’ (Henry et al., 2001; Lawn and Grek 2012) in that it has no direct legislative or policy powers. This means that the entanglement and active enrolment of the range of national actors is crucial to the success of its ILSA project. In the process of diffusion of policy agendas the first stage is therefore to actively seek and bring about such entanglements. Only then can the survey findings surface in the public media arena. This process is referred to as ‘reception’ in the policy borrowing literature (SteinerKhamsi & Waldow, 2012) and is wider than the use of the term within media studies. However, the term ‘reception’ seems to imply a passive and one-way process which is emphatically not the case with ILSAs The alternative terms offered by ANT are those of ‘enrolment ’ and ‘mobilization’ which centre on the complexities of agency in the process of policy translation. This chapter focuses only on the role of the media, but the process of policy reception is wider than this as discussed in Hamilton (2017). In focusing on networks and entanglements, ANT encourages us to look behind the ‘front stage’ (Goffman, 1988) of political and media spectacle to the significant invisible work of creating and maintaining the ILSAs that goes on ‘backstage’ (Bowker and Star, 2000; Denis and Pontille, 2015). ANT also gives us a handle on the issue of how ILSAs come to frame and to be framed by public discourse about adult skills as their underlying assumptions are naturalized as common sense (Law, 2011).

How to study the reception of PIAAC Questions to ask The framework of sociomaterial theory suggests the value of enquiring into the work that enables successful reception of ILSAs in national contexts: building public confidence about them: creating interested publics and maintaining the credibility of a seamless spectacle. This involves understanding the processes and identifying the actors who interact in relation to ILSAs, the struggles to

Managing public reception of assessment results

179

get particular voices and interpretations heard as survey findings move into the public domain. The soft power exerted by and through testing agencies, international and national policy bodies involves a number of layers of persuasive and disciplinary work to align other actors in the dissemination and legitimation of ILSAs. Journalists themselves are subject to this, in part through privileged and embargoed access to the data before the release date, granted on condition that they do not violate the terms of this access. The questions pursued in this study follow from this analysis. I ask, “How are ILSA networks assembled in diverse national contexts?” Specifically: •

• • • •

To what extent do the unique conditions and actants within a given national context affect the media coverage, the public response and the subsequent policy actions of governments in relation to ILSAs? What sources of information about ILSAs appear to be used by journalists and how far do they rely on the country notes supplied by the OECD? Is the OECD’s underlying model of adult skills accepted or challenged in the media and in public discourse? What is the relationship between public opinion and the media to the functioning of policy in varied contexts in which ILSAs intervene? What other, less visible, avenues of influence on policy actors could productively be researched?

Collecting data about the PIAAC How, then to go about documenting how ILSA networks are assembled? The methods needed for such an enquiry break new ground – not just in tracing the sociomaterial relations, but also in working with media reports where the range, structure, interactive nature and accessibility to different audiences of the media themselves is changing so rapidly. Much of this data is online. For example, in the UK, the Daily Mail is a best-selling print newspaper that has an associated online site with both national and international versions, facility for interactive comments and links to other sites, blogs and related material (such as a sample of PIAAC survey items). News items circulate across many interdependent media platforms. The importance of print newspapers varies in different countries. The broadcast media are extensively used and Twitter is also significant, especially for media, testing and policy professionals themselves to reinforce the findings and comment on them. In this study we looked only at print newspapers and

180

International Large-Scale Assessments in Education

their related online sites. Online and interactive social media, documentary programmes, specialist advocacy and professional websites and publications make explicit links to a wide range of relevant resources, making it possible for journalists, policy-makers and the public to delve more deeply into the findings and the background methodology. It is important to take on the methodological challenges posed by the media but also to be circumspect about what can currently be achieved so some cautions are needed about the data produced by these methods and the limitations of the work done to date. Comparison across countries proved to be perplexing, due to differences in media industries and translation problems with key concepts. Keyword searches are more difficult than might be thought at first glance, especially when searching across languages. PIAAC is not a term recognized by most journalists and related terms such as OECD, survey, literacy and numeracy, adult skills, produce different but overlapping results. Our study thus highlights the complexities faced by the international surveys themselves in working to understand and influence policy across diverse contexts and languages. I report on two stages of data collection relating to the two rounds of PIAAC findings so far released (in 2013 and 2016). Taking the PIAAC as an example case study illustrates both the approach to research and analysis and the insights that can result. It is a clearly time-bounded study that involves an expanding set of national contexts, and a common process orchestrated by one international body, namely the OECD. The first round of the PIAAC surveyed twenty-four countries, clustered around the primary OECD countries. The second round added nine countries with much more varied circumstances. In each case the OECD produced a summary of the findings for individual countries and an overall report. The 2016 overall report integrated the round two countries with the analysis from the first round so should have been of interest to all participating. The data on which the study is based come from a range of sources: primary research on internet documents and media reports of the survey findings; the internet sources include many from the OECD itself especially its Country Notes; advice, information, press releases and statements from advocacy groups and national governments and other interested agencies. I also draw on an email exchange with Spencer Wilson, head of the OECD media unit, supplemented by extracts from an interview with him reported in Lingard 2016. Where relevant I refer to the small but developing body of other research papers relating to the PIAAC.

Managing public reception of assessment results

181

To collect the media reports for both rounds we carried out newspaper keyword searches using the Nexus and Factiva databases and supplemented these with internet searches which turn up not only the word-based narratives but the accompanying images which are stripped out of the reports by the database software. I enlisted the help of other researchers with first-hand knowledge of the countries in question. This is essential because details of the socio-economic, political and demographic background of a country, the structure of its media industry, the policy background in relation to LLL, and the language context are needed to interpret the media coverage. We constrained the data collection to particular news sources and synchronized our research by organizing it around a common template of guidance and questions. [see Appendix 1] This template was refined for the second round, based on the findings from our first study and was used to guide the thematic analysis of the media reports.

Case studies for detailed analysis Comparative assessment is fundamentally concerned with establishing recognizable norms for describing differences and similarities and it seems appropriate for research to interrogate these norms by looking at a range of divergent cases. This chapter focuses on five contrasting cases that enable me to analyse similarities and differences among national contexts in order to interpret the media coverage of PIAAC and possible policy actions associated with them. The three taken from the first round study were chosen because they were placed differently in the PIAAC league tables for literacy and numeracy: Japan (top in both proficiencies), UK (around the middle) and France (close to the bottom). The two chosen from the second round study were placed similarly in the overall league table of nations but have many contrasting features in terms of the context and media coverage of the findings (Singapore and Greece).

Round 1: Japan, UK and France Yasukawa et al. (2016) examined and compared how national media of Japan, England and France reported on the differing PIAAC results of their countries, and the extent to which these reports mirrored key messages from the OECD’s Country Notes. We identified all national newspaper coverage in each of the three countries during the week following the release of the findings and traced the ways that the OECD PIAAC agendas establish a framework for the articulation

182

International Large-Scale Assessments in Education

of national policies. Within each country, media coverage in this initial period of PIAAC appeared to be limited although the head of the OECD’s media unit was pleased with the international impact (measured through Factiva) since the survey generated more interest than most reports that OECD produces (with the exception of PISA and the twice-yearly Economic Outlook): According to Factiva, the Skills Outlook generated 384 articles around the launch … … That’s a good result also when you consider that it only included 24 countries (out of 34 OECD members) … When we have the results of a second round of the Survey, then I think we’d see a big impact. [Spencer Wilson, email 18/4/2016]

Some key lessons emerged from this exercise. Firstly – unlike for PISA – the media coverage of the PIAAC findings was short lived, the details quickly decayed into headlines that merged with other surveys. The core news coverage in all papers took place on 8/9 October. Thereafter, any follow-up in the UK, for example, was in opinion pieces and readers’ letters often just as a passing reference in relation to other surveys and wider issues. Other reports on social inequality were released in October by the Save the Children fund (early childhood ) and Legatum (the prosperity index). Comment on these was used as an opportunity to refer again to the headline PIAAC findings. Finally, after the release of the PISA findings on 3 December 2013 the coverage tended to deal with both surveys, with PISA dominating. However, the PIAAC was still new to journalists and the public and the coverage may be different for future waves of the survey as it becomes more familiar. Secondly, the OECD Country Notes played a central role in what is reported in the media, summarizing complex data that are otherwise not easy for journalists to quickly access and absorb. They inevitably direct attention to particular facts and issues in a format that is easy to translate into press reports and headline news. We found that press reports rarely went beyond this material, although they were selective in the aspects of it they highlighted. Little attention was given, for example, to the novel digital dimension of the survey in part perhaps because understanding of this was unstable and therefore difficult for media and policy-makers to make use of. In all countries there was a pervasive interest in extrapolating trends across time. Journalists asked: ‘Have things changed?’ ‘Are literacy skills getting better or worse?’ even though these questions cannot be answered by a single point survey like PIAAC.

Managing public reception of assessment results

183

The analysis also showed how, in each national case, particular aspects of the PIAAC results were foregrounded, depending not only on the performance measures themselves, but also on how accounts of the results were assembled to extend national cultural narratives and debates around education and social policy. For example, the Lifelong Learning focus of PIAAC was reported in France, but was not taken up in either the UK or Japanese media where discussion reverted to young people and schooling. Neither did Japan take up the gender-related aspects of the findings highlighted by the OECD. A further observation was that there was a lack of critical discussion of the household survey methodology and the content of the surveys in media reports. Although much important detail was available in OECD technical documents, there was little evidence of journalists accessing this information. Thus key issues in the design of PIAAC seem not to have been taken up in any of the press coverage. The media are often accused of biased and superficial coverage of key policy issues. However, our investigation concluded that journalists have limited time in which to interpret and present complex statistical findings so that easily digested information sources and press releases are at a premium. The news media demand an immediate response from journalists but the release of the findings is closely controlled by the OECD and approved journalists usually have only one or two days’ embargoed access to them before publication (see correspondence with Spencer Wilson, OECD).1 The currently increasing interest in, and use of, big data sets in policy discussions opens up the need for new kinds of ‘data journalists’, who are trained to scrutinize, analyse or reanalyse, and summarize the findings from such datasets (Knight, 2015; Javrh, 2016; Rogers, 2016).

Round 2: Nine additional countries It was harder to be sure that we had adequately identified the coverage from the second round countries because of the range of languages involved. The news databases Factiva and Nexus produced overlapping datasets but each was limited in the items they picked up. In-country informants identified more items for each individual country than the general database search found.

1

‘Our policy is a level playing field for media – no exclusives – and embargoes are strictly monitored and as a whole respected by media. The sanction for breaking an embargo is that the media organization loses the right to receive embargoed material for a certain period, usually between 3 and 6 months. For an organization like Reuters, for example, that is a credible sanction, given the number of market-moving economics reports we release’. The launch date [for PISA 2012] is

184

International Large-Scale Assessments in Education

Despite the expectations of the OECD and the care with which they prepared the release of the results, our scan of the second round countries showed that media coverage across the nine new countries was very uneven. Chile, Greece, Israel, NZ and Singapore appeared to show a good deal of interest while Indonesia and Turkey seemed to pay little attention to this international ‘news’ and Lithuania and Slovenia were in-between. We also looked again at the Round 1 case study countries, the UK, France and Japan and found that they took very little notice of the Round 2 findings since there was no new information about their own context. PIAAC, as explained earlier, crosses several fields of policy interest and therefore potentially targets journalists who relate to these different specialisms – not all of whom may have recognized the news value of these findings from their own perspective. The coincidence of highly newsworthy political events that included a bomb explosion at an airport in Turkey, the UK referendum on Europe and national holidays such as Eid eclipsed the focus on adult skills – a topic that in any case rarely hits the headlines. In several of the Round 2 countries (e.g. Singapore, Turkey and Indonesia) the media are strongly controlled by the national government and it is not easy to judge the effects of this on the reporting of PIAAC.

The contrasting cases of Singapore and Greece Singapore and Greece were both included in the nine countries that participated in the second round of PIAAC. Both countries have websites with downloadable brochures and videos, which suggests that the exercise was taken reasonably seriously by the co-ordinating agencies in each. Both generated a good deal of media coverage. Neither country was top ranked in the PIAAC league table. After this, the similarities end: the context and current situation of the two countries is very different as is the way the media treated the findings from the PIAAC and the ways in which the OECD and others interpreted their performance. I chose Singapore because it was prominently featured as an example of good practice by Schleicher (2016) in his publicity about the survey. I chose Greece as

announced to the media more than a month in advance of launch. Details of specific events then communicated 7 days before. Embargoed briefings held at OECD Media Centres and London the day before. Other events, organized in concert with government ministries, held on day of launch itself. Media receive report under embargo 24 hours in advance – those attending embargoed briefings on the Monday would receive it Friday. OECD experts available for embargoed interviews day before launch. [Spencer Wilson, email 18/4/2016]

Managing public reception of assessment results

185

a contrasting example of a country that scored low in the rankings, and is a good candidate for ‘PIAAC-shock’. A summary of the OECD’s note for each country can be seen in Table 10.1. Singapore is a rich city state of 5.5 million multilingual people. With no natural resources of its own, the economy depends on a large service sector and trading links with other countries in the Southeast Asia region. The state government keeps tight control over many aspects of the economy including migration which is very significant to the economy: the percentage of nonresidents in the population is currently 30 per cent and is growing and there

Table 10.1 Key issues in OECD country notes for case study countries Singapore and Greece Compared with other countries participating in PIAAC Singapore

Greece

• Adults showed below-average proficiency in literacy and numeracy, but above-average proficiency in problem-solving in technology-rich environments.

• There are fewer high scorers and more low scorers in literacy and numeracy.

• A wide contrast exists between older and younger adults with the youngest age group scoring the highest of all PIAAC#2 countries.

• In contrast to what is observed in other countries, 25–34 year-olds perform as well in literacy as 55–65 year-olds.

• The disadvantage of older adults is partly explained by the higher prevalence of non-native English speakers and by their relatively low levels of educational attainment.

• Is one of the few countries where women outperform men in literacy.

• The dispersion of proficiency scores is • Tertiary-educated adults in Greece have relatively low proficiency in wider than in most other participating literacy, numeracy and problemcountries/economies. solving in technology-rich environments. • There is a strong link with wages and non-economicaoutcomes of skills proficiency and frequent use of skills at work.

• The relationship between skills proficiency and non-economic outcomes is considerably weaker than in other participating countries/ economies. • Workers in Greece use their skills at work to the same degree as other countries, but there is a weak link of skills proficiency with wages.

a

Non-economic outcomes are identified as trust in others, political efficacy, participation in volunteer activities and self-reported health.

186

International Large-Scale Assessments in Education

is a constant two-way flow of both low- and highly-skilled people. The state owns the media outlets and has invested heavily in recent years in education, including continuing vocational education. It has in place serious policies for developing education, human resources and technology (see Luke et al., 2005; Tan, 2017). The economy has faltered in recent years but is stabilizing again, with the unemployment rate currently at 2.3 per cent. It is possible to see Singapore, therefore, as mirroring the aspirations of the OECD for PIAAC and for developing lifelong learning. In the PIAAC survey, Singapore scored above average in problem-solving, but below average in reading and numeracy. The disparity of achievement between older and younger adults was explained by the lack of educational opportunities experienced in the past by older adults, and by the fact that around 85 per cent of older adults do not have English (the language of the test) as their mother tongue. Greece is a member of the European Union and has been suffering from extreme economic difficulties in recent years. The official language is Greek which is spoken by almost everyone in the population, estimated at around 11.2 million in 2007 according to the European Journalism Centre (EJC). It currently has an unemployment rate of 25 per cent which rises to 45 per cent among the 16–25 year age group. Political debate is fierce in the country with a controversial change of government in recent years towards the left and the live possibility of an exit from the EU due to inability to settle national debts. These debates are evident in the coverage of a competitive, privately owned media with a variety of political affiliations. The weakness of the economy has meant that Greece has been subject to reforms imposed by the EU, including curtailment of public funding and state control, which are unresolved and have been highly contested. Within these constraints, accredited continuing and vocational education has developed in line with EU policy, largely delivered by private training organizations. This is a weakened and demoralized country that is unlikely to take further bad news well. In the PIAAC survey, Greece scored among the bottom countries, next to its close neighbour Turkey, and to Chile and Indonesia. It was significantly below the OECD average in all three dimensions. Despite the fact that access to education and participation rates have greatly improved for younger people, their performance was not much better than the older age groups. Two unusual results were that women scored more highly than men, and that there was no relationship between unemployment and skill achievement. So what did the OECD and the media make of these results?

Managing public reception of assessment results

187

Firstly, the OECD country notes in both cases make a point about the age differences in achievements, making important assumptions. For Singapore it is assumed that the higher level of achievement of the youngest groups is an endorsement of the educational reforms of recent years although the language issue is mentioned as a possible impediment for the older adults. This explanation is foregrounded in Andreas Schleicher’s commentaries (a newspaper article, blog, webinar and video addressed to an international audience) and although Singapore is not top of the league table of nations, it is held up as a positive role model (‘glorified’ – see Steiner-Khamsi, 2003) for other countries aspiring to improve their position. The aspects singled out in the OECD country notes are also picked up by the media which in most respects keep close to the OECD’s text. For Singapore, ten news articles were examined and seven online sources. In general, the tone is factual (neither glorifying nor scandalizing) and very little reference is made to the political or policy context, except to say that the government’s existing substantial educational reform strategies, including the lifelong learning Skills for the Future programme (Tan, 2017) seem to be paying off and are appropriately addressing the needs of the adult population. One business news article carried a very negative headline about the skills of older people, but then carried on to simply repeat the OECD country note. The media referred to a wide range of other countries from Europe, Asia and North America. Two comment pieces pick out some issues highlighted by the OECD but ignored by the other media referring to employment related aspects of skills levels for analysis (e.g. ‘5 things you didn’t know about the PIAAC’) but these do not critique either the survey methodology or the national policy. The coverage we examined from Greece consisted of twelve newspaper articles and five online sources. There was overlap in the information and issues presented in all of these, and in some cases the same article was published verbatim in different news sources. There were several items in business related news sources, as well as general and educational publications. As in Singapore, the media reports in Greece stayed close to the OECDs own summaries in reporting factual aspects of the survey findings, but their explanations of what were clearly regarded as dismal findings differed from the OECD’s measured tone, strongly reflecting ongoing political controversies and tensions and the desperate circumstances of contemporary Greek society, especially the plight of younger people. While some articles were highly critical of government policy and the education system in particular, they also focussed on the massive exodus of Greek citizens to other countries. A total of 400,000 people have left since 2008, mostly professional skilled adults and many younger people.

188

International Large-Scale Assessments in Education

Given that the overall population of Greece is 11.2 million this is seen as a number significant enough to skew the results of the survey. This discussion was reinforced by the simultaneous release of a Bank of Greece report on the same issue, and a later statement (in mid-July) from the Hellenic Federation of Enterprises. One article also made the link with the Brexit referendum result which had just been announced – relevant because Greece may itself exit the EU. The enmeshing of political issues with the coverage was very notable and a great contrast to the coverage in Singapore. The very different tone of the coverage in the two countries is especially interesting given that both were, overall, below the OECD average in the league table of thirty-three nations. Singapore acted more as you would expect from a midranging achiever, and left the glorification of the country’s record to the OECD. The Greek media did not make any positive points, for example that the results on numeracy were best, or that – unusually – women were higher performing than men. There seemed to be a halo of negativity around the results and the political discourse more generally, that did not allow for rays of sunshine or silver linings! The headline (as in Bolivar, 2011) could have been ‘more bad news’.

Discussion and conclusions This chapter has explored the trajectories of findings from the OECD’s adult skills survey, PIAAC, into the public discourse of national contexts, detailing five contrasting cases from a wider database of media reports and documentary evidence. This exploration begins to reveal the work done by interconnected actors to move the findings from the desks and computer screens of the survey teams to the policy domains they are designed to influence. Elsewhere (see Hamilton, 2017) I have discussed evidence of how national actors, some of whom are involved in initial decisions to participate in the survey, are fully implicated in how the findings are presented and interpreted, thus undermining the idea that the international agencies impose these measures on a passive audience whose only agency is to react to or ignore the information they bear. Both government agencies and advocacy groups prepare ahead for the findings and position themselves in relation to them to benefit their own policy agendas. On the one hand this means the survey findings constantly escape the intentions of the test-producers. The findings are selectively presented, interpreted in the light of existing national debates and preoccupations and may also be misunderstood and ignored. On the other hand the entanglement of

Managing public reception of assessment results

189

national actors ensures that decontextualized findings about the achievements of adult populations are re-embedded in local contexts and refashioned for local purposes. The media have an important but varied role to play in fashioning the trajectory of the findings. As is especially evident from the data collected for the second round survey, this role is constrained in different ways according to the structure of the media industry in each country and who controls the content of the newspapers. Privately owned papers may be independent from the state but are dependent on the economics of selling a newsworthy story, while state supported and monitored press may express interpretations of the findings that align with existing policy. This chapter has documented the results of some of these differences of approach as well as identifying more constant aspects of media coverage: for example, the necessity of simplification of ILSA data in short items and headlines and the short afterlife of the data as the media move on to other stories. Compared with PISA, both Round 1 and Round 2 findings generated lowkey coverage and even this could be easily derailed by coincidental news events. Despite the OECD’s careful efforts not to give overall rankings to individual countries (see OECD, 2016) media headlines reduced the findings to a single judgement which reverberated over time. However, there was some evidence in Round 2, that the three dimensions of the PIAAC test were more disaggregated in media reports, with more interest in computer use and digital problem-solving than had been evident in our Round 1 cases. In almost all media reports, and in government press releases too, the OECD country notes which present selected facts and interpretations of the data, were major determinants of content. We found no examples of journalists or researchers hunting for information beyond these. However, reports differed in the tone with which the facts were discussed (summarized by one Slovenian journalist, Ranka Ivelja, in her Dnevnik opinion column as the ‘cup half-full or half-empty’ syndrome; Ivelja, 2016). Some aspects of the OECD’s guidance were highlighted while others were ignored. Particularly interesting in Round 2 was the low interest in the relationship of skills to employment conditions and their relationship to what the OECD calls ‘non-economic’ outcomes which were emphasized in all country notes but rarely mentioned in media reports. Employers were not challenged to respond to skills inequalities. This suggests that the OECDs’ own primary model of individual skills imparted through initial schooling is the dominant one and other arguments about the implications

190

International Large-Scale Assessments in Education

for rights to training opportunities, for general well-being and citizenship participation are not attended to. A second noteworthy departure from the OECD’s notes is the importance given to migration in the media reports. Immigrant populations are not mentioned by the OECD but in fact, migration patterns and attitudes are significant in both countries and to other countries in Round 2. They play out very differently in each and this is reflected in the media coverage. Refugee populations prompt new anxieties and skills challenges in Greece, while the two-way exchange of inward and outward skilled migrants in is a major issue for both Greece and Singapore. These flows and exchanges of populations make countries permeable entities and, far from being an unintended problem, they are essential to the global economy. Although the within population variations are discussed in news articles, the headline finding is still the overall country ranking compared with others. This shows how international comparative testing tends to direct attention away from systemic issues like inequality and towards external reference countries. A range of external reference societies (Waldow et al., 2014) appear in media reports and these give clues as to how the countries identify themselves. For example, Singapore, as a cosmopolitan trading centre, refers to an eclectic range of countries, Greece to nearby countries in the EU. New Zealand refers mostly to Englishspeaking developed countries rather than to Pacific Asia. Slovenia compares itself with other former communist countries in Eastern Europe as well as the EU more generally. And everyone refers to the top scorers Japan, Finland and Sweden. Existing national debates were strongly evident in the Round 1 cases we examined and also in Greece. In these debates the findings are often framed in terms of a ‘blame game’ between politicians of different persuasions and reform records (Elstad, 2012). Little of this was evident in the case of Singapore, however, suggesting that adult skills policy is not a controversial issue for this country. We could also speculate that the government-owned media in Singapore may flatten the discussion, especially in light of the country’s strong existing investment and well-developed alignment with the OECD’s lifelong learning skills strategy. In both Round 1 and 2 countries, very brief attention is paid to methodology and there is no critical discussion about this. The ‘facts’ about achievement are accepted and debate centres on what they mean and who/what is to blame for them. No post-truth scepticism is evident! Just as for Singapore and Greece, each country, if examined in detail, has unique conditions that have to be understood not just in relation to the test scores, but also to interpret the media coverage, the public response and the

Managing public reception of assessment results

191

subsequent policy actions of governments. As the PIAAC widens its coverage of countries, so these diversities are likely to become more significant. The existing structures, the degree of economic stability of the country, its existing alignment with OECD policy priorities, demographic features, proximity to conflict – all may affect priorities and readiness to take on issues of lifelong learning. However, our study found that the model of adult skills that underlies and is disseminated through the PIAAC is uniformly accepted. This model organizes and naturalizes the understandings of key actors and publics with little apparent interruption from the media who rather amplify and promote it. Given that the OECD’s ultimate aim is to influence the policy arms of national governments, advocacy groups and other key actors, including corporate ones, it could be argued that embedding this basic model is more important than the findings themselves and international testing agencies have been phenomenally successful at this – so successful (as we have seen above) that once accepted, it is hard to change the model to incorporate new features. Furthermore, while in some countries public opinion matters to the functioning of policy and the media play a key role, there are others where the relationships are different. Given the variability of the relationship between the media, the state, religious and corporate powers and public discourse, it can be argued that less publicly visible avenues of influence on policy actors such as policy-focussed reports, international meetings, seminar briefings and training are more important places to look for effects on policy reform. Aligning a country’s key actors with the global vision of the OECD’s skills strategy is a long-term enterprise, punctuated by the release of the survey findings which are the tip of an iceberg. Although the PIAAC is a new survey and this chapter has only been able to track its life over a short period, our analysis and other existing research (Mons et al., 2009) suggests the importance of constructing an historical timeline to track the growing recognition of the survey brand and its recognition by journalists and publics, to show the push and pull of actor networks and to attend to the backstage work of creating and maintaining the ILSAs done by the OECD, its allies and advocates who are often based in national contexts. There is value in looking more widely at the ‘afterlife’ of the data as it resonates in academic and policy worlds. The ‘gist’ of the findings becomes integrated into policy and popular discourse which amplifies them through repetition across different news domains and media outlets. All this leads to what we might call the disappearing power of the data as it is reduced, reframed, ignored and superseded. I would argue that this process is inevitable but it frustrates the test-producers who spend time worrying about how to control public and

192

International Large-Scale Assessments in Education

policy reactions that misunderstand or misuse the data (see Roseveare, 2014; O’Leary et al., 2017). This anxiety is perhaps misplaced. One important effect of the OECD’s successfully embedded technicized model is that it renders lay people unable to assess their own (or others’) skills and specialized expertise is needed to interpret the large-scale survey data. The expert role of the OECD in managing the release of ILSA findings is central and unquestioned in the public discourse we have examined. It usefully filters complex findings, suggests directions, and offers resources for countries that wish to align themselves with a particular pathway to the future.

Acknowledgement Sincere thanks to all the people who contributed to the research reported in this paper: Margarita Calderon (Chile) Cormac O’Keefe (France) Natalie Papanastasiou, Sofia Ntalapera, Despina Potari, Jeff Evans, Anna Tsatsaroni (Greece) Ari Danu, Didi Sukyadi David Mallows (Indonesia) Oren Pizmony-Levy (Israel) Tomoya Iwatsuki (Japan) Justina Naujokaitiene (Lithuania) Janet Coup/Pat Strauss (NZ) Stanley Koh (Singapore) Petra Javrh (Slovenia) Caroline Runesdottir (Sweden) Ahmed Yildez (Turkey) Keiko Yasukawa (Factiva searches)

References Addey, C., Sellar, S., Steiner-Khamsi, G., Lingard, B., & Verger, A. (2017), ‘The rise of international large-scale assessments and rationales for participation’, Compare: A Journal of Comparative and International Education, 47 (3): 434–452. Bolívar, A. (2011), ‘The dissatisfaction of the losers’, in M. A. Pereyra, H. G. Kotthoff, & R. Cowen, Pisa under examination, 61–74, Rotterdam: Sense Publishers.

Managing public reception of assessment results

193

Bowker, G. C. & Star, S. L. (2000), Sorting things out: Classification and its consequences, Cambridge, MA: MIT press. Carvalho, L. M. (2012), ‘The fabrications and travels of a knowledge-policy instrument’ , European Educational Research Journal, 11 (2): 172–188. Denis, J. & Pontille, D. (2015), ‘Material ordering and the care of things’, Science, Technology, & Human Values, 40 (3): 338–367. Elstad, E. (2012), ‘PISA debates and blame management among the Norwegian educational authorities: Press coverage and debate intensity in the newspapers’, Problems of Education in the 21st Century, 48: 10–22. Fenwick, T., Edwards, R., & Sawchuk, P. (2015), Emerging approaches to educational research: Tracing the socio-material, Abingdon: Routledge. Goffman, E. (1988), Exploring the interaction order, Cambridge: Polity Press. Gorur, R. (2011), ‘ANT on the PISA trail: Following the statistical pursuit of certainty’ , Educational Philosophy and Theory, 43 (s1): 76–93. Grek, S. (2010), ‘International organisations and the shared construction of policy “problems”: Problematisation and change in education governance in Europe’ , European Educational Research Journal, 9 (3): 396–406. Hamilton, M. (2012), Literacy and the politics of representation, Routledge. Hamilton, M. (2017), ‘How ILSAs engage with national actors: Mobilizing networks through policy, media, and public knowledge’, Critical Studies in Education. Hamilton, M. (forthcoming), ‘The discourses of PIAAC: Re-imagining Literacy through Numbers’, in F. Finnegan & B. Grummell (eds), Power and possibility: Adult education in a diverse and complex world, Rotterdam: Sense Publishers. Hanushek, E. & Woessmann, L. (2015), Universal basic skills what countries stand to gain: What countries to gain, OECD Publishing. Hanushek, E. A., Schwerdt, G., Wiederhold, S., & Woessmann, L. (2015), ‘Returns to skills around the world: Evidence from PIAAC’, European Economic Review, 73: 103–130. Henry, M., Lingard, B., Rizvi, F., & Taylor, S. (2001), The OECD, globalisation and education policy, Oxford: Pergamon. Ivelja, R. (2016), ‘Pismenost: kako poln je kozarec?’ [Literacy: how full is the glass?]. Available online: https://www.dnevnik.si/1042741759/mnenja/kolumne 30. junij 2016 (accessed 10 August 2017). Javrh, P. (2016), How to work with media and policy makers. Presentation at ESRC Seminar on The Politics of Reception, Lancaster, April. Available online: https:// www.youtube.com/watch?v=6CWj8AF6Eeg&feature=youtu.be (accessed 10 August 2017) Knight, M. (2015), ‘Data journalism in the UK: A preliminary analysis of form and content’, Journal of Media Practice, 16 (1): 55–72. Komatsu, H. & Rappleye, J. (2017), ‘A new global policy regime founded on invalid statistics? Hanushek, Woessmann, PISA, and economic growth’, Comparative Education, 53 (2): 166–191.

194

International Large-Scale Assessments in Education

Latour, B. (2005), Reassembling the social: An introduction to actor-network-theory, Oxford: Oxford University Press. Law, J. (2011), ‘Collateral realities’, in F. D. Rubio & P. Baert (eds), The politics of knowledge, 156–178, London: Routledge. Lawn, M. & Grek, S. (2012), Europeanizing education: Governing a new policy space, Symposium Books Ltd. Lingard, B. (2016), ‘Rationales for and reception of the OECD’s PISA’, Educação & Sociedade, 37 (136): 609–627. Luke, A., Freebody, P., Shun, L., & Gopinathan, S. (2005), ‘Towards research-based innovation and reform: Singapore schooling in transition’, Asia Pacific Journal of Education, 25 (1): 5–28. Mons, N., Pons, X., Van Zanten, A., & Pouille, J. (2009), ‘The reception of PISA in France’, in Connaissance et régulation du système éducatif, Paris: OSC. Morgan, C. (2007), OECD programme for international student assessment: Unraveling a knowledge network. ProQuest. Available online: https://scholar.google.co.uk/citations? user=cSuiBH4AAAAJ&hl=en&oi=sra (accessed 8 May 2017). O’Keeffe, C. (2015), Assembling the adult learner: Global and local e-assessment practices, PhD thesis, Lancaster University. OECD (2013), ‘The survey of adult skills (PIAAC): Implications for education and training policies in Europe’. Available online: https://www.oecd.org/site/piaac/ PIAAC%20EU%20Analysis%2008%2010%202013%20-%20WEB%20version.pdf OECD (2013), OECD skills outlook 2013: First results from the survey of adult skills, Paris: OECD. OECD (2016), Further results from the survey of adult skills (PIAAC), Paris: OECD. O’Leary, T. M., Hattie, J. A., & Griffin, P. (2017), ‘Actual interpretations and use of scores as aspects of validity’, Educational Measurement: Issues and Practice, 36 (2): 16–23. Ozga, J., Dahler-Larsen, P., Segerholm, C., & Simola, H. (eds) (2011), Fabricating quality in education: Data and governance in Europe, London: Routledge. Pena, A. (2016), ‘PIAAC skills and economic inequality’, Journal of Research and Practice for Adult Literacy, Secondary, and Basic Education, 5 (2): 17. Pizmony-Levy, O., Doan, L., Carmona, J., & Kessler, E. (2019), ‘The public and international assessments’, in B. Maddox (ed.), International large-scale assessments in education: Insider research perspectives, London: Bloomsbury Publishing. Rawolle, S. & Lingard, B. (2014), ‘Mediatization and education: A sociological account’, in K. Lundby (ed.), Mediatization of communication (Vol. 21), 595–614, Berlin: Walter de Gruyter GmbH & Co KG. Rogers, S. (2016), Data journalism matters more now than ever before. Available online: http://simonrogers.net/2016/03/07/data-journalism-matters-more-now-than-everbefore/ (accessed 8 May 2017). Roseveare, D. (2014), PIAAC and the OECD Skills Strategy. INSIGHT Seminar 4 Academy of the social sciences. Available online: https://www.acss.org.uk/iagseminar-series-seminar-4-report/

Managing public reception of assessment results

195

Schleicher, A. (2016), Why Skills Matter. Available online: http://oecdeducationtoday. blogspot.co.uk/2016/06/why-skills-matter.html (accessed 8/5/2017). Sellar, S. & Lingard, B. (2013), ‘The OECD and global governance in education’, Journal of Education Policy, 28 (5): 710–725. Steiner-Khamsi, G. (2003), ‘The politics of league tables’, JSSE-Journal of Social Science Education, 2 (1). Steiner-Khamsi, G. & Waldow, F. (eds.) (2012), World yearbook of education 2012: Policy borrowing and lending in education, Abingdon: Routledge. Tan, C. (2017), ‘Lifelong learning through the SkillsFuture movement in Singapore: Challenges and prospects’, International Journal of Lifelong Education, 36 (3): 278–291. Waldow, F., Takayama, K., & Sung, Y. K. (2014), ‘Rethinking the pattern of external policy referencing: Media discourses over the “Asian Tigers” PISA success in Australia, Germany and South Korea’, Comparative Education, 50 (3): 302–321. Williamson, B. (2016), ‘Digital education governance: data visualization, predictive analytics, and “real-time” policy instruments’, Journal of Education Policy, 31 (2): 123–141. Yasukawa, K., Hamilton, M., & Evans, J. (2016), ‘A comparative analysis of national media responses to the OECD Survey of Adult Skills: Policy making from the global to the local?’, Compare: A Journal of Comparative and International Education, 47 (2): 271–285.

Appendix 1: Template used to summarize second round media reports Collaborators with first-hand knowledge of the participating countries were asked to collect newspaper and online articles that refer to the PIAAC findings in their country during the week following the release of the findings and to write a short report about them which would answer the following questions. These are based on the findings from our earlier research: 1. What results are presented in the media (rankings, country scores, within country group differences such as age or gender)? 2. What policy issues are drawn from the results? 3. Do the reports simply summarize or extract from the OECD country notes, or is there an attempt to provide richer interpretations? 4. What reference countries are mentioned (positive and negative), and is the general tone of the results positive or negative? 5. Is the survey methodology mentioned or critiqued in the reports?

11

The public and international assessments Oren Pizmony-Levy, Linh Doan, Jonathan Carmona and Erika Kessler

Introduction International large-scale assessments (ILSA) are a public matter. At the most basic level, ILSA are funded with public money to inform stakeholders about the output of what is often considered a large component of the governmental sector (Howie & Plomp, 2005). The International Association for the Evaluation of Educational Achievement (IEA) Trends in International Mathematics and Science Study (TIMSS), for example, aims to evaluate the extent to which the intended curriculum is actually implemented and attained. The Organisation for Economic Co-operation and Development (OECD) Program for International Student Assessment (PISA) aims to evaluate the extent to which students can apply what they learned in school to ‘real-world’ situations. Because education as a social institution has various societal effects (Meyer & Rowan, 1977), the impact of ILSA on the education system potentially could affect larger society (e.g. by altering what counts as school knowledge). Another aspect of the public nature of ILSA is the fact that their results attract media attention as headline news and take on symbolic substance worldwide. Indeed, following the immense growth in ILSA, scholars have examined the public discourse around these assessments (Stack, 2007; Pons, 2012; Dixon et al., 2013 ; Waldow, Takayama, & Sung, 2014; Yasukawa, Hamilton, & Evans, 2017; Hamilton, in this volume; Pizmony-Levy, forthcoming; Pizmony-Levy & Torney-Purta, 2018). Steiner-Khamsi (2003), for example, categorized three types of discursive reactions to ILSA results: (1) scandalization, (2) glorification, and (3) indifference. These reactions characterize the responses by elected officials, policy makers and the mass media.

198

International Large-Scale Assessments in Education

To date, however, scholars have paid little attention to the links between ILSA and public opinion. Although the term is commonly used, scholars do not agree on one single definition of the term public opinion. For our purpose, we define ‘public opinion’ as opinions concerning social or governmental matters rather than on private matters (Clawson & Oxley, 2013). It comprises opinions, attitudes, preferences, beliefs and values. These opinions are held by individuals in society, but they become public through technical practices such as standardized representative surveys (Perrin & McFarland, 2011). The public nature of ILSA makes them an appropriate topic for public opinion research. We posit that understanding public opinion toward ILSA is important for several reasons. First, public opinion plays an important role in the development of public policy. In their Advocacy Collation Framework, Sabatier and Jenkins-Smith (1999) argue that changes in public opinion represent external events that are a critical prerequisite to major policy changes. The premise of this argument is that public opinion can alter general spending priorities and the perceived seriousness of various problems (Berkman & Plutzer, 2005). The relative stability of public opinion is also seen as a key mechanism for the path dependency of social policy because it signals to policymakers what their constituents expect (Brooks & Manza 2006). Second, as stated above, one of the key objectives of ILSA is to inform stakeholders, including the general public, about the state of education in their country. Drawing on the initial framework of PISA, for example, Pizmony-Levy (2017) has demonstrated how the Organisation for Economic Co-operation and Development (OECD) sought to address information gaps regarding the performance of young adults among parents, students, the general public and policymakers. As this book investigates the choreographed rituals of ILSA and the associated discourse about education, this chapter focuses specifically on the implications of ILSA on public opinion. The chapter examines four research questions: 1. What does the general public think about international assessments? 2. To what extent is the general public engaged with results of international assessments? 3. To what extent is public engagement with international assessments linked to attitudes towards education? 4. To what extent is national performance on international assessments correlated with attitudes towards education?

The public and international assessments

199

The first two questions above assess public opinion toward and engagement with ILSA. The second two questions investigate the relationship between ILSA results and more general attitudes about educational policy.

Background Although scholars have called for more studies that examine public opinion about education (Jacobsen, 2009), this research remains limited. Therefore, it is not surprising to find very few publications about the intersection between ILSA and public opinion. Most of the publications rely on the work of think tanks and polling organizations that track public opinion about education on an annual basis. Thematically, we identify two lines of research. The first line of research examines public views towards ILSA (for review, see Pizmony-Levy, 2017). Surveys assess public knowledge about PISA results by asking respondents to indicate how 15-year-old students in their country perform on ILSA (e.g. bottom, middle and top of a ranking table). In the United States and Israel, respondents underestimate the performance of students in their country. In both countries, college-educated respondents are more critical of their country’s performance on PISA. In addition, surveys assess the level of engagement with ILSA results. According to a 2014 PDK/Gallup survey, for example, only one-third of Americans (30 per cent) remember reading or hearing about PISA scores when they were released (PDK International, 2014). The same survey reveals uncertainty about the accuracy of PISA and about the importance of PISA for helping improve schools. Nonetheless, Americans endorse PISA as reflected in the 2014 Education Next-PEPG survey. This line of research relies heavily on samples from the United States and Israel, and thus we have a limited perspective on the phenomena. The second line of research explores how ILSA results shape public opinion about education. Using a national survey experiment in the United States, Morgan and Poppe (2012) show how framing educational policy with the goal of enhancing international competitiveness lowers the subjective evaluation of the quality of local schooling without increasing interest in additional spending to improve the nation’s education system. Moving to Scandinavia, Fladmoe (2013) demonstrates how awareness of PISA could influence the effect of news consumption on subjective evaluation of the national education system. Specifically, in Norway and Sweden (but not in Finland), individuals who are aware of PISA express more negative evaluations of the national education

200

International Large-Scale Assessments in Education

system the more they consume certain sources of news. Cross-national research, using data from the International Social Survey Program (ISSP) from thirty countries, shows the link between country performance on PISA and public confidence in education (Pizmony-Levy & Bjorklund, 2018).

Data & methods To begin exploring public engagement with ILSA, we conducted a pilot study to investigate public opinion towards education and international assessments. The study was implemented between November and December 2016. During that time, the IEA released the results of TIMSS 2015 (29 November) and the OECD released the results of PISA 2015 (6 December). By administrating the survey during that timeframe, we intended to investigate the extent to which releases of new ILSA results – especially in a year when TIMSS and PISA are released simultaneously – affect public opinion and views. We designed the survey as an online, self-administered questionnaire. It included questions gauging respondents’ attitudes toward ILSA and general questions about education (e.g. confidence in education, and opinions regarding public spending on education). Most of the items about ILSA were adapted from the 2014 PDK/Gallup Poll of the Public’s Attitudes Toward the Public Schools and the 2013 Pew Research Center’s Public Knowledge of Science and Technology Survey (for review, see Pizmony-Levy, 2017). The general items about education were adapted from the 2006 Role of Government Module of the International Social Survey Program (ISSP). The survey contains detailed information on respondents’ socio-demographic characteristics. The final question invited respondents to share open-ended comments about issues raised in the survey. The survey was written in English and then translated by native speakers into sixteen languages: Arabic, Armenian, Bahasa Indonesia, Dutch, French, Hebrew, Hungarian, Japanese, Korean, Simplified Mandarin, Traditional Mandarin, Portuguese, Russian, Spanish, Turkish, and Vietnamese (for complete version of survey instrument, see Pizmony-Levy et al., 2017). The majority of respondents completed the survey within 8–10 minutes. The pilot study is based on a convenience sampling of adults. We recruited respondents through social media outlets including Facebook, Twitter, WeChat and others. A research team of sixty-one individuals from twenty-five countries disseminated scripted announcements and reminders about the survey throughout the data collection period. While the advantages of convenience sampling are clear (e.g. simplicity and cost effectiveness), we should acknowledge

The public and international assessments

201

two limitations. First, the sample is highly vulnerable to selection bias (see discussion of sample demographics below). Second, because this is a nonprobability sample, the results are not generalizable. An additional caveat is that there is not equitable access to the internet across the globe (Pearce & Rice, 2013). These limitations are important; however, we believe that for the purpose of generating new hypotheses regarding public engagement with ILSA, this approach suffices. The final sample included 4,585 respondents from seventyeight different countries (including eighty respondents from eighteen countries that did not participate in TIMSS or PISA 2015). The analytical sample is slightly smaller than the full dataset and includes a total of 4,306 respondents. The sample is restricted to twenty-one countries in which at least thirty responses were recorded (see Appendix A). Further, the sample includes seven countries in which at least thirty valid responses were recorded before and after the release of TIMSS and PISA 2015. Table 11.1 presents the sociodemographic background of the sample. Two-thirds (66.7 per cent) of the sample are women. Half (50.8 per cent) of the sample are young respondents between ages 18 and 29. A vast majority of the sample holds an academic degree (88.9 per cent

Table 11.1 Sample demographics (n = 4,306) Characteristics

Per cent

Gender Men

33.3

Women

66.7

Age group 18–29

50.8

30–49

37.8

50–64

9.9

65+

1.5

Education High school or less

11.1

University/College degree

47.1

Graduate degree

41.8

Employment status Not employed

22.7

Part time

18.2

Full time

59.1

202

International Large-Scale Assessments in Education

Characteristics

Per cent

Social class Lower class

3.6

Working class

11.0

Lower middle class

32.3

Upper middle class

47.1

Upper class

6.0

Community Rural (fewer than 15K people)

7.7

Town or suburban (15K to about 1M people)

44.2

A large city (with over 1,000,000 people)

48.1

Parental status Non parent

74.8

Parent

25.2

hold an undergraduate degree, and 41.8 per cent hold a graduate degree). Similarly, almost all respondents reside in urban or suburban communities (92.3 per cent). Only one-quarter of the sample (25.2 per cent) are parents to schoolaged children. We found little variation related to socio-demographic background across countries. Given the composition of the sample (i.e. young, highly educated and apparently concerned respondents) it is important to mention that some of the patterns reported in this chapter will look very different in a study using probability and representative survey data. Specifically, we speculate that our sample may over estimate engagement with ILSA. Because the sample is not representative and the sample size varies across countries, the analytical technique relies on descriptive statistics and bivariate analysis (e.g. cross-tabulation). In addition to the analyses reported here, we also analysed the data with country weights that address the uneven sample sizes. All supplemental analysis is available upon request.

Results Attitudes toward ILSA We will first explore the descriptive statistics for attitudes towards ILSA. For the sake of simplicity, the survey used the term ‘international comparisons tests’.

The public and international assessments

203

The survey paid particular attention to four factors: (a) perceived accuracy of ILSA; (b) perceived contribution of ILSA to improving schools; (c) perceived importance of good country performance on ILSA; and (d) support for country participation in ILSA. A key premise of ILSA is the provision of high-quality comparable data across different countries and cultures. To help us measure the public’s perceived accuracy of ILSA, the survey prompted respondents to rate their agreement/ disagreement with the following statement: ‘International comparisons tests such as PISA and TIMSS accurately measure student achievement across nations.’ Overall, respondents reported that they doubt the accuracy of ILSA. Slightly more than one-third (36.0 per cent) agreed with this statement, whereas the rest responded neither agree or disagree (41.6 per cent) or disagree (22.4 per cent). Perceived accuracy of ILSA varied across countries, as illustrated in Figure 11.1. The majority of respondents in Australia and Hungary, for example, endorsed the notion that ILSA accurately measures student achievement crossnationally. Conversely, a small minority of respondents in the United Kingdom and Denmark endorsed this idea. Importantly, the figure for the United States is similar to figures found in a national representative sample (PDK/Gallup, 2014).

100% 90% 80%

74%

70% 55% 56%

60%

60%

51% 46% 44% 46% 46%

50% 40% 28% 30% 26% 27% 27%

30% 20%

13%

33% 33%

36% 36% 36%

17%

10% 0% 0) 9) 20) 05) 32) 01) 26) 87) 70) 49) 21) 41) 35) 72) 37) 61) 07) 93) 42) 72) 2 1 1 6 = n= = =3 -3 n- n= n= n= n= 20 =1 n= n= n= =1 (n ( (n el (n (n= (n= a (n (n= (n= (n (n= ce ( n ( al ( ay ( co ( a (n ey ( lia ( ry ( f s a y d a ug il wan am an n i j r e o c a a z u n i s t a s rk ra nga n a o d g a c i n a n I p a i r rt u Fr rb ne Tu ust nm ing Ja Chi St Ca u Br Ta Viet bl rm Po Ur Mo ndo H De A ze K pu ited i/ Ge A e I e d p R n i te , i a U a n re U eT es Ko in h C 1)

=3

k ar

(n

om

Percent Agree

Percent Agree (total)

Figure 11.1 International comparisons tests such as PISA and TIMSS accurately measure student achievement across nations, by country

204

International Large-Scale Assessments in Education

Policymakers that evaluate their educational systems using ILSA believe these assessments would help them to improve education quality. International organizations responsible for ILSA (i.e. IEA and OECD) make similar claims. Therefore, the survey asked respondents to rate their agreement/disagreement with the following statement: ‘International comparisons tests are critical to helping improve schools in this country.’ Respondents are split when it comes to the notion that ILSA is a critical tool for school improvement. Less than half (48.7 per cent) agreed with this statement, whereas the rest neither agreed or disagreed (26.1 per cent) or disagreed (20.3 per cent). In further analysis (not reported), we found that the public’s view of the perceived contribution of ILSA to improving schools varied across countries. Previous research on public discourse has suggested that in many countries policymakers take action in response to ILSA results (Figazzolo, 2009; Breakspear, 2012). By doing so, they advance the idea that good performance on TIMSS and/ or PISA is desired and is an important pursuit. The survey asked respondents to rate their opinion on the following statement using a four-point scale: ‘How important is it that your country performs well on these tests compared to other countries?’ Although respondents doubted the accuracy of ILSA and they reported being unsure regarding the contribution of ILSA to improving schools, a large majority of them (79.7 per cent) viewed good performance on ILSA as important (42.5 per cent answered somewhat important, and 37.2 per cent answered very important). We found a similar pattern related to public support for its country’s participation in further cycles of ILSA. The survey asked respondents: ‘Do you support or oppose the country’s participation in international comparisons tests in science, mathematics, and reading in the coming years?’ Slightly more than two-thirds (68.2) indicated they support participation in ILSA, with a small minority (9.5 per cent) opposing this action. Table 11.2 presents a correlation matrix of the four attitudes towards ILSA. Overall, we found positive and significant correlations between all the variables. The correlation between perceived accuracy of ILSA and perceived importance of good country performance is relatively weak (r =.27, p