285 109 2MB
English Pages 288 [289] Year 2007
Measuring Attitudes CrossNationally
The contributors Jaak Billiet Professor of Social Methodology at the Katholieke Universiteit Leuven, Centre for Sociological Research James Davis Senior Lecturer in Sociology at the University of Chicago and Senior Research Scientist at the National Opinion Research Center Gillian Eva Research Fellow at City University and a member of the ESS Central Coordinating Team (CCT) Rory Fitzgerald Senior Research Fellow at City University and a member of the ESS CCT Irmtraud Gallhofer Member of the ESS CCT and senior researcher at ESADE Business School, Universitat Ramon Llull, Barcelona Sabine Häder Senior Statistician at Zentrum für Umfragen, Methoden und Analysen (ZUMA), Mannheim, Germany and a member of the European Social Survey sampling panel Janet A. Harkness Senior Research Scientist at ZUMA, Mannheim, Germany and Director of the Survey Research and Methodology Program at the University of Nebraska, USA Bjørn Henrichsen Director at Norwegian Social Science Data Services Roger Jowell Research Professor at City University London and Principal Investigator of the European Social Survey (ESS) Max Kaase Emeritus Professor of Political Science at the University of Mannheim, past President of the International Political Science Association, and chair of the ESS Scientific Advisory Board Achim Koch Senior Researcher at the European Centre for Comparative Surveys (ECCS) at ZUMA, Mannheim, Germany
Kirstine Kolsrud Senior Adviser at Norwegian Social Science Data Services Peter Lynn Professor of Survey Methodology at the University of Essex, UK and a member of the European Social Survey sampling panel Peter Mohler Director of Zentrum für Umfragen, Methoden und Analysen (ZUMA), Mannheim, Germany and Professor at Mannheim University José Ramón Montero Professor of Political Science at the Universidad Autónoma de Madrid and the Instituto Juan March, Madrid Kenneth Newton Professor of Comparative Politics at the University of Southampton and Visiting Fellow at the Wissenschaftszentrum Berlin Pippa Norris Director of the Democratic Governance Group, United Nations Development Program and the McGuire Lecturer in Comparative Politics, Harvard University Michel Phillippens Former research assistant at the Katholieke Universiteit Leuven, Centre for Sociological Research Willem E. Saris Member of the ESS CCT and Professor at the ESADE Business School, Universitat Ramon Llull, Barcelona Shalom H. Schwartz Sznajderman Professor Emeritus of Psychology at the Hebrew University of Jerusalem, Israel Knut Kalgraff Skjåk Head of Department at Norwegian Social Science Data Services Ineke Stoop Head of the Department of Data Services and IT at the Social and Cultural Planning Office of the Netherlands
Measuring Attitudes CrossNationally Lessons from the European Social Survey
EDITORS
Roger Jowell, Caroline Roberts, Rory Fitzgerald and Gillian Eva
Centre for Comparative Social Surveys at City University, London
© Centre for Comparative Social Surveys, City University, London 2007 First published 2007 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, this publication may be reproduced, stored or transmitted in any form, or by any means, only with the prior permission in writing of the publishers, or in the case of reprographic reproduction, in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. SAGE Publications Ltd 1 Oliver’s Yard 55 City Road London EC1Y 1SP SAGE Publications Inc. 2455 Teller Road Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd B 1/I 1 Mohan Cooperative Industrial Area Mathura Road, Post Bag 7 New Delhi 110 044 SAGE Publications Asia-Pacific Pte Ltd 33 Pekin Street #02-01 Far East Square Singapore 048763 Library of Congress Control Number: 2006932622 British Library Cataloguing in Publication data A catalogue record for this book is available from the British Library ISBN 978-1-4129-1981-4
Typeset by C&M Digitals (P) Ltd, Chennai, India Printed in Great Britain by Athenaeum Press, Gateshead Printed on paper from sustainable resources
Contents Foreword
xi
1
1 1
2
The European Social Survey as a measurement model Roger Jowell, Max Kaase, Rory Fitzgerald and Gillian Eva Introduction In defence of rigour The pursuit of equivalence The ESS model in practice Continuity Governance Division of tasks Workpackages 1 and 2: Overall project design and coordination Workpackage 3: Sampling Workpackage 4: Translation Workpackage 5: Commissioning fieldwork Workpackage 6: Contract adherence Workpackage 7: Piloting and data quality Workpackages 8 and 9: Question reliability and validity Workpackage 10: Event monitoring Workpackage 11: Data access and aids to analysis Conclusion Notes References
1 4 6 9 11 12 15 15 16 17 18 19 21 22 24 24 26 27 29
How representative can a multi-nation survey be? Sabine Häder and Peter Lynn
33
Introduction Equivalence of samples Sample sizes Achieving equivalence Population coverage Sampling frames Sample designs Design weights Design effects Sample size Organisation of the work Conclusion References
33 34 36 37 38 38 40 43 44 49 50 51 52
vi
3
4
MEASURING ATTITUDES CROSS-NATIONALLY
Can questions travel successfully? Willem E. Saris and Irmtraud Gallhofer
53
Introduction Seven stages of questionnaire design Background to the evaluation of questions Evaluation of ‘concepts-by-intuition’ Quality criteria for single survey items The Multitrait-Multimethod design Predicting the quality of questions Evaluation of ‘concepts-by-postulation’ Political efficacy The Human Values Scale An evaluation of cross-cultural comparability Conclusion References Appendix
53 54 56 57 58 60 61 61 62 65 68 71 72 75
Improving the comparability of translations Janet A. Harkness
79
Introduction Source and target languages Organisation and specification Organisation Specification The Translation Procedure: TRAPD Split and parallel translations Countries with more than one language Producing multiple translations Sharing languages and harmonisation Ancillary measures to support translation Annotating the source questionnaire Query hotline and FAQs Documentation templates Lessons learned Source questionnaire and translation Advance translation Templates and production tools Attention to detail Identifying translation errors Conclusion References
79 80 81 81 81 83 84 85 85 87 87 87 88 88 89 89 89 90 90 91 91 92
Contents
5
6
7
If it bleeds, it leads: the impact of media-reported events Ineke Stoop
vii
95
Introduction “Events, dear boy, events” Events in the media News flow and event identification Guidelines and database Meanwhile, what was happening in Europe? Looking ahead Notes References
95 97 98 100 102 105 108 110 111
Understanding and improving response rates Jaak Billiet, Achim Koch and Michel Philippens
113
Introduction Response quality: standards and documentation The conduct of fieldwork Response and non-response Why such large country differences in response rates? Country differences in non-contact rate reduction Contact procedures Number of contact attempts Contactability Country differences in refusal conversion Differentiation of respondents according to readiness to co-operate Estimation of non-response bias Conclusion References Appendix
113 115 117 118 120 121 122 122 124 126 129 129 132 133 136
Free and immediate access to data Kirstine Kolsrud, Knut Kalgraff Skjåk and Bjørn Henrichsen
139
Introduction Data access barriers Standardising the production of data and meta data The data The survey documentation Dissemination Conclusion References
139 140 142 142 146 149 155 156
viii
8
9
MEASURING ATTITUDES CROSS-NATIONALLY
What is being learned from the ESS? Peter Mohler
157
Introduction Consistency Transparency Coordination and management Innovative probability samples A source of data on error and bias Translation Free and easy access to data Capacity building Conclusion References
157 159 160 161 162 162 164 164 165 166 167
Value orientations: measurement, antecedents and consequences across nations Shalom H. Schwartz
169
Introduction The nature of values Current survey practice and the conception of values A theory of the content and structure of basic human values Ten basic types of value The structure of value relations Comprehensiveness of the ten basic values But are self-reports valid indicators of values? Measuring values in the ESS Development of the Human Values Scale Methodological issues in designing the scale Correcting for response tendencies Reliability of the ten values Value structures in the ESS countries Value priorities in the ESS countries Sources of individual differences in basic values Age and the life course Gender Education Income Basic values as a predictor of national and individual variation in attitudes and behaviour Attitudes to immigration Interpersonal trust Social involvement
169 170 172 173 173 174 176 177 177 177 179 180 181 182 184 188 188 189 189 190 190 190 192 193
Contents
10
11
ix
Organisational membership Political activism Conclusion References Appendix 1 Appendix 2 Appendix 3
194 195 196 197 201 202 203
Patterns of political and social participation in Europe Kenneth Newton and José Ramón Montero
205
Introduction Individual participation: fragmented and multi-dimensional National levels of participation: also fragmented and multi-dimensional? Types of participation Participation in voluntary associations Social and helping behaviour Conventional political participation Protest politics Overall participation What explains the national patterns? Conclusion References Appendix 1 Appendix 2 Appendix 3 Appendix 4 Appendix 5 Appendix 6 Appendix 7
205 206 209 210 210 214 217 219 221 223 227 229 230 231 233 234 235 236 237
A continental divide? Social capital in the US and Europe Pippa Norris and James Davis
239
Introduction Toquevillian theories of social capital Social networks and social trust matter for societal co-operation Social capital has importance consequences for democracy Social capital has declined in post-war America Social capital in advanced industrialised societies Evidence and measures Comparing social capital in Europe
239 241 242 243 243 247 249 251
x
MEASURING ATTITUDES CROSS-NATIONALLY
Cohort analysis of social capital Conclusions References Appendix
255 261 262 264
Index
265
Foreword This book describes the product of a remarkable collaboration across national borders between researchers and funders whose singular purpose has been to build a regular and rigorous means of charting attitudinal and behavioural change in a changing Europe. The project’s starting point (and its continual pre-occupation) has been to find ways of tackling the longstanding and seemingly intractable difficulties of achieving equivalence in comparative social surveys. This volume is about the problems facing comparative social research generally and new approaches to finding solutions. Almost all chapters have been written by one or more of the primary architects and initiators of the European Social Survey (ESS). Each chapter deals with a particular aspect of comparative social surveys – from sampling to translation, response rate enhancement to harmonisation of data, and so on – tracing the difficulties and describing how the ESS attempts to solve them. Chapter 1 records the origins of the European Social Survey, its underlying philosophy and purpose. It also introduces and summarises its many innovations – both methodological and organisational. Chapter 2 discusses the obstacles to achieving equivalent random samples within different countries. It documents the ESS’s unprecedented approach to achieving a viable solution. Chapter 3 describes the unusual collection of hoops through which ESS questions have to pass before they are adopted as part of the questionnaire, warning of the hazards of less rigorous approaches. Chapter 4 documents the unusual procedures and protocols employed in the ESS to obtain equivalent translations from the source questionnaire into well over 20 languages, contrasting ESS methods with alternative approaches. Chapter 5 reviews the possible impact of major national or international events on attitudinal trend data and describes the methods the ESS has developed to monitor and record such events with the purpose of informing subsequent data analyses. Chapter 6 is about patterns of declining response rates in surveys, and the particular problem of differential response rates in cross-national surveys. It describes the range of counteractive measures taken in the ESS and assesses their effectiveness.
xii
MEASURING ATTITUDES CROSS-NATIONALLY
Chapter 7 tackles the formidable difficulty of producing an equivalent, user-friendly and timely dataset in the same form from over 20 separate countries. It outlines the meticulous procedures and protocols employed by the ESS to achieve this. Chapter 8 assesses what lessons we are learning from the various ESS innovations in methodology and organisational structure, acknowledging what has already been learned from predecessor cross-national social surveys. Chapter 9 outlines the origins and development of the ‘human values scale’ employed in the ESS and demonstrates its utility for mapping the structure of values across nations. Chapter 10 analyses the results of the rotating module in Round 1 of the ESS on citizen involvement and democracy, showing distinctly different national patterns of participation in both voluntary and political activity. Chapter 11 compares ESS data with data from the US General Social Survey to investigate to what extent the well-documented ‘crisis’ of declining social capital in the US applies to European nations too. The huge debts we owe to colleagues throughout Europe are too numerous to itemise here. The organisational structure of the ESS means that in each of 32 countries there are numerous individuals and organisations that have taken on the task of making the ESS a success in their own country. They include, above all, the National Coordinators who orchestrate the work in their country and who generously contribute their ideas and expertise, the survey agencies that carry out the fieldwork and data preparation to remarkably high standards, and, of course, the national funding agencies that have consistently financed successive rounds of fieldwork and coordination in their country. In addition, members of our various advisory boards and committees – the Scientific Advisory Board, the Funders’ Forum, the Methods Group, the Sampling Panel and the Translation Panel – have played an invaluable role in helping to secure and sustain the quality of the project. We greatly appreciate their respective contributions and realise how much we depend on them – individually and collectively – to help us manage such a large and complex multinational enterprise. As for the production of the book itself, we have relied heavily on the talents and meticulousness of Sally Widdop, a research assistant at our Centre, who has kept us on track and told us precisely what to do – for all of which we owe her a heartfelt vote of thanks. Editors Roger Jowell Caroline Roberts Rory Fitzgerald Gillian Eva
1
The European Social Survey as a measurement model Roger Jowell, Max Kaase, Rory Fitzgerald and Gillian Eva∗
Introduction The importance to social science of rigorous comparative research is incontestable. It helps to reveal not only intriguing differences between countries and cultures, but also aspects of one’s own country and culture that would be difficult or impossible to detect from domestic data alone. As Durkheim famously put it: “Comparative sociology is not a particular branch of sociology: it is sociology itself” (Durkheim, 1964, pp.139). Even so, the strict methodological standards that have long been employed in many national studies have tended to be beyond the reach of many comparative studies (Scheuch, 1966; Teune, 1992). One obvious reason is their expense. But there are other even more compelling reasons, notably that comparative studies have to deal with competing cultural norms and national methodological preferences that single-nation studies do not begin to face. Although these problems are not necessarily insuperable, it seems that national customs and conventions have too often held sway over methodological consistency. As a result, design inconsistencies that would never be tolerated in important national studies have frequently been shrugged off in important comparative studies. Only after the event have the
∗
Roger Jowell is a Research Professor at City University London and Principal Investigator of the European Social Survey (ESS); Max Kaase is an emeritus Professor of Political Science at the University of Mannheim, past President of the International Political Science Association, and chair of the ESS Scientific Advisory Board; Rory Fitzgerald is a Senior Research Fellow at City University and a member of the ESS Central Coordinating Team (CCT); Gillian Eva is a Research Fellow at City University and a member of the ESS CCT.
2
MEASURING ATTITUDES CROSS-NATIONALLY
methods of several celebrated comparative studies been shown to be less consistent between nations than they ought to be (see Verba, 1971; Saris and Kaase, 1997; Park and Jowell, 1997). This was the situation that confronted the team responsible for the ‘Beliefs In Government’ project which started in 1989, sponsored by the European Science Foundation (ESF) and led by Max Kaase and Ken Newton (1995). The project was designed to compile and interpret existing data about changes over time in the socio-political orientations of European citizens in different countries. Many sources of data were available to the study – notably time series such as the Eurobarometers, the International Social Survey Programme, the European (and World) Value Surveys, and sets of national election studies. But although these studies formed the essential source material for the study, the scope for rigorous comparative analysis across countries and over time was limited by their discontinuities and internal inconsistencies. This discovery was the inspiration behind the European Social Survey. A member of the ESF Standing Committee of the Social Sciences (SCSS) at the time, Max Kaase proposed to his colleagues a project to investigate the feasibility of starting a new European Social Survey with a view to mitigating the limitations that the Beliefs in Government project had revealed. The SCSS agreed and set up an eight-person ‘Expert Group’ to pursue the idea (see Note 1 at the end of this chapter). At the end of its year-long deliberations, it concluded that a new rigorous and meticulously planned panEuropean general social survey was both desirable and feasible (ESF, 1996). As importantly, it concluded that, with the aid of the ESF and its member organisations throughout Europe (plus, it was hoped, the European Commission – EC), the project was likely to be fundable. Thus encouraged, the SCSS set up and financed two new committees: the first – a Steering Group (see Note 2 at the end of this chapter) – representing social scientists selected by each of the ESF’s interested member organisations; and the second – a Methodological Committee (see Note 3 at the end of this chapter) – consisting of a smaller number of specialists from a range of European countries. These two groups were jointly charged with turning the idea into a well-honed blueprint for potential action. After parallel deliberations, though with some overlaps in membership, the chairs of the two committees (Kaase and Jowell), together with the SCSS scientific secretary (John Smith), jointly produced a Blueprint document (ESF, 1999), which was duly presented to and endorsed by the SCSS and distributed to all ESF member organisations. Here at last was a document which contained not only a call for regular, rigorous monitoring of changes in values within modern Europe, but also a detailed specification of how such a highly ambitious project might be set up and implemented in an equivalent way across a diverse range of European countries. The Blueprint also made
The European Social Survey as a measurement model
3
clear that the project could not be a one-shot comparative survey. To achieve its essential aim of monitoring and interpreting change, it had to undertake repeat measurements over an extended period. The Blueprint was soon welcomed by many academics in the field throughout and beyond Europe, but also – and more importantly perhaps – by the many national social science funding agencies that, as ESF members, might be called on to contribute resources to such a project. The proposal had its detractors too, most of whom saw the potential value of the project but believed it might be too ambitious and expensive to get off the ground. As the remainder of this book shows, these fears fortunately proved to be unfounded. Following publication of the Blueprint, a small team led by Roger Jowell was assembled (see Note 4 at the end of this chapter) to formulate an application to the EC for core funding of the project that would cover the ESS’s detailed design and continuing coordination, but not its fieldwork – which was always to be financed at a national level. Meanwhile, the ESF had begun seeking commitments from its member organisations that – if EC funding was in the event to materialise for the ESS core activities – they would in turn be ready to meet the costs of their own national fieldwork and domestic coordination. Learning from the experience of other studies, however, no potential funding agency was left in any doubt that the hallmark of the ESS was to be consistency across nations and exacting standards. Thus, familiar but inappropriate national variations in methodology were in this case to be firmly resisted. Rather, the design was to be based on the now publicly available Blueprint and determined by a Central Coordinating Team. Although there would, of course, be consultation with all participants and advisers – the ESS was above all to be implemented according to a uniform (or equivalent) set of principles and procedures. Given the fact that many of the potential participating countries would have to go through complicated funding hoops to secure support for this new venture, the core application to the Commission cautiously assumed that around nine nations would participate in the first round. Others, it was hoped, would follow suit in subsequent rounds. As it turned out, however, not long after the successful outcome of the EC application had been announced, an astonishing 22 countries had opted to join the ESS’s first biennial round in 2002/2003, each funding its own share of the study’s costs. All but one of those same nations then also took part – again on a selffunding basis - in the second round in 2004/2005 and were joined by five new nations. Now almost all of these nations are participating in the third round in 2006/2007, again with some important new entrants. Critically, at each new round the EC has also supported applications from the central coordinating team to cover the project’s continuing design and coordination.
4
MEASURING ATTITUDES CROSS-NATIONALLY
Apart from its unusual rigour for a comparative attitudinal survey, two further features of the ESS attracted immediate and widespread interest among social scientists. The first was the division of the ESS questionnaire into two halves – one half devoted to its core measures and the other half to two rotating modules, both subject to a Europe-wide competition among multinational teams of social scientists. This arrangement ensures on the one hand that there is appropriate continuity between rounds, but on the other that the central team is not the sole arbiter of the study’s content. It also means that many academics in many countries look to the ESS as a potential vehicle for the collection of valuable multinational data in their field. The second feature of the ESS that has ensured immediate attention is its firm policy of transparency and open access. All its protocols and methods are made immediately available on the ESS website (www.europeansocialsurvey.org), and each round of data is also made immediately available on the ESS data website (http://ess.nsd.uib.no), giving everyone simultaneous access and allowing no privileged prior access to the principal investigators. Perhaps it was these features of the ESS that so swiftly alerted social scientists to its existence, particularly those throughout the world who are involved in comparative social measurement. But the interest in the project seemed to expand exponentially when it was announced in 2005 that the ESS team had won the coveted Descartes Prize “for excellence in collaborative scientific research”. As the first social science project ever even to have been short-listed for this top European science prize, it was a welcome sign that the project had met the approval of the wider scientific community in Europe. Before dealing with the specific components of the ESS model, we wish briefly to rehearse some of the broader motivations behind the enterprise. In defence of rigour Good science – whether natural science or social science – should never turn a blind eye to its known imperfections. Nor should those imperfections be concealed from potential users. Some might argue that the social sciences are always an order of magnitude more error-prone than are the natural sciences. That is disputable, but in any case it provides all the more reason for greater rather than less vigilance in social science methodology. In some respects too, the social sciences are even more complicated than the natural sciences. Although they do not have to explain the complexities of the physical and natural world, they do have to interpret and explain the complexities of people’s interactions – whether with one another or with their world. And human interactions are in some ways more complicated than are interactions in the physical and natural world. For one thing, ‘laws of behaviour’ are less in evidence among human populations than among,
The European Social Survey as a measurement model
5
say, physical objects, or chemicals, or even creatures. Thus, social scientists cannot as confidently make assumptions about the likely regularities of human interactions as, say, chemists sometimes can about the interactions between certain gases. Not only do cultural variations complicate the measurement of human behaviour and attitudes across nations, but so perhaps do larger and more unpredictable individual variations within the same populations. Moreover, human beings have their own value systems and are ‘opinionated’ in ways that their counterparts in the natural world are not. They are also all too capable of believing one thing and doing (or saying) quite another. So, the social sciences often have to start off by overcoming barriers which are erected (whether deliberately or intuitively) by the objects of their measurements themselves. Unless they succeed, these barriers may distort or nullify their findings. All of which makes the general domain of the social scientist particularly tricky. But, as in all fields, some aspects are a great deal trickier than others. Three features of the ESS (and other similar studies) place it near the extreme of this notional spectrum of difficulty: • Measuring social attitudes and values is for many reasons more risky and error-prone than measuring validatable facts and behaviour patterns, because they tend to be even more fluid and contextdependent. • Measuring change over time adds a level of complexity to the analysis and interpretation of findings that rarely applies to studies that are able to rely on one-off measurements. • Measuring cross-national differences and similarities is made infinitely more difficult by simultaneous variations in social structure, legal systems, language, politics, economics and culture that would be rare indeed in a single-country study. Cross-national studies of attitude change simultaneously incorporate all three of these daunting aspects of quantitative social measurement. But the ESS was fortunate in coming late to the scene, by which time many distinguished comparative studies had already laid the groundwork, such as Almond and Verba (1963), Barnes et al, (1979) and, more recently, a series of comparative surveys of attitude and value change, including the Eurobarometers, the International Social Survey Programme and the European (and World) Value Surveys. The ESS was determined not only to learn from these studies, but also, wherever possible, to mitigate the methodological difficulties they had encountered, just as other present and future projects will doubtless build on the ESS model.
6
MEASURING ATTITUDES CROSS-NATIONALLY
The initiators of the ESS also found themselves with an enviable remit. Their role was not just to determine the structure and style of a new improved time series on European attitude change, but to do so without compromising the highest standards of scientific rigour. The enthusiastic and widespread support they received for this goal was as surprising as it was inspiring. It came not just from individual members of numerous specialist advisory groups, but also from the officials (and ultimately the referees) who deal with EC Framework Programmes, as well as from a range of funding councils throughout Europe (well beyond the borders of the EU itself). The time was clearly ripe for a brave new initiative which would not only monitor value change in a changing Europe according to the highest technical standards, but also meticulously (and openly) document the process for the benefit of others in the field. At last rigour, as opposed to speed and cost alone, was firmly back on the agenda. The pursuit of equivalence All quantitative research depends for its reliability on what may be called a “principle of equivalence” (Jowell, 1998). For instance, even in national surveys the probability of an individual citizen’s selection in a sample should be equal (or at least known and non-zero) to satisfy the demands of representativeness. Similarly, co-operation or response rates should not vary greatly between different subgroups within a nation if the pursuit of equal representation is to be sustained. Questions should have a broadly equivalent meaning to all respondents to ensure that variations in the data derive from differences in their answers rather than in their interpretation of the questions. Coding schemas must be devised to ensure that it is the codes rather than the coders that account for differences in the distribution of answers. And so on. A great deal of work in national surveys therefore goes into the sheer process of ensuring that different voices in the population are appropriately represented and taken equally into consideration. Only to the extent that a national survey succeeds in that objective are its findings likely to approximate to some sort of social reality. But to the extent that these problems of achieving equivalence affect national surveys – since no nation is homogeneous with respect to vocabulary, first-language, modes of expression, levels of education, and so on – they are, of course, greatly magnified when it comes to multi-national surveys. For a range of welldocumented reasons, most comparative surveys have not entirely succeeded in coming to grips with them. Cultural, technical, organisational and financial barriers have undermined equivalence in comparative studies for at least three decades – from the ‘courtesy bias’ first discovered in South East Asian studies (Jones, 1963), to the recognition that ‘spurious lexical equivalence’ often disguises major differences in meaning (Deutscher, 1968; Rokkan, 1968;
The European Social Survey as a measurement model
7
Cseh-Szombathy, 1985). Hantrais and Ager (1985) have argued for more effective cooperation between linguists and social scientists, but – to the extent that this has happened at all – it has not improved things markedly. The fact remains that different languages are not necessarily equivalent means of defining and communicating the same ideas and concepts; they are also reflections of different thought processes, institutional frameworks and underlying values (Lisle, 1985; Harding, 1996; Harkness, 2003). From the start, comparative researchers were also frustrated by country-specific differences in methodological and procedural habits – such as in their preferred modes of interviewing, their deeply ingrained preferences for different sampling models and procedures, major differences in how they defined ‘acceptable’ response rates, the different ways in which they employed visual aids, variations in their training of interviewers and coders, and their often tailor-made socio-demographic classifications (see, for instance, Mitchell, 1965). Comparative social scientists also soon discovered that certain ‘standard’ conceptualisations of cleavages within one country (such as the left–right continuum, or the liberal–conservative one) had no direct counterpart in another, and that seemingly identical questions about concepts such as strong leadership or strong government, or nationalism or religiosity, tended to be interpreted quite differently in different countries according to their different cultural, social structural and political conditions (Miller et al, 1981; Scherpenzeel and Saris, 1997; Saris and Kaase, 1997). Many impressive attempts have been made to mitigate these problems, but with patchy results. For instance, having heeded the problems faced by predecessor’s comparative studies, the International Social Survey Programme (ISSP) started off with strict standardisation in mind (Davis and Jowell, 1989). Although the ISSP did in fact make large strides towards consistency, it was thwarted by an absence of any available central coordinating budget with which to help enhance its equivalence across nations. Each of the (now) 39 national institutions in the ISSP has to find its own annual funds to carry out the survey and although they all ‘agree’ to follow the project’s clearly laid-out methods and procedures, some of them have found themselves unable to comply without stretching the meaning of concepts such as ‘probability sampling’ or ‘no substitution of refusals’. Moreover, unlike the ESS which has the resources to identify such problems in advance and to monitor the implementation of agreed standards, embarrassing variations in the ISSP were discovered only after the event. And despite the heroic efforts by the ISSP secretariat to remedy these problems in subsequent rounds of the survey, some have proved difficult to shift. These experiences confirmed to the architects of the ESS that, in the absence of appropriate budgetary or executive sway, too many participants in multi-national surveys will inevitably take decisions into their own hands with potentially serious consequences for equivalence and reliability.
8
MEASURING ATTITUDES CROSS-NATIONALLY
One key aspect of the ESF Blueprint was to prove critical in mitigating this problem. A two-pronged approach was devised to help ensure compliance to the ESS’s centrally–determined specification. In the first place, the ever-present Central Coordinating Team is responsible for designing, specifying and monitoring the use of equivalent methods in all nations. Equally, all national funding organisations make their own separate commitments (via the ESF) that they too will ensure compliance on behalf of their selected national teams. It is probably this dual arrangement, above all, that sustains the extent of methodological equivalence which has come to define the ESS. Inevitably, however, plenty of national deviations still manage to arise. True, most but not all are minor, and most but not all are inadvertent. But in keeping with the project’s spirit of transparency, all such deviations are identified and published at the conclusion of each round of the survey. This practice is by no means designed to ‘name and shame’ those responsible for the deviations. It has two quite different motives. First, it shows to all participants what can go wrong with a view to preventing similar breaches in future rounds; and secondly, potential users of the data have a right to have early knowledge of such deviations in case it affects their analyses, or even their choice of which nations to include in their comparisons. There is, of course, an almost endless list of potential hazards that can crop up in one corner or another of a large cross-national study – from subtle translation discrepancies to uncharted sampling differences, from esoteric variations in coding conventions to differential context effects, from major response rate variations to more straightforward transcription errors, from variations in ‘standard’ definitions to mundane timetable slippages, and so on. All these hazards can be reduced to a greater or lesser extent, but they cannot, of course, ever be eliminated. All the ESS protocols, which are published on its website, go into meticulous detail to help ensure that these risks are minimised. Practical steps are also taken, such as setting up a standing sampling panel, a methods group and a translation panel to give detailed help on a range of technical issues. As with all multi-national studies, one of the most difficult tasks is to achieve functionally equivalent translations of questionnaires and other documents. In the case of the ESS, the Blueprint argued for English as the project’s official language – for its meetings as well as all its central documentation. This proposal prevailed. Thus, all original ESS protocols, questionnaires and field materials are formulated in English and subsequently translated by national teams as necessary into their own languages (well over 20 in all) – see chapter 4. Although this practice has a strong whiff of hegemony about it, it is nonetheless a massive administrative convenience for a unified project such as the ESS. But it also has its hazards because certain English phrases (and especially idioms) have no equivalent counterpart in many other languages. On balance, however, operating in a single widely spoken language is surely preferable to the potentially chaotic alternative. And we are fortunate in having
The European Social Survey as a measurement model
9
the help of a group of admirably bilingual National Coordinators and their colleagues to prevent the most obvious errors. We stress these issues to illustrate the numerous inherent obstacles to equivalence that a multi-national survey covering such a large number of heterogeneous countries inevitably faces. Issues of taxonomy, technique, human error, lapses in communication, cultural and political circumstances, and a host of other factors all get in the way of equivalence to a greater or lesser extent. And these difficulties increase with the number and heterogeneity of the countries involved. Nonetheless, we should not exaggerate the rigidity with which the ESS pursues absolute methodological consistency come what may. Its goal is to achieve equivalent methods and measures, not identical ones. It would, for instance, be wholly unrealistic to require all countries to use precisely the same sampling procedures. Some countries – notably the Nordic countries – have publicly available registers of all individuals which contain details of their demographic and economic characteristics of a sort that would infringe the privacy laws of other countries. Alas, most countries do not constitute such a ‘sampling heaven’, and some have no reliable publicly available list of individuals or addresses at all. To select equivalent probability samples in these very different circumstances necessitates different approaches to the same end. So although the ESS specifications do rigidly require each national sample to be based on random (probability) methods designed to give every resident of that country (not just citizens) an equal (non-zero) chance of selection, each country has to achieve that overall objective taking due account of its particular set of opportunities and obstacles. Working closely with the central sampling panel, this process may involve quite a bit of to-ing and fro-ing before an optimal solution is reached, but in no case has the goal of sampling equivalence been breached (see chapter 2). The ESS model in practice The ESS’s three main aims are: • to produce rigorous data about trends over time in people’s underlying values within and between European nations • to rectify longstanding deficits in the rigour and equivalence of comparative quantitative research, especially in attitude studies • to develop and gain acceptance for social indicators, including attitudinal measures, that are able to stand alongside the more familiar economic indicators of societal progress.
10
MEASURING ATTITUDES CROSS-NATIONALLY
If we were ever remotely to fulfil these aims, we required not only a wellformulated model, as provided by the Blueprint document, but also a detailed modus operandi that was demonstrably capable of delivering that model on the ground. This issue loomed large in the initial application to the European Commission for Round 1 funding, submitted in June 2000, which – we reasoned – was not aimed solely at the Commission but also at the many national funding agencies that might soon be called on to fund their own fieldwork and national coordination for the first round. Our plans thus had to stand up to the detailed scrutiny not only of the European Commission’s officers and referees, but also of more than 20 separate national funders. The plans also had to persuade the wider academic community from among whom National Coordinators would subsequently be appointed that it was not only doable but worth doing. And they had to be acceptable to the various national field agencies that would ultimately be asked to implement the plans on the ground. In summary, our initial task was to persuade an unusually large number of knowledgeable and habitually sceptical observers that the ESS was capable of becoming an especially authoritative and influential study, both substantively and methodologically. It is clearly a long journey from the starting point of even a splendid design to its simultaneous implementation in over 20 countries. In this chapter we briefly summarise not only the range of design characteristics and innovations that we believe have been critical to the success of the ESS, but also the set of structural arrangements that have contributed most to their implementation. Subsequent chapters deal in more detail with many of these topics. But we should re-emphasise emphasise that the detailed design specification for the ESS is not followed in all cases with quite the same precision as it is in others. As noted, some of the inherent difficulties of cross-national studies have proved extremely difficult to solve, and there have been errors of omission and commission en route. The deviations that have occured are discussed later in this chapter. Thankfully, however, the compliance rate on most of the ESS’s demanding list of requirements is impressive. And for this, a great deal of credit goes to the National Coordinators. So, as noted, we believe we have achieved more than expected in terms of sampling equivalence between countries. But in aspects of fieldwork, we still have some way to go. Granted that face-to-face interviewing is universally applied in the ESS, as are many other key fieldwork requirements, but the reality is that fieldwork organisations tend to have their own preferred procedures, which even the most well-monitored survey cannot easily influence. For instance, although we specify a maximum number of
11
The European Social Survey as a measurement model
respondents per interviewer in order to reduce the impact of interviewer variability on the results, this requirement is often unilaterally abandoned (perhaps appropriately) when it is seen to conflict with the achievement of high response rates. The same reasoning sometimes applies to the stretching of fieldwork deadlines, resulting in a wider than hoped for range of national fieldwork periods. Continuity Any multinational time series such as the ESS depends above all not just on a consistent methodology but also on continuity of participation by the nations involved and, of course, on uninterrupted funding. Although in these respects the ESS has been particularly fortunate so far, it has not yet achieved any real security. Instead, it still has to rely on the round by round success of applications for funding both of its coordination and of each country’s participation in the enterprise. So every biennial round of the ESS involves over 25 independent funding decisions – each of which, if negative, could inflict damage on the project as a whole. We hope this may change in EC Framework 7, but we will have to wait and see. Meanwhile, the continuity of national participation and funding throughout the first three rounds of the ESS has admittedly been remarkably smooth. Table 1.1 shows the pattern of national participation over the first three biennial rounds of the ESS by the 32 countries that have funded and fielded at least one round. Table 1.1
The 32 ESS participating countries to date
Country
R1
R2
R3
Country
Austria Belgium Bulgaria Cyprus Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Israel Italy
Latvia Luxembourg Netherlands Norway Poland Portugal Romania Russia Slovakia Slovenia Spain Sweden Switzerland Turkey UK Ukraine
?
?
R1
R2
R3
?
Notes: Number of countries in Round 1: 22; number of countries in Round 2: 26; number of countries in Round 3: 25–28
12
MEASURING ATTITUDES CROSS-NATIONALLY
In sum, 18 European countries may be described as perennial ESS participants, having taken part in all three rounds to date.1 Four further countries who joined at Round 2 are also participating in Round 3.2 Five further Round 3 joiners will, we hope, sustain their participation into future rounds. And the five remaining participants that failed to obtain funding for Round 2 and/or Round 3 are all determined to remain in the fold and to rectify their funding gap in Round 4 and beyond. So although results suggest that we ought perhaps to be confident about the longer-term stability of the ESS, the persistence of the present funding regime – with its multiplicity of independent decision trees – is simply not conducive to a strong sense of security. On a more positive note, some countries have recently managed to secure a longer-term commitment to ESS participation (usually up to two rounds ahead), on condition that the EC’s core-funding of the project – itself subject to a round by round competition – continues to flow. We are delighted to report that an early decision by the Commission to core-fund ESS Round 4 (in 2008/2009) has recently been secured. The continuity of funding and national participation that the project has enjoyed so far has undoubtedly been a key factor in attracting analysts to its dataset. Not only does the relatively stable range of countries within each round enable cross-national comparisons to be validated, but the repeated participation of over 20 countries enables all-important analyses to be made of changes within and between nations. Governance The origins of the unusual governance arrangements of the ESS may be found in its initial Blueprint, though they have been adjusted as necessary to fit the circumstances of a larger and potentially more cumbersome enterprise than had been envisaged. Figure 1.1 summarises the overall organisational structure of the ESS. At the heart of the governance arrangements are the six institutions listed at the centre and centre-left of Figure 1.1. They constitute the ESS Central Coordinating Team (CCT) (see Note 4 at the end of this chapter), which collectively holds the various grants for the project and takes overall responsibility for the programme of work (see ‘Division of tasks’ in this chapter). But the successful execution of the project at a national level relies equally on the country teams on the right of the Figure (National Coordinators and 1
Italy is included in this figure though their funding for Round 3 is still uncertain. Iceland and Turkey are included in this figure though their funding for Round 3 is still uncertain. 2
The European Social Survey as a measurement model
Figure 1.1
13
ESS organisational structure
survey institutes) which ensure that the project is faithfully adapted, translated and carried out to the same exacting standards in all nations. The four bodies at or near the top of Figure 1.1 collectively ensure that the project adheres to or exceeds its ambitious ideals. Chaired by Max Kaase, the Scientific Advisory Board (see Note 5 at the end of this chapter) meets twice a year and has been remarkably stable in its membership. Board members are eminent social scientists from all ESS participating countries, each nominated by their main academic Funding Council. Individually and collectively, they help to steer the ESS in virtuous directions, influencing its key decisions. Moreover, the Board also plays the sole executive role in the selection of specialist Question Module Design Teams, the bodies which help to design one half of the questionnaire at each round. The Funders’ Forum (see Note 6 at the end of this chapter) consists of senior staff members from each of the national funding bodies (plus the EC and the ESF). It meets less frequently – usually about once a year – and its key role is to monitor the progress of the project and, in particular, its role as a large, long-term multinational investment. It attempts to foresee and prevent unintended funding discontinuities.
14
MEASURING ATTITUDES CROSS-NATIONALLY
The smaller Methods Group (see Note 7 at the end of this chapter) is chaired by Denise Lievesley and consists of four other eminent survey methodologists. It also meets about once a year to tackle the knotty technical and statistical issues that a project of this size and complexity inevitably throws up. They respond admirably to the numerous technical conundrums that are put to them, guiding the CCT towards appropriate solutions. And they advise on the ESS’s methodological programme, injecting new ideas and helping to produce elegant solutions. As noted, new Question Module Design Teams (see Note 8 at the end of this chapter) are selected at each round to help formulate the rotating elements of the questionnaire, which form nearly one half of its content. This procedure is designed to ensure that the ESS’s content is determined not only by the need for continuity but also by a dynamic ‘bottom-up’ process. An advertisement is placed in the ‘Official Journal’ well before each round starts and it is publicised through National Coordinators within their own countries. It invites multi-national teams of scholars to apply for the chance to help design a (now) 50-item module of questions on a subject of their choosing for the following round of the survey. In general, two such teams are selected by the project’s Scientific Advisory Board, having considered the suitability of the subject and the experience of the prospective team. The successful teams then work closely with the CCT to develop suitable rotating modules for pilot and subsequent fielding in the next round of fieldwork (refer to the Questionnaire section of the ESS website). Seven rotating modules have been fielded to date in one or other of Rounds 1 to 3, and their data are widely quarried by analysts (see the description of Workpackages 8 and 9 later in this chapter). There were concerns at the start about whether this procedure for designing rotating modules would work. But thanks largely to the quality of the teams selected at each round, and to the astute comments and suggestions we receive from National Coordinators, it has worked very well, extending both the depth and breadth of the project as a whole. As far as the National Coordinators and survey institutes (see Note 9 at the end of this chapter) are concerned, we are fortunate in having a skilled and committed body of people and organisations who are in all cases appointed and financed by their national academic funding agencies. They collectively represent the leading edge of social survey research practice in Europe. Although their official role is country-specific, they also lend considerable expertise to the project as a whole through a series of National Coordinator meetings and regular email and telephone contact. Their task above all is to ensure that what happens on the ground in their country matches as closely as possible the requirements and expectations of the ESS specification – whether in respect of sampling, translation, fieldwork or coding. As the essential link between the CCT at the
The European Social Survey as a measurement model
15
centre and what happens in each nation, they take legitimate credit for bolstering the consistent standards to which the ESS tries to adhere. Division of tasks In common with most Commission-funded projects, the ESS work programme is divided in advance into distinct but overlapping ‘workpackages’, each the responsibility of one or more of the CCT institutions. The 11 workpackages are:
Workpackages 1 and 2: Overall project design and coordination The City University team in London3 is contractually responsible for the design and subsequent delivery of the whole programme of work according to budget and timetable, for initiating and convening team meetings, and for liaison with funders, advisers, National Coordinators and the wider social science community. Although CCT meetings are regular events, most of the coordination and communication naturally takes place outside these meetings. So City acts as the hub of the project and is at the centre of communication and discourse with CCT members, national teams, the project’s many influential advisers, the growing number of scholars in the wider social science community who have an interest in ESS methods and outputs, the project’s core funders (the EC and the ESF), and the many national funding bodies that collectively supply the bulk of the overall budget for the project. The City team is also responsible for framing the ‘Specification for Participating Countries’, updated at every round, which lays out in meticulous detail the procedures, standards and outputs required for each aspect of the survey’s implementation (see Project Specification section of the ESS website). But City also has the lead role in questionnaire design at each round of the ESS. While the core questionnaire – which accounts for about one half of the total interview duration–remains as stable as possible from round to round, it is nonetheless continually under review by both the CCT and the Scientific Advisory Board. Limited changes have been introduced at each round, some 3 Roger Jowell, PI and ESS Coordinator; Rory Fitzgerald; Caroline Roberts; Gillian Eva and Mary Keane. Recent additions to the City team are Daniella Hawkins, Eric Harrison, Sally Widdop and Lynda Sones. In addition, Rounds 1 and 2 would never have got off the ground so smoothly and efficiently in the absence of three former members of staff – Caroline Bryson, Ruth O’Shea and Natalie Aye Maung.
16
MEASURING ATTITUDES CROSS-NATIONALLY
to remove or amend demonstrably ‘bad’ items, others to introduce new items on emerging issues. But the very purpose of the core – to measure long-term value changes – requires that we should avoid being fidgety with its content. The main round by round task of the City team in respect of questionnaire design is to work closely with the respective Question Module Design Teams (QDTs) on the shape and content of the rotating modules for each round – a protracted process involving face-to-face meetings, several drafts of questions, and two pilot studies (in separate countries) to iron out problems. Only after a detailed analysis of the pilot studies, followed by extensive consultations with the QDTs and National Coordinators, is the module eventually ‘put to bed’ and sent out for translation into multiple languages. The whole questionnaire design process, including its various interim ‘conclusions’, is documented as it takes place and made available on the web immediately so that National Coordinators and others can join the discussions ‘in real time’ and have their say.
Workpackage 3: Sampling The Sampling Panel (see Note 10 at the end of this chapter) is convened by Sabine Häder at ZUMA4 and has three other specialist members. The ESS has an unusual and innovative sampling specification which requires among other things each country to aim for the same ‘effective sample size’, not necessarily the same nominal sample size (see chapter 2). So it is not just the anticipated response rate that a National Coordinator and the Sampling Panel have to take into account in determining the starting number of individuals (or addresses) to select, but also the ‘design effects’ that their chosen design will generate – a function of its extent of clustering. The greater the degree of clustering in the sample design, so the larger must be the starting sample size. It is the Sampling Panel’s role to ensure that these ‘rules’ are closely adhered to. To help achieve this, the Panel allocates each of its individual members to work with a particular set of countries, ensuring that each country has a single named adviser to consult with as necessary. Where the situation requires it, this adviser will travel to the country concerned to investigate possibilities and help find solutions. In any event, each national sample design has in the end to be ‘signed off’ by the Sampling Panel before it is adopted and implemented. We are confident that by these means the ESS achieves equivalent random samples of an unusually high standard. Each national sample is designed to be a probability sample of all residents in that country (not just of its citizens) who are 4
The ZUMA team as a whole consists of Peter Mohler, Janet Harkness, Sabine Häder, Achim Koch and Sigfried Gabler. Recent additions to the team are Annelies Blom, Matthias Ganninger and Dorothée Behr.
The European Social Survey as a measurement model
17
aged 15 and over (with no upper age limit). For full details of definitions and precise sampling procedures, see the Methodology section of the ESS website.
Workpackage 4: Translation The Translation Taskforce (see Note 11 at the end of this chapter), chaired by Janet Harkness at ZUMA, exists for similar reasons but is not in a position to sign off every translation in every country. As noted, the questionnaire is drafted in English and is then translated not only into each country’s majority language(s) but into every language spoken in that country as a first language by more than five per cent of the population. So several countries – not just the usual suspects such as Switzerland and Belgium – have to translate the source questionnaire into more than one language. The role of the Translation Taskforce is to design, implement and continually refine the protocols and procedures for ensuring equivalent translations, as well as to advise and guide National Coordinators on problems as they arise (see chapter 4). To facilitate the work of the translators, reviewers and adjudicators who are assembled in each country for the purpose of turning all source questions into their own language(s) without changing their meaning, all identifiably ambiguous words and phrases in the source questionnaire are ‘annotated’ in advance with a brief description of their underlying meaning. For instance, one of the batteries of questions in the ESS is designed to measure political tolerance and asks respondents how much they agree or disagree with a series of statements. One of those statements is: “Political parties that wish to overthrow democracy should be banned.” Because of the potential ambiguity of the word ‘democracy’ in this context, the pre-translation source questionnaire contains the following annotation to help translators find an equivalent form in their own language: “ ‘Democracy’ here refers to an entire system or any substantial part of a democratic system, such as the government, the broadcasting service, the courts, etc.” Similarly, the question “How often do you meet socially with friends, relatives or work colleagues?” is accompanied in the pre-translation source questionnaire by the annotation “ ‘Meet socially’ implies meet by choice rather than for reasons of either work or pure duty.” These annotations and many others do not of course appear on the final translated questionnaire, since they are certainly not for the interviewers’ use. Rather they are available solely for translators to help them find the most suitable equivalent phrase in their language. The protocols for translation lay down procedures about what should be done and not done in reaching conclusions on equivalent translations, and how to resolve difficulties. They also give detailed guidance on when and how to use translations that have already been made into the same language
18
MEASURING ATTITUDES CROSS-NATIONALLY
by another ESS country (there are many more language overlaps among ESS countries than we had casually anticipated). For full details of the content of the protocols, refer to the Translation Guidelines in the ESS documents part of the ESS website.
Workpackage 5: Commissioning fieldwork Although the selection of fieldwork agencies in each country is the responsibility of the national funding body together with the National Coordinator, the process is coordinated and documented by Ineke Stoop5 at the Social and Cultural Planning Office in The Netherlands. We decided early on not to entrust the fieldwork to a single multi-national supplier, but rather to leave each country to find its preferred supplier and try to ensure that it adhered to the Specification. (In some countries we anticipated that the preferred supplier would be the National Statistical Institute, and so it proved.) But a hazard of remote management of the sort that characterises multi-national surveys is that the longer the chain of command, the likelier it is to break down. In this context, we had to ensure that the survey houses, not just the National Coordinators, knew in advance of costing of the project precisely what they would be signing up to. So the Specification for Participating Countries is provided to every potential fieldwork supplier as part of their invitation to bid for the contract. As noted, it contains details such as the required sampling procedure and size, the target response rate, the maximum number of sampled people/addresses which may be assigned to any one interviewer, the number of calls required before a sampled address may be abandoned, the maximum acceptable proportion of non-contacts, and much besides. These explicit specifications give field agencies advance knowledge of the size and nature of the task they are committing themselves to, helping them to avoid under-costing and, as a result, an inevitable lowering of standards. As always, the quality of a survey project of this nature depends critically on the quality of its fieldwork. So a great deal rests on the survey houses, which are at one step removed from the central project management. To mitigate this potential problem, the CCT has to work closely with National Coordinators as early as possible in the process, helping to ensure that their communication with field agencies is as clear and comprehensive as possible. Once the contract has been completed, an English summary of it is passed on to the CCT. It must be stressed that none of these measures indicates the least lack of confidence in either the National Coordinators or the fieldwork agencies, both 5 Recent additions to the team are Thomas van Putten, Paul Dekker, Peter Tammes and Jeroen Boelhouwer.
The European Social Survey as a measurement model
19
of whom do their jobs conscientiously and with consummate skill. They are introduced because – on the basis of the experience of other multi-national surveys – things do go wrong, and trying to correct them post hoc is usually either less effective or plain impossible.
Workpackage 6: Contract adherence The task of monitoring and helping to ensure contract adherence in all aspects of national performance is the responsibility of Achim Koch at ZUMA, recently with the help of Annelies Blom. There is, of course, a fine balance to be struck between policing on the one hand and persuasion on the other. Although the Specification for Participating Countries contains details of the respective responsibilities of National Coordinators, field agencies and the CCT itself, the project ultimately stands or falls according to how closely the specification is adhered to. To assess this, and where necessary to remedy it, close monitoring is essential, together with readily available support when it is required. A series of questionnaires filled in by National Coordinators – on the progress of sampling, translation and fieldwork subcontracting – provides the raw material for progress monitoring. Signs of potential non-compliance to the specification, nearly always inadvertent, may be picked up in the process and rectified before it is too late. Similarly, unanticipated difficulties on the ground, such as late fieldwork or lower than predicted response rates, may be discovered and discussed. Many of these problems cannot, of course, be rectified, but some can and others provide the sort of accumulated intelligence that allows a time series to improve round by round. For instance, from Round 2 onwards, National Coordinators have been providing a projection of the number of interviews expected to be completed per week as a benchmark against which to chart progress within and between countries. Naturally, some national variation is inevitable and even appropriate. Certain deviations from the Specification are necessary to comply with domestic laws, conventions or circumstances. For instance, the standard practice in the ESS of recording brief descriptions of non-respondents’ neighbourhoods is apparently contrary to data protection laws in some countries. So it cannot be pursued. By the same token, fieldwork has been delayed in certain countries because it would have clashed with the build-up to a national election, or simply because of hold-ups in funding. Other deviations come to light only once the data from a country is scrutinised as part of the archiving procedures. In these cases the deviation is flagged both in the end of grant report (Central Coordinating Team, 2004) and in the dataset itself so that data users are aware of it. It may well be that this sort of openness with the project’s shortcomings (rather than ignoring or suppressing them) may
20
MEASURING ATTITUDES CROSS-NATIONALLY
make the ESS seem like the most error-prone survey of all time; but so be it. In extreme cases, we not only flag the issue but also remove the offending variable from the combined data file (noting that we have done so), thus preventing inattentive data users from mistakenly treating it as equivalent. An example was a source question containing the word “wealthy”, which was inadvertently translated in one country as “healthy” – with predictable consequences for the findings in that country. In similar vein, the CCT discovered that one country had mistakenly excluded non-citizens from its sampling frame. But this error was able to be rectified in time by interviewing a separate appropriately sized random sample of resident non-citizens and merging the data into the main dataset. As noted, we are able to report with some relief that the number of serious deviations in any round of the ESS so far has been small. And for this a great deal of the credit goes to the National Coordinators. In any case, all countries do indeed employ strict probability sampling methods, all countries do conduct translations for minority languages spoken as a first language by at least five per cent of the population, and all countries do conduct face-to-face interviews (apart from a few agreed experimental treatments by telephone). But overall, compliance is lower when it comes to fieldwork procedures because they are inherently more difficult to influence remotely. Although of course face-to-face interviewing is universal in the ESS, as are most other key fieldwork practices, it is nonetheless the case that many fieldwork organisations have their own preferred procedures which even the most well-monitored survey cannot easily influence. Notably, the timing of fieldwork periods in some countries has run significantly over the deadline, usually reflecting the fragility of their national funding arrangements. Similarly, several countries fail to get even close to the ESS’s specified target response rate of 70 per cent (see chapter 6). Nonetheless, ESS response rates are generally higher or much higher than those achieved in similar social surveys in the same countries, suggesting perhaps that even an unattainable target can in some circumstances be an effective motivator. The conclusion we are able to draw from the balance we have struck between persuasion on the one hand and policing on the other is that both are essential to some extent in a dispersed multinational survey such as the ESS. We have found, for instance, that deviations from standard practice have been most in evidence when central attention to those practices has been least in evidence (such as our failure to monitor the laid-down maximum size of interviewer assignments in Round 1). Our experience also suggests that both top-down and bottom-up improvements are able to be introduced, sometimes even during a particular round, but certainly between rounds. So at the start of each round the CCT draws the attention of National Coordinators to the deviations that occurred in the previous round, alerting
The European Social Survey as a measurement model
21
them to possible pitfalls and how to avoid them. But individual countries are also encouraged and helped to introduce their own measures to improve standards. Switzerland, for example, has made heroic (and successful) efforts to raise their very poor Round 1 response rates significantly in Round 2, while The Netherlands has made important strides in testing how to motivate both respondents and interviewers, using a series of alternative incentives.
Workpackage 7: Piloting and data quality This work is led by Jaak Billiet at the University of Leuven. In Rounds 1 and 2, he was ably supported by the involvement of two outstanding PhD researchers within the university.6 As noted, having been drafted and re-drafted, the ESS final draft questionnaire then goes through a simultaneous two-country pilot, one of which is in an English-speaking country. The Round 1 pilot took place in the UK and The Netherlands, the Round 2 one in the UK and Poland, and the Round 3 one in Ireland and Poland. The sample size for the pilots is around 400 per country, sufficiently large to test scales and questions. Although their primary purpose is to test the rotating modules at each round, several other questions are included in the pilots either to investigate new issues or as independent variables in the data analysis. Some of the questions also go through a ‘Multitrait Multimethod’ analysis to test alternative versions of the same basic question. The pilot data are thoroughly analysed by both the CCT and the QDTs, after which the questions are re-appraised in the light of the results. The prime motive behind the ESS is to provide for scholars in Europe and beyond a regular and rigorous set of comparative datasets as a basis for measuring and analysing social change. So the achievement of data quality has always been a primary concern. There are, of course, numerous components of data quality – among them a representative sample, well-honed questions, skilled interviewing, harmonised coding, and many others. But the University of Leuven’s main focus has been on how to measure and mitigate the potentially damaging effects of different response rates in different countries. This work takes place against the background of falling response rates over the years – a trend that in many cases has simply been accepted as a sign of the times and as yet another instance of the democratic deficit at work. But inertia on the part of social scientists in relation to issues as serious as this would slowly undermine the reliability of quantitative social measurement.
6
Michel Phillippens and Stefaan Pleysier. Other members of the ESS team at the University of Leuven include Silke Devacht, Geert Loosveldt and Martine Parton.
22
MEASURING ATTITUDES CROSS-NATIONALLY
It was with this in mind that the ESS has from the start set a target response rate for all countries of 70 per cent. We realised, of course, that this target was unlikely to be universally achieved, but we introduced it in the hope that it would nonetheless help to raise the bar significantly. And, on all the available evidence to date, this is precisely what has happened. The most important requirement is, of course, that contracts with survey houses incorporate the means of achieving high response rates, and that they are appropriately budgeted for. These include a minimum duration of fieldwork that allows enough time to find elusive respondents, and call patterns which ensure that unsuccessful visits to addresses are repeated at different times of day and on different days of the week (see chapter 6). But simply setting such standards does not guarantee they will be faithfully followed in all cases. Also necessary is an objective means of monitoring, documenting and analysing what happens on the ground, providing regular checks of the process and – in the longer term – helping to improve it round by round (Lynn, 2003). So for each call at each address throughout the course of fieldwork, interviewers are required to complete a detailed ‘Contact Form’ containing valuable information about the interaction between interviewers and the addresses they visit. It is this unique set of records that informs the Leuven team’s analyses of response rates, refusals and non-contacts, and which guides the CCT (and the wider survey research community) on strategies for arresting and perhaps reversing the downward trend in response rates.
Workpackages 8 and 9: Question reliability and validity Work on establishing the reliability and validity of individual ESS questions has from the start been the responsibility of Willem Saris and Irmtraud Gallhofer, formerly at the University of Amsterdam and now at ESADE Business School at the Universitat Ramon Llull, Barcelona. Discussions about the possible content of the perennial core ESS questions started well in advance of the publication of the Blueprint document. In the end, a consensus emerged that three broad themes should be included in the core: • People’s value orientations (their world views and socio-political standpoints). • People’s cultural/national orientations (sense of attachment to various groups and their feelings towards outgroups). • The underlying social structure of society (socio-economic and sociodemographic characteristics). Within each of these broad areas we identified a larger number of sub-areas and then commissioned academic specialists in each field to prepare a paper based on a literature review recommending what questions in each sub-area they would regard as essential components of the proposed ESS.
The European Social Survey as a measurement model
23
Not unexpectedly, several of the desired topics turned out to lack appropriate or well-honed existing questions. But the papers nonetheless provided an excellent background for theory-based questionnaire construction. A draft core questionnaire for Round 1 was eventually produced containing all the proposed essential elements, some of which were represented by ‘classic’ questions, others by freshly minted ones. This draft, together with the drafts of the two rotating modules for Round 1, then went through a number of checks and tests before being adopted. They included ‘predictions’ of their measurement qualities based on each question’s basic properties and making use of the Survey Quality Program, or SQP (Saris et al, 2004). A two-nation pilot and its subsequent analysis followed, in which alternative items were tested for reliability and validity using the Multitrait Multimethod technique (Scherpenzeel and Saris, 1997) (see chapter 3). All subsequent rotating modules and new core questions have gone through a similar process before being finalised. Although these measures do not help us to make choices between topics, they certainly help to guide question and scale construction. The body of work will in time also help to identify and, we hope, rectify problems with translation from the English source questionnaire into other languages, as well as enabling differential measurement error between countries to be neutralised. The core questionnaire contains the following broad list of topics: - Trust in institutions - Political engagement - Socio-political values - Social capital, social trust - Moral and social values - Social exclusion
- National, religious, ethnic identities - Well-being and security - Demographic composition - Education and occupation - Financial circumstances - Household circumstances
The rotating modules fielded to date are: Round 1 - Immigration - Citizen engagement and democracy Round 2 - Family, work and well-being - Economic morality - Health and care-seeking Round 3 - Indicators of quality of life - Perceptions of the life course
24
MEASURING ATTITUDES CROSS-NATIONALLY
Workpackage 10: Event monitoring A time series that monitors changes in attitudes can certainly not afford to assume that attitudes exist in a vacuum. They change over time in response to events and to a range of other factors. A simplifying assumption sometimes made is that within a particular round of a time series the impact of events is likely to be rather small. But while this may be true enough in a national survey, it is much less likely to be true in a multi-national survey involving a large number of disparate countries. This was the reason that we introduced event monitoring (see chapter 5) into each round of the ESS. Its purpose in essence is to describe and record the primary short-term events that might create turbulence in the trend-lines of different countries. Unless these events are charted, future analysts might find it difficult to explain why certain apparently inexplicable blips had occurred. From the start, this work has been initiated and coordinated by Ineke Stoop at the Social and Cultural Planning office in the Netherlands The 9/11 attack on New York or the Chernobyl nuclear power disaster are examples of events that deeply affected attitudes and perceptions worldwide. But lesser events, such as a rash of serious crimes in a particular country, or even moments of political turbulence (such as a national election) can also have an impact on public opinion at a national level, so they also ought to be monitored and recorded. Because no budget had been allocated to this process, a somewhat rudimentary system of event recording was devised for Round 1, which has since been upgraded and is soon to be upgraded further. So far, however, it has been up to National Coordinators to compile regular reports of any events just prior to or during fieldwork that receive ‘prominent’ attention in the national press (such as repeat front page coverage or sustained coverage in later pages). Relevant events are those that might conceivably have an impact on responses to ESS questions. These events are allocated to fixed categories with keywords, a short description, plus start and end dates. Although now under consideration for an overhaul, the basic system of event recording devised for the early rounds of the project has served its purpose well and has been greatly welcomed by many data users.
Workpackage 11: Data access and aids to analysis The large corpus of activities that falls under this heading is the responsibility of the team at Norwegian Data Services in Bergen.7 7
Bjørn Henrichsen, Knut Kalgraff Skjåk, Kirstine Kolsrud, Hilde Orten, Trond Almendingen, Atle Jastad, Unni Sæther, Ole Voldsæter, Atle Alvheim, Astrid Nilsen, Lars Tore Rydland, Kjetil Thuen and Eirik Andersen.
The European Social Survey as a measurement model
25
In strict accordance with the original motivation behind the ESS, the combined dataset containing the data from all participating countries is made available free of charge to all as soon as is practically possible to do so (see chapter 7). As noted, neither the CCT members, nor the National Coordinators, nor the QDTs, nor for that matter anyone else is granted prior access to the dataset, except for checking and quality control purposes. So none of the key players in the design, formulation and execution of any round of the ESS has had the all too familiar ‘lead time’ that enables them to quarry the data and prepare publications earlier than others. Instead, the overriding principle of the ESS model is that it should speedily provide a high-quality public-use dataset that is freely and easily accessible on-line to all comers. This policy may help to explain why, in less than three years since the initial (Round 1) dataset was released, the ESS has already acquired more than 10,000 registered users. All that potential users have to do is to enter the data website (http://ess.nsd.uib.no) and provide their name and email address. They are then granted immediate access to the fully documented dataset, along with a considerable amount of metadata. The NESSTAR distribution system enables users to browse on-line and to produce tabulations more or less at the touch of a button. But it also allows them instantly to download all or parts of the dataset in a number of formats for subsequent more complex analyses. More than 50 per cent of the 10,000 registered users have used this download facility. Each fully documented ESS dataset so far has been made publicly available on the web within nine months of each Round’s fieldwork deadline. But the initial releases do not, of course, include the small minority of countries whose fieldwork has run exceptionally late in that round. Their data are in each case merged into the combined dataset later. As Kolsrud and Skjåk (2004) have observed, it is the ESS’s central funding and organisational structure that has enabled it to produce a set of equivalent measures which should be the hallmark of quality in a multi-national dataset. This is largely a product of input harmonisation, including adherence to internationally accredited standards, which is achieved by giving National Coordinators on-line access to the documentation, standards, definitions and other tools required to ensure such adherence. Flaws are thereby minimised. (See the ESS Data Protocol in the Archive and Data section of the ESS website for a comprehensive guide to the required procedures for depositing data and the accompanying documentation.) The data website also contains information about the socio-cultural context in each country. Based on data assembled at a national level, the site contains population statistics on age and gender, education and degree of urbanisation and – in response to requests from Questionnaire Design Teams – specific background statistics that help set the context for their particular modules, such as
26
MEASURING ATTITUDES CROSS-NATIONALLY
the racial composition of the population or levels of immigration. NSD also routinely adds data on national elections, GDP and life expectancy and provides links to the SCP-compiled Event Data referred to earlier. A guide to sources of pan-European context data that is compiled by SCP Netherlands is also provided on the site. As well as the organisation, archiving, and provision of access to successive rounds of ESS datasets, NSD is also responsible for ‘EduNet’ (http://essedunet.nsd.uib.no), an on-line training facility that makes use of the ESS datasets to introduce and guide new researchers through a range of data analysis methods. So far, the EduNet facility has utilised the Human Values Scale, a regular part of the ESS core, and the Round 1 rotating module on citizenship, on which chapters 8, 9 and 10 all draw. Conclusion All surveys are judged by the analytical power they provide. On the evidence of the very large body of serious data users that the ESS dataset has already attracted, we hope that the project comfortably passes this test and will continue to do so. But we also hope that the ESS will be judged by the methodological contributions it is making to the design and conduct of large-scale comparative surveys more generally (see chapter 11). The project is after all a product of a bottom-up process from within the European social science community. It was the ESF – representing almost all national academic science funders in Europe – that provided the project’s initiators with the seed money to investigate the possibilities, and then further seed money to turn a seemingly plausible idea into a working reality. The ESF has also steadfastly continued to finance and service the project’s Scientific Advisory Board – which, as noted, is by no means a token institution. And the national academic science funding agencies provide the lion’s share of the total cost of the ESS through their continued financing of domestic coordination and fieldwork. The project is particularly fortunate in being able to benefit not only from the financial support of these bodies, but – as importantly perhaps – from the authority they give to the ESS and its methods. But the ESS has also been very fortunate to enjoy long-term core support and finance from the EC. In the absence of this central funding, the project would have been still-born. Not only has the Commission already agreed to fund four biennial rounds of the survey to date, stretching its central funding from 2001 to at least 2009, but it has also more recently agreed to provide large-scale ‘Infrastructure’ support for the project until 2010. This new form of support for the social sciences is quite different in character from the round-by-round support that the project has enjoyed so far. It is instead
The European Social Survey as a measurement model
27
designed to encourage more outreach, innovation, training and methodological work, and to extend and enhance the ESS’s existing access provision. Its budget is supporting a modest expansion of the staff in the CCT institutions, the addition of a new partner institution of the CCT – the University of Ljubljana, Slovenia – and a range of new organisational and methodological enhancements. It is a welcome recognition – just as was the award to the ESS of the Descartes Prize in 2005 – that ‘big’ social science has similar characteristics and similar needs to those of ‘big’ science. The real heroes of the ESS, however, are the (so far) three sets of around 35,000 respondents at each round, spread across a continent, who have voluntarily given their time and attention to our endless questions. We owe them a huge debt of gratitude. Notes 1. Members of the Expert Group were: Max Kaase, Chair; Bruno Cautrès; Fredrik Engelstad; Roger Jowell; Lief Nordberg; Antonio Schizzerotto; Henk Stronkhorst; John Smith, Secretary. 2. Members of the Steering Committee were: Max Kaase, Chair; Rune Åberg; Jaak Billiet; Antonio Brandao Moniz; Bruno Cautrès; Nikiforos Diamandouros; Henryk Domanski; Yilmaz Esmer; Peter Farago; Roger Jowell; Stein Kuhnle; Michael Laver; Guido Martinotti; José Ramón Montero; Karl Müller; Leif Nordberg; Niels Ploug; Shalom Schwartz; Ineke Stoop; Françoise Thys-Clement; Niko Tos; Michael Warren; John Smith, Secretary. 3. Members of the Methodology Committee were: Roger Jowell, Chair; Jaak Billiet; Max Kaase; Peter Lynn; Nonna Mayer; Ekkehard Mochmann; José Ramón Montero; Willem Saris; Antonio Schizzerotto; Jan van Deth; Joachim Vogel. 4. The institutional grant-holders and senior members of the Central Coordinating Team (CCT) are: Roger Jowell (City University, London, UK), Principal Investigator; Jaak Billiet (University of Leuven Belgium); Bjorn Henrichsen (NSD, Bergen, Norway); Peter Mohler (ZUMA, Mannheim, Germany); Ineke Stoop (SCP, The Hague, Netherlands); and Willem Saris (University of Amsterdam Netherlands, now at ESADE Business School, Universitat Ramon Llull, Barcelona). Institutional membership has recently expanded to include Brina Malnar at the University of Ljubljana, Slovenia and assisted by Vasja Vehovar, Tina Zupan and Rebeka Falle. 5. Members of the SAB are: Max Kaase, Chair; Austria: Anton Amann; Belgium: Piet Bracke and Pierre Desmarez; Bulgaria: Atanas Atanassov; Cyprus: Kostas Gouliamos; Czech Republic: nomination pending; Denmark: Olli Kangas; Estonia: Dagmar Kutsar; Finland: Matti Heikkilä; France: Bruno Cautrès; Germany: Ursula Hoffmann-Lange; Greece: John Yfantopoulos; Hungary: Gergely Böhm; Iceland: Stefán Ólafsson; Ireland: Seán Ó Riain; Israel: Shalom Schwartz; Italy: Guido Martinotti; Latvia: Aivars Tabuns; Luxembourg: Andrée
28
6.
7. 8.
9.
MEASURING ATTITUDES CROSS-NATIONALLY
Helminger; Netherlands: Jacques Thomassen; Norway: Ann-Helén Bay; Poland: Henryk Domanski; Portugal: Manuel Villaverde Cabral and João Ferreira de Almeida; Romania: nomination pending; Russia: Vladimir Magun; Slovak Republic: L’ubomir Falt’an; Slovenia: Niko Tos; Spain: José Ramón Montero; Sweden: Robert Erikson; Switzerland: Peter Farago; Turkey: Yilmaz Esmer; Ukraine: Eugene Golovakha; United Kingdom: Jacqueline Scott; European Commission: Virginia Vitorino and Andrea Schmölzer; European Science Foundation: Henk Stronkhorst and Gün Semin. Funders’ Forum members: Austria: Richard Fuchsbichler; Belgium: Benno Hinnekint and Marie-José Simoen; Bulgaria: Atanas Atanassov; Cyprus: Spyros Spyrou and Antonis Theocharous; Czech Republic: nomination pending; Denmark: Lars Christensen; Estonia: Reesi Lepa; Finland: Helena Vänskä; France: Roxane Silberman; Germany: Manfred Niessen; Greece: John Yfantopoulos; Hungary: Katalin Pigler; Iceland: Fridrik Jónsson; Ireland: Fiona Davis; Israel: Bob Lapidot; Italy: Anna D’Amato; Latvia: Maija Bundule; Luxembourg: Ulrike Kohl; Netherlands: Ron Dekker; Norway: Ingunn Stangeby; Poland: Henryk Domanski and Renata Kuskowska; Portugal: Ligia Amâncio and Olga Dias; Romania: Ioan Dumitrache; Russia: Vladimir Andreenkov; Slovak Republic: Dusan Kovac and Daniela Kruzinská; Slovenia: Peter Debeljak and Ida Pracek; Spain: Martin Martinez Ripoll; Sweden: Rune Åberg and Rolf Höijer; Switzerland: Brigitte Arpagaus; Turkey: Bilal Ahmetceoglu; Ukraine: Natalia Pohorilla; United Kingdom: Stephen Struthers; European Commission: Virginia Vitorino; European Science Foundation: Henk Stronkhorst and Gün Semin. Members of the Methods Group are: Denise Lievesley, Chair; Norman Bradburn; Vasja Vehovar; Paolo Garonna; and Lars Lyberg. Round 1 Questionnaire Design Teams and topics: ‘Citizenship, involvement and democracy’: Ken Newton; Hanspeter Kriesi; José Ramón Montero; Sigrid Rossteutscher; Anders Westholm; and ‘Immigration’: Ian Preston; Thomas Bauer; David Card; Christian Dustmann; James Nazroo. Round 2 Questionnaire Design Teams and topics: ‘Family, work and social welfare in Europe’: Robert Erikson; Janne Jonsson; Duncan Gallie; Josef Brüderl; Louis-André Vallet and Helen Russell; ‘Opinions on health and care seeking’: Sjoerd Kooiker; Nicky Britten; Alicja Malgorzata Oltarzewska; Jakob Kragstrup; Ebba Holme Hansen; and ‘Economic morality’: Susanne Karstedt; Stephen Farrall; Alexander Stoyanov; Kai Bussman and Grazyna Skapska. Round 3 Questionnaire Design Teams and topics: ‘Personal and social well-being’: Felicia Huppert; Andrew Clark; Claudia Senik; Joar Vitterso; Bruno Frey; Alois Stutzer; Nic Marks; Johannes Siegrist; ‘The timing of life: the organisation of the life course’: Francesco Billari; Gunhild Hagestad; Aart Liefbroer and Zsolt Spéder. National Coordinators and countries: Austria: Karl Müller; Belgium: Geert Loosveldt and Marc Jacquemain; Bulgaria: Lilia Dimova; Cyprus: Spyros Spyrou and Antonis Theocharous; Czech Republic: Klára Plecitá-Vlachova; Denmark: Torben Fridberg; Estonia: Kairi Talves; Finland: Heikki Ervasti; France: Daniel
The European Social Survey as a measurement model
29
Boy, Bruno Cautrès and Nicolas Sauger; Germany: Jan van Deth; Greece: Yannis Voulgaris; Hungary: Peter Robert; Iceland: Fridrik Jónsson; Ireland: Susana Ferreira; Israel: Noah Lewin-Epstein; Italy: Sonia Stefanizzi; Luxembourg: Monique Borsenberger and Uwe Warner; Netherlands: Harry Ganzeboom; Norway: Kristen Ringdal; Poland: Pawel Sztabinski; Portugal: Jorge Vala; Romania: Mihaela Vlasceanu and Catalin Augustin Stoica; Russia: Anna Andreenkova; Slovak Republic: Jozef Vyrost; Slovenia: Brina Malnar; Spain: Mariano Torcal; Sweden: Mikael Hjerm; Switzerland: Dominique Joye; Turkey: Yilmaz Esmer; Ukraine: Andrii Gorbachyk; United Kingdom: Alison Park. 10. Sampling Panel members: Sabine Häder; Siegfried Gabler; Seppo Laaksonen; Peter Lynn. 11. Translation Panel members: Janet Harkness; Paul Kussmaul; Beth-Ellen Pennell; Alisu Schoua-Glousberg; Christine Wilson.
References Almond, G. and Verba, S. (1963), The civic culture: political attitudes in five nations, Princeton: Princeton University Press. Barnes, S. and Kaase, M. et al. (1979), Political action: mass participation in five western democracies, Beverly Hills: Sage. Central Coordinating Team, City University (Unpublished, 2004), European Social survey: Round 1: End of Grant Report, July 2004. Cseh-Szombathy, L. (1985), ‘Methodological problems in conducting cross-national research on lifestyles’ in: L. Hantrais, S. Mangen and M. O’Brien (eds), Doing Cross-National Research (Cross-National Research Paper 1), Birmingham: Aston University, pp.55–63. Davis, J. and Jowell, R. (1989), ‘Measuring national differences: An introduction to the International Social Survey Programme (ISSP)’ in: R. Jowell, S. Witherspoon and L. Brooks (eds), British Social Attitudes: Special International Report, Aldershot, UK: Gower, pp.1–13. Deutscher, I. (1968), ‘Asking questions cross-culturally; Some issues of linguistic comparability’ in: H.S. Becker, B. Geer, D. Riesman and R.S. Weiss (eds), Institutions and the Person, Chicago: Aldine, pp. 318–341. Durkheim, E. (1964), The rules of the sociological method, 8th edition, New York: The Free Press. ESF (European Science Foundation), Standing Committee of the Social Sciences, (1996), The European Social Survey: Report of the SCSS Expert Group (April, 1996), Strasbourg: European Science Foundation. ESF (European Science Foundation), Standing Committee of the Social Sciences, (1999). The European Social Survey (ESS) – a Research Instrument for the Social Sciences in Europe: Summary, Strasbourg: ESF. Hantrais, L. and Ager, D. (1985), ‘The language barrier to effective cross- national research’ in: L. Hantrais, S. Mangen and M. O’Brien (eds), Doing Cross-national
30
MEASURING ATTITUDES CROSS-NATIONALLY
Research (Cross National Research Paper 1), Birmingham: Aston University, pp.29–40. Harding, A. (1996), ‘Cross-national research and the “new community power”’ in: L. Hantrais, S. Mangen and M. O’Brien (eds), Doing Cross-national Research (Cross National Research Paper 1), Birmingham: Aston University, pp.29–40. Harkness, J.A. (2003), ‘Questionnaire Translation’ in: J.A. Harkness, F. Van de Vijver and Mohler, P. (eds), Cross-Cultural Survey Methods, NJ: Wiley. Jones, E. (1963), ‘The courtesy bias in SE Asian surveys’, International Social Science Journal, 15 (1), pp.70–76. Jowell, R. (1998), ‘How Comparative is Comparative Research?’, American Behavioral Scientist, 42 (2), pp.168–177. Kaase, M and Newton, K. (1995), Beliefs in Government, Oxford: Oxford University Press. Kolsrud, K. and Skjåk, K.K. (2004), ‘Harmonising Background Variables in International Surveys’, Paper presented at the RC33 Sixth International Conference on Social Science Methodology, 16–20 August 2004, Amsterdam. Lisle, E. (1985), ‘Validation in the social sciences by international comparison’ in: L.Hantrais, S. Mangen and M. O’Brien (eds), Doing Cross-national Research (Cross National Research Paper 1), Birmingham: Aston University, pp.11–28. Lynn, P. (2003), ‘Developing quality standards for cross-national survey research: five approaches’, International Journal of Social Research Methodology, 6 (4), pp.323–336. Miller, J., Slomczynski, K. and Schoenberg, R. (1981), ‘Assessing comparability of measurement in cross-national research’, Social Psychology Quarterly, 44 (3), pp.178–191. Mitchell, R.E (1965), ‘Survey materials collected in the developing countries: Sampling measurement and interviewing obstacles to intra- and inter-national comparisons’, International Social Science Journal, 17 (4), pp.665–685. Park, A. and Jowell, R. (1997), Consistencies and Differences in a Cross-National Survey, London: SCPR. Rokkan, S. (1968), Comparative Research Across Cultures and Nations, Paris: Moulton. Saris, W. and Kaase, M. (1997), Eurobarometer: Measurement Instruments for Opinions in Europe, Amsterdam: University of Amsterdam. Saris, W., Van der Veld, W. and Gallhofer, I. (2004), ‘Development and improvement of questionnaires using predictions of reliability and validity’ in: S. Presser, J.Rothgeb, M.Couper, J.Lessler, E.Martin and E.Singer (eds), Questionnaire development, evaluation and testing, New York: Wiley. Scherpenzeel A. and Saris, W.E (1997), ‘The validity and reliability of survey questions: A meta analysis of MTMM studies’, Sociological Methods and Research, 25 (3), pp.341–383. Scheuch, E.K. (1966), ‘The development of comparative research: Towards causal explanations’ in: E. Oyen (ed.), Comparative methodology: Theory and practise in international social research, London: Sage (1990), pp.19–37.
The European Social Survey as a measurement model
31
Teune, H. (1992), ‘Comparing countries: Lessons learned’ in: E. Oyen (ed.), Comparative methodology: Theory and practise in international social Research, London: Sage (1990), pp.38–62. Verba, S (1971), ‘Cross-National Survey Research: The Problem of Credibility’ in: I. Vallier (ed.), Comparative Methods in Sociology: Essays on Trends and Applications, Berkeley: University of California Press.
2
How representative can a multi-nation survey be? Sabine Häder & Peter Lynn∗
Introduction It is very important for a multi-nation survey to select an equivalent sample in each country. Lack of equivalence in the samples can undermine the central objective of cross-national comparison. However, this is a big challenge. Not only do available sampling frames vary greatly in their properties from country to country, but so does usual sampling practice. Failure to select equivalent samples has been a common criticism of some multi-nation surveys. One of the most important innovations of the ESS has been the system it has devised and implemented to ensure that equivalent high-quality random samples are selected in each country. In this chapter we explain the principles underlying the ESS approach to sampling, we describe the process used to implement these principles and we discuss the extent to which the ESS approach might usefully be applied to other cross-national surveys. 1
*
Sabine Häder is a Senior Statistician at Zentrum für Umfragen, Methoden und Analysen (ZUMA), Mannheim, Germany. Peter Lynn is Professor of Survey Methodology at the University of Essex, Colchester, UK. Both are members of the European Social Survey sampling panel. 1 We would like to thank Siegfried Gabler and Matthias Ganninger (both Centre for Survey Research and Methodology, Germany) for helping us in the calculation of design weights and the estimation of design effects. We also would like to acknowledge the helpful comments made about this chapter by Caroline Roberts, City University, London.
34
MEASURING ATTITUDES CROSS-NATIONALLY
Equivalence of samples Although many surveys over the years have been used to make crossnational comparisons, such uses were often an after-thought. In very few cases have surveys been specifically designed to facilitate cross-national comparisons. Only in the last decade or so has it become recognised that design objectives – and therefore methods – for cross-national surveys should be somewhat different from those for national surveys. In 1997, Leslie Kish acknowledged this shifting perception: Sample surveys of entire nations have become common all over the world during the past half century. These national surveys lead naturally to multinational comparisons. But the deliberate design of valid and efficient multinational surveys is new and on the increase. New survey methods have become widespread and international financial and technical support has created effective demand for multinational designs for valid international comparisons. The improved technical bases in national statistical offices and research institutes have become capable of implementing the complex task of coordinated research. However, to be valid, multinational surveys have to be based on probability sample designs of comparable national populations, and the measurements (responses) should be well controlled for comparability. (Kish, 1997 p.vii) It may seem self-evident that samples should be equivalent for any comparative survey, but it is less evident what ‘equivalent’ means in practice. The ESS approach is to define equivalence in terms of two fundamental characteristics of a survey sample: the population that it represents and the precision with which it can provide estimates of the characteristics of that population. Our first step was to define the population that each national sample should represent and to insist that the definition was adhered to in each country so that the sample design provided complete, or very near-complete, coverage of that population. The definition of the population was, “all persons aged 15 years or older resident in private households within the borders of the nation, regardless of nationality, citizenship, language or legal status”. The second step was to define what the word ‘represent’ should mean. We concluded that this should mean that each sample would be capable (at least in the absence of non-response) of providing statistically unbiased estimates of characteristics of the population. The only way to guarantee this is to use a strict probability sampling method, where every person in the population has a non-zero chance of being selected, and the selection probability of every person in the sample is known.
How representative can a multi-nation survey be?
35
But providing unbiased estimates is necessary but not sufficient for comparative research. The estimates also have to be sufficiently precise to be useful. Broadly, they should be precise enough to detect all but minor differences between nations. However, to assess the implication of this requirement for sample sizes would require separate considerations not only of each estimate of population variability, but also of the likely magnitude of between-nation differences. For a survey such as ESS that asks a wide variety of questions, this would be an enormous task. And even if it could be achieved, the conclusion would be different for each estimate. So instead, our approach was to consider a single imaginary estimate with “typical” properties and then to identify the minimum precision that would be required for such a variable. On this basis, it was agreed that precision equivalent to a simple random sample (SRS) of 1,500 respondents would be sufficient for most purposes without being prohibitively expensive. The strategy from then on was to ensure that this sort of precision was achieved in each nation. But since the design in most nations would involve departures from simple random sampling, for instance clustering and variable selection probabilities, it was also necessary to predict the impact of these departures on precision (known as ‘design effects’) before determining the required nominal sample size in each nation. This is a very unusual step for a cross-national survey to take at the specification stage. Typically, such surveys content themselves with specifying the same nominal sample size for each country (i.e. the same number of interviews) under an implicit assumption that this is what determines the precision of estimation. Instead, the ESS introduced a further innovation by implementing a standard way of predicting design effects in each country and then requiring each nation to aim for the same effective (rather than nominal) sample size. Design effects will, of course, differ for different estimates, depending on how they are associated with selection probabilities, strata and weights. So, to ensure consistent decisions across nations we determined that design effects should be predicted with respect to an imaginary estimate with standard properties, in particular that the estimate is not correlated with selection probabilities or strata and is modestly correlated with clusters (an intra-cluster correlation of 0.022). Although this may be a reasonable approximation to the properties of many ESS variables, there would, of course, also be many variables for which the approximation did not hold. But the strategy has the advantage both of simplicity (as we shall see later, it is not too demanding to predict design effects once these simplifying assumptions have been made) and of consistency (the same 2
Alternative intra-cluster correlation predictions were permitted provided that these could be justified based on estimates from national surveys. In practice, almost all participating nations used the default value of 0.02.
36
MEASURING ATTITUDES CROSS-NATIONALLY
basis for decisions is used in every country). The strategy is evaluated in Lynn et al. (forthcoming). Sample sizes It is perhaps worth emphasising why equal precision in each nation is a desirable goal. It stems from the idea that a key objective of cross-national research is to make comparisons between nations, whether in terms of simple descriptive statistics, associations between variables, or complex multivariate statistics. For any given overall budget, the precision of an estimate of the difference between parameters for two nations is maximised if the precision of the estimates of the two individual parameters is approximately equal (assuming that precision comes at a similar price in each nation). Figure 2.1 The relationship between precision of an estimated difference between nations and relative national sample sizes, assuming simple random sampling
Standard error of estimate of difference
4 3.5 3 2.5 2 1.5 1 0.5 0
0.2
0.4
n2/n
0.6
0.8
1
Equal costs and variances Equal cost, variance 50% greater in nation 1 Equal variance, cost double in nation 2
This can be illustrated by a simple example. Suppose we want to compare the mean of a variable y between two nations based on SRS in each nation. If the population variability of y is similar in each nation, then precision depends only on sample size. Figure 2.1 shows that, if the cost per sample
How representative can a multi-nation survey be?
37
member is the same in each nation, maximum precision is obtained by having the same sample size in each nation (the lowest of the three lines on the graph). Reassuringly, the graph is fairly flat in its central region, indicating that slight departures from equal sample size in each nation will make little difference to the precision of comparisons. The graph also shows what happens to the precision of an estimated difference between nations if the assumptions of equal population variance or equal costs do not hold. The graph becomes asymmetrical, implying that a larger proportion of the sample should be devoted to the nation with the lower data collection costs or the higher variability in its population. However, provided that the differences in costs or variance are not too great, an equal sample size in each country will still provide precision that is almost as good as the maximum theoretically obtainable. In any case, in a cross-national survey such as the ESS, it would not be possible to vire data collection costs between nations. So it is not appropriate to consider costs when determining sample allocation across nations. Achieving equivalence It is one thing to have a goal of equivalent samples and a concept of what equivalence should mean. It is quite another to achieve equivalence in practice. The experiences of previous multi-nation surveys (e.g. Park and Jowell, 1997) led us to believe that the process by which sample designs are developed, agreed and implemented is at least as important as the specification to which designs should adhere. Consequently, we developed a detailed process that we felt would minimise the chances of ending up with sample designs that failed to meet the specification or that were sub-optimal in other preventable ways. We recognised that this process would require considerable resources, but felt it to be justified given that the sample is one of the foundation stones on which any survey is constructed. So, a small panel of sampling experts was set up from the start – the “ESS sampling panel”. The panel is responsible for producing the sample specification, agreeing the actual sample design for each nation and reviewing the implementation of the designs. Two features of the operation of the panel are notable. First, its approach is co-operative rather than authoritarian. The sampling panel sees its primary role as providing advice and assistance where needed, even though it also has to sign off and ‘police’ the sample designs. Second, its interactions are intensive. Regular contact is made between relevant parties, promoting rapport and co-operation. To our knowledge, this process is unique among cross-national social surveys, and is described in more detail later in this chapter. An important feature of the process is that a particular member of the sampling panel is allocated to each nation and the goal is to develop close working
38
MEASURING ATTITUDES CROSS-NATIONALLY
arrangements with the respective National Coordinator – a process that often leads to an extended dialogue which seems to be much appreciated by both parties. We now describe each element of the ESS’s sampling procedure. Population coverage We have already referred to the definition of the ESS target population. As it turned out, however, this seemingly simple definition proved surprisingly difficult to apply in several nations. Some nations were not used to working with a lower age limit as low as 15; others were used to applying an upper age limit, typically of either 75 or 80. Even so, the ESS’s age specification (15+, no upper limit) has been strictly adhered to in all countries. The lower limit of 15 proved to be especially difficult for two countries (Italy and Ireland) which used their electoral registers as a sampling frame. So in these countries the registers were used just as a frame of households or addresses, with interviewers implementing a special procedure to sample from all persons resident at each selected address. In some countries (e.g. Ireland and Poland), 15–17 year olds were only able to be interviewed with parental consent. Another important aspect of the population definition was that a person’s first language should not be an undue barrier to selection or participation. Thus questionnaires had to be made available in all languages spoken as a first language by five per cent or more of the population and interviewers had to be available to administer them. In some countries it turned out that ‘complete’ geographical coverage of the population would be too costly or even too dangerous to achieve. For instance, on cost grounds Jersey, Guernsey and the Isle of Man were excluded from the United Kingdom sample, just as Ceuta and Mallila were excluded from the sample for Spain, and the smaller islands were excluded from the sample for Greece. In Round 1, the Palestinian residents of (East) Jerusalem were excluded from the sample for Israel because at that time it would have been too dangerous for interviewers to work there. And in some countries where the sampling frame was a population register it proved impossible to include illegally resident persons. Such deviations from the ideal were discussed and agreed in advance with the sampling panel, which ensured that there were no serious deviations from the definition of the target population. Sampling frames An important prerequisite for sampling is the availability of a suitable sampling frame. In many countries it was a major challenge to find a regularly
How representative can a multi-nation survey be?
Table 2.1
39
Sampling frames used on ESS Round 2
Country
Frame
Remarks
Austria
Telephone book
Additional non-telephone households were sampled in the field, as described in the text below
Belgium Czech Republic
National Register Address register UIR-ADR
Denmark Estonia Finland France Germany
Danish Central Person Register Population Register Population Register None Registers from local residents' registration offices None Central Registry National Register National Electoral Register Social Security Register Postal address list (TPGAfgiftenpuntenbestand) BEBAS Population Register National Register of Citizens (PESEL) None Central Register of Citizens Central Register of Population Continuous Census Register of population Telephone register Clusters of addresses Postal address list (Postcode Address File) None
Greece Hungary Iceland Ireland Luxembourg Netherlands Norway Poland Portugal Slovakia Slovenia Spain Sweden Switzerland Turkey UK Ukraine
Used to select streets, followed by field enumeration
Area-based sampling
Area-based sampling
Area-based sampling
Area-based sampling
updated, complete and accessible frame. For instance, whereas Austria has a regularly updated computer-based population register, it is not permitted for it to be used as a sampling frame in social surveys. To illustrate the diversity of available frames, Table 2.1 lists the frames used in Round 2 for the selection of individuals, households or addresses. Almost all countries that participated in both of the first two rounds of the ESS used the same frame on both occasions. There were, however, two exceptions: the Czech Republic and Spain. In the Czech Republic, it turned out that the frame used in Round 13 did not meet expectations in respect of coverage and updating.
3
The “SIPO” database of households. This is compiled by merging utility lists of households that subscribe to electricity, gas, radio, television or telephone.
40
MEASURING ATTITUDES CROSS-NATIONALLY
It was thus changed in Round 2. In Spain, we took advantage of the fact that it was possible to change from a frame of households (in Round 1) to one of individuals (in Round 2). In five countries (Czech Republic, France, Greece, Portugal and Ukraine), it proved impossible to find an appropriate frame, and in each case area-based designs had to be deployed. Sample designs We have already mentioned that to meet the ESS objectives in respect of bias and precision, it was essential to use strict probability samples everywhere. Even so, the nature of the sample designs were to vary greatly between nations. That in itself is, of course, not a problem since to achieve equivalence of outcomes it is not necessary to use identical inputs. But it did mean that the sampling panel had to exert careful control over all designs to ensure that they really were comparable. In Table 2.2 we give an overview of the designs applied in Round 2 of the ESS. It can be seen that the designs varied from simple random sampling (e.g. Denmark) on the one hand to four-stage stratified, clustered designs (e.g. Ukraine) on the other. Moreover, while in some cases a sample of persons could be selected directly from the frame (indicated by a P in the Units column of Table 2.2), in other cases – where the units selected were households or addresses – the final task of selecting an individual to interview had to be carried out in the field. Table 2.2 can thus provide only a summary of the variation in designs. To illustrate the complexity of some of the designs implemented, we now describe two of them in more detail. In Austria the only available sampling frame is the telephone book. But this covers only abut 90 per cent of households. Not covered are households without a fixed-line telephone and households with secret numbers. To give these latter groups a chance of selection we developed the following design. Firstly, the Austrian municipalities were sorted into 363 strata, formed from 121 districts and three classes of population sizes (small municipalities with less than 2,500 inhabitants, medium municipalities with 2,500 to less than 10,000, and large municipalities with 10,000 inhabitants or more). At stage 1 the Primary Sampling Units (PSUs) were selected. These are clusters in municipalities. The number of clusters in a stratum was proportional to the size of its target population. The allocation was done by controlled rounding (Cox, 1987). Within a stratum, clusters were selected by systematic proportional to size random sampling. At stage 2, five addresses of households in the telephone book were drawn in each cluster. These formed the first part of the sample. To include also the non-listed households, the interviewer took each “telephone household” as a starting point and identified the tenth subsequent household in the field (according to a specified rule for the random walk). The five households found with this method formed the second part of the sample.4 Then, at stage 3 an individual was selected at each address using the next birthday method. 4
This method gives telephone households twice the chance of selection of nontelephone households, so this was taken account of by weighting.
How representative can a multi-nation survey be?
Table 2.2
Overview of ESS sample designs in Round 2
Country Austria Belgium Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Luxembourg Netherlands Norway Poland Portugal Slovakia Slovenia Spain Sweden Switzerland Turkey UK Ukraine
41
Design strat, clus, 3 stages, 360 pts Cities: srs Rest: strat, clus, 2 stages, 324 pts strat, clus, 4 stages, 275 pts Srs Systrs Systrs strat, clus, 3 stages, 200 pts strat, clus, 2 stages, 163 pts strat, clus, 3 stages, 528 pts Cities: srs Rest: strat, clus, 2 stages Srs strat, clus, 3 stages, 250 pts Stratrs Stratrs Srs Cities: srs Rest: strat, clus, 2 stages, 158 pts strat, clus, 3 stages, 326 pts Srs strat, clus, 2 stages, 150 pts strat, clus, 2 stages, 503 pts Srs strat, clus, 3 stages, 287 pts strat, clus, 3 stages, 200 pts GB: strat, clus, 3 stages, 163 pts NI: srs strat, clus, 4 stages, 300 pts
Units
ngross
nnet
neff
H
3672
2556
1390
P A P P P A P A
3018 4333 2433 2867 2893 4400 5868 3056
1778 3026 1487 1989 2022 1806 2870 2406
1487 773 1487 1989 2022 1114 1288 1469
P P A P A P
2463 1200 3981 3497 3009 2750
1498 579 2286 1635 1881 1760
?5 579 936 1419 1568 1761
P A P P P P A A
2392 3094 2500 2201 3213 3000 4863
1717 2052 1512 1442 1663 1948 2141
1078 787 1512 952 1176 1948 1398
A A
4032 3050
1897 2031
1123 600
Notes: strat: stratified. clus: clustered. srs: simple random sample. systrs: systematic random sample. stratrs: stratified (unclustered) random sample. stages: number of stages of selection. pts: number of sample points. H: household. P: person. A: address. The sample of Turkey was not included in data release 2 so exact sample sizes are not yet known
The Austrian example shows that even if no complete frame is available it is possible to implement a probability sample design. Moreover, Austria is an example of a country in which we were able to improve the design at Round 2 compared to Round 1, based on the evidence of Round 1. Having found a high level of homogeneity within sample points at Round 1 which resulted in a relatively large design effect due to clustering, we decided to increase the number of clusters
5
The Hungarian team used a sample design that was not agreed with the sampling expert panel and that differs from the signing off form. This design led to huge design effects due to varying inclusion probabilities (see Table 2.3) and due to clustering (see Table 2.4). However, owing to a personnel change in the statistical department it was not possible to clarify details of the sample design subsequently. An improvement of the design in Round 3 is inevitable.
42
MEASURING ATTITUDES CROSS-NATIONALLY
from 324 (in Round 1) to 360 (in Round 2). In addition, the clustering (and resultant design effect) was further reduced by requiring the interviewer to visit only the tenth household after the starting address instead of the fifth. A second rather complex design is the one in Ukraine. Since probability sampling is not the usual method there and because no population register is available as a sampling frame it was challenging to find an acceptable design. The first stage of the agreed design was to stratify ‘settlement’. The strata were defined by 11 geographic regions and seven types of settlements. Altogether, there proved to be 56 non-empty strata. 300 sample points were allocated to the strata in proportion to the size of the stratum population using the Cox method. In some cases, the stratum consisted of a single settlement (large city), while in others it was necessary to select settlements with probability of selection proportional to population size with replacement. At stage 2, streets were selected as sample points using simple random sampling. For this, a register of streets within settlements was available as a sampling frame. Unfortunately, however, this register did not contain any information on the number fof addresses or households in each street. So selecting streets with equal probabilities was the only possibility. At stage 3 the interviewers listed the dwellings in each sampled street, excluding any that were obviously vacant. The lists were returned to the central office, where an appropriate number of selections were made from each street using systematic random sampling with a fixed sampling fraction. Finally, at stage 4, one eligible person was selected by the interviewer using the last birthday method. Table 2.2 also shows that the required minimum effective sample size, neff = 1500, was not reached in a number of countries, such as the Czech Republic, Ireland, Portugal or Ukraine. The reason for this in many cases was that the intra-class correlation turned out to be higher than predicted. For instance, in Portugal we found a median ρ =0.16 in a set of eight selected variables, a large deviation from the value of ρ=0.02 that we had initially suggested. But this suggestion had emerged from an analysis of the British Social Attitudes Survey which includes several items similar to ESS items. A similar level of intra-class correlation (ρ=0.04) is also reached in the ESS as far as the UK and three other countries are concerned. But there seem to be marked cross-cultural differences in the amount of homogeneity within clusters. In largely rural countries such as Portugal, clustering effects turn out to have a bigger impact than they do in highly urbanized countries such as Germany. Another source of high homogeneity in certain countries could be fieldwork practices, but we cannot yet be sure. In particular, the design effect due to interviewer effects has to be considered, and it may still be that in future rounds that the minimum effective sample size neff =1500 is too demanding to sustain as a strict requirement.
How representative can a multi-nation survey be?
43
Design weights It is clear that the designs we described for Austria and the Ukraine are not equal probability designs. The same was true for many of the other ESS designs too. So it was vitally important for the selection probabilities at each stage to be recorded. These records were subsequently used by the sampling panel to calculate design weights which were an inverse of the product of the inclusion probabilities at each stage. In Round 1, the weights for all respondents were equal only in Belgium, Denmark, Finland, Sweden, Hungary and Slovenia. In Round 2, the same was true for Estonia, Iceland, Norway and Slovakia. 6 These countries used equal probability selection methods. In all other countries, however, selection probabilities (and hence design weights) varied to some degree for the following reasons: • Unequal inclusion probabilities in countries where frames of households or addresses were used, such as the United Kingdom and Greece. For example, persons in households with four persons aged 15 years and older had a selection probability one-quarter that of a person in a single adult household. • Unequal inclusion probabilities because of over-sampling in some strata. An example was Germany, where because separate analyses were required for eastern and western parts of Germany, East Germans were over-sampled. As a result, 1020 persons in East Germany and 2046 persons in West Germany were interviewed even though the actual proportion of East Germans is only around 20 per cent. In other countries, such as the Netherlands and Poland, sampling fractions were varied slightly over strata in anticipation of differences in response rates. When analysing the data it is, of course, important to use the design weights that accompany the dataset, which have been calculated to correct for these variations in selection probabilities. Unless they are used, some samples will turn out strangely, for instance heavily over-representing East Germans or persons living alone, according to the above examples. In Spain, for instance, before introducing the design weights, some 14.9 per cent of respondents apparently lived in single-person households, but – after properly applying them – this proportion reduces to 6.1 per cent. Figure 2.2 gives an overview of the variation in design weights calculated for Round 2. The differences between nations in the shape of the distribution is striking. The nine nations with equal-probability designs stand out as consisting of a single vertical bar.
6
In Round 2 Hungary changed to an unequal probability sampling scheme.
44
Figure 2.2
MEASURING ATTITUDES CROSS-NATIONALLY
Distribution of design weights in Round 2
Note: Round 2 data from Turkey was not included in data release 2 and no sample design file has been received, so exact sample sizes, design weights and design effects are not yet known. Therefore, Turkey is excluded from this and the following tables and figures
Design effects We have argued that it is necessary and even desirable to allow variation in sample designs in order to achieve equivalence. However, the more a
How representative can a multi-nation survey be?
45
sample is clustered and the greater the variation in its inclusion probabilities, the less “effective” it is. In other words, more interviews need to be conducted to obtain the same precision of estimates when a complex design is used as compared with simple random sampling. As noted, we can measure this loss in precision by the design effect (Kish, 1965). The variation between nations in sample designs meant that there was likely to be considerable variation in design effects too. That is why we tried to predict the design effects and to use them to determine the number of interviews that would be needed to meet the criterion of a minimum effective sample size of 1500. For the prediction of design effects we chose a model-based approach that takes into account two components: • Design effect due to differing selection probabilities (DEFFp) If differing selection probabilities were to be used, then the associated design effect was predicted using the following formula:
m
mi wi2
i
DEFF p =
mi wi
2
where there are mi respondents in the ith selection probability class, each receiving a weight of wi. An overview of the predicted and estimated design effects due to differing inclusion probabilities for both Rounds 1 and 2 of the ESS is given in Table 2.3. The post-fieldwork estimated design effects have been calculated in exactly the same way as the pre-fieldwork predictions, but using the realised design weights and sample sizes rather than the anticipated ones. In other words they apply to our imaginary variable that is not correlated with strata or clusters. In most countries the predicted design effects were rather close to the realised design effects in Round 1. Only in Portugal and Israel did we notably under-predict the design effect. But in Round 2 we were able to improve on our predictions, and in some cases on the design too, based on the Round 1 data. This led to a lower anticipated design effect in a number of countries (Austria, France, Ireland, Luxembourg). Indeed, in three of those four countries the estimated design effects were equal to the predictions or even lower. Meanwhile, in Spain and Norway a new design was used that did not involve varying inclusion probabilities, so DEFFp=1. And in Portugal a remarkable decrease of DEFFp took place, due to an improvement in the design. Only in the Czech Republic and the Ukraine were there increases in the design
46
MEASURING ATTITUDES CROSS-NATIONALLY
effects due to differing selection probabilities and in the Ukraine, this was a result of new sample designs.7 Table 2.3
Design effects due to differing selection probabilities; Rounds 1 and 2
Country Austria Belgium Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Israel Italy Luxembourg Netherlands Norway Poland Portugal Slovakia Slovenia Spain Sweden Switzerland UK Ukraine
Predicted DEFFp Round 1
Estimated DEFFp Round 1
Predicted DEFFp Round 2
Estimated DEFFp Round 2
1.4 1.0 1.2 1.0
1.2 1.0 1.2 1.0
1.0 1.3 1.1 1.2 1.0
1.0 1.2 1.1 1.2 1.0
1.3 1.3 1.0 1.4 1.2 1.2 1.0 1.1
1.0 1.6 1.2 1.3 1.2 1.0 1.0 1.8
1.3 1.0 1.5 1.0 1.0 1.0 1.2 1.1 1.2 1.0 1.0 1.0
1.3 1.0 1.5 1.0 1.0 1.0 1.2 1.1 1.2 2.2 1.0 1.3
1.0 1.1 1.0 1.2 1.2
1.0 1.2 1.0 1.2 1.2
1.3 1.2 1.0 1.0 1.2 1.0 1.0 1.0 1.0 1.2 1.2 1.2
1.2 1.2 1.0 1.0 1.4 1.0 1.0 1.0 1.0 1.2 1.3 3.4
Note: Figures are rounded to one decimal place
• Design effect due to clustering (DEFFc) In most countries multi-stage, clustered, sample designs were used. In such situations there is also a design effect due to clustering, which can be calculated as follows: DEFFc = 1 + (b-1) ρ where b is the mean number of respondents per cluster (but see Lynn and Gabler, 2005) and ρ is the intra-cluster correlation (or “rate of homogeneity”) – a measure of the extent to which persons within a clustering unit are more 7
We do not discuss the case of Hungary here (see footnote 4).
How representative can a multi-nation survey be?
47
homogeneous than persons within the population as a whole (Kish, 1995) – see Table 2.4. In the first round this design effect could be predicted, at least crudely, from knowledge of other surveys and/or the nature of the clustering units. In some countries calculations were made to estimate intra-class correlation coefficients from earlier surveys with similar variables. If there was no available empirical evidence upon which to base an estimate of the intra-class correlation coefficient, then a default value of 0.02 was used, and in fact this value was used in most countries. However, as noted, this turned out in many countries to be an under-prediction of the actual homogeneity within clusters. Table 2.4
Design effects due to clustering: Rounds 1 (R1) and 2 (R2)
Country Austria Belgium Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Israel Italy Luxembourg Netherlands Norway Poland Portugal Slovakia Slovenia Spain Sweden Switzerland UK Ukraine
Predicted DEFFc R1
Estimated DEFFc R1
Predicted DEFFc R2
Estimated DEFFc R2
1.1 1.1 1.1 1.0
1.6 1.2 1.3 1.0
1.0 1.2 1.4 1.1 1.2
1.0 1.3 2.0 1.6 1.4
1.2 1.2 1.1 1.0 1.2 1.2 1.1 1.1
1.9 2.4 1.8 1.0 1.0 1.6 1.8 1.6
1.2 1.2 1.3 1.0 1.0 1.0 1.2 1.7 1.1 1.0 1.0 1.3
1.5 1.2 2.6 1.0 1.0 1.0 1.4 2.0 1.4 3.3 1.0 1.9
1.4 1.2 1.0 1.2 1.3
1.3 1.6 1.0 1.3 1.4
1.0 1.0 1.0 1.1 1.2 1.0 1.4 1.3 1.0 1.1 1.3 1.1
1.0 1.0 1.0 1.6 1.9 1.0 1.4 1.4 1.0 1.2 1.3 1.7
Note: Figures are rounded to one decimal place
Since the number of respondents per cluster (b) and the intra-class correlation (ρ) both influence the design effect we were – based on estimated design effects from Round 1 – able to suggest improvements of the designs in Round 2 in some countries. The improvements were made either through a reduction of b by increasing the number of sample points, or through increasing the size
48
MEASURING ATTITUDES CROSS-NATIONALLY
of the sample points in the hope that this would lead to a reduction in ρ. So in Switzerland we increased the number of sample points from 220 to 287, in France from 125 to 200, and in Portugal from 150 to 326. But such changes were not always possible, because they tend to increase fieldwork costs. Some estimates of and the realised values of b are presented in Table 2.5 for two of the ESS’s attitude scales. The differences between nations in the extent to which attitudes seem to cluster in the population is striking. On the left–right dimension of political values, for instance, small areas seem to be quite homogeneous in Austria, the Czech Republic, Greece, Ireland, Slovenia, Spain and especially in Ukraine, but they are relatively heterogeneous in France, Hungary, Poland and the UK. A similar picture emerges with the life satisfaction scale. Austria, the Czech Republic, Greece, Ireland, Portugal, Spain and Ukraine are again amongst those with large intra-cluster correlations, but Slovenia is now missing from this group. Table 2.5
Estimated ρ for two measures, Round 2 Left–right-scale
Country Austria Belgium Czech Republic France Germany Greece Hungary Ireland Poland Portugal Slovenia Spain Switzerland UK Ukraine
Satisfaction with life
Intra-class correlation coefficient ρ
Mean cluster – size b
Intra-class correlation coefficient ρ
0.10 0.05 0.09 0.03 0.05 0.08 0.03 0.08 0.00 0.13 0.10 0.10 0.06 0.04 0.27
5.5 5.0 9.5 8.7 16.0 4.1 23.2 10.3 4.2 4.9 6.4 3.8 7.3 10.2 6.8
0.10 0.03 0.09 0.04 0.07 0.09 0.05 0.09 0.07 0.22 0.04 0.11 0.04 0.05 0.09
Mean cluster – size b 6.2 5.4 11.1 9.3 17.6 4.7 27.5 12.2 4.5 6.4 8.8 4.0 7.9 11.3 9.4
– Note: Differences in b between the two measures are due to differences in the level of item non-response
The total design effect is the product of the design effect due to differing selection probabilities (DEFFp) and the design effect due to clustering (DEFFc). These total design effects vary largely between countries. It is also instructive to note that a change in sample design within a country is always reflected in the design effect. When improvements are made – such as an increase of PSUs or the application of a less complex design – this leads to a
How representative can a multi-nation survey be?
49
reduction of DEFF. In this respect, design effects seem to be a good measure for the quality of samples. Once the design effect had been predicted for a country, a target minimum net sample size (number of completed interviews) was set for that country, so as to produce approximately equal precision for all countries (see the last three columns of Table 2.2). Moreover, as can be seen in Table 2.2, we were in most but not all cases successful with our predictions. But how was this calculation of net sample sizes carried out? Sample size For the calculation of the sample sizes we used the following formula: nnet = neff * DEFFc * DEFFp = 1500 * DEFF ngross = nnet / (ER * RR) where nnet is net sample size, neff is effective sample size, fixed at 15008, ngross is gross sample size, ER is eligibility rate and RR is response rate. To illustrate these formulae we examine the case of Greece (Round 2). In Greece we had estimated from Round 1 data a design effect due to differing inclusion probabilities within the households as DEFFp = 1.22. (We had previously predicted DEFFp = 1.18 at Round 1 based on external data but, as with many countries, were now able to use Round 1 data to inform our prediction for Round 2.) The prediction of DEFFc was based upon the anticipated mean number of interviews per sample point (4) and predicted (0.04, predicted at Round 1 because the units used as sample points were smaller than in most countries and therefore might be expected to be more homogeneous). So, DEFFc = 1+ (4-1) * 0.04 = 1.12. Thus DEFF = 1.22 * 1.12 = 1.37. Given the target minimum effective sample size (neff = 1500) and the design effect (DEFF =1.37), the net sample size has to be at least nnet = 1500 * 1.37 = 2055. Since about one per cent ineligibles and a response rate of 70 per cent were predicted (from Round 1), the gross sample size should have been at least ngross = 2055/(0.70 * 0.99) = 2965. In fact, the gross sample size was set at 3,100 to allow for a margin of error. This should have meant that the effective sample size target was met so long as the design effect turned out not to exceed 1.43 (if response rate assumptions held) or so long as the response rate exceeded 67 per cent.
8
Except for “small” countries with populations of less than 2 million, for whom the minimum neff was 800 on grounds of affordability.
50
MEASURING ATTITUDES CROSS-NATIONALLY
In contrast, it is, of course, much easier to calculate sample size in countries with simple random samplisng, since DEFF=1.0. For example, in Sweden it was decided to select 3000 individuals to form the gross sample. Since the response rate was expected to be 75 per cent and the number of ineligibles about 2.3 per cent, the net sample size nnet = (3000*0.977) * 0.75 = 2198. This means the effective sample size is nnet /DEFF=2198/1.0=2198, which is clearly higher than the required minimum effective sample size of neff = 1500. To reach this benchmark Sweden would have needed only a gross sample size of ngross = nnet /(ER*RR) = 1500/(0.977*0.75) = 2048 individuals. Organisation of the work As noted, a sampling panel was set up to produce the sample specification, to develop the design for each nation in co-operation with the National Coordinator, to agree the final sample designs and to review the implementation of the designs. The panel consists of: • Sabine Häder (Centre for Survey Research and Methodology, Germany) – Chair • Siegfried Gabler (Centre for Survey Research and Methodology, Germany) • Seppo Laaksonen (University of Helsinki, Finland) • Peter Lynn (University of Essex, UK). During each of the first two rounds of the survey the panel met three times. At the first meeting each panel member was assigned about six nations with which to liaise. Since the teams (panel member, National Coordinator, statisticians in survey institutes) had all co-operated so successfully at Round 1, the same assignment was maintained as far as possible in Round 2. Nations joining the survey for the first time at Round 2 were assigned to one or other panel member to ensure that the allocation of work would be as equal as possible. In the first place, panel members would contact ‘their’ National Coordinators asking for information about the foreseen sample design. Thereafter, a cooperative process of discussion and decision making between the National Coordinators, the survey organisations and the panellists got under way. As noted, with the knowledge gained from the designs applied in Round 1 we were able to improve the sample design in several nations at Round 2. At the second panel meeting in each round the panel addressed practical problems occurring in the different countries as well as theoretical
How representative can a multi-nation survey be?
51
questions affecting the calculation of design effects. (For example, we found that theory did not exist for estimation when substantively different designs were used in different domains within a nation, so we had to develop the theory ourselves – Gabler et al., 2006.) Issues such as the possible future inclusion of non-response weights in the ESS were also discussed. In some cases panel members had made visits to one or more of ‘their’ countries for a detailed discussion of problems – whether an expected low response rate, or the selection of an appropriate survey organisation, or the development of a completely new design (e.g. Portugal, Italy, Switzerland, France). Once all questions were clarified and resolved, a country’s design was considered to be “ready for signing off”. A pre-designed form was used for the purpose containing full details of each design and other details that derived from the discussions between panellists, National Coordinators and survey organisations. Only when all panel members agreed a design was it ‘signed off’. Otherwise, the discussion with the National Coordinator would carry on until all the perceived problems were resolved. When fieldwork was complete, the panel members guided National Coordinators on how to create the sample design data files which would (after merging the file with the substantive data) be used to compute design weights and estimate design effects. Conclusion The ESS represents a significant step forward in the control of sample design in multi-nation surveys. We believe that the detailed sample design specification we have developed is based on clear principles, is appropriate for cross-national comparisons, and is capable of consistent implementation. We have provided clear guidance on the important role of predicted design effects in sample design and have demonstrated the benefit of a co-operative process of design in the pursuit of equivalence between countries. Naturally by no means everything has gone precisely according to plan. Some designs have failed to meet the specification owing to budgetary constraints or poorer than anticipated response rates which have reduced the ‘ideal’ sample size. But in Round 1, our predictions of design effects proved inaccurate in only two cases. Overall, we are able to report that the sampling strategy developed for the ESS has produced genuinely comparable samples across countries in all important respects, and that the samples without exception provide reasonable precision of estimation.
52
MEASURING ATTITUDES CROSS-NATIONALLY
References Cox, L.H. (1987), ‘A constructive procedure for unbiased controlled rounding’, Journal of the American Statistical Association, 82 (398), pp.520–524. Gabler, S., Häder, S. and Lynn, P. (2006), ‘Design effects for multiple design samples’, Survey Methodology, 32 (1), pp.115–120. Kish, L. (1965), Survey Sampling, New York: Wiley. Kish, L. (1995), ‘Methods for design effects’, Journal of Official Statistics, 11 (1), pp.55–77. Kish, L. (1997), ‘Foreword’ in: T. Lê and V. Verma, DHS Analytical Reports No. 3: An Analysis of Sample Designs and Sampling Errors of the Demographic and Health Surveys, Calverton, MD: Macro International Inc. Lynn, P. and Gabler, S. (2005), ‘Approximations to b* in the prediction of design effects due to clustering’, Survey Methodology, 31 (1), pp.101–104. Lynn, P., Gabler, S., Häder, S. and Laaksonen, S. (forthcoming), ‘Methods for achieving equivalence of samples in cross-national surveys’, Journal of Official Statistics, 22 (4). Park, A. and Jowell, R. (1997), Consistencies and Differences in a Cross-National Survey, London: SCPR.
3
Can questions travel successfully? Willem E. Saris and Irmtraud Gallhofer∗
Introduction Among the most important features of the ESS is the fact that so many different checks on quality have been built into its design. As other chapters show, these checks relate not only to areas such as sampling, non-response and fieldwork, but also to the quality of the questionnaires. Without them, we would always be wary of the validity of cross-cultural comparisons through the ESS. We need to know not only about the quality of the English-language source questionnaire, but also about differences in quality across the participating countries which may arise from the way in which the same questions come across in different cultural settings. As we will show, the ESS places an unusually strong emphasis on checking data quality at its questionnaire design phases. First, every proposed new question is subjected to prior checks based on predictions of its reliability and validity and an evaluation of the expected quality of the construct it belongs to. New items are also tested in the two-country pilot surveys that are built into the design phase, which are large enough to sustain analyses of their performance in the field. But the most unusual characteristic of the ESS is its attempt to assess the comparability of its final fielded questions in all countries and languages by means of Multitrait-Multimethod experiments, which allow error structures for several items to be compared and subsequently corrected for measurement error. ∗
Willem Saris is a member of the ESS Central Coordinating Team and Professor at the ESADE Business School, Universitat Ramon Llull, Barcelona; Irmtraud Gallhofer is a member of the ESS Central Coordinating Team and senior researcher at ESADE.
54
MEASURING ATTITUDES CROSS-NATIONALLY
In this chapter we focus on the series of checks built into the ESS that help to ensure the quality of its questionnaires within and across the ESS participating countries. Seven stages of questionnaire design Having determined the subject matter of both the largely unchanging core module and the two rotating modules of the questionnaire for each round of the survey, the detailed shape and content of the questionnaire is coordinated and ultimately determined by the Central Coordinating Team, in consultation with the Scientific Advisory Board, the National Coordinators and the two Question Module Design Teams. The content of the core itself was guided by a group of academic specialists in each subject area. The questionnaires at each round go through the following stages: • Stage 1 The initial aim is to ensure that the various concepts to be included in the questionnaire, always based on a set of detailed proposals – are represented as precisely as possible by the candidate questions and scales. Since subsequent data users require source material that makes these links transparent, we make available all the initial documents and the subsequent stages of evaluation and re-design which lead to the final questionnaire. • Stage 2 To achieve the appropriate quality standard, the questions and scales undergo an evaluation using standard quality criteria such as reliability and validity. Where possible, these evaluations are based on prior uses of the question in other surveys. For new questions, however, the evaluations are based on ‘predictions’ of quality that take into account their respective properties. Such predictions are based on the SPQ program.1 Naturally, validity and reliability are not the only criteria that matter. Attention is also given to other considerations such as comparability of items over time and place, anticipated item non-response, social desirability and other potential biases, and the avoidance of ambiguity, vagueness and doublebarrelled questions.
1
Some explanation of the prediction procedure will be given below. For more details refer to Saris et al. (2004a).
Can questions travel successfully?
55
• Stage 3 The next step constitutes the initial translation from the source language (English) into one other language for the purpose of two large-scale national pilots. The Translation Panel guides this process, which is designed to ensure optimal equivalence between the languages. • Stage 4 Next comes the two-nation pilot itself (400 cases per country), which also contains a number of split-run experiments on question wording alternatives. Most of these split-run experiments are built into a self-completion supplement. Some interviews are, on occasions, tape-recorded for subsequent analysis of problems and unsatisfactory interactions. • Stage 5 The pilot is then analysed in detail to assess both the quality of the questions and the distribution of the substantive answers. Problematical questions, whether because they have displayed weak reliability or validity, deviant distributions or weak scales, are sent back to the drawing board. It is on the basis of these pilot analyses that the final source questionnaire is subsequently developed. • Stage 6 The source questionnaire then has to be translated into multiple languages. The process is helped by the fact that potentially ambiguous or problematic questions in the English version are annotated to expand on their intended meaning. These annotations – carried out in collaboration with the various authors of the questions – attempt to provide greater definition to, or clarification of, the concept behind the questions, and are especially useful when the words themselves are unlikely to have direct equivalents in other languages. The translation process, which is designed to ensure that the questions in all languages are optimally functionally equivalent, is discussed in chapter 4. • Stage 7 Regardless of the effort that is put into making the questions as functionally equivalent as possible in all languages, it is inevitable that certain questions will be interpreted in a different way by respondents in certain countries. This may happen either because they use the labels of the response scale in a
56
MEASURING ATTITUDES CROSS-NATIONALLY
country-specific way (systematic errors), or because they are simply so unfamiliar with the concepts addressed by the question that their answers tend to be haphazard (random errors). It was with these problems in mind that we incorporated into the ESS fieldwork a supplementary questionnaire containing questions designed to elicit the extent of random and systematic error in different countries. The data from these supplementary questionnaires will ultimately allow for the correction of measurement error across countries, thus making the findings from different countries more equivalent. Background to the evaluation of questions The effect on responses of how questions are worded has been studied in depth by a number of scholars (e.g. Schuman and Presser, 1981; Sudman and Bradburn, 1982; Andrews, 1984; Molenaar, 1986; Alwin and Krosnick, 1991; Költringer, 1993; Scherpenzeel and Saris, 1997). In contrast, little attention has been paid to the difficulties of translating concepts into questions (Blalock, 1990; de Groot and Medendorp, 1986; Hox, 1997). In addition, Northrop (1947) was the first to distinguish between conceptsby-intuition and concepts-by-postulation. What he refers to as ‘concepts-byintuition’ are simple concepts whose meaning is immediately apparent. ‘Concepts-by-postulation’, or ‘constructs’, are, in contrast, concepts that require explicit definition to be properly understood. So concepts-by-intuition would include judgements, feelings, evaluations, norms and behaviours, where it is relatively clear even on the surface what is meant – such as “people tend to behave in a particular way” or “a certain group is especially likely to have a certain characteristic”. In contrast, concepts-by-postulation would require definition, such as “ethnocentrism” or “authoritarianism”. These more complicated concepts can usually be captured only by multiple items in a survey questionnaire. Attitudes used to be defined as a combination of cognitive, affective and action tendencies (Krech et al., 1962). But this conceptualisation was challenged by Fishbein and Ajzen (1975), who defined them instead on the basis of evaluations. Although these two definitions were different, both defined attitudes and it is interesting that both defined attitudes on the basis of concepts-by-intuition. Blalock (1968) had noted the gap between the language of theory and the language of research, and when he returned to the subject two decades later (Blalock, 1990), he noted that the gap had not narrowed. Accepting that such a gap was inevitable to a degree, he also argued that insufficient attention had been given to the development of concepts-by-postulation. Now a further two decades on, the ESS has paid more attention to the development of concepts by postulation. Tests have been carried out to check the structure or
Can questions travel successfully?
57
dimensionality of particular concepts. But the first step in developing concepts-by-postulation is a clear view of each concept’s individual components, which are in effect a series of concepts-by-intuition. Therefore we start with a focus on the quality of these simpler concepts. Evaluation of ‘concepts-by-intuition’ Developing a survey item for a concept-by-intuition involves choices, some of which follow directly from the aim of the study and the precise measurement objective, such as whether we want from the respondent an evaluation or a description. But many other choices influence the quality of the survey item, such as the nature or structure of the question, its wording, whether response scales are involved, and the mode of data collection. Several procedures have been developed over the years to evaluate survey items before they are set in stone. The oldest and still most commonly used approach is a pre-test followed ideally by a de-briefing of the interviewers involved. Another approach, suggested by Belson (1981), and now known as ‘cognitive interviewing’, is to ask people after they have answered a question in a pre-test how they interpreted the different concepts in the item. A third approach is the use of other “think aloud” protocols during interviews. A fourth approach to assess the cognitive difficulty of a question is to refer the matter to an expert panel (Presser and Blair, 1994), or to judge its linguistic or cognitive difficulty on the basis of a specially devised coding scheme or computer program (Forsyth et al., 1992; van der Zouwen, 2000; Graesser et al., Esposito and Rothgeb, 2000a/b). Another approach is to present respondents with different formulations of a survey item in a laboratory setting in order to see what the effect of these wording changes is (Esposito et al., 1991; Esposito and Rothgeb, 1997; Snijkers, 2002). (For an overview of these different cognitive approaches, see Sudman et al., 1996.) A rather different approach is to monitor the interaction between the interviewer and the respondent through behavioural coding to see whether or not it follows a standard pattern (Dijkstra and van der Zouwen, 1982). Non-standard interactions may indicate problems related to specific concepts in the items. In all these approaches the research attempts to detect response problems, the hypothesis being that certain formulations of the same item will increase or reduce the quality of responses. But the standard criteria for data quality – notably validity, reliability, method effect and item non-response – are not directly evaluated by these methods. As Campbell and Fiske (1959) noted, the problem is that validity, reliability and method effects can only directly be evaluated by using more than one method to measure the same trait. Their design – the Multitrait-Multimethod or MTMM design – is now widely used in psychology and psychometrics (Wothke, 1996), but has also attracted the
58
MEASURING ATTITUDES CROSS-NATIONALLY
attention of scholars in marketing research (Bagozzi and Yi, 1991). In survey research, the MTMM approach has been elaborated by Andrews (1984) and applied in several languages: English (Andrews, 1984), German (Költringer, 1995) and Dutch (Scherpenzeel and Saris, 1997). This same approach to evaluating questions has been applied in the ESS. But before describing the ESS approach, we must define the quality criteria we employed. Quality criteria for single survey items It goes without saying that it is desirable to minimise item-non-response in surveys. This is probably the primary criterion for evaluating survey items. Missing values disturb the analysis of answers and may lead to a distortion of the ‘true’ results. A second criterion is bias, defined as a systematic difference between the true scores of the variable of interest and the observed scores (having been corrected for random measurement error). For validatable factual variables these true scores can be obtained, giving us a benchmark against which to measure and evaluate the (corrected) observed scores. So in electoral research, for instance, the published turnout rate may be compared against survey measurements of turnout rates. It turns out, however, that surveys using standard questions on electoral participation tend to over-estimate turnout. Different formulations may thus be tested to see whether they help to improve the survey estimates. But for attitudinal variables, where the ‘true’ values are, of course, unknown, the only possibility is to study different methods that may in turn generate different distributions. And the problem is that either or neither of any two different distributions might be correct. These issues have attracted a great deal of scholarly attention (see Schuman and Presser, 1981 for a summary). Meanwhile, Molenaar (1986) has studied the same issues in non-experimental research. As noted, also relevant for any survey instrument are reliability, validity and method effect. The way these concepts can be defined in the context of surveys is illustrated in Figure 3.1, which represents a measurement model for two concepts by intuition – “satisfaction with the government” on the one hand, and “satisfaction with the economy” on the other. In this model it is assumed that: – fi is the trait factor i of interest measured by a direct question; – yij is the observed variable (variable or trait i measured by method j); – tij is the “true score” of the response variable yij; – mj is the method factor, that represents the specific reaction of people on the method and therefore generates a systematic error; and – eij is the random measurement error term for yij.
Can questions travel successfully?
Figure 3.1 method
59
Measurement model for two variables measured by the same
Notes: f1,f2 = variables of interest; vij = validity coefficient for variable i; Mj = method factor for both variables; mij = method effect on variable i; tij = true score for yij; rij = reliability coefficient; yij = the observed variable; eij = the random error in variable yij
The rij coefficients represent the standardized effects of the true scores on the observed scores. This effect is smaller if the random errors are larger. So this coefficient is called the reliability coefficient. The vij coefficients represent the standardized effects of the variables one would like to measure on the true scores for those variables. (This coefficient is called the validity coefficient.) The mij coefficients represent the standardized effects of the method factor on the true scores. (This coefficient is called the method effect, and the larger it is, the smaller is the validity coefficient.) It can be shown that in this model mij2 = 1 – vij2, so the method effect is equal to the invalidity due to the method used.
60
MEASURING ATTITUDES CROSS-NATIONALLY
Reliability is defined as the strength of the relationship between the observed response (yij) and the true score (tij) which is rij2. Validity is defined as the strength of the relationship between the variable of interest (fI) and the true score (tij) which is vij2. The systematic method effect is the strength of the relationship between the method factor (mj) and the true score (tij) which is mij2. The total quality of a measure is defined as the strength of the relationship between the observed variable and the variable of interest which is: (rijvij)2. The effect of the method on the correlations is equal to: r1jm1jm2jr2j. So, by looking at the effect of the characteristics of the measurement model on the correlations between observed variables, it can be seen that both the definitions and the criteria are appropriate. Using elementary path analysis, it can be shown that the correlation between the observed variables ρ(y1j,y2j) is equal to the correlation due to the variables we want to measure, f1 and f2, reduced by measurement error plus the correlation due to the method effects or as provided in the formula below: ρ(y1j,y2j) = r1jv1j ρ(f1,f2)v2jr2j + r1jm1jm2jr2j
(1)
Note that rij and vij, which are always smaller than 1, will reduce the correlation (see first term), while the method effects – if they are not zero – may generate an increase of the correlation (see second term). So if one knows the reliability coefficients, the validity coefficients and the method effects, one can estimate the correlation between the variables of interest corrected for measurement error. The Multitrait-Multimethod design The problem is that the above coefficients cannot be estimated if only one measurement of each trait is available, since in that case there would be only one observed correlation available to estimate seven free parameters. That was why the MTMM design with three traits, each measured with three different methods, was suggested by Campbell and Fiske (1959). Several such experiments have been included in the ESS. Naturally we crafted questions for the ESS interview which we considered optimal for each of the variables. But in the supplementary questionnaire, which is completed by respondents after the main questionnaire, we experimented with alternative versions of some of the same questions. For these
Can questions travel successfully?
61
experiments the sample was split up randomly in two or six groups, allowing each version of each question to be tested on at least two random groups. Combining the data from these two groups, three questions about three traits were asked in different ways so that any one respondent received only one repetition of a question for the same trait. This design has been called the Split Ballot MTMM (SB-MTMM) design (Saris et al., 2004b), and it is an alternative for the classical MTMM design as used by Andrews (1984). The important difference between SB-MTMM and the classic MTMM is that in the latter case each respondent receives two repetitions of the same question, and in the former case only one repetition. Memory effects are thus smaller and the response burden lower in SB-MTMM. The disadvantage of SB-MTMM, however, is that the data matrix is incomplete, but Saris et al., (2004b) have shown that all parameters can still be estimated using multiple group analysis (Jöreskog, 1971). Predicting the quality of questions It will be clear that such experiments cannot realistically be contemplated for all questions in a questionnaire as long as the ESS’s. In any case, estimates of data quality are affected by other factors, such as the item’s position in the questionnaire, its distance from another MTMM measure, the length of its text, and so on. To be able to take these factors into account and to make predictions of the quality of questions which have not been studied explicitly one therefore needs another approach, known as the meta analysis of MTMM experiments (Andrews, 1984; Scherpenzeel and Saris 1997). By ensuring that the MTMM experiments are chosen in such a way as to cover the most important choices involved in the design of the questions, then a meta analysis of the results of these experiments can provide an estimate of the effect of those choices on the reliability, validity and method effects in the ESS as a whole. And for this purpose of evaluating the overall quality of survey instruments, the program SQP was duly developed.2 The predictions are founded on the properties of more than 1000 survey questions, apart from the questions in the ESS itself. Evaluation of ‘concepts-by-postulation’ We provide below a number of examples of checks we conducted on the ESS. For a complete description of the evaluation of all instruments, see the ESS website (www.europeansocialsurvey.org.uk). Each of these examples illustrates a different aspect of the checks performed. 2
In this case a Windows version of SQP developed by Oberski, Kuipers and Saris (2004) has been used. The latest version of the program can be obtained from the author writing to [email protected]
62
MEASURING ATTITUDES CROSS-NATIONALLY
Political efficacy The concept of political efficacy is a longstanding component of surveys in political science. Two forms of efficacy are often distinguished – beliefs about the responsiveness of the system on the one hand (external efficacy), and beliefs about one’s own competence on the other (internal efficacy). Only the second component was included in the ESS core questionnaire. We follow below the stages of design and testing that led us to the final questions. The original formulation of the three proposed political efficacy questions was as follows: How much do you agree or disagree with each of the following statements?3 1. Voting is the only way that people like me can have any say about how the government runs things. 2. Sometimes politics and government seem so complicated that a person like me cannot really understand what is going on. 3. It is difficult to see the important differences between the parties. The initial SQP analysis (Table 3.1) suggested that the first two questions were acceptable, but the third one less so due to the fact that it is a complex assertion.4 Table 3.1 Item 1 2 3
1st SQP evaluation of initial three items on political efficacy Reliability
Validity
Method effect
Total quality
.77 .76 .62
.82 .83 .76
.18 .17 .24
.63 .63 .47
But the CCT had its doubts about the concept by postulation approach in this instance because of the different premises on which the three items were based. While the first item presents a relationship, the second represents an
3
The five-point agree-disagree scale ranged through ‘disagree strongly’, ‘disagree’, ‘neither agree nor disagree’, ‘agree’, ‘agree strongly’. 4 Here we use specifications of concepts suggested by Saris and Gallhofer (forthcoming).
Can questions travel successfully?
63
evaluative belief and the third a complex judgement.5 So the items were seeking to measure different concepts. And these doubts were confirmed by prior studies of the same items. Indeed, in Netherlands election studies, the first two items even did not end up in the same scale because of the low correlation between them. The specialist response to our concerns was that there had indeed been debate about the quality of these questions, and that an elaborate study by Vetter (1997) had shown the old questions to be flawed by unclear factor structures. Experiments had been conducted on other items shown below which had obtained a clearer factor structure: How much do you agree or disagree with each of the following statements? 1. I think I can take an active role in a group that is focused on political issues. 2. I understand and judge important political questions very well. 3. Sometimes politics and government seem so complicated that a person like me cannot really understand what is going on. These items were indeed more homogeneous, all being perceptions of the respondent’s own subjective competence with respect to politics. But when checking the quality of these questions using SQP, the results (shown in Table 3.2) were not at all encouraging. Table 3.2 Item 1 2 3
2nd evaluation by SQP of alternative three items on political efficacy Reliability
Validity
Method effect
Total quality
.64 .63 .65
.72 .91 .87
.28 .09 .13
.46 .57 .57
Taking into account the suggested lower quality of items which were supposed to be more homogeneous than the first group, the CCT decided to investigate further using an alternative measurement procedure. Since Saris et al. (2003) had suggested that better results can be obtained by questions with trait specific response categories than by questions with batteries of
5 Here we also use specifications of concepts suggested by Saris and Gallhofer (forthcoming).
64
MEASURING ATTITUDES CROSS-NATIONALLY
agree/disagree items, a test was conducted in the pilot study using the following alternative form of Vetter’s three items: How often do politics and government seem so complicated that you can’t really understand what is going on?6 Do you think that you could take an active role in a group that is focused on political issues?7 How good are you at understanding and judging political questions?8 Table 3.3
3rd evaluation by SQP of further variation in items on political efficacy
Reliabilities method A/D 5 cat TS 5 cat [A/D 4/5 cat9 [1]
Complexity NL 0.65 0.88 0.78
GB 0.83 0.7 0.73
Active role NL 0.66 0.94 0.87
GB 0.71 0.86 0.82
Understand NL 0.69 0.86 0.82
GB 0.78 0.84 .80
The important difference between the two sets of questions is that the earlier one used the same introduction and response categories for all three items (a ‘battery’) while the one above uses different response categories for each question (a ‘trait specific scale’) (Saris et al., 2003). The contrasting results are shown in Table 3.3, where A/D means an agree – disagree scale was used, while TS means that a trait specific scale was used. This table shows that in the Netherlands the trait-specific format had a much higher reliability than the two agree – disagree formats. In Britain the size of this effect is much less clear, but it holds true for two of the three items. Given the low reliabilities in the first agree – disagree measure it is no wonder that the correlations are normally so low between these items. Based on the results of this pilot study, we decided to use the trait specific scales with categories in all countries. The correlations for these items are now much higher than they were. 6
The response categories were ‘never’, ‘seldom’, ‘occasionally’, ‘regularly’, ‘frequently’, and ‘don't know’. 7 The response categories were ‘definitely not’, ‘probably not’, ‘not sure either way’, ‘probably’, ‘definitely’ and ‘don't know’. 8 The response categories were ‘very bad’, ‘bad’, ‘neither good nor bad’, ‘good’, ‘very good’, ‘don't know’. 9 MTMM requires three methods to be used for each trait. The third one was similar to the first, but for some items a four-point scale was used in error.
Can questions travel successfully?
65
The Human Values Scale A second and in principle more difficult evaluation was provided before Round 1 by the proposal to include the highly praised Human Values Scale into the ESS questionnaire (Schwartz, 1997; see also chapter 8). The scale differentiates between people’s underlying value orientations, which Schwartz describes as “affect-laden beliefs that refer to a person’s desirable goals and guide their selection or evaluation of actions, policies, people and events”. The problem we faced was that each item of the Scale contained two different ways of expressing what was essentially the same motivation or belief. All the questions refer to an imaginary person who feels or acts in different ways. But to get across both the motivational and the value components of the imaginary person, two statements are invariably employed – the first about what is important to that person and the second more about the person’s motivations. The problem for questionnaire design, however, is that it is generally advisable to avoid double-barrelled questions for the simple reason that some respondents will wish to agree with one half of the question and to disagree with the other half. So the CCT’s first reaction was to query whether one statement for each item – the clear value statement – might not suffice. The following example of a single item in the scale helps to explain the perceived problem. The 21-item scale was preceded by the following introduction: Here we briefly describe some people. Please read each description and think about how much each person is or is not like you. Put an X in the box to the right that shows how much the person in the description is like you.10 How much like you is this person? After this general introduction the different statements were read to the respondents. The one for the value of an “exciting life” was formulated as follows: He likes surprises. It is important to him to have an exciting life. The first part of this item relates to a feeling on the part of the imaginary person: “he likes surprises”. The second part is a value: “it is important to him to have an exciting life”. As suggested by Saris and Gallhofer (2005), values are ideally measured by statements that take the form of the second sentence, which express the 10
The response categories were: ‘very much like me’, ‘like me’, ‘somewhat like me’, ‘a little like me’, ‘not like me’, ‘not like me at all’.
66
MEASURING ATTITUDES CROSS-NATIONALLY
importance to the individual of “an exiting life”. Any difference between the actual and observed variable would probably be attributable to measurement error alone, as indicated in Figure 3.2.
Figure 3.2
Importance and value position
The other part of the item (“he likes surprises”) relates to a feeling on the part of the imaginary person. But is such a feeling the same as a value or – more probably – a consequence of the value. There may indeed be other variables that influence the feeling, such as ‘risk aversion’. That is, two people with the same view of the importance of an exciting life might differ in their attitude to the risks inherent in such a life. This could be modelled as in Figure 3.3. So although the two measures might be the same, apart from measurement error, they might turn out to be different as a result of the intervention of a third variable – in this case risk aversion or some other similar influence. As this was an important issue to clarify in advance, we decided to conduct a pilot experiment to test if the direct value measures received the same answers as the feeling questions. Three of the Human Values Scale items were decomposed into an importance assertion and a feeling assertion and measured in the supplementary questionnaire. We present here the analysis from the Dutch pilot only, but the British pilot produced the same results. First we present the results of a standard MTMM analysis in Table 3.4.
Can questions travel successfully?
Figure 3.3
Table 3.4 Round 1)
67
Importance, value and risk
MTMM experiment on Human Values Scale (Netherlands pilot,
Reliability of Importance Feeling Complete item
item 1 1 0.87 0.89 0.82
Validity of Importance Feeling Complete item
item2 2 0.84 0.93 0.91
1 0.94 0.89 0.76
item 1 1 0.99 0.99 0.99
item3 2 0.87 0.91 0.93
1 0.96 0.97 0.81
item2 2 0.92 0.97 0.98
1 0.99 0.99 0.99
2 0.86 0.88 0.97 item3
2 0.93 0.97 0.98
1 0.99 0.99 0.99
2 0.93 0.97 0.99
There is little indication from this analysis that either of the measures is better than, or systematically different from, the other. So as Table 3.5 shows, we then checked the correlations between the three items after correction for measurement error. Again, even without a formal test, it is clear that there is no substantive or relevant difference between these correlations. Although we now had very
68
Table 3.5
MEASURING ATTITUDES CROSS-NATIONALLY
Correlations between the three items and their components
Correlations of the variables
Items only importance
1 with 2 2 with 3 1 with 3
Items only feelings
.74 .55 .50
.70 .52 .50
Items combinations .71 .49 .49
strong evidence of the equality of these measures, it could still be argued that the variables were different even though the intercorrelations between them were the same. Given the absence of any relevant method effect, we could directly test the equality of the measures by measuring whether the correlations between the importance and the feeling assertions for each separate item were in fact equal (or almost equal) to 1. We did this using the congeneric test model of Jöreskog (1971), which we show in Table 3.6. Table 3.6
Test of the equality of the importance and feeling variables
Number of the item 1 2 3
assumption corr=1 chi2 26 16.6 15.6
df 2 2 2
corr= free n 200 200 200
chi2 4 0.1 9.4
df 1 1 1
corr 0.85 0.91 0.95
Although the requirement that the correlation should be 1 is not formally met, its estimated value after correcting for measurement error is between .85 and .95, which was certainly high enough for our purposes. On the basis of these results the CCT concluded that it did not matter that two items were combined into single statements. So the Human Values Scale remained intact and is proving highly productive (see chapter 8). An evaluation of cross-cultural comparability In each round of the ESS we conduct six MTMM experiments in all countries, specifically to detect whether the quality of our measurement instruments is the same in different countries. These MTMM experiments also help us to evaluate the effects of different decisions during the questionnaire design phase on the quality of the instruments. When a sufficient number of these MTMM experiments have been conducted,
Can questions travel successfully?
69
we might well be able to predict the quality of all questions before the data are collected. This point has not been reached, but the information is building up. Here we discuss only one experiment in detail, while giving an overview of the results of the six experiments conducted in Round 2. In the ESS questionnaire we generally employ answer categories with fixed reference points – that is, labels on an underlying scale, such as “extremely satisfied” or “extremely dissatisfied” as end points. If a less fixed reference point, such as “very dissatisfied” is used as the end of the scale, some respondents might regard them as end points and others might see them as intermediate positions – a difference of perception that may cause differences in responses which have little to do with differences in substantive opinions. So it is generally believed that fixed reference points have advantages, and we conducted an experiment during the main ESS fieldwork to test this assumption in all participating countries. At the same time we were able to test whether on a 0 – 10 numeric satisfaction scale the effect of a third fixed reference point in the scale – a labelled mid-point (“neither dissatisfied nor satisfied”) – would work better than just two. The topic for the experiments was a group of well-worn questions about people’s satisfaction levels with different aspects of society – the state of the economy, the way the government is doing its job, and the way democracy is working. The questions themselves are shown in the appendix for this chapter. All three statements were subjected to three separate experimental treatments – one with fixed end reference points (extremely dissatisfied and satisfied), one without fixed end reference points (very dissatisfied and satisfied), and the third with three fixed reference points (extremely satisfied and dissatisfied plus a middle fixed reference point (neither). The mean quality (total) across the three forms shows that the standard form with two fixed reference points is quite a bit better (.77) than the form without fixed reference points (.70) and the one with three reference points (.70). But how much did the data quality vary across the 17 countries we analysed? Table 3.7 shows the findings for each of the three questions in the main questionnaire. We see first of all that the question asking about satisfaction with the economy produces considerably lower quality data than do the other two. This is not the effect of just a few countries; in all countries the quality of the first question is lower than that of the other two questions. And, as Table 3.7 shows, the differences are large, always more than .2, which would in turn produce appreciable differences in correlations with other variables. These findings show that it is necessary to correct for data quality before comparing correlations across countries. Such corrections can be made using
70
MEASURING ATTITUDES CROSS-NATIONALLY
Table 3.7 Country Aus Bel Cze Den Est Fin Ger Gre Lux Nor Pol Por Slo Spain Swe Swi UK Total
The quality of the three questions in different countries Economy
Government
,7921 ,7056 ,5997 ,6889 ,7921 ,6724 ,5219 ,7632 ,7225 ,7396 ,7569 ,8100 ,5858 ,5675 ,6235 ,7396 ,7744 ,6974
,8836 ,8464 ,6555 ,9025 ,9025 ,8281 ,7792 ,7964 ,7921 ,9801 ,9025 ,8464 ,7162 ,6688 ,7474 ,8100 ,8836 ,8201
Democracy ,8649 ,8649 ,6561 ,8100 ,8649 ,8281 ,8129 ,8138 ,9801 ,7569 ,8836 ,8281 ,6416 ,6688 ,6521 ,9409 ,8100 ,8046
the information contained in Table 3.7. For instance, take the observed correlation between the ‘economy’ item and the ‘government’ item (r12). In Estonia the correlation between these two variables is .659 and in Spain .487 – a large difference. Before we know whether this is substantively relevant, however, we must correct for measurement error, especially because – as we have seen – the quality of the measure is much higher in Estonia than in Spain. In Estonia the quality measure for these two items is .792 and .903 respectively, while in Spain it is .567 and .669. The correction for measurement error can be made by dividing the observed correlation by the product of the square root of the quality estimates for the two variables,11 or the correlation corrected for measurement error (ρ12): ρ12 = r12/ q2q1
(2)
Using this formula we obtain for Estonia a correlation corrected for measurement error of .78 and for Finland .79. These two correlations are both larger than the observed correlations, but the correction minimizes the difference in correlations between the two countries from nearly .2 to .01, suggesting the importance of these corrections. 11
Note that the quality estimates in the tables are q2. Therefore one has to get the square root of the quality estimates which are equal to the quality coefficients (q).
Can questions travel successfully?
71
Just why the quality varies so much for different countries remains an open question. It may well have something to do with the translations or the formulation of the questions, which we know to be flawed. But it may also have something to do with the number of interviews per interviewer. We are still investigating this phenomenon. Conclusion We have described the attention we have given to evaluating the quality of single survey items as well as the link between these items and the ‘concepts by postulation’ that were originally proposed. In several instances this led to changes in the items. Whenever the CCT was uncertain about the quality of the new formulations, they were tested in the pilot study against alternative formulations. It was on this basis that the source questionnaire was formulated. But the source questionnaire was then translated from English into many different languages. Despite all the efforts we made to achieve functional equivalence for all questions, it is, of course, still likely that certain errors will be of different sizes in different languages. By including MTMM experiments into our procedures, we were able to estimate these different error variances, as well as the reliability and validity of some questions in the different languages. Our analyses show that there are indeed differences in error structures between certain languages and thus in the reliability and validity of certain questions by country. But when corrections for measurement error are made, this sometimes turns out to reduce the differences we had first found. But the corrections can of course also have the opposite effect, turning similar observed correlations into rather larger differences. Either way, our analyses show that conclusions drawn from the observed and corrected correlations are sometimes appreciably different. The fact is that it is risky to derive statistics from observed correlations in advance of corrections for measurement error. The ESS is unusual in its emphasis on estimating the reliability and validity of questions across countries, but – since only six experiments per round are possible – this has not yet been done for all questions. But more will be done as we go along to facilitate the correction of measurement error. And in time we hope to have enough readings to conduct a meta analysis which includes data from all countries so that a predictive program such as SQP can be developed for all languages represented in the ESS. Then the predictive program could be used to determine the size of the measurement errors in different countries and we can provide tools to correct for them. Data from different countries may then be compared without the obstacle of differential measurement error.
72
MEASURING ATTITUDES CROSS-NATIONALLY
References Alwin, D.F. and Krosnick, J.A. (1991), ‘The reliability of survey attitude measurement: the influence of question and respondent attributes’, Sociological Methods and Research, 20, pp.139–181. Andrews, F.M. (1984), ‘Construct validity and error components of survey measures: a structural modelling approach’, Public Opinion Quarterly, 48, pp.409–422. Bagozzi R.P. and Yi, Y. (1991), ‘Multitrait-multimethod matrices in Consumer Research’, Journal of Consumer Research, 17, pp.426–439. Belson, W. (1981), The design and understanding of survey questions, London: Gower. Blalock H.M. Jr. (1968), ‘The measurement problem: A gap between languages of theory and research’ in: H.M. Blalock and A.B. Blalock (eds), Methodology in the Social Sciences, London: Sage. Blalock H.M. Jr. (1990), ‘Auxilary measurement theories revisited’, in: J.J. Hox and J. de Jong-Gierveld (eds), Operationalisation and research strategy. Amsterdam: Swets & Zeitlinger, pp.33–49. Campbell, D.T. and Fiske, D.W. (1959), ‘Convergent and discriminant validation by the multimethod-multitrait matrix’, Psychological Bulletin, 56, pp.833–853. Dijkstra, W. & van der Zouwen, J. (1982), Response behaviour in the survey-interview, London: Academic Press. Esposito J, Campanelli, P.C, Rothgeb, J. and Polivka, A.E. (1991), ‘Determining which questions are best: Methodologies for evaluating survey question’ in: Proceedings of the Survey Research Methods Section of the American Statistical Association (1991), pp.46–55. Esposito, J.P. and Rothgeb, J.M (1997), ‘Evaluating survey data: Making the transition from pretesting to quality assessment’ in: P. Lyberg, P. Biemer, L. Collins, E. de Leeuw, C. Dippo, N. Schwarz and D. Trewin (eds), Survey measurement and Process quality, New York: Wiley, pp.541–571. Fishbein, M. and Ajzen, I. (1975), Belief, Attitude, Intention and Behavior: An Introduction to Theory and Research, Reading, MA: Addison Wesley. Forsyth B.H., Lessler, J.T and Hubbard, M.L. (1992), ‘Cognitive evaluation of the questionnaire’ in: C.F. Tanur and R. Tourangeau (eds), Cognition and Survey research, New York: Wiley, pp.183–198. Graesser, A.C., Wiemer-Hastings, K., Kreuz, R. and Wiemer-Hastings, P. (2000a), ‘QUAID: A questionnaire evaluation aid for survey methodologists. Behavior Research Methods, Instruments, and Computers’ in: Proceedings of the Section on Survey Research Methods of the American Statistical Association, pp.459–464. Graesser, A.C., Wiemer-Hastings, K. Wiemer-Hastings, P. and Kreuz, R. (2000b), ‘The gold standard of question quality on surveys: Experts, computer tools, versus statistical indices’ in: Proceedings of the Section on Survey Research Methods of the American Statistical Association, pp.459–464.
Can questions travel successfully?
73
Groot A.D. de, and Medendorp, F.L. (1986), Term, begrip, theorie: inleiding to signifische begripsanalyse, Meppel: Boom. Hox J.J. (1997), ‘From theoretical concept to survey questions’ in L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N.Schwarz and D. Trewin (eds), Survey Measurement and Process Quality, New York: Wiley, pp.47–70. Jöreskog K.G. (1971), ‘Simultaneous factor analysis in several populations’, Psychometrika, 34, pp.409–426. Költringer, R. (1993), Gueltigkeit von Umfragedaten, Wien: Bohlau. Költringer, R. (1995), Measurement quality in Austria personal interview surveys’ in: W.E Saris and A.Münnich (eds), The Multitrait-Multimethod Approach to evaluate measurement instruments, Budapest: Eötvös University Press, pp.207–225. Krech D., Crutchfield R. and E. Ballachey (1962), Individual in Society, New York: McGraw-Hill. Molenaar, N.J. (1986), Formuleringseffecten in survey-interviews, Amsterdam: VUuitgeverij. Northrop F.S.C. (1947), The Logic of the Sciences and the Humanities, New York: World Publishing Company. Presser S. and Blair, J. (1994), ‘Survey Pretesting: Do different methods produce different results?’ in: P.V. Marsden (ed), Sociological Methodology, Oxford: Basil Blackwell, pp.73–104. Saris, W.E. and Gallhofer, I.N (2005), A scientific method for questionnaire design: SQP, Amsterdam: SRF. Saris W.E. and Gallhofer, I.N. (forthcoming), Design, evaluation and analysis of questionnaires for survey research, Hoboken: Wiley. Saris, W.E., Krosnick, J.A. and Shaeffer, E.M. (Unpublished, 2003), ‘Comparing the quality of agree/disagree questions and balanced forced choice questions via an MTMM experiment’, Paper presented at the Midwestern Psychological Association Annual Meeting, Chicago, Illinois. Saris W.E., Satorra, A. and Coenders, G. (2004b), ‘A new approach for evaluating quality of measurement instruments’, Sociological Methodology, pp. 311–347. Saris W.E., van der Veld, W. and Gallhofer, I.N. (2004a), ‘Development and improvement of questionnaires using predictions of reliability and validity’ in: S. Presser, M.P. Couper, J.T. Lessler, E. Martin, J. Martin, J.M. Rothgeb and E. Singer (eds), Methods for testing and evaluating survey questionnaires Hoboken: Wiley, pp. 275–299. Scherpenzeel A. and Saris, W.E (1997), ‘The validity and reliability of survey questions: A meta analysis of MTMM studies’, Sociological Methods and Research, 25, pp.341–383. Schuman, H. and Presser, S. (1981), Questions and answers in attitude surveys: experiments on question form, wording and context, New York: Academic Press. Schwartz S.H. (1997), ‘Values and culture’ in: D. Muno, S. Carr and J. Schumaker (eds), Motivation and culture, New York: Routledge, pp.69–84. Snijkers G. (2002), Cognitive laboratory experiences: on pretesting, computerized questionnaires and data quality, PhD thesis, University of Utrecht.
74
MEASURING ATTITUDES CROSS-NATIONALLY
Sudman S., and Bradburn, N. (1982), Asking questions: A practical guide to questionnaire design, San Francisco: Jossey-Bass. Sudman S., Bradburn, N. and Schwarz, N. (1996), Thinking about answers: The Application of Cognitive Processes to Survey Methodology, San Francisco: Jossey-Bass. van der Veld, W.M., Saris, W.E. and Gallhofer, I. (September, 2000), ‘SQP: A program for prediction of the quality of Survey questions’, Paper presented at the ISA – methodology conference, Köln. van der Zouwen, J. (2000), ‘An assesment of the difficulty of questions used in the ISSP questionnaires, the clarity of their wording and the comparability of the responses’, ZA-information, 45, pp.96–114. Vetter, A.(1997), ‘Political Efficacy: Alte und neue Meßmodelle im Vergleich’, Kölner Zeitschrift für Soziologie und Sozialpsychologie, 49, pp.53–73. Wothke W. (1996), ‘Models for multitrait-multimethod matrix analysis’ in: G.C. Marcoulides and R.E. Schumacker (eds), Advanced structural equation modelling: Issues and techniques, Mahwah, N.J: Lawrence Erlbaum, pp.7–56.
Can questions travel successfully?
75
Appendix: The ESS Satisfaction questions from Round 2. Form as in the main questionnaire B25 STILL CARD 13: On the whole how satisfied are you with the present state of the economy in [country]? Still use this card. Extremely dissatisfied 00
01
Extremely (Don’t satisfied know) 02 03
04
05
06
07
08
09
10
88
B26 STILL CARD 13: Now thinking about the [country] government, how satisfied are you with the way it is doing its job? Still use this card. Extremely dissatisfied 00
01
Extremely (Don’t satisfied know) 02 03
04
05
06
07
08
09
10
88
B27 STILL CARD 13: And on the whole, how satisfied are you with the way democracy works in [country]? Still use this card. Extremely dissatisfied 00
01
Extremely (Don’t satisfied know) 02 03
04
05
06
07
08
09
10
88
Second Form B25 STILL CARD 13: On the whole how satisfied are you with the present state of the economy in [country]? Still use this card. Very dissatisfied 00
01
Very satisfied 02
03
04
05
06
07
08
09
10
76
MEASURING ATTITUDES CROSS-NATIONALLY
B26 STILL CARD 13: Now thinking about the [country] government, how satisfied are you with the way it is doing its job? Still use this card. Very dissatisfied 00
01
Very satisfied 02
03
04
05
06
07
08
09
10
B27 STILL CARD 13: And on the whole, how satisfied are you with the way democracy works in [country]? Still use this card. Very dissatisfied 00
01
Very satisfied 02
03
04
05
06
07
08
09
10
Third Form B25 STILL CARD 13: On the whole how satisfied are you with the present state of the economy in [country]? Still use this card. Extremely dissatisfied
00
01
neither satisfied nor dissatisfied 02
03
04
05
Extremely satisfied
06
07
08
09
10
B26 STILL CARD 13: Now thinking about the [country] government, how satisfied are you with the way it is doing its job? Still use this card. Extremely dissatisfied
00
01
neither satisfied nor dissatisfied 02
03
04
05
Extremely satisfied
06
07
08
09
10
Can questions travel successfully?
77
B27 STILL CARD 13: And on the whole, how satisfied are you with the way democracy works in [country]? Still use this card. Extremely dissatisfied
00
01
neither satisfied nor dissatisfied 02
03
04
05
Extremely satisfied
06
07
08
09
10
4
Improving the comparability of translations Janet A. Harkness∗
Introduction When a survey is conducted in multiple languages, the quality of questionnaire translations is a key factor in determining the comparability of the data collected. Conversely, poor translations of survey instruments have been identified as frequent and potentially major sources of survey measurement error. As the ESS questionnaires consist of more than 250 substantive and background questions, around one half of which are new at each round, it is hardly surprising that considerable time, resources and methodological effort have been devoted to developing rigorous procedures for translation and translation assessment. This chapter describes the general approach and individual procedures adopted by the ESS to enhance the quality of translations. In line with recent developments in survey translation practice, the ESS has replaced more traditional survey approaches involving back translation with the team-based approach described and assessed here.
∗
Janet Harkness is a Senior Research Scientist at ZUMA, Mannheim, Germany and Director of the Survey Research and Methodology Program at the University of Nebraska, USA, where she holds the Donald and Shirley Clifton Chair in Survey Science.
80
MEASURING ATTITUDES CROSS-NATIONALLY
Source and target languages Following terminology from translation science, we distinguish here between ‘source’ languages and ‘target’ languages and thus between source language questionnaires and target language questionnaires. While a source language is the language from which a translation is made, a target language is the language into which translation is made. In the case of the ESS, the source language is (British) English and all translations into other languages are required to be made from the original British source documents. Multi-national, multi-lingual surveys can follow either of two broad strategies for designing instruments. The first is to try to collect equivalent data by an ‘ask the same questions’ (ASQ) approach to instrument design. The second is to try to collect comparable data by an ‘ask different questions’ (ASD) approach. In ASQ approaches, the various language versions of the questionnaire are produced on the basis of translation of a source questionnaire. In ADQ approaches, construct or conceptual overlap of the questions used in each context is the basis of comparability, not translation. The ESS, in line with the majority of cross-national surveys, has essentially adopted what may be referred to as a sequential ASQ approach – that is, we finalise the source questionnaire before embarking on other language versions. Two further ASQ models – the simultaneous approach and the parallel approach – cannot be discussed here but see Harkness (forthcoming) for details. Sequential models have certain advantages which undoubtedly have contributed to their popularity in cross-national research. For instance, a sequential ASQ approach is relatively economical and straightforward to organise. More important still, it permits an unlimited number of target language versions to be produced and allows projects to aim at replicating existing questions. In contrast to ADQ approaches, it offers analysts the chance to make item-for-item comparisons across data sets (cf. van de Vijver, 2003). At the same time, an ASQ sequential approach focuses less on cross-cultural input at the instrument design stage than other models, as discussed in Harkness (forthcoming). Sequential ASQ models operate on the underlying assumption that functionally equivalent questions can be produced for different cultures and languages by translating the semantic content of the source questions. Their success depends both on the suitability of source question content and formulation and on the quality of the translations made. In saying this, it is important to remember that contextual considerations contribute to determining what respondents perceive questions to mean (see, for example, Schober and Conrad, 2002; Harkness, 2004; Braun and Harkness, 2005). As a result, semantic equivalence of words or questions is, of itself, not a sufficient guarantee of functional equivalence. At the same time, the considerable attention paid to the selection and formulation of source questionnaire items in sequential approaches often means that semantic equivalence (i.e., meaning at
Improving the comparability of translations
81
face value) across languages is assumed to be a strong indicator of an accompanying underlying conceptual equivalence. Apart from the ESS, prominent multi-national surveys that follow a sequential ASQ model include: • The International Social Survey Programme (an annual academic social science survey that covers around 40 countries: http://www.issp.org) • The European and World Values surveys • The family of ‘Barometer’ surveys in Eastern Europe, Asia, Africa and Latin America • the Survey on Health, Ageing and Retirement in Europe (SHARE) • the WHO World Mental Health Initiative. Like many other international studies, the ESS includes questions that have been used in other studies. When questions that were designed for one study are replicated in another, problems may arise in respect of equivalence, translation and adaptation (Harkness, 2004; Harkness et al, 2004). The procedure followed in the ESS is to invite comments and contributions from all participating countries in each round on the draft source questions. The substantive questions in the ESS are accompanied by a prescribed set of socio-demographic ‘background’ questions. Some of these background questions can be translated in the same way as substantive questions, while others, such as education, require country-specific content and formulation. Organisation and specification
Organisation As noted in chapter 1, the translation arrangements and their coordination are the primary responsibility of a team within ZUMA, one of the ESS partner institutions, aided by a Translation Expert Panel (see chapter 1). The task of annotating the source questions and of responding to queries from participating countries is shared between ZUMA and the ESS coordinating office at City University.
Specification One of the ‘rules’ governing ESS translation practices is derived from the original Blueprint document (ESF, 1999), which stipulated that translations should be made into any language used as a first language by five per cent or more of a country’s population. As a result, nine countries have been required to translate their questionnaires into more than one language: two languages each in Belgium, Estonia, Finland, Slovakia, Spain and Ukraine, three
82
MEASURING ATTITUDES CROSS-NATIONALLY
languages each in Israel and Switzerland, and four languages in Luxembourg (see later section of this chapter for a more detailed listing). The detailed arrangements and recommendations for translation were developed by the head of the translation team at ZUMA, in consultation with the Translation Expert Panel. Each participating country is required not only to reserve an appropriate budget for their translations and pre-testing according to the centrally provided specification but also to assume responsibility for the quality of the translations it ultimately produces. The specification produced in Round 1 has been somewhat modified in subsequent rounds, but remains substantially intact (for the latest version see www.europeansocialsurvey.org). The ESS translation guidelines and support materials are designed to provide National Coordinators with detailed guidance on translation and translation assessment procedures. The key points covered are as follows: • Countries are required to use two translators for each language they employ and to adopt a team approach to the process (TRAPD – described in detail below). Details of the approach identify the range of skills that each team needs to have and the procedures to be followed in producing translations and evaluating and revising the outcomes. Revisions are based on team discussions of the draft translations in direct comparison with the English-language source questionnaire. This is then followed by pre-testing of the questionnaire and examination of pre-test findings. • As noted above, the questionnaire and supporting field documents must be translated into any language spoken as a first language by five per cent or more of a country’s population. • As numerous questions may be replicated in future rounds, it is important to decide on formulations/translations that are likely to stand the test of time. Wording changes between rounds are to be avoided wherever possible. • Countries that share languages are required to consult one another on their versions in order to reduce unnecessary differences between same language versions. • Documentation of the translation process and translation products is required in order to facilitate replication in future rounds as well as comparisons within a round of different national versions in the same language. • All countries are required to take into account the literacy level of their population. Since the ESS translation requirements are detailed, concrete and innovative, countries participating for the first time might not be familiar with the kind of procedures proposed. The translation guidelines thus aim to be both detailed and accessible for all participating and a help and query desk is also part of the support service provided by the ZUMA team. Participants in Round 1 were provided with sets of guidelines, descriptions and examples at a series of
Improving the comparability of translations
83
National Coordinators’ Meetings. New entrants to the ESS in later rounds are provided with an overview that covers these materials and are given the opportunity to consult one-to-one with ESS translation specialists. The Translation Procedure: TRAPD An acronym for Translation, Review, Adjudication, Pre-testing and Documentation, TRAPD consists of five interrelated procedures (Harkness, 2003). These form the basis of the team approach to translation developed for the ESS. The procedures are open to iteration so that – as adjustments are made to draft versions of translated questions – the review and documentation activities may also need to be repeated. The three key roles involved in the translation effort are those of translator, reviewer and adjudicator. Translators need to be skilled practitioners who have ideally received training in the translation of questionnaires. The ESS calls for two translators per questionnaire, each of whom is required to translate out of English into their strongest language (normally their ‘first’ language). Having translated independently from each other, they then take part in the subsequent review session(s). The notes they make on their translations provide valuable information for the review session. Reviewers need not only to have good translation skills but also to be familiar with the principles of questionnaire design and the particular study design and the topics covered. A single reviewer with linguistic expertise, experience in translating, and survey knowledge is sufficient. But if one person cannot be found with all these skills, then more than one may be used to cover the different aspects. Reviewers do not produce either of the two draft translations but attend the review sessions and contribute to and guide the revisions. Once again, notes from the review may inform any subsequent adjudication. The adjudicator is the person responsible for the final decisions about which translation options to adopt. This is often the National Coordinator but may also be another nominated person in the team. In any event, adjudicators must have knowledge of the research topics and survey design more generally, as well as being proficient in the languages under discussion. Especially when there are multiple languages involved, the adjudicator – as the person responsible for signing off the translations – may not be sufficiently proficient in one or other of the languages. In those cases, the adjudicator is required to work closely with the reviewer or a further suitable consultant to complete the task. The adjudicator ideally attends the review and contributes to revisions. Failing this, adjudication takes place at a later independent meeting with the reviewer and any other relevant experts. This multi-stage approach was adopted for three main reasons: to mitigate the subjective nature of translation and text-based translation assessment; to ensure appropriate stage-by-stage documentation which helps both adjudicators and
84
MEASURING ATTITUDES CROSS-NATIONALLY
subsequent analysts; and to allow comparisons of translations by countries which share a language. The decision taken to secure input from two translators was motivated by a growing body of research that points to the advantages of discussing translations in a team (e.g., Guillemin et al., 1993; Acquadro et al., 1996; McKay et al., 1996; Harkness et al., 2004). By giving the team two draft versions to discuss, more options are automatically available for discussion and one translator is not called on to ‘defend’ her/his product against the rest of the team. In addition, regional variance, idiosyncratic interpretations, and inevitable translator oversights can be better dealt with (Harkness and Schoua-Glusberg, 1998). Moreover, team approaches enable closer appraisals and more detailed revisions than do methods that use a single translator and compartmentalised review procedures. The team approach specified for the ESS ensures that people with different and necessary fields of expertise are involved in the review and adjudication process. Properly conducted by a well-selected and well-briefed team, the increased benefits of a team approach are considerable. Even so, the procedures themselves are no automatic guarantee of quality. As noted, the TRAPD strategy incorporates review into translation production. This approach was selected in preference to the more traditional model of translation followed by back translation. A major drawback of back translation as a review process is that it ultimately focuses attention on two versions of the source questionnaire rather than concentrating appropriate attention on the translation in the target language. For detailed discussion of back translation and its inherent weaknesses, see Harkness (2003) and Harkness (forthcoming); see, too, Brislin (1970, 1980, 1986), Hambleton (2005), Pan and de la Puente (2005) and McKenna et al.(2005). Split and parallel translations As noted, the first stage of our procedures involves two translators per language working independently of each other to produce two parallel translations. At a subsequent reconciliation meeting, the translators and a reviewer go through the questionnaire question by question, discussing differences and, if possible, coming to a consensus on a single new version. If the adjudicator is also present, a final version may be reached in one sitting. Otherwise, any unresolved differences go forward to an adjudication meeting. However, ESS countries that plan to discuss their translations with other countries producing translations in the same language are permitted to produce only one translation. At the same time, the fact that a country intends to discuss its version with another country does not imply that the translation produced within one country is less important, nor that it needs
Improving the comparability of translations
85
to be undertaken with less care. Countries are encouraged to plan ahead so as to schedule sharing activities for the new ‘rotating’ modules at each round. In producing their single country version, each country is still required to use two translators. The source questionnaire is split up between these two in the alternating fashion used to deal cards in many card games, thus ensuring each translator gets an even spread of the questionnaire to translate (SchouaGlusberg, 1992). The two halves of the questionnaire are then merged for the review discussion, during which the translators, the reviewer and (usually) the adjudicator go through the translation question by question, discussing alternatives and agreeing on a single version. Once these national versions are produced, the countries sharing languages arrange to discuss their different versions with a view to harmonising them wherever appropriate. In discussions between two countries, therefore, two versions of the questionnaire are available, one produced by one country, one produced by the other. The precise steps recommended for sharing are described in documentation available on the ESS website. In recommending that countries who share a language should co-operate on reducing differences across their questionnaires, the ESS chose to encourage harmonisation where appropriate but to refrain from a strict policy of enforced harmonisation by which countries sharing a language would be required to use identical question wording. Countries are encouraged to plan ahead so as to schedule sharing activities for the new ‘rotating’ modules at each round. Countries with more than one language
Producing multiple translations With the exception of Ireland and the UK, all ESS countries need to translate into at least one language and many had to field the survey in multiple languages so as to meet the five per cent rule referred to earlier. Translation demands on individual countries can be very different. For example, in Round 1, Switzerland and Israel both produced three written translations. Whereas Switzerland shared each language with at least one other country and could theoretically consult with these, Israel was the only country in that round to translate into Hebrew, Arabic and Russian. Luxembourg also fielded the questionnaire in several languages in the first round but, contrary to ESS requirements, did not produce a written translation in that round for each language in which interviews were conducted. Table 4.1 indicates that for Round 2, nine languages were shared and that 16 of the countries listed could, in principle, have shared the development of at least one translation with another country or countries.
86
Table 4.1
MEASURING ATTITUDES CROSS-NATIONALLY
Countries translating and/or sharing languages in Rounds 1 and 2
Country Austria Belgium Czech Republic Denmark Estonia (2) Finland France Germany Greece Hungary Iceland Ireland Israel (2) Italy Luxembourg (3) Netherlands Norway Poland Portugal Slovakia Slovenia Spain Sweden Switzerland Turkey UK Ukraine (2)
Language(s) translated German Flemish(1) French Czech Danish Estonian Russian Swedish Finnish French German Greek Hungarian Icelandic – Russian Arabic Italian French German Dutch (1) Norwegian Polish Portuguese Slovakian Hungarian Slovene Catalan Spanish Swedish Swiss German Italian Turkish – Russian Ukrainian
Language in common with
Round(s)
Germany, Luxembourg, Switzerland The Netherlands, France, Switzerland, Luxembourg
1,2 1,2
Israel (2), Ukraine
1,2 1,2 2
Sweden
1,2
Switzerland, Luxembourg, Belgium Luxembourg, Switzerland, Austria
1,2 1,2 2 1,2 2 1,2 1
Slovak Republic UK, Luxembourg Estonia, Ukraine Switzerland France, Switzerland, Belgium Germany, Austria, Switzerland Belgium
Luxembourg Hungary, Ukraine
Finland Germany, Austria, Luxembourg Italy Ireland, Luxembourg Estonia, Israel
1,2 1,2 1,2 1,2 1,2 1,2 2
1,2 1,2 1,2 1,2 2 1,2 2 2
Notes: (1) Written Dutch and Belgian Flemish are very similar; (2) Israel did not participate in Round 2, Ukraine and Estonia did not participate in Round 1; (3) Luxembourg did not have written translations for some languages in Round 1 and it also fielded some questionnaires in English
Improving the comparability of translations
87
Sharing languages and harmonisation The ESS countries that produced a questionnaire in the same language as another country in Round 2 are set out in Table 4.2. We have treated Dutch and Belgian Flemish as close enough to be able to be classified in this way. Table 4.2 Round 2 countries translating questionnaires into the same languages Shared language
Countries
Dutch/Flemish French German Italian Hungarian Russian Swedish
Belgium, The Netherlands Belgium, France, Switzerland, Luxembourg Austria, Germany, Switzerland, Luxembourg Italy, Switzerland Hungary, Slovakia Ukraine, Estonia Sweden, Finland
Ancillary measures to support translation A number of ancillary measures have been developed to facilitate ESS translation efforts.
Annotating the source questionnaire Survey questionnaires tend to look rather simple, but they are, instead, complex measurement tools. One reflection of this is the fact that interviewer manuals often provide interviewers with definitions and clarifications of terms as they are intended in a given questionnaire. For instance, a household composition question such as ‘How many people, including children, live in this household?’ is often accompanied in the interviewer manual by definitions of what counts as a ‘household’, what counts as ‘live in’, and reminders about various categories of people who may stay for different periods and on different financial bases in the dwelling unit (for example, boarders, lodgers, servants, visiting relatives). This information thus clarifies which categories of people are to be counted and guides interviewers on how to answer respondent queries about what the question means. Annotations in the source questionnaire for translators and others involved in producing new versions serve a similar purpose. They provide information on what the question is intended to measure. For instance, ‘household’ in the question above might be automatically associated with ‘home’ and hence with ‘family’ in some cultures. In measurement terms, however, the household
88
MEASURING ATTITUDES CROSS-NATIONALLY
composition question is intended to refer to a dwelling unit. A note for those producing other language versions of this question could guide them on the required concept, however differently this might then be realised in different countries. Annotations are not intended to be incorporated literally into translated questions, nor provided to interviewers as notes. They are simply to be used as aids to the design of functionally equivalent questions.
Query hotline and FAQs ESS participants are encouraged to contact the translation team based at ZUMA about any difficulties they encounter during translation. By the second round of the ESS, we were thus able to compile a list of frequently asked questions and the appropriate replies. In the course of doing so, it has become evident that many of the queries have at least as much to do with measurement properties of the source questionnaire as with translation issues per se.
Documentation templates In the first two rounds, the ESS questions were delivered to participating countries in two formats. On the one hand, a formatted paper-and-pencil source questionnaire was distributed as a Word file. On the other, the questionnaire was sent out in a Word table template that assigned each question, instruction, and set of response categories to individual rows, and provided empty columns for the two envisaged translations and the translation and review comments that would inform the review process and later document the translation output. In social science survey projects, documentation of translation decisions and difficulties is, to date, rare. ESS countries were asked to document translation and review decisions using the template provided for six main reasons: • The TRAPD approach is based on discussion and revision of drafts to arrive at final versions. Having a record of the points felt to be at issue in the forms of notes or comments from translators can greatly facilitate the review discussion. Note-taking is a tool that trained translators often use to accelerate translation and revision and it helps them recall the rationale for particular decisions. • Countries sharing languages need a record of the individual decisions and their rationale in order to compare and contrast their alternative versions. • Later rounds of the survey will be informed by good records of the initial problems encountered and their solutions. If changes are made in
Improving the comparability of translations
89
either source questions or translations over time, records need to be available of the chain of changes across translations (version tracking and documentation). • New members joining the programme that share languages with countries already involved will have the documentation available to them. • Data analysts will be able to consult such documents as part of their efforts to interpret the data. • The template was a means of encouraging teams to take notes alongside points at issue as they worked. Experience has shown that if writing up problems or solutions is delayed, many details will be forgotten. Lessons learned The development and refinement of ESS translation guidelines and support materials has been a learning experience for those involved. Insights gained are summarised briefly below.
Source questionnaire and translation Although participating countries are given every chance to comment on draft modules and encouraged to help question drafting teams resolve cultural and linguistic problems at an early stage, most problems are either not noticed or neglected until the questions come under close scrutiny when the translation proper begins. The advance translation procedures described below could help remedy this.
Advance translation While national teams are asked not to begin their official translations before the source questionnaire for each round has been finalised, we do suggest that they use even rough-and-ready translations as a problem spotting tool (see discussion in Harkness, 1995, 2003; Harkness and Schoua-Glusberg, 1998; Harkness et al., 2004; Braun and Harkness, 2005). The expectation is that by jotting down a first translation of each question during their appraisal of the draft source questionnaire, participating countries can help identify problems before the source questionnaire is finalised, at a time when wording can still be more readily changed. The dynamic process of ESS question design can now be monitored ‘live’ on the web and National Co-ordinators will be encouraged in future to try this technique of advance translation in appraising the draft source questions.
90
MEASURING ATTITUDES CROSS-NATIONALLY
Templates and production tools It is important to minimise changes to the source documents once translation has begun, but changes cannot be completely avoided even at this stage, because they often arise from late feedback. Alterations to ‘finalised’ source documents were much reduced in Round 2 compared to Round 1, so perhaps we are coming to grips with this problem. Even so, better tools that help national teams to keep up to date with changes and allow them to continue to use a template aligning source questions and translations would clearly be of great benefit. The infrastructure grant that the ESS has recently received will enable us to produce a blueprint for such tools by the end of 2007. Meanwhile, checklists can be used to ensure that common production errors (such as the reversal of response options, or omissions) are swiftly spotted and remedied during the translation or review process.
Attention to detail Initial evaluations of some of the ESS translations from the first two rounds suggest that countries sharing languages would benefit from across-country discussion of the draft questionnaires. Efforts will be made to enhance such collaboration in coming rounds. Question-by-question review by a team of specialists, as recommended in the ESS specification, greatly increases the likelihood of finding and remedying major translation mistakes as well as subtler errors. It is clear, however, that some countries have been more successful than others in meticulously implementing the ESS procedures. Certainly, mistakes found in the first two rounds suggest that some countries had failed to use the specified team approach effectively. Countries also need to remember to carry out normal copy-editing and proof-reading of their translations to ensure that nothing has been omitted or incorrectly placed in the course of the translation effort. The copy-editing review should check both the correctness of the translated text as text and its completeness. Copy-editors also need to check back against the source questionnaire to be able to identify inadvertent omissions or reversals. Questions that countries have raised with the head of the translation team at ZUMA reflect the difficulty translators sometimes have in understanding the purpose of survey translations. Since techniques and strategies that translators find useful for other kinds of texts are often not appropriate for survey translations, even very good translators need to be thoroughly
Improving the comparability of translations
91
briefed on the special nature of survey translation. It may take some concentrated effort before translators and possibly their supervisors can be brought to appreciate this.
Identifying translation errors Errors in translation have a debilitating effect on data comparability. An important and evolving part of the quality control work performed by the CCT includes identifying survey error related to translations. For this, a combined approach is necessary (Harkness et al., 2004): on the one hand, the quality of translations needs to be monitored by people knowledgeable about the languages involved and the characteristic properties of questionnaires. On the other hand, a variety of statistical procedures can be employed to check various sources of potential error in the data, of which translation errors may be one. Billiet (2006) provides an example, based on ESS questions in French. Conclusion For a variety of reasons, good survey translation is far more difficult than it may appear. Without an understanding of the principles of questionnaire design and survey measurement, even skilful translators will not be able to produce the kind of document required. By bringing together translators and survey specialists, the team approach adopted by the ESS goes a good way towards recognising this. At the same time, research on ESS translation indicates that translators would benefit from more briefing and training on survey translations and that translation supervisors might need to be better informed about the risks of corner-cutting. Poor translations deprive researchers of the opportunity to present respondents with the questions they intended. The ESS has invested unusual effort and expense in attempting to develop a practical, theoretically sound and rigorous framework for translation. We are aware that challenges and problems remain and continuing efforts are being made to iron out remaining difficulties. The recent evaluation of ESS translation outputs so as to inform new strategies is one example. Our intention is to introduce ‘review-andlearn’ projects in future rounds that will contribute to our common understanding of what needs to be done and how it should best be implemented on the ground.
92
MEASURING ATTITUDES CROSS-NATIONALLY
References Acquadro, C., Jambon, B., Ellis, D. and Marquis, P. (1996), ‘Language and Translation Issues’ in: B. Spilker (ed.), Quality Life and Pharmacoeconomics in Clinical Trials, 2nd edition, Philadelphia: Lippincott-Raven. Billiet, J (2006), ‘Things that go wrong in Comparative Surveys – evidence from the ESS’, Paper presented at the ESRC Methods Festival, Oxford University, 20 July 2006. Braun, M. and Harkness, J.A. (2005), ‘Text and Context: Challenges to Comparability of Survey Questions’ in: J.H. Hoffmeyer-Zlotnik and J.A. Harkness (eds), ZUMA-Nachrichten Spezial No.11. Methodological Aspects of CrossNational Research, Mannheim: ZUMA. Brislin, R.W. (1970), Back-translation for cross-cultural research,. Journal of CrossCultural Psychology, 1 (3), pp.185–216. Brislin, R.W. (1980), ‘Translation and Content Analysis of Oral and Written Materials’ in: H.C. Triandis and J.W. Berry (eds), Handbook of cross-cultural Psychology, Boston: Allyn & Bacon. Brislin, R.W. (1986), ‘The wording and translation of research instruments’ in: W.J. Lonner and J.W. Berry (eds), Field methods in cross-cultural research. Beverly Hills, CA: Sage. de Mello Alves, M.G., Chor, D., Faerstein, E., de S Lopes, C. and Guilherme L. (2004), ‘Short version of the “job stress scale”: a Portuguese-language Adaptation’, Rev Saúde Pública, 38 (2), pp.164–171. ESF (European Science Foundation) (1999), Blueprint for a European Social Survey, Strasbourg: ESF. Guillemin, F., Bombardier, C. and Beaton, D. (1993), ‘Cross-Cultural Adaptation of Health-Related Quality of Life Measures: Literature Review and Proposed Guidelines’, Journal of Clinical Epidemiology, 46 (12), pp.1417–1432. Hambleton, R.K. (2005), ‘Issues, Designs, and Technical Guidelines for Adapting Tests in Multiple Languages and Cultures’ in: R.K. Hambleton, P. Merenda and C.D. Spielberger (eds), Adapting Educational and Psychological Tests for CrossCultural Assessment, Hillsdale: Erlbaum. Harkness, J.A. (1995), ISSP Methodology Translation Work Group Report 1995, Report to the ISSP General Assembly at the 1995 Cologne ISSP meeting. Harkness, J.A. (2003), ‘Questionnaire Translation’ in: J.A. Harkness, F. van de Vijver and P. Mohler (eds), Cross-Cultural Survey Methods, New York: John Wiley and Sons. Harkness, J.A. (2004), ‘Overview of Problems in Establishing Conceptually Equivalent Health Definitions across Multiple Cultural Groups’ in: S.B. Cohen, and J.M. Lepkowski (eds), Eighth Conference on Health Survey Research Methods, Hyattsville: US Department of Health and Human Services, pp. 85–90. Harkness, J.A., (forthcoming), ‘Comparative Survey Research: Goals and Challenges’ in: J. Hox, E.D. de Leeuw and D. Dillman (eds), International Handbook of Survey Methodology, and Mahwah: Lawrence Erlbaum.
Improving the comparability of translations
93
Harkness, J.A., Pennell, B.E. and Schoua-Glusberg, A. (2004), ‘Survey Questionnaire Translation and Assessment’ in: S. Presser, J.M. Rothgeb, M.P. Couper, J.T. Lessler, E. Martin, J. Martin and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires, Hoboken: John Wiley and Sons. Harkness, J.A. and Schoua-Glusberg, A. (1998), ‘Questionnaires in Translation’ in: J.A. Harkness (ed.), Cross-Cultural Survey Equivalence ZUMA-Nachrichten Spezial No.3, Mannheim: ZUMA. McKay, R.B., Breslow, M.J., Sangster, R.L., Gabbard, S.M., Reynolds, R.W., Nakamoto, J.M. and Tarnai, J. (1996), ‘Translating Survey Questionnaires: Lessons Learned’, New Directions for Evaluation, 70, pp.93–105. Pan, Y. and de la Puente, M. (2005), Census Bureau Guidelines for the Translation of Data Collection Instruments and Supporting Materials: Documentation on How the Guideline was Developed, Statistical Research Division, U.S. Census Bureau. Schober, M.F. and Conrad, F.G. (2002), ‘A collaborative view of Standardized Survey Interviews’ in: D.W. Maynard, H. Hootkoop-Steenstra, N.C. Schaeffer and J. Van der Zouwen (eds), Standardization and Tacit Knowledge Interaction and Practice in the Survey Interview, New York: John Wiley and Sons. Schoua-Glusberg, A. (1992), Report on the Translation of the Questionnaire for the National Treatment Improvement Evaluation Study, Chicago: National Opinion Research Centre. van de Vijver, F.J.R. (2003), ‘Bias and Equivalence: Cross-cultural Perspectives’ in: J.A. Harkness, F.J.R. van de Vijver and P. Mohler (eds), Cross-cultural Survey Methods, New York: John Wiley and Sons.
5
If it bleeds, it leads: the impact of media-reported events Ineke Stoop
∗
Introduction1 In November 2002, the tanker Prestige broke up off the west coast of Spain, causing what many predicted would be the most damaging oil spill since the Exxon Valdez broke up off the coast of Alaska in 1989. Several towns and beaches along the Galician coast, which earned its nickname ‘coast of death’ for the many shipwrecks near its shores, were fouled by oil. In Madrid, environmental activists held demonstrations against the oil spill and in Santiago de Compostela, social groups and politicians from Spain’s opposition parties led a march under the battle charge of nunca mais (‘never again’). According to the organisers and local police, 150,000 people marched in protest against the inadequate response of the regional and national governments in handling the crisis. At about the same time, fieldwork for the first round of the European Social Survey was starting in Spain. The damaging oil spill, the perceived inadequacy of the government to deal with it and the large protest rallies, which received widespread media attention all over Europe, may well have had an impact on the answers of Spanish citizens to ESS questions about trust in government and trust in politicians. Any differences found between ∗
Ineke Stoop is head of the department of Data Services and IT at the Social and Cultural Planning Office of the Netherlands. 1 A number of ideas in this chapter on the use of news sources and ways to improve event reporting stem from discussions with and papers from Howard Tumber (City University London), Paul Statham (University of Bristol) and David Morrison (University of Leeds).
96
MEASURING ATTITUDES CROSS-NATIONALLY
levels of trust in government in Spain and other European countries, or between the level of political trust in Spain in Round 1 compared with that in subsequent rounds of the time series, might be directly attributable to the short-term impact of these events. The role of the ESS is to measure social, cultural and attitudinal climate changes across Europe, rather than transitory changes in the attitudinal weather. Analysts of the data should thus ideally be in a position to distinguish between the two, or at least to identify and possibly discount the impact of short-term events on expressed attitudes. This is especially important because socio-political conditions in Europe seem to have become particularly volatile during the last decade or so. Although the ESS’s initiators had noted the need to record events when they first started their planning work in the mid-1990s, the events they had in mind at that stage were primarily national elections or short-term political upheavals. Since that time, however, we have witnessed the 9/11 attacks in the USA, major terrorist attacks in Madrid and London, a war in Afghanistan, a war in Iraq, the Darfur crisis, a devastating tsunami in Asia, political assassinations in the Netherlands, plans in some countries to suspend certain civil rights, an overwhelming ‘no’ vote in two national referendums on the EU constitution, and a fair number of political and financial scandals. Is Europe simply going through an isolated outbreak of extreme weather conditions, or is the political climate changing rather rapidly? Either way, it is increasingly important to make available this sort of background information to survey analysts of a cross-national time series. The Blueprint for the ESS (ESF, 1999) recognised this and called for an ‘event database’ to be made available alongside the substantive database. Although no funds were initially available for event reporting, we considered it important enough to make a start by developing a parsimonious but reasonably effective system. National Coordinators (NCs) and their teams in each country took on the task of producing a systematic overview of events during the ESS fieldwork period and submitting their reports for collation into a central event database, which has since been set up (http://ess.nsd.uib.no). As a matter of fact, the description of the Prestige oil spill at the beginning of this chapter is taken from this database which now contains information on media reported events that took place during the fieldwork periods of ESS Rounds 1 and 2. The remainder of this chapter goes on to describe in more detail not only the rationale behind event reporting, but also how it was implemented in the ESS. It also assesses its success to date, examines inherent
If it bleeds, it leads: the impact of media-reported events
97
problems in its implementation and considers ways of improving it in future rounds. “Events, dear boy, events”2 As noted, the ESF’s Blueprint document for the ESS called for an events database, giving the following justification for it: It is well known from earlier comparative survey research that in some fields, such as electoral analysis, individual reactions to certain questions will be influenced by contextual factors and by significant events. For example, a question about the subjective interest in politics of a respondent may well be answered differently at the height of a national campaign for a general election compared to a time when no election is imminent. The contextual impact on individual response behaviour will not create major difficulties for the ESS as long as the contexts and events vary individually in an idiosyncratic fashion. The impact, however, of a contextual factor or an event must be considered and, whenever possible, controlled as soon as whole societies are thus influenced in a way which is not uniform across the countries in the ESS. In addition, it has to be remembered that the ESS will in the long run also become an important asset for historical micro analysis. As a consequence, from the beginning an information tool which for the lack of a better term may be called an event data inventory will have to be designed. This inventory must offer to the researchers a brief, pertinent synopsis of major political, social and other potentially relevant events in the ESS countries; this is particularly important for the ESS since its modular approach will in the long run cover a wide area of substantive concerns (ESF, 1999, p.33). An early example of (unexpected) consequences of major events was given by Bradburn (1969) who studied the trauma caused by the assassination of President John F. Kennedy. He found that this event not only caused feelings of shock, grief and personal loss but also occasioned an increase in interpersonal communication and social cohesion. According to an analysis by Das et al., (2005) of the more recent assassination in the Netherlands, 2
Remark attributed to the British Prime Minister Harold Macmillan when asked by a young journalist after a long dinner what can most easily steer a government off course.
98
MEASURING ATTITUDES CROSS-NATIONALLY
there there was no increase in social cohesion, but instead a rise in social disorganisation and depression. In the course of an experimental study they were conducting on attitudes to terrorist attacks, the Dutch filmmaker Theo van Gogh was assassinated by a Muslim fundamentalist. The event attracted wide national and international media attention, thus complicating their initial experiment but opening up new avenues of exploration to them. In the week following van Gogh’s murder, numerous anti-Muslim attacks took place in the Netherlands http://news.bbc.co.uk/2/hi/europe/4057645.stm). Das et al. found that when terrorism occurs on one’s own doorstep (rather than in some distant land) the fear of death is magnified greatly. In the Netherlands this resulted in an increase of terror-induced prejudice. Although, of course, personal and family events may usually have a more profound impact on people’s lives and thoughts than will more distant political events, these personal sources of turbulence do not tend to have a systematic impact on survey outcomes. They are, in effect, randomly distributed across the population. In contrast, what is important for the ESS are any systematic effects of events on attitudes at a particular time or in a particular place. It is this sort of turbulence that can cause differences between countries, changes over time and variations between subgroups of the population. Events in the media From the range of events that occur in the world every day or week, only a small selection becomes salient to the public. It is these salient, well-popularised events, with high exposure either to large sub-groups in a country or to an entire country, or even to a large group of countries, that have the potential for focusing and shaping the attention of members of the public. The mass media are, of course, the primary conduits through which these potentially salient events are conveyed to the public. So our primary interest was in media-reported events. The role of the mass media has long been a subject of research in its own right. Nas (2000), for instance, provides an overview of the impact of the media on attitudes to environmental issues. She argues that in earlier decades the favoured theory of the media’s impact was the ‘hypodermic needle’ theory, suggesting that the public as passive consumers of news simply get injected with elements of news that the media choose to report. From the late 1940s, these assumptions about a passive public began to change in favour of the existence of selective filters between medium and recipient (selective attention, social networks, selective exposure and interpersonal communication). By the 1960s, however, a
If it bleeds, it leads: the impact of media-reported events
99
new agenda-setting theory had found favour. As Cohen (1963, p.13) puts it: “(The press) may not be successful in telling people what to think, but it is stunningly successful in telling its readers what to think about.” The agendasetting theory does not imply that media users are merely passive agenda-followers, because it also accepts the existence of filters between the transmitters and recipients of messages (see www.agendasetting.com). Agenda-setting by the media is by no means a cross-culturally uniform process. As Pfetsch (2004, p.60) points out after studying the role of newspaper editorials in different countries on the subject of European integration: The media play a significant role as political actors as they use the format of editorials for claims-making, thereby assigning relevance and frames to political issues and introducing their own opinions into public discourse and political debate. In their dual role as communication channels of political actors and as actors in their own right they constitute the major communicative linkages within and between national public spaces which are a basic prerequisite for the Europeanisation of the public sphere. Even so, Pfetsch concludes that the role of the media is by no means confined to agenda-setting as “the media’s opinion about Europe resonates with the position of the national political elites and at the same time reinforces it” (p. 61). There were, it emerged, large differences between the media in different European countries. The British media, for instance, seemed to try hard to ignore the European perspective whenever possible, while the French national media seemed to be the most open to it. Differences between the newspapers and TV channels of different countries are also the main focus of media research conducted by the German mediawatching organisation, Medien Tenor (www.medientenor.de). They have monitored, for instance, the extent to which UN Secretary General Kofi Annan has been ignored in international TV news, the differences in the reporting of public attitudes towards the Euro in German v UK newspapers and TV stations, and the way in which the US mass media tend to reinforce ethnic stereotypes. These findings are relevant to the ESS, since the starting point of our event-reporting is to employ media reports simply as a means of highlighting the most salient events in each country. The fact that a country’s mass media may have systematic biases and that there is therefore an interaction between public attitudes and media-reported events is by no means an obstacle to our work, rather the essence of it. To some extent we have to make the further simplifying assumption that as long as an event is salient enough to be reported widely in the newspapers, it is
100
MEASURING ATTITUDES CROSS-NATIONALLY
likely to have at least some impact on the consciousness of non-newspaper readers as well. Within the context of the ESS, it would not have been possible to set up a comprehensive multi-national media watch system. Instead, we had to make parsimonious choices as to which media to use as source material for eventreporting. Although there were arguments to have monitored television news, the difficulties and costs of systematically recording and coding television news bulletins across over 20 nations were simply too daunting.3 Instead we chose to monitor newspapers in the hope that television news agendas and press agendas tend to coincide in many cases. News flow and event identification What sorts of event did we wish to record? We quickly rejected the notion that events that take place far away will necessarily have less impact than events closer to home. The war in Iraq, for instance, had seemingly major (and different) national ramifications. Thus, German Chancellor Schröder’s opposition to the war may well have benefited his successful campaign in the German elections. Meanwhile, major protests about the war took place in the UK, including the resignation of some ministers, and the coalition government in the Netherlands seemed at one time to fall apart precisely because of conflicting points of view on Dutch involvement in the war. Many other countries had protest demonstrations which may or may not have affected public opinion in those countries differentially. Similarly, Round 2 of the ESS witnessed sustained media attention being devoted to hostage-taking in Iraq involving aid workers and journalists from, among other places, France, Italy and the UK. And then the tsunami in Asia on 26 December 2004 occupied press reports for weeks. But certain countries, notably Sweden, witnessed even more press attention than did other countries because they had lost hundreds of citizens in the disaster. Meanwhile political criticisms were made in several countries about the lack of appropriate support in the aftermath of the disaster, and some intended fundraising events turned quickly into major national events. Another example of a major international event with the potential for different national implications was the death of Pope John Paul II in April 2005, which attracted especially sustained coverage in Italy and Poland. 3
In the first round of the ESS Greece did tape the news for event-reporting purposes.
If it bleeds, it leads: the impact of media-reported events
101
So, more relevant to us than where an event took place was when it took place. Our aim was to be able to link events to survey answers, so the ideal situation was when an event could be identified that had a clear start and end. But many events do not behave like this. Instead they linger on and sometimes re-emerge. Events can also have even later repercussions, attracting renewed attention, say one year after they took place. For instance, Maney and Oliver (2001) investigated the use of news media versus police records to track shifting patterns of protest across time and place. They concluded that neither the news media nor police records fully captured the picture and they challenged the (usually tacit) assumption that newspaper coverage of an event reasonably closely matches the event itself, noting (p.166) that “much of the event coverage appeared weeks, if not months, before or after an event’. Even elections and referendums that in most countries happen on a single day tend to have a longish period before and after during which they cast their shadow on public perceptions and attitudes. It is for this reason that several countries have postponed ESS fieldwork so as to avoid the immediate impact of national elections. But in any case not all ‘events’ are in the end directly related to their ostensible subject matter. Take the Hutton Inquiry in the UK, a judicial enquiry set up to investigate a row between the BBC and the government over the ‘exposure’ of a BBC source who then committed suicide. Although the Inquiry effectively found in favour of the government, most astute commentators believe that it resulted in a serious loss of public faith in the government. Yet in Italy, for instance, where the Prime Minister was alleged to be involved in dubious financial dealings, most commentators believe that the scandal inflicted little or no lasting damage on public trust in politicians. The fact is that different countries see seemingly similar events through quite different lenses. In the same way, the two national referendums on the EU constitution in 2005, one in France and one in the Netherlands, both took place during ‘difficult’ political times in their respective countries. So they may well have tapped hostility towards their governments at the time at least as much as hostility toward the proposed constitution. And the effect of these two ‘no’ votes may well have influenced attitudes to the EU in many other countries simultaneously. Thus, ‘simple’ events like these, in that they are closely fixed in time and place, may nonetheless have far-reaching effects on many issues in many countries, even sometimes before the actual event has taken place.
102
MEASURING ATTITUDES CROSS-NATIONALLY
Guidelines and database As noted, ESS event reporting has developed alongside the ESS survey. Well before the start of Round 1 fieldwork, we conducted a trial run by asking NCs4 to collect event data for a trial period of six weeks. They were asked to record major national (or international) events that might influence the answers to substantive questions in the (then draft) questionnaire. We provided them with a prompt-list of possibly relevant political events – inspired by Taylor and Jodice (1986) (see Note 1 at the end of this chapter) – and complemented it with a number of less political events, asking them to record such events only if they appeared on the front pages of national newspapers for two or more days in the period, as well as attaining television coverage. No report format was provided. Having reviewed the outcome of the trial, we produced new guidelines for event-recording to be implemented in ESS Round 1. Each participating country was to send in monthly reports on events that received ‘prominent attention’ in national newspapers. This was defined to mean ‘front page news’ or ‘appearing regularly in larger articles on later pages’ on several days. NCs were asked to assign events to fixed categories, to provide keywords and a description, to give a start date and end date (if possible), to mention the source, and to assess the likely size and direction of the event’s impact on the survey answers. Although this work certainly produced a helpful database, it was based on a system with too much built-in leeway for individual reporting variation. This led us to revise the guidelines for Round 2, as presented in Table 5.1, so that they included a more standardised format and were based on weekly, rather than monthly, reports. The NCs were also required to provide information on the newspapers they had used.5 Information could be collected from the newspapers themselves or from websites containing the newspapers. In Round 2, reporting started two weeks before the start of fieldwork, which differed somewhat across countries, and NCs were also asked for a short overview of any major events in their country since Round 1 that 4 In several countries event-reporting was done by a special “event-reporter” on behalf of the NC. 5 Based on the experiences of Round 1 the original request to national reporters was to use two newspapers, a broadsheet and a tabloid, or – even better – a left-wing broadsheet, a right-wing broadsheet and a tabloid. However, feedback from NCs revealed that this was not always the best option in certain countries, both because tabloids and broadsheets may not be comparable across countries, and because some countries (such as Switzerland) have very different newspapers in different regions, in some cases in different languages.
If it bleeds, it leads: the impact of media-reported events
Table 5.1 types
103
Framework for event-reporting, ESS 2004, and variety of entry
Information requested Explanation
Examples
Name
Name of specific event
Minister of Education steps down after school fraud 500 000 turnout at demonstration against care budget cuts Paid parental leave: Mum can stay at home now The name of a speGreece are European football champions cific event may be an Fundamentalist Muslims accused of terrorism interpretable Housing market collapses newspaper headline Scathing judgment on quality of childcare (but not ‘Dust to dust’, Tornado in Toledo or ‘Double Dutch’ or Kidnapping in Iraq ‘Home alone’ or Herb cure saves lives ‘Trojan horse victory’) Prince Claus dies Hospital scandal: 30 patients infected Major credit card fraud Opening of parliamentary year: the future looks bleak Opinion poll on democracy: all in for personal gain Low turnout at EU referendum
Category
Select one or more categories; add category if necessary (highlight)
Election (national, local), plebiscite, referendum Resignation, appointment, dismissal of politically significant person Fall of cabinet, change of government, new government Significant change in law Strikes, demonstrations, riots (mention topic) Acts of terrorism Events involving ethnic minorities, asylum seekers Events concerning the national economy, labour market Political, financial, economic scandal, frauds National events (royal weddings, sports championships) Health issues Family matters Crimes (kidnappings, robberies) Disasters (outbreaks of foot and mouth/mad cow disease, extreme weather conditions) International conflict with national impact (Israel–Palestine; Iraq, Pakistan) Major international events that attract close attention locally
Short description
Similar to header in newspaper or introduction of news item
Prince Claus has died after 20 years of serious health problems. The nation mourns. Prince Claus was beloved by many Dutch people for his original contribution to being a Prince. He has become famous for his contributions to developing countries. Many people come to pay him their last respects
Timing
Date event in media, date event, duration (sudden, continuing)
Prince Claus died on 6 October and was buried on 16 October. Wide media coverage of his life, his lying in state and the tribute paid to him by Dutch citizens and dignitaries, and funeral during these 10 days.
(Continued)
104
Table 5.1
MEASURING ATTITUDES CROSS-NATIONALLY
(Continued)
Information requested Explanation
Examples
Coverage
Attention in media
All national newspapers and TV joprogrammes, front page tabloidsurnals, extra breaking news
Source
Which newspaper/ website
Web link
Only if free and (semi)-permanent
http://news.bbc.co.uk/2/hi/health/3856289.stm
Link to questionnaire
If direct relationship with identifiable question blocks
B18: lawful demonstration (when large demonstration) B19: consumer boycott (when large consumer boycott) B12: satisfied about state of education (when educational abuses denounced) B34: European unification (when heated debates occur, e.g. discussions on Turkey in EU) C1: How happy are you (when your country wins European football match)
Possible effect on fieldwork
Areas closed off because of animal diseases; heavy storms; confidentiality scandals
Additional information
All additional information
might have shaped or altered public attitudes or perceptions. This overview was simply to provide some idea of (changes in) the political landscape. From the beginning of Round 1 incoming event reports have been posted on our website (accessible via http://ess.nsd.uib.no) to give an up-to-date overview of incoming event reports, and also to show guidelines, information notes and background information about the process. This transparency was helpful not only to users who wished to have an overview of weekly or monthly events in each country, but also to NCs themselves as a way of checking how their colleagues in other countries were using the system. The web page also contains FAQ’s and ultimately provides the final ESS media-reported event inventory for each round. In Round 3 the procedure is very similar to the one in Round 2 with one difference: events reports can now be uploaded by the national event reporters themselves and will be part of a more structured database. As a result different types of overview (per week, per country, per keyword) will be easier to obtain. As Round 3 of the survey is still in the field at the end of 2006, no substantive results are yet available.
If it bleeds, it leads: the impact of media-reported events
105
Meanwhile, what was happening in Europe? In March 2003 the Coalition Forces attacked Iraq, by which time ESS Round 1 fieldwork had ended in most countries. Even so, the preparations for and the threat of war had been a major ‘event’ during the Round 1 fieldwork. As noted, what happened in Iraq was not simply an event in a faraway country but had an impact very close to home, even in European countries that were not to be engaged in the war. Other faraway events that drew a great deal of attention were the Bali terror attack, the North-Korean nuclear threats and the Chechnyan hostage disaster in Moscow. At the start of fieldwork there were also devastating floods in central Europe and Belgium. In 2002/2003 the economic situation in Europe deteriorated. There were redundancies, bank ruptcies and shutdowns. Although some national events, such as the Prestige oil spill, ETA terrorism attacks in Spain, conflicts within Haider’s freedom party in Austria, the Pim Fortuijn murder in the Netherlands, and the Israel–Palestine conflict, did make it to the front pages of foreign newspapers, their impact on those other countries was minimal. Other national events had greater cross-national echoes, but not necessarily with the same meaning or impact. In almost half of the countries, elections took place. There was also a good few political and financial scandals, several strikes and demonstrations, and, of course, the inevitable sporting triumphs and disasters. Meanwhile, immigration was a rising issue in several countries, as was EU enlargement. Whereas Round 1 of the ESS was accompanied by preparations for the Iraq war, Round 2 took place during the war itself. So among the events recorded were a large number of terror attacks and hostage-takings, as well as – in January 2004 – the national elections in Iraq. Other recorded international events in Round 2 were the Darfur conflict in Sudan, the Beslan school siege in Russia, bomb attacks in Jakarta and Egypt, elections in Afghanistan, and – most prominent of all, perhaps – the tsunami disaster in Asia on 26 December. The economic situation in much of Europe had not improved substantially in comparison with Round 1, and the issue of immigration continued to rumble on. During fieldwork in the final months of 2004, there was a seriously contested election in Ukraine, one of the ESS countries, and the Theo van Gogh murder in the Netherlands, followed by an outbreak of anti-Islam incidents and anti-terror raids. Early in December 2004, the PISA report on education was published to widespread press attention in many countries, the US presidential election resulted in a second term for George W. Bush, and Yassir Arafat died. Meanwhile Pope John Paul II had become terminally ill (he died in April 2005).
106
MEASURING ATTITUDES CROSS-NATIONALLY
In 2004, 10 new countries entered the EU, of which six (the Czech Republic, Estonia, Hungary, Poland, Slovakia and Slovenia) were already participating in ESS Round 2. All but Estonia and Slovakia had also participated in Round 1, enabling changes to be monitored over time. Other major EU events during Round 2 were the start of the accession talks with Turkey in December (also an ESS country) and the rejection of the proposed new EU constitution by referendums in two other ESS countries – France and the Netherlands – leading to the abandonment of planned referendums in other countries. The overview above is simply an impressionistic view of some major media stories that broke during and around the fieldwork of ESS Round 2, and which may have had a one-off impact on expressed public values. The event database is more crowded and provides more detail. A proper academic analysis of media-reported events in the period would, however, have required a special coding operation for which we did not have a budget. It would have recorded more precise details of the timing of each event, the differential exposure of particular events by country, and the coding of specific events within general classes. Thus, the Iraq hostagetaking, the van Gogh murder and the Beslan school siege would become discrete events in the coding rather than being classed simply as ‘acts of terrorism’. Similarly, the elections in Ukraine would not be classified simply as an election but as a specific and more prolonged event. Stathopoulou (2004) used a sophisticated combination of linguistic processing of textual data, correspondence analysis and cluster analysis on the ESS Round 1 data to produce a set of event clusters per month, some general and some specific. Her results make it possible to distinguish groups of events from those that relate to a single country and to follow events over time, identifying similarities and differences between countries. Figure 5.1 presents the results of a much simpler correspondence analysis based on word counts of the event reports in October 2003. Events that were infrequently mentioned or related only to a single country have been removed (such as Haider’s success in Austria or the Prestige oil spills in Spain). Data for this correspondence analysis are the number of times a particular word is mentioned in a particular country. Thus, Finland, Norway, Sweden, Flanders, Portugal and the Netherlands are all in the upper right quadrant of Figure 5.1, because they were all characterised by a greater than average number of stories involving their economies, including business and financial scandals. In contrast, the Czech Republic, Austria, Slovenia,
If it bleeds, it leads: the impact of media-reported events
Figure 5.1
107
Countries and events, October 2003, weighted by word count
Hungary, and Austria are all in the lower right quadrant because they were more than averagely involved – as was Denmark – with stories about EU accession and enlargement. Meanwhile Switzerland was at that time somewhat pre-occupied with demonstrations and strikes, mainly in their milk processing industry. Strikes and demonstrations were also rife in Israel and Italy at the time. For instance, the event report for October 2003 from Italy started as follows: Tens of thousands of Italian workers have been taking part in rallies as part of a general strike to protest against labour reforms and budget cuts by the government of Prime Minister Silvio Berlusconi. The strike (there were demonstrations in 120 towns all over Italy), called by Italy’s largest and most left-wing trade union, CGIL, caused chaos in the transport sector with air, rail and local transport severely affected.
108
MEASURING ATTITUDES CROSS-NATIONALLY
Similarly, Israel’s record covering no more than a single week in October 2003 contained the following three reports: October 13: some 100,000 workers in municipalities, local authorities and religious councils launched an open-ended strike. October 16: the strike spread to sea ports. Due to the strike, most government offices were closed all week while most local services. October 19: the continued to be halted by municipal workers. Histadrut decided to suspend the strike for a while due to the security situation. So, despite less than standardised event-reporting from different countries, our subsequent coding and analysis seemed to capture the gist of the major events that might have affected answers to the ESS questionnaire at the time. And this was precisely the aim of the event database – not merely to provide a view of what happened in Europe during the period of ESS fieldwork, but also to track the possible impact of events on attitudes and opinions. On the other hand, our first examination of attitudes over the period of ESS Round 1 itself produced no clear picture of the short-term impact of particular events on attitudes. There was, for instance, no clear impact on trust in politicians or institutions (not even the UN) that seemed to stem from the considerable political turmoil at the time mainly over the Iraq war. The fact is that measuring the impact of events on attitudes is a highly uncertain and complex affair. For instance, despite the political turmoil in the Netherlands as a result of the Theo van Gogh murder, responses to a question on freedom of speech in a regular Netherlands-based survey on cultural change (Verhagen, 2006) showed an abrupt change in the immediate aftermath, but went rapidly back to the original level. As Bradburn (1969) has recommended, the event reports need to be supplemented by more systematic research on psychological reactions to significant events. Although events may alter attitudes, their effect may be highly specific and short-lived. The ESS was, of course, set up to measure long-term climate changes in attitudes rather than short-term changes in public opinion. Thus, our interest in event-reporting is not so much in the short-term impact of a particular event per se, but rather in the way it might affect or distort the measurement of long-term trends. Looking ahead Our experience to date with event reporting in the ESS has already shown us that – useful as it undoubtedly is – it could certainly be improved upon.
If it bleeds, it leads: the impact of media-reported events
109
Several issues remain to be solved, among them our perhaps too strong emphasis on front page stories, our unresolved difficulties in the coding of events, and our as yet less than standardised approach to the whole reporting process. In particular, our concentration on front page reports was motivated by our wish to cover the most important and potentially most impact-making stories. But not only do front pages differ across newspapers, but newspapers also differ across countries. Some front pages concentrate heavily on sensationalist stories (‘if it bleeds, it leads’), while others focus more on stories of national importance. Either way, our focus on front pages may miss important cross-national similarities thus exaggerating cross-cultural differences. For this and other reasons, events might in future have to be taken from articles on other pages and possibly from editorial as well as ‘op-ed’ pages too. In any case it has been clear to us from the outset that for a multilevel study such as the ESS, the parsimonious method of narrative event reporting that we have adopted so far can be no more than a short-term device. We need to move on to more rigorous methods, incorporating coded events to a standardised frame. An example of such methods is the Kansas Event Data System project (KEDS, www.ku.edu/~keds/index.html), which was initially designed to develop appropriate techniques that could convert English-language reports of political events into standardised event data. By classifying events in the first place we might be able to move on to a system of transparent and reproducible automated coding. An overview of such automated methods is available at www.ku.edu/~keds/papers.html (see also Schrodt, 2001). Colleagues at ZUMA (Cornelia Züll and Juliane Landmann, 2003) have already begun carrying out experiments on ESS event data using automatic coding. Our present event material in the ESS, based as it is on reports from NCs, may be too dependent on subjective choices to be a good base for such coding. They have now started to use the original newspaper articles as an alternative source. This poses the additional problem of language: the ESS event reports are in English whereas the newspapers used are in more than 25 languages. In a collaborative effort between the Social and Cultural Planning Office in the Netherlands, City University, London, the University of Bristol, and ZUMA in Germany plans are being made to develop a more standardised, impartial, comprehensive and accessible tool for event reporting. This work is being funded by the European Commission as part of their infrastructure support of the ESS. This new tool will also use newspapers as the primary source of event reports, but the electronic version from the Lexis-Nexis database rather than the paper version. Lexis-Nexis
110
MEASURING ATTITUDES CROSS-NATIONALLY
provides an on-line search mechanism of newspapers in a wide variety of countries, but – where no such coverage exists – we can buy and store the relevant newspapers (intact or on micro-fiche) to be quarried at a later date. Human coders will code events post hoc. This allows for the coding of different countries to proceed at different paces from one another. The possible implementation of automatic coding will have to be studied. By selecting a wide range of newsprint media, we should ideally be able to retrieve the salient events that have occurred in different types of newspaper (left/right; elite/mass readership) in all countries and to develop a systematic coding frame of event variables (time of event, place, actor, geographical scope, etc.). The resulting coded event database would ideally also provide the means by which researchers would be able to derive electronically not only a description of the events, but also their relative salience in different countries. The database would also be able to be interrogated in relation to specific time periods and/or issue fields, making it possible to find clues as to whether any special national or cross-national factors have influenced the responses in particular rounds of the ESS. For the moment we estimate that around four newspapers per country would suffice for this purpose – covering left-leaning and right-leaning elite newspapers, and the two most ‘popular’ newspapers, drawing samples from each of their main news sections. It is, of course, possible that these plans may prove to be over-ambitious. It is too early to tell. So far, however, they appear to be feasible and well worth pursuing. And to the extent that we succeed, these sorts of mediareported event could develop into an important resource not just for the ESS, but for cross-cultural time series in general. Notes 1. The Inter-university Consortium for Political and Social Research (ICPSR, www.icpsr.umich.edu/org/index.html) holds several studies on political events, often within a particular period and within a particular area. An interesting example is the World Handbook of Political and Social Indicators series (Taylor and Jodice, 1986) which contains a large number of daily, quarterly and annual political events all over the world, from Afghanistan to Zimbabwe, and in addition aggregate political, economic, and social data and rates of change for 155 countries. Events covered include demonstrations, riots, strikes, assassinations, elections, referendums, and imposition of political restrictions, including censorship, and in particular periods also bombing, ambush, raid, arrest, release of the arrested, imposition of martial law or curfew, and relaxation of martial law or curfew.
If it bleeds, it leads: the impact of media-reported events
111
References Bradburn, N.M. (1969), The Structure of Psychological Well-Being, Chicago: Aldine Publishing. Available on-line from NORC Library: cloud9.norc.uchicago. edu/dlib/ spwb/index.htm Cohen, B. (1963), The Press and Foreign Policy, Princeton: Princeton University Press. Das, E.H.H.J., Bushman, B.J. and Bezemer, M. (2005), ‘The impact of terrorist acts on Dutch society: The case of the Van Gogh Murder’. Presentation at the First EASR Conference, Barcelona, July 2005. ESF (European Science Foundation) (1999), Blueprint for a European Social Survey, Strasbourg: ESF. Maney, G.M. and Oliver, P.E. (2001), ‘Finding Collective Events, Sources, Searches, Timing’. Sociological Methods & Research, 30 (2), pp.131–169. Nas, M.A.J.C. (2000), Sustainable Environment, Unsustained Attention. A Study of Attitudes, the Media and the Environment, The Hague: SCP. (Full text in Dutch, abstract and summary available in English from: www.scp.nl/english/publications/summaries/ 9057495244.html) Pfetsch, B. (2004), The Voice of the Media in European Public Sphere: Comparative Analysis of Newspaper Editorials. Available on-line from: http://europub.wzberlin.de/project%20reports.en.htm Schrodt, P.A. (2001), ‘Automated Coding of International Event Data Using Sparse Parsing Techniques’, Paper presented at the International Studies Association, Chicago, February 2001. Stathopoulou, T. (2004), ‘Modelling Events for European Social Survey: Towards the Creation of an Autonomous Tool for Survey Research’, Paper presented at the Sixth International Conference on Social Science Methodology, Amsterdam, The Netherlands, August 2004. Taylor, C.L and Jodice, D.A. (1986), World Handbook of Political and Social Indicators III: 1948-1982 [Computer file]. Compiled by C.L. Taylor, Virginia Polytechnic Institute and State University. 2nd ICPSR edition, Ann Arbor, MI: University of Michigan, Inter-university Consortium for Political and Social Research (producer and distributor). Verhagen, J. (2006), ‘Robuuste meningen. Het effect van responsverhogende strategieën bij het onderzoek Culturele Veranderingen in Nederland’, The Hague: SCP. Züll, C. and Landmann, J. (2003), European Social Survey and Event Data, Working Paper, Mannheim, Germany: ZUMA.
6
Understanding and improving response rates Jaak Billiet, Achim Koch and Michel Philippens∗
Introduction The ESS was from the outset designed to be a high-quality research instrument for the social sciences. One way in which the quality of a survey is often measured is its overall response rate – not an unreasonable premise since the higher the proportion of its target respondents who participate, the more reliable are its results likely to be. Although this somewhat oversimplifies the issue, the headline co-operation rate and the associated issue of nonresponse bias nonetheless remain central to survey quality. So if the ESS was to place emphasis on its methodological quality, response rates were inevitably a key variable, though, of course, by no means the only variable.1 This chapter refers to the issue of survey participation and its effect on the quality of survey findings. There will, of course, always be some designated respondents in a survey who cannot be located by the interviewers during the fieldwork period (non-contacts). There are also others who are contacted but then decline to participate (refusals). And there are still others who simply cannot participate because of, say, illness or language problems (unable to answer). In cross-national surveys in particular, non-response can threaten the validity of comparisons between nations. *
Jaak Billiet is professor of social methodology at the Katholieke Universiteit Leuven, Centre for Sociological Research. Michel Phillipens was formerly a research assistant at the same institute; Achim Koch is a Senior Researcher at the European Centre for Comparative Surveys (ECCS) at ZUMA, Germany. 1 All methodological quality measures are documented on the ESS website (http://www.europeansocialsurvey.com/) and on the ESS data archive website (http://ess.nsd.uib.no/).
114
MEASURING ATTITUDES CROSS-NATIONALLY
In a review of the literature on non-response in cross-national surveys, Couper and De Leeuw (2003, p.157) comment: “Only if we know how data quality is affected by non-response in each country or culture can we assess and improve the comparability of international and cross-cultural data.” The most important question in this context is whether non-response leads to bias in the resulting survey estimates. This will be the case when respondents and non-respondents differ systematically with respect to different survey variables, in which case the generalisability of the survey results to the target population and the comparability of results across countries might potentially be put at risk. Despite their obvious importance, non-response issues are often ignored in cross-national surveys. For some reason, the strict standards that are applied to the evaluation of national surveys are often suspended when it comes to cross-national studies (Jowell, 1998). We describe in this chapter the measures we have introduced in the ESS both to reduce non-response and to derive information about non-response. Our focus here is on non-contacts and refusals. By discovering the particular factors affecting non-contacts and refusals in different ESS countries, we hope to find ways of improving response rates in future rounds of the survey and, we hope, in similar studies. Our data come primarily from an analysis of Round 1 ESS contact forms in which interviewers in all countries are required to record the mode, time and outcome of all contact attempts they make. In addition, we make use of aggregate level data for each country from the National Technical Summaries (see chapter 7) which all countries provide when delivering data to the ESS archive. We use a pragmatic approach to data quality assessment in which process and outcome variables are treated as equally important (Loosveldt et al., 2004). Thus our evaluation of data quality deals not only with each step in the process of data collection (i.e. the contact attempts of interviewers), but also, of course, with the overall outcomes of the survey (response rates, comparability of estimates with known distributions in the population, and so on). Bringing these two approaches together allows us to develop an analytical tool that will become more powerful at each new round of the ESS, allowing us to draw practical conclusions about how to improve data quality. So in this chapter we will look first at the standards and documentation required by the ESS to maximise its response rates, to reduce bias and to provide for the analysis of these phenomena at a macro and micro level. We will examine ESS Round 1 fieldwork and assess how it performed from a quality perspective looking in detail at both response and non-response. We will try to identify some of the reasons for differences between countries on these measures with an emphasis on factors that can be influenced by the research design and its implementation. In particular we will examine the effectiveness of efforts to reduce non-contacts and convert refusals. We will then reflect on the extent to which non-response creates bias in the substantive
Understanding and improving response rates
115
findings of the ESS. For convenience, we refer in all cases to Round 1 of the ESS, but the same basic story can be told of Round 2. Response quality: standards and documentation When setting up the ESS we developed, in co-operation with other experts, a set of methodological standards that had to be pursued in each participating country. The quality standards we set were based not on the lowest common denominator across all countries, but oriented at those countries which were normally associated with the highest research quality (Lynn, 2003). In relation to response rate enhancement, the standards and specifications were as follows: • Data had to be collected by face-to-face interview • Interviewers had to make at least four visits (with at least one evening and one weekend visit) before a case could be abandoned as non-productive • Unless a country had an individual-named sampling frame with accompanying telephone numbers, all visits – including the first contact – had to be made in person • Substitution of difficult to reach or reluctant target persons was not permitted under any circumstances • All interviewers had to be personally briefed on the survey prior to fieldwork • The workload of any single interviewer was limited to a maximum of 48 issued sampling units • Fieldwork was to be closely monitored, including producing fortnightly reports on response • The fieldwork period had to cover at least 30 days. In addition to these basic standards, challenging targets were set. With respect to response rates, countries were asked to aim for (and budget for) a response rate of at least 70 per cent. Although we realised that this response rate would be very challenging for some countries (to say the least), we thought it appropriate to aim as high as possible both to raise the lowest response rates and to not depress the highest ones. To help countries reach this target response rate they were encouraged to implement a set of Current Best Practice guidelines, which included: • Selecting the most experienced interviewers whenever possible • Boosting interviewers’ confidence about their abilities • Briefing all interviewers in personal training sessions lasting at least half a day
116
MEASURING ATTITUDES CROSS-NATIONALLY
• Training interviewers in doorstep introductions and persuasion skills • Considering the use of incentives for respondents • Reissuing all “soft” refusals and as many “hard” refusals as possible. It was, however, clear that simply setting standards and targets would not be enough (Park and Jowell, 1997). We also had to introduce careful monitoring, evaluation and feedback. If the ESS aimed to improve standards more generally, then it needed to document and report on the extent to which standards are met. By feeding back information on compliance with or deviations from standards into the survey process, actions can be taken to improve procedures and standards round by round. Thus the Central Coordinating Team (CCT) carefully documents nonresponse and requires the National Technical Summaries to include: • Length of fieldwork period • Payment and briefing of interviewers • Number of visits required (including the number of visits required to be in the evenings or at weekends) • The use of quality-control back-checks • The use of special refusal conversion strategies2 • The use of advance letters, brochures and respondent incentives • The distribution of outcome codes for the total issued sample, according to a pre-defined set of categories. Indeed, we went even further in documenting non-response. We have standardised information on non-response not only at the aggregate level for each country, but also at the level of each individual sample unit. As noted, every country had to use contact forms to record detailed fieldwork information at each visit. Developing such uniform contact forms in the context of a crossnational survey was a rather complex task. We first needed to make an inventory of contact forms used by several European survey organizations, and we then had to develop separate versions of contact forms for each class of sampling frame and selection procedure used in the ESS. We had to strike a delicate balance between the burden this process would place on interviewers who had to record the data and the necessity to have detailed contact records available for subsequent analysis (Devacht et al., 2003; Stoop et al., 2003). In the end we were able to produce a standardised contact form specification and resultant standard data file comprising information on: 2
We use the term ‘refusal conversion’ because it is widely used in the methodological literature. This does not refer to a ‘flat’ or ‘final’ refusal. It would perhaps be more appropriate to refer to ‘repeated attempts to persuade initially reluctant persons to reconsider their participation in the survey’.
Understanding and improving response rates
• • • • • • •
117
Date and time of each visit Mode of each visit (face-to-face vs. telephone)3 Respondent selection procedure in the household Outcome of each visit (realised interview, non-contact, refusal, etc.) Reason for refusal, plus gender and estimated age of target person Neighbourhood characteristics of each sample unit Interviewer identification.
Most countries (17 out of 22)4 successfully delivered a complete call record dataset. No comparable information was made available from five countries5 for a number of reasons. In some cases, the survey agencies were not familiar with the collection of call record data and found the burden too heavy; in others restrictive confidentiality laws prevented the release of information about refusals, non-contacts or even neighbourhood characteristics (Devacht et al., 2003; Stoop et al., 2003). The conduct of fieldwork In order to cope with the high methodological standards of the ESS, only survey organisations capable of carrying out probability-based surveys to the highest standard of rigour were to be appointed. This was clearly easier to achieve in some countries than in others, depending on the prevalence of high-quality survey practices. As it turned out, the majority of countries in Round 1 (12 out of 22) selected a commercial survey agency, four selected a university institute, three a non-profit survey organisation, and the other three countries their national statistical institute (see appendix for details). As noted, the prescribed method of data collection was face-to-face interviewing, with countries free to choose between traditional paper-and-pencil interviewing (PAPI) and computer-assisted interviewing (CAPI). In Round 1, 12 countries used PAPI and 10 used CAPI. The fieldwork duration and period specified for Round 1 was at least one month in a four-month period between September and December 2002. By trying to make national fieldwork periods broadly coincide, we would, we hoped, help to reduce the impact of external events on the findings (see chapter 5). On 3
Under certain specified circumstances, some contacts were permitted to be by telephone. 4 Austria (AT), Belgium (BE), Germany (DE), Finland (FI), Great Britain (GB), Greece (GR), Hungary (HU), Ireland (IE), Israel (IL), Italy (IT), Luxembourg (LU), Poland (PL), Portugal (PT), Spain (ES), Switzerland (CH), The Netherlands (NL) and Slovenia (SI). 5 Czech Republic (CZ), Norway (NO), Sweden (SE), Denmark (DK) and France (FR).
118
MEASURING ATTITUDES CROSS-NATIONALLY
average, fieldwork took 124 days, but there were large differences between countries. The shortest fieldwork period was 28 days in Hungary, the longest 240 days in Austria. Only five countries managed to complete their fieldwork in 2002, and another seven before the end of February 2003. The last country finished fieldwork in December 2003, though it started fieldwork very late too. Problems in obtaining the necessary funding were the main reason for the delays observed. But in any event, our pursuit of simultaneous fieldwork periods in all countries proved to be less than successful, to say the least. Fourteen of the 22 countries each achieved their specified sample size requirement of at least 2000 achieved interviews (or 1000 interviews in countries with a population of less than two million – see chapter 2). There were a number of reasons that the other eight countries did not meet their target, among them lower budgets than necessary and in some cases lower response rates than anticipated. In the remainder of this chapter we focus on the analysis of our unique dataset (at least in a cross-national context) of response and non-response data at an individual and aggregate level. We focus in particular on the information we gathered about response rates, refusals and non-contacts. Response and non-response The call record data collected in the ESS offer the advantage that the same non-response outcome definitions and non-response rate formulae may be used across countries, thereby enabling valid cross-national non-response comparisons. As noted, not all countries delivered a dataset containing the necessary information. So, for countries with no suitable call record data, we report response and non-response rates that we calculated on the basis of the information provided in the National Technical Summaries, recognising that they may not be directly comparable and need to be treated with due caution. Before referring to the response rates themselves we must describe the definitions and formulae used to calculate them. We needed in the first place to construct an overall non-response disposition of each sample unit, since the call record dataset did not contain a variable that expressed this in its final form. Instead, we had to combine the separate outcomes of the separate calls into a single final code. This could be done either by taking the outcome of the last contact (with any member of the household) as the final nonresponse code (see AAPOR, 2004), or by setting up a priority order and then selecting the outcome with the highest priority (see, for instance, Lynn et al., 2001). Thus a refusal code that comes early in a sequence of visits may be given priority over a non-contact code at a subsequent or final visit.
Understanding and improving response rates
119
We chose to use a combination of these two approaches. Thus, we took the outcome of the last contact as the final non-response code, except when a refusal occurred at an earlier visit, and subsequent contacts with the household resulted in other eligible non-response outcomes. In these cases, we took the final non-response code to be “refusal to participate” (Philippens and Billiet, 2004). When a non-response code was followed by a response because of successful conversion attempts, then the final outcome became a response code because it had a higher priority in the coding procedure. With respect to the definition of outcomes we classify as ‘refusals’ an unwillingness to participate whether by a proxy, a household member on behalf of the whole household, or by a respondent. Similarly, people were classified as refusals if they broke their appointments, were at home but did not answer the door, or broke off the interview in its prime. Non-contacts, on the other hand, are defined as those addresses or households at which no contact with
Table 6.1
Response, refusal and non-contact rates
Country
Response rate %
Non-contact rate %
Refusal rate %
Eligible sample size
Total sample size
79.6 73.3 72.2 71.8 70.9 70.3 69.0 68.8 68.4 67.8 65.0 64.4 60.6 59.3 55.0 53.7 53.6 43.4 43.2 43.0 33.0
1.7 1.4 0.8 2.4 3.0 3.2 4.0 3.2 4.6 2.5 3.0 8.1 10.1 4.5 3.5 5.9 7.9 2.8 6.9 – 2.0
16.9 20.9 19.6 15.3 21.3 15.1 21.0 26.9 23.0 26.2 25.0 22.9 27.0 25.6 30.6 29.3 35.3 45.8 37.0 – 55.1
3222 2728 2921 2114 3523 2398 2878 2196 2143 3486 3109 3179 3725 3204 3730 5436 3227 2778 3589 – 4652
3227 2766 2978 2175 3600 2484 3000 2366 2150 3570 3215 3185 3828 3340 4013 5796 3657 3000 3773 – 5086
GR FI PL SI IL HU SE PT DK NL NO IE AT BE GB DE ES IT LU CZ 1 CH 2
Notes: 1No detailed information is available for the Czech Republic. 2For Switzerland, two approaches were followed. The first included face-to-face recruitment and the second telephone recruitment. In this paper we report only on the telephone part of the survey, since the contact form data for the face-to-face part was not suitable for analysis Source: Contact forms data file
120
MEASURING ATTITUDES CROSS-NATIONALLY
anyone was made at any visit. But respondents who moved within the country and were not re-approached were not treated as non-contacts, so as to enhance comparability between household and individual-named samples on the one hand, and address samples on the other. The response rates, refusal rates and non-contact rates are shown in Table 6.1. All figures are expressed as percentages of the total eligible sample size. In effect, the eligible sample comprises all addresses or households selected that were residential and occupied by residents aged 15 and over. The figures in Table 6.1 illustrate that about half of the participating countries obtained response rates close to or higher than the specified target rate of 70 per cent. But it also shows rather large differences between countries. Some countries (Greece, Finland, Poland, Slovenia, Israel and Hungary) achieved response rates higher than 70 per cent, while others (Italy, Luxembourg, the Czech Republic and Switzerland) obtained response rates lower than 50 per cent. These large non-response differences could raise questions about the validity of cross-national comparisons. But the distribution of the non-response appears to be rather similar across countries, and refusals are comfortably the most common reason of non-participation. Our aim to keep non-contact rates in all countries to a strict minimum (target 3 per cent or less) seems to have been achieved in most cases. In 16 of the countries non-contact rates in Round 1 were lower than 5 per cent. The exceptions were Austria (10 per cent), Ireland (8 per cent), Spain (8 per cent), Luxembourg (7 per cent) and Germany (6 per cent). Action was taken in Round 2 and beyond to try to increase contact rates and therefore overall response rates by an increase in the number of call attempts and greater variety in the call patterns. One remarkable observation based on the Round 1 (and Round 2) data is that the well-documented problem of non-response in the Netherlands is not replicated in the ESS (see, for instance, De Heer, 1999; Hox and De Leeuw, 2002; Stoop and Philippens, 2004). The Netherlands response rate was in fact close to the specified target rate of 70 per cent, and in the next section we will show how this result was achieved. Why such large country differences in response rates? Many factors may be responsible for the observed differences in response rates. On the one hand, societal factors that cannot, of course, be influenced by the research design may play a part (Groves and Couper, 1998). There are, for instance, differences in the ‘survey climate’ across countries. Not only do survey practices vary by country, but so do public attitudes to surveys and the extent
Understanding and improving response rates
121
to which people consider them useful or legitimate. These survey-climate factors may influence overall co-operation and refusal rates. But apart from these considerations there is also variation in the ‘at-home’ patterns in different countries. These at-home patterns influence the contactability of households and will affect the efforts needed to bring down non-contact rates. Given the large demographic variations between countries (with respect to birth rates, proportion of women in employment, and so on), ‘at-home’ patterns are likely to vary rather strongly across countries (see De Leeuw and De Heer, 2002). Although these survey-climate and at-home patterns are interesting and important from a theoretical point of view, they have limited practical importance since they cannot be manipulated by the research design (other than, for instance, by appropriate planning of fieldwork around the prevailing ‘at-home’ patterns). More pertinent in this context are factors that are, at least in principle, under the control of the researcher. According to De Heer (1999, pp.136–137) these can be divided into three categories: • General design factors: mode of data collection, survey method (panel vs. cross-sectional), observational unit (household vs. individual). • Fieldwork efforts: number of contact attempts, refusal conversion efforts, interviewer and respondent incentives, and interviewer training. • Survey organisation: conditions of employment of interviewers, supervision of interviewers. Our analysis focuses primarily on the study of differences in the second category (fieldwork efforts). We first discuss the number and timing of contact attempts and possible explanations for differences in non-contact rates, and then move on to a comparison and evaluation of refusal conversion attempts. Country differences in non-contact rate reduction In order to minimise fieldwork variation between countries, the ESS specifies a common calling strategy in all countries. Interviewers are required to make at least four personal visits to each sampling unit before abandoning it as nonproductive, including at least one attempt in the evening and at least one at the weekend. Moreover these attempts have to be spread over at least two different weeks. But even when these instructions were scrupulously followed, there were often significant differences in contactability. Many countries decided to exceed these minimum requirements and were encouraged to do so. For instance, Irish, Slovenian and Greek interviewers were required to make at least
122
MEASURING ATTITUDES CROSS-NATIONALLY
five contact attempts, while Polish and Slovenian interviewers had to make at least two evening calls.
Contact procedures The first contact with potential respondents, often following an advance letter, was required to be in person. Only after initial contact had been made could interviewers subsequently make appointments on the telephone for a face-to-face interview. As noted, however, the restriction on making initial contact by a personal visit was relaxed for countries with registers of named individuals that included telephone numbers. Analysis of the call record data shows that only Switzerland, Sweden, Finland and Norway used mainly telephone attempts to recruit respondents. Everywhere else, almost all contact attempts were made face to face.
Number of contact attempts It is generally assumed that increasing the number of contact attempts is the most effective strategy for decreasing non-contact rates. Figure 6.1 plots the average number of call attempts made to non-contacts against the percentage of non-contacts in the eligible sample. As we would expect, there is a negative relationship between the achieved non-contact rate and the average number of contact attempts (Spearman’s rho = -0.41). Figure 6.1 indicates that countries such as Germany, Belgium, Ireland, Luxembourg and Austria that made on average less than the prescribed contact attempts to non-contacts did not achieve the target non-contact rate of three per cent. A detailed analysis of all call records reveals that in Ireland, Germany and Belgium most interviewers did make a “minimum of four contact attempts”, and that in these cases it was a small number of core interviewers who did not make the prescribed number of calls and recorded high non-contact rates. In Ireland and Germany, for instance, five per cent of the interviewers were responsible for approximately 50 per cent of all non-contacts who did not receive four contact attempts, while in Belgium, five per cent of interviewers were responsible for 67 per cent of all non-contacts who received less than four contact attempts. In these countries, the contact rate could almost certainly have been raised by a closer monitoring of interviewers and by reissuing assignments to other interviewers. In Luxembourg on the other hand, a large majority of interviewers routinely broke the “minimum of four attempts” rule for at least some of their potential respondents, suggesting a more structural problem. It is likely that interviewers there were not fully aware of the fact that the prescribed guidelines were mandatory.
Understanding and improving response rates
123
Countries that complied with the “minimum of four call attempts” rule generally reached the target non-contact rate of three per cent, or at least came close to it. Even so, the relationship between the average number of contact attempts and the achieved non-contact rates is not clear-cut. In the UK and Spain, for instance, although more contact attempts were made than were strictly required, the target rate remained out of reach. And in Israel, although only 2.8 contact attempts were made before cases were abandoned, a non-contact rate of three per cent was nonetheless achieved. Differences in the timing of calls might play an intermediate role in the relationship between the number of call attempts and the achieved non-contact rate. Countries with large hard-to-reach populations will tend to have to make more contact attempts to obtain the same contact rate. Similarly, in countries where interviewers routinely make more evening calls, fewer contact attempts will be needed to achieve the same non-contact rates (Purdon et al., 1999).
Figure 6.1 Scatterplot of average number of contact attempts versus achieved non-contact rate
124
MEASURING ATTITUDES CROSS-NATIONALLY
Contactability Following Groves and Couper (1998, p.80) we define “contactability” as the propensity for a household to be contacted by an interviewer at any given point in time. Owing to the fact that ‘at home time’ tends to be different in different countries, we should expect some populations to be harder to contact than others. To verify this, we examined the probability of contacting a household at the first call attempt at different times of the day and different days of the week. And, as Figure 6.2 shows, the probability of contacting a household during a weekday morning or afternoon is relatively low in Spain, the UK, the Netherlands, Switzerland and Portugal but relatively high in Poland, Israel, and Italy. So it does appear that in certain countries interviewers have to make more contact attempts to reach their sample units than in other countries. Indeed, these figures might partially explain why Israeli interviewers on average only have to make 2.8 contact attempts to reach their target non-contact rate, while British interviewers on average had to make close to nine attempts to achieve a similar rate. Figure 6.2 attempt
Probability of contact at first call attempt by timing of first call
Figure 6.2 also illustrates that, in line with previous research, evening and weekend contact attempts are in general more productive than weekday morning or afternoon attempts. In all countries, except for Italy and Poland, we found a significant relationship (p