151 52 171MB
English Pages 716 [741] Year 2016
The SAGE Handbook of
Survey Methodology
SAGE was founded in 1965 by Sara Miller McCune to support the dissemination of usable knowledge by publishing innovative and high-quality research and teaching content. Today, we publish over 900 journals, including those of more than 400 learned societies, more than 800 new books per year, and a growing range of library products including archives, data, case studies, reports, and video. SAGE remains majority-owned by our founder, and after Sara’s lifetime will become owned by a charitable trust that secures our continued independence. Los Angeles | London | New Delhi | Singapore | Washington DC | Melbourne
The SAGE Handbook of
Survey Methodology
Edited by
Christof Wolf, Dominique Joye, Tom W. Smith and Yang-chih Fu
SAGE Publications Ltd 1 Oliver’s Yard 55 City Road London EC1Y 1SP SAGE Publications Inc. 2455 Teller Road Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd B 1/I 1 Mohan Cooperative Industrial Area Mathura Road New Delhi 110 044 SAGE Publications Asia-Pacific Pte Ltd 3 Church Street #10-04 Samsung Hub Singapore 049483
Editor: Mila Steele Editorial Assistant: Mathew Oldfield Production Editor: Sushant Nailwal Copyeditor: David Hemsley Proofreader: Sunrise Setting Ltd. Indexer: Avril Ehrlich Marketing Manager: Sally Ransom Cover Design: Wendy Scott Typeset by Cenveo Publisher Services Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY
At SAGE we take sustainability seriously. Most of our products are printed in the UK using FSC papers and boards. When we print overseas we ensure sustainable papers are used as measured by the PREPS grading system. We undertake an annual audit to monitor our sustainability.
Editorial arrangement © Christof Wolf, Dominique Joye, Tom W. Smith and Yang-chih Fu 2016 Chapter 1 © Dominique Joye, Christof Wolf, Tom W. Smith and Yang-chih Fu 2016 Chapter 2 © Tom W. Smith 2016 Chapter 3 © Lars E. Lyberg and Herbert F. Weisberg 2016 Chapter 4 © Timothy P. Johnson and Michael Braun 2016 Chapter 5 © Claire Durand 2016 Chapter 6 © Geert Loosveldt and Dominique Joye 2016 Chapter 7 © Kathy Joe, Finn Raben and Adam Phillips 2016 Chapter 8 © Kathleen A. Frankovic 2016 Chapter 9 © Ben Jann and Thomas Hinz 2016 Chapter 10 © Paul P. Biemer 2016 Chapter 11 © Edith de Leeuw and Nejc Berzelak 2016 Chapter 12 © Beth-Ellen Pennell and Kristen Cibelli Hibben 2016 Chapter 13 © Zeina N. Mneimneh, Beth-Ellen Pennell, Jennifer Kelley and Kristen Cibelli Hibben 2016 Chapter 14 © Jaak Billiet 2016 Chapter 15 © Kristen Miller and Gordon B. Willis 2016 Chapter 16 © Jolene D. Smyth 2016 Chapter 17 © Melanie Revilla, Diana Zavala and Willem Saris 2016 Chapter 18 © Don A. Dillman and Michelle L. Edwards 2016 Chapter 19 © Dorothée Behr and Kuniaki Shishido 2016 Chapter 20 © Silke L. Schneider, Dominique Joye and Christof Wolf 2016 Chapter 21 © Yves Tillé and Alina Matei 2016
Chapter 22 © Vasja Vehovar, Vera Toepoel and Stephanie Steinmetz 2016 Chapter 23 © Siegfried Gabler and Sabine Häder 2016 Chapter 24 © Gordon B. Willis 2016 Chapter 25 © Annelies G. Blom 2016 Chapter 26 © François Laflamme and James Wagner 2016 Chapter 27 © Ineke A. L. Stoop 2016 Chapter 28 © Michèle Ernst Stähli and Dominique Joye 2016 Chapter 29 © Mary Vardigan, Peter Granda and Lynette Hoelter 2016 Chapter 30 © Pierre Lavallée and Jean-François Beaumont 2016 Chapter 31 © Stephanie Eckman and Brady T. West 2016 Chapter 32 © Heike Wirth 2016 Chapter 33 © Christof Wolf, Silke L. Schneider, Dorothée Behr and Dominique Joye 2016 Chapter 34 © Duane F. Alwin 2016 Chapter 35 © Jelke Bethlehem and Barry Schouten 2016 Chapter 36 © Caroline Roberts 2016 Chapter 37 © Martin Spiess 2016 Chapter 38 © Victor Thiessen† and Jörg Blasius 2016 Chapter 39 © Jan Cieciuch, Eldad Davidov, Peter Schmidt and René Algesheimer 2016 Chapter 40 © Lynette Hoelter, Amy Pienta and Jared Lyle 2016 Chapter 41 © Rainer Schnell 2016 Chapter 42 © Jessica FortinRittberger, David Howell, Stephen Quinlan and Bojan Todosijević 2016 Chapter 43 © Tom W. Smith and Yang-chih Fu 2016
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, this publication may be reproduced, stored or transmitted in any form, or by any means, only with the prior permission in writing of the publishers, or in the case of reprographic reproduction, in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. Library of Congress Control Number: 2015960279 British Library Cataloguing in Publication data A catalogue record for this book is available from the British Library ISBN 978-1-4462-8266-3
Contents List of Figuresix List of Tablesxi Notes on the Editors and Contributorsxiii Prefacexxiv PART I BASIC PRINCIPLES
1
1.
Survey Methodology: Challenges and Principles Dominique Joye, Christof Wolf, Tom W. Smith and Yang-chih Fu
3
2.
Survey Standards Tom W. Smith
16
3.
Total Survey Error: A Paradigm for Survey Methodology Lars E. Lyberg and Herbert F. Weisberg
27
4.
Challenges of Comparative Survey Research Timothy P. Johnson and Michael Braun
41
PART II SURVEYS AND SOCIETIES
55
5.
Surveys and Society Claire Durand
57
6.
Defining and Assessing Survey Climate Geert Loosveldt and Dominique Joye
67
7.
The Ethical Issues of Survey and Market Research Kathy Joe, Finn Raben and Adam Phillips
77
8.
Observations on the Historical Development of Polling Kathleen A. Frankovic
87
PART III PLANNING A SURVEY
103
9.
105
Research Question and Design for Survey Research Ben Jann and Thomas Hinz
10. Total Survey Error Paradigm: Theory and Practice Paul P. Biemer
122
vi
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
11.
Survey Mode or Survey Modes? Edith de Leeuw and Nejc Berzelak
142
12.
Surveying in Multicultural and Multinational Contexts Beth-Ellen Pennell and Kristen Cibelli Hibben
157
13.
Surveys in Societies in Turmoil Zeina N. Mneimneh, Beth-Ellen Pennell, Jennifer Kelley and Kristen Cibelli Hibben
178
PART IV MEASUREMENT
191
14.
What does Measurement Mean in a Survey Context? Jaak Billiet
193
15.
Cognitive Models of Answering Processes Kristen Miller and Gordon B. Willis
210
16.
Designing Questions and Questionnaires Jolene D. Smyth
218
17.
Creating a Good Question: How to Use Cumulative Experience Melanie Revilla, Diana Zavala and Willem Saris
236
18.
Designing a Mixed-Mode Survey Don A. Dillman and Michelle L. Edwards
255
19.
The Translation of Measurement Instruments for Cross-Cultural Surveys Dorothée Behr and Kuniaki Shishido
269
20.
When Translation is not Enough: Background Variables in Comparative Surveys Silke L. Schneider, Dominique Joye and Christof Wolf
288
PART V SAMPLING
309
21.
Basics of Sampling for Survey Research Yves Tillé and Alina Matei
311
22.
Non-probability Sampling Vasja Vehovar, Vera Toepoel and Stephanie Steinmetz
329
23.
Special Challenges of Sampling for Comparative Surveys Siegfried Gabler and Sabine Häder
346
Contents
vii
PART VI DATA COLLECTION
357
24.
Questionnaire Pretesting Gordon B. Willis
359
25.
Survey Fieldwork Annelies G. Blom
382
26.
Responsive and Adaptive Designs François Laflamme and James Wagner
397
27.
Unit Nonresponse Ineke A. L. Stoop
409
28.
Incentives as a Possible Measure to Increase Response Rates Michèle Ernst Stähli and Dominique Joye
425
PART VII PREPARING DATA FOR USE
441
29.
Documenting Survey Data Across the Life Cycle Mary Vardigan, Peter Granda and Lynette Hoelter
443
30.
Weighting: Principles and Practicalities Pierre Lavallée and Jean-François Beaumont
460
31.
Analysis of Data from Stratified and Clustered Surveys Stephanie Eckman and Brady T. West
477
32.
Analytical Potential Versus Data Confidentiality – Finding the Optimal Balance Heike Wirth
33.
Harmonizing Survey Questions Between Cultures and Over Time Christof Wolf, Silke L. Schneider, Dorothée Behr and Dominique Joye
488
502
PART VIII ASSESSING AND IMPROVING DATA QUALITY
525
34.
Survey Data Quality and Measurement Precision Duane F. Alwin
527
35.
Nonresponse Error: Detection and Correction Jelke Bethlehem and Barry Schouten
558
36.
Response Styles in Surveys: Understanding their Causes and Mitigating their Impact on Data Quality Caroline Roberts
579
viii
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
37.
Dealing with Missing Values Martin Spiess
597
38.
Another Look at Survey Data Quality Victor Thiessen† and Jörg Blasius
613
39.
Assessment of Cross-Cultural Comparability Jan Cieciuch, Eldad Davidov, Peter Schmidt and René Algesheimer
630
PART IX FURTHER ISSUES 40.
Data Preservation, Secondary Analysis, and Replication: Learning from Existing Data Lynette Hoelter, Amy Pienta and Jared Lyle
649
651
41.
Record Linkage Rainer Schnell
662
42.
Supplementing Cross-National Survey Data with Contextual Data Jessica Fortin-Rittberger, David Howell, Stephen Quinlan and Bojan Todosijević
670
43.
The Globalization of Surveys Tom W. Smith and Yang-chih Fu
680
Index
693
List of Figures 3.1 The different types of survey error source 29 10.1 A high-level process flow diagram for the CES data collection process 127 10.2 Simulated dashboard for monitoring production, costs, and interview quality 129 11.1 Example of auto-advance or carousel question format 153 14.1 Schematic overview of the operationalization process 200 14.2 Operationalization of the perception of welfare state consequences 202 17.1 The different steps applied to the importance of the value honesty 242 18.1 Percent of respondents choosing the most positive endpoint category for five long distance satisfaction questions in a survey, by assigned response mode 263 18.2 Example of unified design question format, using the same question structures, question wording and visual appearance in both the mail (on left) and web (on the right) versions of the question; next and back buttons on web screens are not shown here 265 19.1 Harmonization between survey target regions 279 19.2 Response distribution of 18 survey items 282 19.3 Examples of Japanese translations of ‘strongly agree’ 283 25.1 Trading off fieldwork objectives 383 25.2 Checklist for fieldwork planning 385 25.3 Interviewer effects in terms of the Total Survey Error framework 393 26.1 Subgroup response rates by day for the NSFG 401 26.2 Ratio of screener to main calls by day for NSFG 402 26.3 Responsive design (RD) strategy for CATI surveys 403 26.4 Key responsive design indicators 405 28.1 Modes and use of incentives in last ISSP survey by per capita GDP and response rate 436 29.1 Research data life cycle 444 29.2 The survey life cycle for cross-cultural surveys 445 29.3 Generic Statistical Business Process Model (GSBPM) 448 29.4 Visualizing the path through an instrument based on metadata about skip patterns449 29.5 Rich variable-level metadata in the IFSS harmonized file 451 29.6 Variable comparison tool based on DDI metadata 451 29.7 Table of Contents from 1960 US Census Codebook 452 29.8 Interactive codebook for the Collaborative Psychiatric Epidemiology Surveys (CPES)453 29.9 Sample variable from the NLAAS, which is part of the harmonized CPES 455 29.10 Variable discovery using the ICPSR Social Science Variables Database 456 31.1 Sampling distribution for simple random sampling, stratified simple random sampling using proportional allocation, and clustered simple random sampling 478 33.1 From theoretical construct to questionnaire item 503 33.2 Overview of harmonization approaches 504
x
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
34.1 Path diagram of the classical true-score model for a single measure 34.2 Path diagram of the classical true-score model for two tau-equivalent measures 34.3 Path diagram of the classical true-score model for a single measure composed of one trait and one method 34.4 Path diagram for the relationship between random measurement errors, observed scores and true scores for the multiple measures and multiple indicator models 34.5 Path diagram of the quasi-Markov simplex model – general case (P > 4) 34.6 Path diagram for a three-wave quasi-Markov simplex model 35.1 Distribution of the estimated response propensities 35.2 Boxplots of response propensities by degree of urbanization 35.3 General optimization setting for adaptive survey designs 35.4 Raking ratio estimation 38.1 South Africa: Bar graph of factor scores 39.1 A model for testing for measurement invariance of two latent variables measured by three indicators across two groups with continuous data. The two factors are allowed to covary 39.2 A model for testing for measurement invariance of two latent variables measured by three indicators across two groups with ordinal data. The two factors are allowed to covary 39.3 A model for testing for measurement invariance using an ESEM approach with two factors, three indicators measuring each factor and two groups. The two factors are allowed to covary
530 531 534 538 543 544 562 563 568 573 623 632 635 640
List of Tables 3.1 Survey quality on three levels 10.1 CTQs by process step, potential impacts, and monitoring metrics 10.2 Sources of error considered by product 10.3 Quality evaluation for the Labour Force Survey (LFS) 12.1 Dimensions of survey context 14.1 Construct ‘popular perceptions of welfare state legitimacy’: Scalar invariant measurement model and structural relations in Flemish and Walloon samples in Belgium (ESS Round 4, 2010) 17.1 The classification of concepts-by-intuition from the ESS into classes of basic concepts of the social sciences 17.2 The basic structures of assertions 17.3 The characteristics of the questions to be taken into account 17.4 Two survey questions for a concept-by-intuition 17.5 Quality predictions in SQP 17.6 An improved question for the same concept-by-intuition 17.7 Quality predictions for Q3a and Q3a-bis 17.8 Categories for differences in the SQP codes for two languages 19.1 Core features of good practice translation and assessment methodology 19.2 Overview of adaptation domains and types 19.3 Survey item for preferred qualities of friends 20.1 The CASMIN education scheme 20.2 ISCED 1997 and 2011 main levels 20.3 Detailed educational attainment categories and their coding in the ESS (edulvlb), ES-ISCED, ISCED 2011 and 1997 20.4 Structure of ISCO-08 21.1 Main sampling designs with maximum entropy in the class of sampling designs with the same first-order inclusion probabilities 23.1 Telephone access in Europe 27.1 Temporary outcomes and final disposition codes 31.1 Example population for stratified sampling 31.2 Variance of different stratified designs 31.3 Design effects for selected estimates in the 2012 General Social Survey 33.1 IPUMS Integrated Coding Scheme for Marital Status, slightly simplified 35.1 Response rate, R-indicator, coefficient of variation, and partial R-indicators for the six selected auxiliary variables. Standard errors in brackets 35.2 Category-level partial R-indicators for urbanization after one month and after two months. Standard errors in brackets 35.3 Category-level unconditional partial R-indicators for the 16 strata. Standard errors in brackets 35.4 Values of the indicators for the adaptive survey design with restricted follow-up in month 2. Standard errors in brackets
35 128 136 138 159 203 238 240 244 246 247 247 247 250 271 276 279 294 295 296 299 325 348 411 480 480 482 515 567 568 569 570
xii
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
35.5 Estimating the percentage having a PC 35.6 Estimating the percentage owning a house 35.7 Weighting techniques using all six auxiliary variables 36.1 Description of eight common response styles 38.1 Student response behaviours by reading achievement quintile, Australia and USA 38.2 Interviewer effects in ESS 2010 38.3 ISSP 2006 – partial listing of South African duplicated data
575 575 576 581 618 621 625
Notes on the Editors and Contributors
THE EDITORS Christof Wolf is acting president of GESIS – Leibniz Institute for the Social Sciences and Professor of Sociology at Mannheim University. He is currently serving as Secretariat of the International Social Survey Programme (ISSP) and is a member of the Executive Committee of the European Values Study (EVS). His main research interests include sociology of religion, social networks, sociology of health, and survey methodology. He is co-editor of the SAGE Handbook of Regression Analysis and Causal Inference (2015). Dominique Joye is Professor of Sociology at the University of Lausanne and associated to FORS. He is involved in the analysis of inequality and life course, and is participating in NCCR LIVES in Switzerland; part of this handbook was also realized in this frame. He has published many papers in this area as well as defining the way that social-professional positions are measured in Switzerland by the Swiss Statistical Office. He is also interested in comparative surveying, and is a member of the methodological advisory board of the European Social Survey (ESS), of the executive and methodological committees of the European Values Study (EVS), and Chair of the methodological committee of the International Social Survey Program (ISSP). Tom W. Smith is Senior Fellow and Director of the Center for the Study of Politics and Society of NORC at the University of Chicago. He is Principal Investigator and Director of the National Data Program for the Social Sciences which conducts the General Social Survey and collaborates with the International Social Survey Program. He studies survey methodology, societal trends, and cross-national, comparative research. Yang-chih Fu is Research Fellow at the Institute of Sociology and former Director of the Center for Survey Research, Academia Sinica, Taiwan. He is Principal Investigator of the Taiwan Social Change Survey, a large-scale survey series launched in 1984. His recent research focuses on the social desirability effects that occur during face-to-face interviews, as well as multilevel analyses that use contacts as the building blocks of interpersonal ties and social networks.
xiv
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
THE CONTRIBUTORS René Algesheimer is Professor of Marketing and Market Research and Director of the University Research Priority Program ‘Social Networks’ at the University of Zurich. His main research interests lie in social networks, social media and the consequences of the digital transformation on firms and individuals. He has authored several articles in leading journals of the field, such as Marketing Science, Journal of Marketing Research, Journal of Marketing, Harvard Business Review or Sociological Methods and Research. Duane F. Alwin is the inaugural holder of the Tracy Winfree and Ted H. McCourtney Professorship in Sociology and Demography, and Director of the Center for Life Course and Longitudinal Studies, College of the Liberal Arts, Pennsylvania State University, University Park, PA. He is also Emeritus Research Professor at the Survey Research Center, Institute for Social Research, and Emeritus Professor of Sociology, University of Michigan, Ann Arbor. In addition to his interest in improving survey data quality, he specializes in the integration of demographic and developmental perspectives in the study of human lives. His work is guided by the life course perspective, and his current research focuses (among other things) on socioeconomic inequality and health, parental child-rearing values, children’s use of time, and cognitive aging. Jean-François Beaumont is Chief in statistical research in the International Cooperation and Corporate Statistical Methods Division at Statistics Canada. He is responsible for the Statistical Research and Innovation Section. His main research projects and publications are on issues related to missing data, estimation, including robust estimation and, more recently, small area estimation. Dorothée Behr is a senior researcher at GESIS – Leibniz Institute for the Social Sciences, Mannheim. Her research focuses on questionnaire translation, translation and assessment methods, and item comparability as well as cross-cultural web probing. Besides publishing in these fields, she provides consultancy and training in the wider field of cross-cultural questionnaire design and translation. Nejc Berzelak is a researcher in the field of survey methodology at the Faculty of Social Sciences, University of Ljubljana. The main topics of his research include questionnaire design, measurement errors, mode effects, analysis of survey response process, and cost-error optimization of mixed-mode surveys. He is participating in several research projects related to the development of survey methods and works as a methodological consultant for surveys conducted by academic, governmental, and private organizations. Jelke Bethlehem is Professor of Survey-methodology at the Leiden University in The Netherlands. Until recently he was also senior survey methodologist at Statistics Netherlands. His research interests are nonresponse in surveys, online surveys, and polls and media. He is author or co-author of several books about surveys. Paul P. Biemer is Distinguished Fellow of Statistics at RTI International and Associate Director for Survey R&D in the Odum Institute for Research in Social Science at University of North Carolina. His main interests lie in survey statistics and methodology, survey quality
Notes on the Editors and Contributors
xv
evaluation and the analysis of complex data. He is the author, co-author and editor of a number of books including Introduction to Survey Quality (Wiley, 2003) and Latent Class Analysis of Survey Data (Wiley, 2011). Jaak Billiet is Emeritus Professor of Social Methodology, Centre of Sociological Research, University of Leuven. He combines methodological research with substantial longitudinal and comparative research on ethnocentrism, political attitudes and religious orientations. He is author or co-author of many published book chapters, articles in academic journals, and several co-authored books and edited volumes including Cross-Cultural Analysis (Routledge, 2011). Jörg Blasius is a Professor of Sociology at the Institute for Political Science and Sociology, University of Bonn, Germany. His research interests are mainly in explorative data analysis, especially correspondence analysis and related methods, data collection methods, sociology of lifestyles and urban sociology. Together with Simona Balbi (Naples), Anne Ryen (Kristiansand) and Cor van Dijkum (Utrecht) he is editor of the SAGE series ‘Survey Research Methods in the Social Sciences’. Annelies G. Blom is Assistant Professor at the University of Mannheim and Principal Investigator of the German Internet Panel (GIP). Her research looks into sources of survey error during fieldwork, in particular interviewer effects and nonresponse bias. She is author and coauthor of many peer-reviewed articles that appeared in scholarly journals such as Public Opinion Quarterly, Journal of the Royal Statistical Society: Series A, International Journal of Public Opinion Research, Journal of Official Statistics, and Field Methods. Michael Braun is Senior Project Consultant at GESIS – Leibniz Institute for the Social Sciences at Mannheim and Adjunct Professor at the University of Mannheim. His main research interests include cross-cultural survey methodology and intercultural comparative research in the areas of migration and the family. He is co-editor of Survey Methods in Multinational, Multiregional and Multicultural Contexts. Kristen Cibelli Hibben is a PhD Candidate at the University of Michigan Program in Survey Methodology and Research Assistant in the International Unit at the Institute for Social Research’s Survey Research Operations. Her research interests include respondent motivation and data quality, cross-cultural survey research, and the application of survey methods in challenging contexts such as post-conflict or in countries with little survey research tradition. She has co-authored book chapters in the present volume as well as Hard-to-Survey Populations (Tourangeau, Edwards, Johnson, Wolter, and Bates, 2014) and Total Survey Error in Practice (Biemer, de Leeuw, Eckman, Edwards, Kreuter, Lyberg, Tucker, and West, 2017). Jan Cieciuch is Project Leader of the University Research Priority Program ‘Social Networks’ at the University of Zurich. His interests are applications of structural equation modeling especially in psychology, with focus on the investigation of human values and personality traits. Recent publications appeared in leading journals such as the Journal of Personality and Social Psychology, Journal of Cross-Cultural Psychology, Annual Review of Sociology, and Public Opinion Quarterly. Eldad Davidov is Professor of Sociology at the University of Zurich and president of the European Survey Research Association (ESRA). His research interests are applications of
xvi
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
structural equation modeling to survey data, especially in cross-cultural and longitudinal research on which he has published many papers. Applications include human values, national identity, and attitudes toward immigrants and other minorities. Don A. Dillman is Regents Professor of Sociology and Deputy Director for Research in the Social and Economic Sciences Research Center at Washington State University in Pullman, Washington. His research emphasizes methods for improving response to sample surveys in ways that reduce coverage, measurement and nonresponse errors. He has authored more than 250 publications including the 4th edition of his book, Internet, Phone, Mail and Mixed-Mode Surveys: The Tailored Design Method (Wiley, 2014), coauthored with Jolene Smyth and Leah Christian. Claire Durand is Professor of Survey Methods and Quantitative Analysis, Department of Sociology, Université de Montréal. Her main research interests pertain to the quality of electoral polls, the historical analysis of survey data and the statistics related to the situation of aboriginal people. She is currently vice-president/ president elect of WAPOR. She is author of numerous articles, book chapters and blog posts on the performance of electoral polls in various elections and referendums. Stephanie Eckman is a Senior Survey Research Methodologist at RTI International in Washington, DC. She has published on coverage errors in face-to-face, telephone and web surveys and on the role of respondents’ motivation in survey data quality. She has taught sampling and survey methods around the world. Michelle L. Edwards is Assistant Professor of Sociology, Sociology and Anthropology Department, Texas Christian University. Her main research interests lie in research methodology, environmental risk, and public perceptions of science. She has previously co-authored an article with Don A. Dillman and Jolene D. Smyth in Public Opinion Quarterly on the effects of survey sponsorship on mixed-mode survey response. Michèle Ernst Stähli is Head of group International Surveys at FORS (Swiss Centre of Expertise in the Social Sciences), running in Switzerland the European Social Survey (ESS), the International Social Survey Programme (ISSP), the European Values Study (EVS) and the Survey of Health, Ageing and Retirement in Europe (SHARE). Holding a PhD in sociology of work, since 2010 she has focused her research on topics related to survey methodology such as translation, nonresponse and mixed mode. Jessica Fortin-Rittberger is Professor of Comparative Politics at the University of Salzburg and a former member of the Secretariat of the Comparative Study of Electoral Systems (CSES). Her main areas of research interest include political institutions and their measurement, with particular focus on electoral rules. Her work has appeared in Comparative Political Studies, European Journal of Political Research, European Union Politics, and West European Politics. Kathleen A. Frankovic retired in 2009 as CBS News Director of Surveys and Producer, where she managed the CBS News Polls and (after 2000) CBS News election night projections. Since then, she has consulted with CBS News, YouGov, Harvard University and the Open Society Foundations, among others. A former AAPOR and WAPOR President, Frankovic has published many articles on the linkages between journalism and polling.
Notes on the Editors and Contributors
xvii
Siegfried Gabler is the leader of the statistics team at GESIS – Leibniz Institute for the Social Sciences and Privatdozent at University of Mannheim. He is a member of the Sampling Expert Panel of the European Social Survey. His research area covers sampling designs, especially for telephone surveys and for cross-cultural surveys, weighting for nonresponse, design effects, and decision theoretic justification of sampling strategies. He is involved in several projects in the context of telephone surveys and was jointly responsible for the Census 2011 project for Germany. He has published on a wide field of statistical topics. Peter Granda is Associate Director of the Inter-university Consortium for Political and Social Research (ICPSR). Most recently he has participated in a number of collaborative projects with colleagues at the University of Michigan including acting as Director of Data Processing for the National Survey of Family Growth and as Co-Principal Investigator of the Integrated Fertility Survey Series. He has interests in the creation and use of comparative and harmonized data collections and has had a long association with the cultures of South Asia, where he spent several years of study in the southern part of the Indian subcontinent. Sabine Häder is Senior Statistician at GESIS – Leibniz Institute for the Social Sciences, Mannheim. She is also a member of the Sampling Expert Panels of the European Social Survey. Sabine Häder holds a Doctorate in Economics. Current research areas are: sampling designs, especially for telephone surveys and for cross-cultural surveys. She has published widely on sampling topics. Thomas Hinz is Professor of Sociology at the University of Konstanz. His research interests cover social inequality, labor market sociology, economic sociology, and survey research methodology, particularly survey experiments. Together with Katrin Auspurg, he authored Factorial Survey Experiments (SAGE Series Quantitative Applications in the Social Sciences, Vol. 175, 2015). Lynette Hoelter is an Assistant Research scientist and Director of Instructional Resources at ICPSR and a research Affiliate of the Population Studies Center at the University of Michigan. At ICPSR, she is involved in projects focusing on assisting social science faculty with using data in the classroom, including the Online Learning Center and TeachingWithData.org, and generally oversees efforts focused on undergraduate education. Lynette is also a Co-Principal Investigator of the Integrated Fertility Survey Series, an effort to create a dataset of harmonized variables drawn from national surveys of fertility spanning 1955–2002. Her research interests include the relationship between social change and marital quality, gender in families, and the study of family and relationship processes and dynamics more broadly. She has also taught in the Department of Sociology and the Survey Methodology Program at the University of Michigan. David Howell is Associate Director of the Center for Political Studies at the University of Michigan, and Director of Studies for the Comparative Study of Electoral Systems (CSES). His interests include public opinion, cross-national research, survey methodology, and developing local research capacity in international settings. Ben Jann is Professor of Sociology at the University of Bern. His research interests include social-science methodology, statistics, social stratification, and labor market sociology. Recent publications include the edited book Improving Survey Methods: Lessons from Recent Research
xviii
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
(Routledge 2015) and various methodological papers in journals such as Sociological Methodology, Sociological Methods & Research, the Stata Journal, the Journal of Survey Statistics and Methodology, or Public Opinion Quarterly. Kathy Joe is Director, International Standards and Public Affairs at ESOMAR, the World Association of Social, Opinion and Market Research. She works with international experts in the development of strategies relating to data privacy legislation, and global professional standards including the ICC/ESOMAR International Code on Market and Social Research. Recent areas of activity also include guidelines on fast-changing areas such as social media research, online research as well as mobile research. Kathy has worked at various publications including The Economist and Euromoney and she is also co-editor of Research World. Timothy P. Johnson is Director of the Survey Research Laboratory and Professor of Public Administration at the University of Illinois at Chicago. His main research interests include measurement error in survey statistics, cross-cultural survey methodology, and social epidemiology. He has edited one book (Handbook of Health Survey Methods), and co-edited two others (Survey Methods in Multinational, Multiregional and Multicultural Contexts, and Hard-toSurvey Populations). Jennifer Kelley is a Research Area Specialist in Survey Methodology at the Survey Research Center, Institute for Social Research, University of Michigan. Her main research interests are measurement issues, specifically questionnaire design and interviewer effects. Her operational interests include surveys conducted in international settings, particularly those in developing or transitional countries. François Laflamme is Chief of data collection research and innovation section at Statistics Canada. His main interests are related operational research on various aspects of survey operations to improve the way data collection is conducted and managed in order to lead to more cost-effective collection or data quality improvements. He is author of many paradata research and responsive design papers. Pierre Lavallée is Assistant Director at the International Cooperation and Corporate Statistical Methods Division at Statistics Canada. His fields of interest are: indirect sampling, sampling methods for hard-to-reach populations, longitudinal surveys, business survey methods, and non-probabilistic sample designs. Pierre is the author of the book: Le Sondage Indirect, ou la Méthode Généralisée du Partage des Poids (Éditions Ellipses) in French and Indirect Sampling (Springer) in English. He also contributed in many monographs and papers on survey methods. Edith de Leeuw is MOA-Professor of Survey Methodology, at the Department of Methodology and Statistics, Utrecht University. Her main research interests lie in online and mixed-mode surveys, new technology, total survey error, and surveying special populations, such as children. Edith has over 140 scholarly publications and is co-editor of three internationally renowned books on survey methodology: The International Handbook of Survey Methodology, Advances in Telephone Methodology, and Survey Measurement and Process Quality. Geert Loosveldt is Professor at the Center for Sociological Research of the Catholic University of Leuven (KU Leuven) where he teaches Social Statistics and Survey Research Methodology.
Notes on the Editors and Contributors
xix
His research focuses on evaluation of survey data quality with special interest in the evaluation of interviewer effects and the causes and impact of non-response error. He is a member of the core scientific team of the European Social Survey. Lars E. Lyberg , PhD, is senior adviser at Inizio, Inc., a research company, and CEO at Lyberg Survey Quality Management, Inc. His research interests lie in comparative surveys, survey quality and general quality management. He has edited and co-edited a number of monographs covering various aspects of survey methodology and is the co-author of the book Introduction to Survey Quality (Wiley, 2003). He is the founder of the Journal of Official Statistics and served as its Chief Editor between 1985 and 2010. Jared Lyle is Associate Archivist at the Inter-university Consortium for Political and Social Research (ICPSR), Institute for Social Research, University of Michigan. His main research interests are in data sharing and digital preservation. He is author or co-author of several publications related to managing and curating data. Alina Matei is senior lecturer at the Institute of Statistics, University of Neuchâtel, Switzerland. Her main research interests and publications concern different features of survey sampling, like sample coordination, estimation in the presence of nonresponse, variance estimation, etc., as well as computational aspects of sample surveys. Kristen Miller is the Director of the Collaborating Center for Question Design and Evaluation Research at the National Center for Health Statistics. Her writings have focused on question comparability, including question design and equivalence for lower SES respondents, and the improvement of evaluation methods for cross-cultural and cross-national testing studies. She is a co-editor of two survey methodology books: Cognitive Interviewing Methodology (2014) and Question Evaluation Methods (2011). Zeina N. Mneimneh is an Assistant Research Scientist at the Survey Methodology Program, Survey Research Center, University of Michigan. Her main research interests include interview privacy, social desirability biases, and interviewer effects on sensitive attitudinal questions. Her main operations interests include monitoring surveys using paradata, reducing survey error in conflict-affected settings, and international survey capacity building. She has published more than 25 peer-reviewed articles and book chapters. Beth-Ellen Pennell is the Director of International Survey Operations at the Institute for Social Research’s Survey Research Center at the University of Michigan. Pennell also serves as the Director of the Data Collection Coordinating Centre for the World Mental Health Survey Initiative, a joint project of the World Health Organization, Harvard University and the University of Michigan. Her research interests focus on cross-cultural survey methods and their application in resource poor settings. Pennell is an elected member of the International Statistical Institute, led the development of the Cross-cultural Survey Guidelines (http://ccsg. isr.umich.edu) and was one of the co-editors of Survey Methods in Multinational, Multiregional and Multicultural Contexts, edited by J. Harkness, M. Braun, B. Edwards, T. Johnson, L. Lyberg, P. Mohler, B-E. Pennell, and T.W. Smith. Adam Phillips is a research consultant and Managing Director of Real Research. He has been Managing Director of AGB Nielsen, Euroquest and Mass-Observation and CEO of
xx
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Winona Research. He chairs ESOMAR’s Legal Affairs Committee, and through his knowledge of the public affairs arena and experience in liaising with UK and EU regulatory bodies, has broad experience in compliance and self-regulation. He chaired ESOMAR’s Professional Standards Committee for 15 years, worked with the Committee, to set up an international disciplinary process that binds ESOMAR members to uphold the ICC/ESOMAR International Code. Amy Pienta is a senior researcher at ICPSR in the Institute for Social Research. She is a faculty affiliate in the Population Studies Center, the Michigan Center for Demography of Aging, and the Michigan Center for Urban and African American Aging Research at the University of Michigan. Her training is in sociology (PhD from SUNY Buffalo) and demography (NIA postdoctoral fellowship at the Population Research Institute at the Pennsylvania State University). Her research centers on secondary analysis of the Health and Retirement Study (and other key datasets such as SIPP and NLS) exploring how marriage and/or family relationships affect a range of later life outcomes including: retirement, chronic disease, disability, and mortality. Her current research seeks to understand the underpinnings of a culture of data sharing in order to incentivize and strengthen this ethos across a broad range of scientific disciplines. Dr Pienta directs the National Addiction and HIV Data Archive Program (NAHDAP), funded by NIDA, and the National Archive of Data on Arts and Culture (NADAC), funded by the National Endowment for the Arts. Dr. Stephen Quinlan is Senior Researcher at the GESIS Leibniz Institute, Mannheim and Project Manager at the Comparative Study of Electoral Systems project. His research focuses on comparative electoral behavior, public opinion, and the impact of social media on politics. His research has been published in the journals Electoral Studies and Irish Political Studies. Finn Raben is Director General of ESOMAR, the World Association of Social, Opinion and Market Research, and has spent most of his working career in market research. Prior to joining ESOMAR, he had worked at Millward Brown IMS in Dublin, AC Nielsen, TNS and at Synovate. He is an ex Officio Director of MRII, the online educational institute partnered with the University of Georgia (USA); he serves as an external examiner at the International School of Management in Avans University (Breda, NL) and has joined the advisory board for the Masters of Marketing Research programme at Southern Illinois University Edwardsville. Melanie Revilla is a researcher at the Research and Expertise Centre for Survey Methodology at the University Pompeu Fabra, Barcelona, Spain. Her main research interests lie in survey methodology, quality of questions, questionnaire design, web and mobile web surveys. She is author of a series of papers about quality of questions in different modes of data collection, and for several years she has been teaching courses on survey design, measurement errors, etc. Caroline Roberts is Assistant Professor in Survey Methodology in the Institute of Social Sciences at the University of Lausanne, Switzerland. Her research interests relate to the measurement and reduction of different types of survey error, particularly in mixed mode surveys. She teaches courses on survey research methods and questionnaire design for the MA in Public Opinion and Survey Methodology, and is a member of the Scientific Committee of the European Survey Research Association.
Notes on the Editors and Contributors
xxi
Willem Saris is Emeritus Professor of the University of Amsterdam and momentarily visiting professor at the Universitat Pompeu Fabra in Barcelona. His research interests have been Structural Equation models and its application in improvement of survey methods, especially the correction for measurement errors. In that context he developed together with others the program SQP that makes it possible to predict the quality of questions and the improvement of them. Besides that he has been involved with Irmtraud Gallhofer in the study of argumentation of politicians. In all three fields he has made many publications. Peter Schmidt is Professor Emeritus of social science methodology at the University of Giessen. His research concentrates on foundations and applications of generalized latent variable models, especially structural equation models. Applications include cross-country, repeated cross-sections and panel data. The substantive topics deal with values, attitudes toward minorities, national identity and innovation. He is, together with Anthony Heath, Eva Green, Eldad Davidov, Robert Ford, and Alice Ramos, a member of the Question Design Team for the immigration module of the European Social Survey 2014. Silke L. Schneider is senior researcher and consultant at GESIS – Leibniz Institute for the Social Sciences, Mannheim. Her research interests cover comparative social stratification research, attitudes towards migrants, and survey methodology, especially the (comparative) measurement of socio-demographic variables. She has served as an expert with respect to education measurement and the International Standard Classification of Education for several cross-national surveys (e.g. ESS, SHARE), international organizations (e.g. UNESCO, OECD) and individual research projects. Rainer Schnell is the Director of the Centre for Comparative Surveys at City University London and holds the chair for Research Methodology in the Social Sciences at the University of Duisburg-Essen, Germany. His research focuses on nonsampling errors, applied sampling, census operations, and privacy preserving record linkage. Rainer Schnell founded the German Record Linkage Center and was the founding editor of the journal Survey Research Methods. He is the author of books on Statistical Graphics (1994), Nonresponse (1997), Survey Methodology (2012), and Research Methodology (10th ed. 2013). Barry Schouten is Senior Methodologist at Statistics Netherlands. His research interests are nonresponse reduction, nonresponse adjustment, mixed-mode survey design and adaptive survey design. He has written several papers in these areas and was project coordinator for EU FP7 project RISQ (Representativity Indicators for Survey Quality). Kuniaki Shishido is an Associate Professor of the Faculty of Business Administration, Osaka University of Commerce. His areas of specialty are social gerontology, social survey and quantitative analysis of survey data. He takes charge of designing questionnaires of the Japanese General Social Surveys (JGSS). He also participates in cross-cultural survey projects such as the East Asian Social Survey (EASS). Jolene D. Smyth is an Associate Professor in the Department of Sociology and Director of the Bureau of Sociological Research at the University of Nebraska-Lincoln. Her research focuses on challenges with questionnaire design, visual design, and survey response/nonresponse. She is co-author of the book Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design
xxii
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Method (Wiley, 2014) and has published many journal articles focusing on issues of questionnaire design and nonresponse. Martin Spiess is Professor of Psychological Methods and Statistics, Institute of Psychology, University of Hamburg. His main interests include survey and psychological research methodology, techniques to compensate for missing data, robust and semi-/non-parametric statistical methods as well as causal inference. From 1998 until 2008 he was responsible for compensating unequal selection and response probabilities as a researcher at the German Socio-Economic Panel study. Stephanie Steinmetz is an Assistant Professor of Sociology at the University of Amsterdam and senior researcher at the Amsterdam Institute for Advanced Labour Studies (AIAS), Netherlands. Her main interests are quantitative research methods, web survey methodology, social stratification, and gender inequalities. Ineke A. L. Stoop is senior survey methodologist at The Netherlands Institute for Social Research/SCP. She is also Deputy Director Methodological of the European Social Survey, and Chair of the European Statistical Advisory Committee. Her main research interests lie in nonresponse and cross-national surveys. She is author and co-author of several books and book chapters on these topics. Victor Thiessen† was a Professor in the Department of Sociology and Social Anthropology at Dalhousie University, Halifax, Nova Scotia Canada. During his carrier he served as Chair of his department, Dean of the Faculty of Arts and Social Sciences, and Academic Director of the Atlantic Research Data Centre, which made Statistics Canada surveys available to academic researchers. His substantive work focused on various transitions in young people’s lives. Victor loved to play with statistics and to teach others how to do the same, something he continued to do as Professor Emeritus. He passed away suddenly and unexpectedly at the age of 74 on the evening of February 6th, 2016, in the company of his wife Barbara and very close friends. He is survived by his wife, his sister, two daughters and their partners, and six grandchildren. Yves Tillé is professor at the Institute of Statistics, University of Neuchâtel. His main research interests are the theory of sampling and estimation from finite population, more specifically sampling algorithms, resampling, estimation in presence of nonresponse, estimation of indices of inequality and poverty. Bojan Todosijević is Senior Research Fellow at the Center for Political Studies and Public Opinion Research, Institute of Social Sciences, Belgrade. His research interests include political psychology, political attitudes and behavior, and quantitative research methods. He has been affiliated with the Comparative Study of Electoral Systems (CSES) for a decade, mostly dealing with the integration of micro- and macro-level cross-national data. His work has been published in Political Psychology, International Political Science Review, and European Journal of Political Research. Vera Toepoel is an Assistant Professor at the Department of Methods and Statistics at Utrecht University, the Netherlands. Her research interests are on the entire survey process, with a particular focus on web and mobile surveys. She is the chairwoman of the Dutch and Belgian Platform for Survey research and author of the book Doing Surveys Online.
Notes on the Editors and Contributors
xxiii
Mary Vardigan, now retired, was an Assistant Director and Archivist at the Inter-university Consortium for Political and Social Research (ICPSR), a large archive of social and behavioral science data headquartered at the University of Michigan. At ICPSR, Vardigan provided oversight for the areas of metadata, website development, membership and marketing, and user support. She also served as Executive Director of the Data Documentation Initiative (DDI), an effort to establish a structured metadata standard for the social and behavioral sciences and as Chair of the Data Seal of Approval initiative. Vasja Vehovar is a Professor of Statistics, University of Ljubljana, Slovenia. His interests are in survey methodology, particularly web surveys. He co-authored the book Web Survey Methodology and is also responsible for the corresponding website (http://WebSM.org). James Wagner is a Research Associate Professor in the University of Michigan’s Survey Research Center. His research interests include survey nonresponse, responsive or adaptive survey design, and methods for assessing the risk of nonresponse bias. He has authored articles on these topics in journals such as Public Opinion Quarterly, the Journal of Survey Statistics and Methodology, Survey Research Methods, and the Journal of Official Statistics. Herbert F. Weisberg is Professor Emeritus of Political Science at The Ohio State University, Columbus, Ohio. His main research interests include American politics, voting behavior, and political methodology. He is author of The Total Survey Error Approach: A Guide to the New Science of Survey Research. Brady T. West is a Research Assistant Professor in the Survey Methodology Program, located within the Survey Research Center of the Institute for Social Research on the University of Michigan-Ann Arbor campus. His main research interests lie in interviewer effects, survey paradata, the analysis of complex sample survey data, and regression models for longitudinal and clustered data. He is the first author of the book Linear Mixed Models: A Practical Guide using Statistical Software (Second Edition; Chapman and Hall, 2014), and also a co-author of the book Applied Survey Data Analysis (Chapman and Hall, 2010). Gordon B. Willis is Cognitive Psychologist and Survey Methodologist at the National Cancer Institute, National Institutes of Health, Bethesda, MD. His main research interests are questionnaire design, development, pretesting, and evaluation; especially in the cross-cultural area. He has written two books on the use of Cognitive Interviewing in questionnaire design. Heike Wirth is senior researcher at the Leibniz Institute for the Social Sciences, GESIS, Mannheim, and also a member of the German Data Forum. She works in the areas of social stratification, sociology of the family, data confidentiality, and research methodology. She is author or co-author of several articles or chapters on the measurement of social class. Diana Zavala is a survey methodologist and a researcher at the Universitat Pompeu Fabra, Barcelona. Her main research interests lie in questionnaire design, survey translation, linguistic equivalence in multilingual surveys, structural equation modeling and measurement error. She is a member of the Core Scientific Team of the European Social Survey and the Synergies for Europe’s Research Infrastructures in the Social Sciences project.
Preface The story of this Handbook covers five continents and five years! In the summer 2011, during the ESRA conference in Lausanne, SAGE contacted one of us in order to develop the idea of a Handbook of survey methodology and a team of an American, a German, a Swiss quickly joined by a Taiwanese began to elaborate the concept for the volume. Taking advantage of scientific meetings in the USA, Croatia, and Australia the editors developed a detailed proposal for the Handbook which then was reviewed by colleagues in the field contacted by SAGE (thanks to them). On the basis of these reviews the table of content was finalized and approved. The contract for the volume between SAGE and us was signed when the four of us met in Santiago de Chile for the annual ISSP meeting in 2013. This marked the kick-off of the second stage of producing this Handbook by reaching out to a group of internationally acknowledged experts and inviting them to contribute a chapter. We started out hoping to recruit scholars from across the world, but were only partially successful: the 73 authors contributing to this Handbook reside in Asia, Europe, and North America. While the chapters were solicited, written, and reviewed, we used the opportunity of a meeting in summer 2014 in Yokohama to coordinate the content and make last adjustments. Again one year later we met at the annual ISSP meeting, this time in Cape Town, and later in the summer in Reykjavik in order to finalize the last chapters, do a last adjustment to the Table of Contents and organize the writing of the introduction. A final meeting of the editors took place in Zurich in January 2016 bringing us back to Switzerland where it all started in 2011. The story of the development of this Handbook signifies its international character and reflects the importance and value we put on cross-national and cross-cultural perspectives while at the same time striving for a fair and balanced synthesis of current knowledge. Hopefully this Handbook will stimulate more survey research and the population of survey scholars will grow to the critical mass in even more regions. Putting together this Handbook would not have been feasible without the support of our close collaborators, colleagues, and families whom we thank for their encouragement and the freedom to pursue this work. We are also grateful for the continuous support and encouragement we have received from SAGE. April 2016
CW, DJ, TWS, YF
PART I
Basic Principles
1 Survey Methodology: Challenges and Principles Dominique Joye, Christof Wolf, To m W . S m i t h a n d Y a n g - c h i h F u
INTRODUCTION There are a lot of reasons to publish a new handbook of survey methodology. Above all, the field of survey methodology is changing quickly in the era of the Internet and globalization. Furthermore, survey methodology is becoming an academic discipline with its own university programs, conferences, and journals. However, survey methodology could also be seen as a bridge between disciplines, resting on the shared methodological preoccupations between specialists of very different fields. These are some of the challenges we are addressing here. Discussing the actual practices in many contexts is an invitation to think in a global perspective, along two directions. On the one hand, surveys are realized today all around the world in very different settings, which we call the globalization of surveys. But on the other hand, the ‘Total Survey Errors’ paradigm considers the complete survey life cycle
and the interrelation of the different elements involved in the data collection process. That means that it would not be wise to pay too much attention to a single element at the risk of losing sight of the complete picture. This is of course valid for survey designers but also for estimating the quality and potential for use of existing surveys. A global perspective also requires a comparative frame. We even argue that integrating a comparative perspective from the beginning can enlighten many different aspects of survey design, even in a single national context. These points will be developed throughout this handbook beginning in this introduction, with the idea of simultaneously providing a ‘state of the art’ and a perspective on the upcoming challenges. One important point in this respect is not to consider surveying as a technique but to consider it an integrated part of the scientific landscape and socio-political context. But first, we should explicate what we mean by a ‘survey’.
4
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
WHAT IS A SURVEY? The Oxford Dictionary of Statistical Terms begins with a broad definition of surveys: ‘An examination of an aggregate of units, usually human beings or economic or social institutions’ (Dodge 2010 [2003]: 398). Although many authors, such as Ballou (2008), agree on the polysemy of the concept, a more precise definition is given by Groves et al. (2004): ‘A survey is a systematic method for gathering information from (a sample of) entities for the purpose of constructing quantitative descriptors of the attributes of the large population of which the entities are members’ (p. 2). In this sense, the French word for survey, ‘enquête’, the same term used for a criminal inquiry, denotes well this systematic quest for information. Sometimes ‘surveying’ is defined as obtaining information through asking questions, in line with the German word for survey: ‘Befragung’. Dalenius (1985) recalls that observations are to be done according to a measurement process (measurement method and a prescription on its use) (Biemer and Lyberg 2003, see also Dodge 2010 [2003]: 399). That means that surveys defined in this sense share a lot of commonalities with other forms of data collection. The idea of a systematic method for gathering observation includes for example exhaustive censuses as well as the use of a sample. In fact, some specialists explicitly limit surveys to data collection exercises conducted on samples (de Leeuw et al. 2008: 2). This handbook includes many chapters (Chapters 21, 22, 23) on the question of sampling, and the sample survey will be the first target, even though we see no reason to exclude by definition censuses which share a lot methodologically with surveys and are of great importance in the history of the quantitative observation of society. ‘Quantitative descriptors’ implies not only ‘numbers’ but also their interpretation, which in turn is placed in a broader interpretative
frame. There is a process of operationalization that progresses from theory to measurement (Chapters 9, 14 and 34). In this sense, ‘descriptors’, i.e., point estimates, can only be understood by taking into account the structure and functioning of a given society. In other words, how to build the measure of ‘items’ is also one of the main topics to address, and a full part of the handbook is dedicated to survey-based measurement (Chapters 14–20). One strength of this handbook is the attention it gives to measurement and survey quality. The definition of survey nevertheless excludes a lot of approaches useful for social research that are outside the scope of this handbook (but see Bickman and Rog 2009). Generally, qualitative methods are not considered, as we focus on quantitative descriptors. However, in certain parts of the survey life cycle, qualitative methods are well established and important to consider, such as in pretesting (Chapter 24). Along the same lines, ‘big data’, e.g., administrative data or data from the Internet, are not considered here because they are not organized a priori as a ‘systematic model for gathering information’. to rephrase Dalenius. Nevertheless, such data are becoming vital to understanding social life, and must be taken as complementary with surveys.1 In the last part of the handbook we will take into account the growing integration of surveys into a set of different sources of information (predominately Chapter 42, but also Chapters 40 and 41 in some aspects). There are multiple ways of collecting information through surveys, and some distinctions between them are useful. Although a complete typology is outside the scope of this introduction, Stoop and Harrison (2012), for example, classify surveys based on the interrogatives who, what, by whom, how, when, where and why. Without mimicking their excellent chapter, we can discuss some lines. In the ‘by whom’ category, different types of actors that sponsor activities can be mentioned:
Survey Methodology: Challenges and Principles
•• The scientific community tries to develop theory and analytical models in order to explain behav ior, attitudes or values as well as the distribution of health, wealth and goods in given societies. •• The public administration; quantitative informa tion is needed for governance – it is no coinci dence that the words ‘statistic’ and ‘state’ have the same root – and an important task of a state is to assess the number of inhabitants or house holds it contains. •• Commercial enterprises need knowledge about their clients and their clients’ reactions to their products in order to be as efficient as possible in their markets. •• Mass media are a special actor in the survey field and were mentioned as a particular cat egory already in the 1980s (Rossi et al. 1985). Sometimes, they use results of polls more as a spectacular result to gain an audience rather than as a piece of systematic knowledge about society. That is part of the debate about the con cept of public opinion (Chapter 5).
This type of distinction is also of importance when thinking for example about ethical aspects of surveys (Chapter 7). Of course, the boundaries between these actors are not always clear and depend on the national context, at least for the relation between administrative organizations and academia, and this could be important for the definition of measurement tools (Chapters 5 and 20). Nevertheless, we can expect that these actors have different expectations of surveys, their quality and their precision. In fact, most of the examples used in this book are taken from the academic context, implying that we focus more on the link between theory and measurement (Chapters 9 and 14) than other indicators of quality used, for example, in official statistics.2 This example also reminds us that no absolute criterion for quality exists independently of the goals. This is clearly stated in the definition of quality given by Biemer and Lyberg, ‘Quality can be defined simply as ‘fitness for use.’ In the context of a survey, this translates to a requirement for survey data to be as accurate as necessary to achieve their intended purposes, be available at the time it is needed (timely) and be accessible to those for whom
5
the survey was conducted’ (2003: 9). This handbook has
many chapters assessing data quality and aspects that can jeopardize quality (Chapters 34 to 39), an important aspect that must also be taken into account in the design state of a survey (Chapter 16 for example). This is of prime importance when developing the total survey error frame (Chapter 3). We can further distinguish between the different ways to acquire information, the ‘how’ mentioned by Stoop and Harrison (2012). The first distinction is between modes of data collection, even though the boundaries between them are blurring, and multi or mixed modes are more and more often utilized (Chapter 11). We will come back to this later in the introduction when considering the development of surveys during the last century. Who (or what) are the units of analysis of the survey, is another question. In most of the chapters in this handbook it would be individuals or households, but this is clearly a choice: a fairly big proportion of the surveys conducted by statistical offices are on establishments, even if it is individuals who give the information. In other cases we can aggregate information at some meso or macro level, such as occupations or regions. We include one chapter examining how survey data could be augmented by macro indicators (Chapter 42). Along the same lines, complex structures of data, such as members of a network or connections or interactions among these network members, as a basic unit is left to the specialized literature, such as the Sage Handbook of Social Network Analysis (Scott and Carrington 2011). Stoop and Harrison also mention the time dimension as an important classificatory factor, the ‘when’, mostly to distinguish longitudinal and cross-sectional approaches. We cover the discussion on cross-sectional vs longitudinal survey designs in Chapter 9, and further details can be found in dedicated publications (e.g., Lynn 2009). But ‘when’ may also refer to the historical context (Chapter 8), a topic we turn to next.
6
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
SURVEYS IN HISTORICAL PERSPECTIVE Though the idea of probability originated many centuries ago, and the art of counting people through censuses has been known for several millennia (Hecht 1987 [1977]), the modern survey, organized on the basis of a random sample and statistical inference, is more or less only one century old (Seng 1951; see also Chapter 21). Of course, some precursory works can be mentioned, such as for studying poverty or social mobility,3 as well as the so-called straw polls for prediction of political results. However, the use of a small random sample, for example to predict elections, instead of a large, but fallacious, non-random selection of cases, was proposed only in the interwar years (Chapter 8). More or less at the same time official statistics began using samples. Even if the advent of modern surveying was relatively recent, it has experienced major changes during the last decades. It is interesting to quickly discuss these changes, as they have structured the way to realize surveys as well as the way to use them or even think about them. One aspect of this change can be seen in the predominant mode of data collection, which – at least for Western countries – shows the following sequence: •• Just after the Second World War, most surveys were conducted face to face or by mail if enough people were considered to be literate. •• One generation later, in the 1980s, the tele phone survey was seen as a new and efficient technology, at least when the coverage was sufficient. In some countries it was obligatory to be listed in the telephone directory, which therefore was seen to constitute an excellent sampling frame. •• Another generation later, telephones seem more and more difficult to use, as mobile phones tend to replace landlines, and centralized directories are no longer available. With the spread of the Internet, the web survey is seen as THE new methodology to adopt, in particular when con sidering the price of interviewing.
•• Nowadays, in a context of declining response rates (Chapter 27), in many cases the idea is that combining modes (Chapters 11 and 18) is the way to follow, either in order to contact the greatest possible part of the sample or to have the best cost/benefit ratio, such as by using adaptive sampling (Chapter 26). This also explains the choice in this handbook to discuss interviewer interactions and interviewers effects less in depth, as would have been the case in earlier works (e.g., Biemer et al. 1991), but to put more emphasis on quality in general.
All these points pose challenges for data collection, which the survey industry has had to overcome. This has mostly resulted in adapting the field techniques to the changing circumstances. In particular, they must have been more transparent and more systematic but also more flexible. Another story of change can be told in terms of the growing complexity of survey designs (Chapter 16), linked – once again – to the development of technology: •• At the beginning of the period considered here, after the Second World War, most surveys were single cross-sectional surveys. Their analysis was promising for the disciplines involved, and many important books were based on this type of infor mation. The practical work of analysis was com plicated enough with a single survey, in particular considering the (lack of) availability and ease of use of the computers needed for the analysis: software like SPSS was only developed in the sev enties, and terminals with video displays became available more or less at the same time. • In the next period, the time dimension gained importance, but under different modalities: Repeated cross-sectional surveys were put in place in many countries, like the General Social Survey (GSS) in the United States, the British Social Attitudes (BSA) in Great Britain and the Allgemeine Bevölkerungsumfrage in den Sozialwissenschaften (ALLBUS) in Germany. This was also an important change because it was no more single researchers conducting scientific projects but a tool that had to serve an entire scientific community. In other words, that was the beginning of the implementation of infrastructures in this field.
Survey Methodology: Challenges and Principles
Panel surveys, with multiple waves of data collection for the same respondents, have become more frequent. The Panel Study of Income Dynamics (PSID), running in the USA since 1968, is the longest-running longitudi nal household survey in the world.4 Similar initiatives have been launched in other coun tries, like the German Socio-Economic Panel (G-SOEP), which has run since 1984; the British Household Panel Survey (BHPS), which began in 1991; or the Swiss Household Panel (SHP), running since 1999. Of course there were precursors to these big initiatives; for example, the NORC’s College Graduates Study was begun in 1961. In another disci plinary field, we can also mention cohorts like the National Child Development study in the UK, based on a 1958 cohort, and the National Longitudinal Survey of Youth in the US (beginning in 1979). Most of these studies are further complicated by being household surveys following not only individuals but entire households, which of course change over the years, meaning they involve a very complex data structure. •• The next step was to introduce the comparative dimension, in addition to the time axis. Here we also have to mention three situations: {{ Comparative repeated cross-sectional projects. The European and World Values Surveys (EVS and WVS respectively) and the International Social Survey Program (ISSP) were put in place in the eighties, with pre cisely the idea to have a tool that allowed putting countries in perspective while evalu ating change. Additionally ‘barometers’ have evolved in Europe and outside of Europe such as the Latino Barometer, Asian Barometer,5 Afro Barometer, Arab Barometer and Eurasia Barometer. Both the East Asian Social Survey (EASS) and European Social Survey (ESS) were also built from the beginning with the same idea to measure social change while keeping strict comparability and high quality. {{ Harmonization of national panel studies to allow comparability. This was the challenge of the Cross-National Equivalent File (CNEF),6 for example, pulling together, among others, the British (BHPS), German (G-SOEP), Swiss (SHP) and US (PSID) panels. {{ Comparative panel surveys designed from the beginning in a comparative perspective.
7
There are not many examples, but we can nevertheless mention the case of the Survey of Health Aging and Retirement in Europe (SHARE) and the European Statistics on Income and Living Conditions (EU-SILC). •• The tendency nowadays is also to combine dif ferent sources of data to exploit the growing availability of information, as well as to pursue a movement initiated some years ago by social sci entists such as Stein Rokkan (Dogan and Rokkan 1969). Multilevel models are more and more often used in comparative projects, making use of data at the country level. Some other projects use other types of information at the contextual level, which could be not only geographic but linked, to mention some examples, to social net works or to occupation. Other examples include the ESS which tracked the main events arising during the fieldwork period or the CSES which integrates not only geographical but also institu tional information. SHARE is integrating not only answers to questions but also biomarkers which will probably gain even more importance in the development of health-related surveys these next years. This is described in Chapter 42, among other projects. •• A further sign of growing complexity comes from the fieldwork which in recent years has given more attention to paradata giving supplementary information on respondents and the contexts in which they live. These data are a potential basis to identify and correct biases (Kreuter 2013) but also provide a means to improve fieldwork moni toring and adapt to the best design (Chapter 26). Control of the production process is an important aspect of survey quality (Chapter 25; see also Stoop et al. 2010). •• In recent years, another type of survey based on the Internet has gained visibility – on-line or access panels:7 {{ These panels typically are opt-in surveys meaning that potential respondents sign-up for them. This allows for conducting very cheap surveys on a large number of respond ents that can be in the range of 10,000 to 100,000 panelists. Being a self-selected group, coverage of the population usually is poor and though companies try to improve representativeness by weighting according a few socio-demographic variables external validity typically is a problem. This therefore calls into question the quality of such surveys.
8
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
{{
In reaction to such a model, academic-driven surveys began to use an offline random recruitment process, called probabilistic panels. This strategy tries to combine the advantages of Internet surveys, which are cheap and easy to set up, and those of true random sampling. One of the most famous is probably the LISS panel, which has run in the Netherlands since 2007 (http://www.lissdata. nl/lissdata/About_the_Panel), but similar experiments are running in France with the ELLIPS initiative (http://www.sciencespo.fr/ dime-shs) and in Germany with the German Internet Panel (http://reforms.uni-mannheim. de/internet_panel/home/) and the GESIS panel (http://www.gesis.org/unser-angebot/ daten-erheben/gesis-panel/). The rise of these Internet surveys does not mean that some other aspects of survey design, such as sam pling (Chapter 22) as well as attrition and selection of respondents, are totally solved.
Looking at the history of surveys, there is something a little bit paradoxical: the recent proliferation of surveys, mainly Internet surveys, without random sampling or a clear description of the inference possibilities, put forward data by emphasizing the number of respondents more than the quality of data. This in a sense harkens back to the situation of the thirties and the discussion about the Literary Digest poll, although there is perhaps a difference: we now have a better knowledge of non-probability sampling (Chapter 22) and conditions of use.8 In this history, we can mention a last important distinction, between data produced by design, when a survey is designed according to a specific goal, and data produced by a process like administrative data or even the ‘big data’ mentioned earlier. At the time of writing, it does not seem useful to us to claim the superiority of one type of data over the other in an historical, or prospective, perspective, but it seems to us more important to think about the articulation between the research question and the way to answer it, in function of the available sources of information, but considering also considering the limitations that each kind of data may have.
A DISCIPLINE WITHIN DISCIPLINES? The first survey practitioners were first of all substantive researchers who were learning methodology by conducting surveys and accumulating experience. In this sense, in the middle of the last century, surveying was something like a craft or an art. For example, a famous book of Stanley Payne (1951) was precisely entitled The Art of Asking Questions. Survey methodology began to cumulate as a discipline later on; for example, the first handbook was published in the eighties (Rossi et al. 1985). In order to move from an ‘art’ or a ‘craft’ to a ‘science’, there was therefore a need for a unifying paradigm. Total survey error was a perfect candidate for this (Chapters 3 and 10). Survey methodology tends more and more to be seen as a discipline of its own: it has its own journals, such as Public Opinion Quarterly (POQ), Survey Research Methods (SRM), and the Journal of Official Statistics (JOS); its own conferences and associations, including the American Association for Public Opinion Research (AAPOR), European Survey Research Association (ESRA) and World Association for Public Opinion Research (WAPOR); as well as handbooks, this one being one of them. These associations have contributed to establishment of standards for the discipline (Chapter 2). This is also the case for market-oriented associations (Chapter 7). But survey research is also a bridge between disciplines. For example, medical cohorts and sociological panels use the same methodology and they could begin to speak to each other just because they share so much in terms of data collection and analysis. It is nevertheless important to recall in this context that methodology is also deeply embedded into disciplines. In this context, it is probably vain to develop a methodology for methodology, independent of the substantive goal of the research. There is always the risk of development inside an ivory tower without taking into account the most
Survey Methodology: Challenges and Principles
important social and scientific challenges. For this reason we argue that a comparative perspective in methodology is a way to reflect about the limits and conditions for the validity of survey results and, as such, a very important safeguard. In other words, methodology must be open to the preoccupations of the other disciplines and keep a broad perspective. Methodological excellence will also be better accepted by substantive researchers if it shares, at least in part, the same disciplinary vocabulary and preoccupations. Survey research is therefore positioned in an area of tension between methodology on the one hand and substantive research on the other. But the reverse is true also: from a substantive researcher point of view, it is important to be aware of data quality and, more generally, of the question of how much methodology affects empirical results. We should not only consider if survey methodology is a discipline by itself or a specific field of expertise within several disciplines but also to take into account that some disciplines are sources of support for solving problems specific to survey methodology: •• The first discipline to mention is statistics, which can sometimes be seen as part of mathematics but also has its own position in the scientific land scape. From our point of view, statistics is impor tant not only for sampling or data analysis but also for proposing tools for measurement (Chapter 17) and estimation of quality (Chapter 34). •• Psychology and in particular cognitive psychol ogy as well as social psychology can be seen as a key when looking at the modeling of the answer process and the interaction between interview ers and interviewees. These disciplines are also important when looking at models of answers, like in Chapter 15. •• Psychometry is important to consider in the dis cussion about measurement and measurement models (Chapter 17 or 34). •• Sociology and political science can be consid ered when trying to understand the differences in survey participation, for example, by social position. It could be inspired by theory, like the one of Bourdieu, but may also refer to the idea of social exchange models as posited by Dillman
9
et. al (2014). Along the same lines, the study of housing and living conditions of respondents can benefit from the work of urban sociology in order to conceptualize lifestyle and social condi tions (Smith 2011). But sociology and political science are also important when embedding the construction of indicators into social and political constraints (Chapters 5 and 20). •• Linguistics and translation studies are also important to consider for questions of formula tion as well as tools when considering transla tions (Chapter 19).
Interdisciplinarity is important in this frame. It is one of the conditions needed to fruitfully integrate methods, statistics and disciplinary perspectives. However, for survey methodologists, it is also important to find a common paradigm in order to address the questions of survey research. That is precisely the goal of the Total Survey Error perspective already mentioned, but some remarks can be added here. The discussion of survey quality, including how to develop reliable estimators and efficient tools, is probably as old as the history of surveys. A text of Deming (1944) is one of the oldest milestones along these lines; interestingly, it was published in the American Sociological Review. From this perspective, it is important to consider all the possible sources of errors, the way that they interact and can endanger, or not, the results as well as the conclusions that can be drawn from the data (Smith 2005). This is why Chapter 3 is dedicated specifically to considering the possible sources of errors, their consequences and their interrelations. This does not prevent us from dedicating a full part of this work to the different facets of data quality (Chapters 34 to 39). This attention to data quality and its consequences is one of the characteristics of this handbook and is related to the other transversal theme: comparative design, to which the last chapter of each part is dedicated as well as a general chapter on challenges of comparative survey research (Chapter 4). This attention to comparison and comparability of
10
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
course has a lot of consequences for the way we consider survey methodology.
COMPARATIVE FRAME AND THE NEED FOR MORE RESEARCH AND EXCHANGE BETWEEN CONTINENTS For many reasons, to begin with the size of the scientific community and the need of information from the administration of a big country, survey methodology developed quickly in the United States after the Second World War. Part of this knowledge was ‘exported’ to Europe with the creation of important firms such as the IFOP with Jean Stoezel in France (Meynaud and Duclos (2007 [1985]) and the Allensbach Institute with Elisabeth Noelle-Neumann, both of which exchanged before with George Gallup (Zetterberg 2008; but see also Chapter 43). This movement of exportation and dissemination was of course primarily the case for ‘opinion studies’ rather than for official statistics, in which the various countries invested more energy in autonomous development.9 The origin and development of survey methodology inside the United States has had many consequences. A lot of studies have been established in a US context, but their validity in other contexts has not always been tested. This is the case, for example, for topics like the use of incentives (Chapter 28), for which we can expect that level of income and lifestyle are determinant. Along the same lines, a lot of studies are based on meta-analysis of published results and have never taken into account the cultural origin of the studies on which they were based. These are of course strong arguments to also consider and promote studies conducted in different contexts. By the way, as mentioned in Chapter 43, surveys have been developed on all the continents. For example, the Global Barometer Surveys mentioned above that were inspired by the Eurobarometer now include Latino Barometer, Asian Barometer, Afro Barometer,
Arab Barometer and Eurasia Barometer (http://www.globalbarometer.net/). The ISSP and the World extension of the value surveys also cover six continents since many years. This represents not only the dissemination of a technique all over the globe, which is interesting, but also the possibility of very interesting scientific developments (Haller et al. 2009). For example, to what extent can we compare different systems (Chapter 12)? What is the importance and impact of translation (Chapter 19)? What is the link between the general conditions of a country and the way of conducting surveys, either in drawing a sample or choosing the most adequate mode (Chapter 23)? The comparative perspective is clearly a central point here. Comparison is considered for the planning, measurement, use and quality of the survey process and the resulting data. Another point can be mentioned here. Even though survey methodology is first of all a discipline founded on an empirical basis, there are still a lot of practical elements that are unknown and need to be explored, especially from a comparative perspective. For example, what is the relation between interest in a topic or in the questionnaire on the one hand and quality of response on the other? If such a question seems rather simple, there are still difficulties in measuring ‘interest’ in an appropriate manner. Likewise, even if the words in a questionnaire are chosen very carefully and discussed between experts after extensive cognitive testing or pretesting, there is still room to discuss the choice of a particular wording. This kind of information has to be documented in depth and published in order to allow scientific validation. In this context, it is a little bit astonishing that experiments in survey methodology are less often archived and re-analyzed than substantive surveys, even though the survey methodologists are probably the people most trained to do secondary analysis in an appropriate way (Chapter 40; see also Mutz 2011). Once again, the comparative dimension adds a level of difficulty here: what is true in a given context is not necessarily true in another one (Chapters 12 and 13). That means that a
Survey Methodology: Challenges and Principles
lot of experiments and empirical analysis have to be multiplied in different countries before solutions can be adopted or adapted. As mentioned, the incentives presented in Chapter 28 have been until recently only discussed in a US context, without knowing how appropriate it would be to implement them in the same way in Africa, Asia, Europe or South America. Another case could be item validity, which may vary by behavior patterns or rules of social exchange deeply embedded in cultures. For example, the prevailing norm of ‘saving face’ during social interactions may lead Chinese respondents to give socially desirable responses during face to face interviews to an excessive extent.10 In some comparative surveys, such as the ISSP, a significantly large number of East Asian respondents also choose the mid-point response category (e.g., neither agree nor disagree) on attitudinal items, consistently refraining from revealing definite opinions. It also remains unknown how such cultural differences could be taken into account for comparative survey studies. Similarly, linguistic properties of questions and wordings are far better known in an Anglo-Saxon context that in other languages. In other words, we still have to make progress in order to find functional equivalence between countries when designing surveys (Chapters 19 and 33). But insisting on the difference of conditions between countries is also an invitation to examine the importance of differences between different social groups inside a country, in terms of shared validity and reliability, as well as functional equivalence. In other words, if we follow such an idea, every survey is comparative by nature! That means that considering the challenges posed by multicultural surveys also make us aware of the heterogeneity of conditions and of respondents in each national context.
USE AND USEFULNESS OF SURVEYS IN A CHANGING WORLD We have already mentioned the usefulness of surveys, a minimal proof being the
11
development of the discipline. Let us discuss some points about this in more detail: •• In any country the statistical office and other gov ernmental agencies are an important source of information needed for governing and for making informed decisions. For example, the European Commission conducts the Eurobarometer in order to gain regular information on the atti tudes of Europeans. More generally, the use of statistical data is part of a movement wherein decisions are based on information and facts. All of the social reporting movement and evalu ation studies are based on this line of reasoning. ‘Evidence-based policy’ is similarly in line with the push for ‘evidence-based medicine’ and the recent developments around genomics could be a further incentive in this sense. •• A lot of scientific work has been developed on the basis of the Comparative Study of Electoral Systems (CSES), ESS, E(W)VS and ISSP, to men tion four important international comparative projects. Combining these sources, there are probably more than one thousand comparative journal articles published each year, showing the integration of survey methodology in the sci entific activity of social sciences today, especially in a context of comparison. This is not only a ben efit in terms of knowledge of the social system of the countries concerned, but also considering that of sharing methodological excellence and innova tions between researchers. That means that the use of these important surveys increases the level of knowledge and competences inside the scien tific community of the participating countries. The encouragement of data infrastructures in Europe, including data production with projects like ESS or SHARE and through the creation of ERICs,11 is probably one more sign of the vitality of surveys as a tool for knowledge production in the aca demic field, at least in an European context. •• The relation to the media is sometimes more ambiguous, also because of the question of the accuracy of electoral predictions, sometimes based on surveys lacking the necessary transparency (Chapters 2 and 5). On the other hand, the ques tion of feedback to citizens and participants in surveys is clearly an important point and it is even sometimes seen as a crucial element of a demo cratic system (Henry 1996). This is also part of the idea of a ‘survey climate’ developed in Chapter 6, at least if we think that discussion about surveys
12
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
and feedback to the people participating are part of a democratic culture (Chapter 5). •• We have already mentioned the close relation between the words ‘statistic’ and ‘state’, which clearly puts on the agenda some concepts related to political science. Four aspects at least are of interest here. {{ Surveys are part of the democratic system, in which everyone must have the right to express his or her own opinion. In this sense, every act against freedom of expression is problematic, and surveys are clearly relevant in the context of establishing democracy (Chapter 5). {{ There is also another link between surveys and democracy, as in many cases surveys mimic democracy by following the model ‘one (wo)man, one vote’, which we find when each adult inhabitant of a country has the same probability of being invited to par ticipate in a survey. {{ Along the same lines, potential respondents have some rights: the right not to answer and the right to receive feedback about the results of what has been done with the information given. More generally, for the respondents, taking part in surveys entails a cost that must be acknowledged by carefully considering asking only meaningful questions. {{ But surveys are also ways of forming opinions and are not a purely neutral tool of observa tion. As mentioned in Chapter 20, there is a performative effect in the definition and use of categories and subjects to be asked about. More generally, there is a stream of research that questions the relation between surveys and public opinion or even the pertinence of the latter as a scientific concept.12
In fact, presented in this way, survey methodology can be not only seen as an interdisciplinary field but even considered in terms of transdisciplinarity (Hirsch Hadorn et al. 2008), which means taking into account the social conditions in which interdisciplinarity operates. We hope to have demonstrated in this handbook the interest in discussing surveying from a methodological point of view as well as from a far more general perspective, including social and political challenges. The condition for this is the practical
possibility of using the information contained in surveys in the most pertinent way. Above all, that means documentation of the data and their conditions of production (Chapter 29) as well as good practices of analysis (Chapters 30, 31 and 32).
ORGANIZATION OF THE HANDBOOK The handbook begins by introducing basic principles in Part I. It also introduces readers to the two main organizing principles: the idea of total survey error and the comparative perspective. Part II underlines that surveys are not just a technical tool described by a simple methodology but that they are anchored in societies and historical contexts. They are useful for many actors, such as the state, and also have an impact because they have developed in a historical context. That is one reason that allows us to speak about ‘survey climate’ (Chapter 6) and one more reason to take into account ethical issues (Chapter 7). The remainder of the handbook follows a simple flow model of conducting a survey: planning a survey, deciding about measurement, choosing a sample method, thinking about specific features of data collection, preparing the data for use, and finally assessing and, where possible, improving data quality. The questions that are posed and that have to be answered in each of these steps are even more challenging if a survey is to be carried out as part of cross-cultural, comparative research. Therefore, this particular aspect is, as mentioned earlier, dealt with in specific chapters discussing the particular challenges of comparative survey research in the different phases of the survey process. The next chapters are dedicated to planning a survey (Part III). In this part, the research question that drives the choice of a suitable survey design has to be made explicit. A specific survey mode or modes is also to be determined. It also covers a discussion of the
Survey Methodology: Challenges and Principles
total survey error paradigm in practice as well as a discussion of surveying in multicultural contexts, with an emphasis on doing surveys in difficult situations. The chapters in the remaining parts discuss more specific aspects of the survey process through the question of measurement on the one hand (Part IV) and of sampling on the other (Part V), which are the practical tools that surveys require. Finally, specific issues in the data collection phase are discussed (Part VI). From the total survey error perspective, all the different steps are important when it comes to assessing the final quality of the outcome. In other words, a survey is always a combination of interlinked steps and its finally achieved quality depends on the weakest one. Data are of no value if they are not used. That is why a lot of people put a lot of effort into making data available for secondary research. This of course implies properly documenting the data, organizing access to them and respecting the characteristics of the samples through weighting as well as ensuring comparability. Preparing data for use is covered in Part VII. Quality can be threatened by a number of factors. Detecting, and hopefully correcting for, the potential biases is central in this respect. Part VIII addresses this in terms of measurement questions, non-response and missing values, as well as comparability challenges. Part IX is dedicated to further issues. As mentioned, they can be divided into three components: better use of resources, beginning with secondary analysis, putting together different sources, comprising the different ways to link data and framing all that in a process of globalization of science and surveying. For us as editors an important contribution of this handbook is not only to give tools to solve problems but also to offer elements to frame surveys in a more general context, allowing methodology and scientific practices to be linked in the most fruitful ways.
13
One of the most challenging developments in the near future will be to combine data from different sources including surveys. However, we firmly believe that the survey model based on a random sample of a population will continue to play an important role in the advancement of the social sciences and knowledge society in general.
NOTES 1 A report recently published by the AAPOR writes, ‘The term Big Data is used for a variety of data as explained in the report, many of them characterized not just by their large volume, but also by their variety and velocity, the organic way in which they are created and the new types of processes needed to analyze them and make inference from them. The change in the nature of the new types of data, their availability, the way in which they are collected, and disseminated are fundamental’. And, as a recommendation: ‘Surveys and Big Data are complementary data sources not competing data sources’ (AAPOR report on big data, 12.2.2015, accessed 29.2.2016 from https://www.aapor.org/AAPOR_Main/media/ Task-Force-Reports/BigDataTaskForceReport_ FINAL_2_12_15_b.pdf). 2 See for example Quality Assurance Framework for the European Statistical System, version 1.2, edited by the European Statistical System, http://ec.europa. e u / e u ro s t a t / d o c u m e n t s / 6 4 1 5 7 / 4 3 9 2 7 1 6 / ESS-QAF-V1-2final.pdf/bbf5970c-1adf-46c8-afc358ce177a0646, accessed 28.11.2015. 3 For poverty see for example Bowley (1915; discussed in Kruskal and Mosteller 1980). For social mobility, some Scandinavian studies of the nineteenth century are mentioned by Merllié (1994). We can also think of the works of Galton and Pearson on the transmission of quality between generations, as reported for example by Desrosieres (2002). For general histories of survey research see Oberschall (1972) and Converse (2009). 4 Cf. https://psidonline.isr.umich.edu/, accessed 3. 12.2015. 5 Asian Barometer (http://www.asianbarometer. org/), a partner in the Global Barometer network, is not to be confused with Asia Barometer, an independent regional comparative survey n etwork jointly sponsored by governmental agencies and business firms in Japan (https://www.asiabarometer.org/).
14
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
6 See https://cnef.ehe.osu.edu/, accessed 3.12.2015. 7 See also the ISO norm, http://www.iso.org/ iso/catalogue_detail?csnumber=43521, accessed 14.1.2016. 8 See the AAPOR report on the use of nonprobability sampling, http://www.aapor.org/ AAPOR_Main/media/MainSiteFiles/NPS_TF_ Report_Final_7_revised_FNL_6_22_13.pdf, accessed 29.2.2016. 9 This question of different development between the academic world, the private survey organizations and the national statistical institutes is still of relevance and was one of the reasons for the launch of the Data without Boundaries project (DwB) in the context of the 7th framework program of the European Union. 10 For theoretical arguments about such social norms, see Hwang (1987). 11 For these institutions see for example https:// ec.europa.eu/research/infrastructures/index_ en.cfm?pg=eric, accessed on 26.12.2015. 12 In the French tradition, such a critical perspective exists in the stream initiated by Bourdieu in 1972 in the famous paper ‘L’opinion publique n’existe pas’ (reproduced at http://www.homme- moderne.org/societe/socio/bourdieu/questions/ opinionpub.html, accessed on 26.12.2015). More recently see also the work of Blondiaux (1998). In English we can mention Bishop (2004).
REFERENCES Ballou J. (2008) ‘Survey’, in the Encyclopedia of Research Methods, Vol. 2, Paul Lavrakas (ed.), Beverly Hills: Sage, p. 860. Bickman L. and Rog D.J. (2009) Sage Handbook of Social Research Methods, Beverly Hills: Sage. Biemer P.P. and Lyberg L.E., (2003) Introduction to Survey Quality, Hoboken: Wiley. Biemer P.P., Groves R.M., Lyberg L.E., Mathiowetz N.A. and Sudman S. (1991) Measurement Errors in Surveys, Hoboken: Wiley. Bishop G.F. (2004) The Illusion of Public Opinion: Fact and Artifact in American Public Opinion Polls, United States of America: Rowman & Littlefield Publishers. Blondiaux L. (1998) La fabrique de l’opinion, Paris: Seuil. Converse J. (2009) Survey Research in the United States: Roots and Emergence 1890– 1960, Transaction publisher.
Dalenius T. (1985). Elements of Survey Sampling. Swedish Agency for Research Cooperation with Developing Countries. Stockholm, Sweden. Deming, W.E. (1944) ‘On Errors in Surveys’, American Sociological Review 9(4), pp. 359–369. Desrosieres A. (2002) The Politics of Large Numbers: A History of Statistical Reasoning, Cambridge: Harvard University Press. Dillman D.A., Smyth J.D. and Leah M.C. (2014) Internet, Phone, Mail and Mixed Mode Surveys: The Tailored Design Method, Hoboken: Wiley. Dodge Y. (2010 [2003]) The Oxford Dictionary of Statistical Terms, Oxford: Oxford University Press. Dogan M. and Rokkan S. (eds) (1969) Quantitative Ecological Analysis in the Social Sciences, Cambridge: MIT Press. Groves R.M., Fowler F.J., Couper M.P., Lepkowski J.M., Singer E. and Tourangeau R. (2004) Survey Methodology, Hoboken: Wiley. Haller M., Jowell R. and Smith T.W. (2009) The International Social Survey Programme 1984-2009: Charting the Globe, London: Routledge. Hecht J. (1987 [1977]) ‘L’idée de dénombrement jusqu’à la révolution’, Pour une histoire de la statistique, Vol. 1, pp. 21–81, Paris: Economica/INSEE. Henry G. (1996) ‘Does the Public Have a Role in Evaluation? Surveys and Democratic Discourse’, in Marc T. Braverman and Jana Kay Slater (eds) Advances in Survey Research, New direction for evaluation, No. 70, pp. 3–15. Hirsch Hadorn G., Hoffmann-Riem H., BiberKlemm S., Grossenbacher-Mansuy W., Joye D., Pohl C., Wiesmann U. and Zemp E. (2008) Handbook of Transdisciplinary Research, New York: Springer. Hwang Kwang-kuo (1987) ‘Face and Favor: The Chinese Power Game’, American Journal of Sociology 92(4), 944–974. Kreuter F. (2013) Improving Surveys by Paradata, Hoboken: Wiley. Kruskal J. and Mosteller F. (1980) ‘The Representative Sampling IV: The History of the Concept in Statistics’, International Statistical Review 48(2), 169–195.
Survey Methodology: Challenges and Principles
Leeuw Edith D. de, Hox Joop J. and Dillman Don A. (2008) International Handbook of Survey Methodology, New York: Lawrence Erlbaum. Lynn P. (2009) Methodology of Longitudinal Surveys, Hoboken: Wiley. Merllié D. (1994) Les enquêtes de mobilité sociale, Paris: PUF. Meynaud H.Y. and Duclos D. (2007 [1985]), Les sondages d’opinion, Paris: La Découverte. Mutz D. (2011) Population-Based Survey Experiments, Princeton: Princeton University Press. Oberschall A. (ed.) (1972) The Establishment of Empirical Sociology, New York: Harper & Row. Payne S. (1951) The Art of Asking Questions, Princeton: Princeton University Press. Rossi P.H., Wright J.D. and Anderson A.B. (eds) (1985) Handbook of Survey Research, Orlando: Academic Press. Scott J.G. and Carrington P.J. (eds) (2011) The Sage Handbook of Social Network Analysis, Beverly Hills: Sage. Seng Y.P. (1951) ‘Historical Survey of the Development of Sampling Theories and Practice’,
15
Journal of the Royal Statistical Society, Series A (General) 114(2), 214–231. Smith T.W. (2005) ‘Total Survey Error,’ in Kempf-Leonard, K. (ed.) Encyclopedia of Social Measurement, New York: Academic Press, pp. 857–862. Smith T.W. (2011) ‘The Report on the International Workshop on Using Multi-level Data from Sample Frames, Auxiliary Databases, Paradata, and Related Sources to Detect and Adjust for Nonresponse Bias in Surveys’, International Journal of Public Opinion Research, 23, 389–402. Stoop I. and Harrison E. (2012) ‘Classification of Surveys’, in Gideon L. (2012) Handbook of Survey Methodology in the Social Sciences, Heidelberg: Springer, pp. 7–21. Stoop I., Billiet J., Koch A. and Fitzgerald R. (2010) Improving Survey Response: Lessons Learned from the European Social Survey, Hoboken: Wiley. Zetterberg H.L. (2008) ‘The Start of Modern Public Opinion Research’, Sage Handbook of Public Opinion Research, London: Sage, pp. 104–112.
2 Survey Standards To m W . S m i t h
DIFFERENT TYPES OF STANDARDS First, there are informal common or customary practices. For example, in the field of survey research (as well as in many other disciplines), the general norm is to accept probabilities of 0.05 or smaller as ‘statistically significant’ and thus scientifically creditable. As far as I know, this rule is not codified in any general, formal standards, but it is widely taught in university courses and applied by peer reviewers, editors, and others at journals, publishers, funding agencies, etc. (Cowles and Davis, 1982). Other examples are the use of null hypotheses and including literature reviews in research articles (Smith, 2005). Second, standards are adopted by professional and trade associations.1 These may apply only to members (often with agreement to follow the organizational code as a condition of membership) or may be deemed applicable to all those in a profession or industry regardless of associational membership.
Enforcement is greater for members (who could be sanctioned or expelled for violating standards), but can also be applied to non-members (Abbott, 1988; Freidson, 1984, 1994; Wilensky, 1964). Third, standards are developed by standards organizations. These organizations differ from particular professional and trade associations in that they do not represent a specific group and they are not designed to promote and represent individual professions or industries, but to establish standards across many fields. The main international example is the International Organization for Standardization (ISO) and the many national standards organizations affiliated with the ISO (e.g. in the US the American National Standards Institute or in Togo the Superior Council of Normalization). Standards organizations typically both promulgate rules and certify that organizations are compliant with those rules (Smith, 2005). Fourth, standards are written into specific contracts to conduct surveys. Contracts of
Survey Standards
course can stipulate any mutually agreeable, legal provisions. But in many cases they incorporate specific requirements based on the codes/standards of professional and trade associations and/or standard organizations. Finally, there are legally-binding standards established by governments. These can be local, national, or international. They may be set directly by legislation and/or established by regulatory agencies. Examples are the restrictions that many countries impose on pre-election polls (Chang, 2012; Smith, 2004). Enforcement can be through civil sanctions or criminal prosecutions. Government agencies sometimes work together with private organizations (usually trade, professional, or standards groups) to formulate and even enforce rules. In addition, governments also set standards by establishing rules for data collected by their own agencies (e.g. the US Bureau of the Census) or by those working for the government (OMB, 1999, 2006; Smith, 2002a; Subcommittee, 2001).
TYPES OF CODES OF PROFESSIONAL AND TRADE ASSOCIATIONS One key component of professionalization is the adoption of a code of standards which members promise to follow and which the professional association in turn enforces (Abbott, 1988; Freidson, 1984, 1994; Wilensky, 1964). Codes for survey-based research can have several different components. First, there are ethical standards that stipulate certain general and specific moral rules. These would include such matters as honesty, avoiding conflicts of interest, and maintaining confidentiality (American Statistical Association, 1999; Crespi, 1998). Even when applied to a specific industry/profession like survey research, they usually reflect general principles applicable across many fields. Second, there are disclosure standards that stipulate certain information, typically
17
methodological, that must be shared with others about one’s professional work (Guide of Standards for Marketing and Social Research, n.d.; Hollander, 1992; Kasprzyk and Kalton, 1998; Maynard and Timms-Ferrara, 2011; Smith, 2002a). A prime example of this approach is the Transparency Initiative launched by the American Association for Public Opinion Research (AAPOR) and endorsed by such other organizations as the World Association for Public Opinion Research (WAPOR) and the American Statistical Association (AmStat). Third, there are technical and definitional standards. Essentially, these are detailed elaborations on what is meant by other standards. For example, AAPOR and WAPOR both require that the response rate of surveys be disclosed and both endorse Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys (www. aapor.org/Standard_Definitions2.htm) as the way in which those and other outcome rates must be calculated and reported (see also Lynn et al., 2001; Kaase, 1999). Fourth, there are procedural standards. These indicate specific steps or actions that need to be executed when a professional activity is carried out. For example, checking cases through monitoring centralized tele phone calls or recontacts might be stipulated procedures for interview validation. Finally, there are outcome or performance standards. These specify acceptable levels that are expected to be reached before work is considered as satisfactory. This includes such things as having dual-entry coding show a disagreement rate below a certain level (e.g. less than 2 in 1000) or obtaining a response rate above some minimum (e.g. 70%). Codes can encompass few, many, or all of these types of standards. The different types are not independent of one another, but interact with each other in various, complex ways. For example, procedural standards would have to be consistent with ethical standards and disclosure and technical/definitional standards are closely inter-related.
18
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Professional and Trade Associations There are many professional and trade associations in the field of survey research. First, there are the core professional and trade associations of the profession and industry of survey research. These include two, major, international professional associations: ESOMAR (formerly the European Society for Opinion and Marketing Research) and WAPOR; regional associations like the European Survey Research Association, Asian Network for Public Opinion Research, and the Latin American Chapter of WAPOR; many national professional associations such as the AAPOR, the Social Research Associations (SRAs) in Wales, Scotland, and Ireland, and the British Market Research Association (BMRA); and national trade associations such as the Association of the Marketing and Social Research Industry (Canada) (AMSRI), Council of American Survey Research Organizations (CASRO), Council of Canadian Survey Research Organizations (CCSRO), and the National Council of Public Polls (USA) (NCPP). Second, there are professional and trade associations in closely-related fields: market research, the social sciences, and statistics. Market research associations include ESOMAR, which bridges the fields of survey and market research, and such other groups as the Advertising Research Foundation (ARF), the Alliance of International Market Research Institutes (AIMRI), the American Marketing Association (AMA), the Association of Consumer Research (ACR), the Association of European Market Research Institutes (AEMRI), the European Federation of Associations of Market Research Organizations (EFAMRO), the Market Research Quality Standards Association (MRQSA), the Marketing Research Association (MRA), and more specialized groups within market research such as the Audit Bureau of Circulation (ABC) and the Media Ratings Council (MRC).
The social-science disciplines most closely tied to survey research are sociology, political science, and psychology and they are represented by such cross-national groups as the International Sociological Association (ISA), the International Political Science Association (IPSA), and International Association of Applied Psychology (IAAP). The main international statistical groups are the International Association of Survey Statisticians (IASS) and the International Statistical Institute (ISI). Finally, since survey research is often public and widely distributed to the mass media and also sometimes done by or in collaboration with the media, standards relating to the media and journalism are also relevant. First, the survey-research field reaches out to and promotes best practices by journalists in their use of surveys. The NCPP focuses on the media and both AAPOR (Zukin, 2012) and WAPOR (Smith, 2002b) have guides for journalists. Second, numerous media organizations have standards about reporting their own surveys and the surveys of others, such as the Canadian Broadcasting Corporation (n.d.) and Reuters (2009). Lastly, various media professional and trade associations have standards relating to surveys such as the Canadian Association of Journalists (2012) and the German Press Council (2006). The organizational and associational codes of the media usually only touch on a few general points about using surveys.
Existing Professional and Trade Codes Most of the professional and trade associations discussed above have codes of standards that address survey research. But they are quite variable in what is and is not covered. A few examples will illustrate this. First, for codes of disclosure a comparison was made of nine documents (codes and supporting documents) by five organizations (AAPOR, CASRO, ESOMAR, NCPP, and
Survey Standards
WAPOR) (Smith, 2002b). All organizations agreed on the reporting of the following elements of a survey: who conducted, who sponsored, sample design, sample size, sampling error, mode of data collection, when collected/dates, question wording, question order, sample population, and response rate. Also, mentioned in most of these codes and related documents were weighting/imputing and indicating the purpose of the survey. Second, standards on response rates were examined. The codes and official statements of 20 professional, trade, and academic organizations were examined (Smith, 2002a). Four have neither codes nor any relevant official statements. Another three organizations have only brief general statements about doing good, honest research. Yet another three have general pronouncements about being open about methods and sharing technical information with others, but no details on what should be documented. Then, there are 10 that have some requirement regarding nonresponse. Of these referring to nonresponse in their codes and statements, all require that response rates (or some related outcome rate) be reported. Only a subset of the 10 mentioning nonresponse require anything beyond a reporting requirement. Six organizations provide at least some definition of response and/or related outcome rates, but only the AAPOR/WAPOR, CASRO, and ABC definitions are detailed. Three organizations deal with the issues of nonresponse bias in their codes. The WAPOR code, right after requiring the reporting of the nonresponse rate, calls for information on the ‘comparison of the size and characteristics of the actual and anticipated samples’ and the ESOMAR and MRQSA codes require in client reports ‘discussion of any possible bias due to non-response’. Three organizations mention nonresponse bias in official documents. AAPOR in its ‘Best Practices’, but not its code, urges that nonresponse bias be reported. AmStat addresses the matter in its ‘What is a Survey?’ series. The AMA in its
19
publication, the Journal of Market Research, requires authors to ‘not ignore the nonrespondents. They might have different characteristics than the respondents’. Of the organizations that have an official journal, nine have definite standards about reporting and calculating response rates, two have some general pronouncements mentioning nonresponse bias or the response rate, one has a marginally relevant standard on data sharing, and two have no applicable statement. In brief, only the professional, trade, and academic organizations at the core of survey research and in the sub-area of media-ratings research take up nonresponse in their codes, official statements, and organizational journals. General market research and statistical organizations do not explicitly deal with nonresponse issues in their codes and standards and only marginally address these in the guidelines of their official journals. Even among the organizations that do address the matter of nonresponse, the proclaimed standards are mostly minimal. Some, but not automatic, reporting is required by all of the core organizations. However, definitions are provided by only six of the 10. Other aspects, such as nonresponse bias and performance standards, are only lightly covered. Thus, even among those organizations that consider nonresponse, reporting standards are incomplete, technical standards are often lacking and/or regulated to less official status, and performance standards are nearly non-existent. Finally, professional, trade, and academic organizations have advanced the cause of standards by their general promotion and dissemination of research methods at their conferences and official journals (e.g. AAPOR’s Public Opinion Quarterly, ESRA’s Survey Research Methods, WAPOR’s International Journal of Public Opinion Research). As Hollander (1992: 83) has observed, ‘the annual AAPOR conference was recognized early on, together with POQ, which is older still, as a means of advancing standards’.
20
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
STANDARDS ORGANIZATIONS Recently, the ISO initiated a major effort to develop standards for survey research. In 2003, Technical Committee 225 (TC225) was established to develop standards for ‘market, opinion, and social research’. ISO and its national members are bodies specializing in the development of standards per se and lack detailed knowledge of most specific fields and industries. As such, TC225 was composed of survey-research practitioners and relied on direction from technical advisory groups made up of survey researchers in the participating countries and from two international, survey-research associations which are liaison members (ESOMAR and WAPOR) to develop the relevant definitions and rules. ISO 20252 on Market, Opinion, and Social Research – Vocabulary and Service Requirements were issued in 2006 and then updated in 2012 (www.iso.org/iso/ catalogue_detail.htm?csnumber=53439). In addition, in 2009 ISO 26363 on Access Panels in Market, Opinion, and Social Research was adopted. The ISO standards are largely consistent with the existing codes of professional and trade associations. For example, its disclosure list of information to be included in research reports closely follows the existing minimum disclosure requirements of the major professional and trade associations. But the ISO standards go beyond most existing codes in two main regards. First, they specify the mutual obligations that exist between clients and research service providers (i.e. data collectors or survey firms). This includes stipulating elements that need to be in agreements between them including such matters as confidentiality of research, documentation requirements, fieldworker training, sub-contracting/outsourcing, effectiveness of quality management system, project schedule, cooperation with client, developing questionnaires and discussion guides, managing sampling, data collection, analysis, monitoring data collection, handling research
documents and materials, reporting research results, and maintaining research records. Second, they have a number of procedural and performance standards. These include the following: (1) methods for doing translations and level of language competency for the translators; (2) type and hours of training for fieldworkers; (3) validation levels for verifying data collected by fieldworkers; (4) use of IDs by fieldworkers; (5) the notification that potential respondents must receive; (6) documenting the use of respondent incentives; (7) guarantees of respondent confidentiality; and (8) what records should be kept and for how long they should be retained.
INTERNATIONAL COLLABORATIONS Most major cross-national collaborations have standards for their participating members (Lynn, 2001). These include the European Social Survey, the International Social Survey Programme (ISSP), the OECD Programme for International Student Assessment, the Survey of Health and Retirement in Europe, the World Health Survey (WHS), and the World Values Survey. A fuller listing of these programs appears in Chapter 43 in this volume. For example, the ISSP Working Principles (www.issp.org/ page.php?pageId=170) contain various standards for data collection and documentation, including requirements about mode, sample design, the calculation of response rates, and methods disclosure. For the WHS rules, see Ustun et al. (2005).
OTHER ASSOCIATIONS AND ORGANIZATIONS Standards are promoted and developed by other groups besides the national and international professional and trade associations. These include several conference/workshop
Survey Standards
series such as the Household Nonresponse Workshop (since 1990), the International Field Directors and Technologies Conference (since 1993), the International Total Survey Error Workshop (since 2005), the International Workshop on Comparative Survey Design and Implementation (CSDI) (since 2002), and the loosely-associated series of survey methodology conferences starting with the International Conference on Telephone Survey Methodology in 1987 through the International Conference on Methods for Surveying and Enumerating Hard-to-Reach Populations in 2012. CSDI for example has issued the Guidelines for Best Practice in Cross-Cultural Surveys (http://ccsg.isr.umich.edu/). Also, survey archives around the world have created standards for the documentation of survey research data (Maynard and Timms-Ferrara, 2011). The metadata initiatives in particular have increased the documentation required for deposited surveys and also made that information more accessible to users. Of particular importance is the Data Documentation Initiative (www.ddialliance. org/).
IMPLEMENTING AND ENFORCING CODES If the proof of the pudding is in the tasting, then the proof of standards is in their enforcement. Codes matter only if they are followed and here the experience of survey research is mixed. Three examples illustrate the present situation and its limitations. First, most codes indicate specific methodological information about survey methodology that must be reported. Numerous studies over the years in various countries and covering both television and newspapers have repeatedly found that the basic methodological components required by disclosure standards are often not reported (Bastien and Petry, 2009; Sonck and Loosveldt, 2008;
21
Szwed, 2011; Weaver and Kim, 2002; Welch, 2002). For example, the share of news stories reporting sample size ranged from 21% in the USA in 2000, to 37% in the USA in 1996–98, 49% in Canadian newspapers in 2008, and 65% in Poland in 1991–2007. For question wording, reporting was even lower, ranging between 6% and 25% in studies. Similarly, reporting in academic journals also falls short of the disclosure standards. Presser (1984) examined what methodological information was reported in articles in the top journals in economics, political science, social psychology, sociology, and survey research. He found that in articles using surveys reporting ranged as follows: (1) sampling method from 4% in economics to 63% in survey research; (2) question wording from 3% in economics to 55% in survey research; (3) mode of data collection from 18% in economics to 78% in social psychology; (4) response rate from 4% in economics to 63% in survey research; (5) year of survey from 20% in social psychology to 82% in political science; and (6) interviewer characteristics from 0% in economics to 60% in social psychology. Likewise, when looking at response rates, Smith (2002a) also found that reporting levels were low in top academic journals – 34% in survey research articles, 29% in sociology, and 20% in political science. In follow-up work, Smith (2002c) found in 1998–2001 that response-rate reporting remained low in political science and sociology, but was improving in survey research. However, even in survey research in 2001 only 53% of articles reported a response rate and just 33% provided any definition (see also Hardmeier, 1999; Turner and Martin, 1984). Perhaps even more telling are the shortfalls in the methodology reports released by the survey-research, data collectors themselves. A study in Canada (Bastien and Petry, 2009) during the 2008 election found that sample size was reported 100% of the time, question wording 97%, weighting factors 62%, and interview mode 55%. A US study of 2012 pre-election polls found reporting for sample
22
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
size 98%, question wording 73%, weighting 37%, and interview mode 86% (Charter et al., 2013). Second, the professional associations are not well-equipped to handle specific instances of alleged code violation which are commonly called standards cases. For example, WAPOR has no mechanism for or tradition of handling standard cases. AAPOR does have procedures and does conduct such reviews, but it has found that formal standards cases involve considerable effort, take a long time to decide, and, under some outcomes (e.g. exoneration or private censure), do not result in educating the profession. AAPOR procedures are by necessity complex and legalistic in order to protect the rights of the accused. Also, since the handling of standards cases is done by volunteers who must find time to participate, this creates a burden and takes considerable time to adjudicate. AAPOR believes that standards in the field can better be advanced by methods other than formal standards cases, such as by taskforce reports and the Transparency Initiative. Finally, many professions in part enforce their codes through the certification of members. But this practice is rare in the field of survey research. Globally, neither WAPOR nor ESOMAR has certification, nor does AAPOR or CASRO in the United States. However, MRA started a Professional Researcher Certification Program in 2005 (see www.mra-net.org/certification/overview.cfm). Its certification includes adherence to MRA’s Code of Marketing Research Standards. Also, as is true of ISO standards in general, ISO 20252 provides for the certification of survey-research organizations as compliant with its standards.
THE ROAD TO PROFESSIONALIZATION AND THE ROLE OF CODES Wilensky (1964) proposes five sequential steps that occupations go through to obtain professionalization: (1) the emergence of the
profession; (2) establishing training schools and ultimately university programs; (3) local and then national associations; (4) governmental licensing; and (5) formal codes of ethics. Survey research has only partly achieved the second, for although there are some excellent training programs and university programs, most practitioners are formally trained in other fields (statistics, marketing, psychology, sociology, etc.).2 Survey research has resisted certification and governmental licensing, although recent support for the proscription of fraudulent practices disguised as surveys (e.g. push polls and sugging – selling under the guise of a survey) and the ISO standards have moved the field more in that direction. On the development of the survey-research field, see Converse (1987). Studies of professionalization indicate that one of the ‘necessary elements’ of professionalization is the adoption of ‘formal codes of ethics…rules to eliminate the unqualified and unscrupulous, rules to reduce internal competition, and rules to protect clients, and emphasize the service ideal …’ (Wilensky, 1964: 145) and ‘codes of ethics may be created both to display concern for the issue [good character] and to provide members with guides to proper performance at work’ (Freidson, 1994: 174). Survey research has begun to follow the path of professionalization, but has not completed the journey. In the judgment of Donsbach (1997), survey research is ‘semi-professional’. Among other things, it has been the failure of survey researchers ‘to define, maintain, and reinforce standards in their area’ (Donsbach, 1997: 23) that has deterred full professionalization. As Crespi (1998: 77) has noted, ‘In accordance with precedents set by law and medicine, developing a code of standards has long been central to the professionalization of any occupation’. He also adds that ‘One hallmark of professionals is that they can, and do, meet performance standards’. In Donsbach’s analysis (1997: 26), the problem is that standards
Survey Standards
have neither been sufficiently internalized nor adequately enforced: We have developed codes of standards, but we still miss a high degree of internalization in the process of work socialization. We also lack clear and powerful systems of sanctions against those who do not adhere to these standards. It is the professional organizations’ task to implement these systems and to enforce the rules.
The limited adoption and enforcement of standards and the incomplete professionalization has several causes. First, the surveyresearch profession is divided between commercial and non-commercial sectors. Coordinating the quite different goals and needs of these sectors is challenging. There has often been disagreement between these sectors on standards and related matters (Smith, 2002a, 2002d). Moreover, trade associations typically only include for-profit firms and exclude survey-research institutes at universities, government agencies, and not-forprofit organizations. But various steps have been taken to bridge this divide. AAPOR, for example, has certain elected offices rotate between commercial and non-commercial members and more informally WAPOR and other associations try to balance committee appointments between the various sectors. Also, CASRO has opened membership to not-for-profits and universities. Second, for quite different reasons both sectors have not vigorously pursued professionalization. The academics have been the most open to professionalization in general and standards in particular since most are already members of two types of well- organized professions (university teachers) and their particular discipline (e.g. statistician, psychologist, sociologist, etc.). But while this socialization has made them open to professionalization and standards, it has also hampered the professionalization of survey research since the academics already are usually twice-fold professionals and many have only a secondary interest in survey research as a field/profession.
23
The commercial practitioners have seen themselves more as businesspersons and less as professionals and many see standards as externally-imposed constraints (akin to gov ernment regulations) that intrude on their businesses. Of course it is not inevitable that businesses oppose standards and people in business fields necessarily resist professionalization. For example, the Society of Automobile Engineers was successful from early on in establishing industry-wide standards and recommended practices (Thompson, 1954). However, this has not transpired within the survey-research industry. Suggested reasons for the limited development of cooperation within the survey field include a high level of competition (Bradburn, 1992) and that fewer benefits from collaboration and coordination may exist.3 Third, survey research in general and public-opinion research in particular are information fields with strong formative roots in both journalism and politics (Converse, 1987). Some have seen any attempted regulation (especially by government, but even via self-regulation), as an infringement on their freedom of speech and as undemocratic. They lean more towards an unregulated, marketplace-of-ideas approach related to the freedom-of-the-press model. In sum, the incomplete professionalization of survey research has hindered the development and enforcement of professional standards. Incomplete professionalization in turn occurs due to inter-sector and interdisciplinary division in survey research and from the high value placed by practitioners on the ideal of independence and proposition that the marketplace can exercise sufficient discipline. Both economic and intellectual laissez-faireism undermine the adoption and enforcement of standards.
CONCLUSION Standards codes exist for the key professional and trade associations in the field of survey
24
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
research and there is a high degree of agreement on many of their provisions. But largely because professionalization has been incomplete, actual practice has often lagged behind the standards and enforcement has been limited. However, this situation has begun to change in recent years. For example, AAPOR and WAPOR have both adopted Standard Definitions for the calculation and reporting of response and other outcome rates and the ISO has worked with professional and trade associations in the field of survey research to establish international standards. Thus, the future prospects are for the spread of and greater enforcement of standards and the continued professionalization of survey research.
NOTES 1 Trade or industry associations are those in which organizations rather than individuals belong. Professional and academic associations have individuals in a particular occupation or scholarly discipline as members. There are also hybrid associations that include both individuals and organizations as members. 2 University survey-research programs include the Survey Research and Methodology Program at the University of Nebraska and the Joint Program in Survey Methodology at the Universities of Maryland and Michigan and summer institutes such as the ICPSR Summer Program for Quantitative Methods of Social Research, the Essex Summer School in Social Science Data Analysis, and the GESIS Summer School in Survey Methodology. 3 The setting of a standard gauge for railroads is an example in which several industries benefited. Builders of railroad equipment needed to produce only one size of wheels and axles, shippers gained as transfer costs were reduced, and railroads won increased traffic as unnecessary costs were eliminated.
RECOMMENDED READINGS Interested readers may begin by studying professional standards, e.g. the AAPOR Standard Definitions (see http://www.aapor.org/
AAPOR_Main/media/publications/Standard-Defi nitions2015_8theditionwithchanges_ April2015_logo.pdf) or the ICC/ESOMAR International Code on Market and Social Research (see www.esomar.org/uploads/public/ knowledge-and-standards/codes-and-guidelines/ICCESOMAR_Code_English_.pdf). For comparative surveys the Cross-Cultural Survey Guidelines are of particular importance (see http://ccsg.isr.umich.edu/archive/index.html).
REFERENCES Abbott, Andrew (1988). The System of Professions: An Essay on the Division of Expert Labor. Chicago: University of Chicago Press. American Statistical Association (1999). ASA Issues Ethical Guidelines. Amstat News, 269 (November), 10–15. Bastien, Frederick and Petry, Francois (2009). The Quality of Public Opinion Poll Reports during the 2008 Canadian Election. Paper presented to the Canadian Political Science Association, Ottawa, May, 2009. Bradburn, Norman M. (1992). A Response to the Nonresponse Problem. Public Opinion Quarterly, 56(Fall): 391–7. Canadian Association of Journalists (2012). Canadian Association of Journalists Statement of Principles, at www.rjionline.org/ MAS-Codes-Canada-CAJ-Principles [accessed on 6 June 2016]. Canadian Broadcasting Corporation (n.d.). Canada Code: Canadian Broadcasting Corporation Journalistic Standards and Practices, at www.rjionline.org/MAS-Codes-Canada-CBC [accessed on 6 June 2016]. Chang, Robert (2012). The Freedom to Publish Public Opinion Poll Results: A Worldwide Update of 2012. World Association for Public Opinion Research. Charter, Daniela et al. (2013). Transparency in the 2012 Pre-Election Polls. Paper presented to the American Association for Public Opinion Research, Boston. Converse, Jean M. (1987). Survey Research in the United States: Roots and Emergence, 1890. Berkeley: University of California Press. Cowles, Michael and Davis, Caroline (1982). On the Origins of the .05 Level of Statistical
Survey Standards
Significance. American Psychologist, 37: 553–8. Crespi, Irving, (1998). Ethical Considerations When Establishing Survey Standards. International Journal of Public Opinion Research, 10(Spring): 75–82. Donsbach, Wolfgang (1997). Survey Research at the End of the Twentieth Century: Theses and Antitheses. International Journal for Public Opinion Research, 9: 17–28. Freidson, Eliot (1984). The Changing Nature of Professional Control. Annual Review of Sociology, 10: 1–20. Freidson, Eliot (1994). Professionalism Reborn: Theory, Prophecy, and Policy. Chicago: University of Chicago Press. German Press Council (2006). German Press Code, at http://ethicnet.uta.fi/germany/ german_press_code [accessed on 13 June 2016]. Guide of Standards for Marketing and Social Research (n.d.). L’Association de l’Industrie de la Recherche Marketing et Sociale [Canada]. Hardmeier, Sibylle (1999). Political Poll Reporting in Swiss Print Media: Analysis and Suggestions for Quality Improvement. International Journal of Public Opinion Research, 11(Fall): 257–74. Hollander, Sidney (1992). Survey Standards. In Paul B. Sheatsley and Warren J. Mitofsky (eds), Meeting Place: The History of the American Association for Public Opinion Research. American Association for Public Opinion Research: pp 65–103. International Organization for Standardization/ Technical Committee 225 (2005). Market, Opinion and Social Research Draft International Standard. Madrid: AENOR. Kaase, M. (1999). Quality Criteria for Survey Research. Berlin: Akademie Verlag. Kasprzyk, D. and Kalton, G. (1998). Measuring and Reporting the Quality of Survey Data. Proceedings of Statistics Canada Symposium 97, New Directions in Surveys and Censuses. Ottawa: Statistics Canada. Lynn, Peter (2001). Developing Quality Standards for Cross-National Survey Research: Five Approaches. ISER Working Paper, 2001–21. Lynn, Peter J. et al. (2001). Recommended Standard Final Outcome Categories and Standard Definitions of Response Rate for
25
Social Surveys. ISER Working Paper 2001– 23. Essex University, Institute for Social and Economic Research. Maynard, Marc and Timms-Ferrara, Lois (2011). Methodological Disclosure Issues and Opinion Data. Journal of Economic and Social Measurement, 36: 19–32. Office of Management and Budget (OMB) (1999). Implementing Guidance for OMB Review of Agency Information Collection. Draft, June 2, 1999. Washington, DC: OMB. Office of Management and Budget (OMB) (2006). Standards and Guidelines for Statistical Surveys. Washington, DC: OMB. Presser, Stanley (1984). The Use of Survey Data in Basic Research in the Social Sciences. In Charles F. Turner and Elizabeth Martin (eds), Surveying Subjective Phenomena, Vol. 2. New York: Russell Sage Foundation, pp. 93–114. Reuters (2009). Handbook of Journalism, at http://handbook.reuters.com/index. php?title=Main_Page [accessed on 13 June 2016]. Smith, Tom W. (2002a). Developing Nonresponse Standards. In Robert M. Groves, Donald A. Dillman, John L. Eltinge, and Roderick J.A. Little (eds), Survey Nonresponse. New York: John Wiley & Sons, pp. 27–40. Smith, Tom W. (2002b). A Media Guide to Survey Research. WAPOR. Smith, Tom W. (2002c). Reporting Survey Nonresponse in Academic Journals. International Journal of Public Opinion Research, 14(Winter): 469–74. Smith, Tom W. (2002d). Professionalization and Survey-Research Standards. WAPOR Newsletter, 3rd quarter: 3–4. Smith, Tom W. (2004). Freedom to Conduct Public Opinion Polls Around the World. International Journal of Public Opinion Research, 16(Summer): 215–23. Smith, Tom W. (2005). The ISO Standards for Market, Opinion, and Social Research: A Preview. Paper presented to the American Association for Public Opinion Research, Miami Beach. Sonck, Nathalie and Loosveldt, Geert (2008). Making News Based on Public Opinion Polls: The Flemish Case. European Journal of Communication, 23: 490–500. Subcommittee on Measuring and Reporting the Quality of Survey Data (2001).
26
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Measuring and Reporting Sources of Error in Surveys. Statistical Working Paper No. 31. Washington, DC: OMB. Szwed, Robert, (2011). Printmedia Poll Reporting in Poland: Poll as News in Polish Parliamentary Campaigns, 1991–2007. Communist and Post-Communist Studies, 44: 63–72. Thompson, George V. (1954). Intercompany Technical Standardization in the Early American Automobile Industry. Journal of Economic History, 14(Winter): 1–20. Turner, Charles F. and Martin, Elizabeth (1984). Surveying Subjective Phenomena. New York: Russell Sage Foundation. Ustun, T. Bedirhan et al. (2005). Quality Assurance in Surveys: Standards, Guidelines, and Procedures. In Household Sample Surveys in
Developing and Transition Countries, edited by the United Nations. New York: United Nations Publications, pp. 199–230. Weaver, David and Kim, Sung Tae (2002). Quality in Public Opinion Poll Reports: Issue Salience, Knowledge, and Conformity to AAPOR/WAPOR Standards. International Journal of Public Opinion Research, 14 (Summer): 202–12. Welch, Reed L. (2002). Polls, Polls, and More Polls: An Evaluation of How Public Opinion Polls are Reported in Newspapers. Harvard International Journal of Press/Politics, 7: 102–13. Wilensky, Harold L. (1964). The Professionalization of Everyone. American Journal of Sociology, 70(September): 137–58. Zukin, Cliff, (2012). A Journalist’s Guide to Survey Research and Election Polls. AAPOR.
3 Total Survey Error: A Paradigm for Survey Methodology L a r s E . L y b e r g a n d H e r b e r t F. W e i s b e r g
INTRODUCTION Survey research began as a very practical enterprise, gathering facts and opinions from a large number of respondents, but with little underlying theory. As will be shown in this chapter, the ‘Total Survey Error’ approach (TSE) has been devised as an inclusive paradigm for survey research to guide the design of surveys, critiques of survey results, and instruction about the survey process. This chapter will review the TSE approach and place it within the broader concern for achieving a ‘Total Survey Quality’ (TSQ) product. TSE emphasizes the trade-offs that are required in conducting surveys. Whenever a researcher conducts any type of study, there are constraints: costs, ethics, and time. There are also possible errors in any type of research study. In the survey field, those possible errors, among others, include the error that results from using a sample to represent a larger population (‘sampling error’), the error from incomplete response to the survey and
its questions (‘nonresponse error’), and the error that occurs in survey responses (‘measurement error’). The TSE approach emphasizes the trade-offs that must be made in trying to minimize those possible errors within the constraint structure of the available resources. Minimizing all of these error sources at once would require an unlimited budget as well as a very long time schedule, a situation that never prevails in the real world. Instead, the researcher has to decide which errors to minimize, realizing that expending more resources to minimize one type of error means fewer resources are available to minimize another type of error. Decreasing sampling error by greatly increasing the sample size, for example, would take away from the amount of money left to give interviewers extensive training, call back to locate respondents who were not found in the first attempt, pretest the questionnaire more extensively, and so on. The TSE approach suggests that clients commissioning surveys
28
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
should weigh these trade-offs, deciding how they want to spend their limited resources to minimize the potential survey errors that they consider most serious. The goal of survey organizations is to achieve a quality product. Total Survey Quality (TSQ) includes the need for minimizing total survey error (‘accuracy’), but it also includes, among other factors, producing results that fit the needs of the survey users (‘relevance’) and providing results that users will have confidence in (‘credibility’). The well-run survey organization pays attention to these quality criteria while trying to minimize survey error within cost and other constraints. The TSQ argument emphasizes the usability of the survey results. Maximizing TSE does not necessarily produce a useable set of findings. Consider a well-designed simple random sample of 100 likely voters that shows that one candidate has a 60%–40% advantage over the other candidate. While that seems like a small sample, such a result would be statistically significant: there would be less than a 5% chance of obtaining such a result if the other candidate were ahead instead. Yet, regardless of the statistical significance of its findings, such a small survey would not be considered credible. Few newspapers would take it seriously enough to publish an article based on that small survey, and campaign consultants would not find enough useful results to help them shape their campaign. Governments generally have strict quality standards they impose on surveys they contract for. The TSQ approach points to the need to take the likely usefulness of the survey results into consideration when designing and conducting the survey. This chapter introduces the interrelated TSE and TSQ perspectives that underlie later chapters in this Handbook. We begin by relating the evolution of the TSE approach, and then describe how it is merged with the TSQ goals. We conclude by considering some recent developments that have the potential of upsetting usual practices in the survey research field.
THE DEVELOPMENT OF THE TOTAL SURVEY ERROR APPROACH While surveys are a very common research procedure today, they were very rare a century ago with the exception of censuses. The basic notion that one could represent a larger population with a probability sample was not yet understood. The first major breakthrough involved statistical sampling theory, particularly Neyman’s (1934) landmark article that provided a scientific basis for sampling. Neyman provided mathematical proof that sampling error could be measured by calculating the variance of the estimator. That was followed by Hansen’s experiments for the US Census Bureau that showed that small random samples are more accurate than nonrandom judgment samples in which individuals are chosen to represent different groups in the population (Hansen and Hauser 1945). Thus, by the middle of the twentieth century it was well understood that the inevitable ‘sampling error’ that results from surveying a random sample of a larger population of interest could be estimated mathematically. There also was an early realization that there are more possible errors in surveys than just sampling error (e.g., Deming 1944). Typical was Kish’s (1965) description of survey error as having two components: sampling error and non-sampling error, with the latter including measurement bias. Sampling error was emphasized because it could be computed for probability samples with mathematical precision. Measurement bias could not be formally estimated, but there were many practical efforts in the 1950s through 1980s to decrease measurement bias in surveys by improving interviewing procedures as well by improving question wording. There are, of course, potential errors in every social science research technique. The idea of systematizing and classifying potential errors began in Campbell and Stanley’s (1963) careful classification of potential problems of ‘internal validity’ and ‘external validity’ in experimental research. There has
Total Survey Error: A Paradigm for Survey Methodology
been less attention to detailing the full possible set of errors in content analysis and observation research. Robert Groves’s (1989) book on Survey Errors and Survey Costs provided a major theoretical development in the survey research field. It developed the concept of Total Survey Error (TSE), making it a paradigm for the survey field. Groves systematically listed the types of error in surveys and explained the cost calculation in minimizing each type of error. He showed that there are cost-benefit trade-offs involved, since minimizing one type of error within a fixed survey budget leads to less emphasis on controlling other types of error. Groves et al. (2004, 2009) updated Groves (1989) with research that had been done in the interim on the different stages of the survey process. In addition to survey errors and costs, Weisberg (2005) emphasized another important consideration by explicitly adding survey effects to the trade-offs between survey errors and survey constraints. Rather than involving errors that can be minimized, these effects involve choices that must be made in survey design for which there are no correct decision. For example, asking question
Respondent selection issues
A before question B may affect the answers to question B, but asking question B before question A may affect the responses to question A. Thus, it may be impossible to remove question order effects in a survey regardless of how many resources are spent on them. As another example, a male respondent might give different answers on some questions to a male interviewer than he would to a female interviewer. Again, there is no easy way to remove gender-of-interviewer effects in surveys that use interviewers. Instead, survey researchers can seek to estimate the magnitude of such survey effects. At this point, it is useful to introduce the different types of survey error that will be considered in more detail in subsequent sections of this Handbook. Figure 3.1 provides a depiction that emphasizes that sampling error is just the ‘tip of the iceberg’, one of the several possible sources of error in surveys. The first set of errors arises from the respondent selection process. As already mentioned, sampling error is the error that occurs when interviewing just a sample of the population. If the sample is selected by probability sampling, then it is possible to compute mathematically the ‘margin of error’
Sampling error Coverage error Nonresponse error at the unit level Nonresponse error at the item level Measurement error due to respondents
Response accuracy issues
Measurement error due to interviewers Postsurvey error
Survey administration issues
Mode effects Comparison error
Figure 3.1 The different types of survey error source. Source: Adapted from Weisberg (2005, p. 19).
29
30
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
corresponding to a 95% confidence interval for the survey results. Sampling issues can be very technical. While it sounds good to give each individual an equal chance of being selected by using simple random sampling, that procedure is often neither possible nor cost-effective. Stratifying the sample can reduce the sampling error, so that the right proportion of individuals are chosen within known subcategories of the population. Clustering the sample within known clusters (such as city blocks or banks of telephone numbers) can reduce costs, though that would increase the sampling error. A more serious problem is that probability sampling is often not feasible, as when conducting an Internet-based sample without having a list of the email addresses of the population of interest. Strictly speaking, sampling errors cannot be computed when the sampling is not probability-based, though many survey reports state what the sample would be for the number of individuals that were interviewed, had simple random sampling been used. Respondent selection issues involve more than sampling error. ‘Coverage error’ or ‘frame error’ occurs when the list from which a sample is taken does not correspond to the population of interest. A telephone sample based exclusively on landline phones would entail coverage error since it would be biased against young people who only have cell phones. Frame errors also occur when a sample includes people who are ineligible, such as a voter survey that includes non-citizens. When a sampled unit does not participate in the survey, there is ‘unit nonresponse’. Unit nonresponse occurs both when sampled respondents cannot be contacted and when they refuse to be interviewed. The response rate has fallen considerably in most surveys of the mass public, making it essential to consider this matter when designing a survey. Unit nonresponse becomes especially serious when it is correlated with variables of interest, such as if Republicans were less willing than Democrats to participate in US exit polls.
Another set of survey errors involves response accuracy issues. The usual emphasis is on the measurement error that arises when respondents do not give accurate responses. That may be due to the respondents themselves, as when they are not sufficiently motivated to provide accurate answers. Alternatively, it could be due to the question wording, including unclear questions, biased question wording, or demanding more detailed information than respondents can be expected to know or remember (such as asking people to recall what month they last saw their doctor). Measurement error due to respondents can also be related to questionnaire construction, such as question-order effects or fatigue effects from overly long questionnaires. Interviewers can also introduce measurement error, which emphasizes the importance of interviewer training. One approach to minimize interviewer error is ‘standardized interviewing’, when interviewers are trained to ask the identical question in the identical nonjudgmental manner to all respondents. Some researchers instead prefer ‘conversational’ (or ‘flexible’) interviewing, with interviewers trained to help respondents understand the questions in the identical manner. Additionally, the aggregate set of responses on a survey question can be inaccurate when some respondents do not answer all the questions, known as ‘item nonresponse’ or ‘missing data’. This can be a matter of people skipping questions accidentally, intentionally refusing to answer questions, or not having an opinion on attitude questions. One can try to write survey questions in such a way as to minimize the likelihood of missing data, or one can try to deal with missing data problems at the data analysis stage. Missing data may not be a problem if it is truly missing at random, but the results would be biased if the occurrence of missing data were correlated with the variables of interest. Finally, there is a set of possible errors related to survey administration. ‘Postsurvey error’ can occur in processing and analyzing
Total Survey Error: A Paradigm for Survey Methodology
the data, including errors made in coding the survey responses. While converting survey responses to code categories may seem routine, it often involves difficult and subjective judgments that can introduce considerable error. There can also be ‘mode effects’ related to how the survey is conducted (e.g., telephone versus web surveys). Mode effects could be related to whether or not there is a human interviewer, as well as whether the respondent hears or reads the questions. Mode differences can be particularly important on so-called ‘sensitive topics’. For example, people might be less willing to admit drug or alcohol use when an interviewer is asking the questions than when filling out an anonymous written questionnaire. ‘Social desirability effects’ can also be more common when there is a human interviewer, as when answering questions relating to racial prejudice. ‘Comparison error’ (Smith 2011) deals with the issue of non-equivalence of estimates on the same survey topic for different populations, which is relevant in cross-national and cross-cultural surveys. Comparison error also can occur when comparing answers on the same topic from surveys taken in different years if, as frequently is the case, the surveys word the questions differently. Each of these different errors can either be random or systematic. Random errors are ones that vary from case to case but are expected to cancel out. For example, human interviewers are likely to skip a word occasionally when reading questions to respondents, but there should not be a pattern to such slips. Systematic errors are ones that bias the results, distorting the mean value on variables. For example, interest groups that sponsor surveys often word their questions so as to make it more likely that respondents will support those interest groups’ positions, biasing the results to make it look like their position has more support than it would have with more neutral question wording. Another issue is whether the errors are uncorrelated or correlated. Ideally, errors
31
would be uncorrelated, as when an interviewer mistakenly records a respondent’s ‘yes’ answer as a ‘no’. What is more serious is when the errors for different respondents are correlated, which occurs when interviewers take multiple interviews and when cluster sampling is used. Having interviewers take multiple interviews (which is the only feasible way of taking interviews) and using cluster sampling help contain the cost of a survey, but correlated errors increase the variance of estimates due to an effective sample size that is smaller than the intended one and thereby make it more difficult to achieve statistically significant results. Correlated variances occur whenever multiple coders, editors, interviewers, supervisors and/or crew leaders are given assignments and affect their assignments in systematic but different ways. Thus, it is important to balance the cost savings from having to train just a small number of interviewers and other survey staff and from using cluster sampling versus the greater margin of error that results from those design decisions. The survey design goal is to minimize the ‘mean squared error’ (MSE), which is the sum of (1) the variance components (including sampling error) and (2) the squared bias from measurement and other sources. MSE has become the metric for measuring Total Survey Error. While MSE cannot usually be calculated directly, it is useful conceptually to consider how large the different components can be and how much they add to the total survey error. This discussion has spoken of ‘potential’ errors, since there is often no objective way to determine the ‘truth’ being measured in surveys. For example, ideally each person in a sample would answer the survey, but it is common to have some people refuse to cooperate. It may well be that the people who do not respond would have answered the same way as those who did respond, in which case their refusals did not create any error. However, their refusals create the possibility for error, since their answers might have been very different from those who responded to
32
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
the survey – and we generally have no way of knowing whether that is the case, so we must recognize the potential for error when some people in the sample are not interviewed. In dealing with error that is not related to sampling, the survey design decisions are whether to ignore such error, whether to try to control it, or whether to try to measure it. While the ideal might be to minimize every type of error, that is impossible under fixed monetary and time constraints, so many researchers instead try to measure those error sources they cannot control. For example, some resources might be used to gain side information about individuals who were selected for the sample but would not participate, so one can estimate how much bias was introduced by their nonparticipation. As the field has developed since the Groves (1989) book, there has been further research on each of the different error sources. That research will be presented in later sections of this Handbook, but it is important to stress a few of the most important developments and relevant controversies. One of the most important developments has been greater focus on the role of cognition in survey research, particularly as regards how respondents process survey questions. While early work viewed interviewing as a conversation, more recent theorizing has focused on how an interview is a cognitive task for the respondent. Of particular importance is the Cognitive Aspects of Survey Methodology (CASM) movement (Jabine et al. 1984), which began the process of using insights from the cognitive revolution in psychology to reduce measurement error in surveys. The CASM movement led to the idea of ‘think-aloud protocols’, in which respondents tell the interviewer what they are thinking as they formulate their answers. That is particularly useful in testing question wording when developing a questionnaire. The emphasis on cognitive processes also led to Krosnick and Alwin’s (1987) theory that there are two different levels of effort that respondents can exert in answering survey questions. While
researchers hope that respondents will follow the ‘high road’ that requires real thinking, respondents will instead frequently use the ‘low road’ of giving an answer that sounds plausible so as to get through the task quickly. Such ‘satisficing’ behavior increases measurement error. Tourangeau et al. (2000) provided a further breakthrough with their separation of the high-road response process into four cognitive components: comprehending the question, retrieving relevant information from memory, judgment of the appropriate answer, and, finally, selecting and reporting the response. Improving survey question wording requires understanding possible errors in each of these four steps as well as trying to minimize the likelihood of satisficing. Another important development in the TSE approach has been a greater focus on ‘selection bias’, which occurs when the actual respondents differ systematically from the intended population on the attitudes or behavior being measured. Non-probability samples introduce the possibility that the sample selection criterion is related to the attitudes or behavior of interest, thus biasing the survey results. The response rate in telephone surveys has fallen drastically over the years, leading to increased reliance on recruiting respondents who are willing to participate in web surveys. However, selection bias is a serious potential problem for web surveys because people who opt-in for a web survey might be very different from those who do not. Even weighting the sample on the basis of known population characteristics may not handle this problem because people who respond to such a web survey may differ from those with the same demographics who do not respond. Some pollsters try to resolve this problem by conducting a small supplementary telephone sample to use to weight the web sample. However, the basic argument of web survey proponents is that the response rate on telephone surveys today is so low that conventional random phone surveys also suffer from selection bias: the people who are willing to respond
Total Survey Error: A Paradigm for Survey Methodology
to a telephone survey might be very different from those who do not. In any case, selection bias has become a key concern for surveys of the mass public.
DESIGNING SURVEYS FROM A TSE PERSPECTIVE From a TSE perspective, the goal in survey planning is to have the smallest possible Mean Squared Error within the constraints of a fixed monetary and time budget. Stating that is, however, easier than achieving it. For one thing, it is difficult to estimate the magnitude of many of the possible errors, and, for another, costs of minimizing particular types of error are hard to estimate in advance. The task becomes even more difficult in the usual situation where several variables are being measured and they have different likely error structures. Working to reduce the measurement error on one set of survey questions could increase the error for a different set of questions in the same survey. Cross-national surveys pose even greater difficulty, especially when different survey organizations conduct the fieldwork in the different countries. This is not to discount the importance of thinking about MSE, but it points out that one cannot expect precision in estimating it at the survey planning stage. The broader point is that there is not an overall formal survey planning theory. The TSE approach provides a framework for thinking about the several elements involved in planning a survey, but, as Lyberg (2012: 110) emphasized, there is no planning manual for surveys and ‘no design formula is in sight’. Later in this chapter we review a recent attempt by Biemer et al. (2014) to provide more of an overall assessment, but we do not expect a formal survey planning theory to be developed in the near future. The TSE perspective is certainly very useful as an outline for instruction about survey research. It is important that researchers
33
contemplating a survey understand the tradeoffs required, and the TSE approach helps make those trade-offs clear. Still, it would be hard in practice for an investigator to make the choices required, particularly because of the difficulty in stating trade-offs effects with precision. The most common trade-offs situation is between sampling error and unit nonresponse. One way to reduce sampling error is to increase the sample size, but that increases interviewer costs considerably. Alternatively, one could expend more money on trying to obtain interviews with more people in the original sample, such as through more callbacks to people who could not be contacted originally, through conversion attempts to get interviews from people who refused on first contact, through offering alternative ways of answering the survey (such as web completion instead of a telephone interview) and/ or through offering monetary inducements to respondents. In this day of big data, one might even be able to buy information about non-cooperating designated respondents, possibly including their consumer behavior, their house value, their frequency of voting in recent elections, and which political candidates they have contributed to – assuming that such inquiries about people without their consent can pass the ethical muster of an Institutional Review Board. Another aspect of the TSE approach is to include in the survey some measurement of survey effects that cannot be minimized. For example, when there is not a perfect way to word a survey question, random half-samples of the respondents can be asked different versions of the same question to measure how robust answers on a topic are to how the question is worded. Similarly, when asking closed-ended questions, the order of response options can be varied for different random half-samples to measure how robust results are to the ordering of the response options. Such survey experiments increase the cost of programming a survey and require extra effort to ensure that the results are analyzed
34
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
correctly, but these extra costs are minimal compared to the considerable usefulness of the information they can provide. It is similarly possible to design a survey so that some interviewer effects can be estimated. For example, it can be useful to keep track of the gender of interviewers in order to test for interactive effects between the gender of the respondent and the interviewer. Similarly, the race of interviewers can be included in the data so that race-ofinterviewer effects can be analyzed. Though it is not always feasible, respondents would ideally be assigned to interviewers randomly (a procedure known as ‘interpenetration’), so interviewer effects can be estimated. The TSE approach has its roots in cautioning against the common practice to focus on just the sampling error. A focus just on sampling error results in underestimating the real error, sometimes considerably so. Ideally the service provider should, together with the client or the main users, identify error sources that are major contributors to the MSE, control them during the implementation stage, and potentially modify the survey design based on the analysis of paradata collected from relevant processes (Couper 1998, Groves and Heeringa 2006, Kreuter 2013).1 In practice, however, errors and error structures are difficult to discuss with interested parties, since their complexity does not invite user scrutiny. Concepts such as correlated interviewer variance, design effects, and cognitive phenomena such as context effects and telescoping can be very difficult to discuss with a client or a user. Instead, the average client thinks that good accuracy is the responsibility of the service provider, and the service providers are selected based on their perceived credibility. Thus, service providers or producers of statistics usually place high priority on being trustworthy and accurate regarding data quality, while users place high priority on aspects that they can easily assess. Examples of aspects or dimensions of quality that users appreciate include relevance
(data satisfy user needs), timeliness (data are delivered on time), accessibility (access to data is user-friendly), interpretability (documentation is clear and comprehensible), and cost (data give good value for money). These dimensions are the components of so-called quality frameworks and there are a number of slightly different ones used by statistical organizations (Lyberg 2012). Groves et al. (2009) call them non-statistical dimensions but none-the-less they are important to consider at the design stage, since resources have to be set aside to satisfy user needs regarding these dimensions. Typically, users are not only interested in data quality (the total survey error as measured by the MSE) but also in some of these other dimensions. This means that we have a trade-off situation not only when it comes to the TSE components but also between TSE components and other dimensions. For instance, if a user wants data really fast, there is less time for nonresponse follow-up and accuracy might decrease compared to a situation where there is ample time for this activity. The TSE framework is a typology of error sources with a prescription of how to control, measure, and evaluate their impact on survey estimates. It is a great conceptual foundation but very difficult to practice. In practice, survey organizations do not produce estimates of TSE on a regular basis because of costs, complexity, and/or lack of methodology. Also the number of error sources increases as new technology is introduced, and some error sources might even defy expression. Many questions associated with the TSE framework have remained unanswered over the years. For instance, why are some error sources such as coding understudied when their consequences can be considerable depending on the use of coded data? One example is when movements on the labor market are studied and errors in repeated occupation coding result in an exaggerated picture of such movements. Other questions concern the allocation of resources. How should resources be allocated between measurement
Total Survey Error: A Paradigm for Survey Methodology
of error sizes and actual improvement of the processes involved (Spencer 1985), and how should resources be allocated between prevention of errors, quality control, and evaluation, i.e., before, during, and after the survey? Also, most surveys are multipurpose, which is problematic from a design optimization point of view. Usually this problem is solved by identifying the most important variables and then working out a compromise design that best estimates those variables. It seems very unrealistic to expect statisticians to develop expanded confidence intervals or margins of error that take all major error sources into account. A much more realistic scenario is to work on continuous improvement of various survey processes so that biases and unwanted variations are gradually reduced to the extent that estimates of variances become approximations of the mean squared error. One way of accomplishing that would be to vigorously apply proven design principles and survey standards together with ideas from the world of quality management, most notably the notion of continuous quality improvement. This calls for a new way of thinking, where TSE is extended to total survey quality (TSQ), where survey quality is more than a margin of error. In the next section we will describe how a gradual merging between TSE and TSQ has evolved.
THE MERGING OF TOTAL SURVEY ERROR WITH TOTAL SURVEY QUALITY In the late 1980s and early 1990s many statistical organizations became interested in aspects of survey quality beyond traditional measures of accuracy. The quality frameworks represent one such development, where users were informed about several dimensions of survey quality. Just measuring and describing quality dimensions was, however, not sufficient. The quality management movement became part of the work in these organizations, with ideas about continuous quality improvement, two-way communication with users, handling competition from other providers of surveys, streamlining survey processes by observing metrics so that unnecessary variation is reduced, trying to eliminate waste, and minimizing costs of poor quality (Lyberg et al. 1998). At the core is the idea that measuring quality must be combined with systematic improvement activities. Thus survey quality is more than a specified TSE. To be able to improve the TSE, we need to move to a state that we might call Total Survey Quality, TSQ, where the ingredients listed above are present. TSQ is illustrated by Table 3.1 that is adapted from Lyberg and Biemer (2008). The table shows that it is possible to view TSQ as a three-level
Table 3.1 Survey quality on three levels Quality level
Main stakeholders
Control instrument (examples)
Product
Users, clients
Product requirements, evaluation studies, quality frameworks, minimum standards
Process
Survey designer
Organization
Service provider, government, society
Measures and indicators (examples)
Estimates of MSE components, degree of compliance to requirements, results of user surveys Metrics and paradata, control Paradata analysis, analysis of charts, verification and other common and special cause quality control measures, process process variation, results of standards and checklists evaluation studies Excellence models, audits, selfAssessment scores, identification assessments of strong and weak points, staff surveys
Source: Adapted from Lyberg and Biemer (2008)
35
36
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
concept: product quality, process quality, and organization quality. The product quality is the extent to which agreed-upon product requirements have been met. Such requirements might include a certain response rate or that the translation of survey materials has been checked according to specifications. A good product quality rests on good underlying processes. A good process is one that is stable in the sense that it always delivers what is expected. For instance, the process of interviewing implies that a number of elements must be in place for this to happen. Examples of such elements are a proper training program, a compensation system that encourages interviewers to strive for good quality, and supervision and feedback activities that help interviewers improve. Quality is built into the process by such quality assurance measures. Then quality control measures are used to check if these quality assurance elements are carried out and function as intended. This is done by using paradata or other metrics, i.e., data about the process. When paradata are plotted it is usually possible to distinguish between process variation that has common causes and variation due to special causes by using statistical process control theory and methods, especially the control chart (Montgomery 2005). There are also simpler ways of checking parts of the interview process. For instance, checking with respondents that interviews have actually taken place can discover interviewer falsification and so can simple response pattern analyses. Simple Pareto diagrams of monitoring outcomes can identify questions that were especially problematic for the interviewers. Finally, a good process quality cannot be achieved without good organizational quality. For instance, a survey organization must have leadership that makes sure that the organization has staff with the right competence, that processes are continuously improved, that suggestions and opinions from users and staff are taken care of, and that good processes are promoted within the organization and evaluated regularly.
While the idea that survey quality should be measured by the mean squared error encompassing all variance and squared bias terms associated with an estimate seems reasonable, this becomes complicated in practice. It would be very expensive to estimate the sizes of various bias terms since that would entail comparisons between the regular survey result and the result of a preferred survey procedure (which for some reason, probably budget constraints, could not be used in the regular survey). It would also be very complicated and expensive to estimate the correlated response variance due to interviewers, coders, editors, and supervisors, since that would require interpenetration experiments comparing the outcomes of clustered assignments with random ones. Furthermore, models decomposing the MSE of an estimate generally do not include all major error sources. For example, the US Census Bureau survey model (Hansen et al. 1964) does not include nonresponse and noncoverage errors. It also takes time to conduct such studies. On the other hand it is quite disturbing that so many margins of error in surveys are understated. As mentioned one way out of this dilemma is to gradually improve survey processes so that they approach ideal ones associated with small errors. The quality management literature has given us philosophies, methods, and tools to do that (Breyfogle 2003). The ASPIRE system (A System for Product Improvement, Review, and Evaluation) developed at Statistics Sweden (Biemer et al. 2014) is a recent attempt at handling survey quality assessment emphasizing TSE while using certain quality management tools. ASPIRE is a general system for evaluating TSE for the most important products at the agency. It uses six components for quality monitoring and process improvement, namely: •• MSE decomposed into sampling error, frame error, nonresponse error, measurement error, data processing error, and model/estimation error. •• Five quality criteria, namely the product’s knowl edge of risks of each of the MSE components on
Total Survey Error: A Paradigm for Survey Methodology
••
•• ••
••
the accuracy of the product, communication with data providers and users regarding these risks, compliance with best practices in the survey field regarding mitigation of errors, available expertise for monitoring and controlling errors, and the product’s achievements toward risk mitigation and improvement plans. Quality rating guidelines for each of the quality criteria with descriptions of what the assess ments poor, fair, good, very good, and excellent mean to ensure consistent ratings. Rating and scoring rules that help summarize progress in quality. Risk assessment including the intrinsic risk of doing nothing, the residual risk remaining after mitigation measures have been applied, and where the quality criteria are weighted by intrin sic risks, low, medium, and high. This allows for individual error source scores as well as a weighted overall product score. An evaluation process including pre-activities such as reviewing existing quality declarations and any program self-assessments, a quality interview with program staff, and post activi ties such as reviewing comments from product owners on product ratings possibly resulting in scoring adjustments.
ASPIRE is a comprehensive approach that is easily understood by management. The use of quality management principles and tools such as self-assessment, risk assessment, staff capacity building, and identification of areas most important to improve makes it possible to gradually mitigate the TSE. ASPIRE has so far been used during five rounds of quality assessment of the most important products at Statistics Sweden. Quality improved over the five rounds for most of these products. ASPIRE does not really reflect all parts of the TSE, but it does a better job than other existing approaches toward mitigating the MSE. Admittedly the scoring process can be somewhat subjective and is, of course, highly dependent on the knowledge and skills of the evaluators. Furthermore, it is important that the evaluators are external, since most internal assessments have a tendency to underreport problems (Lyberg 2012).
37
It is possible to go beyond Total Survey Quality to consider the quality of the total research study. Total Research Quality (TRQ) would include not only the survey itself, but also the information needs that led to choosing to conduct a survey rather than a different research strategy. It would also include the analysis stage, checking whether the data analysis approach is appropriate for the research needs. Accordingly, Kenett and Shmueli (2014) recently launched a new concept called Information Quality, InfoQ, which takes survey quality one step beyond Total Survey Quality. InfoQ attempts at assessing the utility of a particular data set for achieving a given analysis goal by employing statistical analysis or data mining. Obviously it is possible to increase InfoQ at the survey design stage by investigating the various known information needs. A formal definition of the concept is InfoQ ( f, X, g) = U[ f(X|g)] where g is the quality of goal definition, X is data quality, f is quality of analysis, and U is quality of the utility measure. InfoQ is in some sense a measure of the Total Research Quality (TRQ). Given a stated goal, InfoQ can be assessed at the design stage, at the data release stage, or before embarking on any secondary analyses. It can discover a faulty translation from statistics to domain, and it should be potentially useful when integrating data sets. InfoQ is clearly a few steps away from the onesize-fits-all frameworks we have mentioned above but it is still to be tested in practice. Thus, in our review of survey quality we have not only moved from sampling error to total survey error, but we have moved beyond both to total survey quality and ultimately to total research quality.
THE FUTURE OF TOTAL SURVEY ERROR AND TOTAL SURVEY QUALITY It is appropriate to conclude with some discussion of issues that are arising in contemporary
38
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
survey research that impinge on the TSE, TSQ, and even TRQ approaches. This involves recognition both of new data sources and of renewed debates as to how to conduct scientific surveys. First, there has been an explosion of new data sources in recent years, principally due to the Internet. Much more ‘administrative data’ is collected nowadays and posted in a manner that is accessible to researchers. The term ‘big data’ is often used to include the large amount of commerce and government data that is becoming available to researchers. Add to this the data from social media, such as from Google searches or Facebook posts. This explosion of data sources is leading to a field of ‘Data Science’ that is apart from but not unrelated to survey research. One result of this explosion is that the number of cases for data analysis is often exponentially greater than the number of respondents in most surveys. While sample size has traditionally varied between different types of surveys, non-governmental surveys were usually in the area of 800–2,000 respondents. Web surveys now can yield ten times those numbers of respondents, while ‘big data’ can involve thousands of times that number of cases. On the one hand, the greater number of respondents makes it easier to detect small effects. A 1% difference is usually not statistically significant in a survey of 800–2,000 respondents, but it might be if there were 80,000–200,000 cases being analyzed. However, a 1% difference is still a small difference, often too small to matter substantively. On the other hand, the greater number of respondents is rarely achieved through probability-based sampling. Strictly speak ing, that makes statistical significance calculations inappropriate. Some researchers simply claim that conclusions based on a very large number of cases should be accepted, regardless of how the respondents were recruited. Other researchers use complex weighting procedures to weight the respondents on the basis of known population characteristics. ‘Propensity weighting’ has become
common in web surveys as a means to compensate for people’s differential willingness to participate in surveys, such as by weighting a large web survey by some aspects of a much smaller telephone survey. A further complication is the debate between two statistical schools. On one side are the ‘frequentists’ who have provided the basis for how survey results have traditionally been interpreted. On the other side are the ‘Bayesians’ who weight survey results with their prior knowledge, often based on previous surveys. A good example of the Bayesian perspective is the US election forecasting models by Nate Silver and others who build models based on a state’s voting in previous election and then update those models with new polls as they come out, giving more credence to polls from nonpartisan sources and human interviewers than to polls from survey organizations with ideological biases or to automated polls. The election forecasting example is a good illustration of what is becoming known as data-driven journalism. Whereas journalists 50 years ago would develop forecasts for an election by talking to insiders who were considered specialists in the politics of a state, nowadays they use polls, polls of polls, and complex statistical models to develop their election forecasts. Such a development is occurring also in many decision-making areas, such as companies using a combination of focus groups and consumer surveys in deciding on which new products to bring to market and how to prepare effective marketing campaigns for those products. Time and cost considerations are creating additional problems for traditional survey research. Many data users have a distinct need for improved timeliness, needing data on very short notice and not being willing to wait for a traditional survey to deliver that data. Furthermore, in-person national surveys have become very expensive in large industrial countries, and the cost of hiring interviewers has even made telephone surveys more expensive than many research budgets
Total Survey Error: A Paradigm for Survey Methodology
permit. As a result, many users are not willing to pay for traditional surveys. There is still a lot of room for traditional surveys, but the survey community needs to adjust to these time and cost concerns. The Total Survey Error approach and the Total Survey Quality emphasis have both been important developments. They have created better understanding of the survey process and more focus on how to improve surveys. Yet the movement away from traditional survey procedures to big data, opt-in panels, and other modes of data collection may signify another paradigm shift that yields even greater changes in how survey research is conducted. The development from TSE and TSQ to Total Research Quality seems very relevant since future data collection will be very different from today’s. We can expect a plethora of data sources and a need to combine these. We need new theories and methods that can help us do that in a scientific way.
NOTE 1 Paradata in a survey are data about how the survey data were collected. For example, it would include how many attempts were made to contact a respondent, how long the interview took, and what time of day the interview was taken. Alternative definitions do not restrict paradata to the data collection process but rather all survey processes (Lyberg and Couper 2005).
RECOMMENDED READINGS The following works are recommended for more information: Groves (1989), Groves et al. (2009), Lyberg (2012), Lyberg and Biemer (2008), Smith (2011), and Weisberg (2005).
REFERENCES Biemer, P. P., Trewin, D., Bergdahl, H., and Japec, L. (2014). A System for Managing the
39
Quality of Official Statistics. Journal of Official Statistics, 30(3), 381–442. Breyfogle, F. W. (2003). Implementing Six Sigma: Smarter Solutions using Statistical Methods. Hoboken, NJ: John Wiley & Sons. Campbell, D. T., and Stanley, J. (1963). Experimental and Quasi-Experimental Designs for Research. Chicago: Rand-McNally. Couper, M. P. (1998). Measuring Survey Quality in a CASIC Environment. Paper presented at the Joint Statistical Meetings, American Statistical Association, Dallas, TX. Deming, W. E. (1944). On Errors in Surveys. American Sociological Review, 9(4), 359–369. Groves, R. M. (1989). Survey Errors and Survey Costs. New York: John Wiley. Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R. (2004, 2009). Survey Methodology (1st and 2nd edn). New York: Wiley. Groves, R. M., and Heeringa, S. G. (2006). Responsive Design for Household Surveys: Tools for Actively Controlling Survey Errors and Costs. Journal of the Royal Statistical Society, A, 169, 439–457. Hansen, M. H., and Hauser, P. M. (1945). Area Sampling – Some Principles of Sample Design. Public Opinion Quarterly, 9(2), 183–193. Hansen, M. H., Hurwitz, W. N., and Pritzker, L. (1964). The Estimation and Interpretation of Gross Differences and Simple Response Variance. In C. Rao (ed.), Contributions to Statistics (pp. 111–136). Oxford: Pergamon Press. Jabine, T. B., Straf, M. L., Tanur, J. M., and Tourangeau, R. (1984). Cognitive Aspects of Survey Methodology: Building a Bridge between Disciplines. Washington DC: National Academy Press. Kenett, R. S., and Shmueli, G. (2014). On Information Quality. Journal of the Royal Statistical Society, A, 177(1), 3–27. Kish, L. (1965). Survey Sampling. New York: Wiley. Kreuter, F. (ed.) (2013). Improving Surveys with Paradata: Analytic Uses of Process Information. Hoboken, NJ: John Wiley & Sons. Krosnick, J. A., and Alwin, D. F. (1987). An Evaluation of a Cognitive Theory of ResponseOrder Effects in Survey Measurement. Public Opinion Quarterly, 51(2), 201–219.
40
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Lyberg, L. E. (2012). Survey Quality. Survey Methodology, 38(2), 107–130. Lyberg, L. E., and Biemer, P. P. (2008). Quality Assurance and Quality Control in Surveys. In E. de Leeuw, J. Hox, and D. Dillman (eds), International Handbook of Survey Methodology, Chapter 22 (pp. 421–441). New York: Lawrence Erlbaum Associates. Lyberg, L. E., and Couper, M. P. (2005). The Use of Paradata in Survey Research. Invited paper, International Statistical Institute, Sydney, Australia. Lyberg, L. E., Japec, L., and Biemer, P. P. (1998). Quality Improvement in Surveys – A Process Perspective. Proceedings of the Section on Survey Research Methods, American Statistical Association, 23–31. Montgomery, D. C. (2005). Introduction to Statistical Quality Control (5th edn). Hoboken, NJ: John Wiley & Sons.
Neyman, J. (1934). On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society, 97(4), 558–606. Smith, T. W. (2011). Refining the Total Survey Error Perspective. International Journal of Public Opinion Research, 28(4), 464–484. Spencer, B. D. (1985). Optimal Data Quality. Journal of the American Statistical Association, 80(391), 564–573. Tourangeau, R., Rips, L. J., and Rasinski, K. (2000). The Psychology of Survey Response. Cambridge, UK: Cambridge University Press. Weisberg, H. F. (2005). The Total Survey Error Approach. Chicago: University of Chicago Press.
4 Challenges of Comparative Survey Research T i m o t h y P. J o h n s o n a n d M i c h a e l B r a u n
INTRODUCTION The origins of comparative survey research date back to the late 1940s (Smith, 2010). From those earliest experiences, the dangers of uncritically exporting social science research methodologies to new cultures and social environments were quickly recognized (c.f., Buchanan and Cantril, 1953; Duijker and Rokkan, 1954; Wallace and Woodward, 1948–1949; Wilson, 1958). Over the ensuing decades, the research literature began to demonstrate increasing awareness of both the opportunities and challenges of comparative survey research (Bulmer and Warwick, 1993; Casley and Lury, 1981; Cicourel, 1974; Frey, 1970; Przeworski and Teune, 1970; Tessler et al., 1987; van de Vijver and Leung, 1997; van Deth, (2013 [1998]). Today, the importance of comparative survey research is reflected in dramatic increases in the availability of internationally comparative survey data (Smith, 2010; see also Chapter 43 by Smith and Fu, in this Handbook), the volume
of substantive analyses of these data, and the continued growth and sophistication of methodological research focused on the problems associated with comparative survey research (c.f., Davidov et al., 2011; Harkness et al., 2010a; van de Vijver et al., 2008). In this chapter, we provide an overview of the challenges posed by comparative survey research and some potential strategies for addressing them.
THE CHALLENGE OF COMPARABILITY In comparative survey research, much more than the problems common to all mono- cultural surveys and measures need to be taken into consideration. In addition to depending on the quality of each individual national or cultural survey and measurement component, cross-cultural research is also dependent on their ‘comparability’. Within the comparative framework, the same
42
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
challenges apply whether the goal is comparison of different ethnic groups within a single country or comparisons across multiple countries. In the first case, multiple issues, such as sampling, accessibility, translations, and interviewer effects, may be relevant. In the second case, larger contextual factors must also be considered. These latter contexts might include considerations such as the social structures and information available for the drawing of samples and conducting fieldwork, as well as socio-economic and political environments. Because crossnational comparative surveys are the more general case, we will focus in this chapter on comparisons of different countries. However, comparing different cultural groups within a country will be addressed as well. An initial question which emerges early in the conduct of a comparative project is that of which countries are to be compared. The selection of countries or cultural groups is important both for the design of an original comparative survey project and for the analysis of secondary data. Depending on the research question, different strategies have been considered (e.g., Frey, 1970; Küchler, 1998; Przeworski and Teune, 1970; Scheuch, 1968). One approach involves the selection of countries which are contextually as different as possible. An important advantage of this ‘most different systems’ design (Przeworski and Teune, 1970: 34ff) is its ability to demonstrate the generalizability of relationships between the variables under study. If relationships between constructs are found to be consistent across countries that have little in common, then there is reason to believe that they are indeed generalizable and not dependent on the conjunction of very specific contextual conditions. If, on the other hand, the research aim is to demonstrate that a relationship originally found in one country or cultural group is unique and does not hold everywhere, then using different contexts (e.g., countries which differ in a large number of characteristics) is a disadvantage. This holds because, if the data seemingly are in
accordance with the expectation of different relationships, alternative explanations might make an unambiguous interpretation impossible. In this case it is rather helpful to select highly similar countries, which is consistent with Przeworski and Teune’s (1970) ‘most similar systems’ design. This is because, having once found differences in the relationships already with highly similar countries, it becomes more likely that this will also be the case across more different national contexts. In research practice, however, these considerations are increasingly losing relevance, at least as far as the selection of countries is concerned, due to the ongoing development of analytic strategies that are able to address many of these concerns (see below). Many projects of international comparative survey research such as the International Social Survey Program (ISSP, http://www. issp.org) or the World Values Survey (WVS, http://www.worldvaluessurvey.org) strive for a more comprehensive coverage of all countries. And the survey projects restricted to specific regions of the world such as the European Social Survey (ESS, http://www. europeansocialsurvey.org), the European Values Study (EVS, http://www.europeanvaluesstudy.eu) or the Eurobarometer surveys (http://ec.europa.eu/public_opinion/ index_en.htm) also pursue, in their domains, a maximally complete coverage of existing countries (for these projects, see Chapter 43 by Smith and Fu, this Handbook). Modern developments of regression analysis, in particular multilevel models, have provided adequate statistical procedures for the appropriate analysis of data from a large number of countries. These procedures allow one to investigate whether relationships between variables differ between countries and how such differences can be explained by characteristics on the level of countries. The problem of comparability not only concerns the nations under investigation but also the selection of groups to be compared between countries. In comparative research, attention has to also be paid to the possibility
Challenges of Comparative Survey Research
that what is found is not a difference between countries but a comparison between different groups. Scheuch (1968: 187) for instance, discusses whether farmers in the US are comparable with those in Europe: ‘… if one compares responses for both groups, much of what is done actually shows that similar labels refer to different groups, rather than demonstrating cross-cultural differences between the responses of otherwise comparable groups’.
PROBLEMS OF COMPARABILITY OF PROJECT COMPONENTS In cross-cultural comparative research, the adequacy of conclusions depends on the quality and comparability of single national studies. In the presence of errors, similarities and differences between countries might simply be due to methodological artifacts. Indeed, a critical task of comparative survey research is to attempt to prevent, ex ante, and to detect and adjust for, ex post, these potential artifacts. Two broad classes of errors can be distinguished: (1) The degree to which samples which were drawn and realized (after fieldwork) represent intended populations in a comparable way (sampling, coverage, and non-response error), and (2) the extent to which the questionnaire in general and specific items in particular after translation are processed the same way by respondents from different national and/or ethnic groups (measurement error). For a discussion of the total survey error framework, see Chapter 10 by Biemer in this Handbook.
Sampling, Coverage, and Nonresponse Errors ‘Sampling error’ results from analyzing just a sample instead of the entire population. It can be computed exactly, if the sample has
43
been drawn at random and the other, systematic, error components (see below) can be neglected. ‘Coverage errors’ exist if not all units of the population under investigation have a chance to be included in the sample. The conditions of the sampling might be responsible for this. That is, the existence, access to and practicability of complete and up-to-date lists of the members of the population which are necessary for an unbiased sampling may not be available in all nations (see Chapter 23 by Gabler and Häder, this Handbook). ‘Non-response errors’ (see Chapter 27 by Stoop, this Handbook) refer to non- participation of those units of the population that have been selected as part of the sample. It is composed of non-contacts and refusals and those who are not able to take part in the survey for other reasons, such as limited accessibility, poor health, or language problems. Each of these components can differ significantly across countries, depending on the survey climate in a society (that is how Westernoriginated, quantitative surveys are perceived in general, and the population’s general willingness to participate in them), as well as its survey literacy, or experience with survey research practices (Harkness, 2007), fieldwork duration, contact protocols, use of incentives, the physical mobility and accessibility of the population, the acceptability of refusal conversion practices, average household size and the resources of the research organization. Lack of awareness of or appreciation for important religious or public holidays when planning fieldwork can also have serious effects on nonresponse (Wuelker, 1993). The consequences of excluding specific parts of the population – due to coverage and/ or non-response errors – depend on the topics of the survey and the size of the groups affected but in particular on differences with respect to the variables of interest between these groups and those actually participating in the survey. The utilization of identical procedures within each participating country is no
44
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
guarantee for comparability (Pennell et al., 2010). On the contrary, different strategies which are adapted to the respective contexts might be preferable. It makes obviously little sense to prescribe the conduct of telephone surveys for all countries, irrespective of the local conditions. Otherwise, in countries with a low telephone density a large part of the population would be excluded from the survey to begin with. Instead, in these countries, other survey modes should be employed that can be equally successful in covering and contacting the population. It might also be necessary to accept differences in sampling procedures and data collection mode, if the approaches optimal for each single country should be applied. In Germany, for instance, a sample drawn from municipal registers is regarded as the gold standard for personal interviews. In the US, in contrast, such registers do not exist. There, the standard approach for area probability surveys consists of drawing a random sample of city blocks first, then another sample of households, and finally the sampling of actual respondents using within household selection procedures. In some developing countries, even such a procedure is not feasible, for instance there is less reliable information regarding the distribution of the population in small geographical units, blocks of houses cannot be identified or a part of the population consists of nomads. In these cases less precise procedures must sometimes be used of necessity, although the now widespread availability of global positioning systems (GPS) and associated technologies have done much to alleviate these problems (c.f., Galway et al., 2012; Himelein et al., 2014). It would be hardly acceptable to use such sub-optimal procedures also in those countries in which good random samples could be drawn, only to preserve the formal equality of procedures.
Measurement Errors Discrepancies between actual and reported values for measures targeted for assessment
by survey data collection methods are known as ‘measurement errors’. These include sources of error that may be contributed by survey instruments, respondents, interviewers, survey design and procedures, and differences in social systems (Biemer et al., 1991). In comparative research, each of these may have differential influence across nations and individual cultures. Survey instruments may contribute to measurement error by way of complex designs, the formulation of single questions (and, of course, inadequate translation), the succession of questions and response alternatives. At the most basic level, the construction of individual items needs to be attended to carefully when developing measures for use in cross-cultural contexts. As an example, the President of the US has an entirely different function than the President in Germany (where he or she has a mostly representative role, only). When it comes to measuring attitudes towards the head of government, ‘President’ thus needs to be relabeled as ‘Chancellor’ in a German questionnaire. Harkness (2007) provides another example that considers the challenge of measuring religiosity in a comparable manner across nations using items that assess frequency of church attendance. At the respondent level, asking for both objective and subjective information can be affected by cultural variability in question information processing, including problems of understanding and interpretation, difficulties of recall of the desired information from memory, and variability in judgment formation, response mapping, and response editing processes (Johnson et al., 1997). In addition, it is known that social interactions and communication patterns are largely mediated by cultural norms, which may influence ‘standardized’ survey data in numerous ways. In comparative analyses, these differences may be misinterpreted as substantively meaningful respondent differences in attitudes, beliefs, and/or behaviors when they in fact represent variability in how respondents
Challenges of Comparative Survey Research
react and respond to survey questions during social encounters. There are currently several conceptual frameworks for understanding cultural dimensions that may be useful for interpreting respondent behaviors during survey interviews. Some of these include those identified by Hofstede (2001), Inglehart and Oyserman (2004), Schwartz et al. (1992), Triandis (1996), and Trompenaars and Hampden-Turner (1998). Hofstede’s (2001) individualistic vs collectivistic orientations, for example, have been linked to national and cultural group differences in propensity to give socially desirable, acquiescent, and extreme responses when answering survey questions (Johnson et al., 2011; Johnson and van de Vijver, 2003). In addition, unique cultural patterns such as the Asian and Middle Eastern Courtesy bias (Ibrahim, 1987; Jones, 1963; Mitchell, 1993), the ‘ingratiation bias’ (Bulmer, 1993), the ‘sucker bias’ (Keesing and Keesing, 1956), Simpatía and other dimensions of Hispanic culture (Davis et al., 2011; Triandis et al., 1984), and the ‘honor’ culture found in the southern United States (Nisbett and Cohen, 1996), may also differentially influence respondent behaviors. Interviewers may be an additional source of variability in measurement error in comparative survey research. They may misread question text or register the answers of the respondents in a biased way, thereby contributing to error. They also might drift – consciously or unconsciously – from their role as a friendly but neutral observer, by cuing respondents to answer in a conforming or socially desirable manner. Recent research suggests, for example that in some countries, religious clothing may influence respondent answers (Benstead, in-press; Blaydes and Gillum, 2013; Koker, 2009). Countries differ in their survey traditions and, therefore, also in the standards of conduct, for instance the training of interviewers, and the acceptability of interactions between interviewers and respondents of different age or gender (Bulmer, 1993; Kearl, 1976; Newby et al., 1998; Pennell et al., 2010). The presence of
45
others during survey interviews is also likely to have differential effects on the quality of self-reported information across countries (Mitchell, 1993; Mneimneh, 2012). Survey design and protocols may also differentially add to measurement errors cross-nationally. Tendencies to give socially desirable answers, for example, are known to vary by survey mode in that they are most pronounced with personal interviews and least pronounced with self-administered surveys (Kreuter et al., 2008). As comparative projects of necessity must sometimes rely on varying modes of data collection across nations for various reasons, such as coverage issues (see above), efforts to minimize one potential source of cross-national error may unfortunately contribute to other sources of error. Differences in operational protocols, such as procedures for supervising and monitoring field interviewers, may have similar effects. Structural differences across social systems, if not properly accounted for, may also lead to serious measurement errors. For example, potentially different definitions in different countries concern nearly all socio-demographic variables (Braun und Mohler, 2003; Hoffmeyer-Zlotnik and Wolf, 2003; Scheuch, 1968). Measures of education (Braun and Müller, 1997) are particularly problematic due to significant differences in the underlying education systems. Measuring income cross-nationally is equally challenging.
COMPARABILITY OF CONSTRUCTS AND ITEMS Researchers addressing cross-cultural measurement error usually express concerns with the ‘equivalence’ of their measures and instruments. However, as Frey (1970: 240) has observed: ‘equivalence is never total’. Accordingly, Mohler and Johnson (2010) conceive of equivalence as an ‘ideal concept’
46
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
or ‘ideal type’ that is unattainable in practice, albeit a useful concept when considering the practical similarity of constructs, indicators, and measures across cultural groups. The notion of equivalence, however, is employed as a common heuristic in the literature concerned with cross-cultural methods and we a use it here for the same purpose. It is of course essential to work to establish the comparability of the stimuli to which respondents are asked to respond. If there is no such comparability, the data will not only represent real differences between the countries or cultures under investigation but also measurement artifacts. Separating both can be a difficult challenge. Obviously, researchers want to avoid any lack of comparability through careful construction of measurement instruments ex ante. However, such efforts have not been successful in many studies, despite significant effort. This means that the users of secondary data must often face the task of demonstrating ex post whether or not equivalence can be demonstrated for the measures being analyzed.
Securing Equivalence of Measurement Instruments Ex Ante The comparability of data can be compromised both by an inadequate rendering of measures in different languages and in the social reality of different national contexts. Researchers typically attempt to achieve functional equivalence, ex ante, through careful construction of the source questionnaire, making ample use of existing multicultural competence and experience (Harkness et al., 2010b), and via professional translation and adaptation of questionnaires (Harkness et al., 2010c). To give an example, in the ISSP a so-called drafting group consisting of representatives of approximately six – ideally highly diverse – member countries prepares a preliminary source questionnaire in English. In an iterative procedure, this draft is then
discussed with representatives of the other member countries, tested in a smaller number of countries, and finally approved by the ISSP General Assembly after discussion and voting. During this process, country-specific particularities are taken into account, both with regard to the inclusion of single items and their formulation. For instance, in the past East Asian countries have suggested a more abstract formulation of items on religion than would be necessary if all countries had a Christian background. The source questionnaire is then translated into the languages of the participating countries. For the translation, a team approach is recommended, in which several translators first translate the questionnaire independently from each other into the language of the country and then discuss their translations with experts for the specific topics dealt with in the questionnaire and survey experts (see Chapter 19 by Behr and Shishido, this Handbook). Special attention is given to the question whether the translated items are understood the same way as the items in the source questionnaire. Finally, the country-specific versions of the questionnaire are tested in a (cognitive) pretest, with special attention to problems of comprehensibility of the translation and comparability of the stimuli with the source questionnaire. Often there is no clear distinction made between translation and adaptation, though this might be useful (see Chapter 19 by Behr and Shishido, this Handbook). Translation (in a narrow sense) refers to linguistic aspects. However, as a properly translated question might lead to different stimuli in different cultures, an adaptation is necessary which takes into consideration non-linguistic cultural information. Paradoxically, errors may be more frequent where cultural differences are seemingly smaller and more likely to be overlooked, such as when a common language is shared between countries that have somewhat unique cultures. An example are questions regarding whether children suffer if their mother works, where eastern German respondents have a tendency to think
Challenges of Comparative Survey Research
of younger children and a higher amount of labor-force participation than do western German respondents. Thus, the answers of respondents are not directly comparable. Nevertheless, the revealed attitudes are still less traditional in eastern compared to western Germany. Both qualitative procedures and quantitative pretests are helpful to achieve equivalence ex ante. Among qualitative procedures, cognitive interviews are particularly helpful to investigate problems in the response process (Beatty and Willis, 2007; Willis, 2015; see also Chapter 15 by Miller and Willis in this Handbook). International comparative cognitive studies, however, are infrequent compared to intercultural comparative cognitive studies in one country, due to the high coordination effort necessary (as exceptions: Fitzgerald et al., 2011; Miller et al., 2011). In quantitative pretests the main study questionnaire is tested under actual conditions. When pretests are conducted in different countries and high numbers of cases are available, statistical procedures for testing equivalence can also be employed. It is, however, rare in practice that large quantitative pretests or qualitative studies are conducted in all participating countries. Even for the European Social Survey (ESS), qualitative studies were the exception and quantitative pretests have been conducted only in two countries, before the English-language source questionnaire was finalized. Moreover these pretests – as in the ISSP – were conducted mainly for the purpose of selecting substantively interesting and methodologically adequate questions and not in order to test the final measurement instrument to be used in the main study. Thus, it is often necessary to test equivalence ex post on the basis of the data collected in the main study.
Securing Equivalence of Measurement Instruments Ex Post There are a large number of quantitative procedures for testing equivalence of
47
measurement instruments ex post (Braun and Johnson, 2010; Davidov et al., 2011; van de Vijver and Leung, 1997). One of the most frequently used procedures is confirmatory factor analysis (Brown, 2006; Vandenberg and Lance, 2000). For cross-cultural comparisons, multiple-group confirmatory factor analysis tests whether a measurement instrument which has proved to be adequate for one country can also be applied to other countries. Comparability can exist on different levels which refer to the criteria for equivalence. Braun and Johnson (2010) illustrate the use of several complex and less complex procedures for the same substantive problem, demonstrating that they lead to essentially the same conclusions. An important drawback of all quantitative procedures – perhaps with the exception of multilevel models (in particular multilevel confirmatory factor analysis) – is that they can only point to a problem but not account for the underlying causes. For these purposes – as already with establishing equivalence ex ante (that is in the course of pretesting) – probing techniques can be used. Because the conduct of cognitive interviews – particularly on the international level – is very time consuming, the conduct of additional web-based studies is a possible alternative. In such studies, probing questions can be included in a regular web questionnaire – comparable to Schuman’s (1966) suggestion for ‘random probes’. Behr et al. (2012), for example, use both ‘comprehension’ (‘What ideas do you associate with the phrase “civil disobedience”? Please give examples’) and ‘categoryselection probes’ (‘Please explain why you selected [response alternative]’) for an item on civil disobedience from ISSP. They document two reasons for the lower support for civil disobedience in countries such as the US or Canada, compared to countries such as Germany and Spain. On the one hand, in the first group of countries civil disobedience is associated with violence to a higher degree than in the second group. This is a
48
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
methodological artifact as the respondents are actually answering different questions. On the other hand, trust in politicians is (even) lower in the second group of countries than in the first one. This is a substantive result. However, in the data methodological artifacts and real differences are actually confounded, and it is difficult to distinguish between the two. Braun et al. (2012) show that respondents from different countries, when asked to evaluate migrants, do think of largely comparable groups which correspond to the reality of migration in the different countries, though there are distortions. They use ‘specific probes’ (‘Which type of immigrants were you thinking of when you answered the question?’). Conventional statistical methods alone could not have answered the question whether the attitudes of respondents in different countries towards migrants are incomparable because respondents think of different groups even in the case of comparable realities of migration.
ETHICAL AND OTHER CONCERNS IN COMPARATIVE SURVEYS Conducting comparative survey research also requires consideration of potential variability and reconciliation of ethical standards of research conduct across the nations and/or groups being examined. In this regard, perhaps the most fundamental ethical question confronting comparative survey research is the appropriateness of imposing Western individualistic, quantitative social science methodologies onto cultures that do not share those traditions and values. Arguably, survey research is itself a culture-bound methodology designed to study social structures that are most specific to Western nations. Western assumptions, such as the equal value of all individual opinions, can be contested elsewhere (Rudolph and Rudolph, 1958) and suggests an inherent ‘democratic bias’
(Lonner and Berry, 1986) embedded within survey methodologies. From such a perspective, applications of survey research in other cultural environments may be interpreted by some as a form of ‘scientific imperialism’ (Drake, 1973). One important source of conflict over the past several decades in this regard has centered around questions of the traditional Western requirement of obtaining informed consent from all individual survey respondents. In some traditional, collectivist societies, the concept of the self continues to be de-emphasized in favor of group identities. In these environments, researchers have commented on the importance of obtaining approval from village leaders or family elders in lieu of individual informed consent from potential respondents (Bulmer, 1998; Drake, 1973; Kearl, 1976). It has been argued that requiring individual informed consent where it is not understood or valued is a coercive form of ‘ethical imperialism’ that may undermine local institutions and social practices (Barry, 1988; Newton, 1990). In contrast, Ijsselmuiden (1992) cautions that it is not always clear who, other than the research participant, would be the most appropriate person to provide consent, and how to be certain that such a gatekeeper is in fact acting in the respondent’s best interests. Common sense suggests that comparative researchers be aware of the importance of respecting and showing deference to local cultural traditions and values while also demonstrating respect for the importance of individual autonomy when making decisions regarding research participation. A related issue is the level of information that is required to meet informed consent requirements in different nations. Concerns have been expressed that some national requirements for complex, multi-page consent documents may be confusing and not always appropriate when exported to other cultural contexts, leaving potential respondents with the impression that researchers are more concerned with their own
Challenges of Comparative Survey Research
legal protections rather than the welfare of respondents (Marshall, 2001). Some additional ethical concerns that are specific to comparative survey research include cross-national data sharing practices (Freed-Taylor, 1994), the ethics of collecting data in nations that may have ‘no intellectual or material interests in the results’, also known as ‘academic colonialism’ (Schooler et al., 1998), and abandonment of the strict standards of methodological rigor commonly seen in national surveys when conducting cross-national research (Jowell, 1998). In addition, the importance of equal professional and power status for researchers from various participating nations, as opposed to exploitive ‘hired hands’ research (Clinard and Elder, 1965; Kleymeyer and Bertrand, 1993; Warwick, 1993), cannot be over-emphasized. As O’Barr et al. (1973: 13) have commented ‘when decisions about research in one society are made by persons from another society or, worse yet, by foreign governments, the potential for abuse increases considerably, as does the anxiety of those being studied’. Fortunately, accumulated insights and experience over the past several decades have both brought to light these various ethical concerns and explored approaches to addressing them using strategies that recognize and demonstrate respect for societal values and for personal well-being. In particular, both Warwick (1980) and Hantrais (2009) have presented sets of recommendations for engaging in sound ethical practice while also maintaining high scientific standards in the conduct of comparative survey research.
SUMMARY The collection of survey data across nations or cultures presents many additional challenges beyond those confronted when conducting research in more homogeneous environments. We have sought to provide an overview of those challenges specifically
49
associated with the design and execution of high quality comparative survey research, along with some of the potential solutions that are being actively pursued by investigators around the world.
RECOMMENDED READINGS For further reading, we recommend the following recent texts concerned with unique conceptual and technical aspects of comparative survey research: Davidov et al. (2011); Hantrai (2009); Harkness et al. (2003, 2010a); Hoffmeyer-Zlotnik and Wolf (2003); and Porter and Gamoran (2002) In particular, Harkness et al. (2010a) provide a broad overview of cross-national and cross-cultural survey research methodologies. In addition to these texts, Scheuch (1968) is one of the earliest papers to carefully describe the challenges associated with comparability in cross-cultural survey research.
REFERENCES Barry, M. (1988). Ethical considerations of human investigation in developing countries: The AIDS dilemma. New England Journal of Medicine 319(16): 1083–1086. Beatty, P.C. and Willis, G.B. (2007). Research synthesis: The practice of cognitive interviewing. Public Opinion Quarterly 71: 287–311. Behr, D., Braun, M., Kaczmirek, L. and Bandilla, W. (2012). Item comparability in crossnational surveys: Results from asking probing questions in cross-national Web surveys about attitudes towards civil disobedience. Quality & Quantity. DOI: 10.1007/s11135012-9754-8. Benstead, L.J. (in-press). Effects of interviewerrespondent gender interaction on attitudes toward women and politics: Findings from Morocco. International Journal of Public Opinion Research. Advanced access at: http://ijpor.oxfordjournals.org/content/ early/2013/09/27/ijpor.edt024.full.pdf+html. Accessed on 6 June 2016.
50
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Biemer, P.P., Groves, R.M., Lyberg, L., Mathiowetz, N.A. and Sudman, S. (eds) (1991). Measurement Errors in Surveys. New York: Wiley. Blaydes, L. and Gillum, R.M. (2013). Religiosityof-interviewer effects: Assessing the impact of veiled enumerators on survey response in Egypt. Politics and Religion. Accessed on April 25, 2015 at: http://web.stanford. edu/∼blaydes/Veil3.pdf. Braun, M., Behr, D. and Kaczmirek, L. (2012). Assessing cross-national equivalence of measures of xenophobia: Evidence from probing in Web Surveys. International Journal of Public Opinion Research. DOI:10.1093/ ijpor/eds034. Braun, M. and Johnson, T.P. (2010). An illustrative review of techniques for detecting inequivalences, pp. 375–393 in J.A. Harkness, M. Braun, B. Edwards, T.P. Johnson, L. Lyberg, P.P. Mohler, B.-E. Pennell and T. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: Wiley. Braun, M. and Mohler, P.P. (2003). Background variables, in J. Harkness, F.J.R. van de Vijver and P.P. Mohler (eds), Cross-Cultural Survey Methods. Hoboken, NJ: Wiley, pp. 101–116. Braun, M. and Müller, W. (1997). Measurement of education in comparative research, in L. Mjøset et al. (eds), Methodological Issues in Comparative Social Science. Comparative Social Research 16, Greenwich, CT: JAI Press, pp. 163–201. Brown, T.A. (2006). Confirmatory Factor Analysis for Applied Research. New York: Guilford Press. Buchanan, W. and Cantril, H. (1953). How Nations See Each Other: A Study of Public Opinion. Urbana: University of Illinois Press. Bulmer, M. (1993). Interviewing and field organization, pp. 205–217 in M. Bulmer and D.P. Warwick (eds), Social Research in Developing Countries: Surveys and Censuses in the Third World. London: UCL Press. Bulmer, M. (1998). Introduction: The problem of exporting social survey research. American Behavioral Scientist 42: 153–167. Bulmer, M. and Warwick, D.P. (1993). Social Research in Developing Countries: Surveys and Censuses in the Third World. London: UCL Press. Casley, D.J. and Lury, D.A. (1981). Data Collection in Developing Countries. Oxford: Clarendon Press.
Cicourel, A.V. (1974). Theory and Method in a Study of Argentine Fertility. New York: Wiley. Clinard, M.B. and Elder, J.W. (1965). Sociology in India. American Sociological Review 30: 581–587. Davidov, E., Schmidt, P. and Billiet, J. (eds) (2011). Cross-cultural Analysis: Methods and Applications. New York: Routledge. Davis, R.E., Resnicow, K. and Couper, M.P. (2011). Survey response styles, acculturation, and culture among a sample of Mexican American adults. Journal of Cross-Cultural Psychology 42: 1219–1236. Drake, H.M. (1973). Research method or culture-bound technique? Pitfalls of survey research in Africa, pp. 58–69 in W.M. O’Barr, D.H. Spain and M.A. Tessler (eds), Survey Research in Africa: Its Applications and Limits. Evanston, IL: Northwestern University Press. Duijker, H.C.J. and Rokkan, S. (1954) Organizational aspects of cross-national social research. Journal of Social Issues 10: 8–24. Fitzgerald, R., Widdop, S., Gray, M. and Collins, D. (2011). Identifying sources of error in cross-national questionnaires: Application of an error source typology to cognitive interview data. Journal of Official Statistics 27, 569–599. Freed-Taylor, M. (1994). Ethical considerations in European cross-national research. International Social Science Journal 46: 523–532. Frey, F.W. (1970). Cross-cultural survey research in political science, pp. 173–294 in T.T. Holt and J.E. Turner (eds), The Methodology of Comparative Research. New York: Free Press. Galway, L.P., Bell, N., Sae, A.S., Hagopian, A., Burnham, G., Flaxman, A., Weiss, W.M., Takaro, T.K. (2012). A two-stage cluster sampling method using gridded population data, a GIS, and Google EarthTM imagery in a population-based mortality survey in Iraq. International Journal of Health Geographics 11: 12. Accessed on April 25, 2015 at: http://www.ij-healthgeographics.com/ content/11/1/12. Hantrais, L. (2009). International Comparative Research: Theory, Methods and Practice. New York: Palgrave Macmillan. Harkness, J.A. (2007). In pursuit of quality: Issues for cross-national survey research, pp. 35–50 in L. Hantrais and S. Mangen
Challenges of Comparative Survey Research
(eds), Cross-National Research Methodology and Practice. London: Routledge. Harkness, J.A., Braun, M., Edwards, B., Johnson, T.P., Lyberg, L., Mohler, P.P., Pennell, B.E. and Smith, T. (2010a). Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: Wiley. Harkness, J.A., Edwards, B., Hansen, S.E., Miller, D.R. and Villar, A. (2010b). Designing questionnaires for multipopulation research, pp. 33–57 in J.A. Harkness, M. Braun, B. Edwards, T.P. Johnson, L. Lyberg, P.P. Mohler, B.-E. Pennell and T. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: Wiley. Harkness, J.A., van de Vijver, F.J.R. and Mohler, P.P. (eds) (2003). Cross-cultural Survey Methods. Hoboken, NJ: Wiley Harkness, J.A., Villar, A. and Edwards, B. (2010c). Translation, adaptation, and design, pp. 117–140 in J.A. Harkness, M. Braun, B. Edwards, T.P. Johnson, L. Lyberg, P.P. Mohler, B.-E. Pennell and T. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: Wiley. Himelein, K., Eckman, S. and Murray, S. (2014). Sampling nomads: A new technique for remote, hard-to-reach, and mobile populations. Journal of Official Statistics 30(2): 191–213. Hoffmeyer-Zlotnik, J.H.P. and Wolf, C. (eds) (2003). Advances in Cross-National Comparison: A European Working Book for Demographic and Socio-Economic Variables. New York: Kluwer/Plenum. Hofstede, G. (2001). Culture’s Consequences (2nd edn). Thousand Oaks, CA: Sage. Ibrahim, B.L. (1987). The relevance of survey research in Arab society, pp. 77–101 in M.A. Tessler, M. Palmer, T.E. Farah and B.L. Ibrahim (eds), The Evaluation and Application of Survey Research in the Arab World. Boulder, CO: Westview Press. Ijsselmuiden, C.B. (1992). Research and informed consent in Africa – another look. New England Journal of Medicine 326(12): 830–834. Inglehart, R. and Oyserman, D. (2004) Individualism, autonomy, self-expression. The human development syndrome, pp. 74–96 in H.
51
Vinken, J. Soeters and P. Ester (eds), Comparing Cultures: Dimensions of Culture in a Comparative Perspective. Leiden, The Netherlands: Brill. Johnson, T.P., O’Rourke, D., Chavez, N., Sudman, S., Warnecke, R., Lacey, L., et al. (1997). Social cognition and responses to survey questions among culturally diverse populations, pp. 87–113 in L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz and D. Trewin (eds), Survey Measurement and Process Quality. NY: John Wiley & Sons. Johnson, T.P., Shavitt, S. and Holbrook, A.L. (2011). Survey response styles across cultures, pp. 130–175 in D. Matsumoto and F.J.R. van de Vijver (eds), Cross-Cultural Research Methods in Psychology. Cambridge: Cambridge University Press. Johnson, T.P. and van de Vijver, F.J.R. (2003), Social desirability in cross-cultural research, in J. Harkness, F.J.R. van de Vijver, P.P. Mohler (eds), Cross-Cultural Survey Methods. Hoboken, NJ: Wiley, pp. 195–206. Jones, E.L. (1963). The courtesy bias in SouthEast Asian surveys. International Social Science Journal 15: 70–76. Jowell, R. (1998). How comparative is comparative research? American Behavioral Scientist 42: 168–177. Kearl, B. (1976). Field Data Collection in the Social Sciences: Experiences in Africa and the Middle East. New York: Agricultural Development Council. Keesing, F.M. and Keesing, M.M. (1956). Elite Communications in Samoa: A Study in Leadership. Stanford, CA: Stanford University Press. Kleymeyer, C.D. and Bertrand, W.E. (1993). Misapplied cross-cultural research: A case study of an ill-fated family planning research project, pp. 365–378 in M. Bulmer and D.P. Warwick (eds), Social Research in Developing Countries: Surveys and Censuses in the Third World. London: UCL Press. Koker, T. (2009). Choice under pressure: A dual preference model and its application. Economics Department Working Paper No. 60, Yale University, March. Accessed on April 25, 2015 at: http://economics.yale.edu/sites/ default/files/files/Working-Papers/wp000/ ddp0060.pdf.
52
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Kreuter, F., Presser, S. and Tourangeau, R. (2008). Social desirability bias in CATI, IVR, and web surveys. Public Opinion Quarterly 72: 847–865. Küchler, M. (1998). The survey method: An indispensable tool for social science research everywhere? American Behavioral Scientist 42: 178–200. Lonner, W.J. and Berry, J.W. (1986). Sampling and surveying, pp. 85–110 in W.J. Lonner and J.W. Berry (eds), Field Methods in CrossCultural Research. Beverly Hills, CA: Sage. Marshall, P.A. (2001). The relevance of culture for informed consent in U.S.-funded international health research, pp. C-1–38 in Ethical and Policy Issues in International Research: Clinical Trials in Developing Countries. Volume II: Commissioned Papers and Staff Analysis. Bethesda, MD: National Bioethics Advisory Commission. Accessed on June 21, 2014 at: https://bioethicsarchive.georgetown.edu/nbac/clinical/Vol2.pdf. Miller, K., Fitzgerald, R., Padilla, J.-L., Willson, S., Widdop, S., Caspar, R., Dimov, M., Grey, M., Nunes, C., Prüfer, P., Schöbi, N. and SchouaGlusberg, A. (2011). Design and analysis of cognitive interviews for comparative multinational testing. Field Methods 23: 379–396. Mitchell, R.E. (1993), Survey materials collected in the developing countries: Sampling, measurement, and interviewing obstacles of intraand inter-national comparisons, pp. 219–239 in D.P. Warwick and S. Osherson (eds), Comparative Research Methods, Englewood Cliffs, NJ: Prentice-Hall. Mneimneh, Z. (2012). Interview privacy and social conformity effects on socially desirable reporting behavior: Importance of cultural, individual, question, design and implementation factors. Unpublished dissertation. Ann Arbor: University of Michigan. Accessed on April 25, 2015 at: http://deepblue.lib.umich. edu/bitstream/handle/2027.42/96051/ zeinam_1.pdf?sequence=1. Mohler, Ph. and Johnson, T.P. (2010). Equivalence, comparability, and methodological progress, pp. 17–29 in J.A. Harkness, M. Braun, B. Edwards, T.P. Johnson, L. Lyberg, P.P. Mohler, B.-E. Pennell and T. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: Wiley.
Newby, M., Amin, S., Diamond, I. and Naved, R.T. (1998). Survey experience among women in Bangladesh. American Behavioral Scientist 42: 252–275. Newton, L.H. (1990). Ethical imperialism and informed consent. IRB 12(3): 10–11. Nisbett, R.E. and Cohen, D. (1996). Culture of Honor: The Psychology of Violence in the South. Boulder, CO: Westview Press. O’Barr, W.M., Spain, D.H. and Tessler, M.A. (1973). Survey Research in Africa: Its Applications and Limits. Evanston, IL: N orthwestern University Press. Pennell, B.-E., Harkness, J.A., Levenstein, R. and Quaglia, M. (2010). Challenges in crossnational data collection, pp. 269–298 in J.A. Harkness, M. Braun, B. Edwards, T.P. Johnson, L. Lyberg, P.P. Mohler, B.-E. Pennell and T. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: Wiley. Porter, A.C. and Gamoran, A. (2002). Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, D.C: National Academy Press. Przeworski, A. and Teune, H. (1970). The Logic of Comparative Social Inquiry. New York: Wiley. Rudolph, L. and Rudolph, S.H. (1958). Surveys in India: Field experience in Madras State. Public Opinion Quarterly 22: 235–244. Scheuch, E.K. (1968). The cross-cultural use of sample surveys: Problems of comparability, pp. 176–209 in S. Rokkan (ed.), Comparative Research across Cultures and Nations. Paris: Mouton. Schooler, C., Diakite, C., Vogel, J., Mounkoro, P. and Caplan, L. (1998). Conducting a complex sociological survey in rural Mali. American Behavioral Scientist 42: 276–284. Schuman, H. (1966). The random probe: A technique for evaluating the validity of closed questions. American Sociological Review 31: 218–222. Schwartz, S. (1992). Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries, pp. 1–65 in M.P. Zanna (ed), Advances in Experimental Social Psychology. San Diego: Academic Press. Smith, T.W. (2010). The globalization of survey research, pp. 477–484 in J.A. Harkness, M.
Challenges of Comparative Survey Research
Braun, B. Edwards, T.P. Johnson, L. Lyberg, P.P. Mohler, B.-E. Pennell and T. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: Wiley. Snijders, T.A.B. and Bosker, R.J. (2011). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling (2nd edn). London: Sage. Tessler, M.A., Palmer, M., Farah, T.E. and Ibrahim, B.L. (1987). The Evaluation and Application of Survey Research in the Arab World. Boulder, CO: Westview Press. Triandis, H.C. (1996). The psychological measurement of cultural syndromes. American Psychologist 51: 407–417. Triandis, H.C., Marín, G., Lisansky, J. and Betancourt, H. (1984). Simpatía as a cultural script for Hispanics. Journal of Personality and Social Psychology 47: 1363–1375. Trompenaars, F. and Hampden-Turner, C. (1998). Riding the Waves of Culture: Understanding Diversity in Global Business (2nd edn). New York: McGraw-Hill. Vandenberg, R.J. and Lance, C.E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices and recommendations for organizational research. Organizational Research Methods 3: 4–69. van Deth, J.W. (ed.) (2013 [1998]). Comparative Politics: The Problem of Equivalence. ECPR Classics. Colchester: ECPR Press.
53
van de Vijver, F.J.R and Leung, K. (1997). Methods and Data Analysis for Cross-cultural Research. Thousand Oaks, CA: Sage. van de Vijver, F.J.R., van Hemert, D.A. and Poortinga, Y.H. (2008). Multilevel Analysis of Individuals and Cultures. New York: Lawrence Erlbaum Associates. Wallace, D. and Woodward, J.L. (1948–1949). Experience in the Time International Survey: A symposium. Public Opinion Quarterly 12: 709–711. Warwick, D.P. (1980). The politics and ethics of cross-cultural research, pp. 319–371 in H.C. Triandis and W.W. Lambert (eds), Handbook of Cross-Cultural Psychology, Volume 1. Boston: Allyn and Bacon. Warwick, D.P. (1993). The politics and ethics of field research, pp. 315–330 in M. Bulmer and D.P. Warwick (eds), Social Research in Developing Countries: Surveys and Censuses in the Third World. London: UCL Press. Willis, G.B. (2015).The practice of cross-cultural cognitive interviewing. Public Opinion Quarterly 79: 359–395. Wilson, E.C. (1958). Problems of survey research in modernizing areas. Public Opinion Quarterly 22: 230–234. Wuelker, G. (1993). Questionnaires in Asia, pp. 161–166 in M. Bulmer and D.P. Warwick (eds), Social Research in Developing Countries: Surveys and Censuses in the Third World. London: UCL Press.
PART II
Surveys and Societies
5 Surveys and Society Claire Durand
This chapter examines the interrelationship between surveys and society, i.e., how changes in the society within which surveys arose affect how they are conducted and how they in turn influence society. Society is, in the broadest sense, the survey sponsor, and it therefore defines the terms of the contract. It influences polls, defining not only the survey questions but also the means used to reach respondents and the conditions under which survey results are analyzed and published. However, the products and results of surveys – e.g., standardized questions, two-way tables of means and proportions – also influence society. They define what is learned and how it is presented to the general public. The chapter examines the inter-influence between polls and society in four areas. It looks first at measurement, i.e., which questions are asked and how they are worded. It then goes on to examine data collection, data analysis, and the use of survey data. The chapter concludes with a review of emerging challenges in the ongoing process of adaptation between society and polls.
QUESTIONS, THEIR WORDING, AND THEIR CATEGORIES Survey sponsors, be they academia, governments, the media, or private business, define which information they require to achieve their interests. Policy makers need information about demographics, attitudes, and opinions as well as behaviors in order to be able to plan their interventions. The questions that can be asked vary historically both within and between societies (Sudman and Bradburn, 2004). However, over the course of its development, survey research has also tended to set in stone the way questions are asked, and this had an impact on society.
Measuring Demographics Demographics are seen as ‘universal’ factual information. However, the information that appears relevant changes along with society and differs among societies. The measurement process strives to fit individuals into
58
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
categories according to relevant concepts. For people to exist, statistically speaking, they must fit into some pre-determined category; falling into the ‘other’ category puts one in statistical limbo. If there is no category for same-sex partnership or for mixed origins, it is as if these realities do not exist in the public arena. At the same time, as some situations or conditions become common or visible enough, categories emerge to take them into account. In social science, sex and age have been considered major determinants of attitudes. Younger people, for example, are considered to be more liberal or more concerned with the environment. But when does ‘youth’ end and ‘adulthood’ begin? In survey research, age is usually grouped into categories influenced by legal factors (age of majority, of retirement, etc.) and statistical considerations (i.e., categories must be large enough). These categories end up taking on a life of their own. Young people are defined as 18 to 24 years old, sometimes as 18 to 29, and occasionally up to 34. Old age starts at 65. With life expectancy increasing steadily, new categories emerge like the young-old (65–79) and the old-old (80+), for example. Social class is currently measured using the concept of socioeconomic status (SES), a combination of income, education, and occupation. However, in countries where informal work is widespread, measures of ownership – of a house, a car, or a television – are used as measures of social class. Therefore, arriving at truly comparable measures of social class across countries is a tedious task. In many countries, marital status categories now include ‘separated or divorced’ and ‘living with a common-law partner’. In recent years, a category for living with or being married to a same-sex partner has appeared in some countries’ censuses. In others, household composition may include more than one spouse. All of these categories appear or disappear in conformity with the norms of the societies where the questions are asked. However, questions about sexual orientation remain extremely difficult to ask in most countries.
The measure of ethnicity is not straightforward. Countries use widely varying categorizations to subdivide their population according to ethnicity; these include ‘race’, country of birth, ancestry, nationality, language, and religion (Durand et al., in press). Measuring ethnicity has been the focus of much writing and is not as simple as it may appear. First, categories that appear relevant in one country may be totally irrelevant in another. Second, the categories themselves are far from homogenous. In the US, where skin color has been the main category used to classify people, ‘black’ encompasses recent immigrants of West Indian, Haitian, or African origin as well as descendants of former slaves. More recent immigration trends have led to the introduction of a new question on ‘Hispanic/Latino’ origin in the 1980 US Census, a category comprising Mexicans, Cubans, Central and South Americans, and Puerto-Ricans (Prewitt, 2014; Simon, 1997). Third, in most countries, the proportion of the population with multiple origins is rapidly increasing. In Australia and Canada, for example, it was around 10% in 1980. By 2006 however, it had reached more than 25% in Australia and 35% in Canada (Stevens et al., 2015). Fourth, respondents tend to change their declaration according to the context in which the question is asked (Durand et al., in press). Aboriginal people in the Americas provide a good example of this process, called ‘ethnic drifting’ (Guimond, 2003). In Canada, the number identifying as Aboriginal, i.e., either First Nation/ North American Indian, Inuit, or Métis, has increased by 4.8% per year between 1991 and 2006 (Goldmann and Delic, 2014). Among those who identified solely as First Nation in the 2006 Canada census, 27.5% did not identify as such in 2001 (Caron-Malenfant et al., 2014). In addition, in both 2006 and 2012, around 30% of Aboriginals did not report the same self-identification in the Aboriginal Peoples Survey, as in the Census taken only six months before (Durand et al., in press). In the US, the number of people reporting ‘American Indian’ as their race more than
Surveys and Society
tripled between 1960 and 1990 (Nagel, 1995) and it increased by 9.7% between 2000 and 2010 (Norris et al., 2012). These examples testify to the fact that peoples’ willingness to report a given ethnicity may depend on a number of factors, including the social image of the different ethnic groups in a given society. Because of their breadth, censuses are where the necessity for new categories may first be noticed. Census data provide survey researchers with the information they need to compare sample composition with the general population in order to estimate and correct for biases. Therefore, surveys are dependent upon statistical agencies and their decisions, including their decisions concerning the categories they use. The process of categorization creates and highlights groups of people. This can have either positive or negative impacts. People may or may not decide to identify with a given group for a number of reasons, one being the social image of that group (Prewitt, 2014; Simon, 1997). In the aftermath of World War II, countries like France, concerned with the possible impact of such categorizations, decided not to ask questions about origin or religion in their census.
Measuring Opinions and Behaviors Opinion polls aim to measure public opinion. There are a number of assumptions at the foundation of this process, among them the idea that ‘widely disparate or even idiosyncratic values can be expressed in standardized ways and that these expressions do not alter meanings relevant to decisions (Espeland and Stevens, 1998: 324). A primary goal of surveys is to classify people according to their opinions and behaviors on one hand and their characteristics and demographics on the other. When these two groups of measures are compared, a profile of potential ‘clients’ for political parties, social programs, and manufactured products emerges. To this end,
59
surveys have avoided natural language and open-ended questions in favour of a specialized dialect (Smith, 1987), i.e., instead of ‘What do you think of your government?’, they ask ‘Are you very satisfied, somewhat satisfied, not very satisfied or not at all satisfied with your government?’. This process is essential in order to compare answers to survey questions with one another. Standardization means that every single concept that the survey aims to measure must be translated into this survey questionnaire ‘dialect’. The need for standardization gave rise to typical formats such as the Likert scale of agreement. People became accustomed to these formats and learned to answer using them. Concepts such as satisfaction, agreement with a policy, or interest in a topic are now naturally linked to standard response formats in people’s minds. Standardized formats appear to make all questions simple and easy to ask, but this apparent clarity tends to oversimplify issues. Complex issues such as war strategies, protection of individual rights, or immigration policies require much information and thought. Yet they are nonetheless translated into short questions, which respondents must answer using a response scale with a limited number of response options. Political stakeholders then use these survey results to justify their policies. This is where opinion polls are vulnerable: they are influenced by current socio-political debates and dependent on the interests of political actors. In these cases, different polls on the same topic may generate different results due to question wording or question order, supplying ammunition to critics who say that polls can be manipulated to say anything.
COLLECTING DATA: MULTIPLE MODES, MULTIPLE BARRIERS The conduct of surveys is dependent on available resources within a society. In general there are three fundamental requirements: a
60
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
universal means of communication with respondents, availability and access to coordinates from which to select and contact them, and the willingness of respondents to participate. Availability of various means of communication has varied over time and between countries. However, the process by which survey researchers try to access respondents, along with the reaction that emerges to block this access, appears to be a universally prevalent cat-and-mouse game. As new means of communication develop, so do ways to protect respondents from unwanted intrusion.
Surveys using the Interview Mode Surveys may be carried out by interviewers, who contact respondents, convince them to participate, and ask them questions. Initially, surveys using the interview mode were mostly conducted face-to-face. People were selected at random within areas, based on a number of criteria, and asked to complete a survey. This meant that it was feasible to knock on peoples’ doors or stop them in the street, and that most people would kindly answer the questions they were asked. Face-to-face is still the preferred mode of data collection for many national statistical agencies and in countries with either a small territory and dense population or that have difficulty accessing the whole population by other means. However, in many countries, apartment buildings are now equipped with security devices, making it very difficult to reach people who live in these types of dwellings. In countries with large territories and a dispersed population, face-to-face surveys are very costly; as soon as telephone use became widespread, it was quickly adopted as almost the sole mode for conducting surveys. Initially, directories were used to select sample members; however, it soon became possible for people to have their telephone number unlisted to avoid being bothered by unwanted phone calls. Researchers Warren Mitofsky and Joseph Waksberg developed
random digit dialing (RDD), to select any telephone number, listed or not (Montequila, 2008). However, when households began using more than one telephone line, any of which could be selected using RDD, it became more difficult to estimate any given respondent’s probability of inclusion and to produce statistically reliable results. In recent years, mobile telephones have emerged and spread very rapidly. As of 2013, in developed countries, there was already more than one mobile phone subscription per person and in the developing world, mobile penetration was estimated at 89%, and in Africa at 63%. (http://www.internetworldstats.com/ mobile.htm). With mobile phones, sampling frameworks change from being householdbased to individual-based. In developed countries, where frameworks for mobile and landline telephones must be used concurrently, it therefore becomes difficult to estimate any given person’s probability of inclusion, taking into account the relative use of each telephone to which that person has access. To get around this problem, other means have appeared, such as address-based sampling, which aims to create lists that contain each household only once. However, in most developing countries, the emergence of mobile phones as a widespread means of communication has created new opportunities for survey research in areas where most surveys could previously only be conducted face-to-face. But sampling and cost are not the most difficult challenges faced by telephone surveys. As technology makes it easier to reach people, other technologies such as answering machines, voice mail, and caller ID allow people to decide whether or not to answer incoming calls. The result is that as people become easier to reach, it is likewise easier for them to avoid being reached.
Self-administered Surveys Self-administered surveys avoid social desirability bias, which can occur when an
Surveys and Society
interviewer is present, and they are preferable when asking sensitive or threatening questions (Sudman and Bradburn, 2004). Survey questionnaires can be printed and sent by mail; however, in many countries, the lack of complete mailing lists for the entire population or privacy laws making it difficult to build such lists mean that mail surveys are not practical for surveying the general population. For this reason, they have been primarily used either to survey specific populations or by government departments or other organizations that have access to such lists. In recent years, the Internet has emerged as an alternative method for conducting selfadministered surveys. Though it has not yet become a universal means of communication, by 2015, over 88% of the population in North America and 73% of the European and Oceanian population had Internet access. Penetration had reached over 50% in Latin America and in the Middle East, 40% in Asia and close to 30% in Africa (http://www.internetworldstats.com/stats.htm). It is foreseeable that bias due to incomplete coverage may soon become negligible in most countries (Traugott, 2012). Like mail surveys, web surveys face the problem of building a sampling frame. Two solutions have emerged. One is to use telephone or mail surveys to recruit respondents to take part in panels for a limited period of time; Internet access is then provided to respondents who do not already have access. This framework is probabilistic in essence but very costly. The other solution is to recruit a convenience sample through various means, including websites where respondents may sign in. The latter is much less expensive, so that in small markets where cost is a main factor – small countries and sub-national constituencies – most electoral polls are now conducted using this method, called ‘opt-in’ web surveys. Of course, if and how these opt-in surveys or access panels may be representative of the general population is subject to much debate (Traugott, 2012). Web surveys are a non-intrusive, flexible, and inexpensive way of inviting people to
61
answer self-administered surveys. Like mail surveys, they have the advantage of allowing people to complete a survey when it is convenient to do so; but in the absence of pressure to answer, these surveys have difficulty reaching acceptable cooperation rates (Durand, 2014). However, as email becomes a main means of communication, some countries have passed laws requiring organizations to get prior approval to contact someone by email if they do not already have a relationship with that organization. Such laws will likely affect pollsters’ ability to use the Internet to conduct surveys. Very recently, interactive voice response (IVR) – where people are contacted by recorded voice message and answer using the telephone keypad – has developed as an alternative for short surveys or as a complement to classical telephone surveys. This technology gets around the increasing cost of paying interviewers and avoids the possible social conformity bias associated with interviews. It may be used during classical telephone interviews to switch to questions that might be threatening or that are prone to social desirability biases, such as vote intention. The spread of mobile phone technology also allows for self-administered surveys by SMS. However, in some countries such as the U.S., robo-calling cell phones, which is required by IVR technology, has been prohibited. Finally, it was thought that social media might become another means to conduct surveys, but social media providers soon blocked any attempts to send bulk requests, regardless of the sender. In short, with each new method of reaching people, new barriers have arisen to protect people from unwanted intrusions. The ‘right not to be bothered’ has become an integral part of modern life.
THE CHOICE OF STATISTICS AND THEIR IMPACT ON SOCIETY Once collected, data must be made available in a timely fashion and presented in a way that is both accurate and understandable by
62
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
end users. The availability of survey data has made it possible to build archives, which can be used for innovative research, both historical and contemporary. Technological changes have influenced these capabilities, but ‘just in time’ availability of survey data has also changed how societies operate.
such as the growth of couples of mixed ethnicity, new types of family units, or changing income distributions within households. This shows the way for what surveys will have to take into account in the future.
Analysis of Survey Data Availability of Data Archives Thousands of surveys have been conducted since the emergence of survey research. Some of these surveys have been archived and may be retrieved from a number of public or academic sites. In some cases, they can be analyzed using web-based software available on-site. As a result, in some countries it is possible to monitor changes in public opinion on topics such as capital punishment, euthanasia, trust in institutions, or on certain major political debates, going as far back as the 1950s or 1960s. These archives also allow comparisons to be made between countries, an endeavor that would have been almost impossible before the emergence and dissemination of polls (Lagos, 2008; Mattes, 2008). In trying to understand social changes and how various events have impacted these changes, survey archives constitute a mine of information that can help put survey results in perspective for the general public and can contribute to public debate. Initiatives like the Barometers – Euro Barometer, Latino Barometro, Arab Barometer, Asia Barometer, and Afro Barometer – the World Values Surveys and the International Social Survey Programme allow for international comparisons of change over time for similar opinions and attitudes. In recent years, data gathered by statistical agencies – including censuses – have been made available to researchers in many countries. These initiatives to democratize access to data also open windows for new and more sophisticated analyses that will help build a more comprehensive understanding of social change. For instance, census data may signal the emergence of new social phenomena,
Technology has not only changed how surveys are administered, it has also facilitated data recording and analysis. Except for mail surveys, data collection is now usually computerassisted and recorded concurrently. This gives very quick access – sometimes in real time – to survey results. Once data become available, however, they must be analyzed. Data analysis has gone from hand compilation to the use of mainframe computers accessible only to a few specialists to the current use of user-friendly – and increasingly free and open access – software for microcomputers. Meanwhile, the presentation of survey results has become automated and standardized in the form of tables showing the distribution of the main variables of interest and tabulations of these same variables with relevant demographics. While these products help to reveal the specific situations of various social groups, they also contribute to shaping and ‘setting in stone’ the images of these groups (Desrosières, 2002). This process creates an essentially homogenous portrait of various groups, in which the emphasis is placed on average characteristics, behaviors, or attitudes, not on heterogeneity and diversity (Espeland and Stevens, 1998). For instance, all women tend to be seen as earning less than men, all young people as less interested in politics than older people, and all poor people as unemployed. In this respect, survey data have helped put flesh on the bones of social categories. Members of different social groups are trapped in the images that surveys attribute to them, and it is difficult for members of disadvantaged groups to free themselves from their associated stereotypes.
Surveys and Society
Data analysis has also been used to justify a priori positions. For example, factor analysis has been used in the debate on whether intelligence is a one-factor or a two-factor concept (Gould, 1996). The one-factor theory gave rise to most of the Intelligence Quotient (IQ) tests. Hernstein and Murray (1994) used a sophisticated statistical apparatus to justify a racial theory of intelligence, a theory that has been much criticized, notably by Gould (1996).
THE USE OF SURVEY RESEARCH The relatively low cost of polls and the timely availability of poll results has contributed to making polls ubiquitous in the socio-political sphere. Political stakeholders request poll results, and polls may also be pushed on political actors by various socioeconomic interest groups.
Electoral Polls and Their Influence Knowing peoples’ opinions is of no use unless one wants or needs to take them into account. Polls go hand in hand with parliamentary democracy, and as democracies emerge in most parts of the globe, so do opinion polls (Blondiaux, 1998; Lagos, 2008; Mattes, 2008). One of the most publicly known uses of surveys is the electoral poll. With technological advances, it has become possible to conduct polls and access the results in a matter of hours. The end of the twentieth century saw a proliferation of polls conducted over short periods of time and published right up to the end of the electoral campaigns. The availability of poll results has changed the way electoral campaigns are conducted. In countries where the head of state or prime minister decides when the election is called, polls can also influence the timing of election launches. Polls have taken an important role in election campaigns, as they prompt political
63
stakeholders to adjust their campaign discourse based on ‘public opinion’. They sometimes serve as a criterion to decide whether a candidate will be invited into an electoral debate. They may also influence party activists, galvanizing or demoralizing them depending on poll results. From the very first use of polls in electoral campaigns, the notion that they may influence voters’ choices or turnout has been an issue (Gallup and Rae, 1940). Though this possibility is commonly held as true by political stakeholders, the media, and voters, evidence that polls regularly influence voters, strongly enough to change election outcomes has yet to be found (Hardmeier, 2008). However, polls are not always accurate. As of 2010, the academic literature has documented nearly 50 instances of polls going wrong (Durand et al., 2010), and new instances have appeared regularly since then, such as in US primaries, provincial elections in Canada, and general elections in Israel, Great Britain, and Austria. Researchers have sought and are still seeking to determine why such situations occur. Were these polls really inaccurate, or did people simply change their minds at the last minute or fail to show up, in part because of poll results? The question of the inter-influence between poll results and the course of electoral campaigns is still an open question. Election polls have also been important for methodological reasons. Elections are almost the only situation where polls can be compared with what they aim to measure, i.e., the vote. Consequently, electoral polls constitute an essential tool for estimating poll bias, and research using these polls has led to significant improvements in survey methodology.
Opinion Polls, Surveys, and Public Policy ‘Just in time’ poll results have helped define a new socio-political environment in which politicians and pressure groups alike must take public opinion into account and try to
64
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
use it to advance their own agendas. The emergence of polls has placed a new burden on all political stakeholders to sway public opinion in favor of their positions if they are to achieve their goals (Blondiaux, 1998). This raises a number of questions. For instance, do polls create public opinion or do they measure existing opinions? Sometimes surveys ask questions that are not current topics of public debate and about which people may not have all the information necessary to take an enlightened position, or that pertain to topics which may need advanced technical knowledge to comprehend. To what level should public opinion be called upon in political decisions on specific issues, and how can it be reliably measured? This is a key question in the interaction between society and surveys. The principle behind representative democracy is that those elected act on behalf of the people they represent, work through the information provided concerning a given project or policy, and decide what they feel is in the best interests of their electorate and of the country as a whole. The emergence of polls has changed the way representative democracy works (Blondiaux, 1998). Many countries have experienced clashes between public opinion and the protection of minority rights – issues like gay rights; the right to wear religious garb such as yarmulke, turban, or hijab; or the right to build minarets. Who should prevail when minority rights clash with majority opinion? In most of these debates, polls have played and still play a role in the arguments political stakeholders advance in support for their positions.
FUTURE DIRECTIONS How surveys are and will be conducted depends on each country’s situation, history, territory, population density, and composition, along with its socio-political and technological development. As democratic movements emerge around the world, surveys and polls
spread and become tools in the hands of political leaders and governments, but also of the population itself. In a complex world, what are the questions that can and should be asked? Is it possible to convey the complexity of all social challenges using simple survey questions? Pollsters and survey researchers must learn how to assemble pools of questions that give a reliable image of the real ‘pulse of democracy’, as Gallup put it. To understand the impacts of globalization and the changes brought about by demographic diversification, our approach to demographics must be reviewed in order to understand which demographic categories, if any, are now related to individual opinions, attitudes, and behaviors, and why. Will categories change or become useless? Is it possible that in the near future, none of the demographics that were once related to people’s answers will maintain their relationships? It will be a challenge to determine new predictors of people’s opinions and eventually, of their voting behaviour or of other kinds of behaviour. Changes in data collection methods are reminders that opinion polls and surveys require cooperation from their subjects and from society as a whole. There is no such thing as an intrinsic right to communicate with a perfect stranger, whatever good reasons pollsters may think they have for doing so. Survey research is being conducted in an environment in which the means available to avoid communication are increasing at the same pace as the means available to communicate with potential respondents. What is the future of survey research if there is no access to survey respondents? The current situation has not occurred overnight and for no reason. Credibility is essential to gaining respondent cooperation, and surveys may have lost some public credibility. People receive too many requests, they are required to answer questionnaires that are often too long and not always well constructed, and they are asked to answer at the
Surveys and Society
exact moment they are contacted. How can survey researchers convince people to cooperate? New methods are being developed that make use of panel designs and that request cooperation for a limited number of surveys over a limited period of time, for instance in some of the web surveys using probabilistic sample frameworks. Research is being done on multimode surveys that allow people to answer using their preferred mode (Dillman et al., 2014). At the same time, making survey results more accessible and usable in people’s lives would help surveys and polls regain credibility and encourage people to take part in them. All the tools needed to perform more sophisticated data analysis that would provide access to information about variations in opinions and behaviors are readily available and accessible. Established standard methods of presenting results can be misleading. For survey research to be useful to the general public, data analysis must move away from simple, if not simplistic, tables and toward more refined analysis. New standard products may emerge that combine advanced statistical procedures with technology that facilitates visual presentation in the form of figures and graphs to convey the complexity and diversity of modern society in an easy-tounderstand fashion. Archives of survey data are an excellent tool for tracking social change. However, such archives have not been established in all countries and are far from comprehensive. Creating such archives is a vital endeavor, one that must be accomplished before too much data disappear. These archives are the foundation upon which future historians will base their understanding of the last century. To make full use of the data they contain, research is required to facilitate the combination of sources that measure similar concepts in different ways. Finally, what is and what should be the role of surveys in society? Political stakeholders and the general public alike are still in the process of learning when polls are likely to
65
be useful and reliable and when they are not. One thing is certain: surveys and polls are here to stay, and we are still discovering new ways to conduct and use them.
RECOMMENDED READINGS For those who read French, a social history of the emergence of public opinion polls in the US and in France can be found in Blondiaux (1998). Desrosières (2002) provides a great synthesis of the history of statistics tracing the relationship between the states and the production of statistics that both influence decision making and are influenced by its process. Dillman et al. (2014) is a major text combining theory and practice of surveys, including how to take into account the social context in which they are conducted. A history of commensuration as an instrument of social thought and as a mode of power and its influence on sociological inquiry can be found in Espeland and Stevens (1998). First published in 1981 and updated in 1996, The Mismeasure of Man by Gould (1996) demonstrates how statistical analysis has been used to justify biological determinism.
REFERENCES Blondiaux, L. (1998). La fabrique de l’opinion, une histoire sociale des sondages [The fabrication of opinion, a social history of opinion polls]. Paris: Seuil. Caron-Malenfant, E., Coulombe, S., Guimond, E., Grondin, C. and Lebel, A. (2014). La mobilité ethnique des Autochtones du Canada entre les recensements de 2001 et 2006. Population, 69(1), 29–55. DOI: 10.3917/ popu.1401.0029. Desrosières, A. (2002). The Politics of Large Numbers. A History of Statistical Reasoning. Cambridge: Harvard University Press. Dillman, D. A., Smyth, J.D. and Christian, L.M. (2014). Internet, Phone, Mail and MixedMode Surveys, the Tailored Design Method, 4th edn. New York: Wiley.
66
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Durand, C. (2014). Internet Polls: New and Old Challenges. Proceedings of Statistics Canada 2013 Symposium: Producing Reliable Estimates from Imperfect Frames, http://www. mapageweb.umontreal.ca/durandc/Recherche/ Publications/Statcan2013/07A1_Final_Durand_ Eng.pdf, accessed December 15, 2015. Durand, C., Goyder, J., Foucault, M. and Deslauriers, M. (2010). Mispredictions of Electoral Polls: A Meta-analysis of Methodological and Socio-Political Determinants Over Fifty Years, ISA 17th Congress, Goteborg, Sweden, July 15–21, 2010. http://www. mapageweb.umontreal.ca/durandc/ Recherche/Publications/pollsgowrong/ Pollsgowrong_AIMS2010.pdf, accessed December 15, 2015. Durand, C., Massé-François, Y.-E., Smith, M. and Pena-Ibarra, L. P. (in press). Who is Aboriginal? Variability in Aboriginal Identification between the Census and the APS in 2006 and 2012. Aboriginal Policy Studies, 6 (1), Espeland, W. N. and Stevens, M. L. (1998). Commensuration as a Social Process. Annual Review of Sociology, 24, 313–343. Gallup, G. and Rae, S. F. (1940). Is There a Bandwagon Effect? Public Opinion Quarterly, 4(2), 244–249. Goldmann, G. and Delic, S. (2014). Counting Aboriginal Peoples in Canada, in F. Trovato and A. Romaniuk (eds), Aboriginal Populations: Social, Demographic, and Epidemiological Perspectives. Edmonton: University of Alberta Press, pp. 59–78. Gould, S. J. (1996). The Mismeasure of Man. New York: W.W Norton & Co. Guimond. E. (2003). Changing Ethnicity: The Concept of Ethnic Drifters, in J. P. White, P. S. Maxim and D. Beavon (eds), Aboriginal Conditions: Research as Foundation for Public Policy. Vancouver: UBC Press, pp. 91–109. Hardmeier, S. (2008). The Effects of Published Polls on Citizens, in M. Traugott (ed.), Handbook of Public Opinion Research. London: Sage, pp. 504–515. Hernstein, R.J. and Murray, C. (1994). The Bell Curve: Intelligence and Class Structure in American Life. New York: Free Press. Jones, N.A. and Bullock, J. (2012). Two or More Races Population: 2010, US Census Bureau 2010 Census Briefs, https://www.census.
gov/prod/cen2010/briefs/c2010br-13.pdf, accessed December 15, 2015. Lagos, M. (2008). International Comparative Surveys: Their Purpose, Content and Methodological Challenges, in M. Traugott (ed.), Handbook of Public Opinion Research. London: Sage, pp. 580–593. Liebler, C. A., Rastogi, S., Fernandez, L. E., Noon, J. M. and Ennis, S. R. (2014). Americas’s Churning Races: Race and Ethnicity Response Changes between Census 2000 and the 2010 Census. CARRA Working Paper Series # 2014–09. Washington: US Census Bureau. Mattes, R. (2008). Public Opinion Research in Emerging Democracies, in M. Traugott (ed.), Handbook of Public Opinion Research. London: Sage, pp. 113–122. Montequila, J. M. (2008). Mitofsky–Waksberg Sampling, in P. Lavrakas (ed.), Encyclopedia of Survey Research Methods. London: Sage, pp. 472–473. http://dx.doi.org/10.4135/ 9781412963947.n299 Nagel, J. (1995). American Indian Ethnic Renewal: Politics and the Resurgence of Identity. American Sociological Review, 60 (6), 947–965. Norris, T., Vines, P.L. and Hoeffel, E.M. (2012). The American Indian and Alaska Native Population: 2010, US Census Bureau 2010 Census Briefs, http://www.census.gov/prod/ cen2010/briefs/c2010br-10.pdf, accessed December 15, 2015. Prewitt, K. (2014). What Is Your Race? The Census and Our Flawed Efforts to Classify Americans. Princeton: Princeton University Press. Simon, P. (1997). La statistique des origines. Sociétés Contemporaines, 26, 11–44. Smith, T. W. (1987). The Art of Asking Questions. Public Opinion Quarterly, 51, S95–S108. Stevens, G., Ishizawa, H. and Grbic, D. (2015). Measuring Race and Ethnicity in the Censuses of Australia, Canada, and the United States: Parallels and Paradoxes. Canadian Studies in Population, 42(1–2), 13–34. Sudman, S. and Bradburn, N. M. (2004). Asking Questions. San Francisco: Jossey-Bass. Traugott, M. (2012). Methodological Trends and Controversies in the Media’s Use of Opinion Polls, in C. Holtz-Bacha and J. Stromback (eds), Opinion Polls and the Media. New York: Palgrave MacMillan, pp. 69–88.
6 Defining and Assessing Survey Climate Geert Loosveldt and Dominique Joye
INTRODUCTION In Wikipedia, the Internet encyclopedia, climate is defined as: a measure of the average pattern of variation in temperature, humidity, atmospheric pressure, wind, precipitation, atmospheric particle count and other meteorological variables in a given region over long periods of time. Climate is different than weather, in that weather only describes the short-term conditions of these variables in a given region. (http://en.wikipedia.org/wiki/Climate, accessed January 29, 2014)
This classical climatological definition reveals the basic characteristics of a climate: it refers to average values of different aspects, it is regional and only observable over a longer period. It differs from temporary weather. What does this general definition of climate mean to describe in terms of the survey climate, and why do survey researchers need information about the survey-taking climate? The definition of climate makes
clear that the survey climate is not an individual characteristic but refers to the general context and that at least measurements of several aspects over a longer period must be taken into account to describe the survey climate. Just as normal climatological conditions affect (among other things) the kind of crops that are suitable in a particular region, the survey climate determines the design and efforts needed to implement surveys in an adequate and efficient way and to obtain high-quality data. When most people are reluctant to share personal information and communicate with external agencies, it will require much effort to collect personal data. Interviewers should also be aware about the survey climate conditions in which they will have to work. Should interviewers be prepared for a hostile environment, or can they expect a generally warm welcome? Therefore, information about the survey climate is also relevant for the training of interviewers. In the context of cross national surveys, the survey climate makes clear that in some
68
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
countries particular procedures are perhaps more appropriate than others or that more effort is necessary to keep the same data quality over time (e.g. in different waves). In the next section, we search for the origins of the concept to trace the relevant aspects and dimensions of the survey climate. This will allow a more precise definition of the concept, which is necessary to answer the question how we can measure the survey climate.
BACKGROUND The origin of the concept of survey climate is situated in a paper about nonresponse research at Statistics Sweden (Lyberg and Lyberg, 1991). It refers to the negative trend in nonresponse rates that started with data collection problems during the 1970 census of population and housing. During the 1970 census, negative publicity in the media and individual protest stimulated a public debate on data collection and privacy. The debate led to the Swedish Data Act, which regulates data collection procedures and data storage more precisely (e.g. advance letters must emphasize that participation in a survey is voluntary). In this seminal paper, the survey climate is not precisely defined. The generally negative atmosphere and the regulation of data collection activities are considered important aspects. To monitor the survey climate, a nonresponse barometer that presents a time series of nonresponse rates for particular surveys during a particular period was suggested. This means that the general willingness to participate in surveys as indicated by response rates was considered as a general reflection of the survey climate. The survey climate is also mentioned in the conceptual framework for survey cooperation of Groves and Couper (Groves and Couper, 1998). The basic assumption of this framework is that the respondent’s decision to participate or not is directly and strongly influenced by the interaction between the
respondent and the interviewer. This interaction can be influenced by different types of factors. The factors related to the survey design (e.g. topic, mode of administration, etc.) and the interviewer (e.g. experience, expectations, etc.) are under the researcher’s control. Other factors related to the social environment and the respondent or household characteristics are out of the researcher’s control. The survey-taking climate is considered a characteristic of the social environment that determines the response rate. Here the survey climate refers to the number of surveys conducted in a society and the perceived legitimacy of surveys. Monitoring how many surveys are organized and on what subjects can be considered as an important source of information to describe the survey climate. For example, according to the Swiss Association of Survey Providers, each household is, on average, contacted for a survey each year (VSMS, 2013: 10). Another type of research that is relevant in the context of survey climate is the socalled surveys on surveys. In these surveys, respondents are questioned about several aspects of survey or opinion research. The first national survey or poll on polls was conducted by Cantril in November 1944, and the results were published by Goldman in Public Opinion Quarterly (Goldman, 1944–45). In that survey, a small number of questions were asked about the accuracy and the usefulness of poll results and the role of polls in democracy. This basic information about ‘what impression the polls have made on the public itself’ (Goldman, 1944–45: 462) was considered as relevant information to contextualize the discussion about the accuracy and fairness of the results of public opinion polls during that period. The general assumption is that public opinions about surveys and polls influence the decision to participate in surveys. Therefore, these opinions about surveys and polls can be considered as relevant contextual information for survey researchers and pollsters. Although thinking about the survey climate was initiated and stimulated by negative
Defining and Assessing Survey Climate
experiences such as decreasing response rates and increasing concerns about privacy, it is clear that a positive survey climate is considered as a necessary condition for the easy and smooth organization and implementation of high-quality surveys. The exploration of the origin and background of the survey-taking climate concept makes clear the most relevant characteristics to describe the survey climate. The concept refers to the public’s general willingness to cooperate in surveys and to societal and organizational factors that influence this willingness. The extent to which people consider survey research and polls to be useful and legitimate are examples of important societal factors that characterize the survey climate. The capacity of fieldwork organization to implement efficient fieldwork procedure is an example of a relevant organizational factor. So the general willingness to participate in surveys, the public opinion about surveys, and the way surveys and polls are organized and reported in the media are considered as the key dimensions of the survey climate. In the next sections, these elements that are necessary to characterize the survey climate are discussed.
MEASUREMENT OF SURVEY CLIMATE To describe the survey climate, information is necessary at both the individual (willingness to participate and opinions about surveys) and the societal level.
Willingness to Participate The assessment that the survey climate was deteriorating was primarily inspired by the general observation that response rates in several surveys were decreasing. Although response rates are in general considered as valid indicators for the willingness to participate in surveys, some comments can be made. Response rates not only depend on the
69
individual’s decision to participate but are also influenced by the fieldwork organization’s capacity to implement adequate contact and persuasion strategy. This means that a response rate is not a perfect indicator or expression of the respondent’s willingness to participate. This viewpoint is supported by a trend analysis of the response rates in the European Social Survey (ESS). The ESS is a biennial face-to-face survey organized in as many European countries as possible. The first round was organized in 2002 (www. europeansocialsurvey.org/). In each country, data collected by a standardized contact form are used to calculate the response rates in a standardized way. The results of a trend analysis show that there are differences between and within countries (Loosveldt, 2013). Against the general expectation, in some countries there is an increase in the response rates over time. Does this mean that the survey climate is becoming more positive, or does it mean that the survey implementation was more adequate? Sometimes the change in response rate can be linked with a change in fieldwork organization. When response rates are used to compare the survey climate in different countries or at different points in time, it is important that the response rates are calculated in exactly the same way. This is less obvious than it seems at first glance. Especially, the definition of categories (e.g. ineligible) and the categories used in the calculation of the nonresponse rate are crucial. So it is important that a conclusion about differences in survey climate is not based on differences in the format, content (categories) and the way contact forms are used and the response rates are calculated. Information about the unwillingness to participate is also relevant to assess the survey climate. Paradata on the ‘Reasons for refusals’ collected by means of the contact form is a first source of information about why people are unwilling or refuse to participate. Based on the doorstep interaction with the contacted sample unit, interviewers are
70
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
asked to register on the contact form all of the reasons mentioned during the doorstep interaction. These reactions to the request to participate are available for respondents and non-respondents. The usual reasons listed in the contact form are: (1) No time or bad timing (e.g. sick, children, etc.); (2) Not interested; (3) Don’t know enough/anything about subject, too difficult for me; (4) Don’t like subject; (5) Waste of time; (6) Waste of money; (7) Interferes with my privacy / I give no personal information; (8) Never do surveys; (9) Cooperated too often; (10) Do not trust surveys; (11) Previous bad experience; (12) Do not admit strangers to my house/ afraid to let them in. Notice that the first four reasons cannot be considered as an indication of unwillingness. It is possible that respondents with such a reaction will positively react to a future request to participate. However, the other reasons express a negative attitude towards surveys. The survey climate can be considered more negative when a lot of respondents mention one or several of these reasons or when there is an increase in the frequency of respondents that use one of these reasons. Results from analysis of reasons for refusals show that ‘no interest’ and ‘no time’ or ‘bad timing’ are the most frequently made comments and registered reactions (Couper, 1997; Stoop et al., 2010: 151). The relevance of these kinds of doorstep concerns in the context of measurement of the survey climate is demonstrated by the fact that they significantly improve the predictive power of nonresponse models (Bates et al., 2008). Once again, it is important to stress that the results of this kind of analysis are only useful and comparable when interviewers get the same instructions about how to interpret and register the respondent’s reaction during the doorstep interaction. To get more and in-depth information about the respondents’ unwillingness and reasons for refusals, one can organize a nonresponse survey (Stoop et al., 2010; Ernst Stähli and Joye, 2013). In such a survey, non-respondents are asked a small number
of key questions of the survey. This can be done as a conclusion of the doorstep interaction (PEDAKSI: Pre-Emptive Doorstep Administrator of Key Survey Items) or one can re-approach non-respondents by means of another survey mode, shortly after the initial request to participate in a survey (short mail questionnaire after a refusal for a faceto-face interview). The information obtained in this way is extremely useful to compare respondents and initial non-respondents and to assess the nonresponse error. Although the main objective of nonresponse surveys is to get information about the key substantive variables, one can ask also a limited number of questions about surveys (e.g. Do you agree or disagree with the statement: surveys are valuable for the whole society). It is important to underline that the answers to such questions are very different between respondents and non-respondents, which means that the social value of surveys is, among other things, an important parameter governing participation. The reasons for refusals and participation and experience with previous interviews are also important in this context (MOSAiCH, 2009). This kind of information can be used to characterize some aspects of the survey climate. In line with the short nonresponse questionnaire, one can also integrate the same questions about surveys and experience with surveys in the main questionnaire and ask these questions to the respondents. Moreover, it is a good practice to ask at the end of the interview some questions about the previous interview, interview experience in the past and the intention to participate in the future, showing, once again, that past experiences are also important to predict future behavior (MOSAiCH, 2009). Analysis of response rates in ongoing surveys and information about reasons for refusals and comparison of these reasons throughout time are important to document the trend in willingness to participate in surveys. Public opinions about surveys, which are discussed in the next section, can be considered as factors that have an impact on
Defining and Assessing Survey Climate
this willingness. In this regard, these opinions are also relevant to characterize the survey climate.
Public Opinions about Surveys: Surveys on Surveys To obtain information about public opinions about surveys, one can organize a survey. In fact, a survey about surveys can be considered as a broadening of the nonresponse survey discussed in the previous section. As already mentioned, the first survey on surveys was organized by Cantril in 1944 as a reaction to the widespread doubt about the accuracy of polls. Another influential paper, ‘A questionnaire on questionnaires’, was published by Sjoberg in 1955. This survey with ‘suggestive findings’ was considered as ‘an introduction to a relatively neglected area’. The principle question of the survey was ‘What attitudes do people have toward questionnaires and the interviewing process?’ (Sjoberg, 1955: 423). In this paper, the author noted that insufficient attention was being given in the literature to the attitudes of various persons toward interviewing and questionnaires. He argued that this kind of information is needed for interpreting the respondents’ answers to substantial questions. Although this kind of survey never disappeared, the attention to public opinion about polls was not very intensive or overwhelming. In a paper called ‘Trends in surveys on surveys’, published in Public Opinion Quarterly, the authors note that the available data to make a long-term trend analysis of attitudes toward surveys are scant (Kim et al., 2011). In the next sections, we look at some important methodological aspects and the content of such surveys. But before discussing this point, there is another possible way of obtaining information about the survey climate: to ask the opinion of experts. For example, we have data from the ISSP participating countries, carried out in 2013. In general, the answers show that the respondents are very sensitive to privacy
71
issues, even if they trust the institutions in charge of the survey. In any case, issues of lifestyle, as in not being at home often or having a home that is walled off (inter-phones, etc.), is a problem; a lack of cooperation is mentioned in nearly half of the countries. In this sense, we can effectively observe differences in survey climate, but also in the geography: many countries in Eastern Asia and Western Europe mention general lack of cooperativeness from the target population, while this is seen as less of a problem in South America or even Eastern Europe, for example.
Methodology Surveys on surveys are confronted with a major methodological problem: reactivity or the effect that the measurement instrument has on the measurements. This problem was clearly formulated by Goyder: ‘employing an instrument to measure its own performance is immediately contradictory’ (Goyder, 1986: 28). This is in line with Goldman’s earlier comment that ‘finding anything about polling by a poll is a little too syllogistic’ (Goldman, 1944–45: 466). This general formulation of the problem can be translated into two components of the total survey error: selection bias and measurement bias. One can assume that surveys on surveys are extremely sensitive for selection bias. One can expect that people who are interested in surveys and have a positive attitude toward different aspects of them (e.g. reliability and usefulness) are more likely to participate in them. It is clear that due to this selection bias, the assessment of surveys is biased in a positive direction. Another consequence of the selection bias is that in comparative research between countries, opinions about surveys become more positive in countries with higher nonresponse rates. One of the measurement problems in surveys on surveys is the setting in which such a survey takes place. One can assume that respondents seek consistency between their behavior (participation in a survey) and their answers on questions about survey research and survey participation. One can also assume
72
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
that the respondent will value the interaction with an interviewer in the case of face-to-face and give social desirable answers on questions about surveys. This kind of measurement error caused by social desirability will also bias the results in a positive direction. As a consequence of the tendency to be consistent and to give social desirable answers, results from surveys on surveys do not give a correct representation of public opinion about surveys. Due to selection and measurement errors, one can expect that opinions about surveys are too positively estimated. Results from a sequential mixed-mode survey about surveys clearly demonstrate that, with respect to the mean scores on items used to measure the respondent’s opinion about surveys, measurement and selection effects are apparently present and in line with expectation (Vannieuwenhuyze et al., 2012). Significant measurement effects lead to the conclusion that respondents report more positive opinions about surveys in face-to-face interviews compared with mail questionnaires. Also, the selection effects were significant. The respondents of the follow-up face-to-face group were less positive toward surveys (controlling for measurement effects). The methodological complexity of surveying surveys makes clear that the comparison of results of different surveys on surveys should be done carefully and should take measurements and selection effects into account. The survey mode and the response rate are crucial survey characteristics in such a comparison. To avoid measurement effects such as social desirability, for example, one can opt for a survey design without interviewers (a mail or a web survey). However, this can result in low response rates and further selection effects. To solve these problems and to evaluate the impact of measurement and selection effects, a mixed-mode design is advisable.
Content In surveys used to measure public opinion about surveys (surveys on surveys), one can ask questions about different aspects of the
survey climate. Trend reports about surveys on surveys published in Public Opinion Quarterly (Schleifer, 1986; Kim et al., 2011) show a great variety of questions asked in these surveys. Four general topics can be distinguished: pollsters as people; evaluation of public opinion polling; knowledge about public opinion polling; and participation or experience instead of polling. In the section ‘pollsters as people’, questions are presented about honesty of and trust in pollsters, the usefulness of the survey research industry and the value of polls to the public. Evaluation of public opinion polling is about the public belief of the virtue of polls, whether polls work for or against the best interest of public opinion, the importance and influence of polls and the opportunity to provide feedback on public policy issues. Knowledge of polling is measured with questions about public understanding of how polls work and questions about the public’s belief in the accuracy of (non-election) polls and surveys. Different aspects of experience with polling can be asked: participation, refusal to participate, evaluation of pleasantness of past experiences, concerns about privacy and experience with illegitimate smuggling activities. Based on the overview of the frequency distributions of all of these questions, the authors of the trend report conclude that between the mid-1990s and the first decade of the 2000s, there has been a markedly negative shift in attitudes toward public opinion researchers and polls (Kim et al., 2011: 165). This kind of information about different aspects of the survey research practice collected during a longer period is very useful to and extremely relevant characterize the survey climate. Most of the questions discussed in the trend report are related to previous attempts to measure the respondent’s attitude toward surveys (e.g. Hox et al., 1995; Rogelberg et al., 2001; Stocké, 2006; Loosveldt and Storms, 2008). To develop an instrument to measure respondents’ opinion about surveys as an important aspect of the survey climate, Loosveldt and Storms (2008) opt to select the dimensions
Defining and Assessing Survey Climate
that are relevant to the individual decision to participate in a survey interview. Based on the leverage salience theory for survey participation (Groves et al., 2000), they assume that survey enjoyment, survey value, survey cost, survey reliability and survey privacy are five relevant dimensions of the respondent’s opinion about surveys. Respondents will participate in an interview when they consider an interview a pleasant activity (survey enjoyment), that produces useful (survey value) and reliable (survey reliability) results and when the perceived cost (survey cost: time and cognitive effort) of participation and impact on privacy (survey privacy) are minimal. Starting from a list of 23 items for the five dimensions, it was possible to construct reliable scales to measure these five aspects and to create an ‘Opinions about survey short scale’ (Cronbach’s α = 0.82) with six items: surveys are useful ways to gather information (survey value); most surveys are a waste of people’s time (survey cost); surveys stop people from doing important things (survey cost); surveys are boring for the person who has to answer the questions (survey enjoyment); I do not like participating in surveys (survey enjoyment); and surveys are an invasion of privacy (survey privacy). And finally the reliability items did not perform well, and this dimension is not part of the short scale. It is also important to keep in mind that the perception and legitimacy of the surveys is a function of the principal investigator who is organizing the survey as well as the subject. For example, in Switzerland, according to a survey done in 2009 (MOSAiCH, 2009), surveys about transport and mobility are well accepted if done by the Swiss Statistical Office, more than surveys on income and living condition, political behavior or biotechnologies. But for the later mentioned subjects, universities seem more legitimate to conduct surveys, whereas newspapers are seen as not really legitimate of conducting surveys on these subjects. One can consider the individual’s opinion about surveys as the expression of the
73
subjective experience of the survey climate. These opinions are the link between the survey climate at the societal level and the decision to participate at the individual level. One can also assume that there is an interaction between the individual’s opinion and some general societal characteristics that are relevant to describe the survey climate.
General Societal Characteristics The questions in a survey on surveys about survey participation and previous experience with interviews refers to the increasing general public need to collect data by means of questionnaires. This creates an atmosphere of ‘over-surveying’. There is a general feeling that people are asked to participate in surveys too often and that the respondent burden is too high. Information about the topic and the number of surveys in a country or region, the most common mode of data collection, the normal practice about incentives or the payment of respondents and the mean interview length is necessary to substantiate this feeling and to monitor the respondent’s cost – benefit considerations, which is the key issue of the respondent’s burden. Privacy legislation and general concerns about privacy (e.g. number of telephone numbers kept private), the access to registers (e.g. national population register) and the use of personal data (e.g. purchase data) contextualize the sensitivity to privacy matters. Privacy rules can have an impact on the way sample units can be approached and encouraged to participate (e.g. number of contact attempts, implementation of refusal conversions procedures). Changes in the privacy legislation are sometimes important to understand changes in the willingness to participate. The media frequently reports on survey and poll results. The perception of the value and reliability of surveys can be influenced by the way results of polls and surveys are presented in the media. It is obvious that frequent questions and discussions about the accuracy of
74
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
survey results (e.g. election polls) do not contribute to the trust in the survey research industry. Incidents with fake surveys with incorrect or (partial) false data are of special interest in this context. An extremely relevant question in this context is: do newspapers assess the quality of the survey before they publish poll results (e.g. evaluation of the representativeness of the sample)? Systematic documentation about discussions and incidents in the media are useful to understand public opinion about polls and allows the description of the survey climate in an accurate way. For example, difficulties of forecasting the correct results of referendums in Switzerland could be seen as an element influencing negatively the survey climate of the country. This could be the same type of reasoning after other referendums like the Scottish one (Durand, 2014) or other contested predicted results like the British elections of 1992 and 2015 and the French presidential election of 2002, to give a few examples.
CONCLUSION Interest in the survey-taking climate has its origin in the negative experience that the willingness to participate is decreasing and a smooth implementation of survey procedures is becoming more difficult. The awareness improves the understanding of this survey climate, which is valuable in organizing surveys in an efficient way and to collect highquality data. The survey climate can be considered a multi-level and multi-dimensional concept. Elaboration of the climate made clear that several aspects at the individual and societal level are relevant to characterize the survey-taking climate. Just as the description of a trend of the average temperature is not sufficient to identify the climate, trends in nonresponse rates are not sufficient to describe the survey climate, although the willingness to participate is a key characteristic of the survey climate. Additional information at the
individual level such as reasons for refusals, experience with previous surveys and opinions about surveys and polls are extremely relevant to contextualize the nonresponse rates. Information at the societal level about how many surveys in a country or region are organized and why, the way results of polls and surveys are reported and discussed in the media, the sensitivity of privacy issues and the legal rules concerning privacy are important to characterize the context in which survey procedures must be implemented. So it is better to change the original idea to document the survey climate by a nonresponse barometer into a survey climate barometer that integrates information about as many relevant indicators as possible. It is obvious that in this approach more effort is necessary to collect the relevant information. However, the description should not be an end in itself; this additional investment that results in a better understanding of the survey climate can and must be used for better social marketing of the survey and poll industry. In fact, it is remarkable that surveys are used to improve social and commercial policy, but that the survey industry makes little use of its own capacities to promote the significance of its activities and to improve its own performance. National statistical institutes and other survey organizations can use a welldocumented description of the survey climate to improve communication strategies about their activities and projects (e.g. announcement of national surveys) on the one hand and targeting specific subgroups in an appropriate way on the other hand (Lorenc et al., 2013). The latter is in line with the idea of an adaptive design in which different groups are approached in a different way. However, more research about the translation of specific characteristics of the survey climate into effective fieldwork procedures is necessary. It is also clear that possible differences in survey climate in different countries are an important issue in cross national research. Differences in survey climate (e.g. privacy legislation) cannot be only responsible for
Defining and Assessing Survey Climate
differences in fieldwork performance but are sometimes also relevant to explain substantial differences. Although the concept of the survey-taking climate is strongly related with nonresponse research, it has the potential to enrich the typical nonresponse research questions and to shift the focus from why people refuse to participate in a survey to how people can be encouraged to participate. The reflection about the survey climate also makes it clear that the survey research industry must interact with the social environment and emphasizes the importance of an adequate promotion and introduction of survey projects.
RECOMMENDED READINGS 1 Lyberg and Lyberg (1991) can be considered as the first important paper about the survey climate. 2 Groves and Couper (1998) is one of the first and most complete discussion of the non-response problem in surveys. The concept of survey climate is introduced in chapter six about social environ mental influences on survey participation. 3 Loosveldt and Storms (2008) focus on one dimen sion of the survey climate and present a measure ment instrument to collect opinions about surveys. 4 Kim et al. (2011) is a general review of the sur veys on surveys and the paper presents trends about different aspects of the public opinion about surveys.
REFERENCES Bates, N., Dahlhamer, J., and Singer, L. (2008). Privacy concerns, too busy, or just not interested: Using doorstep concerns to predict survey nonresponse. Journal of Official Statistics, 24 (4), 591–612. Couper, M. (1997). Survey introduction and data quality. Public Opinion Quarterly, 61, 317–338. Durand, C. (2014). Scotland the day after: how did the polls fare? Retrieved July 14, 2015 from http://ahlessondages.blogspot.com/.
75
Ernst Stähli, M. and Joye, D. (2013). Nonrespondent surveys: pertinence and feasibility. The Survey Statistician, 67, 16–22. Goldman, E. (1944–45). Poll on the polls. Public Opinion Quarterly, 8, 461–467. Goyder, J. (1986). Survey on surveys: Limitations and potentialities. Public Opinion Quarterly, 50, 27–41. Groves, R. and Couper, M. (1998). Nonresponse in Household Surveys. New York: John Wiley & Sons. Groves, R., Singer, E., and Corning, A. (2000). Leverage–salience theory in survey participation: Description and an illustration. Public Opinion Quarterly, 64, 299–308. Hox, J., de Leeuw, E., and Vorst, H. (1995). Survey participation as reasoned action; a behavioral paradigm for survey nonresponse? Bulletin de Méthodologie Sociologique, 48, 52–67. Kim, J., Gershenson, C., Glaser, P., and Smith, T. (2011). The polls-trends: Trends in surveys on surveys. Public Opinion Quarterly, 75 (1), 165–191. Loosveldt, G. (2013). Unit nonresponse and weighting adjustments: A critical review discussion. Journal of Official Statistics, 29 (3), 367–370. Loosveldt, G. and Storms, V. (2008). Measuring public opinions about surveys. International Journal of Public Opinion Research, 20, 74–89. Lorenc, B., Loosveldt, G., Mulry, M. H., and Wrighte, D. (2013). Understanding and Improving the External Survey Environment of Official Statistics. Survey Methods: Insights from the Field. Retrieved from http:// surveyinsights.org/?p=161; DOI:10.13094/ SMIF-2013-00003. Accessed on 6 June 2016. Lyberg, I. and Lyberg, L. (1991). Nonresponse research at Statistics Sweden. Proceedings of the Survey Research Methods Section. American Statistical Association. Alexandra, VA. Retrieved from www.amstat.org/sections/ SRMS/Proceedings/papers./1991_012.pdf. Accessed on 6 June 2016. MOSAiCH (2009). Measurement and Observation of Social Attitudes in Switzerland, Survey archived by FORS. Retrieved July 10, 2014 from www.forscenter.ch. Rogelberg, S., Fisher, G., Maynard, D., Hakel, M., and Horvath, M. (2001). Attitudes
76
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
towards surveys: Development of a measure and its relationship to respondent behavior. Organizational Research Methods, 4, 3–25. Schleifer, S. (1986). Evaluating polls with poll data. Public Opinion Quarterly, 50, 17–26. Sjoberg, G. (1955). A questionnaire on questionnaires. Public Opinion Quarterly, 18, 423–427. Stocké, V. (2006). Attitudes toward surveys, attitude accessibility and the effect on respondents’ susceptibility to non response. Quality & Quantity, 40, 259–288.
Stoop, I., Billiet, J., Koch, A., and Fitzgerald, R. (2010). Improving Survey Response, Lessons Learned from the European Social Survey. New York: John Wiley and Sons. Vannieuwenhuyze, J., Loosveldt, G., and Molenberghs, G. (2012). A method to evaluate mode effects on the mean and variance of a continuous variable in mixed-mode surveys. International Statistical Review, 80 (2), 306–322. VSMS (2013) Jahrbuch 2012, retrieved July 7, 2014 from www.vsms-asms.ch/files/5813/ 5625/9754/vsms_Jahrbuch_2012_Low_ Res.pdf.
7 The Ethical Issues of Survey and Market Research Kathy Joe, Finn Raben and Adam Phillips
INTRODUCTION There is a school of thought that says turning a discussion towards the subject of ethics in survey and market research is a guaranteed way to kill the conversation! However, it is important to remember that any breach of public trust, cultural standards, best practice or prevailing law (no matter how small), can impact upon the reputation of the entire Market Research industry. Every year, a huge number of consumers and citizens around the world provide market, social and opinion researchers with an extraordinary amount and range of information about their preferences, behaviour and views. The information gathered is used in a wide variety of ways, including (but not limited to): i. helping to guide social policies by giving decisionmakers impartial and unbiased information about what the public wants; ii. measuring trends in society, or to describe, moni tor, explain or predict market phenomena to help
local and national legislatures understand soci etal change; iii. helping companies make better decisions about their products, services or employment standards; iv. guiding the innovation process thereby support ing economic development and growth; v. providing the citizen with a voice to express their views on the prevailing situation – be that social, political or economic.
Central to these applications – and core to the Market and Social Research profession – is the fact that relevant information can be expressed fully and collected accurately, without fear of collateral impact upon the research participant. This may seem like a ‘no-brainer’, but the collection and reporting of information using percentages and aggregated statistics – rather than on a personalised level or by naming individual contributors – is often misunderstood and/or misrepresented. Let us be absolutely clear: the continued success of the Market and Social Research profession relies almost solely on the public’s
78
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
trust, and therein lies the equally important need to maintain a publicised code of ethics. Traditionally, the collection of information on behaviour, attitudes and preferences has been gathered through a combination of quantitative techniques employing questionnaires with survey sampling techniques and qualitative research with depth interviews and focus groups, to gain a deeper understanding of the thoughts and emotions underlying people’s attitudes and behaviour. Both these sets of methodologies are rooted in the applied social sciences and the non- commercial element of these approaches is clearly underlined in the very definition of market research: Market research, which includes social and opinion research is the systematic gathering and interpretation of information about individuals or organisations using the statistical and analytical methods and techniques of the applied social sciences to gain insight and support decision-making. The identity of respondents will not be revealed without explicit consent and no sales approach will be made to respondents as a direct result of their having provided information. (ICC/ESOMAR International Code on Market and Social Research)
Since the 1990s, we have seen the increasing use of digital data collection methods such as online panels and online qualitative research using mobile telephones and other portable devices, more recently the rapid growth of passive methods that observe or measure behaviour without asking questions, including new audience assessment techniques (such as facial coding, eye tracking and electroencephalography), social media research, and self-reported ethnography using mobile devices with video recording capability. These new methodologies have led to an explosion in the amount of data that can be collected, and it has also become apparent that research participants now entrust researchers with a staggering amount of personal information. Noting too, that the evolution of digital technology is not slowing down, the information participants are now providing is becoming increasingly granular,
for example: the use of photos and videos of their everyday life to illustrate, rather than explain in an interview, how they use certain products. Whilst a large proportion of this information might cover everyday subjects, such as preferences for household products (detergents, dairy products etc.), some of this information will cover topics that respondents would not even discuss with family members or close friends, such as their financial situation, personal health issues or how they intend to vote in the coming election. It is also true that in many cultures, there are heightened social and religious sensitivities on what can be discussed (and with whom), and in other societies there are clear restrictions on what can and can’t be said. Thus, the provision of this level of information necessitates a high level of trust in researchers, and in turn, it is incumbent on researchers that they be very aware and mindful of the responsibilities that they have to their research participants.
PROFESSIONAL CODES Almost every profession that deals with members of the public (e.g. the legal profession, the medical profession etc.), has a code of conduct, with disciplinary consequences for those who do not follow it; market research is no different. These codes serve to highlight for the public that there are standards (or principles) of behaviour that we must abide by, and which they can expect from us. Because of its inherent, interrogatory nature, the market research profession has probably been more concerned than most with the question: why should participants trust researchers and what reassurances should they expect from market, social and opinion researchers? While many kinds of survey research were carried out for several decades beforehand, the end of World War II and the nascent
The Ethical Issues of Survey and Market Research
demand for multi-country research, led to the establishment of ESOMAR, the European Society for Opinion and Market Research, and its publication of the first International Code of Marketing and Social Research Practice in 1948, in recognition of the need for ethical market research in the post-war economy. At the same time WAPOR, the World Association for Opinion Research was established with the principal objective of advancing the use of science in the field of public opinion research around the world. It also published a code of ethics. This development was followed by a range of codes for market social and opinion research produced by national bodies in America, Europe and Asia and by the International Chamber of Commerce (ICC). During this time, the academic community was also developing similar codes with the same tenets, including confidentiality of personal data. In 1976, the ICC and ESOMAR agreed it would be preferable to have a single international code instead of two slightly differing ones, and this led to a joint ICC/ESOMAR Code which was published in 1977. There have been a number of revisions to that code since then, the last one being in 2007. The introduction to the current version explains why a code of ethics and practice is so important: Market, social and opinion research depends for its success on public confidence that it is carried out honestly and without unwelcome intrusion or disadvantage to its participants. The publishing of this code is intended to foster public confidence and to demonstrate practitioners’ recognition of their ethical and professional responsibilities in carrying out market research.
Today, the ICC/ESOMAR International Code has been adopted by over 60 national and international associations around the world, and endorsed by an additional number of associations which have their own national codes that embody the same principles but are worded differently to take into account national, cultural and local legislative requirements.
79
A further, exclusive characteristic of the market and social research professional is that they subscribe to this code of their own volition. While it is possible to be a practicing researcher who has not undersigned the Code, all ESOMAR’s 4900 corporate and individual members in more than 130 countries uniquely distinguish themselves by adhering to this international code and agreeing to support both the ethical requirements as well as the disciplinary consequences (of a breach), as a condition of membership.
THE PRINCIPLES At its most fundamental level, there are three golden rules of market research: (i) do no harm; (ii) be respectful and (iii) get consent. These may (again) seem self-evident, but social, cultural and legal differences between countries require a broader set of prescriptive principles. The following eight principles, to which all researchers must abide, have formed the basis of the ICC/ESOMAR International Code since its inception and have stayed relatively unchanged over time: 1 Market researchers shall conform to all relevant national and international laws. 2 Market researchers shall behave ethically and shall not do anything which might damage the reputation of market research. 3 Market researchers shall take special care when carrying out research among children and young people. 4 Respondents’ cooperation is voluntary and must be based on adequate, and not misleading, information about the general purpose and nature of the project when their agreement to participate is being obtained and all such statements shall be honoured. 5 The rights of respondents as private individuals shall be respected by market researchers and they shall not be harmed or adversely affected as the direct result of cooperating in a market research project. 6 Market researchers shall never allow personal data they collect in a market research project to be used for any purpose other than market research.
80
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
7 Market researchers shall ensure that projects and activities are designed, carried out, reported and documented accurately, transparently and objectively. 8 Market researchers shall conform to the accepted principles of fair competition. (ICC/ ESOMAR International Code on Market and Social Research)
The two aspects of research that have assumed greatest importance in recent times, are informed consent and data protection. The emphasis placed on these two elements has been amplified by technological advances, to a point where legislation in most countries is now prescribing certain requirements under these two headings: for example, the collection and handling of personal data in Europe is now governed by the EU Data Protection Directive,1 which was introduced in 1995. This is in the process of being updated by the introduction of the EU Data Privacy Regulation2, which has been agreed by the Council of Ministers and the European Parliament and will lead to more consistent application of the law when it takes effect across all 28 EU member states in 2018. The principles set out in the EU Data Protection Directive have since been adopted by the OECD3 and copied by many countries and regional associations. However, these principles are limited to privacy and data protection and their implementation and enforcement is very variable, which is why observing codes of ethics and professional conduct remain of critical importance for maintaining the trust of research users and the general public. While recognition of these issues by the legislature underlines the importance of adhering to our well-established ethical practices, it is also a reminder that the law often advances more slowly than the practice, and thus it becomes increasingly important for the industry – and associations such as ESOMAR – to promote best practice in these arenas. Under this latter heading, the Guidelines that ESOMAR publishes detail behaviours which practising researchers have adopted to fulfil all prevailing requirements (legal
and ethical), and they provide an up-to-date ‘roadmap’ for all involved in our profession.
CONSENT AND NOTIFICATION One of the ‘golden rules’ is: get consent. A key principle of market and social research is that respondents’ cooperation is voluntary and must be based on adequate information about the information they will be asked for, the purpose for which it will be collected, that their identity will be protected, with whom it might be shared and in what form. Participants must never be left with the perception that they have been unfairly treated, misled, lied to, or tricked, and participants must be allowed to withdraw and have their personal information deleted at any time. This means that research must always be clearly distinguished and separated from non-research activities – especially any commercial activity directed at individual participants. Activities that must not be represented as market research include enquiries that aim to obtain personally identifiable information about individuals that can be used for sales or promotional approaches to individual respondents. For instance, market researchers cannot use or provide the personal contact details and answers obtained during a survey so that the information is used to create followup promotional activity with respondents. Equally, opinion polls based on applied scientific and statistical methods must not be confused with phone-in polls or other selfselecting surveys. Furthermore, any enquiry whose primary purpose is to obtain personally identifiable information about individuals (e.g. for the purposes of compiling or updating lists, for sales, advertising, fundraising or campaigning or other promotional approaches), must not be misrepresented as opinion research. Researchers are also required to promptly identify themselves and enable respondents to check their identity and bona fides without
The Ethical Issues of Survey and Market Research
difficulty. In face-to-face interviews this can be done with an identity card containing the contact details and address of the organisation conducting the research. With telephone surveys, potential respondents should be provided with a phone number that they can call to validate the researcher’s identity and privacy policy, and with online data collection, researchers are required to have a privacy policy which is readily accessible and ‘traceable’ (i.e. including contact details for verification, not just an email address).
DATA PRIVACY In collecting data, researchers should bear in mind that elements of the data collected might exceed the requirements of the study, and whilst personal data might be expected to include name, address, email address, telephone number or birth date, they should also be aware that images and speech collected in photographs, audio and video recordings, as well as social media user names, now also fall under this definition. Also, the potential to re-identify respondents in a small geographic area, due to a unique combination of answers, is receiving heightened attention from legislators, and must therefore be the subject of additional control measures and reassurance by practitioners. The application of research codes of conduct and practice has always placed great emphasis on the elimination of any risk of a breach of respondent confidentiality. This requirement is built into the ICC/ESOMAR Code and other self-regulatory codes used by research associations.
PROTECTING RESPONDENTS’ DATA The issue of how the data is protected is particularly important in an age where there are regular news reports about web sites being
81
hacked or personal information (stored on a computer) being lost or stolen. Researchers must now ensure, and reassure, people that data sets or other materials (photographs, recordings, paper documents) collected in survey research, that contain personally identifiable information, are kept securely. When designing a research project, researchers are required both by law and by professional codes to limit the collection of personal data to only those that are necessary to the research purpose. The same principle applies to passive data collection where personal data may be ‘harvested’ by video recording or data capture (e.g. from social media) without the traditional asking and answering of questions. In such eventualities, those items that can identify a specific individual must be anonymised or deleted from any report or presentation, even when not required by local law. Researchers need to perform quality checks at every stage of the research process; this may involve monitoring and validating interviews. To allow for these checks, while protecting the personal data that identifies individual research participants, researchers must have procedures to separate the data records from the identification records, while retaining the ability to link the two data sets. This is usually done by means of pseudonymous identifiers; a master file linking participants’ names, addresses or phone numbers with their corresponding internallygenerated ID numbers must be kept secure, with access limited to only a small number of people (under the supervision of the data controller in jurisdictions where this is required by law). Permission to access personal data may be extended to include the sampling staff or those charged with running internal checks on the data, so long as their access is for a legitimate purpose. These security procedures enable researchers to process data, code open-ended responses and conduct other participantlevel analyses without seeing individual participant’s names, addresses or phone
82
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
numbers. Such practices, which are often referred to as pseudonymisation, protect personal data by ensuring that, although it is still possible to distinguish individuals in a data set by using a unique identifier such as an ID number, their personal identification information is held separately so that it can be used for checking purposes without the rest of the data being made available to those doing the checking and without the data analysts being able to identify the individual. This protection is particularly important in research involving sensitive subjects, such as an individual’s racial or ethnic origin, health or sex life, criminal record, financial status, political opinions or religious or philosophical beliefs. The file holding the identification information should be destroyed after the research is completed, taking into account the period required to be able to check the data for statistical integrity in for instance trend data. Other data minimisation safeguards to avoid the possibility of re-identification can include asking people to indicate their age range rather than give a specific birth date, combining postal codes to increase the number of people categorised under a geographic unit, or removing or replacing specific numbers from an IP address. The development of the internet and the potential for collecting data on the behaviour of individual users has created a debate about what is appropriate for researchers to collect and how to render the data anonymous. As a minimum researchers should de-identify data prior to release to a client or for any wider publication. Where the individual’s consent cannot be obtained, this process involves the deletion or modification of personal identifiers to render the data into a form that does not identify individuals. Guidance on such techniques is offered by the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule, Office of the Australian Information Commissioner APP Guidelines and the UK Anonymisation: Managing Data Protection Risk Code of Practice. The latter refers to the EU Data Protection Directive (95/46/
EC), which makes it clear that the principles of data protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable. Equally, however, this guidance also recognises that, ultimately, it may be impossible to assess the re-identification risk with absolute certainty and, therefore, there will be many borderline cases where careful judgement will need to be used, based on the circumstances of the case. It also acknowledges that powerful data analysis techniques that were once rare are now common-place and this means that the research organisations must conduct a comprehensive risk analysis before releasing ‘anonymised’ data sets – to ensure that the de-identification processes are sufficiently rigorous. This is of particular relevance in social media research where researchers should mask or anonymise quotes and comments unless the researcher has express consent to pass them on to a third party in identifiable form. In many instances, it is possible to reidentify participants by using search engines, and the ESOMAR Guideline on Social Media Research requires research companies to have a contractual agreement with their client not to try to re-identify an anonymised data set. The Guideline also refers to anonymising qualitative data and describes techniques such as blurring or pixelating video footage to disguise faces, electronically disguising or re-recording audio material and changing the details of a report such as a precise place name or date.
ADDITIONAL ETHICAL ISSUES Be Respectful This requirement covers a range of issues – often overlooked – but nonetheless important. For example, in certain cultures, it is unacceptable to offer alcohol during an extended focus group; in others, it is impossible for women to be interviewed by men, and in
The Ethical Issues of Survey and Market Research
others again, it is unlikely that any respondent will be available to participate in research after 8 p.m. These are social and cultural idiosyncrasies that do require local knowledge to appreciate. Some research involves participants downloading an app to their mobile phone or computer. Apart from the usual requirements for notification and consent, the researcher also has a responsibility to minimise any consequences such a download might have. This can include ensuring that software is not installed that modifies settings beyond what is necessary to conduct the research, and does not cause any conflicts with operating systems or cause other software to behave erratically. Researchers should also ensure that any such software can be easily uninstalled and that it does not impact the device in a negative way. Equally, the participant should never have to bear any additional costs for participating in research and they should also be cautioned if any app installed has the potential to collect data from others through data capture (e.g. of their contact details), image or sound recordings. These should normally either have the other individual’s consent, be filtered, be disguised, be de-identified or deleted. Similarly, participants should be reminded to delete any residual information or application from their device relating to the research after they have finished providing data.
Do No Harm The ICC/ESOMAR International Code, like all other responsible codes of conduct, has a general requirement that researchers must take all reasonable precautions to ensure that respondents are in no way harmed or adversely affected as a result of their participation in a market research project. This can cover a variety of situations. For instance, researchers must check that any foods or products provided for research purposes are safe to consume or handle, and
83
confirm this with the supplier of the product. They should also not encourage respondents to become involved in any illegal or illadvised action and avoid putting participants at risk when researching criminal behaviour or activities that may be considered antisocial. Researchers should warn participants not to respond to a survey questionnaire by texting or otherwise using their mobile device while driving or taking photos in places or situations where this is prohibited. When research involves calling mobile phones, researchers might inadvertently contact potential respondents who are driving a vehicle or in a space where they might be overheard by others. The ESOMAR Guideline on Mobile Research requires that the researcher should confirm whether the potential respondent is in a situation where it is safe, legal and convenient to take the call, before proceeding. An emerging debate within the profession relates to the role of research in enabling ‘commercial’ organisations to collect information which is detrimental to the individual. It is essential that researchers are aware of their social and ethical responsibilities and ensure that their work demonstrates respect for local law and culture, as well as the rights of the individual. This may include creating or joining public debate on the value to society of the research that is being carried out.
INTERVIEWING CHILDREN At present there is no common international definition of a ‘child’ or ‘young person’. Some international laws define a ‘Child’ as ‘a human being below the age of 18 years unless under the law applicable to the child, majority is attained earlier’. However, this definition has to be set against the right of the individual to self-expression and recognise that children and teenagers’ opinions may not always coincide with those of their parent(s) or legal guardian(s). Even within a single
84
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
country the definition of a child or young person may vary with the activity under consideration. Because it would be very difficult to agree any general definition based on factors such as the child’s cognitive powers, to fulfil the objectives outlined above, the ESOMAR Guideline on Interviewing Children and Young People requires that researchers must conform to any relevant definitions incorporated in any National Code of Conduct Practice and in national legislation. Where no such specific national definitions exist, ESOMAR recommends a ‘child’ is to be defined as ‘under the age of 14’, and a ‘young person’ as ‘aged 14–17’. Although a child of 7 and a child of 13 often cannot sensibly be asked the same questions or about identical topics, this issue is usually more a matter of common sense and good research practice rather than one of ethics. However, researchers must be alert to situations where the sensitive nature of the research or the circumstances of the interview mean that exceptional care is called for in interviews with children and young people from any age-group. A key criterion must always be that when the parent or other person responsible for the child hears about the content or circumstances of the interview, no reasonable person would expect him or her to be upset or disturbed. The researcher must take into account the degree of maturity of the child or young person involved when considering what subjects may or may not be safely dealt with in an interview. While it may be imperative to avoid certain subjects when interviewing children (e.g. a topic which might frighten or worry the child), the same subject might quite safely be covered with young people if the appropriate precautions are taken. This again is a question of good research practice as much as of ethics. Examples of topics where special care is needed when interviewing children and young people are ones which could disturb or worry them, such as their relationships with others in their peer
group, or ones which risk creating tensions between them and their parents. There are sometimes valid and important reasons (e.g. in helping to guide social policies) for covering research topics of the kinds where special care is needed as referred to above. When this is the case it is essential both that a full explanation of this is given to the responsible person (certainly in the case of a child, and if possible even in the case of a young person aged 14–17) and their agreement obtained; and also that steps are taken to ensure that the child or young person is not worried, confused or misled by the questioning. In carrying out such research, the welfare of the children and young people themselves is the overriding consideration – they must not be disturbed or harmed by the experience of being interviewed. Furthermore, the parents or anyone acting as the guardian of any child or young person taking part in a research project must be confident that the latter’s safety, rights and interests are being fully safeguarded. Interviewers and other researchers involved in the project must protect themselves against any misunderstandings or possible allegations of misconduct arising from their dealings with the children or young people taking part in that project. The protection of children and young people is a sensitive issue and this means that authorities, and the public generally, must be confident that all research carried out with children and young people is conducted to the highest ethical standards and, depending on the topic and circumstances, this may involve employing researchers who specialise in such research. Researchers should take similar precautions for other vulnerable groups including the sick or the very elderly. The ESOMAR Guideline provides further information on when and how to obtain permission, and when it is advisable that a parent or guardian should be present or in the vicinity when a child or young person is being interviewed.
The Ethical Issues of Survey and Market Research
CONCLUSION Maintaining good ethical and scientific standards has served the profession of market, social and opinion research very well for more than 60 years, noting that the first international Code was issued by ESOMAR in 1948 and has since been updated on a regular basis.4 However, with the continued growth of multicountry research, as well as research conducted via the internet (and thus accessing people without knowing where they are physically located), the need for globally consistent (and applicable) rules, definitions and standards becomes ever more important. The ICC/ ESOMAR International Code can provide such guarantees, and is often supplemented with market specific requirements that are clearly outlined by the national association(s); one arena where this is currently of particular importance and relevance is in data privacy and protection, where local jurisdictional requirements can vary from country to country. This probably exemplifies ESOMAR’s role as a willing facilitator, with the objective of aligning national, regional and global perspectives, to create and promote industry-led professional ethics and standards – designed by practising researchers. While the legal environment(s) under which our profession must operate will differ from country to country, our professional principles will not. We will continue to depend on public goodwill to collect the information we seek, and we must therefore continue to reassure (and demonstrate) that we are honest and transparent in handling that information. In this context, it is essential that we continue to adhere to the eight principles set out in the ICC/ESOMAR International Code and remember above all else to: •• Do no harm •• Be respectful •• Get consent
Given the increasing need for business and governments to stay in touch with consumer
85
preferences – particularly in this fast changing digital world – the development and promotion of professional ethics and standards, as well as the public demonstration of the value of market research services will ultimately safeguard and nurture economic and societal growth.
NOTES 1 Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data. 2 Regulation (EU) of the European Parliament and of the Council on the protection of individuals with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). 3 OECD Privacy Principles, http://oecdprivacy.org/. 4 A survey conducted by ESOMAR among leading research companies revealed that from over 45.5 million interviews conducted in the EU there were only 1264 complaints (0.003%) and that most of this related to problems with incentives for conducting an interview.
RECOMMENDED READINGS The ICC ESOMAR International Code on Market and Social Research, (published by ICC and ESOMAR, last revised 2007, available at www.esomar.org and www.iccwbo.org) which sets out the ethical and professional principles for researchers that have been adopted or endorsed by all ESOMAR members and over 60 national market research associations worldwide. The ESOMAR GRBN Online Research Guideline (published by ESOMAR and GRBN, last updated 2016, available at www.esomar.org and www.grbn.org), which helps researchers address the legal, ethical and practical considerations in using rapidly changing technologies when conducting research online. The ESOMAR WAPOR Guideline on Opinion Polls and Published Surveys (published by ESOMAR and WAPOR in 2014, available at www.esomar.org and www.wapor.org), which sets out the responsibilities of
86
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
researchers to conduct opinion polls in a professional and ethical way and to also enable the public to differentiate between professional and unprofessional polls. The ESOMAR Guideline on Mobile Research (updated by ESOMAR in 2012 and available at www.esomar.org) which addresses the legal, ethical and practical considerations of
conducting market research using feature phones, smartphones, tablets and other mobile devices. The ESOMAR Guideline on Interviewing Children and Young People (ESOMAR 2009 and available at www.esomar.org) which addresses the ethical considerations, special care and precautions the researcher must observe.
8 Observations on the Historical Development of Polling Kathleen A. Frankovic
A variety of researchers have placed the beginnings of survey research in many places. Old questionnaires from centuries ago, reminiscent of modern-day queries, form part of survey research’s history. But much of the development of survey research as we know it today came from social research, from government, from the business world, including the world of journalism. Solving problems and gaining understanding about people are part of all those disciplines. Sometimes the most mundane reasons led to major advancements in the field. There are multiple reasons for the development of survey research – the curiosity and policy interests of governments to count the citizenry or attempt to determine their needs, the social welfare programs of service agencies interested in poverty, the attempts of business to sell more products, the efforts of journalists to report the public mood and to lure readers and viewers, and the work of academics in the social sciences, who not only studied opinion but studied how to measure it.
Although many survey research developments took place in the United States, there have been international contributions, too.
EARLY DEVELOPMENTS AND EXAMPLES Full-scale censuses for population date back to Roman times, as the count collected information for citizenship and taxation. Perhaps the first known questionnaire can be attributed to the time of Charlemagne, who was seeking empirical information about problems in his regime in 811 (Petersen, 2005). Joe Belden reported on a sixteenth-century, 50-item questionnaire that reached the New World from abroad from the King of Spain, Phillip II. The Relaciones Geograficas asked Spanish officials in the New World about the terrain, the native government structure and traditions, animals, birds and religious institutions, among other topics. There are still 166 extant responses (Wauchope, 1972).
88
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
In the late 1780s Sir John Sinclair polled the Scottish clergy to learn about parishes and parishioners. His follow-ups for non-response included personal visits, and he received 881 questionnaires, each with 120 items. Those results would be published in the Statistical Account of Scotland (Martin, 1984). The Frenchman Frederic Le Play conducted an ‘intensive’ study of families, published in 1855 in Les Ouvriers Europeen (Converse, 1987). Like B. Seebohm Rowntree in York, England, he interviewed subjects directly (Martin, 1984). To learn about the needs for social welfare, private agencies in Great Britain invented the social survey, studying the distribution of poverty. Charles Booth used school inspectors to create a sociological map of London. His volumes, Life and Labours of the People of London, was published from 1889 to 1903. It was the beginning of studies of poverty in other countries, including the United States (Converse, 1987: 19). In that same period, the same type of studies were being conducted by Hull House, a settlement house in Chicago, while W.E.B. DuBois (1976 [1899]) did an intensive study for his book The Philadelphia Negro, and journalist and reformer Paul Kellogg conducted his Pittsburgh survey in 1906. Large-scale academic sociological reports that utilized questionnaires include the 1930s’ study of unemployment in Marienthal, Austria, another example of the interest in discovering social needs. Paul Lazarsfeld and other Viennese scholars, including Marie Jahoda and Hans Zeisel (Jahoda et al., 2002 [1933]), examined the impact of unemployment on the village of Marienthal, where about half the workers lost their jobs when a textile factory shut down in 1931. The book, Marienthal: the Sociology of an Unemployed Community (in German Die Arbeitslosen von Marienthal), published in 1933, used multiple methods, including questionnaires and observation, to analyze the impact of joblessness on the community. The field work was immersive (Noelle-Neumann, 2001). The Austrian
history of social research included housing studies in the latter half of the nineteenth century; much like the British social surveyors did, E. von Philippovich examined the living conditions of community inhabitants (Amann, 2012). German academics were also part of early twentieth-century academic developments. As early as 1909, Hugo Meunsterberg, along with the American Edward Rogers, began conducting trademark research. The subject was important to marketing research, as it could establish whether or not brands were being infringed upon. It also meant that surveys would be used for legal evidence. The first successful use of a survey in a trade mark dispute was in Germany in 1933, predating any successful challenge in the United States (Niedermann, 2012). Modern survey research has early roots in some Eastern European countries, too. In Poland, the sociologist Florian Znaniecki and William Thomas co-authored the classic The Polish Peasant in Europe and America (Thomas and Znaniecki, 1918–1920). Although it did not include a survey as we understand it today, the use on personal information from sources as private letters underscored the importance of asking about individuals and not just leaders. The US government role was critical to the development of survey techniques. One area where government data were collected through surveys was in agriculture. In 1855, the President of the Maryland Agricultural Society sent questionnaires to county agricultural societies and other individuals asking them to report on crop conditions; by 1863, the Department of Agriculture received an appropriation to collect agricultural statistics (Henderson, 1942). Data collection was on a regular basis beginning in 1866. In 1908, Liberty Hyde Bailey, of Cornell University, was appointed by Theodore Roosevelt as chairman of the National Commission on Country Life, and undertook a mailing to 530,000 rural residents, with questions about behavior, activities, and satisfactions.
Observations on the Historical Development of Polling
100,000 questionnaires were returned. But the Department of Agriculture objected, and began its own similar studies in 1915, and set up an official research department in 1919. Among other projects, the Department surveyed officials to find out if food stamps would be a better method than direct distribution of food to those in need (Wallace and McCasey, 1940). Today, survey research is a massive industry, with links to advertising, market research, journalism, academia, and government. Data are collected face-to-face, by telephone (landline and cellphone), and online (desktop and mobile). The discipline faces challenges from lowered response rates, data collection changes, and the success of surveys themselves. The proliferation of polls and the general belief in poll findings make them targets of attack from those who dislike the results, and from government regulations that attempt to limit data collection and publications. How did the discipline of survey research reach this level of importance? This chapter will examine several of the turning points in survey research, with special emphasis on the most visible component of modern survey research – the public opinion poll.
THE NINETEENTH CENTURY We can date the roots of polling and other survey research as we know them today to the nineteenth century, in which there were longterm developments and modifications that led to surveys today. Pre-nineteenth-century manifestations, although interesting, appear in hindsight to be one time aberrations in survey development. For example, the Charlemagne surveys were indeed a set of questions about public feelings towards military service and the relationship between the church and the public, but it was addressed not to the public but to the ‘counts, bishops and abbots’. The answers were collected and reprinted as responses, but never coded and/or tabulated.
89
Reaching out to the public directly for their opinions is a process that begins with the move to popular democracy, and it seems to have begun in the United States. In 1824, at the time of an expansion of the electorate, political parties and (at the time, extremely partisan) newspapers collected and tabulated information about public preferences in that year’s presidential election. Preferences were taken at public meetings, in open registers moved from place to place, and even on riverboats. Juries polled themselves and their preferences which were duly reported in the press (Smith, 1990; Frankovic, 1998). The link to democratization is clear. In 1824, relatively few states selected their representatives to the Electoral College (which cast the official ballots for President) through public elections (which themselves were frequently restricted to white men who owned property). In seven of the 24 states in 1824, electors were named by the state legislature. This separation of the presidency from the public was envisioned in the US Constitution, but after more than 30 years, the limits were disappearing in some states at least. Polling the public was a way to demonstrate this lack of public representation, and the ‘straw polls’ as they were known, became devices to indicate that the legislature’s candidate might not be the public’s choice. Those straw polls were particularly common in the states without a public vote for electors, and were predominantly conducted on an ad hoc basis: in North Carolina a jury polled itself; poll books were opened in city halls and taverns; and a Mississippi Riverboat poll showcased the geographical differences in preference based on a person’s state of origin. The polls were reported by newspapers whose political position favored the candidate leading in each poll, and circulated to other newspapers with the same political stance, giving the poll results even more exposure. The winner of many of those straw polls, Andrew Jackson, lost the 1824 race, but four years later, as the franchise expanded – and as fewer states ignored the public in presidential voting (in only two states
90
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
did state legislatures select electors in 1828), he was elected President (Frankovic, 1998). Public opinion polls were often linked to democratization, even in the twentieth and twenty-first centuries. After World War II, as Japan adopted democracy under US tutelage, it also adopted polling (Sigeki, 1983). Today, polling is part of every emerging democracy. The political need for a statement about public opinion in elections that emerged when much of the public in the United States did not actively participate in the elections of the nation’s highest office disappeared as the presidential electoral process incorporated more citizens. Other needs would produce the growth in public opinion measurement that began at the end of the nineteenth century. These would take place both in the commercial world and in academia. The era of straw polls began in the late nineteenth century. In 1896, the Chicago Record sent postcard ballots to every registered Chicago voter, and to a sample of 1 in 10 voters in eight surrounding states. The Record mailed a total of 833,277 postcard ballots, at a cost of $60,000; 240,000 of those sample ballots were returned. The Record poll found Republican William McKinley far ahead of the Democrat, William Jennings Bryan. McKinley won; and in the city of Chicago, the Record’s pre-election poll results came within four one-hundredths of a percent of the actual election-day tally (Frankovic et al., 2009). Throughout the nineteenth century, there also were advances in statistics which, though not necessarily adopted by early survey researchers, would eventually make their way into the discipline. In the mid nineteenth century Francis Galton developed correlation and regression; in 1890 the Hollerith card, which could be machine-read, and speeded up the analysis of large amounts of data, was first used for the United States Census. The American Statistical Association was formed, and statistics were taught at universities (Bernard and Bernard, 1943). Market research also has a nineteenth-century beginning: although not a survey, N.W. Ayer and
Son collected crops statistics in order to plan marketing strategies (Blankenship, 1943).
THE FIRST HALF OF THE TWENTIETH CENTURY The straw polls that were established in 1896 continued to grow in size and cost. Most were conducted in the Midwestern states. In its 1923 mayoral election poll, the Chicago Tribune tabulated more than 85,000 ballots. In the month before the election, interviews were conducted throughout Chicago, and results published reporting preferences by ethnic group (including the ‘colored’ vote), with special samples of streetcar drivers, moviegoers (noting the differences between first and second show attendees), and white collar workers in the Loop. The cost of these straw polls was astronomical. Between 1900 and 1920 there were nearly 20 separate news ‘straw polls’ in the United States. None were as expensive as the Literary Digest, which established a poll in 1916. In 1920, it mailed ballot cards to 11,000,000 potential voters, selected predominantly from telephone lists. In later years, car registration lists and some voter registration lists were added. Newsweek (1935) estimated that the cost for the Literary Digest’s poll was three cents per ballot (including printing, mailing and tabulating costs: 20,000,000 ballots equaled a total cost of $600,000). That cost could not be justified only by the journalistic use of the results; a mailed ballot was accompanied by a subscription form for the Digest (Frankovic, 2012). Political polling grew along with its journalistic uses. The Presidency of Franklin Delano Roosevelt may have marked a turning point. FDR’s tenure coincided with the rise of more scientific approaches to public polling. Although previous administrations required public opinion information, sources were restricted to active participants. Susan Herbst (1993) points out that the collection
Observations on the Historical Development of Polling
of information about public opinion moved from noting the ‘active’ expression of opinion (demonstrations, marches, letter-writing, news editorials), to the more passive collection of information that utilized polling. Questions were determined by those who wrote the polls, and measurement was of issues that the polltakers determined were important, not issues generated by the public. The presidential search for information prior to the 1930s consisted of the collection of information from newspapers and other political leaders (Jacobs and Shapiro, 1995; Geer, 1996; Eisinger, 2003). Examining straw polls that also appeared in newspapers were also of use. But the real emergence of pre-election polls as we now know them came during the presidency of Franklin Roosevelt. When Roosevelt began his term in 1933, the Literary Digest was still the most prominent poll, and it was respected for having predicted presidential outcomes correctly. FDR received advice from Emil Hurja, a statistician who worked for the Democratic National Committee. Hurja collected ballot information and straw poll results (even collecting some of his own) and made predictions about election outcomes. Using his own knowledge, Hurja made adjustments and statistical corrections to the Literary Digest and other straw polls. Hurja met Gallup in 1935 and was said to have trusted his polls (TIME, 1936). But by the 1936 election, George Gallup, Archibald Crossley, and Elmo Roper began to perfect the sample survey, which would cost significantly less, and take less time to conduct than the newspaper and magazine straw polls. Their stories show the importance of advertising and marketing research on the development of polling. They came from the business world and not from politics or academia (though Gallup had earned a PhD), Gallup began his career doing newspaper readership measurement before attending graduate school. He was an interviewer in 1922 for the St. Louis Dispatch, measuring the stories that had been read (Smith, 1997),
91
he measured readership by showing respondents the stories and the paper itself (Gallup, 1930), and he developed a radio ratings method (the ‘coincidental method’) while working at Young and Rubicam, an advertising agency in New York (Gallup, 1982). The telephone calls, asking people what they were listening to, cost 10 cents each, and could be completed quickly. Interestingly, Gallup noted that radio ownership coincided with telephone ownership, therefore those early phone calls could reach the bulk of the radio audience (Frankovic, 1983a; Smith, 1997). In 1934, Gallup conducted a test election poll for the head of the Publisher’s Syndicate. That led to the creation of the American Institute of Public Opinion – the Gallup Poll. Gallup continued working for Young and Rubicam while conducting the Gallup Poll. In 1935, Elmo Roper, who also came from the commercial world, began a collaboration with Paul Cherington, from another advertising agency, J. Walter Thompson. The two established the Fortune Poll, which promised democracy to businessmen and to journalists, and the editors justified the poll as an offshoot of the advertising research business: … if Mr. H.G. Weaver of General Motors can discover by a survey what trend in automobile design is most welcome to General Motors customers, why cannot the editor of a magazine ascertain by the same method the real state of public opinion on matters that vitally concern the readers? (Fortune, 1935)
Roper had emerged from the jewelry business. In 1933, he asked buyers questions about what was more critical to them: quality or style or price. His consultancy with Cherington included work for companies that came very close to opinion polling: learning whether the public supported the policies of their client companies (Carlson, 1968). Like Gallup, Archibald Crossley, who had been working at Curtiss Publications in the 1920s, also established a radio measurement system in order to measure the reach of radio ads. He once wrote how advertising
92
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
measurement dated back to 1912 when Roy Eastman needed to measure the reach of advertisements for breakfast cereals (Crossley, 1957). In 1936, Crossley worked for the Hearst Papers, measuring the public’s response to that campaign. The methods of opinion measurement in the pivotal 1936 election, which established the dominance of sample surveys and the demise of the ‘straw poll’ are not at all like those used later. The Literary Digest’s massive mailout of ballots had several weaknesses obvious to modern day researchers: the sampling frame, mainly, but not completely, comprised of telephone, automobile and subscription lists, did have a class bias in a Depression-era year when economic class would determine the vote. The editors made no effort to adjust results to match the reported vote for the previous election, even though the 1936 ballot card included a question about how the respondent had voted in 1932, and that tabulation was far more Republican than the actual vote. There was a differential non-response. Those who might have been more pro-Roosevelt were less likely to return the Digest’s ballots (Squire, 1988). And the Digest counted the ballots cumulatively. Since they were received over multiple weeks, there was no way to track any change in opinion, and there seemed to have been movement towards Roosevelt. While the Digest produced estimates for every state, its national total was not weighted by state population (Crossley, 1937). Gallup, Roper, and Crossley relied on quota sampling and much smaller samples. Gallup even used some mail ballots, along with personal interviews, at least during that first presidential election (the Fortune polls, on the other hand were all conducted in person). The Gallup Poll charged newspaper subscribers $1 per week for each 1000 circulation; it maintained a mailing list of 100,000 potential respondents and a field staff of about 1,000 part-time interviewers, paid 65 cents an hour, to conduct personal interviews. The average cost for these in-person interviews was 40 cents each (O’Malley, 1940).
But the samples were constructed from the best information the pollsters had at the time, though truly random selection would not take place until years later. Crossley, for example, selected sampling points based on the dominant occupation in the district. He also checked for machine and party activities in the district. Weighting was done by past vote (for example, while New York City comprised 56% of the state population in 1936, it was only 45% of voters) (Crossley, 1937). Basically, a small sample was preferable, using a combination of mail and personal interviews or exclusively personal interviews, with fixed quotas. Changes in registration might require changes in weighting (Crossley, 1937). The ten Crossley polls for the Hearst papers in 1936 each sampled 15,000 respondents over a two-week period, and made estimates state by state (the smallest state sample was 180). Within the larger states, weights for Congressional Districts were used. Crossley even used matched samples to include the military vote. Although, unlike the Literary Digest, the 1936 polls predicted the winner correctly, they were significantly off in estimating the final results. Gallup was off by 6.8% on each candidate that year. And the polls that were closer to the final outcome never were published. The Fortune final pre-election poll estimate was off by just 1.2% from the actual results, but since the editors were convinced (probably by a combination of belief in the accuracy of the Literary Digest and their own political desires) that Roosevelt would lose, Fortune did not publish the final poll (O’Malley, 1940). Roper, too, relied heavily on quotas for geography, size of place, sex, age, occupation, and economic levels. He also claimed that he never asked directly who people planned to vote for in 1936 (Roper, 1939). Gallup became the best-known of the opinion pollsters, though he and the other 1936 pollsters may not have been the first (Henry C. Link, founded the Psychological Corporation in 1932, collecting data about the psychological state of the public before them).
Observations on the Historical Development of Polling
Gallup’s books, including The Pulse of Democracy (1940) written with Saul Forbes Rae, a Canadian researcher, and his tireless evangelism for polling gave him a prominence with the public that neither of the others reached. A ‘Poll on the Polls’, reported by Eric Goldman, asked what polls Americans had heard of: 60% volunteered the Gallup Poll, 11% named the Fortune Poll, 8% the Literary Digest, and only 7% Crossley (Goldman, 1944–1945). There were a number of other commercial advances in the 1930s. A.C. Nielsen founded the Nielsen market research company in 1923. In 1939 the company began its audiometer research, and offered it to clients in 1942, with a 1,000 listener sample in nine Eastern and Central states (Nielsen, 1944–1945). Roosevelt became a consumer of polling. And he had help. He became the first President to privately poll. Hadley Cantril, a professor at Princeton University, established the Office of Public Opinion Research at Princeton in 1940, receiving funding from the Rockefeller Foundation. He was active in the public opinion research community, which by 1940 included academics and commercial operations, and served as the associate editor of Public Opinion Quarterly, which published its first issue in 1937. Cantril’s work focused at first on the war in Europe. He attempted to develop a methodology which would use small samples and collect data unobtrusively, without a questionnaire. Cantril conducted a 200-interview case study in Canada to simulate polling in enemy territory, and worked to measure the impact of Nazi propaganda in Latin America. The Cantril methodology was used by the Office of War Information in Morocco in 1942. That project, conducted before the Allied invasion of North Africa, suggested that British troops not be part of the land invasion, and they were not used (Cantril, 1967). The ‘hidden’ polling technique was also used in Sicily in 1943 (Reuband, 2012). Cantril’s path to Roosevelt’s ear came through the Rockefeller Foundation and
93
James W. Young, FDR’s Undersecretary of Commerce and another advertising executive who had been the former President of the J. Walter Thompson agency. In 1940, FDR’s primary poll interest was in whether Americans wanted to stay out of war or help England win. In July 1940, Cantril found that Americans wanted to stay out of war 59% to 37% (Cantril, 1967). That question would be repeated several times. Cantril asked other questions to satisfy Roosevelt’s worries about public opinion as the US moved closer to war, including 1940 and 1941 items about US aid to England. In February 1941, even before Roosevelt signed the bill into law, 61% of Americans had heard of Lend Lease, the program to ship and ‘lend’ war materiel to Britain, Free France, and China. Those that knew of the program approved. Beginning in 1942, after the United States was engaged in World War II, the private polls Cantril did for Roosevelt were funded by Gerald Lambert, of Lambert Pharmaceuticals, who made a fortune marketing Listerine, convincing Americans that bad breath was a menace. Lambert, though a Republican, supported the US war effort. The two created ‘The Research Council, Inc.’ and provided the White House with trend charts, public relations and speech recommendations, as well as data (Lambert, 1956; Cantril, 1967). Cantril’s reports went officially to the Office of Facts and Figures in the Office of War Information and to the Office of Special Consultant at the Department of State (Steele, 1974; Leigh, 1976). The more public pollsters also offered their results to the President. In 1940, Elmo Roper reported calling FDR with tabulations on American attitudes towards Lend Lease at 1 a.m. (Carlson, 1968). World War II also stimulated the academic community, and researchers became involved in war work. Rensis Likert aided the Division of Program Surveys in the Department of Agriculture, while Elmo Wilson (later the Director of Research at CBS) and Hazel Gaudet were part of OWI, the Office of War
94
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Information, which conducted research on the effectiveness of war information campaigns. The Research Branch of the War Department’s Information and Education Division included Samuel Stouffer, who led the effort to study the war’s impact on the military, in work that was later to be published as The American Soldier (Stouffer et al., 1949). The work provided opportunities for methodological experimentation (Williams, 1983). Herbert Hyman also was part of that effort. He wrote of that experience: Our prime concern was practical, not scholarly, and we believed from past experience that we could be quick without being hasty, thoughtful without becoming obsessive, thorough without engaging in interminable analysis. It never would have occurred to us back then that a survey researcher need six months to a year to design a survey and another year or two to analyze the findings. (Hyman, 1991: 109)
The interaction between academic researchers and commercial researchers strengthened during the Roosevelt Administrations. Although Arch Crossley noted how ‘political scientists thought we were invading their turf’ (Carlson, 1968), many academics consulted for commercial firms. Paul Lazarsfeld, Samuel Stouffer, and statistician Raymond Franzen were consultants on the Roper payroll (McDonald, 1962). Cantril worked with Gallup. Frank Stanton, the Chairman of CBS, chaired the Board of Directors for the Bureau of Applied Social Research, Lazarsfeld’s American home at Columbia University. The two worked on the Program Analyser to measure response to radio programs (it would later be used by the Office of War Information to analyze propaganda films) (Remi, 2012). By 1946, interactions between academics and commercial researchers found an institutional home with the foundation of the American Association for Public Opinion Research in 1947, which codified the relationship with By-Laws that required the rotation of some offices by commercial and academic researchers. At the end of World War II, other
associations were also established. WAPOR, The World Association for Public Opinion Research, was established in the same year as AAPOR. ESOMAR, which was first an acronym for the European Society for Opinion and Market Research, and now a world organization with members in more than 130 countries, was formed the following year, in 1948. European researchers, including some from Czechoslovakia and Hungary, had met in Paris the year before (Downham, 1997). War work using surveys provided intelligence to Roosevelt. But the new tool would also directly provide campaign advice as well as policy help. In 1944, Gerald Lambert organized pollsters Harry Field, founder of the National Opinion Research Center (NORC), and Jack Maloney from Gallup, to help the campaign of Republican Wendell Willke. Roosevelt easily defeated Willke that year, and Lambert later complained that Willke did not take his advice not to waste time campaigning in Southern states. In 1952, Lambert financed Crossley Surveys to conduct a poll in New Hampshire, which he claimed he used to convince Dwight Eisenhower to run. Lambert worked for Eisenhower in that election (Lambert, 1956). Lambert also asked Crossley to work for Nelson Rockefeller’s unsuccessful campaign for the 1964 Republican presidential nomination (Frankovic, 1983b). Elmo Roper worked for candidates in 1946, but after that election he concentrated on the market research that made up more than 85% of his business in election years and 98% in nonelection years (McDonald, 1962). The early polls, especially the Gallup polls, created a number of questions that would continue to be used. By 1940, Gallup began asking whether people approved of the way Franklin Roosevelt was handling his job, and which problem facing the country Americans believed was most important. The very first Fortune survey asked a question that would later be adapted for the American National Election Studies: ‘Do you believe that the government should see to it that every man who wants to work has a job?’ (Frankovic, 1998).
Observations on the Historical Development of Polling
The integration of polling into government, and the acceptance of news polls continued after the war. State polls were founded in Minnesota, Iowa, and Texas. Joe Belden set up a poll of college students in 1937 while he was still a student at the University of Texas (having written Gallup for information about how to conduct a poll and receiving a two-page instruction sheet in return) (Belden, 1939; Frankovic, 1983c). The Washington Post, the first Gallup Poll subscriber, began its local polling in 1945, stating this would ‘implement the democratic process in the only American community in which residents do not have the right to express their views in the polling booth’ (Frankovic, 1998).
1948 The polling successes also brought unattainable expectations. Roper once noted the public looked on polls as a ‘miracle drug thing’, with many researchers ‘overclaiming’ the merits of polling. The gains of the war years, to him, were not in methodological advances, but in ‘acceptance’ (Carlson, 1968). There were also polling successes in the 1940 and 1944 elections, with the polls correctly predicting that Roosevelt would win. Consequently, the 1948 polling mispredictions caused a huge wave of criticism. The final public polls that year all said that Republican Thomas Dewey would defeat Democratic incumbent Harry Truman. Instead, Truman scored a nearly 5-point victory over Dewey. The pollsters all admitted after the election that they had made several mistakes, the first of which was assuming that campaigns mattered little. All stopped polling too soon. Roper did his last presidential election polling in early September, Crossley and Gallup continued, but all had ended their interviewing before the last two weeks of the campaign. And it wasn’t just the pollsters who believed Truman would lose, though many journalists were certainly
95
influenced by the polls. A week before the election, LIFE magazine captioned a picture of Truman as ‘our next President’. After the election, Roper noted that in his polling, Truman received 81% of those who had said they were undecided in September. One woman in Pittsburgh wrote him after the election: ‘I told you I was going to vote for Dewey … I changed my mind. I have been feeling badly ever since I read the polls were wrong’ (Carlson, 1968). Roper also spent the days following the election calling his market research clients to reassure them about his work. The Washington Post sent a conciliatory telegram to Truman: ‘You are hereby invited to attend a crow banquet to which the newspaper proposes to invite newspaper editorial writers, political reporters and editors, including our own, for the purpose of providing a repast appropriate to the appetite created by the late election’. Academics, many of whom had been consulting with the public pollsters, and had joined with them to form AAPOR, now found themselves critical of polling methods. By 1948, some of the most important academic survey centers were established or about to be created. The National Opinion Research Center (NORC) had been established by Harry Field in 1937; Paul Lazarsfeld’s Bureau of Applied Social Research was set up at Columbia University in 1944, but grew out of Princeton University’s Radio Research Project; the Institute for Social Research (ISR) would be founded by Rensis Likert in 1949. Ten days after the election, the Social Science Research Council, established in 1945, appointed a committee on Analysis of Pre-Election Polls and Forecasts. Its report completed before the year ended and was then published as a book in the spring of 1949 (Mosteller et al., 1949). The committee called for the use of better techniques of sampling, interviewer training and supervision, and more methodological research, as well as greater efforts to educate the public about polls (Frankovic, 1992). AAPOR pledged cooperation with the committee.
96
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
The Committee’s report concluded: The pollsters overreached the capabilities of the public opinion poll as a predictive device … The evidence indicates that there were two major causes of error: (a) errors of sampling and interviewing and (b) errors of forecasting, involving failure to assess the future behavior of the undecided voters and to detect shifts of voting intention near the end of the campaign … These sources of error were not new.
The report also criticized ‘the manner in which the pre-election polls were analyzed, presented and published for public consumption’, stating that the result was to give the public ‘too much confidence’ in polls before the election and ‘too much distrust in them afterwards’. At the 1949 conference, academics severely criticized the pollsters for their methodology (using quotas instead of probability samples, having samples that were too small, not paying enough attention to the undecided). Gallup promised to refine the filters he used to determine who was likely to vote. All the pollsters agreed on the need to continue polling close to the election. There were reasons for the academics to criticize the pollsters. Random sampling methods had developed by 1948. In the late nineteenth century, mathematicians had debated the ways one could obtain a representative sample, with some favoring purposive selection and other random sampling. Arthur L. Bowley, an English statistician and economist known as the father of modern statistics, is credited with the development of modern sampling methods, using simple random sampling – and cluster sampling – in a social survey of working class households in five English towns in 1912. There were earlier efforts in Norway and Japan (Martin, 1984) and as early as 1937, Claude Robinson, the founder of Opinion Research Corporation, critiques the quota sampling used in the 1936 polls. Some of the advances came from the American government, particularly the Department of Agriculture, which worked on sampling and methodology. In the 1930s the
Bureau of Agricultural Economics sponsored methodological conferences, including lectures by Jerzy Neyman on probability sampling (Converse, 1987).
POST-WAR AND POST-1948 The events of 1948 would cast a ‘long shadow’ in the US and even overseas. But despite the failure of the polls, polling would continue to increase. Prior to World War II, Gallup conducted polls in Great Britain, which were published in the News Chronicle. Gallup’s British Institute of Public Opinion, which was modeled on the US organization, conducted its first polls in 1938 (HodderWilliams, 1970). Gallup also established polling operations in Canada, Sweden, Australia, and France before World War II. The post-war polling landscape expanded in Europe and elsewhere. In some cases, like Japan, the American influence was directly responsible for the growth of polling (Sigeki, 1983). In 1945, Cantril received $1 million from Nelson Rockefeller to create an Institute for International Social Research (Cantril, 1967: 131).1 In 1952, Cantril was asked to study Japan, Thailand, Italy, and France (Cantril, 1967). And in the mid-1950s, he tested the impact of various anti-Communist appeals in France and Italy, noting that Secretary of State John Foster Dulles’s appeals were the lowest rated (ibid.: 122–123). Gallup and other market research firms established operations in post-war Europe. In 1946 in Britain, Research Services, Ltd grew out of the research department of the London Press Exchange Group. NOP (‘National Opinion Poll’) was established in 1958. A year later Marplan, Ltd grew out of McCann Ericksen. Opinion Research and the Lou Harris Poll followed in the 1960s (Hodder-Williams, 1970). The Australia Gallup Poll conducted its first survey in September 1941. It remained the only public opinion poll there for 30 years (Goot, 2012).
Observations on the Historical Development of Polling
In Poland, just before World War II, sociologists utilized surveys of students (with one study collecting questionnaires from 10,000 Polish students). Those surveys were destroyed during the war, but academic studies resumed in the short interlude between the Allied victory and the Communist assumption of power (Sulek, 1992). The Czechoslovak Institute of Public Opinion was founded in 1946, and survived until the Communist takeover in 1948 (Srbt, 2012). Though there were journalistic attempts to measure opinion in Japan prior to World War II, it took the American Occupation to make polling a major part of Japanese government and journalism (Sigeki, 1983). There is a record of a newspaper poll conducted in Nationalist China in World War II (Henderson, 1942), but this was mainly a curiosity; public opinion surveys, even for academic purposes, were severely limited there after the Communist revolution in 1949. After World War II, the US government’s polling efforts were concentrated in the Office of Public Affairs in the Department of State, which produced ‘American Opinion Reports’ from 1944 to 1947. However, although the State Department would continue to conduct polls in other parts of the world, its US polling operation was shut down in 1957, after Congress conducted an investigation of a polling leak, accusing the department of using polls for ‘propaganda and publicity’ (Leigh, 1976). President Eisenhower had access to polls, but presidential polling started in earnest in 1961, more than a decade after the 1948 debacle. Not only did the Kennedy Administration collect public polling data, it also ran its own private polls. Lyndon Johnson and Richard Nixon were even more interested in polling data, though Johnson became less interested after his 1964 landslide election win (Geer, 1996: 92). Nixon met with his pollster, David Derge, to discuss question wording. However, like so many presidents, Nixon did not want to make it appear that he was using polling data in decision making (Jacobs and Shapiro, 1995).
97
International Advances The use of the Gallup methodology in Great Britain was not the only way opinion was measured there. The competition was something called ‘Mass-Observation’, which has been described as a ‘hybrid of British anthropology, American community studies and French surrealism’ (Goot, 2008, 93). It bears some resemblance to the British social surveys of the nineteenth century, as it collected enormous amounts of information about beliefs and behaviors. It ran from 1937 until 1949. Mass-Observations used mostly-unpaid volunteers, many of whom were teachers and clerks (although ‘observers’ were drawn from all social classes), and asked them to set down ‘all that happened to them’ one day every month. Reports occasionally were produced by the founders Tom Harrison and Charles Madge, an anthropologist and a journalist, from the material collected (Albig, 1956). Although there was no attempt at sampling, the amount of qualitative information collected was staggering. Mass-Observation could give reasons to explain Churchill’s defeat and why women might not be willing to undertake war work. Created in the run-up to World War II, MassObservation lasted only a few years after the war’s end (Goot, 2008). Latin American survey research also began during the war. In 1940, Nelson Rockefeller, the head of the Office of Inter-American Affairs, wanted information from Latin America. Cantril created a fake company to do that research in 1941. The first of those projects to be completed was in Brazil. A month earlier, J. Walter Thompson completed a survey of short wave radio listening in Argentina (Ortiz-Garcia, 2012). Seven years after the Philippines became independent in 1946, polling began there, although early polls, conducted both by universities and private companies, were limited to urban samples. But just as World War II restricted survey research in Europe, in the Philippines polling effectively stopped during the Marcos regime, resuming after his
98
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
downfall in 1986 (Guerrero and Mangahas, 2004). Joe Belden, who had been born in Mexico, set up a firm there in 1947. He developed a radio audience measurement system. But a wartime Hungarian immigrant from Germany, Laszlo Radvanyi, may have been more important to the early development of the Mexican opinion polling industry. He founded the Scientific Institute of Mexican Public Opinion about as soon as he arrived; he polled through 1952. Then, as the PRI (the Institutional Revolutionary Party) continued its one-party dominance, he moved back to East Germany. Radvanyi, also known as Johann Lorenz Schmidt, had been trained in Germany and worked in 1942 in France. But he still modeled his institute on the Gallup model. Radvanyi had been a founding member of WAPOR, and until 1951 edited the official WAPOR journal (Moreno and Sanchez-Castro, 2009). By the twenty-first century, survey research had spread to places like Myanmar and Mongolia (Hessler, 2001). Just as the Rockefeller Foundation encouraged research in Latin America in the 1940s the Soros Foundation has been involved in present day expansion of democratic techniques. American strategists were engaged in election polling in many places from the former Soviet space, like Ukraine and the Republic of Georgia, and they have been joined by the International Republican Institute (IRI) and the National Democratic Institute (NDI), non-profits loosely connected to the American political parties who also encouraged party growth and elections. Another American invention that has spread to the rest of the world is the exit poll. The concept of discovering opinion after people voted appears to have been created by Harry Field and NORC, with a poll on issues conducted after people voted in Boulder, Colorado in 1940 (Albig, 1956). But that work had nothing to do with what exit polls became famous for: projecting elections. In 1964, Ruth Clark, who later became a well-known newspaper researcher, was an interviewer for Lou Harris. She was conducting door-to-door interviews of voters in Maryland on primary election day,
and realized she could find more voters at polling places. As she put it: ‘I told Lou what I had done and by the [Republican] California primary in June, the exit poll was put to full use, with Barry Goldwater voters dropping blue beans into a jar, while Nelson Rockefeller voters dropped red beans’. Harris and statistician David Neft used this information as consultants for CBS News to project election outcomes (Frankovic, 2008a, accessed from http://www. cbsnews.com/news/its-all-about-the-sample/). Warren Mitofsky formalized the exit polling process in 1969, and his first goal was to collect vote information in places where vote counting was especially slow. For the next election the CBS Newsman added information on issues and demographics to the exit poll questionnaires, in order to explain why people voted as they did. The process was adopted by other news organizations and modified as elections changed. Interviewers were stationed at polling places selected through stratified random sampling. The technique spread to other countries. The first exit poll in Great Britain was conducted for ITN by the Harris Organization in 1974. Social Weather Stations conducted the first Election Day poll in the Philippines in 1992; Mexican exit polls began the same year. The first in Russia was conducted by Mitofsky and the Russian firm CESSI in 1993 (Frankovic, 2008a).
THE MODERN ERA AND THE TWENTY-FIRST CENTURY In the 1970s, public polls would become even more important. The increased prevalence of telephones in American households and the rise of random-digit-dialing sampling methods made polling quicker and cheaper. Although television networks and newspapers used – and sometimes conducted – their own polls along with their audience research before, in the 1970s television networks created their own polling units, increasing the number of polls. The Mitofsky–Waksberg method for random
Observations on the Historical Development of Polling
sampling of telephones was widely adopted. It meant telephone surveys did not have to rely on selecting individuals from listed numbers in telephone directories and provided efficiencies by its two stage cluster selection. By 1990, after the increase in processing ability of personal computers, the cost of processing polls dropped, and even more institutions, including colleges and universities, engaged in polling. Errors could be reduced by direct data entry and interviewers were able to use computer-assisted telephone interviewing systems, reducing data entry and interviewer instruction errors that were always present in paper and pencil formats. The rise in the number of polls meant that there were more of them to report. And there were more news operations as well: 24-hour news channels, beginning with CNN in 1980, reported some stories on a minute by minute basis. Elections gave the media polls the opportunity to experiment, while ongoing news stories meant that polling on news stories could be almost continuous. The shorter interviewing time periods that came with telephone polling were pushed to the extremes. Poll reports of the public’s reaction to the jury verdict in the murder trial of O.J. Simpson in 1995 were reported within hours of the verdict. Nights of poll tracking by media polling organizations had taken place in the days before and during the 1991 Persian Gulf War, and they would be used again during the impeachment of President Bill Clinton and later military actions as well. The proliferation of polls during the Clinton saga was so great, that (according to data from the Roper Center Archive at the University of Connecticut), Monica Lewinsky became the most frequently-asked about woman in polling history. More questions were asked about her than Eleanor Roosevelt. Lewinsky retained that position until the middle of the 2008 presidential election contest, when she was surpassed by then-Senator Hillary Clinton (Frankovic, 2008b).
99
By the twenty-first century, survey research was facing the impact of its own success: the large number of surveys stressed the willingness of Americans to cooperate, and response rates plunged below 10%. The changing nature of telephone ownership, with Americans switching from landes to cellphones, changed sampling and weighting procedures (cellphones were primarily an individual, not a household based device). Simple Mitofsky–Waksberg methods shifted to list-assisted and similar cellphone sampling procedures. The rise of the internet had researchers struggling with ways to use the new technology without deviating from the goal of probability sampling, which had been the sampling method of choice since the quota sample was discredited after 1948. Stresses on the news industry, especially the financial difficulties of the print media, reduced news polling budgets, while the demand for polling information increased. And the individual poll, whose results had been so relied on in the twentieth century, was frequently devalued as part of many, as survey aggregators produced election estimates incorporating hundreds of national and state available polls.
RECOMMENDED READINGS Converse (1987) gives a comprehensive history of the development of survey research, with special reference to academic developments. A methodological history of poll development is given in Frankovic et al. (2009).
NOTE 1 The Rockefeller Family’s interest in opinion research spanned generations. In the 1920s, Nelson Rockefeller’s father, John D. Rockefeller, sponsored a World Survey through the Interchurch World Movement, in order to ‘undertake a scientific survey of the world’s needs from the standpoint of the responsibility of evangelical Christians’ (Converse, 1987: 29). Although there were many publications from the study, the project went out of existence in 1924.
100
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
REFERENCES Albig, W. (1956) Modern Public Opinion. New York: McGraw. Amann, A. (2012) ‘Rediscovering the Prehistory of Social Research in Austria,’ in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 119–128. Belden, J. (1939) ‘Measuring College Thought’, Public Opinion Quarterly 3, 458–462. Bernard, L.L. and J. Bernard (1943) Origins of American Sociology: The Social Science Movement in the United States. New York: Thomas Y. Crowell Co. Blankenship, A.B. (1943) Consumer and Opinion Research. New York: Harper and Pros. Cantril, H. (1967) The Human Dimension: Experiences in Policy Research. New Brunswick, NJ: Rutgers University Press. Carlson, R. (1968) Interview with Elmo Roper. Converse, J.M. (1987) Survey Research in the United States: Roots and Emergence, 1890– 1960. Berkeley: University of California Press. Crossley, A.M. (1936) ‘Measuring Public Opinion’, The Journal of Marketing 1, 272–294. Crossley, A.M. (1937) ‘Straw Polls in 1936,’ Public Opinion Quarterly 1, 24–35. Crossley, A.M. (1957) ‘Early Days of Survey Research,’ Public Opinion Quarterly 21, 159–164. Downham, J. (1997) ESOMAR 50: A Continuing Record of Success. Amsterdam: ESOMAR. DuBois, W.E.B. (1976 [1899]) The Philadelphia Negro. Available online at https://archive. org/details/philadelphianegr001901mbp (accessed on April 4, 2016). Eisinger, R.M. (2003) The Evolution of Presidential Polling. Cambridge: Cambridge University Press. Fortune (1935) ‘A New Technique in Journalism,’ July, 65–66. Frankovic, K.A. (1983a) Interview with George Gallup, Princeton, NJ. Frankovic, K.A. (1983b) Interview with Archibald Crossley, Princeton, NJ. Frankovic, K.A. (1983c) Interview with Joe and Eugenia Belden. Frankovic, K.A. (1992) ‘AAPOR and the Polls,’ in P.B. Sheatsley and W.J. Mitofsky, A Meeting
Place: The History of the American Association for Public Opinion Research, AAPOR, pp. 117–154. Frankovic, K.A. (1998) ‘Public Opinion and Polling,’ in D. Graber, D. McQuail, and P. Norris. The Politics of News: The News of Politics. Washington, DC: Congressional Quarterly Press, pp. 150–170. Frankovic, K.A. (2008a) ‘Exit Polls and PreElection Polls,’ in SAGE Handbook of Survey Research 2008. Thousand Oaks, CA: SAGE, pp. 570–579. Frankovic, K.A. (2008b) ‘Women and the Polls: Questions, Answers, Images,’ in L.D. Whitaker, Voting the Gender Gap; Investigating How Gender Affects Voting. Champaign, IL: University of Illinois Press, pp. 33–49. Frankovic, K.A. (2012) ‘Straw Polls in the U.S.: Measuring Public Opinion 100 Years Ago,’ in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 66–84. Frankovic, K.A., C. Panagopoulos, and R.Y. Shapiro (2009) ‘Opinion and Election Polls,’ in D. Pfeffermann and D.R. Rao, Handbook of Statistics – Sample Surveys: Design, Methods and Applications, Volume 29A. North-Holland: Elsevier B.V, pp. 567–595. Gallup, G. (1930) ‘Guesswork Eliminated in New Method for Determining Reader Interest’, Editor and Publisher 62(8):1, 55. Gallup, G. (1939) The New Science of Public Opinion Measurement. Princeton: American Institute of Public Opinion. Gallup, G. (1982) ‘My Young and Rubicam Years,’ Young and Rubicam Incorporated Alumni Newsletter 4. Gallup, G. and S.F. Rae (1940) The Pulse of Democracy: The Public Opinion Poll and How it Works. New York: Greenwood Press. Geer, J.G. (1996) From Tea Leaves to Opinion Polls: A Theory of Democratic Leadership. New York: Columbia University Press. Goldman, E. (1944–1945) ‘Poll on the Polls,’ Public Opinion Quarterly 8, 461–467. Goot, M. (2008) ‘Mass-Observations and Modern Public Opinion Research,’ in W. Donsbach and M. Traugott, The SAGE Handbook of Public Opinion Research 2008. Thousand Oaks, CA: SAGE, pp. 93–103.
Observations on the Historical Development of Polling
Goot, M. (2012) ‘The Obvious and Logical Way to Ascertain the Public’s Attitude Toward a Problem,’ in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 166–184. Guerrero, L.L.B and M. Mangahas (2004) ‘Historical Development of Public Opinion Polls in the Philippines,’ in J.G. Geer, Public Opinion Around the World: A Historical Encyclopedia, Vol. 2. Santa Barbara, CA: ABC-CLIO, pp. 689–696. Henderson, H.W. (1942) ‘An Early Poll,’ Public Opinion Quarterly 6, 450–451. Herbst, S. (1993) Numbered Voices: How Opinion Polls Has Shaped American Politics. Chicago: University of Chicago Press. Hessler, P. (2001) ‘Letter from Mongolia: The Nomad Vote,’ The New Yorker, July 16. Hodder-Williams, R. (1970) Public Opinion Polls and British Politics. London: Routledge and Kegan Paul. Hyman, H. (1991) Taking Society’s Measure: A Personal History of Survey Research. New York: Russell Sage. Jacobs, L. and R.Y. Shapiro (1995) ‘The Rise of Presidential Polling: the Nixon White House in Historical Perspective,’ Public Opinion Quarterly 59, 163–195. Jahoda, M., P.F. Lazarsfeld, and H. Zeisel (2002 [1933]) Marienthal: the Sociology of an Unemployed Community. Transaction Books. Lambert, G. (1956) All Out of Step. New York: Doubleday. Leigh, M. (1976) Mobilizing Consent: Public Opinion and American Foreign Policy, 1937– 47. Westport, CT: Greenwood Press. Martin, L.J. (1984) ‘The Geneology of Public Opinion Polling,’ Annals 472, 12–23. McDonald, D. (1962) ‘Opinion Polls: Interviews with Elmo Roper and George Gallup,’ Santa Barbara, CA: Center for the Study of Democratic Institutions. Moreno, A. and M. Sanchez-Castro (2009) ‘A Lost Decade? Laszlo Radvanyi and the Origins of Public Opinion Research in Mexico, 1941– 1952,’ International Journal of Public Opinion Research 21, 3–24. Mosteller, F., H. Hyman, P.J. McCarthy, E.S. Marks, and D.B. Truman (1949) The
101
Pre-Election Polls of 1948. Washington, DC: American Statistical Association. Newsweek (1935) ‘Poll: Dr. Gallup to Take the National Pulse and Temperature,’ October 26, 23–24. Niedermann, A. (2012) ‘One of the Earliest Practical Applications of Survey Research: Surveys for Legal Defense,’ in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 203–212. Nielsen, A.C. (1944–1945) ‘Two Years of Commercial Operation of the Audimeter and the Nielsen Radio Index,’ Journal of Marketing 9, 239–255. Noelle-Neumann, E. (2001) ‘My Friend, Paul Lazarsfeld,’ International Journal of Public Opinion Research 13, 315–321. O’Malley, J.J. (1940) ‘Profiles: Black Beans and White Beans,’ The New Yorker, March 20, 20–24. Ortiz-Garcia, J.L. (2012) ‘The Early Days of Survey Research in Latin America,’ in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 150–165. Petersen, T. (2005) ‘Charlemagne’s Questionnaire: A Little-Known Document from the Very Beginnings of Survey Research,’ Public Opinion Pros, www.publicopinionpros. norc.org/features/2005/feb/Petersen.asp. Remi, J. (2012) ‘Lazarsfeld’s Approach to Researching Audiences,’ in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 49–64. Reuband, K-H. (2012) ‘Indirect and “Hidden” Surveys: An Almost Forgotten Survey Technique From the Early Years,’ in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 186–202. Roper, E. (1939) ‘Sampling Public Opinion,’ Paper presented at the American Statistical Association meetings, Philadelphia, PA. Sigeki, N. (1983) ‘Public Opinion Polling in Japan,’ in R. Worcester (ed.), Political Opinion Polling: An International Review. London: The Macmillan Press Ltd., 152–168. Smith, R.D (1997) ‘Letting America Speak,’ Audacity 5, Winter, 50–61.
102
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Smith, T.W. (1990) ‘The First Straw? A Story of the Origins of Election Polls,’ Public Opinion Quarterly 54, 21–36. Squire, P. (1988) ‘Why the Literary Digest Poll Failed,’ Public Opinion Quarterly 52, 125–133. Srbt, J. (2012) ‘Cenek, Adamic and the Early Stages of Public Opinion Research in the Czsech Lands’, in H. Haas, H. Jerabek, and T. Petersen, The Early Days of Survey Research and Their Importance Today. Vienna: Braumiller, pp. 129–149. Steele, R.W. (1974) ‘The Pulse of the Public: FDR and the Gauging of American Public Opinion,’ Journal of Contemporary History 9, 195–216. Stouffer, S.A., A.A. Lumsdaine, M. Harper, R.M. Lumsdaine, R.M. Williams, Jr., I.L. Janis, S.A. Star, L.S. Cottrell, Jr. (1949) Studies in Social
Psychology: The American Soldier, Vols 1–4. Princeton, NJ: Princeton University Press. Sulek, A. (1992) ‘The Rise and Decline of Survey Sociology in Poland,’ Social Research: An International Quarterly 59(2): 365–384. Thomas, W.I. and F. Znaniecki (1918–1920) The Polish Peasant in Europe and America. 2 vols. New York: Dover. TIME (1936) ‘Emil Hurja,’ March 2, 18–19. Wallace, H.A. and J.L McCasey (1940) ‘Polls and Public Administration,’ Public Opinion Quarterly 4, 221–223. Wauchope, R. (ed.) (1972) Handbook of Middle American Indians, Vol. 12. Austin, TX; University of Texas Press. Williams, R. (1983) ‘Field Observations in Surveys in Combat Zones,’ Paper presented at the American Sociological Association Meeting, Detroit, MI.
PART III
Planning a Survey
9 Research Question and Design for Survey Research Ben Jann and Thomas Hinz
TYPES OF RESEARCH QUESTIONS Surveys can generally be used to study various types of research questions in the social sciences. One important precondition is that researchers translate their research questions into corresponding survey questions that measure the concepts of interest. This process of operationalization is critical and needs thorough consideration (see Part III in this Handbook). In addition, respondents are assumed to provide meaningful and correct answers to questions presented in a survey format. Whenever these very basic criteria are met, a survey can help answer the given research question. Research questions can be general or specific. They can refer to attitudes or behaviors, to general or specific populations, to short or long time periods, to univariate or to multivariate distributions, and the required data will vary depending on the given research question. The amount and the structure of data in any research project (no matter whether collected in a survey)
is determined by the research design. This chapter distinguishes between several types of research questions and relates them to various research designs, and, thus, to different types of data structures. For the purpose of illustration, we will highlight some research questions on deviant behavior in schools and classrooms. One example of a general research question would be ‘How high is the prevalence of deviant behavior in classrooms?’ To answer such a question, researchers could collect data on deviant behavior using a survey among students or teachers – and then aggregate the individual answers to the classroom level. An example of a more specific research question is whether a high proportion of students from low-status backgrounds increase deviant behavior of students from high- status backgrounds in the same class. Such a research question focuses on a (causal) relation between variables. Survey research must meet the challenge of providing data that can be used to assess such causal claims – and
106
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
this has important implications for research design. As such, we first distinguish between different types of research questions before examining different designs.
Exploration The goal of exploratory survey research is a basic and preliminary understanding of a social phenomenon. For example, researchers might wish to learn what kinds of classroom behaviors are considered deviant, and when these occur. Thus, exploratory survey research helps to clarify the characteristics of a social phenomenon. To extend the example, we might wish to get an idea about whether verbal or non-verbal violence, physical or psychological violence, or violence against persons or property are seen as ‘deviant behavior’. Exploratory surveys often use convenience samples with arbitrarily selected respondents, as they are not intended to be representative of a given population. Rather, exploratory studies are pre-tests and pilots that help researchers design sensible questions for use in later representative surveys.
Description Description means that research focuses on ‘how’, but not ‘why’. Descriptive findings are reported as relative or absolute frequencies, or averages of one or more variables. Descriptive analysis can also include contingency tables and correlation analyses. The questions ‘How high is the prevalence of deviant behavior in the classroom?’ and ‘Is there a lower prevalence of deviant behavior among older students?’ are strictly descriptive. They do not suggest causality, or why there is more deviant behavior among certain groups. To answer descriptive questions in an informative way, it is crucial that the population to which the description refers is welldefined. Furthermore, descriptive research generally requires a random sample that can
be generalized to the population of interest (e.g. to estimate the prevalence rate of deviant behavior in all classrooms). The precision of the descriptive point estimates is determined – apart from the variation of the variables under study – by the quality of the random sample (e.g. number of observations, whether the sample is stratified or clustered). Moreover, it is crucial to consider the mode of aggregation for individual survey data. One might calculate the ‘average’ prevalence rate in various ways: aggregating to the classroom level and then calculating the ‘average of averages’ (thus representative of the average classroom), summing up individual answers among all respondents (thus representative of the average across individuals), etc. One might also wish to specifically examine a given subgroup, for example students aged 13 to 17 in public schools in a given city or over a given time span. Given the research question, population, and time frame of interest, a sample of students is drawn from the target population. Survey questions reflecting the topic of interest are then administered, such as ‘Did you bully a fellow student in your class?’ or ‘Have you been physically attacked in your class?’. With respect to time frame, in the simplest case, descriptive findings are obtained for one point in time only. However, one might also examine changes over time, in which case longitudinal data must be collected.
Explanation Explanatory research questions refer to causal mechanisms and are the most demanding with respect to research design and data structure. Researchers must have a comprehensive theoretical understanding of the social process under study. In the field of deviant behavior in the classroom, one might investigate Hirschi’s theory of social control, asking ‘Does the involvement of students into pro-social activities reduce their risk of delinquency?’ (Hirschi 1969). Conducting a
Research Question and Design for Survey Research
survey to address this question would again require reliable and valid questions for all relevant concepts (involvement into prosocial activities, risk of delinquency). As already mentioned, the operationalization of concepts is not trivial, particularly if surveys include questions on sensitive behavior. To address an explanatory research question, researchers have to develop their research design to test the hypothesized causal mechanism. A limiting aspect of using a survey design is that data collection follows an ex-post-facto format. Survey researchers ask respondents about their (past and planned) behaviors and attitudes. By doing so, survey researchers usually do not control any stimuli as experimental researchers do. The challenge of making sense of survey data is that researchers have to handle potential spurious or suppressed correlation between the variables that measure the theoretical concepts. For example, an observed correlation between pro-social activities and delinquency does not provide an answer to the explanatory research question as long as different constellations for further variables determining the correlation (for example the type of school) cannot be ruled out. This is a fundamental problem using observational (in contrast to experimental) data. Consequently, survey research often cannot isolate ‘causes’ and ‘effects’, as all potential covariates cannot be included and problems such as omitted variable bias, selection bias, and endogeneity might occur. These problems should be carefully addressed in designing the survey study and in choosing appropriate statistical models for data analysis.
TYPES OF RESEARCH DESIGNS There are diverse potential research designs. As outlined above, different research questions have different data demands. That is, data collection needs to take into account the research questions which the data will be
107
used to answer. In this section of the chapter we portray various research designs along several dimensions. Our goal is to provide a systematic overview of design considerations that can be used for planning a new data collection project or for finding existing data that may be suitable to answer a given research question. The main dimensions are: measurement units, that is, how units of observation relate to the units of analysis; the experimental dimension, that is, whether a research design contains manipulation and randomization or not; and the temporal dimension, that is, whether and how research units are observed over time. In addition, we will also briefly touch a number of further design aspects that are difficult to subsume under the three mentioned dimensions.
Units of Analysis and Units of Observation Typically, the units of observation in survey research are individuals. At a fundamental level, this is what we are restricted to in survey research as survey research is about obtaining self-reported information from humans. The units of observation, however, may not be equal to the units of analysis for a given research question. We can distinguish at least four ways in which units of analysis may deviate from units of observation. First, the units of analysis may be at the aggregate level, that is, a unit of analysis may contain multiple observations. For example, we may be interested in how rates of youth delinquency vary across cities. In this case our units of analysis are cities and the research design could collect data on delinquent behavior from samples of pupils from different cities, then aggregating the data within cities (or using multilevel methods for data analysis; e.g. Snijders and Bosker 2012). Second, the units of analysis might be at a lower conceptual level than the individual. For example, we might be interested in how situational factors affect delinquency. In this
108
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
case, our units of analysis would be single acts of delinquent behavior and data might be obtained by asking a sample of students about all incidences of delinquent behavior across the past year and the situations in which they occurred. Similarly, if we are interested in factors that affect the duration of unemployment spells or the likelihood of divorce, the units of analysis are single episodes (spells) of unemployment or marriage. A single individual could have multiple spells of employment or marriage. Third, the units of analysis can consist of social interactions or ties between individual actors in a social network. Again we would obtain data from individuals, for example, in the context of collecting data on the social network in a school, but then use information on interactions such as, e.g., romantic relationships for analysis. In the simplest case, the units of analysis are dyadic relationships between the actors. In more complicated cases, whole network structures could be regarded as the units of analysis. Fourth, units of analysis might be other social actors than individuals (e.g. organizations). For example, we might be interested in how schools deal with delinquent behavior. In this case, the units of analysis are schools and information (e.g. on a school’s measures against delinquent behavior among students) would be collected by surveying a single or multiple representatives of the school (e.g. teachers or the principal). The distinction between units of analysis and units of observation or measurement is important because sometimes it is not immediately evident what the relevant units of analysis are. This may lead survey research to be biased towards treating the individual as the natural unit of analysis. Furthermore, data collection in surveys may need to be suitably adjusted if the unit of analysis is not the individual. When designing a research project it is therefore essential that the researchers clearly identify the units of analysis and the population to which results should be generalized, carefully considering how the required
data can be obtained such that measurement is meaningful, reliable, and valid. This includes identifying the appropriate respondents (e.g. students vs. teachers), asking the right questions, and the timing of interviews. Furthermore, depending on the research questions it might be sensible to augment the survey data with additional information from other sources (e.g. type of school and number of pupils per school from administrative records). Finally, transparency about the units of analysis is also important for evaluating whether results might be jeopardized by ecological fallacy (invalid conclusions about lower level units based on data from upper level units; see Robinson 1950) or reductionism (overly simplistic description of upper level units based on data from lower level units; see, e.g., Chapter 11 in Nagel 1961).
Experimental and NonExperimental Designs With respect to the goal of explanation, an important distinction is whether a research design is experimental. Rosenbaum (2002) distinguishes between experiments and observational studies. In experiments the researcher controls assignment to the treatment, while in observational studies, the researcher cannot (fully) control treatment. Both types of studies, however, are about ‘treatments, interventions, or policies and the effects they cause’ (Rosenbaum 2002: 1), that is, the goal of both types of studies is explanation. We may thus add a third type of study, called descriptive, whose goals lie in description or exploration. Naturally, a single study may contain experimental, observational, and descriptive elements. There is a close link between the research question and whether an experimental design is necessary and possible. Ideally, explanatory studies would always use experimental data, but this is seldom feasible. Observational studies as defined above therefore provide an important set of research designs to inform
Research Question and Design for Survey Research
explanatory research, even if not without qualification. However, if research interests are only descriptive or exploratory, an experimental design is not required. Experimental design must meet the following conditions: (1) there are at least two experimental groups; (2) the researcher controls the (differential) treatment of the experimental groups; and (3) assignment of subjects to the experimental groups is randomized. Stated differently, in an experiment there are at least two treatments (e.g. presence or absence of a specific treatment) and treatment is randomly assigned. Random assignment guarantees that systematic differences in outcomes between the experimental groups can be interpreted as the (average) causal effects of the difference in treatment. The most basic experimental design involves a treatment group and a control group and can be depicted as follows: R X O treatment group R O control group There is a treatment group in which the treatment is present (X) and a control group without treatment. O stands for observation, R for randomization. The distinction between treatment group and control group is just a matter of labeling. In fact, both groups are treatment groups because the absence of treatment X is just another treatment. Hence, a more general experimental design is: R X1 O treatment group 1 R X2 O treatment group 2 … … … … R XK O treatment group K That is, there are K experimental groups and K levels of the experimental factor. In practice, experimental designs might be more complicated and involve multiple factors, pre-treatment and post-treatment measurements (so called within-subject designs), and stratified randomization (see, e.g., Alferes 2012). But the basic principle, identification
109
of causal effects through manipulation and randomization, remains the same. As an example consider the evaluation of a youth delinquency prevention program. We could randomly divide a sample of schools into a treatment group, in which the prevention program will be administered to all students, and a control group without the program. We could then conduct a survey among students about delinquent behavior after students participated in the program (or get admin data from local juvenile delinquency records). A systematic difference in delinquent behavior between the two groups could then be attributed to the treatment, that is, whether the school participated in the prevention program or not. Of course, if juvenile delinquency is measured through a survey, the program’s effect could just be that selfreporting of delinquent behavior by students changed, but not the behavior per se. Observational or quasi-experimental designs cover a broad range of approaches that are used when researchers wish to investigate causal questions, but experimental data is not available. The term quasi-experimental refers to studies in which there is manipulation of the treatment, but randomization cannot or not fully be realized. For example, for the evaluation of a youth delinquency prevention program, schools might choose whether they want to take part in the program or not, that is, schools might self-select into the program. Maybe the result of such a study would be that delinquency rates do not depend on participation in the program, leading to the conclusion that the program is useless. However, it might be that schools that are affected by high delinquency rates, in general, self-select into the program more often, which would mask the effect of the program. To deal with such selection effects, quasi-experimental designs often employ pre-treatment and post-treatment measurement for both the treatment group and the control group, leading to a difference-indifference design (which yields valid conclusions about causal effects as long as
110
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
self-selection is based only on pre-treatment outcome levels, but not on outcome trends, and treatment effects are homogeneous or self-selection is not based on anticipated treatment effects). The difference-in-difference design is a special case of an interrupted time-series design with control group and just two time points of observation. More credible inference may be possible if longer time series are observed. Note that an interrupted time-series design can also be informative without control group, especially if there are repeated treatments. Furthermore, observational studies can also be based on an ex-postfacto design – probably the most common approach in survey research – in which there is no manipulation of the treatment and all variables are measured after the fact. In such studies, one identifies causal effects by controlling for all confounding factors using statistical techniques such as multiple regression analysis or matching. In essence, the goal in such studies is to compare only the treated and untreated who are alike with respect to all other variables that might affect the outcome and therefore bias the relation between treatment and outcome. Strong theory specifying potential confounders and detailed causal mechanisms is needed for such studies to be credible. Furthermore, high quality measurements of all relevant confounders are required. An example of an ex-post-facto observational study would be an estimate of the effects of parents’ socio-economic status (SES) on children’s delinquency, based on a survey in which interviews of students on their delinquent behavior and interviews of parents regarding SES are combined. A relation between parents’ SES and children’s delinquency may come about by different mechanisms, not all of which include a causal link between the two variables. For example, it might be that high SES families live in different neighborhoods and it is the neighborhood characteristics that determine delinquency, not the family’s SES. In such a study it may therefore be sensible to control for neighborhood characteristics. Other types
of observational studies are natural experiments, instrumental-variable designs, and regression-discontinuity designs, in which there is a specific treatment-assignment variable that leads to a situation that is as-goodas randomized. For example, allocation of students to different schools might occur by lottery or by eligibility cut-offs on test scores. In such a case, effects of school characteristics on delinquent behavior could be explored by comparing students who were chosen by the lottery to those who were not chosen or by comparing students who were just above the eligibility cut-off to those who were just below the cut-off. Finally, especially in epidemiological research, case-control studies are a frequently employed observational design. In a case-control design there is a sample of respondents for whom a specific – and often rare – outcome was observed. For example, think of a sample of adolescents who were caught committing a violent crime. Researchers would then collect data from a different sample on a suitable control group of respondents who did not commit the crime but are otherwise comparable (same age and sex distribution, same grade distribution, etc.). The two groups are then compared with respect to the exposure to different factors that are expected to affect delinquent behavior (e.g. exposure to violence in the family). Descriptive designs are designs that are neither experimental, observational, nor quasi-experimental. The primary goal of studies based on descriptive designs is to give an account of the state of social reality at a specific point time or its development over time. This does not necessarily mean that description only focuses on marginal distributions of characteristics (e.g. the rate of youth delinquency at a certain point in time). Conditional distributions and bivariate or even multivariate associations among variables (e.g. the association between social status, gender, and youth delinquency) can also be part of descriptive analysis. The primary difference to experimental or observational studies is that no
Research Question and Design for Survey Research
claims about causality are made. While the data requirements between experiments and other studies are distinct, descriptive and observational studies are often based on the same data sources. A common situation is that statistical offices collect survey data or administrative data for descriptive purposes, but then academic scientists use the data for causal analysis. A relevant distinction with respect to the evaluation of experimental studies is between internal and external validity. Internal validity is given if it is possible to interpret results, within the given sample and situation, as causal. Internal validity may be violated when problems such as incomplete randomization or placebo effects arise. External validity is achieved if results can be generalized from the sample to a broader population or from the specific experimental situation to a class of related situations. Traditionally, the focus of experimental and observational studies has been on internal validity since the detection of causal mechanisms is the primary goal. For descriptive studies, in contrast, external validity in the sense of generalizability from the sample to the population is of greater concern. As the primary goal of such studies is an accurate description of a population, the studied sample should be representative of the population, and the reduction of undercoverage and non-response are important challenges. External validity can also be a problem for experimental designs if causal effects are heterogeneous (i.e. not the same for all population members), if there are general equilibrium effects (i.e. if causal effects change in response to the prevalence of the treatment), or if experimental situations are highly artificial with only weak links to reallife situations. At least with respect to the first aspect, heterogeneous causal effects, the inclusion of experimental designs in population surveys can be a good supplement to laboratory and field experiments, as this allows the estimation of average causal effects in the entire population and facilitates the analysis of effect heterogeneity across subpopulations
111
(see ‘Recent developments and examples’ below).
The Temporal Dimension With respect to the temporal dimension we may distinguish between cross-sectional and longitudinal designs. Furthermore, we may distinguish between retrospective and prospective studies. A cross-sectional design is a design in which a sample of respondents is surveyed only once at a specific point in time. Crosssectional studies are typically based on a representative sample so that descriptive statements about a clearly defined population at the time of the survey are possible. For example, to estimate the prevalence of delinquent behavior among 9th graders in a specific year in a specific country, a random sample of 9th-grade classes could be drawn from the country’s list of classes and then, in each selected class, all students could be interviewed. Using such a sample, one can, of course, also calculate descriptive statistics for subgroups, for example, investigating how behavior differs by gender, family background, or school district. Furthermore, such analyses can be used to explore potential causal relationships as well. For example, we could analyze how delinquent behavior varies by parents’ education, grades, and the composition of the class. However, the explanatory power of cross-sectional studies is rather limited because all variables are measured at the same point in time, leaving the temporal order of factors unspecified. For example, if a (negative) association between grades and delinquency is found, it remains unclear whether delinquency increased due to bad grades or whether grades deteriorated due to delinquency. The problem can be addressed in part by including retrospective questions in a cross-sectional survey, that is, questions about past behaviors and characteristics. For example, respondents could be asked about their grades and
112
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
delinquent behavior in past years, so that the timing of changes in the variables could be analyzed. However, retrospective questions often yield data that is neither valid nor reliable due to various well-documented sources of recall error (selective memory, telescope effects, reduction of cognitive dissonance, etc.; see Part VII in this Handbook for concerns about the measurement quality; for a basic understanding of recall error see Tourangeau et al. 2000). Longitudinal designs include trend designs, panel designs, and cohort studies. Trend design comprises a series of crosssectional studies in which the same questions are administered to independent samples from a population at different points in time (repeated cross-sections). That is, in each repetition of the survey, different respondents are interviewed. For a trend design to be viable it is important that the questions in the different waves of the survey are as comparable as possible, the definition of the population remains the same, and the properties of the sample (such as the degree of under-coverage) do not change. Furthermore, due to mode effects, it may also be important that the interviews are carried out in the same way across waves. Trend designs are particularly well-suited for the descriptive goal of monitoring changes in a population over time. For example, to study how the overall prevalence of delinquency or how the relation between delinquency and gender, family background, or school district changes over time, we could repeat our survey of 9th graders each year using a new sample of 9th grade classes. Repeated cross-sections are useful for understanding aggregate or macro trends. However, at the micro or individual level, explanatory or causal analyses are more difficult, because as with a standard cross- sectional study, information for each individual is only collected at one point in time. Yet, trend designs can be very useful for studying effects of policy changes or other ‘shocks’ at the macro level, especially, if a policy change or shock only applies to specific parts of the
population. At the level of the subpopulations we then have an interrupted time-series design and can employ various methods such as difference-in-difference estimation to identify the effects of the policy change or shock. A special case of a trend design is a so called rolling cross-section in which the interviews of a cross-sectional sample are randomly distributed across a specified period of time so that the collected data represents the population day-by-day or week-by-week. Such a design is sometimes used to analyze campaign effects in election studies or, in general, to study media coverage effects on opinions in the population (see Johnston and Brady 2002). Trend designs do not allow the study of changes in variables at the individual level (except through retrospective questions) and are thus of limited value in causal analyses. In addition, it is difficult to disentangle cohort, period, and age effects based on data from cross-sections or trend designs. Panel designs are better suited to address these issues, as they observe the same individuals over time. In a panel design, a sample of respondents is interviewed repeatedly, typically on a yearly or biannual basis. Panel studies are prospective insofar as after the initial interview they follow respondents into the future. Many countries, for instance, have household panels such as the Socio-Economic Panel in Germany (www.diw.de/soep). The German Socio-Economic Panel in 1984 with a sample of around 6,000 households. Subsequently these households have been interviewed on an annual basis, augmented by refreshment and extension samples. Using data from a panel study allows the analysis of age effects as the development of variables over age can be observed for each individual. They also allow the analysis of cohort effects by comparing age trajectories of respondents born in different years. Finally, period effects can be studied based on changes that are observed at specific times across all age groups. Data from panel studies are also well-suited for explanatory analyses
Research Question and Design for Survey Research
as causes and effects are observed over time (with a given temporal resolution of typically one year) without having to resort to retrospective questions, and unobserved (time- persistent) heterogeneity can be controlled for by incorporating individual fixed-effects into the analyses (Allison 2009). Note that most yearly panel studies also use a monthly event calendar based on retrospective questions to obtain a more fine-grained time- resolution (e.g. with regard to changes in labor force participation). Providing accurate data for descriptive analyses across time, however, can be a challenge. Although panel studies usually start with a representative sample, the sample tends to become biased over time. One problem is panel attrition, the fact that with each wave some respondents are lost due to refusal of further participation in the study. Weighting schemes are used to correct for this loss of respondents, but it is not always clear whether all relevant information has been included in the weighting models to successfully counterbalance attrition bias (the accuracy of a given correction may strongly depend on topic). A second problem is that the composition of the population changes over time due to births and deaths, migration, and, in case of household panels, the dissolution and formation of households. To take into account the changing population, panels must refresh or extend the sample and follow individuals originating in the same household. One final problem is that repeated questioning can lead to panel conditioning. For example, a respondent’s opinions on certain topics might change because the respondent was already interviewed on these topics in prior years. To deal with the trade-off between analytic power and descriptive accuracy, some studies employ a hybrid design that combines elements of trend and panel designs. One example is a split panel design in which a regular panel survey is complemented by yearly cross-sections. Comparing results from the panel sample with the results from
113
the cross-sectional sample provides information about biases that accumulated over the years in the panel sample. Another example is a rotating panel design in which a fraction of the sample is replaced by a new representative sample in each wave. For example, each respondent may stay in the panel for five waves, so that in each wave 20 percent of the sample is replaced. An advantage of rotating panels is that the accumulation of sample bias is less pronounced than in a regular panel and, at the same time, panel effects can be studied by comparing data from the different panel cohorts. On the other hand, longitudinal analysis is limited because respondents only remain in the panel for a short time and age trajectories are incomplete. A final design is cohort studies. In a cohort study, the goal is to study specific cohorts, defined as groups of people who experienced a certain event at the same time (e.g. birth cohorts, study cohorts, marriage cohorts, etc.). A cohort study can be retrospective or prospective. In a retrospective cohort study, respondents from one or several cohorts are interviewed some time after the cohortdefining event occurred, and data on the development between the event and the time of interview are collected using retrospective questions (a retrospective cohort study is thus a special case of a cross-sectional study). Life-history studies usually employ such a design. The main advantage of retrospective cohort studies is that data covering long periods can be collected in a short period of time; however, depending on topic, answers to retrospective questions can be unreliable. In a prospective cohort study (often employed in epidemiological research) a sample of cohort members is followed from the cohort-defining event into the future. A prospective cohort study is thus a special case of a panel study. An example of a prospective cohort study is the National Educational Panel Study in Germany (www.neps-data.de) in which several cohorts of children (early childhood, Kindergarten, 5th grade, 9th grade), a cohort of first-year students, and a cohort of adults
114
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
have been interviewed on a yearly basis since 2010.
Further Design Aspects Depending on context, various other aspects of the implementation of a survey can be relevant for the validity of the collected data and the types of questions that can be answered. For example, a survey on delinquent behavior might include questions of different types and degrees of sensitivity (Tourangeau and Yan 2007). Therefore, a good research design requires careful choice of the survey mode and fieldwork setting (see Chapters 11 and 17 in this Handbook) and, possibly, the application of special techniques for asking sensitive questions (Jann 2015). Furthermore, the sample design (see Part IV in this Handbook) is of utmost importance. For descriptive research questions, a sample that provides an unbiased representation of the defined population is required. This rules out many commonly used sample designs such as quota sample, in which the selection probabilities of the population members are not known. A sample is suitable for descriptive purposes if each population member has a strictly positive selection probability and the selection probabilities (and the dependencies among them) are known for all population members. A significant source of bias in this respect is unit non-response (see Chapters 29 and 38 in this Handbook). For descriptive studies, therefore, it is important to choose a design that mitigates the effects of non-response as far as possible (see Chapter 30 in this Handbook). In the social sciences many research questions focus on how context affects individual behaviors and opinions. For studying context effects it can be very effective to use a research design that is tailored to the relevant geographic or social units. For example, if school class context is suspected to influence youth delinquency, it may be sensible to
use a research design in which the sample is clustered by classes, with data collected from all students in each selected class, as well as, potentially, the teachers. A second example is a study in which school children from different cities are sampled because certain policy changes for delinquency prevention have been implemented in some of the cities but not in others, providing an opportunity to study policy effects in a quasi-experimental setting. In general, research designs that employ hierarchically structured samples and incorporate data from units at different levels are called multilevel designs. This includes cross-country studies based on internationally comparable data collections such as the European Social Survey (www.europeansocialsurvey.org/). Finally, the study of social networks and special (hard-to-reach) populations usually requires specific research designs. Studies on social networks are difficult to implement in terms of standard surveys because independent sampling of respondents is not well-suited for recovering relational structures among population members. Social network studies therefore often focus on a clearly defined social group and administer a full census within that group (common are, for example, studies in which all students in a given school are asked about their relations to each other), or they use some flavor of adaptive sampling in which social networks are recovered by branching out from an initial set of random seeds. Similarly, for hard-to-reach populations standard sampling techniques are not feasible, so that alternative research designs such as Respondent Driven Sampling (see Chapter 22 in this Handbook) have to be used.
RECENT DEVELOPMENTS AND EXAMPLES There are several recent developments that have increased the value of survey data. First
Research Question and Design for Survey Research
is the integration of experimental designs into surveys in order to enable causal conclusions. Second is the collection of paradata (which provide information about the process of data collection). Third, survey data can be linked to population data from administrative sources, such as tax-records, making it possible to address more interesting research questions. Fourth, standardized international surveys, such as the ESS (European Social Survey), allow researchers to answer international comparative research questions. We will briefly discuss these developments in this section and also provide some examples of how survey data can be used to address explanatory research questions, for instance in connection with natural experiments.
Survey Experiments In survey experiments, the key idea of conducting experiments (randomized assignment of controlled stimuli) is implemented into a survey format. There are several forms of such experimental designs in surveys. Research on survey methodology uses experimental set-ups to study a variety of effects: priming, anchoring, order effects, etc. These studies are implemented as split-ballot experiments where a randomly selected group of respondents receives a given treatment (e.g. a questionnaire with a certain modified order of questions) while the control group receives a baseline questionnaire (Fowler Jr. 2004). The comparison of the results between the groups suggests whether method effects exist. Typically, such experiments vary only one factor, such as the order of survey questions (Sudman and Bradburn 1974; McFarland 1981), and are rarely used to address substantive research questions (an exception is Beyer and Krumpal 2010). One way to focus on substantive questions by means of a survey experiment is to use a factorial survey. Survey participants rate descriptions of hypothetical objects or
115
situations (vignettes). Within these descriptions, factors (attributes) are experimentally varied in their levels (Jasso 2006; Auspurg and Hinz 2015). The idea of factorial surveys was developed originally by Peter H. Rossi (1979) as a technique to assess judgment principles underlying social norms, attitudes, and definitions. Normative judgments and definitions and subjective evaluation of situations (such as conflicts in the classroom) require many characteristics to be integrated into one coherent judgment. In the classroom example, researchers might wish to learn about the social situation in which students evaluate norm-violating behavior as appropriate. What factors are relevant for the perception of a ‘legitimized’ use of norm-violating behavior? Is the number of by-standers important? What kind of action (e.g. verbal attack) legitimizes what kind of reaction? Is there a social consensus among students or do judgment principles diverge between groups of students (e.g. migration background, experience with violence in family context)? Researchers construct the stimuli (vignettes) according to an experimental design plan – with the number of dimensions and the number of levels as crucial parameters. In factorial survey experiments, each respondent is asked to rate one or several of these descriptions. If respondents evaluate more than one vignette, researchers have to consider the hierarchical data structure in analysis (Hox et al. 1991). As a recent review demonstrates (Wallander 2009), factorial survey experiments are frequently used in survey research to study causal theories. One evident restriction of this method is that researchers ask questions about hypothetical situations. However, this critique applies to many survey questions. In contrast to other formats, factorial survey experiments seem to be less prone to social desirability bias (Auspurg et al. 2015). Similar to the factorial survey design, choice experiments are another format which implements experiments in surveys. While factorial survey experiments ask respondents
116
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
to evaluate a hypothetical situation, choice experiments present – as the name suggests – various hypothetical ‘choices’. Choice experiments are widely used in economics when researchers wish to study the preferred mode of transportation (Hensher 1994; Louviere et al. 2000), medical treatment (Ryan 2004), or different scenarios in environmental research (Liebe 2015). The goal is to estimate the utility of different choice options and their attributes. Often, choice experiments are applied when researchers focus on preferences for options that do not yet exist or on the willingness to pay for non-market products for which no price information or only subsidized prices are available. Respondents are simultaneously confronted with several descriptions of options (choice alternatives) which are jointly presented within a single choice set. All options are composed of the same attributes which are experimentally varied in their levels. Respondents are asked which of the options they would prefer most and are confronted with several choice sets. Data analysis of choice experiments is more complicated than data analysis of factorial survey experiments because each single choice must be evaluated within a given choice set. Generally, conditional logit models are used for this task (Hensher et al. 2005). Another way to combine experiments and surveys is to integrate game theoretical set-ups in surveys. A situation motivated by game-theoretical reasoning is presented, and then respondents are asked for a decision. For instance, an application for the trust game has been successfully implemented into a nationwide survey (Fehr et al. 2002; Naef and Schupp 2009). A key limitation is that a survey is not interactive, versus the standard experimental approach where players react to the decisions of other (real) players. As such, the survey approach is limited to describing an ‘imagined’ co-player in the explanation of the situation. In psychology, some scholars propose to conduct experiments with subjects online (Reips 2002) – often integrated into web survey studies. In principle, this
enables the researchers to implement interactive games into surveys with more than two players, as in collective good games. Furthermore, the ‘strategy method’, in which potential conditional choices of subjects for each possible information set are recorded, can be employed in a survey setting (Brandts and Charness 2011). Survey studies are rarely used for game-theoretical experiments. However, survey-experiment integration does have significant advantages. Typically survey respondents are a random sample of the total population while lab studies use convenience samples. As such, using surveys to administer experiments increases the external validity of experiments (Auspurg and Hinz 2015). In all three forms of combining experiments with surveys, researchers must thoroughly plan the experimental design. With multi-factorial designs (factorial survey and choice experiments) researchers must consider all expected interaction terms of their factors before implementing the survey. Moreover, the efficiency of the design (orthogonality of experimental factors with a maximum of variance) determines the power to identify causal impacts. In addition, experimental stimuli must be randomized across respondents.
Paradata Paradata contain all data concerning the process of data collection: when did the respondents react to the invitation to a survey, how many times did an interviewer try to get in contact to realize the interview, how long did it take to complete a questionnaire, was a third person present during an interview etc. In interview situations (face-to-face, CATI, CAPI), some paradata stem from evaluations by the interviewers but an increasing amount of paradata is collected during the technical process of conducting interviews (number of calls needed to reach a respondent in CATI studies, response times in CATI/CAPI
Research Question and Design for Survey Research
studies). Due to the increasing use of web surveys, much more paradata for self- administered surveys became available and are now possible subject to further analysis. There are three possible ways to make use of paradata. First, they can help to monitor and manage data collection (Kreuter 2013). Second, paradata are interesting sources of information in survey research as long as data quality issues are discussed. The number of interviews per interviewer in a face-toface survey could be seen as critical because interviewer effects might bias the results (Kreuter 2010). The number of contacts needed to reach a respondent might be used as a weighting factor to control for hard to reach respondents in the samples. The time needed to answer questions can be used to evaluate processes of learning or fatigue of respondents (for an application: Savage and Waldman 2008). Third, paradata might be relevant to measure theoretical concepts as well. Therefore, paradata are interesting with regard to research questions. For instance, the cognitive ability of respondents might correlate to their response time to different survey questions (Sauer et al. 2011). The structure of item non-response might be analyzed as an indicator for cooperativeness (Stocké 2006). While, technically, an immense amount of paradata accompany modern CATI or CAPI surveys it is a challenging task to use these data as a reliable source of information. For instance, response times not only correlate with cognitive ability but also with the seriousness of respondents who differ in their willingness to answer precisely and with other factors such as the interviewer’s behavior (Cooper and Kreuter 2013).
Record Linkage Once survey data is collected, it can be merged with other data sources (Lane 2010; Schnell 2013). It is quite common to link data from surveys with additional information from administrative sources. Often data
117
collection in survey studies follows clustered sampling strategy. Thus, additional data on the macro level clusters (electoral districts, communities, neighborhoods) can be easily merged. Whenever geocoded information is available for the respondents (e.g. for home address, place of work, school, etc.) various contextual data can be matched. Returning to the example of deviant behavior in schools, researchers use the crime rates of surrounding neighborhoods to characterize the social context of schools (Garner and Raudenbush 1991). Researchers studying the labor market will often merge individual data from labor market surveys to firm level data, to merge on detailed firm information like turnover, profit, assets, or number of employees. Data from other sources can be merged to data of the survey respondents as long as the appropriate identifiers are available. This might be standardized geographic codes or other identifiers such as firm identification numbers. In some cases, researchers link records at the individual level. Again, identifying information is needed to facilitate the match (social security number, names, and birth dates, etc.). Often a combination of several variables is necessary for linkage. In labor market research, for instance, survey data are matched with information from the employment register (Trappmann et al. 2013). Record linkage can be used to validate or augment survey data. For example, in school classes one might validate the students’ answers about their grades by linking to the administrative records of grades. In labor market studies, the precise duration of unemployment spells can be merged from administrative data on unemployment insurance. Provided that data privacy issues can be solved, combining survey data with data from other sources obviously expands the spectrum of possible research questions, such as the impact of social contexts on individual behavior and attitudes. Furthermore, sometimes the inclusion of external data may provide opportunities for quasi-experimental analyses.
118
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
International Standardization of Surveys Over the last two decades, increasingly more national survey programs have begun implementing comparable sample designs and identical survey instruments (e.g. Fitzgerald et al. 2013 for the European Social Survey). In educational research, the Programme for International Student Assessment (PISA) study is the most prominent example. PISA is a triennial survey that aims to evaluate educational systems by testing the skills and knowledge of 15-year-old students worldwide. To date, students representing around 70 countries have participated. The availability of this data has motivated significant comparative research on a wide array of topics. Standardized research design is clearly a basic pre-requisite of international comparative research. Ideally, researchers from all countries participating in an international survey program are involved in the initial development of the research design. Often researchers who are using the data wish to test hypotheses on institutional differences between groups of countries that can be classified by macro characteristics such as welfare regimes. Institutional differences between countries or groups of countries are topics in the research questions. The standardization of data collection, data storage, and data access (through international data archives) has significantly improved the potential for such cross-national research based on surveys (Gornick et al. 2013).
Innovative Examples for Causal Analysis with Survey Data Despite substantial limitations, survey data is regularly used to test causal hypotheses (for an overview of causal inference in sociology see Gangl 2010). There are various ways in which researchers can manipulate observational data to make causal inferences. One option is to use panel data and control for ‘unobserved
heterogeneity’ by estimating fixed-effect regressions or difference-in-difference models. Another option is to artificially create treatment and control group situations using specific statistical tools (such as propensity score matching). A third option is to implement experimental set-ups into surveys (with a great gain in external validity). And finally, researchers can use additional sources of data that allow comparing different social contexts (often an important aspect of research design). We provide a small selection of survey studies that address causal inference. Survey data from different countries can be used to study the causal effects of institutions on behavior. Based on data from the Survey of Income and Program Participation (for the US) and the German Socio-Economic Panel, the effects of unemployment benefits on workers’ post-unemployment jobs are examined in a comparative design by Gangl (2004). Natural experiments could be used much more in order to make causal conclusions. A standard approach is to search for changes in legislation that do not affect all respondents but a (quasi-random) selection of them. In sociology of education, the introduction of students’ tuition fees in Germany took place in different states at different points in time. This variability was used by Helbig et al. (2012) to study how tuition affects the intent to enroll in university. With a similar design, Gangl and Ziefle (2015) studied the effect of a recent change in parental leave policy using data from the SOEP survey. Another study using survey data in combination with a natural experiment is an analysis of terrorist events and their impact on attitudes towards immigrants (Legewie 2013). The study is based on the fact that a terror attack (with worldwide media coverage) occurred during the fieldwork period of the European Social Survey. Another example is the study by Kern and Hainmueller (2009) that combines data from historic surveys in the former German Democratic Republic with geographic information on television reception to identify the effect of foreign media exposure on political
Research Question and Design for Survey Research
opinions. Finally, Beyer and Krumpal (2010) use randomized question ordering to test hypotheses about anti-Semitic attitudes in a survey among adolescents in German schools.
RECOMMENDED READINGS Detailed information on research questions and research designs can be found in textbooks and handbooks on social science research methods. Examples are Babbie (2010), Bernard (2013), Miller and Salkind (2002), or Alasuutari et al. (2008); for German-speaking readers we suggest Diekmann (2007). The design of experiments is discussed in Dean and Voss (2008), Alferes (2012), or Montgomery (2013). For the use of experimental techniques in population surveys see Mutz (2011). Classic references for quasiexperiments are Campbell and Stanley (1966) and Cook and Campbell (1979) (also see Shadish et al. 2002). For a modern overview of research designs for causal inference from experimental and observational data see, e.g., Murnane and Willett (2011).
REFERENCES Alasuutari, P., Bickman, L., and Brannen, J. (eds) (2008). The SAGE Handbook of Social Research Methods. Thousand Oaks, CA: Sage. Alferes, V. R. (2012). Methods of Randomization in Experimental Design. Thousand Oaks, CA: Sage. Allison, P. D. (2009). Fixed Effects Regression Models. Thousand Oaks, CA: Sage. Auspurg, K. and Hinz, Th. (2015). Factorial Survey Experiments. Thousand Oaks, CA: Sage. Auspurg, K., Hinz, T., Liebig, S., Sauer, C. (2015). The factorial survey as a method for measuring sensitive issues. In U. Engel, B. Jann, P. Lynn, A. Scherpenzeel, and P. Sturgis (eds), Improving Survey Methods: Lessons
119
from Recent Research (pp. 137–149). New York: Routledge/Taylor & Francis. Babbie, E. (2010). The Practice of Social Research, 12th edn. Belmont, CA: Wadsworth. Bernard, H. R. (2013). Social Research Methods: Qualitative and Quantitative Approaches, 2nd edn. Thousand Oaks, CA: Sage. Beyer, H. and Krumpal, I. (2010). ‘Aber es gibt keine Antisemiten mehr’: Eine experimentelle Studie zur Kommunikationslatenz antisemitischer Einstellungen [‘But there are no longer any anti-Semites’: An experimental study on the communication latency of antiSemitic attitudes]. Kölner Zeitschrift für Soziologie und Sozialpsychologie 62(4), 681–705. Brandts, J. and Charness, G. (2011). The strategy versus the direct-response method: a first survey of experimental comparisons. Experimental Economics 14(3), 375–398. Campbell, D. T. and Stanley, J. C. (1966). Experimental and Quasi-Experimental Designs for Research. Dallas, TX: Houghton Mifflin. Cook, T. D. and Campbell, D. T. (1979). QuasiExperimentation. Design & Analysis Issues for Field Settings. Boston, MA: Houghton Mifflin. Cooper, M. and Kreuter, F. (2013). Using paradata to explore item level response times in surveys. Journal of the Royal Statistical Society A 176(1), 271–286. Dean, A. and Voss, D. (2008). Design and Analysis of Experiments (6 pr.). New York: Springer. Diekmann, A. (2007). Empirische Sozialforschung: Grundlagen, Methoden, Anwendungen, 18th edn. Reinbek bei Hamburg: Rowohlt. Fehr, E., Fischbacher, U., von Rosenbladt, B., Schupp, J., and Wagner, G. G. (2002). A nation-wide laboratory. Examining trust and trustworthiness by integrating behavioral experiments into representative surveys. Schmollers Jahrbuch 122(4), 519–542. Fitzgerald, R., Harrison, E., and Ryan, L. (2013). A cutting-edge comparative survey system: The European Social Survey. In B. Kleiner, I. Renschler, B. Wernli, P. Farago, and D. Joye (eds), Understanding Research Infrastructures
120
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
in the Social Sciences (pp. 100–113). Zürich: Seismo. Fowler, Jr., F. J. (2004). The case for more splitsample experiments in developing survey instruments. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 173– 188). Hoboken, NJ: John Wiley & Sons. Gangl, M. (2004). Welfare states and the scar effects of unemployment: a comparative analysis of the United States and West Germany. American Journal of Sociology 109(6), 1319–1364. Gangl, M. (2010). Causal inference in sociological research. Annual Review of Sociology 36, 21–47. Gangl, M. and Ziefle, A. (2015). The making of a good woman: extended parental leave entitlements and mothers’ work commitment in Germany. American Journal of Sociology 121(2), 511–563. Garner, C. L. and Raudenbush, S. W. (1991). Neighborhood effects on educational attainment: a multilevel analysis. Sociology of Education 64(4), 251–262. Gornick, J., Ragnarsdóttir, B. H., and Kostecki, S. (2013). Cross-national data center in Luxembourg, LIS. In B. Kleiner, I. Renschler, B. Wernli, P. Farago, and D. Joye (eds), Understanding Research Infrastructures in the Social Sciences (pp. 89–99). Zürich: Seismo. Helbig, M., Baier, T., and Kroth, A. (2012). Die Auswirkung von Studiengebühren auf die Studierneigung in Deutschland. Evidenz aus einem natürlichen Experiment auf Basis der HIS-Studienberechtigtenbefragung [The effect of tuition fees on enrollment in higher education in Germany. Evidence from a natural experiment]. Zeitschrift für Soziologie 41(3), 227–246. Hensher, D. A. (1994). Stated preference analysis of travel choices: the state of practice. Transportation 21(2), 107–133. Hensher, D. A., Rose J. M., and Greene, W. H. (2005). Applied Choice Analysis. Cambridge: Cambridge University Press. Hirschi, T. (1969). Causes of Delinquency. Berkeley, CA: University of California Press.
Hox, J. J., Kreft, I. G., and Hermkens, P. L. (1991). The analysis of factorial surveys. Sociological Methods and Research 19(4), 493–510. Jann, B. (2015). Asking sensitive questions: overview and introduction. In U. Engel, B. Jann, P. Lynn, A. Scherpenzeel, and P. Sturgis (eds), Improving Survey Methods: Lessons from Recent Research (pp. 101–105). New York: Routledge/Taylor & Francis. Jasso, G. (2006). Factorial survey methods for studying beliefs and judgments. Sociological Methods and Research 34(3), 334–423. Johnston, R. and Brady, H. E. (2002). The rolling cross-section design. Electoral Studies 21(2), 283–295. Kern, H. L. and Hainmueller, J. (2009). Opium for the masses: how foreign media can stabilize authoritarian regimes. Political Analysis 17(4), 377–399. Kreuter, F. (2010). Interviewer effects. In P. J. Lavrakas (ed.), Encyclopedia of Survey Research Methods (pp. 369–371). Thousand Oaks, CA: Sage. Kreuter, F. (2013). Improving Surveys with Paradata: Analytic Uses of Process Information. Hoboken, NJ: John Wiley & Sons. Lane, J. (2010). Linking administrative and survey data. In P. V. Marsden and J. D. Wright (eds), Handbook of Survey Research (pp. 659–680), 2nd edn. Bingley: Emerald. Legewie, J. (2013). Terrorist events and attitudes toward immigrants: a natural experiment. American Journal of Sociology 118(5), 1199–1245. Liebe, U. (2015). Experimentelle Ansätze in der sozialwissenschaftlichen Umweltforschung [Experimental approaches in environmental research in the social sciences]. In M. Keuschnigg and T. Wolbring (eds), Experimente in den Sozialwissenschaften. 22. Sonderband der Sozialen Welt (pp 132–152). Baden-Baden: Nomos. Louviere, J. J., Hensher, D. A., and Swait, J. D. (2000). Stated Choice Methods: Analysis and Application. Cambridge: Cambridge University Press. McFarland, S.G. (1981). Effects of question order on survey responses. Public Opinion Quarterly 45(2), 208–215.
Research Question and Design for Survey Research
Miller, D. C. and Salkind, N. J. (2002). Handbook of Research Design and Social Measurement, 6th edn. London: Sage. Montgomery, D. C. (2013). Design and Analysis of Experiments, 8th edn. Hoboken, NJ: John Wiley & Sons. Wiley. Murnane, R. J. and Willett, J. B. (2011). Methods Matter: Improving Causal Inference in Educational and Social Science Research. New York: Oxford University Press. Mutz, D. C. (2011). Population-Based Survey Experiments. Princeton, NJ: Princeton University Press. Naef, M. and Schupp, J. (2009). Measuring trust: experiments and surveys in contrast and combination. IZA Discussion Paper Series (Bonn) DP No. 4087. Nagel, E. (1961). The Structure of Science. Problems in the Logic of Scientific Explanation. London: Routledge & Kegan Paul. Reips, U. D. (2002). Standards for Internetbased experimenting. Experimental Psychology 49(4), 243–256. Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review 15(3), 351–357. Rosenbaum, P.R. (2002). Observational Studies, 2nd edn. New York, NY: Springer. Rossi, P. H. (1979). Vignette analysis: uncovering the normative structure of complex judgments. In R. K. Merton, J. S. Coleman, and P. H. Rossi (eds), Qualitative and Quantitative Social Research. Papers in Honor of Paul F. Lazarsfeld (pp. 176–186). New York, NY: Free Press. Ryan, M. (2004). Discrete choice experiments in health care. British Medical Journal 328(7436), 360–361. Sauer, C., Auspurg, K., Hinz, Th., and Liebig, S. (2011). The application of factorial surveys in
121
general population samples: the effects of respondent age and education on response times and response consistency. Survey Research Methods 5(3), 89–102. Savage, S. J. and Waldman, D. M. (2008). Learning and fatigue during choice experiments: a comparison of online and mail survey modes. Journal of Applied Econometrics 23(3), 351–371. Schnell, R. (2013). Linking Surveys and Administrative Data. German Record Linkage Center (Duisburg). Working Paper 2013–03. Shadish, W. R., Cook, T. D., and Campbell, D. T. (2002). Experimental and Quasi- Experimental Designs for Generalized Causal Inference. Boston, MA: Houghton Mifflin. Snijders, T. A. B., and Bosker, R. J. (2012). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, 2nd edn. London: Sage. Stocké, V. (2006). Attitudes towards surveys, attitude accessibility and the effect on respondents’ susceptibility to nonresponse. Quality and Quantity 40(2), 259–288. Sudman, S. and Bradburn, N.M. (1974). Response Effects in Surveys. A Review and Synthesis. Chicago, IL: Aldine. Tourangeau, R., Rips, L.J., and Rasinski, K. (2000). The Psychology of Survey Response. Cambridge: Cambridge University Press. Tourangeau, R. and Yan, T. (2007). Sensitive questions in surveys. Psychological Bulletin 133(5), 859–883. Trappmann, M., Beste, J., Bethmann, A., and Müller, G. (2013). The PASS panel survey after six waves. Journal for Labour Market Research 46(4), 275–281. Wallander, L. (2009). 25 years of factorial surveys in sociology: a review. Social Science Research 38(3), 505–520.
10 Total Survey Error Paradigm: Theory and Practice P a u l P. B i e m e r
INTRODUCTION Total survey error (TSE) refers to the accumulation of all errors that may arise in the design, collection, processing, and analysis of survey data. In this context, a survey error can be defined as any error arising from the survey process that contributes to the deviation of an estimate from its true parameter value. Survey errors may arise from frame deficiencies, sampling, the questionnaire design, translation errors, missing data, editing, and many other sources. Survey error will diminish the accuracy of inferences derived from the survey. A survey estimator will be more accurate when the sum of the squared bias and the variance (i.e., its mean squared error) is minimized, which occurs only if the influences of TSE on the estimate are also minimized. TSE can also inflate the true standard errors of the estimates leading to false inferences and erroneous research findings. Because survey data underlie many public policy and business decisions, a
thorough understanding of the effects of TSE on data quality is needed to reduce the risk of poor policies and decisions. The TSE paradigm is a valuable tool for understanding and improving survey data quality because it summarizes the ways in which a survey estimate may deviate from the corresponding parameter value. Sampling error, measurement error, and nonresponse error are the most recognized sources of survey error, but the TSE framework also encourages researchers to regard less-commonly studied error sources, such as coverage error, processing error, modeling error, and specification error. The framework highlights the relationships between errors and the ways in which reducing one type of error can increase another, resulting in an estimate with more total error. As an example, efforts to reduce nonresponse error may lead to increased measurement errors. The TSE paradigm focuses on the Accuracy dimension of survey quality. Although, Accuracy may impinge on a number of
Total Survey Error Paradigm: Theory and Practice
other quality dimensions such as Timeliness, Comparability, Coherence, Relevance and so on, these other dimensions will not be considered in much detail in this paper. Rather, the paper considers the TSE paradigm’s implications for the three phases of the survey lifecycle: (a) design; (b) implementation; and (c) evaluation. With regard to (a), the paradigm prescribes that surveys should be designed to maximize data quality and/or estimator accuracy subject to budgetary constraints by minimizing the cumulative effects of error from all sources. This also applies to survey redesign. For example, rather than simply cutting the sample size to meet a budgetary shortfall for a survey, alternatives that achieve an equivalent cost reduction with less impact on the TSE should be considered. These might include using mixed modes, more cost-effective methods for interviewer training or less expensive sampling frames with somewhat reduced coverage properties. Implementation (b) includes data collection, processing, file preparation, weighting, and tabulations. Here the paradigm suggests that adaptive designs should be employed that monitor the major sources of error with the strategy of minimizing TSE within budgetary constraints through real-time interventions to the extent possible. Paradata should be monitored to simultaneously reduce the effects of all major sources of error on the estimates. For example, during nonresponse followup, CARI (Computer Assisted Recorded Interviewing) monitoring should be employed to continually evaluate interviewer performance to mitigate any adverse effects of followup operations on measurement error. The implications of the TSE paradigm for survey error evaluations (c) are that, in addition to the main effects of error sources on estimation, the interactions among multiple error sources should also be considered. Some examples include the following: •• the effects of nonresponse followup on both the nonresponse bias and other error sources such as measurement error and sampling error;
123
•• contributions of interviewer-related nonresponse to interviewer variances; and •• the effects of more complete frame coverage on interviewer errors and nonresponse.
This chapter examines how the TSE paradigm can be applied in practice in survey design, implementation, and evaluation. The next section describes the implications of the paradigm for survey design. It is followed by sections that focus on implementation and on evaluation, respectively, and the final section provides a brief summary of the TSE paradigm of effective use in practice.
IMPLICATIONS FOR SURVEY DESIGN Survey design is more challenging today for several reasons: budgets are severely constrained, pressures on providing timely data are greater now in the digital age, public interest in participating in surveys has been declining for years and is now at an all-time low, and when cooperation is obtained from reluctant respondents, responses may be less accurate. In addition, traditional surveys are being challenged by less conventional and more cost-effective approaches such as opt-in panels and the use of ‘found’, data or Big Data. More and more, it is becoming difficult to convince survey clients that traditional, probability surveys are worth the extra cost. Further, surveys are becoming much more complex as they struggle to meet the everincreasing demands for more timely, detailed, and relevant information, pushing the limits of the survey approach. Surveys of special and rare populations such as transients and sex workers and panels lasting 20 years or more are becoming more common place. The main advantage of traditional probability surveys is the ability to customize the data to address an ever-expanding, broad range of essential research questions to improve society, the economy, and, more generally, the human condition. Survey data
124
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
can also be of very high quality at reasonable costs provided the survey practitioners use the best methods available to optimize quality at all phases of the survey lifecycle, particularly the design phase. Survey designers face a number of difficult issues, including: •• How should costs be allocated across the stages of the survey process to maximize accuracy? •• What costs and error components should be monitored during data collection? •• What design modifications or interventions should be available to address costs and quality concerns during the data collection process?
These questions are difficult to address in practice because the information required for achieving optimal accuracy is seldom available. To maximize accuracy, the survey designer needs to know how much total error is contributed by each of the major sources of error. In addition, choosing between alternative survey designs and implementation methods requires information on how these error sources are affected by the various design choices that may be considered for the survey. As an example, Peytchev and Neely (2013) considered several options for random-digit-dial (RDD) surveys: landline frame only, landline plus cell phone frames, and a cell phone only frame. They conclude that, as cell phone penetration grows a cell phone only RDD design may be optimal in terms of minimizing the TSE. Survey designers might ask: ‘Where should additional resources be directed to generate the greatest improvement to data quality: extensive interviewer training for nonresponse reduction, greater nonresponse followup intensity, or by offering larger incentives to sample members to encourage participation?’ Or, ‘Should a more expensive data collection mode be used, even if the sample size must be reduced significantly to stay within budget?’ Even when some data on the components of nonsampling error are known for a survey, this information alone may be insufficient. Identifying an optimal
allocation of survey resources to reduce nonsampling error requires information not only on the magnitudes of the key nonsampling error components but also on how the errors are affected by the various allocation alternatives under consideration. Such information does not exist for a survey and it may be unreasonable or impractical to expect that it ever could be made available for a single survey with any regularity. Fortunately, detailed knowledge on costs, errors, and methodological effects of design alternatives are not needed for every survey design for two reasons: (a) design robustness and (b) effect generalizability (Biemer 2010). Design robustness refers to the idea that the mean squared error (MSE) of an estimator may not change appreciably as the survey design features change. In other words, the point at which the MSE is minimized is said to be ‘flat’ over a fairly substantial range of designs. For example, it is well known that the optimum allocation of the sample to the various sampling stages in multistage sampling is fairly robust to suboptimal choices (see, for example, Cochran 1977). Effect generalizability refers to the idea that design features found to be optimal for one survey are often generalizable to other similar surveys – that is, about similar topics that have similar target populations and cultures, data collection modes, and survey conditions. Dillman’s tailored design method (Dillman et al. 2009) exploits this principle for optimizing mail surveys. Similar approaches are now being developed for Internet surveys (Couper 2008; Dillman et al. 2009). Through meta-analyses involving hundreds of experiments on surveys spanning a wide range of topics, survey methodologists have identified what appear to be the ‘best’ combinations of survey design and implementation techniques for maximizing response rates, minimizing measurement errors, and reducing survey costs for these survey modes. Dillman’s tailored design method prescribes the best combination of survey design choices to achieve an optimal design for mail and Internet
Total Survey Error Paradigm: Theory and Practice
surveys that can achieve good results across a wide range of survey topics, target populations, and data collection organizations. Standardized and generalized optimal design approaches have yet to be developed for interviewer-assisted data collection modes or for surveying most types of special populations, regardless of the mode. One problem is the number of ‘moving parts’ in interview surveys is much greater than in mail and internet surveys making the control of process variation through standardization much more difficult. Nevertheless, there exists a vast literature covering virtually all aspects of survey designs for many applications. As an example, there is literature on the relationship between length of interviewer training, training costs, and interviewer variance (see, for example, Fowler and Mangione 1985). Whether these relationships are transferable from one survey to another will depend upon the specifics of the application (e.g., survey topic, complexity, target population). There is also a considerable amount of literature relating nonresponse reduction methods, such as followup calls and incentives to response rates, and in some cases, nonresponse bias (see Singer and Kulka 2002 for a review of the literature). Perhaps the TSE paradigm that led to a theory of optimal design of mail and Internet surveys may one day be employed in the development of a theory and methodology for optimal face-toface or telephone survey design. Another set of issues that should be considered at the design stage is what costs and error components should be monitored during data collection. Besides specifying the objectives costs and error monitoring, the plan should identify the metrics that will serve as indicators of bias and variance components of the key survey estimates, the paradata and other data that will be collected to compute these metrics, decision criteria that will be used to determine if the processes being monitored are in statistical control, and possible interventions that will be applied to return processes to a stable state and correct process deficiencies.
125
A number of strategies are available for designing surveys so that costs and quality can be closely monitored and optimized in real time. In previous literature, this approach has been given several different labels, including active management (Laflamme et al. 2008), responsive design (Groves and Heeringa 2006) and adaptive design (Wagner 2008; Schouten et al. 2013). The next section describes some of these approaches and provides a ful discussion of the planning that should precede data collection and data processing to ensure that quality and costs can be effectively monitored and controlled in real time.
IMPLICATIONS FOR SURVEY IMPLEMENTATION Despite careful planning even under ideal circumstances, surveys are seldom executed exactly as they were designed for several reasons. First, the survey sample itself is random which introduces a considerable amount of unpredictability into the data collection process. There are also numerous other sources of random ‘shocks’ during the course of a survey such as personnel changes, especially among field interviewers (FIs), the weather in the data collection sites, staffing issues, catastrophic events, and other unforeseen complications. Data costs may be considerably higher than expected in some areas, or data quality indicators, such as response rates, frame coverage rates, missing data rates, and interviewer performance measures may suggest that survey quality is inadequate. It may be necessary to change the data collection mode for some sample members or to introduce other interventions to deal with problems as they arise. A proactive, dynamic, flexible approach to survey implementation is needed to deal with these uncertainties. Thus, an essential ingredient of an optimal survey design is a plan for continuously
126
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
monitoring key cost and error metrics that will allow survey managers to control costs and reduce errors as processes are in progress. Of course, actively monitoring and managing quality and production data has always been an essential and integral part of survey processes. But, CAI and computerized field management systems have considerably increased the availability of detailed data on the data collection and other survey activities as well as the speed with which these data can be compiled, analyzed or visualized, and reported. With that, more structured and systematic strategies for data quality monitoring have been devised that take advantage of these important advances. Next we describe a few strategies for controlling costs and improving data quality in real time.
Continuous Quality Improvement An approach that can be applied to virtually any survey operation is the continuous quality improvement (CQI) approach (Biemer and Caspar 1994; Morganstein and Marker 1997). A number of statistical organizations have adopted at least some aspects of CQI to control costs and errors in their surveys, including the US Census Bureau (US Department of Labor, Bureau of Labor Statistics and US Department of Commerce, Bureau of the Census 2002), Statistics Sweden (Lyberg 1985; Biemer et al. 2014), Statistics Canada (Statistics Canada 2002), and Eurostat (Eurostat 2009). CQI can be viewed as an adaption of Kaizen (Imai 1986), the Japanese philosophy of continual improvement, to the survey process. Like Kaizen, CQI uses a number of standard quality management tools, such as the workflow diagram, cause and effect (or fishbone) diagram, Pareto histograms, statistical process control methods, and various production efficiency metrics (see, for example, Montgomery 2009). The CQI approach consists essentially of six steps (Biemer 2010), as follows:
1. Prepare a workflow diagram of the process to be monitored and controlled. 2. Identify characteristics of the process that are ‘critical to quality’ – sometimes referred to as CTQs. 3. Develop metrics that can be used to reliably monitor key costs and quality characteristics of each CTQ in real time. 4. Verify that the process is stable (i.e., in statisti cal control) and capable (i.e., can produce the desired results). 5. Continually monitor costs and quality metrics during the process. 6. Intervene as necessary to ensure that quality and costs are within acceptable limits.
The process workflow diagram (Step 1) is a graphical representation of the sequence of steps required to perform the process from the initial inputs to the final output. In addition to the steps required, the flowchart can include a timeline showing durations of activities, as well as annotations regarding inputs, outputs, and activities that are critical to quality (Step 2). As an example, Figure 10.1 shows a highlevel workflow diagram for the Current Employment Survey (CES) data collection process conducted by the Bureau of Labor Statistics. The CES collects monthly employment data from a sample of 140,000 nonfarm businesses and government agencies. To encourage participation in this voluntary program, the CES offers a variety of reporting methods. Initial contact is established via Computer Assisted Telephone Interviewing (CATI) with a goal of gradually switching to self-administered modes, such as Electronic Data Interchange (EDI), the Internet, Touchtone Data Entry (TDE), fax, or mail. The survey faces many challenges related to the constantly changing survey environment, increasing nonresponse in both household and establishment surveys and unpredictable costs. The top of Figure 10.1 shows the major steps associated with the CATI process for establishments that have just begun data collection (i.e., Months 1–6 of data collection).
Total Survey Error Paradigm: Theory and Practice
127
Figure 10.1 A high-level process flow diagram for the CES data collection process.
The bottom of the figure shows the selfadministered collection process for establishments that have activated for 7 months or more. Letters A through H that appear above the arrows in the diagram identify the potential CTQs; i.e., processing steps that could have important effects on quality or costs (as specified in Step 2 of the CQI process). Table 10.1 describes these CTQs in a little more detail and provides an initial assessment of their risks costs and data quality. For example, CTQ ‘A’ refers to the ability of interviewers to obtain valid establishment contact information and then to verify with the ‘point of contact (POC)’ that the establishment meets the CES eligibility criteria. As shown in Table 10.1, the risks to data quality and costs are expected to be low for this CTQ owing primarily to the low frequency of occurrence expected for these issues. The last column provides the potential metrics and/or paradata that could be used to form metrics to monitor the cost and error characteristics associated with the CTQ (see Step 3 of the CQI process). Other paradata data and their corresponding metrics could
also be identified as the process runs. To the extent there is confidence in these initial assessments of the risks to error and costs of the CTQ, it may be prudent to focus attention solely on CTQs that have moderate to high risk impacts on either costs or errors according to these preliminary assessments. Note that the CTQs in Table 10.1 are focused primarily on the response rate and nonresponse bias. CTQs that focus on sampling error, frame error, and measurement error can also be identified with appropriate metrics to monitor these errors during data collection. For example, paradata arising from the editing process can be used to track the frequency of edit failures, their causes, potential effects on reporting bias, characteristics of establishments responsible for failed edits, and so on. Likewise, the effects of nonresponse on the final weights and sampling error can also be tracked and controlled during data collection using paradata and an appropriate set of metrics. A detailed cost model can be used that decomposes data collection costs by each step of the process and allows analysts to explore the impact on costs
128
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 10.1 CTQs by process step, potential impacts, and monitoring metrics CTQs and potential errors by processing step
Potential impacts TSE
Cost
A
Unable to obtain establishment contact Low information or verify establishment is eligible for the CES
Low
B
Establishment point of contact (POC) Moderate Low refuses upon contact; further contacts with establishment are terminated POC refuses after receiving enrollment High Moderate package POC refuses or cannot be contacted for Moderate Moderate CATI interview
C D
E
POC refuses to convert to selfadministered mode
Low
F
POC fails to send in report by touchtone data entry (TDE) or TDE report delayed past closing date Nonresponse followup calls do not convert POC
Moderate High
TDE missing critical data or failing edit; contacts with POC necessary to resolve
Low
G
H
High
Moderate High
of various alternative data collection modifications. As described later in this section, this total error focus is a defining feature of the Adaptive Total Design (ATD) strategy. Step 4 is often an optional step, but can be important if there is some question as to whether a particular process step is capable of producing outputs with acceptable quality at an affordable cost. This step essentially verifies that recent history regarding process performance is a reasonable indicator of future performance. For example, early performance of interviewers may reflect a learning period that will rapidly change as the interviewers gain experience. During this period, performance metrics may behave erratically. An erratic metric could be an indication that the metric is unreliable as a measure of output quality or costs and is therefore useless
Moderate
Paradata/metrics
Number of establishments where contact information cannot be obtained or eligibility could not be verified; frame characteristics of all such units Number of establishments refusing at this stage; frame characteristics of all such units Number of establishments refusing at this stage; frame characteristics of all such units Number of establishments refusing or non-contacted at this stage; frame characteristics of all such units Number of establishments refusing; reasons for refusing; frame characteristics of all such units Number of establishments failing to send or delaying report; reasons for refusal/delay; frame characteristics of all such units. Number of establishments requiring nonresponse followup (NRFU); number refusing at NRFU; frame characteristics of all such units Number of establishments missing critical data; reasons for missing data or failed edit; types of data missing or edit failures; frame characteristics of all such units
for cost or quality monitoring and improvement. Regardless of the reason, some action may be required to stabilize the metric before quality improvements can proceed because unstable processes cannot be easily improved until they are stabilized and capable of producing consistent results. Verifying process capability can be done fairly effectively using process control tools and principles such as control charts, scatter diagrams, stem and leaf plots, etc. described in Breyfogle (2003). As an example, these tools can be used to detect a change in the variation of the key process metrics over time which is an indicator of process capability. Determining whether the variation in one or more process metrics represents common or special cause variation (Montgomery 2009) is an essential component for any
Figure 10.2 Simulated dashboard for monitoring production, costs, and interview quality.
130
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
quality improvement strategy. Common cause variations are often the result of minor disturbances that occur frequently and naturally during the normal course of the process. Such variations are inherent in the process that can only be reduced by redesigning the process so that it produces results that have smaller variation. Specific actions to address common causes are not advisable because rather than reduce variation, these actions may actually increase it. For example, period to period variations in interviewer response rates are a natural phenomenon caused by a multitude of uncontrollable events including sample composition and fluctuations in the general survey conditions (weather, traffic, seasonal variations, etc.). Interviewers may feel threatened if they are questioned about every dip in their response rates. This may lead some of them to overcompensate by using more aggressive tactics to achieve a higher response rate which could lead to greater response rate variation. Indeed, some interviewers may leave the workforce leading to disruptions in data collection. Recruitment of new and possibly less experienced interviewers could cause even greater variations in period to period response rates. By contrast, special cause variations are much larger deviations that are traceable to a single, specific cause. As an example, the CES interviewer enrollment rates normally fluctuate from period to period as a result of many uncontrollable variables such as workload size, types of establishments, establishment location, capabilities of the establishment Point of Contact (POC) to report accurately, variations in the abilities of the interviewers to enroll the POC, and so on. Such fluctuations are part of the common cause variation in enrollment rates. However, an interviewer enrollment rate that is substantially lower than the mean rate could indicate a problem with that particular interviewer’s performance – i.e., a special cause deviation. The quality control literature provides a number of tools for distinguishing between special and common cause variation. One
important tool is the process control chart which defines the boundaries (called control limits) for common cause variation for a given metric. Control limits are bounds within which metrics can vary solely as a result of the common causes. Thus, values of the metric that fall outside of these control limits signal special cause variation (requiring intervention) while values within the control limits suggest common cause variation (no intervention required). For Step 5, the proposed project team determines the frequency for computing and reviewing metrics (for example, weekly) and the format of the data displays (e.g., tabular, graphical, descriptive). Prior experience suggests that it is often both informative and practicable to display related metrics together as a ‘dashboard’ (see Figure 10.2). The dashboard organizes and displays critical information on costs, production, and data quality for a variety of metrics whose variations may be correlated. This aids in detecting possible unintended consequences of interventions to induce favorable changes in one or more metrics. For example, efforts to improve response rates are usually accompanied by an increase in costs. Thus, it could be informative to monitor components of the response rates, interviewer effort, and field costs simultaneously in a display or dashboard. An example of this is shown in Figure 10.2 using simulated data. This figure is divided into four quadrants. Production (completed cases per day) can be monitored by the upper left quadrant which graphs daily production along with upper and lower control limits. These control limits are based upon three sigma deviations from an expected production trajectory based upon data from prior quarters of data collection. The graph shows that production is following the expected downward trajectory. Suppose that, to boost production, an intervention such as an increased incentive or more effective but more costly interviewing strategy is implemented at Collection Day 40. Note that the surge in productivity
Total Survey Error Paradigm: Theory and Practice
due to this ‘special cause’ crosses the upper control limit as intended. However, it is clear from the upper right quadrant that, concurrently with this intervention, data collection costs also increase, but total, cumulative costs still hold closely to their planned trajectory. Unfortunately, as shown in the lower left quadrant, interviewer deficiencies (for example, question delivery breaches and other deviations from survey protocols) become much more erratic indicating that interviewing quality has slipped out of statistical control. Although the data in Figure 10.2 are simulated, they illustrate the importance of monitoring metrics across multiple quality and costs dimensions to gather a full understanding of the impact of interventions. Project managers should also be able to interact with dashboards; for example, to modify the data displays to reveal different perspectives or views of the same data source in order to search for causalities. This is particularly useful for explaining both common and special cause variations. Ideally, these dashboards should provide managers with the functionality to ‘drill-down’ into the data in order to look for root causes and to investigate the effects of prior interventions and remedial actions. For example, by viewing the data underlying the results in the lower left quadrant of Figure 10.2, managers can determine which interviewers, if only a few, are responsible for the erratic behavior of the graph after Day 40. Likewise, the productivity and cost data can be examined by region or other geography to determine whether the increases in interviews and costs are related to geography.
Responsive Design Responsive design (Groves and Heeringa 2006) is a strategy originally applied to faceto-face data collection that includes some of the ideas, concepts, and approaches of CQI while providing several innovative strategies that use paradata as well as survey data to
131
monitor nonresponse bias and followup efficiency and effectiveness. Like CQI, responsive design seeks to identify features of the survey design that are critical to data quality and costs and then to create valid indicators of the cost and error properties of those features. These indicators are closely monitored during data collection, and interventions are applied, as necessary, to reduce survey errors (primarily nonresponse bias) and costs. A unique feature is that responsive design organizes survey data collection around three (or more) phases: (1) an experimental phase, during which alternate design options (e.g., levels of incentives, choice of modes) are tested; (2) the main data collection phase, where the bulk of the data is collected using the design option selected after the first phase; and (3) the nonresponse followup phase, which is aimed at controlling costs and minimizing nonresponse bias, for example, by subsampling the nonrespondents from the second phase (i.e., double sampling), shifting to a more effective mode, and/or using larger incentives. Responsive design recognizes that, in many cases, the survey designer is unable to choose the optimal data collection approach from among several promising alternatives without extensive testing. In fact, it is not uncommon in survey work for data collection to be preceded by a pretest or pilot survey designed to identify the best data collection strategy. Responsive design formalizes this practice and includes it as an integral part of the survey design which is referred to as Phase 1. Unlike traditional pretesting, primary data collection (Phase 2) follows immediately on the heels of Phase 1 rather than pausing to analyze the pretest results and setting the design features for the main survey. In addition, responsive design aims to incorporate Phase 1 data in the overall survey estimates unless the data quality metrics suggest otherwise. For this reason, responsive design relies heavily on real-time monitoring of costs and quality metrics across experimental variations so that decisions regarding main data collection design
132
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
features including sample sizes can be made before Phase 1 is terminated. Another key concept of responsive design is the notion of a phase capacity. The main data collection phase (Phase 2) is said to have reached its phase capacity when efforts to reduce nonresponse and its biasing effects on selected survey estimates are no longer costeffective. For example, after many attempts to follow up with nonrespondents, the key survey estimates remain unchanged, and the data collection phase is said to have reached its phase capacity. According to Groves and Heeringa (2006), a phase capacity condition signals the ideal point at which the main data collection phase should be terminated and the third phase should begin. Phase 3 intensifies the nonresponse followup operation from the second phase. However, to control costs, Groves and Heeringa propose that only a subsample of the Phase 2 nonrespondents is pursued in Phase 3. Nonrespondents that are not selected for the subsample are not followed up or further pursued. A weight adjustment is applied to the nonrespondents who eventually respond to represent the nonsampled nonrespondents. For example, if a simple random sample of half of the nonrespondents is followed up, their weights would be at least doubled to account for nonrespondents who were not selected. In practice, the random subsample may not be selected using simple random sampling. Instead, the selection probabilities may be a function of predicted response propensities, costs per followup attempt, and the original case selection weights. Groves and Heeringa discuss a number of innovative metrics based upon paradata that can be used for CQI in all three phases, as well as approaches for signaling when phase capacity has been reached. Although responsive design primarily focuses on nonresponse error, it can be combined with the TSE reduction strategies of CQI to provide a more comprehensive strategy for controlling costs and error. For example, as shown in (Kreuter et al. 2010, and
Kaminska et al 2010), nonresponse reduction efforts can increase measurement errors. This might occur, for example, as a result of respondent satisficing (Krosnick and Alwin 1987) or interviewers who sacrifice data quality to avoid breakoffs (Peytchev et al. 2010). Likewise, subsampling nonrespondents in Phase 3 may reduce the nonresponse bias, but can also substantially reduce the precision of the estimates as a consequence of increased weight variation (i.e., the unequal weighting effect) (see, for example, Singh et al. 2003). The usual method for controlling this variation is to trim the weights, but this can increase the estimation bias (see, for example, Potter 1990). Cumulatively, TSE could be substantially increased even as the bias due to nonresponse bias is reduced. These risks to TSE are ameliorated by monitoring and controlling multiple error sources simultaneously which is the intent of the strategy described in the next section.
Adaptive Designs Several other strategies for real-time control of costs and errors have been proposed in the literature under the rubric ‘adaptive design’. Some of these strategies are simply variations on the CQI concept applied to data collection activities – primarily to increase response rates (see, for example, Wagner 2008). However, Schouten et al. (2013) present adaptive design as a strategy for tailoring key features of the survey design for different types of sample members maximize response rates and reduce nonresponse selectivity. They further propose a mathematical framework to decide which features of the survey to change at specific time points during data collection. Schouten’s approach works on the premise that different people or households in the sample should receive different treatments (for example, interview modes) that are predetermined at the survey design stage. Paradata such as interviewer observations of the
Total Survey Error Paradigm: Theory and Practice
neighborhood, the dwelling or the respondents, or the performance of interviewers themselves provide input to an algorithm that determines the ideal design feature to apply to different population subgroups to optimize costs, response rates, and sample balance. Schouten views responsive design as a special case of adaptive design in that responsive designs are used in settings where little is known about the sample members beforehand and/or little information is available about the effectiveness of the design features. Adaptive designs are appropriate when substantial prior information about sample units is available (from a population register or frame, for example) or for continuing or repeating surveys. To briefly illustrate this approach, suppose there are four design features that can be changed during data collection according to the characteristics of the sample members. For example, these may be: (a) which of two types of advance letters should be sent prior to contact; (b) whether the data collection mode should be web or telephone (CATI); (c) whether initial nonrespondents should be sent a reminder letter; and (d) whether 6 or 15 call attempts should be made. Assume further that the survey designer knows (1) the incremental costs of each of these four design options and (2) the effect of each option on response rates for key population domains defined for the study. Thus, Schouten provides an approach for achieving an optimal assignment of design features (a)–(d) to the population domains subject to the restriction that the total cost is no larger than the budget available for data collection.
Adaptive Total Design ATD (Biemer 2010; Eltinge et al. 2013) is an adaptive design strategy that combines the ideas of CQI and the TSE paradigm to reduce costs and error across multiple survey processes including frame construction, sampling, data collection, and data processing.
133
Like the above optimization strategies, the goal of ATD is to monitor the survey process in real time through a system of metrics and dashboards, intervening as necessary to achieve favorable costs and data quality metrics. However, there are several differences. First, ATD explicitly minimizes the risks of errors across multiple error sources by simultaneously monitoring and controlling costs and quality metrics across survey processes. Second, ATD attempts to identify interactive error sources where attempts to reduce the error in one source may cause unintended consequences in other error sources. As examples, efforts to reduce nonresponse bias by intensive followup may increase the risk of interviewer and other response errors. On the other hand, efforts to monitor interviewer performance using CARI or similar recorded interviewing approaches could increase refusals. Intensive counting and listing procedures to reduce coverage error may cause delays in fielding the sample in some areas, curtailing the nonresponse followup process. Two-phase sampling for nonresponse (Hansen and Hurwitz 1946) may reduce the followup effort and increase weighted response rates but may also reduce precision due to increased weight variation. Many other error interactions are possible. Finally, ATD incorporates a control phase whose purpose is to ensure that the actions taken to improve critical costs and quality metrics are effective, that survey performance has improved, and improvements are sustained for the duration of the processes. The control phase also monitors potentially unintended consequences caused by process interventions on other parts of the survey process. The methods of process control and other Six Sigma tools play an important role in achieving the objectives of the control phase as described in the next section. Particularly for the control phase, dashboards can be created based upon paradata to monitor sampling error, nonresponse, measurement errors, and frame coverage errors during data collection. By combining time series
134
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
graphs of performance measures and error rates for interactive error sources within the same dashboard display, the effects on all these metrics of interventions designed to change them can be monitored. For example, the dashboard considered in Figure 10.2 suggests an interactive effect between interviewer production and CARI monitoring results at the point when the intervention designed to increase response rates was applied. One hypothesis that could be investigated is whether the intervention to increase response rates is having an adverse effect on interview quality while increasing production costs. Likewise, during the data processing stage, additional metrics can be developed and continuously monitored to improve the data capture, editing, coding, and data file preparation processes. This would allow the survey designer to be responsive to costs and errors throughout the survey process and across all major sources of TSE. Here, an even richer set of strategies for ATD is needed and can be found in the literature of Six Sigma (see, for example, Breyfogle 2003).
Six Sigma Developed at Motorola in the early 1980s, Six Sigma embodies a set of principles and strategies for improving any process. Like CQI, Six Sigma emphasizes decision making based on reliable data that are produced by stable processes, rather than intuition and guesswork. It seeks to improve quality by reducing error risks and process variations through standardization. An important distinction between CQI and Six Sigma is the emphasis by the latter of providing verifiable evidence that quality improvements have been successful in improving quality and reducing costs, and that these gains are being held or further improved as described above for the ATD control phase. Similar to the six steps outlined above for CQI, Six Sigma operates under the five-step process referred to as DMAIC: define the problem, measure
key aspects of the process (i.e., CTQs) and collect relevant data, analyze the data to identify root causes, improve or optimize the current process using a set of Six Sigma tools designed for this purpose, and control and continue to monitor the process to hold the gains and effect further improvements. The ultimate goal of Six Sigma is to reduce the number of errors in a process to about 3.4 per million opportunities which may be achievable in a manufacturing process but would seem an unachievable goal for surveys. Still, Six Sigma provides a disciplined strategy, a catalog of approaches and tools, and a general philosophy that can be used to substantially improve survey processes. Many of the Six Sigma concepts as they apply to survey data collection are embodied in the ATD approach. Thus, ATD controls the critical components of TSE through the application of effective interventions at critical points during the process to address special causes (Step 6 in the CQI strategy). Likewise, process improvements will be implemented to reduce common cause variation as well as to improve the average value of a metric for the process. The error control and process improvement aspects of ATD are the most challenging, requiring considerable knowledge, skill, and creativity to be effective. Interventions should be both timely and focused. In some cases, it may be necessary to conduct experiments so that the most effective methods can be chosen from among two or more promising alternatives. All the while, costs and timeliness are held to constraints to avoid overrunning the budget or delaying the schedule.
IMPLICATIONS FOR THE EVALUATION OF SURVEYS AND SURVEY PRODUCTS Post-survey evaluations of at least some components of the total MSE (the MSE that reflects all major sources of error) is an
Total Survey Error Paradigm: Theory and Practice
essential part of the TSE paradigm. Standard errors for the estimates have been routinely reported for surveys for decades and are now considered essential documentation. Evaluations of nonsampling error components of the MSE are conducted in much less frequency. One exception is analysis of nonresponse bias required by the US Office of Management and Budget (OMB) for government-sponsored surveys that achieve response rates less than 80% (OMB 2006). While this focus on the nonresponse bias is welcome, the field could benefit from guidelines and standards that encourage the evaluation of other components of the total MSE that are potentially more problematic for many uses of the data. Nonsampling error evaluations address several dimensions of total survey quality. First, they are essential for optimizing the allocation of resources in survey design to reduce the error contributed by specific processes. Also, in experimentation, error evaluations are needed to compare the accuracy of data from alternative modes of data collection or estimation methods. Estimates of nonsampling errors (e.g., nonresponse bias analyses, measurement reliability studies) also provide valuable information to data users about data quality. Such evaluations can be important for understanding the uncertainty in estimates, for interpreting the results of data analysis, and for building confidence and credibility in the survey results. Here we show how the TSE paradigm is being applied in one national statistical organization to annually review, evaluate and, hopefully, improve the quality of a number of statistical products. Only a brief summary of this approach is possible because of space limitations. However, a full description of the program can be found in Biemer et al. (2014).
The ASPIRE Approach Since 2010, Statistics Sweden has been applying a process for the annual review,
135
evaluation, and improvement of a number of their key statistical products which they refer to as A System for Product Improvement, Review and Evaluation (ASPIRE). Statistics Sweden has used ASPIRE for managing quality improvements across a wide range and variety of statistical products (Biemer et al. 2014). ASPIRE involves assembling information on data quality from reports, memoranda, interviews with staff, data users, and producers and then assessing the risks to data quality for all relevant dimensions of total error for a particular product. Then each product is rated according to criteria designed to assess the extent to which error risks are mitigated through current quality improvement efforts. This general approach can be extended to all dimensions of quality; however, it is currently used solely to assess the accuracy (TSE) of ten important statistical products within Statistics Sweden. Inherent and residual risks to accuracy (described later) are assessed separately for each error source that may affect product quality. Error sources may not be the same for all products. For example, sampling error does not apply to products that do not employ sampling. Likewise, if revised estimates are not issued for a product, then there would be no risk of revision error. As shown in Table 10.2, three sets of error sources have been identified for the ten products considered. Note that the error sources associated with the two registers – Business and Total Population – are somewhat different than the error sources for the other products. Likewise, the error sources associated with Gross Domestic Product are somewhat different from those for the survey products. Each error source is also assigned a risk rating depending upon its potential impact on the data quality for a specific product. In this regard, it is important to distinguish between two types of risk referred to as ‘residual’ (or ‘current’) risk and ‘inherent’ (or ‘potential’) risk. Residual risk reflects the likelihood that a serious, impactful error might occur from the source despite the current efforts that are
136
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 10.2 Sources of error considered by product Product
Error sources
Survey products Foreign Trade of Goods Survey (FTG) Labour Force Survey (LFS) Annual Municipal Accounts (RS) Structural Business Survey (SBS) Consumer Price Index (CPI) Living Conditions Survey (ULF/SILC)
Specification error Frame error Nonresponse error Measurement error Data processing error Sampling error Model/estimation error Revision error Registers Specification error Business Register (BR) Frame error: Total Population Register (TPR) Overcoverage Undercoverage Duplication Missing data Content error Compilations Input data error Quarterly Gross Domestic Product Compilation error (GDP) Data processing error Annual GDP Modelling error Balancing error Revision error
in place to reduce the risk. Inherent risk is the likelihood of such an error in the absence of current efforts toward risk mitigation. In other words, inherent reflects the risk of error from the error source if efforts to maintain current, residual error were to be suspended. As an example, a product may have very little risk of nonresponse bias as a result of current efforts to maintain high response rates and ensure representativity in the achieved sample. Therefore, its residual risk is considered to be Low. However, should all of these efforts be eliminated, nonresponse bias could then have an important impact on the TSE and the risk to data quality would be high. As a result, the inherent risk is considered to be high although the current, residual risk is low. Thus, residual risk reflects the effort required to maintain residual risk at its current level. Consequently, residual risk can change over time depending upon changes in activities of the product to mitigate error risks or when those activities no longer mitigate risk in the same way due to changes in inherent
risks. However, inherent risks typically do not change all else being equal. Changes in the survey taking environment that alter the potential for error in the absence of risk mitigation can alter inherent risks, but such environmental changes occur infrequently. For example, the residual risk of nonresponse bias may be reduced if response rates for a survey increase substantially with no change in inherent risk. However, the inherent risk may increase if the target population is becoming increasingly unavailable or uncooperative, even if response rates remain the same due to additional efforts made to maintain them. Inherent risk is an important component of a product’s overall score because it determines the weight attributed to an error source in computing a product’s average rating. Residual risk is downplayed in the evaluation. Originally, its primary purpose was to clarify the meaning and facilitate the assessment of inherent risk. However, it was found that change in residual risk is an indicator of the success of risk mitigation efforts and ASPIRE has begun to consider such changes in the improvement ratings.
Assessing Quality ASPIRE involves the rating of quality efforts for the products according to the following quality criteria: •• knowledge (with producers of statistics) of the risks affecting data quality for each error source; •• communication of these risks to the users and suppliers of data and information; •• available expertise to deal with these risks (in areas such as methodology, measurement, or IT); •• compliance with appropriate standards and best practices relevant to the given error source; and •• plans and achievements for mitigating the risks.
The explicit guidelines have developed for each criterion to aid the assessment of current quality and quality which is facilitated by the use of checklists. Based upon these guidelines and checklists, a rating is assigned
Total Survey Error Paradigm: Theory and Practice
to quality criterion for each applicable error source. A product’s error-level score is then simply the sum of its ratings (on a scale of 1 to 10) for an error source across the five criteria in ‘Adaptive designs’ above, divided by the highest score attainable (which is 50 for most products) and then expressed as a percentage. A product’s overall score, also expressed as a percentage, is computed by following formula: Overall Score = (error-level score) × (error source weight) ∑ 50 × ∑ (weight) all error sources all error sources
where the ‘weight’ is either 1, 2, or 3 corresponding to an error source’s risk; i.e., low, medium, or high, respectively.
Findings Based on the Assessments The presentation in Table 10.3 provides a typical ASPIRE ‘score card’ for the Labor Force Survey. A number of ratings are ‘Very Good’ and one error source (sampling error) earned ‘Excellent’ ratings for two criteria: Communication and Compliance with Standards and Best Practices. There are two error sources that have been identified with high risk within this product: nonresponse error and measurement error. It is quite evident in the exhibit that work is progressing within measurement error. The staff’s knowledge, communication, and expertise has improved between the second and third applications of ASPIRE (i.e., Rounds 2 and 3) due to an excellent study regarding measurement error in the LFS. Biemer et al. (2014) provide many more details of the ASPIRE process. As Biemer et al. (2014) explain, ASPIRE is comprehensive in that it covers all the important error sources of a product and examines criteria that address the highest risks to product quality. It also identifies areas where improvements are needed ranked in terms of priority among competing risk areas e.g.,
137
high risk areas with lower ratings should be prioritized.
Quality Profiles A quality profile is a report that provides a comprehensive picture of the quality of a survey, addressing each potential source of error. The quality profile is characterized by a review and synthesis of all the information that exists for a survey that has accumulated over the years that the survey has been conducted. The goal of the survey quality profile is: •• to describe in some detail the survey design, estimation, and data collection procedures for the survey; •• to provide a comprehensive summary of what is known for the survey for all sources of error – both sampling as well as nonsampling error; •• to identify areas of the survey process where knowledge about nonsampling errors is deficient; •• to recommend areas of the survey process for improvements to reduce survey error; and •• to suggest areas where further evaluation and methodological research are needed in order to extend and enhance knowledge of the total mean squared error of key survey estimates and data series.
The quality profile also allows external evaluators to learn and understand the important work that has preceded the evaluation process, thus leading them to more informed and more accurate assessments. In that regard, the quality profile is an important component of the ASPIRE process. The quality profile is supplemental to the regular survey documentation and should be based on information that is available in many different forms such as survey methodology reports, user manuals on how to use microdata files, and technical reports providing details about specifics. A continuing survey allows accumulation of this type of information over time and, hence, quality profiles are almost always restricted to continuing surveys. In the US, quality profiles have been developed for the Current Population
78 60 N/A 60.9
Sampling error
Model/estimation error
Revision error
Total score
Fair
62
Data processing error
Poor
56
Measurement error
Good
Scores
52
Non-response error
58
70
Frame error
Very good
64.3
N/A
64
80
62
68
52
58
70
Average score Average score round 2 round 3
Specification error
Error Source
Excellent
N/A
Knowledge of risks
Table 10.3 Quality evaluation for the Labour Force Survey (LFS)
Accuracy(control for error sources)
High
H
N/A
N/A
Available expertise
Medium
M
Levels of Risk
Communication
Low
L
N/A
Compliance with standards & best practices
N/A
M
M
M
H
H
L
L
Risk to data quality
Improvements
Deteriorations
Changes from round 2
N/A
Plans or achievement towards mitigation of risks
Total Survey Error Paradigm: Theory and Practice
Survey (Brooks and Bailar, 1978), the Survey of Income and Program Participation (Jabine et al. 1990), US Schools and Staffing Survey (SASS; Kalton et al. 2000), American Housing Survey (AHS; Chakrabarty and Torres 1996), and the US Residential Energy Consumption Survey (RECS; US Energy Information Administration 1996). Kasprzyk and Kalton (2001) review the use of quality profiles in US statistical agencies and discuss their strengths and weaknesses for survey improvement and quality declaration purposes. Quality profiles are being promoted in some European countries under the rubric ‘quality declarations’ (Biemer et al. 2014).
SUMMARY The TSE paradigm is a useful framework for the design, implementation, evaluation, and analysis of survey data. Surveys that adopt TSE principles attempt to minimize total survey error in the final estimates without increasing costs. This is extremely important at a time when survey organizations are being asked to mount surveys of greater complexity while maintaining high quality for the lowest possible costs. The TSE concept has changed or way of thinking about all aspects of work from design to data analysis. For survey design, it provides a conceptual framework for optimizing surveys that is quite useful even if information on the relative magnitudes of the errors is lacking. As an example, knowing that a specified data collection mode is likely to produce biased data may be sufficient motivation to search for a less biasing mode. For survey implementation, the paradigm has led to innovations in actively managing field work using paradata under ATD for minimizing errors from multiple error sources. For survey evaluation, the TSE framework provides a useful taxonomy for evaluating sampling and nonsampling error. For example, the quality profile that is based on this taxonomy is useful for gathering all that is
139
known about the relevant sources of error for a specific survey and pointing out important gaps in our knowledge of TSE. For example, recent applications of the ASPIRE system which makes extensive use of quality profiles suggests that specification errors, measurement errors, and some data processing errors have been understudied at Statistics Sweden and likely most NSOs (Biemer et al. 2014). Future research on TSE is needed in three critical areas: (1) innovative uses of paradata and metrics for monitoring costs and quality in ATDs, (2) research on highly effective intervention strategies for real-time costs and error reduction and (3) cost-effective methods for evaluating survey error, particularly error interaction effects such as the effects of nonresponse reduction strategies on measurement error.
RECOMMENDED READINGS Biemer (2010) provides additional background on the TSE paradigm and Adaptive Total Design which will help the reader better understand the ideas in my chapter. Other papers in this same special issue of Public Opinion Quarterly (74 (5)) also provide valuable background on the TSE paradigm and the reader is encouraged to take a look at them. Finally, readers should be aware that Total Survey Error in Practice (to appear) is an edited volume to be published in early 2017 based upon papers from the TSE15 conference that was held in Baltimore. The papers in this book will be particularly relevant to the topics discussed in this chapter.
REFERENCES Biemer, P. (2010). ‘Total Survey Error: Design, Implementation, and Evaluation.’ Public Opinion Quarterly, 74 (5): 817–848. Biemer, P. and Caspar, R. (1994). ‘Continuous Quality Improvement for Survey Operations: Some General Principles and Applications.’ Journal of Official Statistics, 10: 307–326.
140
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Biemer, P. and Lyberg, L. (2003). Introduction to Survey Quality. Hoboken, NJ: John Wiley & Sons. Biemer, P., Trewin, D., Bergdahl, H., and Japec, L. (2014). ‘A System for Managing the Quality of Official Statistics.’ Journal of Official Statistics, 30 (4): 381–415. Breyfogle, F. (2003). Implementing Six Sigma: Smarter Solutions Using Statistical Methods. Hoboken, NJ: John Wiley & Sons. Brooks, C. and Bailar, B. (1978). An Error Profile: Employment as Measured by the Current Population Survey. Statistical Working Paper 3. Washington, DC: US Office for Management and Budget. Chakrabarty, R.P. and Torres, G. (1996). American Housing Survey: A Quality Profile. Washington, DC: US Department of Housing and Urban Development and US Department of Commerce. Cochran, W.G. (1977). Sampling Techniques, 3rd edn. New York: John Wiley & Sons. Couper, M. (2008). Designing Effective Web Surveys. New York: Cambridge University Press. Dillman, D., Smyth, J., and Christian, L. (2009). Internet, Mail and Mixed-Mode Surveys: The Tailored Design Method, 3rd edn. Hoboken, NJ: John Wiley & Sons. Eltinge, J., Biemer, P., and Holmberg, A. (2013). ‘A Potential Framework for Integration of Architecture and Methodology to Improve Statistical Production Systems.’ Journal of Official Statistics, 29 (1): 125–145. Eurostat (2009). ‘Regulation (EC) No 223/2009 of the European Parliament and of the Council of 11 March 2009,’ Eurostat General/Standard report, Luxembourg, April 4–5, http://eur-lex.europa.eu/legal-content/EN/ ALL/?uri=CELEX:32009R0223 (accessed June 18, 2014). Fowler, F. and Mangione, T. (1985). The Value of Interviewer Training and Supervision. Final Report to the National Center for Health Services Research, Grant #3-R18-HS04189. Groves, R. and Heeringa, S. (2006). ‘Responsive Design for Household Surveys: Tools for Actively Controlling Survey Errors and Costs.’ Journal of the Royal Statistical Society, Series A, 169 (3): 439–457. Hansen, M.H. and Hurwitz, W.N. (1946). ‘The Problem of Non-Response in Sample
Surveys.’ Journal of the American Statistical Association, 41: 517–529. Imai, M. (1986). Kaisen: the Key to Japan’s Competitive Success. New York, NY: McGraw-Hill Education. Jabine, T., King, K., and Petroni, R. (1990). Quality Profile for the Survey of Income and Program Participation (SIPP). Washington, DC: US Bureau of the Census. Kalton, G., Winglee, M., Krawchuk, S., and Levine, D. (2000). Quality Profile for SASS: Rounds 1–3: 1987–1995. Washington, DC: US Department of Education, National Center for Education Statistics (NCES 2000–308). Kaminska, O., McCutcheon, A. L., and Billiet, J. (2010). Satisficing among reluctant respondents in a cross-national context. Public Opinion Quarterly, 74(5), 956–984. Kasprzyk, D. and Kalton, G. (2001). ‘Quality Profiles in U.S. Statistical Agencies.’ Paper presented at the International Conference on Quality in Official Statistics, Stockholm, Sweden. Kreuter, F., Müller, G., and Trappmann, M. (2010). ‘Nonresponse and Measurement Error in Employment Research.’ Public Opinion Quarterly, 74 (5): 880–906. Krosnick, J. and Alwin, D. (1987). ‘An Evaluation of a Cognitive Theory of Response Order Effects in Survey Measurement.’ Public Opinion Quarterly, 51: 201–219. Laflamme, F., Pasture, T., Talon, J., Maydan, M., and Miller, A. (2008). ‘Using Paradata to Actively Manage Data Collection,’ in Proceedings of the ASA Survey Methods Research Section, Denver, Colorado. Lyberg, L. (1985). ‘Quality Control Procedures at Statistics Sweden.’ Communications in Statistics – Theory and Methods, 14 (11): 2705–2751. Montgomery, D. (2009). Introduction to Statistical Quality Control, 6th edn. Hoboken, NJ: John Wiley & Sons. Morganstein, D. and Marker, D. (1997). ‘Continuous Quality Improvement in Statistical Agencies,’ in Survey Measurement and Process Quality. eds. Lars E. Lyberg, Paul Biemer, Martin Collins, Edith D. de Leeuw, Cathryn Dippo, Norbert Schwarz, and Dennis Trewin, pp. 475–500. New York: John Wiley & Sons.
Total Survey Error Paradigm: Theory and Practice
Office of Management and Budget (2006). Questions and Answers When Designing Surveys for Information Collections. Washington, DC: Office of Information and Regulatory Affairs, OMB. http://www.whitehouse. gov/omb/inforeg/pmc_survey_guidance_2006.pdf [accessed on 14 June 2016]. Peytchev, A. and Neely, B. (2013). ‘RDD Telephone Surveys: Toward a Single Frame Cell Phone Design.’ Public Opinion Quarterly, 77 (1): 283–304. Peytchev, A., Peytcheva, E., and Groves, R. M. (2010). ‘Measurement Error, Unit Nonresponse, and Self-reports of Abortion Experiences.’ Public Opinion Quarterly, 74 (2): 319–327. Potter, F. (1990). ‘A Study of Procedures to Identify and Trim Extreme Survey Weights,’ in Proceedings of the American Statistical Association, Survey Research Methods Section, Anaheim, CA. Schouten, B., Calinescu, M., and Luiten, A. (2013). ‘Optimizing Quality of Response through Adaptive Survey Designs.’ Survey Methodology, 39 (1): 29–58. Singer, E. and Kulka, R. (2002). ‘Paying Respondents for Survey Participation,’ in Improving Measurement of Low-Income Populations. Washington, DC: National Research Council, National Academies Press.
141
Singh, A., Iannacchione, V., and Dever, J. (2003). ‘Efficient Estimation for Surveys with Nonresponse Followup Using Dual-Frame Calibration,’ in Proceedings of the American Statistical Association, Section on Survey Research Methods, pp. 3919–3930. Statistics Canada (2002). Statistics Canada’s Quality Assurance Framework – 2002. Catalogue no. 12-586-XIE. Ottawa, Ontario: Statistics Canada. US Department of Labor, Bureau of Labor Statistics and US Department of Commerce, Bureau of the Census (2002). Current Population Survey: Design and Methodology. Technical Paper 63RV. Washington, DC: US Department of Labor, Bureau of Labor Statistics and US Department of Commerce, Bureau of the Census. http://www.census. gov/prod/2002pubs/tp63rv.pdf (accessed June 2014). US Energy Information Administration (1996). Residential Energy Consumption Survey Quality Profile. Washington, DC: US Department of Energy. Wagner, J. (2008). ‘Adaptive Survey Design to Reduce Nonresponse Bias.’ Unpublished PhD Dissertation, University of Michigan, Ann Arbor, MI.
11 Survey Mode or Survey Modes? Edith de Leeuw and Nejc Berzelak
FROM SINGLE-MODE TO MIXED-MODE The single-mode paradigm, which implies that one data collection method fits all respondents equally well, no longer applies in the twenty-first century. In the first half of the twentieth century, the golden standard and main survey mode was the face-to-face interview. Although the high costs are challenging the applicability of face-to-face interviews today, the face-to-face interview is still seen as ‘the queen of data collection’ and is used as the preferred reference mode for the comparison and evaluation of alternative data collections methods. Additionally to the early face-to-face interviews, postal (paper mail) surveys were used, but they were seen as a fall-back method for researchers with very limited budgets, and it took decades and the pioneering work of Dillman (1978) before postal surveys were accepted as a respectable and valid data collection mode. Recently, paper mail surveys
experienced a revival in the USA with the introduction of the Postal Delivery Sequence File (PDSF). This is an official list, provided by the US postal service, with address information for all households to whom postal mail is delivered. Similar lists have been available in several European countries for years (e.g., Dutch telecom postal delivery list). These lists provide an excellent sampling frame for household probability samples of a country’s population, and consequently postal surveys now have a good coverage of the general population. Furthermore, mail surveys reach higher response rates than comparable crosssectional telephone and web surveys (Lozar Manfreda et al., 2008; Dillman et al., 2014). Telephone surveys took over in the 1980s and soon became the main data collection method in market and opinion research both in the USA and Europe. High telephone penetration in many countries coupled with random digit dialing, made it possible to survey the general population. Furthermore, computer-assisted interviewer-administration
Survey Mode or Survey Modes?
from a central location allowed for complex questionnaires at much lower costs than a comparable face-to-face-survey (e.g., Nathan, 2001). At present, growing noncoverage and nonresponse, pose a serious threat to the validity of telephone surveys. Regarding coverage problems: landline telephone subscriptions are declining and mobile (cell) phone-only households are increasing at a vast pace both in the USA and all over Europe. In addition, there is a ‘mobile divide’ on key demographics, like age, gender, and education, with cell phone-only owners being younger, more often male, and having either a high or a low education (e.g., Mohorko et al., 2013a). Furthermore, telephone response rates are now approaching single digit numbers and, even with increased fieldwork efforts and inclusion of mobile phone numbers in the sampling frame, response rates are at a historical low (e.g., Pew Research Center, 2012). The beginning of the twenty-first century marked the rise of a new data collection method, web surveys. The use of the Internet as a data collection medium is very cost-effective and makes surveying very large groups affordable. However, Internet or web surveys offer additional advantages, such as the potential for using complex questionnaires and having a quick turn-around time. Internet surveys come in many varieties. Usually they are based on nonprobability samples, but if a good sampling frame is available or a different contact mode is used to reach potential respondents, probabilitybased sampling methods can be used too. The flexibility of modern programming and Internet technology make Internet surveys very versatile and much depends on the design and implementation; web surveys can be designed to resemble a paper mail survey, while it is also possible to emulate an interviewer administered approach (Couper, 2011). A problem with Internet surveys is undercoverage and the resulting digital divide due to different rates of Internet access among different demographic groups, such
143
as, an unequal distribution regarding age and education for those with and those without Internet access. For an overview of Internet access across Europe and undercoverage, see Mohorko et al. (2013b); for the USA see, for instance, Smyth and Pearson (2011) and Dillman et al. (2014). Closely following the introduction of web surveys, Internet panels emerged. Usually, these panels are nonprobability opt-in panels (AAPOR, 2010), but scientific probability-based Internet panels do exist. A good example is the Dutch LISS-panel, which was established in 2007; it is based on a probability sample of Dutch households drawn from the population registry by Statistics Netherlands. To avoid undercoverage, non-Internet households were provided with a computer and an Internet connection (Scherpenzeel and Das, 2011). Similar panels now are the Gfk-Knowledge networks panel in the USA, the GESIS Panel in Germany as well as ELIPSS in France. In sum: survey modes differ in their abilities to reduce different survey errors (for an overview, see de Leeuw, 2008). Face-to-face interviews are the most versatile with respect to coverage, interviewer assistance, and questionnaire complexity, but need a huge investment in time, personnel, and money. Also, response rates have been dropping steadily over the years; part of the problem is increased non-contact, part of the problem is an increased unwillingness to participate (de Leeuw and de Heer, 2002). Telephone surveys are less costly and share the flexibility of face-to-face interviews regarding interviewer assistance. They are also faster to complete and offer an increased timeliness, but are more limited in possibilities of question presentation format, as the telephone mode is auditory only. Severe challenges for the telephone mode are a reduced coverage when standard RDD (landline phones) is used, due to an increase of mobile phone-only households and very low response rates. Internet surveys and web panels offer great flexibility regarding question format and answer
144
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
options (e.g., Couper, 2008), have low costs and data are available very quickly. However, they suffer from coverage and (non)response problems. Paper mail surveys have the potential of reaching all intended, but are more costly than web surveys and it takes longer to get the data in. Although response rates in paper mail surveys are higher than in web and telephone surveys (e.g., Dillman et al., 2014), there is a tendency for younger people, those in transition, and those less literate to respond less. Although single-mode surveys have their associated problems, they will still be used in the future, especially in special cases (cf. Stern et al., 2014). Prime examples are health surveys or surveys of special populations, like the elderly, where trained face-to-face interviewers also collect additional information (e.g., data from health tests). However, when taking into account the growing coverage problems associated with individual single survey methods, the increase in nonresponse, and the rising costs of interviews, a carefully planned mixed-mode design seems to be the way to go for general quality surveys (de Leeuw, 2005). Or in the words of Blyth (2008): ‘Mixed-mode is the only fitness regime’!
TYPES OF MIXED-MODE DESIGNS Using multiple modes in a survey is not new and survey practitioners have used different combinations of communication strategies with respondents for some time (Couper, 2011; de Leeuw, 2005). Main reasons to consider mixed-mode approaches are to reduce coverage and nonresponse error, improve timeliness, and reduce cost. Dillman and Tarnai (1988: 510) trace back mixed-mode surveys to the early 1960s. Early applications of mixed-mode designs include mail surveys with a telephone follow-up to reduce nonresponse, and face-to-face and telephone mixes to achieve high response rates at affordable
costs. The onset of web surveys brought a renewed interest to mixed-mode designs and its potential to reduce coverage and nonresponse error associated with online surveys (de Leeuw and Hox, 2011), while still benefitting from lower data collection cost. Two main approaches can be discerned in mixed-mode designs. The first uses different modes of communication to contact potential respondents (e.g., an advance notification letter announcing an interview): a contact phase mode change. The second uses multiple modes for the actual collection of survey data (de Leeuw, 2005; de Leeuw et al., 2008), thereby involving mode changes in the response phase. Both strategies are discussed below.
Mixing Contact Strategies: Contact Phase Mode Change Paper Advance Letters in Interviews and Web Surveys Contacting respondents in a different mode than the one used for actual data collection is not new. Paper advance letters have been used in both face-to-face and telephone surveys for a long time. Advance letters have many functions; they underscore the legitimacy of the surveys, may take away initial suspicion of respondents, and communicate the importance of the survey. Furthermore, interviewers appreciate advance letters and gain professional confidence from it. Many studies showed that paper advance letters increase response rates; for an overview and meta-analysis, see de Leeuw et al. (2007). When email addresses of the sample are known, for instance, in an online panel, or for special groups of respondents, an email invitation to a web survey is very cost-effective. It also reduces the respondent burden to start the survey as they only have to click on the provided link. However, in many cases email addresses are not known for all sample members and a paper mail invitation to complete a web survey is sent to potential respondents.
Survey Mode or Survey Modes?
Even when email addresses are known for part of the sample, a paper advance letter may be useful to establish legitimacy and trust. This may be followed-up by an email invitation with a link provided to those of which the email addresses are known. This procedure, coined email augmentation by Dillman, not only serves as a reminder, it also emphasizes that the researcher is taking the respondent seriously and is willing to invest time and effort, and reduces the respondent burden considerably, thereby increasing the response (e.g., Dillman et al., 2014: 23–24). Another important aspect of advance letters is that they can be employed to deliver unconditional (prepaid) incentives in telephone and web surveys. In general, prepaid incentives have a larger positive effect on response than post-paid or contingent incentives (Singer and Ye, 2013). However, when the initial contact mode is by telephone or email, as in a regular telephone or web survey, prepaid incentives are not possible. Dillman and colleagues successfully experimented with sending prepaid incentives for web surveys by postal mail. Paper mail advance letters asking for response over the Internet were sent out, and letters which included a small monetary incentive almost doubled the response. In these cases the paper mail letter not only emphasizes legitimacy and increases trust, but including an incentive also serves as a token of appreciation and encourages respondents to make the extra effort and respond over the web. For an overview, see Dillman et al. (2014: 421–422).
Telephone Calls to Screen Respondents Sometimes special groups of respondents are needed and the sampling frame does not provide the information necessary for addressing these groups directly. In those cases, often brief telephone interviews of randomly selected members of the general population are used to select the intended respondents, who are subsequently asked to participate in a survey through another mode than the
145
phone. This method of selection is costeffective and fast. Examples are screening by telephone for the elderly, young parents, single parent households, or special groups of patients, for a more costly face-to-face health survey. An advance telephone call is also common in business surveys to identify the most knowledgeable or targeted respondent in the establishment and get past ‘gate keepers’.
Inviting and Selecting Respondents for Panel Research To establish a probability-based online panel, it is necessary to select and invite potential panel members through another mode. No sampling frames of email addresses of the general population are available and one has to rely on traditional sampling frames for the general population and its associated contact strategies. This has the consequence that the initial request to become a panel member is done through another mode (e.g., RDD sampling and telephone interviewing, address sampling and face-to-face interviewing). A prime example of the use of another mode to contact and invite respondents for an online panel is the Dutch LISS panel. A probability sample of Dutch households, drawn by Statistics Netherlands, was contacted faceto-face and over the phone by professional interviewers and invited to become panel members. This approach proved to be successful as over 48% of the total gross sample ended up as active panel members (Scherpenzeel and Das, 2011) and similar approaches are now being used in other countries (e.g., Germany, France).
Reminding Respondents to Complete a Survey Sending out reminders to non-responders is successful to raise response rates in mail surveys, as numerous studies showed (e.g., Dillman et al., 2014). Reminders are also used in cross-sectional web surveys and online panels. To be effective, reminders should fit the situation and address
146
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
respondents in different ways at different time points. Examples are a special last reminder by telephone or special postal delivery in a mail survey (e.g., de Leeuw and Hox, 1988). Changing contact modes proved very effective in the Nielsen media study, where mail and telephone contacts were used both in the prenotification phase, where a media diary was announced, and in the reminder phase (see de Leeuw, 2005 for more details). Continuing to shout and knock at a closed door obviously did not work for Fred Flintstone and will certainly not work for modern survey methodologists. Thus, when a series of email reminders are sent to nonrespondents in a web survey, there is a severe danger that after five reminders the respondent considers these contacts as unwanted mail and the reminders will disappear in the spam folder. Changing mode of contact will avoid this and an old-fashioned paper mail letter or card therefore has novelty value in this electronic world.
Mixing Methods for Collecting Data: Response Phase Mode Change There are five common mixed-mode designs for collecting survey data. The first two concern cross-sectional studies, in which different persons within one sample are surveyed with different modes. The first is a concurrent mixed-mode design, where respondents are offered two or more modes at the same time and are given a choice. This design has been used to overcome coverage problems, for example by sending a letter to the whole sample offering them a choice between an online survey via a given URL or a telephone survey by calling a toll free number. An example is the asthma awareness study by the American lung association, where school nurses were sent a postcard inviting them to complete the questionnaire either by web or by phone.
Concurrent mixed-mode designs are frequently used in establishment surveys, where a business is asked to provide vital statistics in whatever mode is most convenient for it. The second cross-sectional design is a sequential mixed-mode design, where one mode of data collection is implemented after another. Usually one starts with the least expensive mode (e.g., mail or web) and follows up with more expensive ones. A sequential mixedmode design is frequently used to improve response rates and reduce nonresponse bias. A good example is the American Community Survey (starts with mail/web, with telephone and face-to-face nonresponse follow-ups). Sequential mixed-mode designs are also commonly used in longitudinal studies. Cost reduction and practical considerations are the main reason for this design, where one starts with an expensive but flexible mode to guarantee a high response rate for the baseline study and the next waves are conducted with less expensive modes. Examples are the Labour Force Survey in many European countries and the US Current Population Survey. A special case of a longitudinal sequential mixedmode design is building and maintaining probability-based online panels. As there are no good sampling frames of email addresses of the general population, another sampling frame is used (e.g., addresses, telephone numbers), a probability-based sample is drawn and then approached and recruited for the online panel via an interview mode. An example is the already mentioned Dutch online LISS panel, where a probability sample of Dutch households drawn by Statistics Netherlands was first approached face-to-face. The fourth common mixed-mode approach is a concurrent mix used in cross-national surveys. Different countries have different traditions in data collection modes, different infrastructures or fieldwork needs, which make it necessary to use different modes in different countries. For instance, the ISSP (International Social Survey Programme) started as a self-administered survey. However, when more countries joined in, the
Survey Mode or Survey Modes?
self-administered mode was unsuitable for specific low-literacy countries and face-toface interviews were used in those countries. Of course, to make the situation in crossnational surveys even more complicated, a specific country may adopt a mixed-mode approach, while other countries use singlemode approaches, or different countries may adopt different mixed-mode strategies (e.g., one country an Internet-mail strategy, another country face-to-face surveys, and a third country web-telephone). The fifth and last use of mixed-mode is a special concurrent design, where different modes are used for different parts of the questionnaire and all respondents get the same mix. This is a common strategy when one section of a questionnaire contains questions on a sensitive topic and the interviewer switches over to a more private self-administered mode to administer these questions. Examples are handing the respondents a paper questionnaire during a face-to-face interview, changing to (audio)-CASI and handing over the laptop to the respondent in a computerassisted (CAPI) face-to-face interview, or switching to interactive voice response (IVR) during a telephone survey. In these mixes the optimum method is used for each situation, offering interviewer assistance but also privacy to the respondent when needed. Very often multiple modes in both the contact phase and in the response phase are used in one mixed-mode design to raise response rates. For a concise overview, see de Leeuw (2005).
HOW MODES DIFFER Tourangeau et al. (2013) point out that mode differences reflect the major sources of error, that is, sampling, coverage, nonresponse, and measurement error. In a mixed-mode design, researchers focus on total survey error and combine modes in such a way to reduce three major sources (coverage, sampling,
147
nonresponse). However, this may influence measurement error, and we therefore now focus on how modes may influence measurement error. Survey modes differ on two main dimensions: (1) interviewer administered vs self-administered questionnaires, and (2) information transmission and communication (de Leeuw, 1992, 2005). The absence or presence of an interviewer has consequences for the degree of privacy experienced by the respondent, which may influence self-disclosure and social desirability in answers. However, interviewers also have beneficial influences: they may motivate respondents, explain procedures, and help out a respondent when problems occur during the question–answer process. Furthermore, faceto-face and telephone interview modes differ in the degree of interviewer involvement and contact with the respondent. Interviewers in a face-to-face situation are in direct contact with the respondents and have far more communication tools available than interviewers in a telephone situation. For instance, telephone interviewers lack nonverbal communications to identify potential problems or to motivate respondents (e.g., an encouraging nod or smile to the respondent). A main distinction between modes is how information is transmitted. The first difference is how the questions are presented to the respondent; this can be visually, aurally, or both. Telephone interviews are aural only, while face-to-face interviews may use visual stimuli (i.e., showcards with response options), although they are mainly auditory. When information is only auditory this demands more of the memory capacity of the respondent, while visual presentation of information relieves the cognitive burden and may lead to fewer response effects. The second difference is how the final answer is conveyed by the respondent. This can be aurally (spoken word) in interviews, written in paper self-administered questionnaires, or electronic (e.g., typing via keyboard, mouseclick, touch screen).
148
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Related to this is the way communication is transferred between sender and receiver. There is verbal communication (e.g., words, question text), which can be presented visually or auditory, nonverbal communication (e.g., gestures), which only can be seen, and paralinguistic communication (e.g., tone of voice, emphasis), which is typically auditory. In a face-to-face situation all communication forms can be used and offer tools to convey and check information. In telephone surveys nonverbal communication is lacking and experienced interviewers try to compensate for this by using explicit verbal or paralinguistic cues, such as an explicit ‘thank you’ or ‘hmmm’ instead of a nod. A fourth way of communication is through graphical language, which is extremely important in the visual design of self-administered paper and web questionnaires. These modes lack both nonverbal and paralinguistic information and graphical language (e.g., type and size of fonts, italics, shading, colours, arrows and pointers) can be used to compensate for this lack (Dillman et al., 2014: Chapter 6). A final distinction of modes is whether they are computer-assisted or paper and pencil modes. Berzelak (2014) makes an extremely useful distinction in mode inherent factors versus context-specific characteristics and implementation-specific characteristics. Mode inherent factors are given and cannot be negated by survey design; they are stable and do not change across survey implementations, respondents or situations. Examples are interviewer involvement, such as absence of interviewers in self-administered modes, presentation of stimulus and response, such as aural presentation of questions and responses in telephone interviews, and use of computer technology. Context-specific characteristics depend on the social and cultural aspects of the population for which the survey is implemented. Familiarity with the medium used for surveying and its associated expectations play an important role. These may change over time. For instance, telephone calls from unknowns
are now often mistrusted and unwanted, as the rise of ‘do-not-call-me’ registers in the Western world illustrates. Furthermore, people communicate differently now than twenty years ago, using mobile phones and their facilities (e.g., texting, online browsing) more. This does not mean that one cannot do worthwhile telephone interviews, but to do quality telephone interviews one has to do more to establish trust and one has to adapt the survey design (e.g., Dillman et al., 2014: Chapter 8). Another example of a context-specific characteristic is the technology used by the respondent. Different devices and applications used by respondents in online surveys can introduce differences in experience even if the same input and design is used. Respondents may use different Internet-enabled devices (e.g., PC, laptop, tablet, smartphone) and input methods (e.g., mouse, keyboard, touch screen). Again, this is not a mode inherent factor and researchers are now developing and investigating ways to accommodate this in their online and mobile surveys (e.g., Callegaro, 2013). In sum, contrary to mode inherent factors, context-specific characteristics may change over time and may also be countered by clever survey design. Implementation-specific characteristics de pend on the way a mode is implemented in a specific survey. These are characteristics that a researcher can control and exploit to come to an optimal survey design. For example, in paper and web surveys visual design and graphical language can be used to convey extra information to the text, thereby compensating for the lack of interviewer help and intonation (for examples, see Dillman et al., 2014: Chapter 6). Related to implementation-specific characteristics is questionnaire design. One should be aware that the way a questionnaire is designed and implemented may differ over modes. In a single-mode design, specific traditions of how questions are structured and presented exist. These have been adopted to fit the needs of a single-mode best and reflect
Survey Mode or Survey Modes?
current best practices. For example, when a respondent has to choose from a long list of potential responses, this does not pose many problems in mail and web surveys, where these are visually presented; in a face-toface survey, which is mainly aural, visual show-cards are typically used in these cases. Another example is the do-not-know option. In interview surveys, do-not-know is often not explicitly offered, but recorded when spontaneously given by the respondent, while in online surveys, do-not-know is more often explicitly presented as a separate response category. These different traditions of questionnaire design can be seen as an implementation-specific characteristic. When different question formats for the same question are used in different modes, this has the consequence that respondents are presented with a different stimulus in each mode, which may lead to unwanted question wording and mode effects. Therefore, in mixed-mode surveys a unified (uni) mode design, in which a unified stimulus across modes is presented, is preferable (for more details and examples, see Dillman and Edwards, Chapter 18 this Handbook). We come back to this in the final section.
DOES A MIXED-MODE STRATEGY IMPROVE QUALITY? Mixed-mode strategies are recommended to reduce coverage and nonresponse error at reasonable survey costs and with improved timeliness. How well do mixed-mode surveys reach these goals and reduce total survey error?
Response Rates An often cited success story of how a mixedmode strategy led to high response rates is the American Community Survey (ACS) where the first phase is by mail or Internet
149
(including prenotification letters and reminders), and nonrespondents are followed-up through computer-assisted telephone interviews. If no telephone number is available or if a household refuses to participate, a subsample is interviewed in person. Response rates are today still above 97%.1 While the ACS is a mandatory survey, a similar sequential mixed-mode approach may be also successful in voluntary surveys. Switching to a second, or even third mode in a sequential mixed-mode design has been proven to raise response rates in studies of the general population as well as in studies of different ethnic groups, and in studies of special populations, such as professionals and businesses (for an overview, see de Leeuw et al., 2008: 307). In the most common sequential mixed-mode designs nonrespondents to the first, least expensive, mode are followed up by a more expensive mode, thereby balancing costs and nonresponse. Another approach is the consecutive mixed-mode, where two or more modes are offered at the same time and potential respondents are given the freedom to choose in which mode they want to complete the survey. Although giving potential respondents a choice has an intuitive appeal, because respondents can decide which method of completing a survey is most suitable to them, empirical evidence shows that this approach does not work and even carries the danger of lowering response rates. In a meta-analysis of 19 experimental comparisons of concurrent mixed-mode choice designs involving web and paper mail options, Medway and Fulton (2012) showed that a concurrent web–mail mix resulted in lower response rates than mail only surveys. Offering a choice may seem respondentfriendly, but actually it increases the burden for the respondent, who now has to make two decisions instead of one. From ‘will I participate?’ the respondent now has to decide on ‘will I participate’ plus ‘what survey method do I want to use in participating’. Confronted with this harder task, the easiest way out is of
150
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
course postpone and not do anything at all. Furthermore, being confronted with a choice, a respondent may concentrate on the choice dilemma, instead of on the survey, the survey topic, and its potential usefulness. In other words, offering a choice will distract from the researcher’s carefully drafted positive arguments on why to respond, and may weaken the saliency of the survey. In sum: sequential mixed-mode strategies improve response rates. Do not give potential respondents a choice, but rather offer one mode at a time. From a cost perspective, it pays to start with the most costeffective method and reserve more expensive approaches for nonresponse follow-ups.
Coverage and Representativity There is a limited number of studies that investigated whether mixed-mode strategies do improve the coverage of the intended population and especially the representativity of the realized sample. A study of Dutch immigrants (Kappelhof, 2015) found that a sequential mixed-mode did improve response rates and that different modes brought in different types of respondents. For instance, more younger and second generation immigrants through the web, more elder and first generation immigrants in interview modes. Messer and Dillman (2011) investigated the effectiveness of a sequential web–paper mail survey for the general population in the USA. Using paper advance letters to mail a web request to the sampled addresses and using a paper mail survey as sequential follow-up, they obtained responses from nearly half of the households. Furthermore, they showed that respondents to the web and respondents to the mail survey are demographically dissimilar; thereby underscoring the importance of a sequential mixed-mode approach both for improving response rates and representativity. Similarly, Bandilla et al. (2014) found that including a mail survey for those who did not have Internet access and as a
follow-up for web nonrespondents, not only increased response rates, but also improved representativity on demographic and attitudinal variables in Germany. Finally, Klausch et al. (2016) found that a face-to-face followup raised the overall representativeness of mail, telephone, and web surveys to the level of a single-mode face-to-face survey, but not beyond.
Data Quality and Measurement Error Mode Differences and Measurement Error There is a long history of mode effects studies (e.g., Groves, 1989; de Leeuw, 1992; de Leeuw, 2010; Tourangeau et al., 2013). Past research has shown that there is a dichotomy between modes with interviewers, such as face-to-face and telephone interviews, and without interviewers, such as mail and Internet surveys. Self-administered forms perform better when more sensitive questions are asked and both paper mail surveys and web surveys result in more open and less social desirable answers than interview surveys. For a narrative overview of both European and US studies see de Leeuw and Hox (2011), for a meta-analysis of US studies, see Tourangeau et al. (2013: 140–142). On the other hand, interview surveys perform better in motivating respondents to answer as is reflected in response rates, missing data, and answers to open questions (e.g., metaanalysis, de Leeuw, 1992: Chapter 3). These differences in data quality may influence the overall measurement error in a mixed-mode design. For example, if nonrespondents in a self-administered survey on friendship and loneliness receive a follow-up via an interview, lesser acknowledgement among the interview group can be caused because early respondents are really lonelier, because of social desirability in the followup interview where respondents do not want to admit feeling lonely to an interviewer, or
Survey Mode or Survey Modes?
even because talking to an interviewer modulates temporarily the feelings of loneliness. If only the first is the case, different modes get in different types of respondents, but as there is no difference in measurement error between the groups, then there is no problem at all. The differences in selection effects are desired and are indeed one of the main reasons for using mixed-mode designs in nonresponse follow-ups. They restore the balance and improve representativity of the sample. If the latter two are the case, that is, respondents in different modes give different answers because of the mode they are surveyed in; the modes differ in measurement error. In other words, mode differences reflect desired differences in coverage and nonresponse as well as undesired differences in measurement error; only the potential differences in measurement errors are a cause for concern when using mixed-mode designs.
Common Mixed-mode Designs and the Potential for Measurement Error If mixed-mode strategies are limited to the contact phase, either prenotification, screening and election, or reminders, and one single mode is used for the response phase in data collection (‘Mixing contact strategies’ above), then there is no danger for differential measurement error. A design with different modes in the contact phase has only advantages as nonresponse and coverage error are reduced and data integrity is not threatened (de Leeuw, 2005). If mixed-mode strategies are also used for the response phase and data are collected through more than one mode, the situation is more complicated. There is only one clear win–win situation, that is, when a selfadministered mode is used for all respondents within an interview for a specific part of the questionnaire that contains sensitive topics, in such a way that responses to sensitive questions are unknown to the interviewer. By changing the context and thereby allowing the respondent more privacy when needed,
151
optimal data quality is ensured. In all other cases, both sequential and concurrent mixedmode designs may improve response and representativity, but at the risk of increased measurement error. For an overview, see de Leeuw (2005: Fig. 1).
EFFECTIVE USE OF MIXED-MODE STRATEGIES How to Mix To mix is to design. In designing a mixedmode survey one should carefully assess the potential effects of mode-specific measurement error and try to minimize these differences as best as possible. Tourangeau (2013) advises that when an overall population estimate is needed one should try to minimize measurement error by implementing each mode as optimal as possible even at the cost of mode differences. This approach could result in different question formats being used in different modes and should be preferably done for factual questions only, as attitudinal questions are often more susceptible to question format effects. However, when the goal of the study is to make comparisons over groups, one should try to prevent mode and measurement effects by design; for instance, by following a unified mode design to produce equivalent questionnaires. This is especially important when one uses attitudinal questions and when one wants to compare groups. Examples are cross-national studies where different modes are used in different countries, and multi-site studies in health research, where different hospitals use different data collection modes or nonresponse follow-ups in different modes. Also in simple cross-sectional mixed-mode studies, subgroups are often compared (e.g., different educational groups, different age groups) and if certain subgroups are over represented in a specific mode (e.g., younger respondents online), non-equivalent questionnaires
152
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
implemented over the modes may threaten a valid comparison.
Unified (Uni) Mode Design In a unified design for mixed-mode surveying the goal is to present a unified stimulus across modes. The stimulus presented to the respondent is more than the question text alone and also includes response categories and instructions. Dillman and Edwards (this Handbook, Chapter 18) point out that a successful construction of unified (uni) mode questionnaires requires thinking about three different tasks: question structure, question wording, and visual versus aural presentation of questions. When modes are very disparate, it is harder to create equivalent questionnaires than when modes are more similar. For instance, web and paper selfadministered questionnaires both use visual presentation (instructions, questions, and response-categories are all read by the respondent). Furthermore, graphical language (fonts, colours, etc.) can be used to enhance communication and the respondent, who has locus of control over the questionnaire, may determine time and place of responding. Dillman and Edwards (this Handbook, Chapter 18) and Dillman et al. (2014: Chapter 11) give some excellent examples and guidelines for the construction of equivalent questionnaires. When self-administered questionnaires and interviews are mixed, the situation is more complicated: self-administered forms lack the interviewer’s help and interviews often lack the possibilities of visual presentation of stimuli. In face-to-face interviews some limited use of the visual channel for presentation of questions can be made (e.g., use of showcards), but in telephone interviews this is not possible. These differences may influence question design choices. For instance, should one offer an explicit do-not-know option to online self-administered questions or not. Recent research (e.g., de Leeuw et al., 2015)
shows that explicitly offering do-not-know, increases the selection of this nonsubstantive answer, resulting in more item missing data. In mixed-mode studies it is best to follow the custom of interviews and not offer an explicit do-not-know answer. When the mix includes an online survey, one could use the potential of computer-assisted data collection and successfully program in a polite probe, just as in the CAPI or CATI interview. Another example of overcoming a major difference between interview and online surveys and successfully designing for mixed-mode is given by Berzelak (2014). In interviews the questions are posed sequentially one-by-one, while in online questionnaires they are often presented together in a grid (matrix) format. Grouping questions together may help respondents to quickly grasp that all questions have the same response categories, thereby relieving the response burden, but major disadvantages of grid questions are that respondents often use satisficing behaviour such as straightlining or non-differentiation, instead of paying attention to each question separately. Presenting questions one at a time on the screen could counteract these negative effects and resemble the interview process more. However, a standard one-by-one presentation of questions on the screen would add to the response burden (e.g., a respondent has to press the next-button all the time) and to the overall duration of the interview which may increase break-offs. A new online question format, the auto-advance question, presents questions one-by-one, but using auto-advance procedures the next question slides in automatically after an answer has been given, thereby not adding to the respondent burden and mimicking the interview situation. The format of an auto-advance carousel question is (1) question body, (2) response categories, (3) navigation/overview bar. When an answer is selected by the respondent, after a couple of seconds the response categories are cleared and the text of the next question appears. The general question text, instructions, and the list
Survey Mode or Survey Modes?
of response categories stay in place, and the navigation bar provides information where the respondent is in the series of questions and allows the respondent to go back when desired. An example is given in Figure 11.1. When compared with traditional grid questions the auto-advance or carousel question performed better in an online survey. A series of experiments showed that this new format resulted in less straightlining and was more positively evaluated by respondents than a standard grid question in an online survey (Roberts et al., 2013). Furthermore, in an experimental mode comparison Berzelak (2014) shows that when using this question type mode differences (e.g., in satisficing) between online and interview modes are low.
Estimation and Adjustment Prevention, through careful design, of measurement effects in mixed-mode surveys is of course extremely important. However, even after carefully designing a mixed-mode study and paying attention to questionnaire design and implementation strategies, it is possible that differences in measurement errors between modes do exist. To cope with this, two steps are needed: (1) estimate mode differences in measurement, and (2) adjust for these. To be able to do this, a researcher needs auxiliary information (de Leeuw,
153
2005). This could be information from a reference sample (e.g., Vannieuwenhuyze, 2013), from a random subsample, or through experimental manipulation and special designs (e.g., Jaeckle et al., 2010; Klausch, 2014). When estimating and adjusting for mode differences, it is important to distinguish between wanted selection effects and unwanted measurement effects. Remember, that one of the main reasons for doing mixedmode designs is to reduce coverage and nonresponse error, and that the researcher wants to reach different parts of the population with different modes to achieve this; these are the wanted selection effects. However, to compare different groups, one assumes that both modes measure with the same reliability and validity; in other words, one assumes that there are no differences in measurement between the modes. Statistical adjustment of measurement differences in mixed-mode surveys is still in a developmental phase and needs further research. A good introduction is given by Kolenikov and Kennedy (2014) and by Hox et al. (2016).
ACKNOWLEDGEMENTS The authors thank Don Dillman for his stimulating support and comments.
Figure 11.1 Example of auto-advance or carousel question format.
154
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
NOTE 1 http://www.census.gov/acs/www/methodology/ response_rates_data/ [accessed on 14 June 2016]
RECOMMENDED READINGS Couper (2011) provides a good overview of survey modes in terms of their development, key characteristics, and sources of measurement differences between them. Dillman et al. (2014) is a comprehensive handbook for data collection practitioners and survey methodologists with detailed recommendations for implementation of surveys using various modes, including mixed-modes. de Leeuw (2005) provides a clear taxonomy and a comprehensive review of mixed-mode data collection methods, including consequences from a total survey error perspective. Finally, de Leeuw et al. (2008) give an informed overview of theory and practice of mixedmode with practical examples and many references
REFERENCES AAPOR (2010). AAPOR report on online panels. Public Opinion Quarterly, 74(4), 711–781. Bandilla, W., Couper, M. P., and Kaczmirek, L. (2014). The effectiveness of mailed invitations for web surveys and the representativeness of mixed-mode versus Internet only samples. Survey Practice, 7(4). Retrieved from http://www.surveypractice.org/ index.php/SurveyPractice/article/view/274/ html_11 [accessed on 14 June 2016]. Berzelak, J. (2014). Mode Effects in Web Surveys (Doctoral dissertation). University of Ljubljana, Slovenia. Retrieved from http://dk.fdv. uni-lj.si/doktorska_dela/pdfs/dr_b erzelakjernej.pdf [accessed on 14 June 2016]. Blyth, B. (2008). Mixed-mode: The only fitness regime? International Journal of Market Research, 50(2), 241–266.
Callegaro, M. (2013). Conference note: From mixed-mode to multiple devices: Web surveys, smartphone surveys and apps: Has the respondent gone ahead of us in answering surveys? International Journal of Market Research, 55(2), 107–110. Couper, M. P. (2008). Designing Effective Web Surveys. New York: Cambridge University Press. Couper, M. P. (2011). The future of modes of data collection. Public Opinion Quarterly, 75(5), 889–908. de Leeuw, E. D. (1992). Data Quality in Mail, Telephone, and Face-to-face Surveys. Amsterdam: TT-publikaties. Retrieved from http:// edithl.home.xs4all.nl/pubs/disseddl.pdf. Accessed on 13 June 2016. de Leeuw, E. D. (2005). To mix or not to mix data collection modes in surveys. Journal of Official Statistics, 21(2), 233–255. Retrieved from http://www.jos.nu/Articles/abstract. asp?article=212233 [accessed on 14 June 2016]. de Leeuw, E. D. (2008). Choosing the method of data collection. In Edith D. de Leeuw, Joop J. Hox, and Don A. Dillman (eds), International Handbook Of Survey Methodology (pp. 113–135). New York: Taylor & Francis. de Leeuw, E. D. (2010). Mixed-mode surveys and the Internet. Survey Practice, 3(6). Retrieved from http://www.surveypractice. org/i ndex .php/Surv ey Prac ti c e/arti c l e/ view/150/html [accessed on 14 June 2016]. de Leeuw, E. D., Callegaro, M., Hox, J., Korendijk, E., and Lensvelt-Mulders, G. (2007). The influence of advance letters on response in telephone surveys. Public Opinion Quarterly, 71(3), 413–443. de Leeuw, E. D. and de Heer, W. (2002). Trends in household survey nonresponse: A longitudinal and international comparison. In R. M. Groves, D. A. Dillman, J. L. Eltinge, and R. J. Little (eds.), Survey Nonresponse (pp. 41–54). New York: Wiley. de Leeuw, E. D. and Hox, J. (1988). The effects of response stimulating factors on response rates and data quality in mail surveys. A test of Dillman’s Total Design Method. Journal of Official Statistics, 4, 241–249. de Leeuw, E. D. and Hox, J. J. (2011). Internet surveys as part of a mixed-mode design. In M. Das, P. Ester, and L. Kaczmirek (eds),
Survey Mode or Survey Modes?
Social and Behavioral Research and the Internet (pp. 45–76). New York: Taylor & Francis. de Leeuw, E. D., Hox, J., and Boevee, A. (2015). Handling do-not-know answers: Exploring new approaches in online and mixed-mode surveys. Social Science Computer Review, advance online publication. de Leeuw, E. D., Hox, J. J., and Dillman, D. A. (2008). Mixed-mode surveys: When and why. In Edith D. de Leeuw, Joop J. Hox, and Don A. Dillman (eds), International Handbook of Survey Methodology (pp. 229–316). New York: Taylor & Francis. Dillman, D. A. (1978). Mail and Telephone Surveys: The Total Design Method. New York: John Wiley & Sons. Dillman, D. A. and Tarnai, J. (1988). Administrative issues in mixed mode surveys. In: R. M. Groves, P. P. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls II, and J. Waksberg (eds), Telephone Survey Methodology (pp. 509–528). New York: Wiley. Dillman, D. A., Smyth, J. D., and Christian, L. M. (2014). Internet, Phone, Mail and Mixedmode Surveys: The Tailored Design Method (4th edn). Hoboken, NJ: John Wiley & Sons. Groves, R. M. (1989). Survey Errors and Survey Costs. New York: Wiley. Hox, J., de Leeuw, E., and Klausch, T. (2016). Mixed mode research: Issues in design and analysis. Invited presentation at the International Total Survey Error conference, Baltimore 2015, session III, mixed mode surveys. Available at www.tse15.org. (Full chapter to appear in P. Biemer et al. (2016). Total Survey Error in Practice. New York: Wiley.) Jaeckle, A., Roberts, C., and Lynn, P. (2010). Assessing the effect of data collection mode on measurement. International Statistical Review, 78, 3–20. Kappelhof, J. W. S. (2015). Face-to-face or sequential mixed-mode survey among nonwestern minorities in the Netherlands: The effect of differential survey designs on the possibility of nonresponse bias. Journal of Official Statistics, 31(1),1–3. Klausch, T. (2014). Informed Design of Mixedmode Surveys (Doctoral dissertation). Utrecht University, Netherlands. Klausch, T., Hox, J., and Schouten, B. (2016). Selection error in single and mixed mode surveys of the Dutch general population.
155
Journal of the Royal Statistical Society: Series A, forthcoming. Kolenikov, S. and Kennedy, C. (2014). Evaluating three approaches to statistically adjust for mode effects. Journal of Survey Statistics and Methodology, 2(2), 126–158. Lozar Manfreda, K., Bosjnak, M., Berzelak, J., Haas, I., and Vehovar, V. (2008). Web surveys versus other survey modes: A meta-analysis comparing response rates. International Journal of Market Research, 50(1), 79–104. Medway, R. L. and Fulton, J. (2012). When more gets you less: A meta-analysis of the effect of concurrent web options on mail survey response rates. Public Opinion Quarterly, 76(4), 733–746. Messer, B. J. and Dillman, D. A. (2011). Surveying the general public over the Internet using address-based sampling and mail contact procedures. Public Opinion Quarterly, 75(3), 429–457. Mohorko, A., de Leeuw, E., and Hox, J. (2013a). Coverage bias in European telephone surveys: Developments of landline and mobile phone coverage across countries and over time. Survey Methods: Insights from the Field. Retrieved from http://surveyinsights. org/?p=828 [accessed on 14 June 2016]. Mohorko, A., de Leeuw, E., and Hox, J. (2013b). Internet coverage and coverage bias in Europe: Developments across countries and over time. Journal of Official Statistics, 29(4), 609–622. Retrieved from http://dx.doi. org/10.2478/jos-2013-0042 [accessed on 14 June 2016]. Nathan, G. (2001). Telesurvey methodologies for household surveys – A review and some thoughts for the future? Survey Methodology, 27(1), 7–31. Pew Research Center for the People and the Press (2012). Assessing the representativeness of public opinion research. Methodology report. Washington, DC: Pew Research Center. Retrieved from http://www.peoplepress.org/2012/05/15/assessingthe-representativeness-of-public-opinionsurveys/ [accessed on 14 June 2016]. Roberts, A., De Leeuw, E.D., Hox, J., Klausch, T., and De Jongh, A. (2013). Leuker kunnen we het wel maken. Online vragenlijst design: standard matrix of scrollmatrix? [In Dutch: Standard grid or auto-advance matrix
156
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
format?] In A. E. Bronner, P. Dekker, E. de Leeuw, L. J. Paas, K. de Ruyter, A. Smidts, and J. E. Wieringa, (2013). Ontwikkelingen in het Marktonderzoek 2013. 38eJaarboek van de MOA [In Dutch: Developments in Market Research] Jaarboek 2012. (pp. 133–148). Haarlem: Spaaren-Hout. Scherpenzeel, A. C. and Das, M. (2011). True longitudinal and probability-based Internet panels: Evidence from the Netherlands. In M. Das, P. Ester, and L. Kaczmirek (eds), Social and Behavioral Research and the Internet (pp. 77–104). New York: Taylor & Francis. Singer, E. and Ye, C. (2013). The use of and effects of incentives in surveys. Annals of the American Academy of Political and Social Science, 645(1), 112–141. Retrieved from http://ann.sagepub.com/content/645/1/112 [accessed on 14 June 2016]. Smyth, J. D. and Pearson, J. E. (2011). Internet survey methods: A review of strengths,
weaknesses, and innovations. In M. Das, P. Ester, and L. Kaczmirek (eds), Social and Behavioral Research and the Internet (pp. 11–44). New York: Taylor & Francis. Stern, M. J., Bilgen, I., and Dillman, D. A. (2014). The state of survey methodology: challenges, dilemmas, and new frontiers in the era of the tailored design. Field Methods, 26, 284–301. Tourangeau, R. (2013). Confronting the challenges of household surveys by mixing modes. Keynote address at the 2013 Federal Committee on Statistical Methodology Research Conference, Washington, DC. Tourangeau, R., Conrad, F. G., and Couper, M. P. (2013). The Science of Web Surveys. New York: Oxford University Press. Vannieuwenhuyze, J. (2013). Mixed-mode Data Collection: Basic Concepts and Analysis of Mode Effects (Doctoral dissertation). KU Leuven, Belgium.
12 Surveying in Multicultural and Multinational Contexts Beth-Ellen Pennell and Kristen Cibelli Hibben
INTRODUCTION The scope, scale, and number of multicultural and multinational surveys continue to grow as researchers and policy-makers seek data to inform research and policy across numerous substantive domains. This chapter addresses some of the key methodological and operational challenges facing surveys that are specifically designed to achieve comparability across cultures or nations. We begin with a discussion of how multicultural and multinational surveys differ from surveys carried out in a single culture or nation. We define a set of contextual dimensions and discuss how each of these dimensions may pose methodological challenges for survey design and implementation. In light of the many challenges faced, we explore the issues of standardization, coordination, and quality standards for multicultural and multinational research. We also address related topics such as ethical considerations particular to multicultural or multinational surveys, as well as
mixed methods, new technologies, and other promising approaches to survey research in this space.
BACKGROUND This chapter focuses on comparative surveys that are designed to collect data and compare findings from two or more populations crossculturally or cross-nationally. The terms multicultural and multinational are used to refer to surveys conducted among multiple cultural groups or in multiple countries or nations, respectively. Surveys planned with comparability as the goal differ from single population surveys in that the questionnaires and other components of the survey design and implementation must be considered more broadly to enable comparison of findings across the included populations. In comparative research, the data must be valid and reliable for the given
158
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
cultural or national context as well as be comparable across these contexts. As Harkness et al. (2010a) discuss, comparability should drive design as well as the assessment of data quality in cross-national research. Harkness (2008) argues that, indeed, the pursuit of data quality is simultaneously the pursuit of comparability in cross-national research. As discussed by Harkness (2008) and elsewhere (Hantrais and Mangen, 1999; Øyen, 1990), a notable feature of the literature on multicultural and multinational research is a longstanding debate about whether this type of research truly presents a particular case and should be considered a separate field. Although it has been argued that all social science research is inherently comparative (Jowell, 1998), others contend that the complexity and challenges associated with multicultural and multinational research merits singular attention (Harkness et al., 2003a; Lynn et al., 2006). It is clear that multicultural and multinational survey research methodology has emerged as a recognizable subfield of survey methodology, including professional conferences devoted to the topic – annual meetings of the International Workshop on Comparative Survey Design and Implementation (http://csdiworkshop. org/), the International Conference on Survey Methods in Multicultural, Multinational, and Multiregional Contexts (http:// www.3mc2008.de/) and resulting monograph, with another conference and book in this series planned for 2016; an increase in articles focused on cross-cultural and crossnational survey methods in such mainstream journals as the Public Opinion Quarterly, the Journal of Official Statistics, the Journal of Cross-cultural Psychology, as well as several books, e.g., Harkness et al. (2003a, 2010a); graduate courses such as those at the University of Illinois at Chicago (http://www. uic.edu/cuppa/pa/srm/), the Gesis Summer School in Survey Methodology (http://www. gesis.org/en/events/gesis-summer-school/), and the University of Michigan’s Summer Institute in Survey Research Techniques
(http://si.isr.umich.edu/); and finally, online resources such as the Cross-cultural Survey Guidelines (http://www.ccsg.isr.umich.edu/). As more survey methods research is devoted to issues in multicultural, multinational surveys, the unique and complex nature of this work is further elaborated. For example, research on how culture may influence cognition and behavior and how cultural context may influence survey response explores the many sources of potential bias (Schwarz et al., 2010; Uskul and Oyserman, 2006; Uskul et al., 2010). Research on the presence of a third party in the interview setting (in addition to the interviewer and respondent) concludes that this is a complex relationship that varies by context, country, and culture (Mneimneh, 2012). Translation science continues to inform survey research methods and the importance of this step in the survey lifecycle (Harkness et al., 2004, 2010b, 2015). The focus of the following discussion is largely about multinational research, which presents the added complexity of harmonizing survey design and implementation across countries. However, many of the methodological features and challenges described also apply to within-country multicultural research.
SURVEY CONTEXT In this section, we discuss the role of survey context across cultures and nations. In Table 12.1, we distinguish the dimensions of survey context, which may affect one or more of the phases of the survey lifecycle.
Challenges in multicultural and multinational survey research In this section, we discuss key challenges associated with each of the dimensions outlined in Table 12.1 in achieving appropriately
Surveying in Multicultural and Multinational Contexts
159
Table 12.1 Dimensions of survey context Dimension
Description
Social and cultural context
Societal and cultural characteristics including language(s) (e.g., mono-lingual, multilingual, multiple languages, etc.), behavioral and communication norms, education and literacy levels, typical household composition, housing style, crime and security. Characteristics of the political situation including the type of political system, views and behavior of the authorities, and the presence of political tension or violence. Current economic climate (e.g., widespread poverty or deprivation, substantial wealth or income disparity) and labor market conditions (e.g., the availability of skilled interviewers), the availability and quality of the transportation, communications, and technology infrastructure. Also included are the availability of goods and services (e.g., printing services, paper, computers) and the penetration of communication technologies (i.e., telephones, cell phones, internet access). The climate, seasonality, weather conditions (extreme heat or cold, storms, monsoons, and natural disasters) and geography (e.g., dispersion of the population, terrain, urbanicity). The availability, coverage properties and quality of sampling frames, the availability of population registers, the presence and capacity of survey research organizations (e.g., availability of skilled staff, computer-assisted interviewing (CAI) capacity, quality assurance/quality control capacity, etc.), and established methodological practices.
Political context Economic conditions and infrastructure
Physical environment Research traditions and experience
harmonized study designs and maximizing quality at the implementation stage. We limit our discussion to three key aspects of surveys in multicultural or multinational contexts: sample design; questionnaire development (including translation and adaptation); and data collection.
Sample Design Factors stemming from each of the contextual dimensions present challenges for sample design. We discuss the challenges associated with each contextual dimension in turn below. Social and Cultural Context Most large societies have linguistic minorities such as immigrant populations, native or indigenous populations, small culturally distinct peoples living within the borders of the country with a different majority culture and language, and tribal or ethnic groups in culturally and linguistically diverse countries (Harkness et al., 2014). This means that in most multicultural and multinational research, data must be collected across multiple languages. However, countries vary widely in the number of languages spoken,
the extent to which members of the population are proficient in the official or dominant language or languages, and the extent to which language groups are included at the sampling stage. For example, countries like Canada have two official languages but have large immigrant communities with varying levels of fluency in the official language(s). Other countries like Switzerland have four official languages but most of the population is fluent in at least one of the official languages. There are many countries that are highly linguistically diverse such as India, Nigeria, Liberia, and Sierra Leone, as a few examples (Harkness et al., 2014). Language barriers or multiple languages can present a barrier to coverage of particular linguistic groups and segments of the population. Some countries may exclude language groups at the sampling stage, thereby introducing non-coverage error. Others may exclude these populations at the data collection stage, thereby introducing nonresponse error (Lepkowski, 2005). Differences in how the members of language groups are handled can result in sample designs with highly divergent properties.
160
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Many contexts will also have significant ‘hidden’ or hard-to-reach populations such as the homeless, street children, sex workers, undocumented guest workers or migratory peoples which may pose additional coverage considerations and data collection challenges (Harkness et al., 2014). Political Context Some counties have laws and regulations that restrict survey practice. As noted by Heath et al. (2005), countries such as North Korea officially prohibit survey research, while others severely restrict data collection on certain topics or allow collection but restrict the publication of results (e.g., Iran). The political context may affect sample design and coverage if areas need to be excluded in a particular country due to violence or unrest. Exclusion or replacement of such areas could contribute to coverage error (see Mneimneh et al. (2014) and Chapter 13 in this Handbook for further discussion and examples). Minority groups may also be excluded from a survey for political reasons. For example, in the aftermath of the Indian Ocean tsunami, official governmental agencies in Indonesia and Sri Lanka restricted researcher access to areas occupied by ‘rebel’ groups. Further, the governments of Indonesia and Sri Lanka instructed research agencies conducting evaluation studies to limit these efforts to ‘natural disaster victims’ thus effectively leaving out ‘conflict victims struck by natural disaster’ (Deshmukh, 2006; Deshmukh and Prasad, 2009; Pennell et al., 2014). Economic Conditions and Infrastructure Survey cost structures and the financial resources available to carry out a survey play a key role in determining an optimal sample design. As Heeringa and O’Muircheartaigh (2010) discuss, cost structures vary substantially from one country to another based on the available survey infrastructure (government or private sector survey organizations), the availability and costs for databases or
map materials needed for sample frames, labor rates for field interviewers and supervisors, and transportation costs for interviewers to go to sampled areas. As a result, a key challenge for a multinational survey is to determine minimum precision standards given the diverse economic conditions and available infrastructure among participating countries. The availability and quality of the transportation, communications, and technology infrastructure and the penetration of communication technologies (e.g., telephones, cell phones, internet access) factor greatly into the choice of data collection mode and clearly vary considerably cross-nationally. While cell phone penetration has increased dramatically throughout the world in recent years (reaching an average global penetration rate of 96.8% – meaning that on average, there are 96.8 mobile-cellular telephone subscriptions per 100 citizens throughout the world as of 20151), access to cell phones and reliable ways of sampling them remain a challenge in many parts of the world. The rise in cell phone use has had other effects as well. For example, the United States has seen substantial declines in the use of landline telephones in recent years (Blumberg et al., 2013), greatly limiting the coverage of Random Digit Dialing telephone samples unless supplemented with cell phone samples. Large disparities remain between developed and less developed countries’ access to the internet. As of 2015, 34% of households in developing countries had access to the internet at home.2 As Lynn (2003) notes, standardizing the mode of data collection might force some nations to use relatively poor or inefficient sampling methods. However, standardization of the sampling frame might only be possible if different data collection modes are allowed. Yet, allowing different data collection modes invites potential mode effects. Physical Environment Some countries may exclude geographically remote or sparsely-populated areas from
Surveying in Multicultural and Multinational Contexts
general population surveys that use face-toface interviewing. In the UK, the highlands and islands of Scotland north of the Caledonian Canal are usually excluded (Lynn, 2003). In the US, the states of Alaska and Hawaii are often excluded. If these areas are included, securing access to these areas can pose challenges at the data collection phase. Problems with coverage tend to be encountered more frequently in developing countries where the physical environment, existing infrastructure, and resources to reach remote areas, are often limited (Lepkowski, 2005). The effects of these types of exclusions on coverage may vary among countries and as Lynn (2003) suggests, may need to be controlled or limited by the establishment of standards. Research Traditions and Experience Heeringa and O’Muircheartaigh (2010) outline three properties of sample designs that affect sampling variance and bias, and coverage properties: 1) the survey population definition; 2) procedures used for sample screening and respondent selection; and 3) the choice of sampling frame. In this section, we discuss how research traditions and local experience related to these aspects of sample design vary across countries thereby creating challenges at the sample design stage for multinational studies. Differences in research traditions and experience frequently stem from one or many of the underlying conditions in a country’s social, cultural, political, or economic context or its infrastructure and physical environment. These dimensions are discussed further below. Survey population definition. The target population is the collection of elements for which the survey designer seeks to produce survey estimates (Lepkowski, 2005). Established practices often vary across countries with regard to how the target population is defined and the restrictions placed on who is eligible for inclusion from among the target population. This subset of the target population is known as the survey population.
161
Access to the survey population may be restricted by geography, language, or citizenship. Special populations such as those living in institutions (e.g., on military quarters or group homes), or those institutionalized at the time of the survey (e.g., hospital patients) may also pose varying degrees of access (Heeringa and O’Muircheartaigh, 2010). For example, some countries exclude the very elderly population from general population surveys (e.g., Sweden) or people resident in particular types of institutions such as prisons, hospitals, and military bases (Lynn, 2003). The greater the difference in how the target population is defined, the more problematic will be comparisons across target populations. Therefore, a key challenge for multinational surveys is the standardization of the target population to the extent possible given budgetary and logistical constraints. However, as Lynn (2003) notes, standardizing the segments of the population to be excluded, such as categories of institutions, does not necessarily result in equivalent coverage because the nature of the institutions and the population they account for can vary greatly among countries (e.g., countries with mandatory military service versus those without). Appropriate standardization, therefore, must take local conditions into account in order to achieve equivalent coverage. Sample screening and respondent selection. Sample screening and respondent selection refers to the specific definitions, procedures, and operational rules used to determine which sampled units (i.e., housing units, households, and residents) are eligible for inclusion in the survey. For example, what defines a housing unit? Should temporary residents or guest workers be included? What about adults with cognitive limitations or other impairments who are not able to participate in a survey interview? The selection process plays a critical role in determining the ultimate composition and coverage properties of the resulting sample population.
162
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Countries may have different established practices and may also vary in the amount of experience and rigor with which procedures are applied and monitored. Furthermore, it is difficult to develop common definitions suitable for the diverse living situations across countries, or regions or cultures within a country. The level of challenge and complexity in identifying housing units may also vary widely from context to context. For example, as Lepkowski (2005) notes, housing units may be particularly difficult to find and classify in urban slum areas where structures are built from recycled or scrap materials. Housing units may be easier to identify in rural areas, but complex living arrangements in which multiple families may occupy a single dwelling such as a long-house or compound, may make it more difficult to classify households in a uniform way. Differences in key definitions such as what is a housing unit, household, or resident, and how consistently these are operationalized during fieldwork can have important effects on survey coverage. Sampling frame choice. Common sampling frame choices for household surveys include population registries, new or existing area probability sampling frames, postal address lists, voter registration lists, and telephone subscriber lists (Heeringa and O’Muircheartaigh, 2010). Important limitations and differences may exist in the type and quality of available sampling frames (i.e., how up-to-date, level of coverage, etc.). A key difference is that some countries have complete population registers of individuals, which may be used as a sampling frame (e.g., many European countries have such frames including Sweden, Finland, and Germany [see Lynn (2003)] for further examples). Other countries such as the United Kingdom only have registers of addresses or dwellings; others have no type of registry and must rely on area sampling methods as is the case in many developing countries. According to Heeringa and O’Muircheartaigh (2010), among the 44 developing countries participating in the
World Fertility Survey, no country had a reliable register of addresses or individuals. Sampling frame choice has important implications for related features of the sample design – such as clustering, stratification, and weighting. Differences in clustering, stratification, and weighting affect the efficiency (or level of precision) of the individual sample designs and will vary from one country to another (see Heeringa and O’Muircheartaigh (2010) for a detailed discussion). The features of the probability sample design should be adapted to optimize statistical efficiency and cost based on the nature of the population and geography. Heeringa and O’Muircheartaigh (2010) recommend establishing precision standards for the sample designs of participating countries for the key statistics that will be compared, an approach practiced by the European Social Survey (Lynn et al., 2004). Frame limitations also often translate into mode limitations. As Lynn (2003) notes, some countries have complete listings of telephone numbers (e.g., Sweden) or can generate reasonably efficient Random Digit Dialing (RDD) samples (e.g., UK, Germany), while others may only have good frames of addresses without telephone numbers, restricting possible contact via mail or a faceto-face contact. Further, a mode that works well in one country/culture may not have good coverage properties in others (Pennell et al., 2010), a topic that we discuss in more detail below. Some countries have research traditions that routinely employ quota sampling or sample unit or respondent substitution at the last stages of selection rather than probability sampling methods at all stages. Substitution is when interviewers substitute a sampled unit or respondent that they are unable to contact or is nonresponsive. While substitution can reduce interviewer time and travel spent attempting to recontact nonresponding units or respondents, the practice is not widely accepted because substantial evidence shows that this method leads to biased samples (Lepkowski, 2005).
Surveying in Multicultural and Multinational Contexts
Questionnaire Development Questionnaire development presents a further set of challenges for multicultural and multinational surveys. In this section, we focus on challenges stemming from the social, cultural, and political context, as well as research traditions and experience. Social and Cultural Context As noted above, data must often be collected across multiple languages in most multicultural and multinational research. Researchers must identify the language needs and questionnaire design issues pertinent to the language groups to be included in the study. Other research materials such as consent forms may also need to be taken into account. Often, the questionnaire and other related materials will need both adaptation and translation. Adaptation is a process that changes aspects of a questionnaire and other study materials so that they are culturally relevant and appropriate for the study population. This process can involve changes to question content, format, response options, and the visual presentation of the instrument (Harkness et al., 2010b). Team translations are currently considered best practice in cross-cultural survey research. This is described in the TRAPD team translation model: Translation, Review, Adjudication, Pretesting, and Documentation. This process is described in detail in Harkness et al. (2010b) and in the online Cross-cultural Survey Guidelines at http://ccsg.isr.umich.edu/ (see also Chapter 19 in this Handbook). While relatively limited compared to the extensive literature on questionnaire design in noncomparative contexts, resources exist to guide survey researchers in the development of comparative instruments. For example, Harkness et al. (2003b) present advantages and disadvantages of major comparative design models and a general framework for design decisions. Harkness et al. (2010c) further developed this framework and also addressed basic considerations for
163
comparative questionnaire design. Smith (2003, 2004) reviews aspects of general questionnaire design that should be considered when developing instruments for use in multiple languages and provides extensive references. The online Cross-cultural Survey Guidelines (http://ccsg.isr.umich.edu/qnrdev. cfm) also provide a comprehensive overview. Key issues that emerge from this literature include the effect of cultural mindset, cognition, and response styles on survey response and the application of pretesting techniques for comparative surveys. We discuss each of these areas briefly below. A growing literature demonstrates the effect of cultural mindset on cognition. Cultural mindset – also referred to as cultural frames or dimensions – has been found to inform fundamental aspects of culture such as self-concept as well as what is salient and thus more likely to be encoded in memory and recalled, and what may be considered as sensitive to discuss. The effect of Western European and North American (individualist) cultures and East Asian (collectivist) cultures has received the most attention in the survey methodology literature (Schwarz et al., 2010). A small but growing literature explores the types of effects cultural frames may have on survey response. Schwarz et al. (2010) provide an overview of the basic differences between collectivist and individualist cultures and the cultural and linguistic effects that may arise at various stages of the survey response process (Harkness et al., 2014). Uskul et al. (2010) offer a similar discussion but include another form of collectivism – honor-based collectivism – prevalent in the Middle East, Mediterranean, and Latin American countries. Uskul and Oyserman (2006) propose a process model for hypothesizing whether members of different cultures will differ systematically in key areas including question comprehension, inferences made from the research context (the sponsor, question order, etc.), and response editing. The effect of individualism and collectivist cultures are well informed by an established
164
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
conceptual framework and body of experimental evidence (Schwarz et al., 2010). However, cultural psychologists have identified and demonstrated that a number of other cultural dimensions contribute important sources of variance in survey response such as independence versus interdependence, masculinity–femininity, power distance, and tightness–looseness. We are at the early stages of understanding how culture may affect survey response but the picture emerging is one of increased complexity. The effect of cultural frame has not been widely tested in survey research. However, cultural background has been shown to affect how respondents understand questions and constructs, as well as how they are influenced by the research context. For example, Haberstroh et al. (2002) found that individuals from collectivist cultures are more likely to be affected by the content of previous questions and take more care to provide nonredundant answers to redundant questions (Schwarz et al., 2003, 2010). Recent studies have also shown cultural effects for particular types of questions. Cultural frames can be invoked or ‘primed’, which is of particular relevance with bilingual or multilingual respondents whose cultural mindset may shift depending on the language used for survey administration. For example, Lee and Schwarz (2014) found that Hispanics, particularly those interviewed in Spanish, experience a higher level of difficulty answering the subjective life expectancy question resulting in higher rates of nonresponse and reports of zero percent probability. Lee speculates that these results are due to culture-related time orientation and a tendency to have a fatalistic view on life and health. Fundamental theoretical debate continues among cultural psychologists about how culture should be conceptualized, the dimensions of culture and, the extent to which culture can be viewed as an explanatory variable or variables. As Wyer (2013), in a review of this literature explains, cultures have often
been understood as static and equated with nation-states but newer theories view culture as having a dynamic, changing character not restricted by geographical boundaries. For example, Oyserman and Sorensen (2009) posit a theory of culture as situated cognition. In this view, no one person has a singular cultural mindset. Instead, we are all socialized to have a diverse set of overlapping and contradictory mindsets that are cued based on what is salient in the moment. This model implies that different cultural mindsets can be cued based on the situation producing sharply different ‘situated’ or momentary realities and corresponding perceptions and behaviors. According to Oyserman and Sorensen (2009), differences among societies reflect the relative likelihood that a mindset is cued. This means that the patterns associated with a particular dimension are not unique to any given country or society but may exist to a greater or lesser extent across societies. Any observed differences across cultural or national groups may reflect true differences in behaviors or attitudes, differences in the response process, or an unknown combination of both (Schwarz et al., 2010). If culture is situational or context dependent, it is important to consider how differences in the response process may be influenced in the moment by aspects of the research context (i.e., survey topic, sponsor, preceding questions, question features, the interviewer, etc.), the broader context, or a combination of both. Further research is needed that goes beyond global country comparisons and takes into account the complex interplay among factors stemming from the individual, the survey context and the broader context, and examines a broader range of cultural dimensions. Key features of the measurement context that have received increased attention in the questionnaire design literature are response scales and cultural differences in response styles. Challenges are often noted in replicating the features of Likert-type response scales in different languages (Harkness et al., 2003b; Villar, 2009). Response styles are
Surveying in Multicultural and Multinational Contexts
consistent and stable tendencies in response behavior not explainable in terms of question content or the measurement aims of a question (Yang et al., 2010). These tendencies are viewed as sources of bias or measurement error because they are unrelated to question content and measurement. The response styles most frequently discussed in the literature include acquiescence (the tendency to endorse statements regardless of content); extreme responding (the tendency to select the endpoints of answer scales); and, middle category responding (the tendency to choose the middle option on an odd-numbered scale (i.e., 1–5) or the middle portion of an evennumbered scale (i.e., 1–6). In the context of multicultural or multinational research, response styles may compromise comparisons among cultural or country populations if observed differences in responses at the individual or group level do not reflect true differences on a particular construct. Researchers have advanced a range of theories linking culture to response styles including differences in cultural norms, values, and experiences (Johnson et al., 1997), cultural heritage and experience among historically marginalized groups like African Americans (Bachman and O’Malley, 1984), Confucian philosophy in East Asian cultures, which values moderation, modesty, and cautiousness (Chen et al., 1995; Dolnicar and Grün, 2007), and Hofstede’s (2001) dimensions of individualism, and collectivism (Chen et al., 1995; Harzing, 2006; Johnson et al., 1997). Numerous studies examine cultural differences in acquiescence, extreme, and middle category response styles across different cultural groups in North America (Latino or Hispanic populations and African Americans) and in terms of Hofstede’s (2001) cultural dimensions (individualism, collectivism, uncertainty avoidance, power distance, and masculinity) in Europe and across the world. We discuss briefly some examples of the patterns that have emerged for acquiescence and extreme responding (please see Yang et al. (2010) for an extensive
165
discussion; for response styles in general see Chapter 36 in this Handbook). While the research discussed above is often cited as evidence of cultural differences in response styles, it is important to keep in mind the limitations of the existing research. Many of these studies have methodological weaknesses. For example, as Yang et al. (2010) find in their review, some studies rely on an insufficient number of items or focus on items related to essentially one construct. Others fail to include balanced scales for detecting acquiescence (endorsing both conceptually opposing pairs of a balanced scale is considered a good indicator of acquiescence) or lack documentation about how answer scales were translated and adapted across languages. Secondly, multiple confounding effects limit the extent to which definitive conclusions can be reached across the body of research. Yang et al. (2010) report much variation within and between studies on aspects of answer scales including the length of the response scale (i.e., 4, 5, 7, and 10 points), agreement versus other types of scales (importance, true–false, etc.), and labeling (e.g., fully versus partially labeled, labels used for the midpoint). Studies also vary greatly in terms of their operationalization of culture (limited to country-level or individual-level racial or ethnic identity) and the type and variety of cultural dimensions explored. The role of culture in response styles remains an important area for future research. Greater attention is needed on the operationalization of culture, including further development of hypothesis-driven frameworks and measures of cultural dimensions, and multi-level approaches that can better reflect the complexity of the many factors that may influence the response process. An interactive view, such as that posited by Yang et al. (2010), is consistent with newer conceptualizations of culture. As Harkness et al. (2014) note, when designing a questionnaire and other survey materials, it is important for researchers
166
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
to attempt to identify and be informed by ways in which members of different cultures may differ systematically in how questions are understood and answered. Pretesting is essential to identify potential problems in survey design and instrumentation to minimize measurement error and other spurious outcomes. The use of qualitative or mixed methods, which we discuss below, can be particularly informative. Political Context In some countries, there may be political controversy surrounding the use of minority languages such that they are not officially recognized or accommodated with official translations. This may be because retaining one’s native language is seen as a refusal or failure to assimilate into the host or dominant country’s language and culture or because the minority language may be associated with separatist movements or groups resisting the influence and control of the majority culture or population (Harkness et al., 2014). Research Traditions and Experience Research traditions and experience in some countries may factor into the willingness and capacity of survey organizations to provide interviewers or questionnaires in a range of languages. Established practices or experience may also vary regarding translation approaches. Researchers may have varying levels of experience with translation or have established practices regarding how translation is carried out. While the team translation, as described above, is accepted as current best practice, best practices are seldom employed and translation assessment techniques such as back-translation are still widely used, despite the growing evidence that this approach is inadequate (Harkness et al., 2010b; Harkness, 2011). Practices may also differ regarding what should be done in the absence of an official translation. One approach is to ask bilingual interviewers to provide ad hoc (or ‘on the fly’) translation of a written questionnaire.
Another technique involves the use of interpreters. In this latter case, the interviewer reads aloud the interview script in one language, which the interpreter is expected to translate into the language of the respondent. The interpreter also translates responses from the respondent into the interviewer’s language. Some might argue that conducting an interview via translation ‘on the fly’ is better than no interview at all. However, evidence suggests that these forms of ad hoc translation should be avoided due to the inevitable variance in translation performance (http://ccsg.isr.umich.edu/translation.cfm). Concerns about the cost of producing multiple written translations will also inevitably arise. The cost and quality of such undertakings will need to be made in the context of the goals of the survey against competing constraints. Establishing consensus and agreed upon standards in this area present significant challenges in a multinational project.
Data Collection Data collection challenges faced in multicultural and multinational surveys are numerous and have been discussed in some detail in the literature (Pennell et al., 2010, 2011). Here we discuss challenges stemming from each of the contextual dimensions: the social and cultural context; political context; the physical environment; economic conditions and infrastructure; and research traditions and experience. Social and Cultural Context Factors stemming from the social and cultural context can greatly affect contact and nonresponse patterns, as well as communication issues and measurement error during the interview process for those sample members who are contacted and are willing to participate. In some countries, populations move from home locations for extended vacation or holidays. For example, in China, the New Year is a major holiday period. In Australia and New Zealand, the summer Christmas holiday
Surveying in Multicultural and Multinational Contexts
is an extended vacation period. In Europe, the months of July and August often involve long periods of time away from home. Crime and security concerns in some countries or regions may put interviewers at risk or limit contact. Projects may also need to anticipate and plan for a variety of populations being inaccessible because of migration patterns (e.g., long-haul fishermen or nomadic populations) or for access to be restricted (miners or loggers in camps, women in some contexts). Even in regions with fairly comparable geographical, topographical, and political features, access impediments such as gated and locked communities and differential levels of willingness to participate will result in significantly different response rates across and within nations (Billiet et al., 2007; Couper and de Leeuw, 2003; de Leeuw and de Heer, 2002). Although response rates do not measure nonresponse bias (Bethlehem, 2002; Groves, 2006; Koch, 1998; Lynn et al., 2008), significant differences in response rates across countries in a given survey give cause for concern that nonresponse bias may differ across these countries (Couper and de Leeuw, 2003). Such variation in response rates may also be an indirect measure of the overall abilities of the data collection organizations and the rigor with which it implements the survey. This is one reason response rate goals are still used as a benchmark in cross-national surveys. Despite the ongoing growth in survey research all over the world, populations can still be expected to differ in their familiarity with surveys, their understanding of the goal of surveys and their role as respondents, all of which can affect data collection outcomes in terms of nonresponse and measurement error. While novelty or political aspirations may encourage response in some countries (Chikwanha, 2005; Kuechler, 1998), the willingness of developed country populations to participate in surveys has been declining (Couper and de Leeuw, 2003; de Leeuw and de Heer, 2002).
167
As mentioned previously, in most multicultural and multinational research, data must be collected across multiple languages, some possibly without a standard written form. If, in accordance with what is viewed as general good practice, interviewers are expected to work from written questionnaires, countries with multiple languages may confront a daunting, if not insurmountable, task. Nigeria, for example, has more than 500 living languages and dialects; Indonesia has more than 700 (Gordon and Grimes, 2005). Even when a written form of a language is available, there are obvious limits to the number of translations that can be provided for a given location, and for some languages alternative procedures will be needed (Harkness et al., 2004). Even when a translation is provided, such a translation may exclude speakers of less widely spoken dialects. As an example, Harkness and colleagues point to a 2010 government study interviewing Chinese immigrants to the US which failed to take into account that many Chinese living in San Francisco spoke Cantonese rather than Mandarin (Zahnd et al., 2011). A number of structural and design aspects of the survey protocol, including data collection mode, the interviewer and respondent interaction, the presence of third parties during the interview, and issues related to respondent burden such as literacy, familiarity with structured surveys, etc. may also contribute to measurement error. Education and literacy levels in some countries may restrict mode choice. Low literacy may be more pronounced in some targeted populations and negligible in others. The UNESCO Institute for Statistics (2013), for example, reports near universal adult literacy in Central and Eastern Europe and Central Asia and 63% in South and West Asia and 59% in sub-Saharan Africa in 2011. Social desirability bias, a source of measurement error well-documented in onecountry surveys (e.g., Groves et al. (2009) and Tourangeau and Yan (2007)) can also be expected to be differentially important in
168
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
cross-national research (Johnson and Van de Vijver, 2003). In developed countries, selfcompletion modes are often used to reduce social desirability bias (cf. Groves et al., 2009). However, self-completion procedures based on written questionnaires will be unsuitable for populations with low literacy rates. Pennell et al. (2010) note older literature that recognizes that interviewer-related bias may differ widely across countries (i.e., Bulmer, 1998; Bulmer and Warwick, 1993; Ralis et al., 1958; Warwick and Osherson, 1973). Interviewer-respondent matching can improve data quality and response, as examples in Ross and Vaughan (1986) and Cleland (1996) reflect. Although the scope and technical nature of surveys has changed radically in the intervening decades, cultural aspects have remained unchanged in many populations and regions and must be taken into account. For example, cultural norms in some contexts may require that interviewers be matched with respondents on gender, age, religion, or other factors such as caste (Pennell et al., 2010). Interviewer matching on one or more attributes can easily become costly and complex. Gender matching is essential in most Muslim countries. Unexpected challenges may arise even when a project has planned and budgeted for such matching. The Saudi National Health and Stress survey (SNHS) in the Kingdom of Saudi Arabia has found, for instance, that male interviewers must make the initial contact with a household and seek permission to interview both an eligible male and a eligible female household member. Once cooperation has been secured, a female interviewer must arrange to visit the household to interview the female respondent. In this context, cultural norms also preclude the recording of a female voice limiting the use of this method for quality monitoring. Supervisors are therefore randomly assigned to observe interviews and must also be gender matched (personal correspondence).
Many studies instruct interviewers to conduct their interviews in a private setting. However, establishing interview privacy is not always possible due to close or crowded living situations, the presence of children, or other interested third parties where a survey interview may be a novel experience. Women in some cultures may be precluded from speaking alone with a stranger. The presence of third parties can be a source of measurement error. A cross-national analysis of data from nine countries that participated in the World Mental Health Survey demonstrates an effect of third party presence on the reporting of sensitive information. Here, context is found to be critical and measurement error can be increased or decreased depending on the interview questions, the relationship of the third party to the respondent, the respondent’s need for social conformity, and the prevailing cultural background (Mneimneh, 2012). Political Context If surveys are not prohibited, an authoritarian regime may create conditions under which citizens are not accustomed to or may be threatened by expressing personal opinions (Kuechler, 1998). Countries vary widely both in official requirements and in unwritten rules and customs pertaining to data collection and data access. Regulations pertaining to informed consent also vary greatly. American Institutional Review Boards (IRB) state quite precisely the conditions to be met to ensure respondent consent is both informed and has been secured. IRB specifications of this kind are still unusual in parts of Europe, although as Singer (2008) indicates, European regulations on ethical practice can be quite rigorous. Best practice standards acknowledge that refusals must be respected but the definition of what counts as a reluctant respondent or a refusing respondent is also fluid. Multi national panels or longitudinal studies are a further special case. Challenges met here include variation across countries in access to documents and data that would enable a respondent to be tracked from wave to wave.
Surveying in Multicultural and Multinational Contexts
A large scale cross-national survey may also expect to face specific local situations such as conflict more frequently than in onecountry contexts. In these cases, infrastructure may be badly damaged, roads destroyed, and maps rendered useless. Possible material challenges include shortages of supplies and food and curfews and restrictions on travel may be enforced. Upheavals may mean populations are internally displaced, dispersed or in refugee camps and that the safety of both interviewers and respondents is at risk. Survey researchers in areas of ongoing conflict and tension such as Israel may have established survey procedures and routines for dealing with their local risks. In other contexts, this will not be the case. Procedures to deal with the local disruption may have to be developed, as in research specifically designed to assess the impact of a disaster or war, or the project may have to be abandoned or postponed (see Mneimneh et al., 2014; Pennell et al., 2014; and Chapter 13 in this Handbook). Economic Conditions and Infrastructure Economic conditions such as extreme poverty, substantial wealth, or disparities in wealth may affect the living situations of a population, which may in turn affect contact with sampled households and respondents. For example, access to villages or communities in low development contexts may require permission from village elders or other ‘gatekeepers’. In higher income countries or areas, secure compounds, gated communities, or high-rise buildings can impede contact with sampled households (Pennell et al., 2010). The economic situation and related labor market conditions may dictate the availability of skilled interviewers and other personnel, and the availability of goods and services needed for large scale data collection (e.g., paper, photocopies, computers, transportation, etc.). In strained economic environments, research organizations may be flooded with applicants for interviewer and other positions. In a saturated labor
169
market, qualified applicants may be difficult to recruit. Areas affected by conflict may suffer from both the overabundance of applicants but lack of applicants with the required skills or characteristics (religious affiliation, race or ethnicity) due to conflict-related migration or security concerns (Salama et al., 2000). The availability of skilled staff is also often related to the research infrastructure present in the country, discussed further below. Physical Environment Locating and engaging some populations may also be more difficult in cross-national research than in one-country surveys. The effort required to reach people living on remote islands, mountainous regions or in very rural areas for face-to-face interviews can be considerable. The possibility of revisiting such locations is also clearly constrained, thus affecting the number of contact attempts possible. While such conditions are clearly not restricted to surveys in developing countries or multinational research, the magnitude of effort involved may be much greater, as may be the frequency, across countries, with which such effort is required. In both rural Brazil and rural Scotland, for example, sparsely populated areas are often excluded from sampling frames. At the same time, the smaller territory involved and the much easier access available to rural areas in Scotland cannot properly be compared with the challenge of accessing populations in the rural regions of northwestern Brazil. In similar fashion, countries may have numerous islands, desert, or mountainous areas, which are sparsely populated. The Filipino population, for example, is concentrated on the 11 largest of the more than 7,000 islands making up the republic, while the Norwegian population is concentrated on its mainland territory. Reaching such populations calls for more effort and expense, and fielding in remote areas may expose interviewers to more risks. Even if funds are available to include such locations, the number of contact attempts may need to be truncated for logistical or cost reasons.
170
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Weather and seasonal conditions can pose barriers to accessing populations in any country. Winter or monsoon conditions, for example, can impede accessibility at certain times of the year. In regions with heavy snowfall, entire areas may be inaccessible for many months of the year. In comparative projects, therefore, a variety of conditions occurring at different times across targeted populations may need to be accommodated in the project planning. Finally, conducting research after a natural or man-made disaster will pose many additional challenges. If infrastructure was already compromised, conducting surveys after such an event will be even more challenging. For example, conducting fieldwork in Haiti after the 2010 earthquake posed very different challenges than conducting fieldwork in Japan after the complex event in 2011 (earthquake, tsunami, nuclear accident). For a more complete discussion of conducting research in post-disaster areas, see Pennell et al. (2014) and Chapter 13 in this Handbook. Research Traditions and Experience The organizational structure of a crossnational project has a major impact on how a study is designed and carried out (see Pennell et al., (2010) for a discussion). Nonetheless, the presence and capacity of survey research organizations – the availability of skilled and experienced staff, computer-assisted interviewing capacity, quality assurance/quality control capacity, etc. – influence greatly the types of challenges likely to be encountered at the data collection stage. For example, interviewer recruitment and training should be expected to take longer in contexts with little or no research infrastructure. Features of the new study may be unfamiliar, more complex, or require matching interviewers and respondents on different attributes, which may strain even an existing pool of experienced interviewers. In multilingual contexts, it may also be necessary to recruit and assess interviewers for adequate language skills. Lack of a research tradition may also mean that more oversight will be needed
throughout the survey lifecycle but particularly during data collection. Even in countries with a rich tradition of survey research, introduction of a new mode (computerassisted interviewing) or technology ( internet-based sample management) can pose significant challenges.
STANDARDIZATION, COORDINATION, AND QUALITY STANDARDS It has long been recognized that unnecessary variations in design and implementation can negatively affect survey outcomes (Groves et al., 2009). In cross-national surveys, all such variations in design and implementation can potentially affect comparability. A strategy commonly employed to ensure comparability in cross-national surveys is to impose some measure of standardization – keeping as much as possible in the study designs and data collection protocols the same. However, the challenges to requiring similar implementation strategies across very diverse contexts are considerable. Indeed, strict standardization is neither always possible nor desired because of the many differences in survey context (Harkness, 2008; Pennell et al., 2010). Attempts to impose a ‘one size fits all’ approach can threaten data comparability (Harkness, 2008; Harkness et al., 2003a; Lynn et al., 2006; Skjåk and Harkness, 2003). Further, even if designs have been standardized or harmonized as much as possible, differences will frequently arise at the execution or implementation stage contributing differentially to elements of total survey error and ultimately, data quality. Therefore, further challenges arise for surveys in multicultural and multinational contexts in identifying and setting suitable implementation standards and procedures to achieve comparability. Despite the continued growth in crossnational survey research, we still know little about the appropriate balance between standardization and local adaptation. Wide
Surveying in Multicultural and Multinational Contexts
variations in approach still persist across large scale cross-national studies, often dictated by the level of funding available for quality monitoring. Most of these studies have adopted some form of written standards that may describe a sampling approach (e.g., number of cases, probability sampling), response rates goals, ethics reviews, translation protocols, mode (face-to-face, for example), interviewer training, pretesting, data collection monitoring (often in the form of ‘back-checking’), and data delivery timing and standards (e.g., levels of missing data, various consistency checks). The wide variation comes in how conformity to these standards is monitored and whether there are consequences for not meeting these standards. Approaches tend to vary depending on whether the research involves a one-time multinational survey versus a longitudinal or repeated cross-sectional survey. Surveys that are to be repeated over time, often allow more leeway in deviations from protocols under the assumption that improvements can be made in subsequent waves or rounds of the survey. Important challenges also arise in the coordination of multinational projects. Individuals working at the level of project development and data collection management in different locations can have varying degrees of proficiency. Descriptions and understanding of technical terms can vary widely. Collaborators may also be located across many time zones, further complicating communication. Although Web access, free phone services such as voice-over-internet-protocol (VOIP), and tools for document sharing and project collaboration are common, constraints on how and when direct communication can occur are inevitable.
Ethical Considerations when Undertaking Research in Multiple Contexts Most countries and research institutions have regulations or legislation that addresses
171
human subject research. In a multinational framework, this will often mean complying with multiple (and sometimes conflicting) regulations. This can occur both across and within countries if multiple organizations are involved. Having said that, there is considerable overlap in the principles covered in ethics codes and governmental regulations although views of ethical practices and their application may vary across contexts (Hesse-Biber, 2010). These principles include: avoiding undue intrusion; keeping respondent burden low; obtaining voluntary consent (avoiding undue coercion); and respecting privacy and maintaining confidentiality. On the face of it, these principles appear to be relatively straightforward. However, what constitutes an intrusion, what is coercive, what is considered a burden or is sensitive in nature can vary widely across contexts and populations. For example, administering sensitive questions in the presence of others may be a particular concern, especially where the presence of others may be required or is more frequently encountered and where low literacy makes self-administered modes more difficult to implement. The collection of other health measures such as biomarkers and physical measurements during the survey interview process is on the rise. In particular, recent developments have facilitated the collection and analysis of DNA at lower costs and by less specialized personnel, including survey interviewers, leading to an increased interest in collecting DNA as part of survey data collection and the need for knowledge of regulations regarding the collection, handling and, once collected, access to biological and DNA data (Sakshaug et al., 2014). Most regulations also recognize that vulnerable populations require additional protections. These may include children, pregnant women, the elderly, prisoners, the mentally impaired, and members of economically or otherwise disadvantaged groups. Groups such as forced migrants (Bloch, 2006), refugees
172
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
(Jacobsen and Landau, 2003), and populations unfamiliar with surveys may also require additional consideration (Bulmer and Warwick, 1993; Randall et al., 2013). For a more comprehensive discussion of ethical considerations in survey research, see the Cross-cultural Survey Guidelines (http://ccsg.isr.umich.edu/ ethics.cfm); Singer (2008) and Chapter 7 in this Handbook.
FUTURE DIRECTIONS It is increasingly recognized that studies involving multiple cultures, countries, regions, or languages may benefit from the use of mixed methods. A mixed methods study involves the collection or analysis of both quantitative and/or qualitative data in which the data are collected concurrently or sequentially, each given a priority, and integrated at one or more stages in the process of research (Creswell et al., 2003). The different approaches used in qualitative and quantitative data collection methods can be complementary for studies of cross-cultural comparisons (Van de Vijver and Chasiotis, 2010). Further, as Bamberger et al. (2010) note, mixing qualitative and quantitative methods can address the complexity of sensitive topics or cultural factors more fully than can quantitative methods alone. Mixing qualitative and quantitative methods has been found to be particularly helpful for validating constructs (Hitchcock et al., 2005; Nastasi et al., 2007) and designing instruments (Nastasi et al., 2007). Axinn and Pearce (2006) detail data collection methods that combine elements of survey methods, unstructured interviews, focus groups, and others that can, in combination, reduce measurement error and provide insights on causal processes. Researchers can choose from a range of different mixed-methods designs depending on the needs and objectives of a particular study. See Clark et al. (2008) for an overview
of the major mixed methods design types. Van de Vijver and Chasiotis (2010) provide an indepth discussion and a conceptual framework for mixed methods studies and provide examples of their application in multicultural and multinational studies. Further examples and references for mixed methods approaches are provided in the online Cross-cultural Survey Guidelines (Levenstein & Lee, 2011) and in the pretesting, questionnaire design and data collection chapters (see also Chapter 42 in this Handbook). New literature exploring methodological issues across nations and cultures reinforces that we have much to learn and understand and that the issues are complex and nuanced. New work in cross-cultural psychology is just one example. Technological changes are also bringing new insights and challenges. The literature and experience drawn from one-country studies can provide some guidance but is not always relevant to new contexts. For example, when many developed countries transitioned to automated data collection in the late 1980s and early 1990s, the technological options and approaches were very limited. With the profusion of technical options now available to countries ready to make this transition, the lessons learned from past technological and methodological innovations may not be relevant or inform us about how these new approaches or innovations will work or perform in these new contexts. We are also seeing the use of disruptive technologies in new contexts that are pushing the field forward. For example, the national China Health and Retirement Longitudinal Study (CHARLS) not only employs state-of the art data collection systems and centralized paradata monitoring but has innovated in the tracking of respondents across waves by the use of digital photography and digital fingerprinting. In these contexts, innovative solutions are often not constrained by old paradigms. We are currently seeing a proliferation of affordable data collection platforms, including mobile ‘smart phones’ with increased
Surveying in Multicultural and Multinational Contexts
functionality, storage capacity able to handle more and more complex instrumentation, and multiple platforms with longer battery life being employed in resource-poor settings across a wide range of counties. This combined with the use of ‘air cards’ for data transmission, global positioning units (GPS) that can track an interviewer’s movements, and noninvasive digital audio recording all provide rich paradata that bring a much greater level of control over data collection monitoring and quality control. In addition to receiving the survey data on a real time or near real time basis, paradata may include keystroke or audit files, which, combined with digital recording, can provide deep insights into the question and answer process (as well as making it much more difficult to falsify data). Receipt of detailed call records allows for responsive design and active management during data collection. With the use of these technologies, cross-national data collection can be actively monitored from a central location, even when the central location is many time zones removed from the fieldwork. Despite these methodological and technological advances, there is much we still do not know and need to understand. To achieve this understanding, we need better and more detailed documentation. Some progress has been made (see the European Social Survey as an example [http://www.europeansocialsurvey.org/]) but for many, if not most, crossnational surveys, many aspects of design and implementation continue to be opaque. Detailed documentation will help us better design interventions and methodological experimentation. This, in turn, will inform best practices, ultimately improving the quality, comparability, and appropriate use of the resulting data. The International Standards Organization’s ISO20252:2012 and the American Association for Public Opinion Research’s transparency initiative are two examples of calls for increased documentation and ‘transparency’ of methods. Such approaches are critical if we are to move the
173
field of cross-cultural and cross-national survey methods toward empirically informed best practices.
NOTES 1 Source: ITU World Indicators database 2 Source: ITU World Indicators database
Telecommunication/ICT Telecommunication/ICT
RECOMMENDED READINGS We recommend Harkness (2008) and Harkness et al. (2010a), which provide essential background on multinational and multicultural survey research, and the online resource the Cross-cultural Survey Guidelines (available at http://www.ccsg.isr.umich.edu/), which offers a practical guide to all aspects of the survey lifecycle for multinational and multicultural surveys.
REFERENCES Axinn, W. G., and Pearce, L. D. (2006). Mixed Method Data Collection Strategies. Cambridge, UK: Cambridge University Press. Bachman, J. G., and O’Malley, P. M. (1984). Yea-saying, Nay-saying, and Going to Extremes: Black-white Differences in Response Styles. Public Opinion Quarterly, 48(2), 491–509. Bamberger, M., Rao, V., and Woolcock, M. (2010). Using Mixed Methods in Monitoring and Evaluation: Experiences from International Development. In A. Tashakkori and C. Teddlie (eds), SAGE Handbook of Mixed Methods in Social and Behavioral Research (2nd edn, pp. 613–642). Thousand Oaks, CA: Sage Publications, Inc. Bethlehem, J. (2002). Weighting Nonresponse Adjustments Based on Auxiliary Information. In R. M. Groves, D. A. Dillman, J. Eltinge, and R. Little (eds), Survey Nonresponse (pp. 275– 288). New York, NY: Wiley-Interscience.
174
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Billiet, J., Philippens, M., Fitzgerald, R., and Stoop, I. (2007). Estimation of Response Bias in the European Social Survey: Using Information from Reluctant Respondents in Round One. Journal of Official Statistics, 23(2), 135–162. Bloch, A. (2006). Emigration from Zimbabwe: Migrant Perspectives. Social Policy & Administration, 40(1), 67–87. Blumberg, S. J., Ganesh, N., Luke, J. V., and Gonzales, G. (2013). Wireless Substitution: State-level Estimates from the National Health Interview Survey, 2012 (No. 70). National Center for Health Statistics. Washington, D.C. Bulmer, M., and Warwick, D. P. (1993). Social Research in Developing Countries: Surveys and Censuses in the Third World. London, UK: Routledge. Bulmer, M. (1998). Introduction: The problem of exporting social survey research. American Behavioral Scientist 42: 153–167. Chen, C., Lee, S., and Stevenson, H. W. (1995). Response Style and Cross-cultural Comparisons of Rating Scales Among East Asian and North American Students. Psychological Science, 6(3), 170–175. Chikwanha, A. B. (2005). Conducting Surveys and Quality Control in Africa: Insights from the Afrobarometer. In WAPOR/ISSC Conference. Ljubljanna, Slovenia. Clark, V. L. P., Creswell, J. W., Green, D. O. N., and Shope, R. J. (2008). Mixing Quantitative and Qualitative Approaches. In S. N. HesseBiber and P. Leavy (eds), Handbook of Emergent Methods (pp. 363–387). New York, NY: Guilford Press. Cleland, J. (1996). Demographic Data Collection in Less Developed Countries 1946– 1996. Population Studies, 50(3), 433–450. Couper, M. P., and de Leeuw, E. D. (2003). Nonresponse in Cross-cultural and Crossnational Surveys. In J. A. Harkness, F. J. R. Van de Vijver, and P. P. Mohler (eds), Crosscultural Survey Methods (pp. 157–178). Hoboken, NJ: John Wiley & Sons. Creswell, J. W., Plano Clark, V. L., Gutmann, M. L., and Hanson, W. E. (2003). Advanced Mixed Methods Research Designs. In A. Tashakkori and C. Teddlie (eds), Handbook of Mixed Methods in Social and Behavioral Research (pp. 209–240). Thousand Oaks, CA: Sage Publications, Inc.
de Leeuw, E. D., and de Heer, W. (2002). Trends in Household Survey Nonresponse: A Longitudinal and International Comparison. In R. M. Groves, D. A. Dillman, J. Eltinge, and R. Little (eds), Survey Nonresponse (pp. 41– 54). New York, NY: John Wiley & Sons. Deshmukh, Y. (2006). LRRDI Methodology: Long Term Perspectives on the Response to the Indian Ocean Tsunami 2004: A Joint Evaluation of the Links Between Relief, Rehabilitation, and Development (LLRD). Stockholm, Sweden: SIDA. Deshmukh, Y., and Prasad, M. G. (2009). Impact of Natural Disasters on the QOL in Conflict Prone Areas: A Study of the Tsunami Hit Transitional Societies of Aceh (Indonesia) and Jaffna (Sri Lanka). Presented at the IX ISQOLS Conference, Florence, Italy. Dolnicar, S., and Grün, B. (2007). Cross-cultural Differences in Survey Response Patterns. International Marketing Review, 24(2), 127–143. Gordon, R. G., and Grimes, B. F. (2005). Ethnologue: Languages of the World (Vol. 15). Dallas, TX: SIL International. Groves, R. M. (2006). Nonresponse Rates and Nonresponse Bias in Household Surveys. Public Opinion Quarterly, 70(5), 646–675. Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R. (2009). Survey Methodology. Hoboken, NJ: John Wiley & Sons Inc. Haberstroh, S., Oyserman, D., Schwarz, N., Kühnen, U., and Ji, L. J. (2002). Is the Interdependent Self More Sensitive to Question Context Than the Independent Self? SelfConstrual and the Observation of Conversational Norms. Journal of Experimental Social Psychology, 38(3), 323–329. Hantrais, L., and Mangen, S. (1999). Crossnational Research. International Journal of Social Research Methodology, 2(2), 91–92. Harkness, J. A. (2008). Comparative Survey Research: Goals and Challenges. In E.D. de Leeuw, J. J. Hox, and D. A. Dillman (eds), International Handbook of Survey Methodology (pp. 56–77). New York, NY: Psychology Press Taylor & Francis Group. Harkness, J. A. (2011). Translation. Cross- cultural Survey Guidelines. Retrieved from http://ccsg.isr.umich.edu/translation.cfm Harkness, J. A., Braun, M., Edwards, B., Johnson, T. P., and Lyberg, L. E., Mohler, P.Ph.,
Surveying in Multicultural and Multinational Contexts
Pennell, B.-E., Smith, T.W. (2010a). Survey Methods in Multicultural, Multinational, and Multiregional Contexts. Hoboken, NJ: John Wiley & Sons. Harkness, J. A., Edwards, B., Hansen, S. E., Miller, D. R., and Villar, A. (2010c). Designing Questionnaires for Multipopulation Research. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P. Mohler, B.-E. Pennell, T. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 31–57). Hoboken, NJ: John Wiley & Sons. Harkness, J. A., Mohler, P. P., and Dorer, B. (2015). Understanding Survey Translation. Hoboken, NJ: John Wiley & Sons. Harkness, J. A., Pennell, B.-E., and SchouaGlusberg, A. (2004). Survey Questionnaire Translation and Assessment. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 453– 473). Hoboken, NJ: John Wiley & Sons, Inc. Harkness, J. A., Stange, M., Cibelli, K. L., Mohler, P. P., and Pennell, B.-E. (2014). Surveying Cultural and Linguistic Minorities. In R. Tourgangeau, B. Edwards, T. P. Johnson, K. Wolter, and N. Bates (eds), Hard-to-Survey Populations (pp. 245–269). Cambridge, UK: Cambridge University Press. Harkness, J. A., Van de Vijver, F. J. R., and Johnson, T. P. (2003b). Questionnaire Design in Comparative Research. In J. A. Harkness, F. J. R. Van de Vijver, and P. Ph. Mohler (eds), Cross-cultural Survey Methods (pp. 19–34). Hoboken, NJ: John Wiley & Sons. Harkness, J. A., Van de Vijver, F. J. R., and Mohler, P. P. (eds). (2003a). Cross-cultural Survey Methods. Hoboken, NJ: John Wiley & Sons. Harkness, J. A., Villar, A., and Edwards, B. (2010b). Translation, Adaptation, and Design. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P. Ph. Mohler, B.-E. Pennell, T.W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 115– 140). Hoboken, NJ: John Wiley & Sons. Harzing, A.-W. (2006). Response Styles in Cross-national Survey Research: a 26-Country Study. International Journal of Cross Cultural Management, 6(2), 243–266.
175
Heath, A., Fisher, S., and Smith, S. (2005). The Globalization of Public Opinion Research. Annual Review of Political Science, 8, 297–333. Heeringa, S. G., and O’Muircheartaigh, C. (2010). Sample Design for Cross-cultural and Cross-national Survey Programs. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P. Mohler, B.-E. Pennell, T.W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 251–267). Hoboken, NJ: John Wiley & Sons. Hesse-Biber, S. N. (2010). Mixed Methods Research: Merging Theory with Practice. New York, NY: Guilford Press. Hitchcock, J. H., Nastasi, B. K., Dai, D. Y., Newman, J., Jayasena, A., Bernstein-Moore, R., Sarkar, S., Varjas, K. (2005). Illustrating a Mixed-Method Approach for Validating Culturally Specific Constructs. Journal of School Psychology, 43(3), 259–278. Hofstede, G. H. (2001). Culture’s Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations (2nd edn). Thousand Oaks, CA: Sage Publications, Inc. Institute for Statistics (2013). Adult and Youth Literacy: National, Regional, and Global Trends, 1985–2015 (UIS Information Paper). Montreal, Canada: United Nations Educational, Scientific and Cultural Organization. Retrieved from http://www.uis.unesco.org/Education/Documents/literacy-statistics-trends-1985-2015. pdf. Accessed on 13 June 2016. Jacobsen, K., and Landau, L. B. (2003). The Dual Imperative in Refugee Research: Some Methodological and Ethical Considerations in Social Science Research on Forced Migration. Disasters, 27(3), 185–206. Johnson, T. P., O’Rourke, D., Chavez, N., Sudman, S., Warnecke, R., Lacey, L., and Horm, J. (1997). Social Cognition and Responses to Survey Questions Among Culturally Diverse Populations. In L. E. Lyberg, P. P. Biemer, M. Collins, and E.D. de Leeuw (eds), Survey Measurement and Process Quality (pp. 87–113). Canada: John Wiley & sons. Johnson, T. P., and Van de Vijver, F. J. R. (2003). Social Desirability in Cross-cultural Research. In J. A. Harkness, F. J. Van de Vijver, and P. Ph. Mohler (eds), Cross-cultural Survey
176
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Methods (pp. 195–204). Hoboken, NJ: Wiley-Interscience. Jowell, R. (1998). How Comparative is Comparative Research? American Behavioural Scientist, 42, 168–177. Koch, A. (1998). Wenn ‘mehr’ nicht gleichbedeutend mit ‘besser’ ist. Ausschöpfungsquoten und Stichprobenverzerrung in Allgemeinen Bevölkerungsumfragen. Zentrum Für Umfragen, Methoden Und Analysen, 42(22), 66–88. Kuechler, M. (1998). The Survey Method. American Behavioral Scientist, 42(2), 178–200. Lee, S., and Schwarz, N. (2014). Question Context and Priming Meaning of Health: Effect on Differences in Self-rated Health Between Hispanics and non-Hispanic Whites. American Journal of Public Health, 104(1), 179–185. Lepkowski, J. M. (2005). Non-observation Error in Household Surveys in Developing Countries. In Household Sample Surveys in Developing and Transition Countries (pp. 149–169). New York, NY: United Nations. Levenstein, R, and Lee, H.J. (2011). Cross-cultural survey guidelines. Retrieved from http:// ccsg.isr.umich.edu/ [accessed on 14 June 2016]. Lynn, P. (2003). Developing Quality Standards for Cross-national Survey Research: Five Approaches. Int. J. Social Research Methodology, 6(4), 323–336. Lynn, P., de Leeuw, E. D., Hox, J. J., and Dillman, D. A. (2008). The Problem of Nonresponse. Journal of Statistics, 42(2), 255–70. Lynn, P., Hader, S., Gabler, S., and Laaksonen, S. (2004). Methods for Achieving Equivalence of Samples in Cross-national Surveys: The European Social Survey Experience. Journal of Official Statistics, 23(1), 107–124. Lynn, P., Japec, L., and Lyberg, L. (2006). What’s So Special About Cross-national Surveys? In J. A. Harkness (ed.), Conducting Cross-national and Cross-cultural Surveys (pp. 7–20). Mannheim, Germany: ZUMA. Mneimneh, Z. (2012). Interview Privacy and Social Conformity Effects on Socially Desirable Reporting Behavior: Importance of Cultural, Individual, Question, Design and Implementation Factors. University of Michigan, Ann Arbor: Unpublished doctoral dissertation. Mneimneh, Z., Cibelli, K.L., Axinn, W. G., Ghimire, D., and Alkaisy, M. (2014). Conducting Survey Research in Areas of Armed Conflict. In R.
Tourgangeau, B. Edwards, T. Johnson, K. Wolter, and N. Bates (eds), Hard-to-Survey Populations (pp. 134–156). Cambridge, UK: Cambridge. Nastasi, B. K., Hitchcock, J. H., Sarkar, S., Burkholder, G., Varjas, K., and Jayasena, A. (2007). Mixed Methods in Intervention Research: Theory to Adaptation. Journal of Mixed Methods Research, 1(2), 164–182. Øyen, E. (1990). Comparative Methodology: Theory and Practice in International Social Research. London, UK: Sage Publications, Inc. Oyserman, D., and Sorensen, N. (2009). Understanding Cultural Syndrome Effects on What and How We Think: A Situated Cognition Model. In R. S. Wyer, C. Chiu, and Y. Hong (Eds.), Understanding Culture: Theory, Research, and Application (pp. 25–52). New York, NY: Psychology Press. Pennell, B.-E., Deshmukh, Y., Kelley, J., Maher, P., and Tomlin, D. (2014). Disaster Research: Surveying Displaced Populations. In R. Tourangeau, B. Edwards, T. Johnson, K. Wolter, and N. Bates (eds), Hard-to-Survey Populations (pp. 111–133). Cambridge, UK: Cambridge University Press. Pennell, B.-E., Harkness, J. A., Levenstein, R., and Quaglia, M. (2010). Challenges in Crossnational Data Collection. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P.Ph. Mohler, B.-E. Pennell, T.W. Smith (eds), Survey Methods in Multicultural, Multinational, and Multiregional Contexts (pp. 269– 298). Hoboken, NJ: John Wiley & Sons. Pennell, B.-E., Levenstein, R., and Lee, H. J. (2011). Data Collection. Cross-cultural Survey Guidelines. Retrieved from http:// ccsg.isr.umich.edu/datacoll.cfm Ralis, M., Suchman, E. A., and Goldsen, R. K. (1958). Applicability of Survey Techniques in Northern India. Public Opinion Quarterly, 22(3), 245–250. Randall, S., Coast, E., Compaore, N., and Antoine, P. (2013). The Power of the Interviewer: A Qualitative Perspective on African Survey Data Collection. Demographic Research, 28(27), 763–792. Ross, D. A., and Vaughan, J. P. (1986). Health Interview Surveys in Developing Countries: A Methodological Review. Studies in Family Planning, 17(2), 78–94. Sakshaug, J. W., Ofstedal, M. B., Guyer, H., and Beebe, T. J. (2014). The Collection of
Surveying in Multicultural and Multinational Contexts
Biospecimens in Health Surveys. In T. P. Johnson (ed.), Health Survey Methods (pp. 383– 419). Hoboken, NJ: John Wiley & Sons, Inc. Salama, P., Spiegel, P., Van Dyke, M., Phelps, L., and Wilkinson, C. (2000). Mental Health and Nutritional Status Among the Adult Serbian Minority in Kosovo. JAMA: The Journal of the American Medical Association, 284(5), 578–584. Schwarz, N., Harkness, J. A., Van de Vijver, F. J. R., and Mohler, P. P. (2003). Culture-Sensitive Context Effects: A Challenge for Cross- cultural Surveys. In J. A. Harkness, F. J. R. Van de Vijver, and P. P. Mohler (eds), Cross- cultural Survey Methods (pp. 93–100). Hoboken, NJ: John Wiley & Sons. Schwarz, N., Oyserman, D., and Peytcheva, E. (2010). Cognition, Communication, and Culture: Implications for the Survey Response Process. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P. Ph. Mohler, B.-E. Pennell, T.W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 175– 190). Hoboken, NJ: John Wiley & Sons, Inc. Singer, E. (2008). Ethical Issues in Surveys. In E.D. de Leeuw, J. J. Hox, and D. A. Dillman (eds), International Handbook of Survey Methodology (pp. 78–96). New York, NY: Lawrence Erlbaum Associates. Skjåk, K. K., and Harkness, J. A. (2003). Data Collection Methods. In J. A. Harkness, F. J. R. Van de Vijver, and P. Ph. Mohler (eds), Crosscultural Survey Methods (pp. 179–194). Hoboken, NJ: John Wiley & Sons. Smith, T. W. (2003). Developing Comparable Questions in Cross-national Surveys. In J. A. Harkness, F. J. R. Van de Vijver, and P. Ph. Mohler (eds), Cross-cultural Survey Methods (pp. 69–92). Hoboken, NJ: John Wiley & Sons. Smith, T. W. (2004). Developing and Evaluating Cross-national Survey Instruments. In S. Presser, J. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 431–452). Hoboken, NJ: John Wiley & Sons. Tourangeau, R., and Yan, T. (2007). Sensitive Questions in Surveys. Psychological Bulletin, 133(5), 859. Uskul, A. K., and Oyserman, D. (2006). Question Comprehension and Response: Implications of Individualism and Collectivism.
177
National Culture and Groups, Research on Managing Groups and Teams, 9, 173–201. Uskul, A. K., Oyserman, D., and Schwarz, N. (2010). Cultural Emphasis on Honor, Modesty, or Self-enhancement: Implications for the Survey-response Process. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P.Ph. Mohler, B.-E. Pennell, T.W. Smith (eds), Survey Methods in Multicultural, Multinational, and Multiregional Contexts (pp. 191–201). Hoboken, NJ: John Wiley & Sons, Inc. Van de Vijver, F., and Chasiotis, A. (2010). Making Methods Meet: Mixed Designs in Cross-cultural Research. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P.Ph. Mohler, B.-E. Pennell, T.W. Smith (eds), Survey Methods in Multicultural, Multinational, and Multiregional Contexts (pp. 455–473). Hoboken, NJ: Wiley. Villar, A. (2009). Agreement Answer Scale Design for Multilingual Surveys: Effects of Translation-related Changes in Verbal Labels on Response Styles and Response Distributions. ETD collection for University of Nebraska – Lincoln. Paper AAI3386760, January 1. Retrieved from http://digitalcommons.unl.edu/dissertations/AAI3386760 Warwick, D. P., and Osherson, S. (1973). Comparative Research Methods: An Overview. Englewood Cliffs, NY: Prentice-Hall. Wyer, R. S. (2013). Culture and Information Processing: A Conceptual Integration. In R.S. Wyer, C. Chiu, and Y. Hong (eds), Understanding Culture: Theory, Research, and Application (pp. 431–456). New York, NY: Psychology Press. Yang, Y., Harkness, J. A., Chin, T., and Villar, A. (2010). Response Styles and Culture. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P.Ph. Mohler, B.-E. Pennell, T.W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 203–223). Hoboken, NJ: John Wiley & Sons. Zahnd, E. G., Holtby, S., and Grant, D. (2011). Increasing Cultural Sensitivity as a Means of Improving Cross-cultural Surveys: Methods Utilized in the California Health Interview Survey (CHISS) 2001–2011. Presented at the American Association of Public Opinion Research (AAPOR) Annual Conference, Phoenix, AZ.
13 Surveys in Societies in Turmoil Zeina N. Mneimneh, Beth-Ellen Pennell, Jennifer Kelley and Kristen Cibelli Hibben
INTRODUCTION Turmoil caused by armed conflicts and natural disasters afflict many countries around the world resulting in tremendous demographic, social, economic, and health consequences. Measuring these impacts, assessing community needs, and guiding and evaluating aid and other interventions require the collection of timely and reliable information about the affected populations. One of the most commonly used methods for collecting such information is survey research. However, conducting surveys in such contexts is fraught with challenges throughout the survey lifecycle, including sample design and selection, questionnaire development and pretesting, data collection, ethical considerations, and data dissemination. This chapter provides researchers with a set of principles that address some of the challenges faced at these different stages of the survey lifecycle. These principles are based on observations from the literature as well as
the authors’ experience conducting surveys in areas affected by natural disaster and armed conflict. Although these principles are not all based on experimental evidence, they do reflect current best practices in this field. The chapter begins by defining areas of turmoil followed by a discussion of the set of research principles that are applicable to planning and conducting studies in these settings. We discuss commonly found challenges under each of these principles and provide examples. Lastly, we address specific ethical considerations and the importance of collecting detailed documentation when conducting this type of research.
DEFINITION OF AREAS IN TURMOIL We define areas in turmoil as any geographic area (country or part of a country) exposed to armed conflict, more specifically political unrest (and the threat of violence) or active
Surveys in Societies in Turmoil
armed conflict, or affected by any of the five distinct natural disasters: geophysical (e.g., earthquake, volcano), meteorological (e.g., storm, blizzard), hydrological (e.g., flood, storm surge), climatological (e.g., extreme temperature, drought), and biological (e.g., epidemic). Armed conflicts and disasters each have unique aspects that differentially affect the types of challenges faced in the design and implementation of survey research. A detailed discussion of the challenges encountered during armed conflict and natural disasters can be found in Mneimneh et al. (2014), and Pennell et al. (2014) respectively. While the next section discusses survey research principles that address common challenges encountered when conducting face-to-face survey research in such settings,1 researchers are encouraged to assess the applicability, and appropriately adapt each principle to the circumstances and context of the setting in which they are working.
SURVEY DESIGN AND IMPLEMENTATION PRINCIPLES IN AREAS OF TURMOIL This section describes five main principles that could be adopted when conducting surveys in areas of turmoil.
Adaptive and Flexible Approach Although dependent on the specific research objectives, survey research in areas of turmoil is generally characterized by an urgency to collect information quickly, increased mobility of the survey population in search of shelter or security, limited availability of field resources (e.g., interviewers) for research (often competing with relief work), rapidly changing events, and insecurity. To address many of these challenges survey research practitioners are encouraged to
179
embrace an adaptive and flexible approach by changing the study design and protocols for all or a subgroup of survey participants as new information becomes available. For example, during the sample design phase, increased population mobility, deaths and injuries as well as compromised infrastructure cause frames and population data (if existent in the first place) to quickly become outdated, and accessibility of the population to be restricted (Barakat et al., 2002; Daponte, 2007; Deshmukh, 2006). Several methods have been used to address these challenges2 including the use of Geographic Information System (GIS) data to stratify geographical areas based on their exposure to the event and their population displacement. Once stratified, areas where the population has moved can be oversampled. Since population movement could continue during the course of the survey implementation, a good predictive mobility model is needed and might require evaluation and revisions during the field work as researchers learn more about the mobility of the affected population and as new events unfold. Such an adaptive sampling strategy has been used widely in survey research to address other sampling challenges, but can be effectively applied in areas of turmoil (Thompson and Seber, 1996). Questionnaire development is another area that can benefit from an adaptive and flexible approach. Although pretesting is standard practice in survey research, when working in areas of turmoil there is often a great urgency to start collecting data, leading researchers sometimes to view pretesting as an added burden on the timeline (Mneimneh et al., 2008; Zissman et al., 2010). Pretesting, specifically sequential pretesting, however, can be particularly important for topics typically examined in areas of turmoil. Often, these topics tend to be context-specific (Bird, 2009) and are difficult to specify ahead of time or may be too sensitive to ask about without careful testing (Lawry, 2007). Sequential cognitive interviewing is one example of this approach, in which an initial pretest is used to identify
180
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
problems, the questionnaire is revised, and then an adapted cognitive protocol is used for another round of testing (Ghimire et al., 2013). Since elapsed time is a major constraint in such settings, smaller (e.g., 15 cases) sequential rounds of cognitive interviewing are recommended. Even conducting a few in-depth interviews with community members who are involved or affected by turmoil can uncover many potential problems or issues relating to the questionnaire content or survey implementation that the researchers may not have envisioned or anticipated. The unique and evolving nature of turmoil events can limit the realization of important event details until some data are collected. In such situations, researchers might need to adapt the content of the questionnaire as they acquire new information. For example, Pham et al. (2005) note that respondents’ answers related to peace and justice might have been affected by the intensification of armed attacks during data collection in Northern Uganda. In such instances, adding new questions for respondents or local key community members might have captured the cause of some of these variations. Of course, deviation from standardized approaches and changing questionnaires midway through data collection could introduce context effects (Schwarz and Sudman, 1992). Such effects can be mitigated by adding any new questions to the end of the existing instrument or in a separate self-administered component. Flexibility in the structure of the field interviewing team and the timing of the data collection is another area of consideration. Different approaches may be needed for certain geographic segments or selected units to address safety concerns, logistical challenges, or lack of human resources. For example, employing teams of interviewers where the team visits selected segments and conducts interviews during a short period of time and then travels to another selected segment is often used as a strategy to address shortages in field staff, or where it is important to limit repeated visits in insecure areas (Mneimneh
et al., 2008). Another team structure that can be used is to allocate multiple interviewers to the same selected unit. This approach can enhance interview privacy when interviewing respondents who have moved to crowded shelters or camps where concerns about privacy are common (Amowitz et al., 2002; Roberts et al., 2004). Multiple interviewers assigned to such settings are useful in occupying other members of the selected unit while the respondent is being interviewed or when multiple simultaneous interviews from the same unit are desired (Axinn et al., 2012). Such team approaches work best when interviewing is not constrained by matching interviewers to respondents based on a shared language. The timing of the data collection and mode of transportation might also need to be adapted across geographic areas to accommodate possible travel insecurities or access challenges in these areas. For example, all interviewing teams for the Salama et al. (2000) survey in Kosovo had to be accompanied by an international staff member and interviews had to be completed before a 4 p.m. curfew to ensure interviewer safety. In Indonesia after the 2004 tsunami and earthquake, helicopters and boats were used to reach a group of displaced populations who were temporarily housed on hilltops (Gupta et al., 2007). While these examples might lead to differential survey error across geographic segments, in situations of turmoil, difficult trade-offs often must be made. These trade-offs are not simply between survey error and cost, but also between survey error and the security and burden of interviewers and respondents.
Mixed Method Approach One of the most frequent challenges that survey researchers face in turmoil settings is the inability to access some geographic areas because they are regulated by certain groups (e.g., militia or humanitarian agencies) (Axinn et al., 2012; Lawry, 2007; Mneimneh
Surveys in Societies in Turmoil
et al., 2008; Zissman et al., 2010), the transportation infrastructure leading to them is destroyed, or because travel to them is unsafe (Henderson et al., 2009). Such areas typically have higher exposure to the conflict or disaster event. Researchers sometime deal with these challenges by excluding such areas from their sample or replacing them with other areas (Assefa et al., 2001; Mullany et al., 2007). However, such exclusions or replacements could lead to sampling bias (Beyrer et al., 2007), specifically when the survey topic is related to the turmoil or its effects (as is commonly the case). One possible approach to address these limitations is to supplement survey data with other available data (Barakat et al., 2002). Such data sources could include media resources (Oberg and Sollenberg, 2011), hospitals, burial sites, battlefield eyewitness accounts, artillery power projections (Tapp et al., 2008), graveyard census data (Silva and Ball, 2008), demographic data (Guha-Sapir and Degomme, 2005), satellite imagery (Checchi et al., 2013), or humanitarian sources. Hoglund and Oberg (2011) discuss the importance of ‘source criticism’ as a method to evaluate data sources and identify their possible biases. While relying solely on such data sources is not recommended because of their own biases (Oberg and Sollenberg, 2011; Spagat et al., 2009), combining them with survey data could capitalize on their specific strengths while minimizing the possible survey biases. One possible source that could supplement survey data are social media data. Social media data can be very cost-effective and may provide more honest reporting of certain events as there is no interviewer involved. Moreover, these firsthand reports and comments may provide a safer environment for people to voice their opinions or report on certain sensitive events. However, social media data have their limitations and biases including that of coverage and self-selection. The American Association for Public Opinion Research has recently published a report titled Social Media in
181
Public Opinion Research, in which the authors discuss the advantages and disadvantages of social media and how to design studies to better integrate survey data and social media data (Murphy et al., 2014).
Technological Approach The destructive nature of armed conflicts and natural disasters renders frames and population data (if existent in the first place) quickly outdated (Barakat et al., 2002; Daponte, 2007). Recently, researchers have capitalized on technological advances to address this challenge. Geographic Information System (GIS), satellite imagery (including Google Earth), and mobile technology have been used to create frames and to track mobile populations including refugees and internally displaced populations (IDP) (Bengtsson et al., 2011; Galway et al., 2012). Satellite imagery is becoming relatively easy and inexpensive to acquire; and when used in combination with Global Positioning System (GPS) devices, the time and effort needed to locate selected units is drastically reduced. This is critical in turmoil settings where the availability of up-to-date and accurate maps is limited (Kolbe and Muggah, 2010; Lawry, 2007) and the safety of interviewers is a concern (Herrmann et al., 2006; Lawry, 2007). However, the use of such technology and devices has its own challenges. First, when using GPS devices, security issues need to be carefully assessed. In areas with tight security or under the control of certain militia groups, sending interviewers into these areas with such devices may jeopardize their safety. Second, when using GIS information to construct frames and select samples, assumptions about grid-cell occupation probabilities need to be made that may not necessarily reflect the true population distribution (Galway et al., 2012). Third, the resolution, quality, and age of the imagery might vary from one location to another which could introduce bias if this variation is associated
182
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
with the outcome of interest. Fourth, rater error could be introduced if different raters use different standards when reading and enumerating satellite images. Fifth, in turmoil settings characterized by continuous and rapid population mobility, the satellite images themselves might be outdated, especially for highly affected areas. To address this, Adams and Friedland (2011) discussed the use of unmanned aerial vehicles (UAVs) in a disaster setting to acquire up-to-date images. Though UAVs allow for rapid data acquisition, their use still require more research and need to be carefully evaluated specially in armed conflict settings where security issues are common. In addition to UAVs, mobile phone network data has been also used to address the high population mobility in disaster settings. Bengtsson et al. (2011) tracked the daily positions of Subscriber Identity Module (SIM) cards 42 days before the earthquake in Haiti and 158 days after. However, the use of mobile technology to track population in turmoil has its own limitations. While some limitations are common to non-disaster settings, such as geographic and demographic coverage differences, clustering, and duplication of sampling units (Blumenstock and Eagle, 2010), others are unique to turmoil settings. The most concerning challenge is the destruction of mobile towers especially in heavily affected areas that witness the greatest population mobility. In spite of these limitations, the use of technological advances to address sampling challenges in turmoil settings seem to be promising especially when combined and supplemented with other sources of information. Another area where technological advances can greatly aid researchers is during data collection. Researchers conducting surveys in areas of turmoil may consider using computer assisted interviewing (CAI) for data collection, as it is often critical to receive data in a timely manner, but may be leery of equipping field workers with cumbersome (and often expensive) laptops that could jeopardize the
safety of interviewers. Using mobile devices, like tablets or smartphones, for data collection in these settings has become feasible over the past several years. First, mobile technology has advanced, in terms of processing power and ease of use (i.e., touchscreens) that make it easier to deploy and use complex survey software or applications on mobile devices. Second, mobile devices are often less expensive, have longer battery life, and are smaller in size. The latter can be important for interviewer safety; smaller devices can be hidden from view, therefore less conspicuous than laptops, possibly reducing the risk of theft. Third, unlike paper surveys, the data on a mobile device can be encrypted both on the device and during transmission. Researchers have employed innovative approaches for securing data such as sending data via secure short message service (SMS) (Simons and Zanker, 2012). Fourth, data collection software and applications have been and continue to be developed specifically for mobile devices. Companies like Global Relief Technology offer services for data collection via satellite-linked mobile devices and web-interfaced data monitoring (see http:// grt.com) and have used these across a number of disaster settings, such as after the Haiti earthquake. Open Data Kit (ODK) Collection is an open-source (i.e., free) suite of software tools that helps organizations author, field, and manage mobile data collection. These products were developed by the University of Washington’s Department of Computer Science and Engineering department. ODK has been used by many NGOs and academic organizations for data collection in both conflict and disaster settings (http://opendatakit. org/about/deployments/).
Local Partnership and Community Engagement Approach Local partnership and community engagement are critical during the preparation, implementation, and post production phases
Surveys in Societies in Turmoil
of a survey in areas of turmoil. In fact, in certain conflict and disaster situations, having a local partner on the research team who is familiar with the political and social dynamics and is well-connected to the community may be necessary to obtain permission to conduct a survey (Eck, 2011; Hoglund, 2011). Gaining access to certain geographical areas that otherwise would be excluded from the sample (possibly introducing bias) sometimes requires engaging and getting approval from local groups such as government authorities, faction leaders, peace- keeping groups, aid agencies, or village leaders (Axinn et al., 2012; Deshmukh, 2006; Mneimneh et al., 2008; Zissman et al., 2010). In a long term longitudinal study in rural Nepal, leaders of groups in armed conflict toured the study’s research facilities and audited the field procedures to ensure that the appropriate steps were taken for respondent confidentiality (Axinn et al., 2012). Local partners can also provide advice on the best way to obtain appropriate permissions and who to approach (Eck, 2011). Hoglund and Oberg (2011) recommend identifying multiple ‘gatekeepers’ to have multiple access points and ensure that the sample is not biased by relying on few individuals. However, in extremely polarized conflict settings this might be difficult to achieve since getting approval from one group might exclude contact with another group because of political allegiance. Local partners often also have better contextual understanding of the setting, the population, and the dynamics of the conflict. Collaborating with such partners early in the survey process can help in developing new instruments or adapting existing instruments to fit the local conditions. This is quite important for many conflict and disaster events that are unique and require new instrumentation (Yamout and Jabbour, 2010). To provide important contextual background, researchers often recommend conducting interviews with community leaders, key informants, and experts or officers in charge of disaster response, and using
183
this information to supplement survey data from community samples (King, 2002). During the data collection phase, engaging the community and specifically recruiting local trusted members as interviewers or guides can be important in establishing rapport and trust with survey respondents and in gaining their cooperation (Haer and Becher, 2012; Henderson et al., 2009; Lawry, 2007; Mullany et al., 2007). However, engaging community members as interviewers can possibly increase interviewer error (Haer and Becher, 2012), necessitating special trainings on the importance of maintaining neutrality in the interview process. We discuss this further below. Establishing relationships with local partners and ‘gatekeepers’ becomes also important to assure the safety of the interviewers. In situations where safety is a concern, it is advisable to provide interviewers with local contacts in all of the areas they are visiting in case they encounter problems. When conducting interviews in highly controlled and insecure areas in Pakistan and Colombia, researchers hired local interviewers and collaborated with the gatekeepers of those areas to provide access and to guarantee the safety of the interviewers. In one case, those gatekeepers were asked to sign a document guaranteeing that the interviewers were under their protection.3 Building and maintaining local relationships are also important for facilitating the dissemination of results and policy impact (Zissman et al., 2010). Hoglund and Oberg (2011) discuss how peace research should move beyond the principle of ‘doing no harm’ to ‘doing good’, by seeking policyrelevant perspectives that benefit the community under study. Finally, before making decisions with whom to partner, researchers are encouraged to gather information on the public perception of the potential local partners (Eck, 2011; Hoglund, 2011). Public perceptions of local partners could affect the measurement and nonresponse error properties of the survey.
184
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Neutrality Focused Approach Armed conflicts and disasters will create an emotionally charged atmosphere. People’s perceptions and interpretations of events and interactions become highly susceptible to contextual factors (Goodhand, 2000; Yamout and Jabbour, 2010). Affiliating the study with specific partners (such as political groups or aid agencies) may affect respondents’ propensity to respond and influence respondents’ answers affecting the validity of the data (Barakat et al., 2002; Larrance et al., 2007; Mneimneh et al., 2008). Respondents might underreport or exaggerate events depending on their perception of the relative cost and benefit of participation (Amowitz et al., 2002; Larrance et al., 2007; Mullany et al., 2007). Adopting a neutral approach in all study introductions, scripted materials, or media campaigns related to the survey is encouraged. Emphasizing the importance of neutrality when hiring and training interviewers is also essential (Checchi and Roberts, 2008; Mneimneh et al., 2008). Interviewer allegiance to certain groups or their experience with community or relief work might hamper their objectivity in the recruitment or interviewing of respondents. Kuper and Gilbert (2006) reported how some interviewers may seek out highly affected members of the population resulting in selection biases. Training protocols need to highlight how personal preferences, even when motivated by potentially good intentions, could lead to possible survey bias. Special training on how to address sensitive topics related to the conflict or disaster situation with respondents is often needed. For example, to avoid possible retraumatization of respondents, Amowitz et al. (2002) provided interviewers in Sierra Leone with ‘extensive sensitization training’ by experts in sexual violence. In certain situations, it might be difficult for the researcher to control the hiring and training of interviewers. Lawry (2007) reports being ‘given’ several unqualified data
collectors by the local authorities in Sudan; these were obvious ‘minders’ and could not be fired. In such situations, researchers are encouraged to find other field or office tasks for such individuals and avoid having them do any interviews, if feasible.
ETHICAL CONSIDERATIONS AND DOCUMENTATION Researchers conducting surveys in areas of turmoil need to give special attention to certain ethical considerations related to both the target population and the research team. The trade-off between the potential value of the survey and potential risks to those involved in the survey need to be weighed carefully before deciding whether a survey is even warranted. Approval from local (if existent) or non-local ethics review committees should always be sought (see Cross-cultural Survey Guidelines at http://ccsg.isr.umich.edu for a list of resources on ethical guidelines for human subject research).
Ethical Considerations for the Target Population Populations affected by conflict or disaster often experience significant amounts of stress because of direct or indirect exposure to the traumatic event and its repercussions. When interviewing such vulnerable populations, researchers need to evaluate possible risks of undue stress brought on by asking respondents about the trauma. Some researchers argue that asking individuals questions about a traumatic event may cause harm to the individual (Goodhand, 2000), others assert that respondents can have an overall positive gain from participating in the survey (Griffin et al., 2003), and yet others maintain that there are varying levels of risk dependent on the individual’s experiences and demographic background (Newman and Kaloupek, 2004).
Surveys in Societies in Turmoil
While the literature does not provide a clear answer, one can conclude that asking individuals who have experienced trauma details about the traumatic events can be problematic and care must be taken. In these cases, ethical review boards will often require researchers to develop protocols on handling possible respondent distress. Such protocols include providing psychological support for respondents showing signs of significant distress. However, researchers need to be careful about how respondents perceive such support and they need to ensure that respondents do not view their participation as an exchange for mental or physical services (Goodhand, 2000; Yamout and Jabbour, 2010). Conversely, it must also be made clear that any type of relief aid provided will not be jeopardized by participation in research (Larrance et al., 2007). For example, Swiss et al. (1998) found that some respondents refused to initially participate in their study out of fear of losing access to humanitarian aid. The mere participation in surveys might also cause undue respondent burden if the same respondents are inadvertently selected for multiple surveys. This might happen especially in settings that receive wide media coverage (Fleischman and Wood, 2002; Richardson et al., 2012). Researchers are encouraged to inquire about the survey climate and collaborate with other interested researchers to avoid duplication of work and undue burden to respondents. In addition, researchers employing interviewers who belong to selected professions (e.g., community health, peace workers, nurses) that could be providing needed relief work are highly encouraged to consider whether diverting these human resources away from such relief work is well-founded.
Ethical Considerations for the Researchers Ethical concerns also extend to the research team itself. In both conflict and disaster
185
areas, the field staff can be at an increased risk for harm (Brownstein and Brownstein, 2008; Goodhand, 2008; Henderson et al., 2009; Lawry, 2007) and ongoing evaluations of safety must be made throughout the data collection period. It is not only physical harm but also psychological distress that needs to be monitored. Field staff are likely to face stressful conditions, witness death and wide spread destruction, or be exposed to the traumatic event itself (Lawry, 2007; Richardson et al., 2012). Counseling services that are made available to respondents should also be made available to researchers and field staff. Potential harm to respondents and the research team does not necessarily end with data collection. Researchers need to be aware of how and to whom their findings will be disseminated and used. Publically available results can be misused, manipulated, or censored by various organizations such as those involved in the conflict or controlling relief support (Thoms and Ron, 2007).
Documentation Many surveys conducted in areas of turmoil suffer from the lack of complete, transparent, and detailed documentation of the methods used, challenges faced, and what methods were found to be effective. Researchers conducting such surveys are encouraged to adopt a standard practice of documentation that will allow for better interpretation of the results and guide future researchers. If not published, such documentation could be made available in a report format and archived so that they are accessible to other researchers.
CONCLUSION Armed conflicts and natural disasters often create an urgent need for data to measure the demographic, social, economic, and health
186
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
consequences to plan needed interventions. One of the common methods for collecting such data is survey research. The conditions that make collecting data so critical also make designing and implementing surveys more challenging. While researchers might not be able to fully control all the challenges encountered, they are encouraged to adopt a number of research principles that can minimize and adjust for possible survey errors introduced by those challenges. These principles include: (1) maintaining an adaptive and flexible approach; (2) adopting a mixed method approach; (3) implementing technological advances; (4) establishing local partnership and engaging the community; and (5) reinforcing the notion of neutrality among all members of the research team and in all the study materials. Ethical concerns add another layer of complexity in such settings. The trade-offs that survey researchers often need to make are not just between survey error and cost, but also between survey error and the security and burden of interviewers and respondents. Researchers conducting surveys in areas of turmoil need to give special attention to such ethical considerations related to both the target population and the research team. Lastly, researchers need to document the details of their methods, the challenges encountered, and the ways these were addressed to guide the interpretation of results and future research in areas of turmoil.
NOTES 1 We focus on face-to-face interviewing as the main data collection mode as it is the most commonly used mode in conflict and disaster settings due to the damage in communication networks and lack of electricity that hinders the use of other modes. 2 Among the methods discussed in the literature is the use of multiple frames including UNICEF immunization lists (Scholte et al., 2004), food distribution lists (Salama et al., 2000), United Nations agencies and NGOs lists (Husain et al., 2011), school records (La Greca, 2006), and timelocation sampling (Quaglia and Vivier, 2010).
Readers could refer to Mneimneh et al. (2014) and Pennell et al. (2014) for a more detailed description. 3 Personal communication with field staff and principal investigators of surveys conducted in Pakistan and Colombia.
RECOMMENDED READINGS Henderson et al. (2009) provide an overview of research approaches used to conduct postdisaster research. The authors also specifically discuss advantages and disadvantages of survey design and how traditional survey methods often did not apply in the post Hurricane Katrina context. The book Understanding Peace Research: Methods and Challenges (Hoglund and Oberg, 2011) provides a comprehensive overview of several methods used to collect and capture information in peace and conflict settings including surveys, focus groups, in-depth interviews, and news media. It also discusses the challenges encountered by the different methods when used in such contexts. The chapter by Mneimneh et al. (2014) in Conducting Surveys in Areas of Armed Conflict describes in detail the challenges encountered when designing and conducting surveys in conflict settings. The chapter also discusses design and implementation principles that address these challenges and ends with recommended methodological research and practical directions. The chapter by Pennell et al. (2014) in Disaster Research: Surveying Displaced Populations describes the difficulties encountered when designing and conducting survey research in areas affected by disasters. The chapter also provides practical and research recommendations for conducting survey research in this context.
REFERENCES Adams, S. M., and Friedland, C. J. (2011). A survey of unmanned aerial vehicle (UAV) usage for imagery collection in disaster research and management. In 9th International
Surveys in Societies in Turmoil
Workshop on Remote Sensing for Disaster Response, September 2011. Amowitz, L. L., Reis, C., Lyons, K. H., Vann, B., Mansaray, B., Akinsulure-Smith, A. M., Taylor, L., and Iacopino, V. (2002). Prevalence of war-related sexual violence and other human rights abuses among internally displaced persons in Sierra Leone. JAMA: The Journal of the American Medical Association, 287(4), 513–521. Assefa, F., Jabarkhil, M. Z., Salama, P., and Spiegel, P. (2001). Malnutrition and mortality in Kohistan District, Afghanistan, April 2001. JAMA: The Journal of the American Medical Association, 286(21), 2723–2728. Axinn, W. G., Ghimire, D., and Williams, N. (2012). Collecting survey data during armed conflict. Journal of Official Statistics, 28(2), 153–171. Barakat, S., Chard, M., Jacoby, T., and Lume, W. (2002). The composite approach: research design in the context of war and armed conflict. Third World Quarterly, 23(5), 991–1003. Bengtsson, L., Lu, X., Thorson, A., Garfield, R., and Von Schreeb, J. (2011). Improved response to disasters and outbreaks by tracking population movements with mobile phone network data: a post-earthquake geospatial study in Haiti. PLoS Medicine, 8(8), e1001083. Beyrer, C., Terzian, A., Lowther, S., Zambrano, J. A., Galai, N., and Melchior, M. K. (2007). Civil conflict and health information: The Democratic Republic of Congo. In C. Beyrer and H. Pizer (eds), Public Health and Human Rights: Evidence-based Approaches (pp. 268–288). Baltimore, OH: Johns Hopkins University Press. Bird, D. K. (2009). The use of questionnaires for acquiring information on public perception of natural hazards and risk mitigation – a review of current knowledge and practice. Natural Hazards and Earth System Sciences, 9, 1307–1325. doi:10.5194/nhess-9-13072009. Blumenstock, J., and Eagle, N. (2010). Mobile divides: gender, socioeconomic status, and mobile phone use in Rwanda. In Proceedings of the 4th ACM/IEEE International Conference on Information and Communication Technologies and Development, December (p. 6). New York: ACM.
187
Brownstein, C. A., and Brownstein, J. S. (2008). Estimating excess mortality in post-invasion Iraq. New England Journal of Medicine, 358(5), 445–447. Checchi, F., and Roberts, L. (2008). Documenting mortality in crises: what keeps us from doing better? PLoS Medicine, 5(7), e146. Checchi, F., Stewart, B. T., Palmer, J. J., and Grundy, C. (2013). Validity and feasibility of a satellite imagery-based method for rapid estimation of displaced populations. International Journal of Health Geographics, 12(1), 4. Daponte, B. O. (2007). Wartime estimates of Iraqi civilian casualties. International Review of the Red Cross, 89(868), 943–957. Deshmukh, Y. (2006). LRRD1 Methodology: Long Term Perspectives on the Response to the Indian Ocean Tsunami 2004: A Joint Evaluation of the Links Between Relief, Rehabilitation and Development (LRRD). Stockholm: SIDA. Eck, E. (2011). Survey research in conflict and post-conflict societies. In K. Hoglund and M. Oberg (eds), Understanding Peace Research Methods and Challenges (pp. 165–182). Routledge, New York. Fleischman, A. R., and Wood, E. B. (2002). Ethical issues in research involving victims of terror. Journal of Urban Health, 79, 315–321. Galway, L. P., Bell, N., Al S, S. A. E., Hagopian, A., Burnham, G., Flaxman, A., Weiss, W.M., Rajaratnam, J., … and Takaro, T. K. (2012). A two-stage cluster sampling method using gridded population data, a GIS, and Google EarthTM imagery in a population-based mortality survey in Iraq. International Journal of Health Geographics, 11(1), 12. Ghimire, D.J., Chardoul, S., Kessler, R. C., Axinn, W. G., and Adhikari, B. P. (2013). Modifying and validating the Composite International Diagnostic Interview (CIDI) for use in Nepal. International Journal of Methods in Psychiatric Research, 22(1), 71–81. Goodhand, J. (2000). Research in conflict zones: ethics and accountability. Forced Migration Review, 8(4), 12–16. Griffin, M. G., Resick, P. A., Waldrop, A. E., and Mechanic, M. B. (2003). Participation in trauma research: Is there evidence of harm? Journal of Traumatic Stress, 16(3), 221–227. Guha-Sapir, D., and Degomme, O. (2005). Darfur: counting the deaths. Report, Center
188
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
for Research on the Epidemiology of Disasters, 26, Mortality estimates from multiple survey data. Brussels, CRED, 2005. Gupta, S. K., Suantio, A., Gray, A., Widyastuti, E., Jain, N., Rolos, R., et al. (2007). Factors associated with E. Coli contamination of household drinking water among tsunami and earthquake survivors, Indonesia. American Journal of Tropical Medicine and Hygiene, 76(6), 1158–1162. Haer, R., and Becher, I. (2012). A methodological note on quantitative field research in conflict zones: get your hands dirty. International Journal of Social Research Methodology, 15(1), 1–13. Henderson, T. L., Sirois, M., Chen, A. C., Airriess, C., Swanson, D. A., and Banks, D. (2009). After a disaster: lessons in survey methodology from Hurricane Katrina. Population Research and Policy Review, 28, 67– 92. doi: 10.1007/s11113-008-9114-5. Herrmann, M. J., Brodie, M., Morin, R., Blendon, R., and Benson, J. (2006). Interviewing in the face of disaster: conducting a survey of Hurricane Katrina evacuees. Public Opinion Pros. Retrieved from http://www. publicopinionpros.norc.org/from_field/2006/ oct/herrmann.asp Hoglund, K. (2011). Comparative field research in war-torn societies. In K. Hoglund and M. Oberg (eds), Understanding Peace Research Methods and Challenges (pp. 114–129). Hoglund, K. and Oberg, M. (eds). New York: Routledge. Hoglund, K., and Oberg, M. (eds) (2011). Understanding Peace Research: Methods and Challenges. London: Taylor & Francis. Husain, F., Anderson, M., Lopes Cardozo, B., Becknell, K., Blanton, C., Araki, D., and Kottegoda Vithana, E. (2011). Prevalence of war-related mental health conditions and association with displacement status in postwar Jaffna District, Sri Lanka. JAMA: The Journal of the American Medical Association, 306(5), 522–531. King, D. (2002). Post disaster surveys: experience and methodology. Australian Journal of Emergency Management, 17(3), 1–13. Kolbe, A., and Muggah, R. (2010). Surveying Haiti’s post-quake needs: a quantitative approach. Humanitarian Practice Network, (48). Retrieved from http://www.odihpn.org/ humanitarian-exchange-magazine/issue-48/
surveying-haitis-post-quake-needs-a- quantitative-approach. Accessed on 13 June 2016. Kuper, H., and Gilbert, C. (2006). Blindness in Sudan: is it time to scrutinize survey methods? PLoS Medicine, 3(12), e476. La Greca, A.M. (2006). School populations. In F. Norris, S. Galesto, D. Reissman, & P. Watson (eds), Research methods for studying mental health after disasters and terrorism: Community and public health approaches. New York: Guilford Press. Larrance, R., Anastario, M., and Lawry, L. (2007). Health status among internally displaced persons in Louisiana and Mississippi travel trailer parks. Annals of Emergency Medicine, 49(5), 590–601. Lawry, L. (2007). Maps in the sand. Public Health and Human Rights: Evidence-based Approaches (pp. 243–247). Baltimore: John Hopkins University. Mneimneh, Z.N. Axinn, W.G., Ghimire, D., Cibelli, K.L., and Alkaisy, M.S. (2014). Conducting Surveys in Areas of Armed Conflict. Hard-to-Survey Populations. Cambridge: Cambridge University Press. Mneimneh, Z., Karam, E. G., Karam, A. N., Tabet, C. C., Fayyad, J., Melhem, N., and Salamoun, M. (2008). Survey design and operation in areas of war conflict: the Lebanon wars surveys experience. Paper presented at the International Conference on Survey Methods in Multinational, Multiregional, and Multicultural Contexts, Berlin, Germany. Mullany, L. C., Richards, A. K., Lee, C. I., Suwanvanichkij, V., Maung, C., Mahn, M., Beyrer, C. Lee, T. J. (2007). Population-based survey methods to quantify associations between human rights violations and health outcomes among internally displaced persons in eastern Burma. Journal of Epidemiology and Community Health (1979-), 61(10), 908–914. Murphy, J., Link, M., Hunter Childs, J., Langer Tesfaye, C., Dean, E., Stern, M., Pasek, J., Cohen, J., Callegaro, M., and Harwood, P. (2014). American Association for Public Opinion Research. Social Media in Public Opinion Research: Report of the AAPOR Task Force on Emerging Technologies in Public Opinion Research. Retrieved from: http:// www.aapor.org/Social_Media_Task_Force_ Report.htm
Surveys in Societies in Turmoil
Newman, E., and Kaloupek, D. G. (2004). The risks and benefits of participating in traumafocused research studies. Journal of Traumatic Stress, 17(5), 383–394. Oberg, M., and Sollenberg, M. (2011). Gathering conflict information using news resources. In K. Hoglund and M. Oberg (eds), Understanding Peace Research Methods and Challenges (pp. 47–73). Routledge, New York. Pennell, B., Deshmukh, Y., Kelley, J., Maher, P., Wagner, J., and Tomlin, D. (2014). Disaster Research: Surveying Displaced Populations. Hard-to-Survey Populations. Cambridge: Cambridge University Press. Pham, P., Vinck, P., and Stover, E. (2005). Forgotten voices: a population-based survey of attitudes about peace and justice in Northern Uganda. SSRN eLibrary. Retrieved from http://papers.ssrn.com/sol3/papers. cfm?abstract_id=1448371 Quaglia, M., and Vivier, G. (2010). Construction and field application of an indirect sampling method (time-location sampling): an example of surveys carried out on homeless persons and drug users in France. Methodological Innovations Online, 5(2), 17–25. doi: 10.4256/mio.2010.0015. Richardson, R. C., Plummer, C. A., Barthelemy, J. J., and Cain, D. S. (2012). Research after natural disasters: recommendations and lessons learned. Journal of Community Engagement and Scholarship, 2(1), 3–11. Roberts, L., Lafta, R., Garfield, R., Khudhairi, J., and Burnham, G. (2004). Mortality before and after the 2003 invasion of Iraq: cluster sample survey. The Lancet, 364(9448), 1857–1864. Salama, P., Spiegel, P., Van Dyke, M., Phelps, L., and Wilkinson, C. (2000). Mental health and nutritional status among the adult Serbian minority in Kosovo. JAMA: The Journal of the American Medical Association, 284(5), 578–584. Scholte, W. F., Olff, M., Ventevogel, P., de Vries, G.-J., Jansveld, E., Cardozo, B. L., and Crawford, C. A. G. (2004). Mental health symptoms following war and repression in eastern Afghanistan. JAMA: the Journal of the American Medical Association, 292(5), 585–593. Schwarz, N., and Sudman, S. (1992). Context Effects in Social and Psychological Research. New York: Springer.
189
Silva, R., and Ball, P. (2008). The demography of conflict related mortality in Timor-Leste (1974–1999): reflections on empirical quantitative measurement of civilian killings, disappearances, and famine-related deaths. In J. Asher, D. L. Banks, and F. Scheuren (eds) Statistical Methods for Human Rights (pp. 117–140). New York: Springer. Simons, C., and Zanker, F. (2012). Finding the Cases That Fit: Methodological Challenges in Peace Research (No. 189). GIGA Paper No 189. Retrieved from: http://ssrn.com/ abstract=2134886 or http://dx.doi. org/10.2139/ssrn.2134886. Accessed on 13 June 2016. Spagat, M., Mack, A., Cooper, T., and Kreutz, J. (2009). Estimating war deaths. Journal of Conflict Resolution, 53(6), 934–950. Swiss, S., Jennings, P. J., Aryee, G. V., Brown, G. H., Jappah-Samukai, R. M., Kamara, M. S., Schaack, R.D., Turay-Kanneh, R. S. (1998). Violence against women during the Liberian civil conflict. JAMA: The Journal of the American Medical Association, 279(8), 625–629. Tapp, C., Burkle, F. M., Wilson, K., Takaro, T., Guyatt, G. H., Amad, H., and Mills, E. J. (2008). Iraq War mortality estimates: a systematic review. Conflict and Health, 2(1), 1. Thompson, S. K., Seber, G. A. F., and others (1996). Adaptive Sampling. Wiley New York. Retrieved from http://www.mathstat. helsinki.fi/msm/banocoss/2011/Presentations/Thompson_web.pdf Thoms, O. N. T., and Ron, J. (2007). Public health, conflict and human rights: toward a collaborative research agenda. Conflict Health, 1(11). Yamout, R., and Jabbour, S. (2010). Complexities of research during war: lessons from a survey conducted during the summer 2006 war in Lebanon. Public Health Ethics, 3(3), 293–300. Zissman, M. A., Evans, J. E., Holcomb, K. T., Jones, D. A., Schiff, A. C., et al. (2010). Development and Use of a Comprehensive Humanitarian Assessment Tool in Post- Earthquake Haiti. Lexington, MA: MIT Lincoln Laboratory. Retrieved from http://oai. dtic.mil/oai/oai?verb=getRecord&metadata Prefix=html&identifier=ADA534969.
PART IV
Measurement
14 What Does Measurement Mean in a Survey Context? Jaak Billiet
INTRODUCTION I still remember very well my first contact with the term ‘measurement’ in the context of surveying opinions, beliefs, values, and attitudes. This was in one of the earliest lectures in ‘Methods of sociological research’ in the first bachelors’ year in 1967. Because of my previous training in theology and philosophy I could not imagine how it was possible to measure subjective states, or not observable phenomena. I did not realize at the time that measuring theoretical concepts, and assessing the validity of these, would fill a big part of my life. The idea of measurement in (social) science becomes reasonable if it is defined as rules for assigning symbols to objects in two ways: not only as to represent quantities of attributes numerically (e.g. Stevens, 1946: 677–678), but also to define whether objects fall in the same or in a different categories with respect to a given attribute (Nunnally and Bernstein, 1994).
Stevens’ distinction between four levels of measurement is crucial in survey measurement. It specifies what assumptions about measured attributes are made when a measurement scale is applied. Is the meaning of an attribute (e.g. age), and thus its level of measurement, fixed or is it dependent of the theoretical meaning of the underlying concept that it intends to measure? After all, the measure of a concept is not the concept itself, but one of several possible error-filled ways of measuring it. I am still amazed how measurement of theoretical concepts as such is possible. This amazement is not at all strange. Diverging opinions about measuring concepts circulate among philosophers of science and social methodologists (Hox, 1997: 50–53). These ideas are not without consequences for the way theoretical concepts are operationalized, i.e. translated into observable variables, and for the validity of inferences made from the variables measured and
194
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
relationships between these. This chapter therefore starts with some reflections on the operationalization of theoretical concepts and on the distinction between conceptual (or theoretical) validity and measurement validity. Theoretical concepts are embedded in research questions. The translation of these concepts into constructs, observable variables by means of measurable empirical indicators, and latent variables, is not a typical feature of survey measurement. It is a feature of all empirical research, even of the socalled ‘qualitative’ approaches (King et al., 1994: 151–153). The step from operationalized construct to measurement is thoroughly discussed in the second part of this chapter. It is built around a concrete application of operationalization of a theoretical construct in a typical survey context. Typical features of survey measurement are that the measures are performed within large samples of (mostly) individuals by means of standardized instruments (e.g. questionnaires). The elaborated example used in the section ‘From operationalized construct to measurement’ below is taken from the ‘social legitimacy of the welfare state’ module in Round 4 of the European Social Survey (ESS) in 2008. The idea that there is no measurement without error (or effects) coming from the methods used is basic in survey measurement. Method effects, whether or not identified as error, are always a threat to the validity of inferences based on survey measures. An overview of strategies used to reduce (Sudman and Bradburn, 1982; Foddy, 1993) or to measure errors (Groves, 1989: 4; Scherpenzeel and Saris, 1997) should naturally complement the two parts of this chapter. It is, however, impossible to deal with this within the scope of this chapter. I therefore refer for these aspects of survey measurement to Chapters 3, 16, and 17 in this Handbook, and to other specific work on it (Billiet and Matsuo, 2012; Saris and Gallhofer, 2007).
FROM CONCEPT TO MEASUREMENT: CONCEPTUAL AND MEASUREMENT VALIDITY The way from theoretical concepts to appropriate survey questions as valid indicators for the measurement of these concepts is long. This process of operationalization starts with the formulation of research questions about interconnected concepts that are inspired by existing theory, by previous research, or simply by practical concerns. After agreement between researchers on the intended meaning of a concept and the relevant subdomains of its meaning, researchers can move to the next stage. This is the search for empirical indicators that are appropriate and potentially internally valid for the intended concept (Hox, 1997: 47–48). Compared with the very extensive study of question wording effects and measurement error (see Billiet and Matsuo, 2012), very little attention has been paid to the problem of translating concepts into questions (Saris and Gallhofer, 2007: 15). The chapter of Hox (1997: 47–69) in the edited volume Survey Measurement and Process Quality of Lyberg et al. (1997), and the three chapters of Saris and Gallhofer (2007) devoted to a three way procedure to design requests for an answer in the more recent work Design, Evaluation, and Analysis of Questionnaires for Survey Research are among the rare exceptions.
Distinguishing Terms: Concept, Construct, Observed Variable, and Latent Variable Social scientists operate on two levels between which they ‘shuttle back and forth’: the theory–hypothesis–construct level and the observation (or measurement) level (Kerlinger, 1973: 28). Theoretical concepts and constructs have to be operationalized before they can be measured. The terms ‘concept’ and ‘construct’ are often used interchangeably since both are theoretical
What Does Measurement Mean in a Survey Context?
abstractions, but there is an important distinction. A concept expresses an abstraction formed by generalizations from particulars. A construct is a concept that has been deliberately and consciously invented or adopted for special scientific purpose, and is thus a more systematically and formally defined concept (Hox, 1997: 48–49; Kerlinger, 1973: 29). A construct refers to a ‘theorized psychological construct’. Constructs are often somewhat loosely called ‘variables’ indicating that it is something that varies. More precisely, a construct is a symbol to which (numerical or categorical) values are assigned as a result of a measurement operation (Kerlinger, 1973: 29). Both concepts and constructs must be connected to observed or to latent variables that specify or clarify to which empirical variables they are linked, and how the values of these empirical variables are assigned (Hox, 1997: 49). This process is often seen as a translation process in which theoretical constructs are translated into observable variables that are assumed to be valid representations of the constructs. The distinction between concept and construct made by Kerlinger relies on the context in which it functions: a theoretical account or an empirical research context. Saris and Gallhofer see it somewhat differently in their work on the design, evaluation, and analysis of questions for survey research (Saris and Gallhofer, 2007). Following Blalock (1990) and Northrop (1947), they propose a three step procedure for designing ‘requests for an answer’. These authors directly focus on the aim of survey questions by so naming them, i.e. measurement. These steps are (1) specification of a concept-by-postulation in concepts-by-intuition; (2) transformation of concepts-by-intuition in statements indicating the requested concept, and (3) transformation of each statement into a question (or request for an answer). The focus in this part is on the first two steps. Concepts-byintuition are simple concepts whose meaning is immediately obvious while concepts-bypostulation are less obvious and require
195
explicit definitions in a research context. In this terminology, constructs are thus concepts-by-postulation (Saris and Gallhofer, 2007: 15–16). Constructs are composed of intuitive concepts that are convertible into observable indicators which are suitable for the construction of empirical variables. These two kinds of concepts are linked by epistemic correlations (Northrop, 1947: 119),1 in a way analogous to Carnap’s (1956: 39, 47–48) correspondence rules for connecting the two languages. In the context of measurement models for concepts by postulation, the specified observable variables function as indicators for multiple indicator latent variables. Should we assume that latent variables are real entities or conceive these as useful fictions, constructed by the human mind (Borsboom et al., 2003: 204)? There are several ways to distinguish observable variables from latent variables. I propose to consider latent variables as empirical realizations of theoretical constructs. These are thus outcomes of measurement operations on observed variables (indicators). One can distinguish between nonformal and formal definitions of latent variables (Bollen, 2002: 607–615). Nonformal definitions of latent variables refer to the fact that these variables are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). This is why some scholars name them ‘hidden’ variables.2 Some authors consider latent variables ‘hypothetical’ constructs. A latent variable is then something that a scientist puts together in his or her imagination (Nunnally, 1978: 96). It is then something not real that comes out of the mind of the researcher. This idea contrasts with the view of latent variables as real traits. Edwards and Bagozzi (2000: 157) conciliate these two (philosophical) views by considering constructs not as real but as attempts to measure real phenomena (Bollen, 2002: 607– 608). I would rather specify that latent variables are outcomes of these attempts based
196
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
on mathematical models. Formal definitions of latent variables are the local independence definition (Borsboom et al., 2003: 203), the expected value definition, the definition of a latent variable as a nondeterministic function of observed variables, and the sample realization definition (see Bollen, 2002: 609–613).3 Most of the formal definitions apply to reflective indicators, and not to formative indicators of latent variables. This is not surprising since the standard measurement model in the measurement of latent variables is the model with reflective indicators. The latent variable is defined as a causal variable that affects a number of observable measures of its indicators (Saris and Gallhofer, 2007: 278–279). It assumes that a latent variable is more important than the responses to the items (indicators) that it reflects (Borsboom et al., 2003: 209).4 It is clearly stated by Borsboom et al. (2003: 209) that ‘the choice for the reflective model expresses entity5 realism with respect to the latent variable’. Conversely, latent variables measured by formative indicators are not conceptualized as determining the measurements of the indicators (i.e. item responses) but as a summary of these measurements. In a formative measurement model, the observed indicators (variables) determine (or define) the latent variable (Saris and Gallhofer, 2007: 291–292).6 A (entity) realist interpretation of a latent variable implies a reflective model, whereas constructivist or operationalist interpretations are more compatible with a formative model (Borsboom et al., 2003: 209–210).
The Link Between Theoretical Concepts/Constructs and Observed/Latent Variables Operationalization can be seen as a translation process in which theoretical constructs and constructs are translated into observable variables by specifying empirical indicators. In survey research, these indicators are mostly – but not always – survey questions
within the context of a questionnaire. There are in the philosophy of science several views on the link between a theoretical concept/construct and its operationalization into an observable, and the measurement of latent variables. These views are now shortly considered because of their relevance for our distinction between conceptual (theoretical) validity and measurement validity. The scientific/philosophical approach known as ‘operationism’ requires that every construct completely coincide with one single observed variable (Hox, 1997: 50). The implication is that there is no theoretical concept apart from its measurement. The measured construct is the concept. This has as a consequence that a different theoretical concept is measured whenever one term in the operational definition has been changed. In other words, if we change one word in a question text, then we are measuring another construct. This leads to hopeless problems if researchers attempt to generalize their results (Hox, 1997: 50). Moreover it prevents from the use of congeneric measures7 for measuring multiple indicator latent variables. Multiple indicators are evaluated solely for their capacity to provide an alternative for repeated measurements of the same construct. Repeated questions used in the context of multitraitmultimethod designed to assess validity and reliability are required to be strictly identical in wording. Only response scales are permitted to vary (De Wit and Billiet, 1995: 54). In this view a distinction between conceptual validity and measurement validity is meaningless since the theoretical construct is the measured operationalized variable. In (social) research it is more convenient to adapt a view in which the scope of a theoretical construct/concept is larger than the measurement of the operationalized construct. This is, as noted earlier, in line with Kerlinger’s (1973: 28) idea that the empirical researcher operates on two levels. Measurement is guided by theoretical constructs but these are in turn enriched through reflection on the obtained measures. The
What Does Measurement Mean in a Survey Context?
science/philosophical approach known as ‘logical positivism’ precisely distinguishes between a theoretical language and an observation language. This view originated in the late 1940s in the Vienna school, and is i.a. defended by Carnap (1956). The theoretical language contains assumptions about the theoretical constructs and their interrelations, while the observation language only consists of concepts that are operationally defined by formal operations, or that can be linked to these by formal logical or mathematical operations (Hox, 1997: 50). The terms in the theoretical language often refer to unobservable events and are thus much richer than the terms in an observation language (Carnap, 1956: 38). This opens the way to distinguish between conceptual validity and measurement validity. This view is adapted by famous scholars in social research that guide our work, like Lazarsfeld (1972), and lies also behind the previously discussed distinction between concepts-by-postulation and concepts-by-intuition (Northrop, 1947). The latter are (by convention among scientists) easily understandable terms in the observational language. Concepts-by-postulation are terms of a theoretical language and are embedded in a nomological network from which they derive their meaning. Contrary to operationalism, the sophisticated falsificationism of Popper (1934) accepted the idea of the two languages, but the view on the relation between both differs from logical positivism. First, one never can prove that a theory and the derived hypotheses are true. One only can, under specified conditions, falsify theories and hypotheses. Second, observations are always made in the language of some theory. The latter idea made falsification nearly impossible. Therefore, not only theories are subjected to criticism and testing but also observation statements. Observations that have survived critical examination are accepted by the scientific community – for the time being (Hox, 1997: 51–52). Applied to the relation between theoretical constructs and observed
197
(measured) variables, this vision supports the idea that reflection on measurement validity might affect the meaning of a theoretical construct. Negative results, or the rejection of measurement validity of an observed variable, can lead to the conclusion that the measurement instrument is invalid and should be renewed. It can also mean that the construct must be abandoned, or that it must be subdivided or merged with other constructs (De Groot, 1969). Validity assessment thus has implications at both the theoretical level and the measurement level.
Reflections on Conceptual and Measurement Validity The distinction between ‘conceptual validity’ and ‘measurement validity’ is not commonly used among social scientists. An internet search on the term ‘validity’ offers many thousands of hits for measurement validity in combination with (sub)types or kinds of validity that are all specified by a different adjective. The adjective ‘conceptual’ remains virtually invisible. The most commonly used kinds of validity are: content validity, predictive validity (concurrent or predictive), and construct validity (convergent, concurrent). These terms originated in the period where famous psychologists (Thurnstone, Likert, and Guttman) used these to describe measurement characteristics of newly developed attitude scales (O’Muircheartaigh, 1977: 10). These kinds of validity assessment appear in handbooks of social science methodology in general, and in specific handbooks of survey methodology (see, for example, Aday and Cornelius, 2006: 62–68). Directly searching online for ‘conceptual validity’ suggests primarily the term ‘construct validity’. This kind of validity commonly refers to the ability of a measurement tool (e.g. a test, etc.) to actually measure a psychological concept being studied (Cronbach and Meehl, 1955). In the introductory chapter of International Handbook
198
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of Survey Methodology, the volume editors define construct validity as ‘the extent to which a measurement instrument measures the intended construct and produced an observation distinct from that produced by a measure of a different construct’ (Hox et al., 2008: 16). This definition has a double meaning: one substantial and one operational. The first part repeats the already mentioned general definition of validity to whether the measurement of a variable actually reflects the true theoretical meaning of a concept. The second part refers to discriminant validity, i.e. a specific design to validate the measurement of a construct via the analysis of a multitrait-multimethod (MTMM) matrix (Campbell and Fiske, 1959: 82). A direct reference to the ‘theoretical validity’ of the measure of a variable appears in the work of Lord and Novick (1968: 261) on a statistical theory of test scores. In this measurement tradition, theoretical validity refers to the correlation between an observed item (x) and some latent, theoretical construct at interest (τ). The theoretical validity of x is distinguished from the empirical validity of it, which can only assessed in relation to measures of several indictor variables within the framework (assumptions) of the true score theory (Bohrnstedt, 1983: 73–74). In this framework, the notion of construct validity implies two types of hypotheses. First, it implies that items (indicators) of a construct (or latent variable) correlate together because they all reflect the same underlying construct or ‘true’ score. Second, these items may correlate with items (indicators) of other constructs, to the extent that constructs themselves are correlated. The first type of validation has been called ‘theoretical validity – an assessment of the relationship between the items and an underlying, latent unobserved construct. The second involves confirming that the latent variables themselves correlate as hypothesized’ (Bohrnstedt, 1983: 101–102). If either or both sets of hypotheses fail, then construct validity is absent. In other words, construct validity examines whether,
and how many, relationships predicted by the theory are not rejected by an analysis of the collected data (Aday and Cornelius, 2006: 66–68). This is clearly measurement validity, and not what I mean by conceptual validity since the validation in construct validity is largely based on the analysis of empirical relations between (observed) measured indicators, e.g. a MTMM-matrix. The development of Structural Equation Modeling facilitated the empirical testing of measurement models and provides numerical estimates of the reliability and validity of measurement instruments when the indictors of several constructs are repeatedly measured by different methods (Andrews and Withey, 1976; Andrews, 1984; Saris and Andrews, 1991). One of the few appearances of the ‘conceptual validity’ is in a paper by Jackson and Maraun (1996: 103). It is argued that there is little justification for the claim that selected scale items measure the intended concept ‘unless conceptual clarification of the meanings of scale items is undertaken prior to the study of empirical relationships’. It is further argued that the psychometric properties of measurement instrument (e.g. high correlations between the items) are not a sufficient condition to conclude that the measurement instrument reflects the intended construct. This idea goes back to the view of Guttman (1977: 99) who claims that ‘correlation does not imply content’. One cannot go that far, however, and assert that inter-item correlations have nothing to do with the meaning of items. The more similar the meaning the more likely that correlations between items are higher (Zuckerman, 1996: 111–114). Statistical parameters need additional interpretation from a theoretical point of view. In this (philosophical) view on conceptual validity, statistical evidence can never serve as the sole criterion for establishing the meaning of a theoretical construct. This is what I want to highlight with the twin concept of conceptual and measurement validity.
What Does Measurement Mean in a Survey Context?
Contrary to construct validity, ‘content validity’ cannot be expressed by a statistical parameter estimated as some specific value. It refers to the general quality of correspondence rules between the theoretical language (concepts-by-postulation) and the observational language (concepts-by-intuition) (Hox, 1997: 51). It is therefore surprising that the search for the term ‘conceptual validity’ delivers so many hits to construct validity, and not to content validity. The latter relies on judgments about whether the indicators chosen are representative for the domain of meaning of the theoretical constructs that a researcher intends to measure (Bohrnstedt, 1983: 98). In other words, it refers to how good the samples of empirical measures are for the theoretical domains they are presumed to represent (Aday and Cornelius, 2006: 62). Both kinds of validation, content and construct are complementary. Construct validity is mainly geared to a statistical analysis after the data is available. Content validation, however, is primarily an activity that takes place in early stages of the operationalization process wherein decisions are made on samples of observable indicators to measure the theoretical constructs. In the context of survey research, the procedures typically applied in this stage include: relying on the researcher’s personal experience and insights; thorough literature review, consultation of experts in the field; conversations with key witnesses from the field; and dimension/indicator analysis, semantic analysis, and systematic facet analysis (Aday and Cornelius, 2006: 64; Bohrnstedt, 1983: 98, 100; Hox, 1997: 53–54). These operations undertaken in content validation are in a survey context less formalized than is the case with construct validation (Bohrnstedt, 1983: 99; Heath and Martin, 1997). At this point the question arises why do I use in this paper ‘conceptual validity’ versus ‘measurement validity’ and not the usual concepts of ‘content validity’ and ‘construct validity’? Content and construct validity are usually considered as two subsequent and
199
complementary stages in the operationalization process leading to measurement instruments, i.e. a sequence of operations with an inductive stage followed by a deductive one (Boesjes-Hommes, 1970: 206). With the twin pair of ‘conceptual validity’ versus ‘measurement validity’ I want to stress an approach that reflects Kerlinger’s (1973: 28) idea that social scientists ‘shuttle back and forth’ between the theoretical (abstract) level and observation (measurement) level. This is more than subsequent stages in the operationalization process, these are two levels of reflection. The aforementioned view on conceptual validity means that it is not possible to obtain certainty about conceptual validity on the basis of measurement validity. The psychometrical qualities of a measurement instrument required to obtain measurement validity are no sufficient conditions to conclude conceptual validity because the meaning of a theoretical construct of interest is always larger than its measurement. However, this does not mean that measurement validity is unrelated to conceptual validity. It is a necessary condition for it. The main advantage of the distinction between measurement and conceptual validity is that it draws attention to the importance of conceptual clarification, and theoretical reflection on the assumptions behind the measures, not only in the stage of development of concepts, constructs, and indicators but also during and after the measurement stage. This is crucial for the dynamic nature of concepts/constructs and their operationalization, and may avoid that researchers implicitly assume that the measured (latent) variable is the construct. Figure 14.1 clearly shows that the operationalized construct does not completely cover the theoretical construct (CV: conceptual validity). It also shows that a measured variable is smaller in scope than the operationalized constructs (MV: measurement validity). This is not only because of research technical restrictions, but as was mentioned, because not all theoretical nuances, aspects,
200
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Theoretical concept/construct Operationalized construct CV
MV Observed (measured) variable
Figure 14.1 Schematic overview of the operationalization process.
or dimensions of a construct are observable. The figure not only shows differences in the size of scope, but also the presence of invalidity. Operationalized constructs and measured variables not only measure incompletely the intended constructs/concepts; they also measure something what is not intended, i.e. method effects, response style, and even content that does not belong to the scope of the intended concept. This idea is further developed in the second part with the aid of a detailed example.
FROM OPERATIONALIZED CONSTRUCT TO MEASUREMENT Parts I and II of the previously mentioned work of Saris and Gallhofer (2007: 13–169) provide an extensive and concise account of bridging the distance between theoretical constructs and survey questions. It is, within the scope of this chapter, impossible to even summarize all crucial ideas, principles, concepts, and guidelines that are discussed in this standard work. Saris and Gallhofer’s work is strongly recommended for everyone who wants to understand what operationalization means in the context of survey measurement, and wants to apply it. It is written by an expert in survey measurement and one in linguistics, which makes it rather unique. It originated in the context of the preparation of ESS questionnaire modules. The instructions to teams that prepared a rotating module
proposal contained instructions that cover the operationalization process in which the proposed constructs were operationalized. The end product was a list of well-documented survey questions. I illustrate the main three steps of operationalization by the development of the social welfare module (Round 4 of ESS in 2008). The main reason for using this example is that, after acceptance of the proposal by the Scientific Advisory Board, I was a member of the ESS central coordination team involved in the core module questionnaire design. An additional reason is that several publications of the welfare state module are available in international journals (e.g. Roosma et al., 2013; Van Oorschot et al., 2012).
Step 1: Theoretical Elaboration of the Concepts The operationalization and questionnaire design process always starts with a thorough conceptual clarification in which the meaning of the concepts, the dimensions and the theoretical relations between these are elaborated. This is based on the state of the theory, on previous research, and on research questions that guide a project. ‘Social legitimacy of the welfare state’ is the core theoretical construct in the ESS social welfare module. This is a complex multi-dimensional concept that contains a macro and micro dimension. The concept functions at the level of a society (support by institutions, effects of policy, etc.), at the level of groups within society, and at the level of beliefs, attitudes, and perceptions of the citizens. ‘Social’ refers to the legitimacy as perceived in the population, and is the dimension that can be measured via public opinion research. In the conceptual clarification stage, during proposal preparation, seven individual level dimensions of the general concept were distinguished (see Roosma et al., 2013: 235–255). To what extent these dimensions are driven by a basic attitude
What Does Measurement Mean in a Survey Context?
‘welfarism’ was a central research question. Our illustration is restricted to one dimension named ‘popular perceptions of welfare state consequences’. This theoretical concept was further elaborated into a number of subdimensions, namely economic, moral, and social consequences. It was presumed that ‘the legitimacy of European welfare states would be seriously threatened if citizens only had an eye for its unfavorable unintended (economic and moral) consequences without simultaneously having strong perceptions of favorable intended consequences’ (Van Oorschot et al., 2012: 182). By intuition, one can easily imagine what the perceived economic and social consequences of the welfare state are. ‘Moral’ consequences of the welfare state refer to the consequence of it for the sense of responsibility among citizens.
Step 2: Operationalization of the Construct The complexity and length of the operationalization process depends on the (theoretical) distance between concepts-by-intuition and concepts-by-postulation. Many decisions impose themselves given that the measurement context is fixed: a cross-national survey by face-to-face interviews using standardized questionnaires in large random country samples. These decisions relate also to characteristics of the construct the numbers of indicators needed per sub-dimension of the construct (single or multiple indicators); whether direct or indirect measures are suitable; whether behavioral or attitude measures are required; how the relations between general concept and sub-concepts are imagined; the kind of basic assertion structures needed (Saris and Gallhofer, 2007: 15–59). Further decisions depend on the structure of the concepts-by-postulation (see Chapter 17 in this Handbook). Are the indicators conceived as formative in the sense that they determine the meaning of the constructs, or are the indicators reflective? The latter means that the
201
variation in the indicators of a concept-bypostulation is conceived as explained (or ‘caused’) by the variation in a latent variable that reflects the construct (Saris and Gallhofer, 2007: 277–280). This all has important consequences for the assessment of measurement models after measurements are obtained (see below). Crucial during the operationalization stage are decisions that relate to measurement validity concerns. The researcher not only should reflect on alternative choices concerning question wording and response scales for the indicators (i.e. questions and assertions), but also include additional measures in view of validity and data quality assessment (see Saris and Gallhofer, 2007: 171–153). The construction of a measurement instrument – a part of a questionnaire in our illustration – is the final product of the operationalization stage (see Chapters 16 and 17 in this Handbook). In the operationalization stage of the dimension ‘popular perceptions of welfare state consequences’, indicators for each of the sub-dimensions of the construct were selected, and the relations between these were specified. Decisions in this stage are guided by the theoretical elaboration (step 1) and by findings of previous research (van Oorschot, 2010; van Oorschot et al., 2012: 182). Each of the sub-dimensions of ‘popular perceptions of welfare state consequences’ is operationalized as a latent variable measured by multiple reflexive indicators: perceived (negative) economic consequences (D21, D25); perceived social consequences (D22, D23, D26); and perceived (negative) moral consequences (D27, D28, D29). The selected indicators are all Likert items of the form ‘to what extent do you agree or disagree that social benefits and services in [country] …’ D21 D22 D23 D25
… … … …
place too great strain on the economy? prevent widespread poverty? lead to a more equal society? cost business too much in taxes and charges?
202
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
D26 … m ake it easier for people to combine work and family life? D27 … m ake people lazy? D28 … make people less willing to care for another? D29 … make people less willing to look after themselves and their family? Response scale offered: completely agree (1), agree (2), neither agree nor disagree (3), disagree (4), completely disagree (5).
As was indicated, the operationalization in Figure 14.2 is restricted to the operationalization of the sub-concept ‘popular perceptions of welfare state consequences’, which is an aspect of the broader ‘welfarism’ concept that is not covered here. Only the relations between the dimensions and indicators used to measure these, and the mutual relations between the dimensions, are made explicit at this stage of operationalization. The indicators are defined as reflective indicators of latent variables. The relations between the three dimensions and the more general concept behind are not specified in the diagram (Figure 14.2), and thus not operationalized. Dotted lines are used to indicate that these relations remained purely conceptual.
Step 3: Validity Assessment of the Measured Construct Assessment of measurement validity of the operationalized constructs is only possible after the observations are done. There are two different, but interrelated phases in this activity: the assessment of measurement validity, and the assessment of conceptual validity. Judgment on conceptual validity is powered by judgment of measurement quality of the measured variables in the operationalized construct. As was already suggested, measurement validity is a necessary but not sufficient condition for conceptual validity. Depending on the kind of measures and relations between variables and indicators, several questions are appropriate (see, for example, Alwin, 2011: 265– 293). Do the indicators sufficiently reflect the variables that they intend to measure? Are the indicators not (too strongly) related to other variables than those they are intended to measure? Is the residual variation in a set of indicators small enough to consider it as random variation and not as a substantive D21
Economic consequences
D25
D22 Perceptions of welfare state consequences
Social consequences
D23 D26
D27 Moral consequences D28 D29
Figure 14.2 Operationalization of the perception of welfare state consequences.
203
What Does Measurement Mean in a Survey Context?
source of variation behind? Are the empirical relations between the variables in line as expected in the theoretical model? There are also questions guided by the meaning of the intended concept: e.g. are the measured variables not too partial (see Figure 14.1). Partial measurement of complex concepts may lead to theoretical conclusions that are not covered by the empirical observations because only specific dimensions of the construct are measured (‘error of generalization’). It is also possible that the distance between a concept and its operationalization is too large. This happens when abstract concepts are insufficiently reflected in the measured variables (‘error of abstraction’). Assessment of the measurement quality of the measured construct is a first step after the data is collected. In this two sub-samples
of Belgian population are used. Part of the data is collected in Dutch language (N = 981), the other part in French (N = 641). The test of the measurement model for the three dimensions of perceived welfare state consequences therefore needs additional specifications: equivalent measurement between the two language groups is required. Without going much into details, this requirement is operationalized as a claim for scalar invariance since one should be able to compare not only relations between latent variables, but also latent means (see Davidov et al., 2014).8 The estimated parameters of the relations between indicators and the latent variables (the measurement part), and the relations between the latent variables (the structural part) are shown in the upper and lower parts of Table 14.1.
Table 14.1 Construct ‘popular perceptions of welfare state legitimacy’: Scalar invariant measurement model and structural relations in Flemish and Walloon samples in Belgium (ESS Round 4, 2010) Scalar invariant measurement model in Flemish and Walloon samples Indicators
ECONOMIC
SOCIAL
MORAL
Prop. explained variance
λ (t-value)
λ (t-value)
λ (t-value)
R2
0.756 (fixed) 0.753 (25.443) 0.821 (28.214)
0.458 0.423 0.292 0.955 0.141 0.572 0.567 0.674
D21 D25 D22 D23 D26 D27 D28 D29
0.677 (fixed) 0.650 (10.873)
Between factor relations
ECONOMIC
ECONOMIC SOCIAL MORAL
0.946 ( 8.438) −0.097 (−1.594) 0.526 ( 9.615)
0.540 (fixed) 0.977 ( 7.733) 0.376 (10.326)
Structural relations in Flemish and Walloon samples SOCIAL
MORAL
Flemish sample (N = 958) 0.913 (6.154) −0.095 (−2.057)
0.950 (15.656)
Walloon sample (N = 641) ECONOMIC SOCIAL MORAL
1.080 ( 7.453) −0.146 (−2.130) 0.369 ( 9.159)
1.130 ( 6.300) −0.149 (−2.530)
1.075 ( 16.854)
Chi-square = 86.301; df = 44; RMSEA = 0.035; p[close fit] = 0.991; CFI = 0.978
204
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
If one compares the measurement model with the operationalized construct (Figure 14.2), one can observe that the measurement model largely behaves as suggested by the operationalized construct. The indicators ‘load’ substantially on the variables they are supposed to measure, and are not related to other latent variables. A scalar invariant model – i.e. invariant intercepts and slopes over the two samples – is supported by the data (excellent fit according to the criteria). As expected, the relations between the latent variables differ in direction. The indicators of economic and moral consequences are all worded in a negative direction: endorsing the assertions does not support the legitimacy of the welfare state (bad for economy, too expensive, makes people lazy …). The indicators of social consequences are all worded in the opposite direction, and support welfare state legitimacy. This structure is clearly expressed in the correlations between the latent variables (see bottom part of Table 14.1).9 Economic and moral consequences, being both negative towards the attitude object, are positively related with each other, but negatively with the social consequences that are worded in favor of legitimacy of the welfare state. However, these correlations are very weak, and the relation between economic and social consequences is even not significantly different from zero. The structure of the correlations is parallel in the two samples. A second observation deals with the low amount of explained variance (R²) in the indicators by the latent variables. This is in particular the case for indicators D22 and D26 of the social consequences dimension. More than 50% of the variance is explained by the latent variable in only half of the indicators (D23, D27, D28, and D29). This means that a large proportion of the variance in the other indicators is not explained by the variables they are intended to measure. In other words, these indicators largely measure something else than what they intend to measure (see Figure 14.1). What can be the reason? There
are several possible explanations that are not mutual exclusive. First, there is possibly an additional construct that explains a lot of the residual variance. Although not specified in the construct, one can try to model this by introducing an additional factor. A second explanation relates to the measurement of social consequences. Especially D26 about the combination of work and family does not measure so well this dimension that tapped social consequences at the society level. This indicates that the variation in D26 is ‘caused’ by another specific source (see Alwin, 2011: 277). This might be also an indication that the general concept may theoretically contain at least one additional dimension that cannot be covered since there are no other measured indicators for it. A third explanation for the large amount of unexplained variance in the indicators is the presence of acquiescence, which is a response style (a tendency to agree with all items). It is, however, strictly not possible to measure it completely independently from content. This is because the dimensions are not measured each by a (quasi) balanced set of positively and negatively worded items (see Billiet and McClendon, 2000). Van Oorschot et al. (2012: 185–186) have solved this problem by including an extra concept ‘principle of equality’ that was measured by two positively and one negatively worded items towards economic equality. They found in a sample of 25 European countries an agreeing response style factor and stated that comparisons between means and relations between countries therefore need adjustment (van Oorschot et al., 2012: 185–186). I applied the same model with four content variables and a style factor to the two sub-samples of Belgium. The variances of the style factor are rather large in the Flemish (t = 11.410) and Walloon (t = 12.180) samples when compared with the variances in the substantive variables (see the variances of the latent variables in the structural part of Table 14.1). The rather large slope parameters for the style factor (λ = 0.357)10 also
What Does Measurement Mean in a Survey Context?
show that the indicators (agree–disagree items) that are used to measure perceptions of welfare state consequences are very sensitive for endorsement. Since not all indicators were designed to measure the response style acquiescence it is reasonable to conclude that the ‘style’ variable is a mix of both style and agreeing because of a moderate attitude towards welfare state consequences. This moderate position is then expressed by endorsing seemingly contrasting assertions. Our finding that the negative correlations are much higher after controlling for style (and moderation) is in line with this interpretation. This illustration clearly shows that screening the measurement model after the data are collected, and additional assessment on the basis of the intended theoretical construct, may reveal insufficiencies in the operationalization stage. It may indicate that the operationalized concepts only partially cover the theoretical construct, or that something of theoretical relevance was measured that was not clearly defined in the theoretical construct. It is, since measurement is always depends on the instrument used, likely that one finds indication of a systematic source of variation in the measurement that is out of the semantic scope of the theoretical construct that was operationalized. In our example it was a combination of both: something undefined that might fall within the scope of the theoretical construct, and a response style. The advice for dealing with it on the level of measurement validity was clear: take extra measures to estimate it, i.e. a mix of positively and negatively worded items within each sub-dimension (Billiet and Matsuo, 2012). It is in general possible to include in the operationalization step extra measures in order to adjust the estimates for measurement error (see Saris and Gallhofer, 2007). This is in line with the measurers approach to measurement error: if you cannot avoid error, then measure it (Groves, 1989: 4).
205
CONCLUSIONS AND DISCUSSION An extensive literature on measurement error flourishes in the field of survey methodology (for an overview, see Billiet and Matsuo, 2012: 158–174) in which the reliability and the quality of indicators is estimated by means of confirmatory factor analysis (CFA) within the context of structural equation models. The construct validity of a measure xi of a latent trait ξj is the magnitude of the direct structural relation between ξj and xi (Bollen, 1989: 197). The definition of construct validity employed in this approach is purely operational, and is narrower, with a more specific meaning, than in the work of Cronbach and Meehl (1955) for whom validity refers to the extent to which an observed measure reflects the underlying theoretical construct has intended measure. Are these two conceptions contradictory? No when a latent trait or variable is not conceived as the intended theoretical construct or concept, but yes if the measured variable is the intended theoretical concept, as seems to be conceived in a pure operationalist view on survey measurement. Although this science-philosophical viewpoint behind an operational definition of validity is not openly confessed, the impression exists that it is often implicitly assumed. In this chapter, another view on validity is proposed in which conceptual (or substantial) validity is clearly distinguished from measurement validity. This distinction is based on Carnap’s (1958) science-philosophical idea of two languages, a theoretical language and an empirical language. It is argued that this view on validity is useful – without denying the contribution of the commonly used validity concepts – whenever quality of obtained survey measures is reviewed. This approach to valid survey measurement is in line with Kerlinger’s viewpoint (1973: 28) that empirical researchers operate on the theory–hypothesis–construct level and the observation (or measurement) level between
206
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
which they shuttle back and forth. The principal idea that we want to defend is that measurement is guided by theoretical constructs, and that the meaning of these concepts may become richer through review and reflection on the obtained measures. The operationalization process is seen as a translation process in which theoretical constructs are translated into observable variables that are assumed to be valid representations of the constructs. In the first part of the chapter, it is argued why the unusual twin concepts of ‘conceptual validity’ versus ‘measurement validity’ are preferred over the commonly used concepts of ‘content validity’ and ‘construct validity’, although the later distinction is still usable and even unreplaceable. The commonly used concepts refer to sequences of operations in the operationalization process leading to measurement instruments. The twin pair ‘conceptual validity’ versus ‘measurement validity’ does not refer to a sequence of operations. It refers to subsequent validity levels that cover all stages in the lifecycle of a survey research project. It has two levels of reflection. The main advantage of the proposed distinction between measurement and conceptual validity is that it draws attention to the importance of conceptual clarification and theoretical reflection on the assumptions behind the planned and obtained measures through all stages of a survey research project. The second part of this chapter is an illustration of the proposed view on the meaning of survey measurement, and is devoted to a screening of an operationalized construct at the basis of the obtained measures. It is a reflection on an empirical measurement model after the data are collected and analyzed. Can additional assessment of the question to what extent an empirical measurement model expresses the operationalized construct lead to conclusions about the conceptual validity of the construct? We have tried to answer that question with the aid of a concise example of an operationalized and measured construct, ‘popular perceptions of welfare
state consequences’. Recommendations for the improvement of measurement validity are well developed in the survey research literature and in practice. It is less clear, however, whether insufficiencies in the empirical measurement model have in turn relevance for improvement of the conceptual model. I suppose that the better a concept is elaborated and documented prior to operationalization, the more it will be possible to take, after measurement, advantage of it in view of conceptual validity. Increased attention for this is, I presume, possibly the main merit of an distinction between measurement validity and conceptual validity.
NOTES 1 An epistemic correlation is a relation joining an unobserved component of anything designated by a concept-by-postulation to its directly inspected component denoted by a concept-byintuition (Northrop, 1947: 119). 2 See http://en.wikipedia.org/wiki/Latent_variable 3 Most of these formal definitions all apply to structural equation modeling of latent variables (see Lazarsfeld, 1959; Lord and Novick, 1968; Bentler, 1982). 4 A measurement model of a latent variable measured by a set of reflective indicators specifies that the pattern of covariation between the indicators can be fully explained by a regression of the indicators on the latent variable. It implies that the indicators are mutually independent after conditioning on the latent variable (this is the assumption of local independency) (Boomsma et al., 2001: 208). 5 Concerning the ontological status of latent variables, Boomsma et al. (2001: 208–209) distinguish entity realism from theory realism to refer to a weaker form of realism. In entity realism it is assumed that that at least some theoretical entities in a theory exist without claiming that the whole theory is false or true. 6 In graphical presentations of a measurement model with formative indicators, the arrows between indicators, and a latent variable are in the reversed direction (from indicators to latent variable), compared with the direction of reflective indicators (from latent variable to indicators) (see Figure 1 in Boomsma et al., 2001: 208). For concrete examples of both, reflective and
What Does Measurement Mean in a Survey Context?
formative indicators see Saris and Gallhofer (2007: 278–301). 7 Congeneric measures are defined as measures of the same underlying construct in a possibly different unit of measurement or possibly different degree of precision (Lord and Novick, 1968: 47–50). The congeneric measures are indicators of different characteristics that are all simple linear functions of a single construct (Groves, 1989: 18). 8 There are several levels of measurement equivalence of multiple indicator latent variables. The lowest level is configural equivalence that only requires that the mere factor structures across groups are equal. A second and higher level of equivalence, called metric equivalence, requires that the factor loadings between the observed items and the latent variable are equal across the compared groups. Metric equivalence is required if one wants to compare regression coefficients and covariances. The third level of equivalence, scalar equivalence, is the strongest form of measurement equivalence. It requires not only the factor loadings but also the indicator intercepts to be equal across groups. Scalar equivalence is required if one wants to compare latent means across groups (Davidov et al., 2014: 63–64). 9 The reason why diagonal elements of a standardized covariance matrix in Multigroup Structural Equation Modeling can deviate somewhat from 1.00 are explained in a study by R.B. Kline (2011: 175). 10 One may say this since style is not a construct that was planned and measured by specific style indicators but by the direction of wording of indicators that are designed to measure substantive variables. The size of the slope parameters for style are much larger than in previous studies we have done before (e.g. Billiet and McClendon, 2000).
RECOMMENDED READINGS The epistemological background of the distinction between conceptual and measurement validity as proposed in this chapter relates to Rudolf Carnap’s chapter ‘The methodological character of theoretical concepts’, which appeared in 1956 in an edited volume entitled The Foundations of Science and the Concepts of Psychology and Psychoanalysis (Herbert Feigl and Michael Scriven, University of Minneapolis Press, pp. 38–75). The core of this chapter is the distinction in science
207
between two languages, a theoretical language and an observational language, both related by correspondence rules. It is stated that the terms in a theoretical language are much richer, and not completely covered by the terms in an observation language. This view originated in the late 1940s in the Vienna school. Distinction is made between three kinds of terms: theoretical terms, observational terms, and dispositional terms in between. Most examples in the paper originate from physics, but at the end it is argued that it also applied to psychological concepts. The concept of latent variables, assumptions about their theoretical status, and their relation to empirical indicators, are crucial considerations whenever unobservable attributes are related to observable (or measurable) entities. What philosophical position is implied by latent variable theory? Should we assume that latent variables are real entities or conceive these as useful fictions constructed by the human mind? Are there other ways to distinguish observable variables from latent variables, and the latter from theoretical concepts? These questions are extensively discussed in the paper by Denny Borsboom, Gideon Mellenbergh, and Jaap van Heerden, ‘The theoretical status of latent variables’, which appeared in 2003 in Psychological Review, 110 (2): 203–219. Given the scope of this chapter where the focus is on the operationalization stage in which theoretical concepts are translated to constructs and empirical indicators, there was no room to elaborate fully the strategies used to reduce or to measure measurement error. For an extensive and excellent treatment of these aspects of survey measurement I refer to Willem Saris’ and Irmtraud Gallhofer’s Design, Evaluation, and Analysis of Questionnaires for Survey Research (2007, Wiley & Sons; 2nd edn, 2014). This book has several outstanding features in the domain of evaluation of the quality of survey measures. The first part of it (pp. 13–63), which describes a three step procedure in the translation of theoretical concepts (by intuition or by postulation) into observable indicators, is most closely related to the current chapter on the meaning of survey measurement.
208
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
REFERENCES Aday, L.A. and Cornelius, L.J. (2006). Designing and Conducting Health Surveys: A Comprehensive Guide (Third Edition). San Francisco: Jossey-Bass. A Wiley Imprint Alwin, D.F. (2011). Evaluating the reliability and validity of survey interview data using the MTMM approach. In J. Madans, K. Miller, A. Maitland and G. Willis (eds), Question Evaluation Methods: Contributions to the Science of Data Quality (pp. 265–293). Hoboken, NJ: John Wiley & Sons. Andrews, F.M. and Withey, S.B. (1976). Social Indicators of Well-being. Americans’ Perceptions of Life Quality. New York: Plenum Press. Andrews, F.M. (1984). Construct validity and error components of survey measures: a structural modelling approach. Public Opinion Quarterly, 48 (2): 409–442. Bentler, P.M. (1982). Linear systems with multiple levels and types of latent variables. In K.G. Jöreskog and H. Wold (eds), Systems under Indirect Observation (pp. 101–130). Amsterdam: North Holland. Billiet, J. and McClendon, J. McKee. (2000). Modeling acquiescence in measurement models for two balanced sets of items. Structural Equation Modeling: An Interdisciplinary Journal, 7 (4): 608–629. Billiet, J. and Matsuo, H. (2012). Non-response and measurement error. In Lior Gideon (ed.), Handbook of Survey Methodology for the Social Sciences (pp. 149–178). New York: Springer. Blalock, H.M. Jr (1990). Auxiliary measurement theories revisited. In J.J. Hox and J. de JongGierveld (eds), Operationalization and Research Strategy (pp. 33–49). Amsterdam: Swets and Zeitlinger. Boesjes-Hommes, R.W. (1970). De geldige operationalisering van begrippen. Boom: Meppel. Bohrnstedt, G.W. (1983). Measurement. In P.H. Rossi, J.D. Wright and A.B. Anderson (eds), Handbook of Survey Research (pp. 69–121). New York: Academic Press. Bollen, K.A. (1989). Structural Equations with Latent Variables. Hoboken, NJ: Wiley & Sons. Bollen, K.A. (2002). Latent variables in psychology and the social sciences. Annual Review of Psychology, 53: 605–634. Borsboom, D, Mellenbergh, G.J. and van Heerden, J. (2003). The theoretical status of
latent variables. Psychological Review, 110 (2): 203–219. Campbell, D.T and Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56: 81–106. Carnap, R. (1956). The methodological character of theoretical concepts. In H. Feigl and M. Scriven (eds), The Foundations of Science and the Concepts of Psychology and Psychoanalysis (pp. 38–75). Minneapolis, MN: University of Minneapolis Press. Carnap, R. (1958). Beobacthungssprache und theoretische Sprache. Dialectica, XII (3–4): 236–248 (English translation: Observation language and theoretical language, in Rudolf Carnap, Logical Empiricist. Dordrecht, Holland: D. Reidel Publishing Company, 1975). Cronbach, L.J. and Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52: 82–302. Davidov, E., Meuleman, B., Cieciuch, J., Schmidt, P. and Billiet, J. (2014). Measurement equivalence in cross-national research. Annual Review of Sociology, 40: 55–76. De Groot, A.D. (1969). Methodology: Foundations of Inference and Research in the Behavioral Sciences. The Hague: Mouton. De Wit, H. and Billiet, J. (1995). The MTMM design: back to the Founding Fathers. In: W.E. Saris and A. Münnich (eds), The Multitrait-Multimethod Approach to Evaluate Measurement Instruments (pp. 39–60). Budapest: Eötvös University Press. Edwards, J.R. and Bagozzi, R.P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5 (5): 155–174. Foddy, W. (1993). Constructing questions for interviews and questionnaires. Theory and Practice in Social Research. Cambridge: Cambridge University Press. Groves, R.M. (1989). Survey Errors and Survey Costs. New York: John Wiley & Sons. Guttman, G.G. (1977). What is not what in statistics. Statistician, 26: 81–107. Heath, A. and Martin, J. (1997). Why are there so few formal measuring instruments in social and political research? In L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz and D. Trewin (eds), Survey Measurement and Process Quality (pp. 71–86). New York: John Wiley & Sons.
What Does Measurement Mean in a Survey Context?
Hofstee, W.K.B., Boomsma, A., and Ostendorf, F. (2001). Trait structure: Abridged-circumplex versus hierarchical conceptions. In R. Riemann, F.M. Spinath and F. Ostendorf (eds), Personality and temperament: Genetics, evolution, and structure (pp. 207–215). Lengerich: Pabst Science Publishers. Hox, J.J. (1997). From theoretical concept to survey question. In L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz and D. Trewin (eds), Survey Measurement and Process Quality (pp. 47–70). New York: John Wiley & Sons. Hox, J.J., de Leeuw, E. and Dillman, D.A. (2008). The cornerstones of survey research. In E. de Leeuw, J.J. Hox and D.A. Dillman (eds), International Handbook of Survey Methodology (pp. 1–17). New York: Lawrence Erlbaum Associates. Jackson, J.S.H. and Maraun, M. (1996). The conceptual validity of empirical scale construction: the case of the sensation seeking scale. Personality and Individual Differences, 21 (1): 103–110. Kerlinger, F.N. (1973). Foundations of Behavioral Research (2nd edn). New York: Holt, Rinehart and Winston, Inc. King, G., Keohane, R.O. and Verba, S. (1994). Designing social inquiry. Scientific Inference in Qualitative Research. Princeton, NJ: Princeton University Press. Kline, R.B. (2011). Principles and Practices of Structural Equation Modeling (3th edn). New York: Guilford Press. Lazarsfeld, P.L. (1959). Latent structure analysis. In S. Koch (ed.), Psychology: A Study of Science (pp. 362–412). New York: McGraw Hill. Lazarsfeld, P.L. (1972). Qualitative Analysis. Historical and Critical Essays. Boston, MA: Allyn & Bacon. Lord, F.M. and Novick, M.R. (1968). Theories of Mental Test Scores. Reading, MA: Addison-Wesley. Lyberg, L., Biemer, P., Collins, M., de Leeuw, E., Dippo, C., Schwarz, N. and Trewin, D. (eds) (1997). Survey Measurement and Process Quality. New York: John Wiley & Sons. Northrop, F.S.C. (1947). The Logic of the Sciences and the Humanities. New York: World Publishing Company. Nunnally, J.C. (1978). Psychometric Theory. New York: McGraw Hill.
209
Nunnally, J.C. and Bernstein, I.H. (1994). Psychometric Theory (3rd edn). New York: McGraw Hill. O’Muircheartaigh, C. (1997). Measurement error in surveys: a historical perspective. In L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz and D. Trewin (eds), Survey Measurement and Process Quality (pp. 1–25). New York: John Wiley & Sons. Payne, S.L. (1941). The Art of Asking Questions. Princeton, NJ: Princeton University Press. Popper, K. (1934). Logic der Forschung. Wien: Springer. Reprinted as: Popper, K.R. (1968). The Logic of Scientific Discovery. New York: Harper & Row. Roosma, F., Gelissen, J. and van Oorschot, W. (2013). The multidimensionality of welfare state attitudes: a European cross-national study. Social Indicators Research, 113: 235–255. Saris, W.E. and Andrews, F.M. (1991). Evaluation of measurement instruments using a structural modeling approach. In P.P. Biemer, R.M. Groves, L.E. Lyberg, N. Mathiowetz and S. Sudman (eds), Measurement Errors in Surveys (pp. 575–598). New York: Wiley & Sons. Saris, W.E. and Gallhofer, I.N. (2007). Design, Evaluation, and Analysis of Questionnaires for Survey Research. Hoboken, NJ: John Wiley & Sons. Scherpenzeel, A.C. and Saris, W.E. (1997). The validity and reliability of survey questions: a meta-analysis of MTMM studies. Sociological Methods & Research, 25: 341–383. Stevens, S. (1946). On the Theory of Scales of Measurement. American Association for the Advancement of Science, 103: 677–680. Sudman, S. and Bradburn, N.M. (1982). Asking Questions: A Practical Guide to Questionnaire Design. San Francisco, CA: Jossey Bass. Van Oorschot, W. (2010). Public perceptions of economic, moral, social and migration consequences of the welfare state: an empirical analysis of welfare state legitimacy. Journal of European Social Policy, 20 (1): 19–31. Van Oorschot, W., Reeskens, T. and Meuleman, B. (2012). Popular perceptions of welfare state consequences: a multilevel, cross-national analysis of 25 European countries. European Journal of Social Policy, 22 (2): 181–197. Zuckerman, M. (1996). Conceptual clarification or confusion in the study of sensation seeking by J.S.A. Jackson and M. Maraun. Personality and Individual Differences, 21 (1), 111–114.
15 Cognitive Models of Answering Processes Kristen Miller and Gordon B. Willis
INTRODUCTION Understanding the various aspects of the survey process that can impact data quality is key to effective survey research. One of the most salient factors related to data quality is the process by which respondents interpret survey questions and formulate answers. Examining question response provides a basis for determining whether or not questions capture the intended construct and also provides direction as to how resulting data should be analyzed. This chapter describes two theoretical approaches for understanding the question response process: a psychological approach and a socio-cultural approach. We first briefly recount the history of cognitive theory as it applies to the question response process, and then discuss the ways in which it has evolved. Finally, we explore a primary method used to study the question response process – cognitive interviewing – and explain how the
method itself has shifted as notions of question response have developed.
THE QUESTION RESPONSE PROCESS The ways in which survey methodologists have come to understand the question response process has taken shape through the concerted efforts of academics and those in the applied setting of survey research. A significant advancement for question evaluation occurred, for example, in the 1980s with the introduction of cognitive psychology and creation of a new field: the study of the cognitive aspects of survey methodology (CASM). The CASM movement not only brought attention to the issue of measurement error but also established the idea that individual processes – specifically, respondents’ thought processes – must be understood to assess validity and potential sources of
Cognitive Models of Answering Processes
error (Jobe and Mingay, 1991; Schwarz, 2007; Willis, 2004). As asserted by Willis (2005), ‘the respondent’s cognitive processes drive the survey response, and an understanding of cognition is central to designing questions and to understanding and reducing sources of response error’ (p. 23). Notions of what constitutes ‘cognition’ vary among academics and survey methodologists in different disciplines. Some conceptualize ‘cognition’ as a pure psychological, even biological, phenomenon that is subconscious and exists outside of social interaction (Luca and Peterson, 2009). Others argue that it cannot be conceptualized outside of socio-cultural context and that examining cognitive processes requires studying them within this context (Berger and Luckmann, 1966; D’Andrade, 1995; Zerubavel,, 1999). In the survey methodology field, conceptualizations of cognition and question response theory have shifted over time, particularly as cognitive anthropology and cognitive sociology have contributed to the discussion. While there is some overlap, views on question response can be understood as occurring within two general approaches: a psychological approach and a socio-cultural approach.
THE PSYCHOLOGICAL APPROACH The idea that thinking processes are vital to answering survey questions predates the CASM movement: in particular, Cantril and Fried (1944) proposed a model emphasizing what would now be termed as cognitive processes in survey response. This notion developed in Europe and the US, but was brought into major focus by the CASM I conference (for a history, see Jobe and Mingay, 1991). Additionally, models including cognitive components had been introduced by Cannell and his colleagues (Cannell et al., 1981), and also included motivational effects beyond purely mechanistic cognitive processing. The sentinel keystone of CASM, however, has
211
been the four-stage model of question response introduced by Tourangeau (1984): (1) Comprehension; (2) Recall; (3) Judgment/ Estimation; and (4) Response Mapping. Several extensions of this basic model were subsequently introduced, and are reviewed by Jobe and Herrmann (1996). Cognitive theorizing has been embellished by the inclusion of other cognitive mechanisms, in particular ‘cognitive subroutines/ algorithms’ that may influence responding under specified circumstances. For example, Schwarz (2007) has argued that respondents use information contained in response categories to deduce normative behavior and respond in that context. These more limited approaches are not generalized models, but special-case applications. From the time of the CASM I conference, cognitive theory has been applied with respect to two different but complementary objectives: 1 Understanding the processes used when respondents answer survey questions, with the intent of contributing to the science of question naire design; 2 Serving as the basis for cognitive interviewing, an applied method for empirically-based question naire pretesting and evaluation.
The former objective can be exemplified by an early study by Loftus (1984), who applied this approach to examine the question of whether respondents recalled event-based autobiographic information (doctor visits) in a forward or backward chronological order, with the expectation that questionnaires could be designed to optimize respondent cognition. The latter purpose, using theory as a basis for evaluating survey questionnaires, is typified by the widespread use of the four processes within the Tourangeau model to guide cognitive interviews. For example, the fact that concepts such as ‘net worth’, ‘health insurance premiums’, or ‘dental sealants’ are understood by respondents in different ways, and often erroneously, can be seen as a failure in the comprehension stage of the processing chain (Willis, 2005).
212
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
THE SOCIO-CULTURAL APPROACH The explicit inclusion of a cultural interpretation was introduced by Gerber and Wellens (1997), who made the claim that the question response process is best understood within an anthropological perspective, where the phenomenon occurs in a cultural – and not only psychological – context. The extension of this idea has given rise to a more sociological/anthropological perspective, termed here as the socio-cultural approach. While some discussion of culture occurred as the CASM movement grew, attention to the relationship between socio-cultural context and the question response process has become much more pronounced, particularly in the last decade. For the most part, this attention has arisen out of concern for comparability (Harkness et al., 2010a): Do questions mean the same to all socio-cultural and linguistic groups represented in a survey? Are data elicited from questions capturing the same phenomena across all groups of respondents? Multinational surveys, out of practical necessity, began to consider these issues and started to develop methods for examining comparability (Edwards et al., 2004; Jowell et al., 2007; Miller et al., 2010). With this renewed focus, other disciplines, such as linguistics, anthropology and sociology, have become significant contributors to the understanding of the question response process. Translation has been at the forefront of the comparability literature, specifically, how questionnaires in various languages may differ as well as methods for producing and ensuring accurate translations (Harkness et al., 2010b; Schoua-Glusberg and Villar, 2014; Willis et al., 2011). Translations, particularly for multinational surveys, became a focal point for practical reasons: translations had to be produced in order to field multilingual surveys. With survey researchers taking a closer look at translation quality, the field of linguistics became especially relevant to survey methodology; socio-linguists, in particular, argued that consideration of
socio-cultural factors was required to produce comparable translations (Pan, 2007). This new perspective also marks the shift toward improvements in the way multi- lingual surveys produce translated questionnaires. For example, it is now argued that the technique of ‘back translation’ – a technique that produces, more or less, literal translations – should be replaced by the TRAPD (Translation, Review, Adjudication, Pretesting and Documentation) method – a method that forgoes literal translation and is culturally-adaptive (CSDI, 2013; Harkness, 2003). Focus on the evaluation of translations has also led to an important understanding: outside of translated materials, respondents’ life experience as well as socio-economic and cultural context can impact comparability (Fitzgerald et al., 2011; Willis and Zahnd, 2007). It is this latter point concerning nonlinguistic factors that has compelled survey methodologists to rethink the question response process, expanding it to better address question comparability. While psychological models, and in particular the four-stage model (Comprehension, Retrieval, Judgment and Response), tend to emphasize individual thought processes, proponents of the socio-cultural approach argue that sociocultural context must also be a focal point. This perspective does not run counter to psychological models, but rather emphasizes that the interpretation of a question depends on the context of respondents’ lives. Meanings and thought patterns do not spontaneously occur within the confines of a respondent’s mind, but rather those meanings and patterns are inextricably linked to the social world (Miller, 2011; Miller et al., 2014). To fully address comparability, then, it is necessary to understand how and why respondents with different backgrounds and life experiences process questions differently (Willis and Miller, 2011). For example, Ainsworth et al. (2000) suggest that in order to assess levels of physical activity, surveys traditionally ask respondents about leisure-time activities
Cognitive Models of Answering Processes
such as jogging and swimming – activities that are more relevant to well-educated office-workers than to less-educated, lowincome respondents who are likely to engage in work-related physical activity. As a result, leisure-time activity questions do not pertain to all respondents in the same way, and the resulting survey data is likely to misrepresent activity levels. As indicated previously, consideration of culture in survey research was incorporated early in the CASM movement in the 1980s, and it is fair to say that these early writings helped to lay the groundwork for more advanced theorization. For example, borrowing from Grice’s theory of conversation (1975), Schwarz and Strack (1988) argued that respondents infer the meaning of questions through assumptions generated by systems of conversational logic that emerge from cultural interaction. Later, Schober (1999) makes a similar argument, but expands the theory (what he terms ‘the interactional approach’) by considering how respondents infer meaning through the context of the survey interview itself. Underscoring both the importance of conversational logic as well as the context of the survey interview, Sudman et al. (1996) argue that ‘only when the social and cognitive complexities of the response process are understood will we have a good understanding of the quality of survey data’ (p. 2). Eleanor Gerber was perhaps the first survey methodologist to formulate a cultural theory that could serve as the basis for developing and evaluating survey questions. Drawing from cognitive anthropology, Gerber (1999; also Gerber and Wellens, 1997) conceptualized the question response process as being driven by cognitive ‘schemas’ – conceptual frameworks constructed through cultural context and social interaction that allow respondents to assume the meaning of survey questions. Gerber cites D’Andrade’s definition of a schema: A schema is a conceptual structure which makes possible the identification of objects and events ….
213
Basically, a schema is … a procedure by which objects or events can be identified on the basis of simplified pattern recognition. (1997: 28–29)
Importantly, Gerber suggests that schemas also inform the way in which respondents formulate answers, and not simply how they interpret questions, in that they influence the memories that respondents recall. As a simple example, Gerber and de la Puenta (1996), in their research to develop race and ethnic origin questions for the 2000 Decennial Census, found that the concept of ‘national origin’ connotes ‘country of birth’ as well as ‘country of citizenship’. Given that respondents could base their answers on either interpretation, Gerber and de la Puenta determined that the term ‘national origin’ should be avoided. In her work, The Language of Residence: Respondent Understandings and Census Rules, Gerber (1994) lays out more complex schemas that pertain to the ways in which respondents conceptualize what it means to be a resident and whom they should include in a resident roster. She writes, ‘Residence rules impinge on many cultural schemas, and it is not always clear to respondents which one ought to apply’ (p. 229). A respondent might, for example, answer within the framework of a ‘family’ schema, including only those related by blood, including children who are college students living outside the home, while excluding boarders. Others, answering outside the ‘family’ schema, could include boarders who pay rent or long-term guests. Sociologists, who have more recently entered this theoretical discussion, approach the issue in much the same way: interpretive meaning forms the basis of the question response process, and meaning construction occurs through the relationship between the individual and the social world (Berger and Luckmann, 1966). That is, individuals’ knowledge and the meaning that they assign to objects and events are linked to social context and social interaction. The sub-field of cognitive sociology, similarly to cognitive
214
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
anthropology, focuses on the categories, schemes and codes that individuals use to organize their thoughts and make sense of the world around them (Zerubavel, 1999). As it pertains to question response, this approach emphasizes the interpretive value and the fluidity of meaning within the context of a questionnaire as well as (and perhaps more significantly) within the socio-cultural context of respondents’ lives. How respondents go about the four cognitive stages of comprehending, recalling, judging and responding is informed by respondents’ social location – that is, their position in history and society, as organized by and made meaningful by socialstructural factors such as gender, race and class (Chepp and Gray, 2014). Consequently, not all respondents will necessarily process questions in the same manner (Miller et al., 2014). For example, respondents with little access to adequate health care, whether they live in a developing country or in a poor rural area of the United States, may be less able to accurately report having chronic conditions, such as COPD or emphysema, because they may have never been diagnosed or are unaware of the medical terms. At this point, there is not a clear delineation between how a sociological approach to understanding the question response process differs from an anthropological approach, except that sociologists tend to emphasize how social factors, such as a respondent’s age, education level, cultural background, health insurance status and health condition, impact question response (Miller, 2003). As such, a sociological approach emphasizes comparability between groups, although this may be more of a reflection of the broader field’s current emphasis on comparability. Miller and Ryan (2011), for example, show how sexual minorities understand sexual identity questions differently from nonminorities in that the concept ‘sexual identity’ is not equally salient to the two groups; it is much more salient to sexual minorities. Consequently, respondents who would otherwise be categorized as ‘heterosexual’ are
more likely to be misclassified or counted as missing cases because they are not always familiar with the identity categories (Ridolfo et al., 2012).
THE QUESTION RESPONSE PROCESS AND COGNITIVE INTERVIEWING Whereas cognitive interviewing was once seen as a method for studying cognitive response processes generally, as well as a tool to test a specific set of survey questions (Willis, 2005), it has come to mainly be thought of as the latter: a pretesting mechanism. In fact, rather than informing question response theory, existing theories of the question response process – particularly Tourangeau’s four-stage model – were often used to inform the structure of the cognitive interview. For example, to test whether questions create comprehension problems, definition-related probes such as ‘What does “household resident” mean to you?’ could be asked. To assess the ‘Judgment Stage’, probes such as ‘how hard was this to answer?’ could be asked. Despite the distinct influence of cognitive psychology, however, there is a growing argument against the strict reliance on this approach for questionnaire design and evaluation (Gerber and Wellens, 1997; Willis, 2004, 2005). Probing exclusively on the four stages, it is argued, prevents interviewers from collecting relevant information that reveals the impact of socio-cultural factors on question response (Willson and Miller, 2014). Because of these limitations, Gerber has argued that ethnographic methodologies must also be used to design and evaluate questions. For example, in her study to design household rosters, Gerber incorporated ethnography into the cognitive interview to determine the various culturally-bound conceptualizations of ‘household residence’. Because these more interpretive methodologies that capture social context are better able to ‘represent
Cognitive Models of Answering Processes
complexity’, it is argued that the findings are more reflective of the actual processes used by respondents when answering survey questions. (Gerber, 1999: 219; see also Miller, 2011). Currently, ideas of what constitutes cognitive interviewing are shifting as notions around the question response process expand. While in the 1990s ethnography was described as a separate method from cognitive interviewing, those lines have become blurred. For example, Willis (2005) suggests that cognitive interviews should not be driven only by the four-stage cognitive model and argues that cognitive interviews should include ‘reactive probing’, whereby interviewers freely formulate probes based on contextually-relevant information that has emerged in the interview. Willis also suggests that probes should not only pertain to the cognitive process, but also to ‘the characteristics of the survey questions that interact with respondent cognition to produce problems’ (2005: 79). Beatty (2004) has acknowledged a similar trend, pointing out that probing appears more oriented toward generating explanations that take into account respondent circumstances. Finally, Miller (2011) has argued that the cognitive interview cannot fully capture cognitive processes, but rather ‘respondents’ interpretations of their cognitive processes’ (p. 60). Instead, she describes the interview as collecting a story – or ‘narrative’ – that details how and why respondents answer questions the way they do (see also Willson and Miller, 2014). Through analysis of the interviews it is possible to determine the specific interpretations and question response patterns. Because the goal of the cognitive interview is to capture as much of the narrative as possible, it is important for the interviewer to ask whatever questions are necessary to fill in gaps or to address apparent contradictions. With shifting conceptualizations of the question response process and of the cognitive interview itself, the utility of cognitive testing has grown. While cognitive interviewing continues to be a primary method
215
for pretesting survey questions, it has also proved to be an important methodology toward understanding the complexity of the question response process, as well the role of socio-cultural influence on those processes. To the extent that cognitive interviewing studies identify the phenomena that respondents consider and include in their answers, the methodology can be used to study measurement validity and, to the extent that cognitive interviewing studies compare those phenomena across socio-cultural and lingual groups, they can also provide a means to study questionnaire comparability.
RECOMMENDED READINGS Tourangeau et al. (2000) present a psychological approach to the question response process. Willis (2005) provides the first comprehensive discussion on cognitive interviewing methodology which is set within a psychological framework. Miller et al. (2014) present a sociological approach to understanding the question response process and lay out cognitive interviewing methodology within this framework.
REFERENCES Ainsworth, B.E., Bassett, D.R., Jr., and Strath, S.J. (2000) Comparison of three methods for measuring the time spent in physical activity. Medicine & Science in Sports & Exercise, 32: S457–S464. Beatty, P. (2004) The dynamics of cognitive interviewing. In Presser, S., Rothgeb, J., Couper, M., Lessler, J., Martin, E., Martin, J., and Singer, E. (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 45–66). Hoboken, NJ: John Wiley & Sons. Berger, P. and Luckmann, T. (1966) The Social Construction of Reality: A Treatise in the Sociology of Knowledge. Garden City, NY: Doubleday.
216
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Cannell, C.F., Miller, P.V., and Oksenberg, L. (1981) Research on interviewing techniques. In Leinhardt, S. (ed.), Sociological metho dology (pp. 389–437). San Francisco, CA: Jossey-Bass. Cantril, Hadley and Fried, E. (1944) The meaning of questions. In Cantril, H. (ed.), Gauging Public Opinion (pp. 3–22). Princeton, NJ: Princeton University Press. Chepp, V. and Gray, C. (2014) Foundations and new directions. In Miller, K., Willson, S., Chepp, V., and Padilla, J.L. (eds), Cognitive Interviewing Methodology (pp. 7–14). Hoboken, NJ: John Wiley & Sons. The Comparative Survey Design and Implementation (CSDI) Guidelines Initiative (2013) Best practices for survey translation. http:// ccsg.isr.umich.edu/translation.cfm. D’Andrade, R. (1995) The Development of Cognitive Anthropology. Cambridge: Cambridge University. Edwards, S., Zahnd, E., Willis, G., Grant, D., Lordi, N., and Fry, S. (2004) Behavior coding across multiple languages: The 2003 California Health Interview Survey as a case study. Paper presented at the annual meeting of the American Association for Public Opinion Research, Pointe Hilton Tapatio Cliffs, Phoenix, Arizona, 5 November. Abstract at http://www. allacademic.com/meta/p116049_index.html Fitzgerald, R., Widdop, S., Gray, M., and Collins, D. (2011) Identifying sources of error in cross-national questionnaires: application of an error source typology to cognitive interview data. Journal of Official Statistics, December, 27 (4): 1–32. Gerber, E.R. (1994) The Language of Residence: Respondent Understandings and Census Rules. US Bureau of the Census. https://www.census.gov/srd/papers/pdf/ rsm2010-17.pdf. Gerber, E.R. (1999) The view from anthropology: ethnography and the cognitive interview. In Sirkien, M., Herrmann, D., Schecter., S., Schwarz., N., Tanur, J., and Tourangeau, R. (eds), Cognition and Survey Research (pp. 217–234). New York: Wiley. Gerber, E.R. and de la Puenta, M. (1996) The Development and Cognitive Testing of Race and Ethnic Origin Questions for the Year 2000 Decennial Census. US Bureau of the Census. http://www.census.gov/prod/2/gen/ 96arc/iiiagerb.pdf.
Gerber, E.R. and Wellens, T.R. (1997) Perspectives on pretesting: cognition in the cognitive interview? Bulletin de Methodologie Sociologique, 55: 18–39. Grice, H.P. (1975). Logic and Conversation, Syntax and Semantics, vol.3 edited by P. Cole and J. Morgan, Academic Press. Harkness, J.A. (2003) Questionnaire translation. In Harkness, J. A., van de Vijver, F., and Mohler, P.Ph. (eds), Cross-cultural Survey Methods (pp. 35–56). New York: John Wiley & Sons. Harkness, J., Braun, M., Edwards, B., Johnson, T., Lyberg, L., Mohler, P., Pennell., B., and Smith., T. (2010a) Survey Methods in Multiregional, and Multicultural Contexts. Hoboken, NJ: John Wiley & Sons. Harkness, J., Villar, A., and Edwards, B. (2010b) Translation, adaptation, and design. In Harkness, J., Braun, M., Edwards, B., Johnson, T., Lyberg, L., Mohler, P., Pennell., B., and Smith., T. (eds), Survey Methods in Multiregional, and Multicultural Contexts (pp. 117– 140). Hoboken, NJ: John Wiley & Sons. Jobe, J. B. and Herrmann, D. (1996). Implications of Models of Survey Cognition for Memory Theory. In D. Herrmann, M. Johnson, C. McEvoy, C. Herzog, and P. Hertel (Eds.), Basic and Applied Memory Research: Volume 2: Practical Applications, pp. 193–205. Hillsdale, NJ: Erlbaum. Jobe, J.B. and Mingay, D.J. (1991) Cognition and survey measurement: history and overview. Applied Cognitive Psychology, 5: 175–192. Jowell, R., Roberts, C., Fitzgerald, R., and Gillian, E. (2007) Measuring Attitudes CrossNationally: Lessons from the European Social Survey. London: Sage. Loftus, E. (1984) Protocol analysis of responses to survey recall questions. In Jabine, T.B., Straf, M.L., Tanur, J.M., and Tourangeau, R. (eds), Cognitive Aspects of Survey Methodology: Building a Bridge Between Disciplines (pp. 61–64). Washington, DC: National Academy Press. Luca, T. and Peterson, M. (2009) Cognitive Biology: Evolutionary and Developmental Perspectives on Mind, Brain, and Behavior. Cambridge, MA: MIT Press. Miller, K. (2003) Conducting cognitive interviews to understand question-response limitations among poorer and less educated respondents. American Journal of Health Behavior, 27 (S3), Supplement 3: 264–272.
Cognitive Models of Answering Processes
Miller, K. (2011) Cognitive interviewing. In Madans, J., Miller, K., Maitland, A., and Willis, G. (eds), Question Evaluation Methods: Contributing to the Science of Data Quality (pp. 51–75). Hoboken, NJ: John Wiley & Sons. Miller, K. (2014) Introduction. In Miller, K., Willson, S., Chepp, V., and Padilla, J.L. (eds), Cognitive Interviewing Methodology (pp. 1–5). Hoboken, NJ: John Wiley & Sons. Miller, K. and Ryan, J.M. (2011) Design, Development and Testing of the NHIS Sexual Identity Question. National Center for Health Statistics. Hyattsville, MD. Q:Bank (on-line access to question evaluation reports). http:// w w w n . c d c . g o v / q b a n k / re p o r t / M i l l e r _ NCHS_2011_NHIS%20Sexual%20Identity. pdf. Miller, K., Mont, D., Maitland, A., Altman, B., and Madans, J. (2010) Results of a crossnational structured cognitive interviewing protocol to test measures of disability. Quality & Quantity, 45 (4): 801–815. Miller, K., Willson, S., Chepp, V., and Padilla, J.L. (eds) (2014) Cognitive Interviewing Methodology. Hoboken, NJ: John Wiley & Sons. Office of Management and Budget: Standards and Guidelines for Statistical Surveys (2006) http://www.whitehouse.gov/sites/default/ files/omb/inforeg/statpolicy/standards_stat_ surveys.pdf Pan, Y. (2007) The Role of Sociolinguistics in Federal Survey Development. US Bureau of the Census. http://beta.census.gov/srd/ papers/pdf/rsm2007-01.pdf. Ridolfo, H., Miller, K., and Maitland, A. (2012). Measuring sexual identity using survey questionnaires: how valid are our measures? Sexuality Research and Social Policy. 10.1007/s13178-011-0074-x. Schober, M. (1999) Making sense of questions: an interactional approach. In Sirken, M., Herrmann, D., Schechter, S., Schwarz, N., Tanur, J., and Tourangeau, R. (eds), Cognition and Survey Research (pp. 77–94). New York: Wiley. Schoua-Glusberg, A. and Villar, A. (2014) Assessing translated questions via cognitive interviewing. In Miller, K., Willson, S., Chepp, V., and Padilla, J.L. (eds), Cognitive
217
Interviewing Methodology (pp. 51–68). Hoboken, NJ: John Wiley & Sons. Schwarz, N. (2007) Cognitive aspects of survey methodology. Applied Cognitive Psychology. 21: 227–287. Schwarz, N. and Strack, F. (1988) The Survey Interview and the Logic of Conversation: Implications for Questionnaire Construction. Mannheim: ZUMA-Arbeitsbericht. http:// nbnresolving.de/urn:nbn:de:0168-ssoar66536 Sudman, S., Bradburn, N., and Schwarz, N. (1996) Thinking about Answers: The Application of Cognitive Processes to Survey Methodology. Hoboken, NJ: John Wiley & Sons. Tourangeau, R. (1984) Cognitive science and survey methods. In Jabine, T., Straf, M., Tanur, J., and Tourangeau, R. (eds), Cognitive Aspects of Survey Design: Building a Bridge Between Disciplines (pp. 73–100) . Washington, DC: National Academy Press. Tourangeau, R., Rips, L., and Rasinski, K. (2000) The Psychology of Survey Response. Cambridge: Cambridge University Press. Willis, G. (2004) Cognitive interviewing revisited: a useful technique, in theory? In Presser, S., Rothgeb, J., Couper, M., Lessler, J., Martin, E., and Martin, J., et al. (eds), Methods for Testing and Evaluating Survey Questions (pp. 23–43). New York: John Wiley & Sons. Willis, G. (2005) Cognitive Interviewing: A Tool for Improving Questionnaire Design. Thousand Oaks, CA: SAGE. Willis, G. and Miller, K. (2011) Cross-cultural cognitive interviewing: seeking comparability and enhancing understanding. Field Methods, 23 (4) 331–341. Willis, G. and Zahnd, E. (2007) Questionnaire design from a cross-cultural perspective: an empirical investigation of Koreans and NonKoreans. Journal of Health Care for the Poor and Underserved, 18 (4): 197–217. Willson, S. and Miller, K. (2014) Data collection. In Miller, K., Willson, S., Chepp, V., and Padilla, J.L. (eds), Cognitive Interviewing Methodology (pp. 15–34). Hoboken, NJ: John Wiley & Sons. Zerubavel, E. (1999) Social Mindscapes: An Invitation to Cognitive Sociology. Cambridge, MA: Harvard University Press.
16 Designing Questions and Questionnaires Jolene D. Smyth
INTRODUCTION In theory, designing questions and questionnaires seems rather simple. Make a list of things you want to know and write questions about each one. A quick online search of the term ‘web survey’ will produce a number of online survey software providers using words like, ‘fast’, ‘easy’, ‘free’, and ‘within minutes’ that suggest this is true. The result is a strong sense that anyone can design questions and questionnaires; it is easy. But anyone who has ever thoughtfully designed a survey will know that, in reality, it requires one to make countless important decisions. For this reason, entire books have been written on how to do it well, and even seemingly simple decisions, such as how large to make an answer box, have become the subject of multiple scientific conference presentations and peer reviewed articles. When faced with many design decisions, it is helpful to have a framework or perspective from which to approach the decision making
process. For questionnaires, it is useful to use a respondent-centered perspective; that is, to consider how the respondent will experience the questions, what they need to be able to answer accurately, and what might go wrong for them. This chapter will present a respondentcentered approach to question and questionnaire design by centering many design issues squarely within the framework of what respondents have to do to answer questions. After some brief background on the goals and contents of a questionnaire, it describes a common model of the respondent response process and then provides discussion and examples of how various questionnaire design elements such as question wording, visual design, and question order can impact respondents throughout that process. The chapter will also discuss how a response process framework can help with pretesting questionnaires. Throughout the chapter, examples and questionnaire design guidance will be given, but due to the sheer size of the
Designing Questions and Questionnaires
questionnaire design literature, these will not be exhaustive. For additional examples and guidance, the reader should consult one of the many currently available questionnaire design texts (e.g., Bradburn et al., 2004; Dillman et al., 2014; Fowler, 1995; Salant and Dillman, 1994; Saris and Gallhofer, 2014).
WHAT ARE WE TRYING TO ACCOMPLISH WITH QUESTION AND QUESTIONNAIRE DESIGN? The obvious answer is that we are trying to accurately measure constructs that allow us to answer our broader research question. But this is short-sighted. We are also trying to accomplish other, less obvious, but important goals. One is to convince sample members to respond. This is why questionnaire design texts suggest avoiding starting with boring, embarrassing, or sensitive questions and instead make the first questions simple, interesting, and applicable to all sample members with a professional design and visual layout that makes the questionnaire look simple to complete. Moreover, questionnaire titles should be informative, interesting, and meaningful to respondents (Dillman et al., 2014). Another goal is to encourage respondents to optimize rather than satisfice in their response behaviors. Optimizing involves proceeding carefully and completely through the steps of answering each question; satisficing involves shortcutting that process. Satisficing behaviors can range from weak, like not doing the work to remember all instances of a behavior, to strong, like making up answers, answering randomly, or skipping questions (Krosnick, 1991). Questionnaires need to be designed to encourage optimizing. Our final goal is to help respondents be able to answer the questions and to do so in the correct order (Dillman et al., 2014). This means the questionnaire has to be designed so that answer spaces are evident and usable
219
and the order in which questions should be answered is obvious. That is, it needs to have a clear navigational path. Thus, while it seems like questionnaire design is about asking the right questions well, there is really much more to it than that. The questionnaire needs to collect needed information, but it also needs to encourage sample members to become optimizing respondents and help guide them along. Writing questions is only one part of the whole of questionnaire design.
WHAT GOES IN THE QUESTIONNAIRE? The research question(s) will drive what constructs need to be measured in the questionnaire and how they are measured. Imagine a researcher wants to know how many people in a state are engaging in risky behaviors and if those behaviors are associated with health outcomes. The researcher will first need to define what constitutes a risky behavior (e.g., smoking, drinking, fighting, etc.). The idea of ‘engaging’ in these behaviors will then need to be clarified. For example, does he want to know whether sample members have done each behavior (ever or over a specific time period) or how often or how much they have done each behavior? The answers will drive the types of questions to be asked and thus the types of data collected and the types of variables available for analysis. Likewise, the concept of health outcomes would need further specification with important domains and subdomains identified for operationalization. For further discussion of operationalization, see Hox (1997) and Saris and Gallhofer (2014), and for examples of the use of focus groups and semi-structured interviews in this process, see Joseph et al. (1984) and Schaeffer and Thomson (1992). Early on, one should also develop an analysis plan to help make sure that how each construct is measured will yield the type of data
220
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
needed. For example, open-ended narrative responses would not allow one to produce a smoking prevalence estimate. Likewise, asking whether one smokes would not allow one to examine the association between frequency of smoking and health outcomes. The research question and analysis plan should be used to help ensure all necessary constructs are asked about and operationalized appropriately to produce the needed data. An analysis plan can also help one refine their initial research question(s) and list of constructs to measure. For example, while age, race, and gender are not explicitly identified in the aforementioned research question, the researcher would likely want to measure them because they may be related to both engagement in risky behaviors and health outcomes, making them necessary for subgroup analyses or control variables in a regression, both of which should be expressed in the final research questions but are sometimes initially overlooked. These and other constructs may also need to be measured for weighting and adjustment purposes, which should also be planned carefully upfront in the context of the research question(s). Finally, an analysis plan can help avoid the mistake of asking extra questions that have no place in the analyses but that add respondent burden. Once the constructs have been identified, one has to figure out how to write questions that will actually measure them. A good first step is always to see how others have measured them by checking existing question banks, data archives, and surveys. Two notable question banks in the US are the Inter-University Consortium for Political and Social Research (ICPSR – http://www.icpsr. umich.edu/) and the Roper Center for Public Opinion Research iPOLL Databank (http:// ropercenter.cornell.edu/ipoll-database/). Existing questions for many constructs can be found in these sources; however, that they were previously used does not necessarily mean they are good measures or that they will work well in a different population.
While question banks can be very helpful for generating ideas, a healthy level of criticism is advised. Whether using preexisting questions or writing from scratch, it is helpful to view the question through a respondent-centered lens. This entails considering how respondents will experience the question, what design features will help them answer accurately, and what design features will cause problems.
HOW DO RESPONDENTS EXPERIENCE QUESTIONS? A COMMON MODEL OF THE RESPONSE PROCESS A common model of the response process says that respondents must complete several steps to answer a question. They can complete these linearly or they can circle back and forth through them as needed. First, they must comprehend the question, including key terms. They then must retrieve the information needed to formulate a response and then use that information to make a judgment. Finally, they have to provide their answer (Tourangeau et al., 2000). Jenkins and Dillman (1997) have argued that before any of these steps can be undertaken, respondents first have to perceive the question (i.e., see or hear it). In designing questionnaires, a primary goal has to be to make each step of the response process as easy as possible so that respondents can provide accurate answers with minimal burden. The rest of this chapter describes how question wording, visual design, and question order can facilitate this. But first, it is important to recognize that target populations and respondents vary in many ways that will also impact the response process. Age might impact their cognitive processing and memory; nationality and culture might impact the social norms they follow; their social position might impact what they consider a sensitive question, how they react to sensitive questions, or even what questions make sense to
Designing Questions and Questionnaires
ask them; their language might impact what concepts can be used and how they can be used; etc. Some of these issues are discussed in Chapters 12, 13, 19, and 43. Because of length restrictions, this chapter is limited to emphasizing the importance of considering these contexts when designing questions and questionnaires without going into more depth on these important exigencies.
QUESTION WORDING TO FACILITATE RESPONSE Question Wording for Comprehension Many of the most common prescriptions for writing questions are intended to avoid problems with the comprehension stage of the response process by ensuring that respondents can understand what all the words in the question mean and what is being asked of them. These include ask only one question at a time, use simple and familiar words, use specific and concrete words (i.e., avoid vague words), use as few words as possible, use complete sentences with simple structures, and avoid double negatives. In addition to these strategies, definitions, instructions, or examples of certain concepts can be provided. For example, in an experiment to examine how best to clarify a vague concept, Redline (2013) asked respondents how many pairs of shoes they owned. Some were instructed to exclude boots, sneakers, athletic shoes, and bedroom slippers but to include sandals, other casual shoes, and dress shoes. Others received no clarifying instructions. Because she defined shoes in this limited way, she expected the mean number of shoes to be lower when respondents used the clarifying instructions. Indeed, significantly fewer pairs of shoes were reported when the instructions were provided than when they were not, especially when they preceded rather than trailed the question. Respondents
221
did use the instructions to gain a better understanding of the concept of shoes. However, Redline found that a decomposition strategy worked even better (i.e., fewer pairs of shoes reported, reflecting proper exclusions) than providing instructions with the question stem. Here the question was broken up into three specific questions, first asking how many pair of shoes they owned, then how many of them were boots, sneakers, athletic shoes, or bedroom slippers, and finally how many were sandals, other casual shoes, or dress shoes. The decomposition strategy also worked better for other constructs. Decomposition has been found to be particularly useful when definitions and clarifications are long and complex and when the event or item in question is distinctive and can be enumerated (Belli et al., 2000; Fowler, 1992; Menon, 1997). With nondistinctive events or behaviors, decomposition can increase overreporting (Belli et al., 2000). Providing examples can also help clarify ambiguous terms (Schaeffer and Presser, 2003) by helping respondents better define the concepts and understand the specificity needed. For example, Martin (2002) found that providing examples of specific Hispanic origins (e.g., Argentinean, Columbian, etc.) increased the percentage of respondents reporting specific origins (even unlisted ones), and decreased the percentage reporting general categories like ‘Hispanic’, ‘Latino’, or ‘Spanish’. However, the examples one displays can also have unintended impacts on respondent comprehension. For example, Redline (2011) found that respondents reported eating more of a food item when the examples were broad compared to narrow (e.g., meat versus poultry) and when they were frequently consumed compared to rarely consumed (e.g., beef, pork, and poultry versus lamb, veal, and goat). While research has shown that examples do impact comprehension, it has not supported the common notion that recall will be limited to the specific examples provided at the expense
222
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of others (Martin, 2002; Martin et al., 2007; Tourangeau et al., 2009).
Question Wording for Retrieval The next step in the response process is retrieval. In some cases, retrieval may involve searching one’s memory such as when asked how many times in the last week they went shopping. In other cases, it may involve searching records like tax records, college transcripts, or personal medical records. Often, the way information is stored in the record keeping system itself can greatly complicate the retrieval process. For example farmers may be asked how many pounds or tons of a crop they produced when their record keeping system stores production information in bushels. For discussion of the impact of records on survey responses in business or establishment surveys, see Snijkers et al. (2013). Questions should be written to accommodate the retrieval process, which means accounting for how people recall information. People are better able to recall events that happened recently, are distinctive (e.g., buying a new car versus buying groceries), or that are important (e.g., a wedding celebration versus a birthday celebration) (Tourangeau et al., 2000). People can also more accurately recall events when the recall period is shorter (e.g., the last week versus last year). When people are not able or willing to accurately recall information to enumerate, they often resort to estimation strategies in which they recall a rate for the event. They can then use that rate in the judgment stage of the response process to produce an estimate of the total number of times the event occurred (i.e., rate-based estimation) (Blair and Burton, 1987; Conrad et al., 1998). For example, one might recall grocery shopping about twice a week (i.e., the rate) and thus estimate that he went grocery shopping 104 times in the last year (2 × 52 = 104). Other such estimation strategies are discussed in detail
by Tourangeau et al. (2000). Respondents are more likely to use estimation strategies when the events asked about are frequent or regular in nature because they are more difficult to recall and enumerate; often a rate is easier to determine (Tourangeau et al., 2000). Thus, in writing questions, it is important to determine whether exact enumeration of events is needed or whether an estimate is good enough. With very frequent and mundane events, estimation might just have to be good enough. However, for events respondents are expected to be able to recall and enumerate, a shorter recall period will help. Slowing down the survey to give more time to think can also improve retrieval, and can be done by asking longer (not more complex) questions (Bradburn and Sudman, 1979; Tourangeau et al., 2000). Retrieval is also important for attitude and opinion questions. When the attitude has already been well formulated (i.e., preexisting and stable), retrieval can be straightforward. It is more complicated when there is no preexisting attitude. In this case, respondents have to formulate an evaluation on the spot. To do this, they may retrieve relevant impressions or stereotypes (e.g., a candidate’s political party) rather than specific information (e.g., the candidate’s issue positions) or they may retrieve more general attitudes or values related to an issue (e.g., drug use is bad) and formulate a specific opinion based on these (e.g., marijuana should not be legalized) (Tourangeau et al., 2000). Respondents are more likely to rely on general impressions and values when time and motivation are low. They will do the more difficult work of retrieving specific relevant information when they are motivated, have the time to do so, and when the specific information is readily retrievable (e.g., previously asked about).
Question Wording for Judgment Respondents then need to combine retrieved information into a single judgment. In some
Designing Questions and Questionnaires
cases, this is straightforward, such as when a running tally of an event has been kept (e.g., number of birthdays) or a preexisting and stable attitude has been retrieved (Tourangeau et al., 2000). In other cases additional mental calculations need to be made. These can become quite complex when multiple considerations have been retrieved and must be synthesized (e.g., I got a promotion, but my dog just died, so overall, how would I say things are going these days?). The multiple considerations are often weighted and aggregated in complex ways. Some considerations may even be disregarded altogether if they are determined to be irrelevant, invalid, or redundant (Tourangeau et al., 2000). Most of the ways that questionnaire design can impact this judgment process are through question order effects, a topic we return to below.
Question Wording for Reporting Reporting is the final step of the response process in which the respondent actually provides her answer. For the reporting stage to go well, the respondent has to have the ability and willingness to report her answer, both of which are impacted by a number of questionnaire design features. One such design feature is the format of the question. Open-ended questions require considerably different skill sets (i.e., writing, typing, or speaking) than closed-ended questions (i.e., checking, clicking, or speaking). As such, they sometimes require more motivation, especially when considerable detail needs to be provided. In this case, respondent motivation to provide quality open-ended responses can be increased by emphasizing the importance of the question. Statements like, ‘This question is very important to this research’ can increase the length and quality of open-ended responses (Oudejans and Christian, 2011; Smyth et al., 2009). For closed-ended questions, it is imperative that there be a response option that clearly and
223
adequately represents the response and that it is as straightforward as possible for respondents to map their response onto that option. It is for this reason that most questionnaire design texts emphasize that response options should be exhaustive (i.e., all possible options should be provided) and mutually exclusive (i.e., no overlap between response options). In addition, vagueness in response options should be minimized. For example, with ordinal scales, one can verbally label all of the categories rather than just the first and last to make the meaning of each category clearer. Doing so increases the reliability and validity of the scales (Krosnick and Fabrigar, 1997; Menold et al., 2013). Using constructspecific rather than vague quantifier labels can also make reporting more straightforward and reduce measurement error (e.g., asking ‘How would you rate the quality of your new car?’ instead of ‘How much do you agree or disagree that your new car is high quality?’) (Saris et al., 2010). Ensuring there are enough scale points for respondents to place themselves into a category, but not so many that the difference between any two adjacent points cannot be discerned can also help ease reporting on ordinal scales. Reliability and validity seem to be maximized at 5 to 7 scale points for bipolar scales (i.e., scales that measure both direction and magnitude like very satisfied/ very dissatisfied) and 4 to 5 scale points for unipolar scales (scales that measure only magnitude, like never to always) (Krosnick and Fabrigar, 1997). In addition, ensuring that the scale points are conceptually equidistant can also help. For example, reporting might be difficult on the scale very satisfied, slightly satisfied, neither satisfied nor dissatisfied, etc., because there is so much conceptual distance between the ‘very’ and ‘slightly’ categories and so little between the ‘slightly’ and ‘neither’ categories. Finally, for scales with nominal response options, it is important that the response options be clearly stated, mutually exclusive, and exhaustive as mentioned previously. In
224
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
addition, it is also necessary to be very clear about whether only one response option should be selected or whether multiple answers are acceptable. This can be done with question wording, additional instructions, or visual design (see below). Questionnaire designers also have to make sure respondents are willing to provide accurate answers at the reporting stage. Some respondents may be hesitant because of social or normative concerns. Social desirability is a tendency to answer questions in a way that makes one look good (or not look bad) rather than providing the most accurate answer (Tourangeau and Yan, 2007). It can take the form of underreporting negative behaviors like illicit drug use, abortion, and poor college performance or overreporting positive behaviors like voting and church attendance (Bernstein et al., 2001; Hadaway et al., 1993; Kreuter et al., 2008; Tourangeau and Smith, 1996). Respondents are more susceptible to social desirability in intervieweradministered surveys (Aquilino, 1994; de Leeuw, 1992; Dillman et al., 1996), but it is a concern in all modes. Aside from increasing response privacy by asking sensitive questions in a private, selfadministered mode, question design strategies to combat social desirability sometimes focus on making it either more acceptable or safe to answer honestly. Question wording is commonly changed to make the question less threatening such as asking ‘Have you happened to …’, ‘Some people believe and others believe ’, ‘Do you believe or ?’, and ‘There are many reasons people might not do such as not having time, not having transportation, or being sick. How about you, did you do ?’ (Bradburn et al., 2004). While these strategies make intuitive sense, the few empirical evaluations that have been done suggest they are not more effective than direct inquiries (Bradburn et al., 2004; Schuman and Presser, 1981; Yeager and Krosnick, 2012).
Another strategy for asking sensitive questions is the Randomized Response Technique (RRT) (Fox and Tracy, 1986; Warner, 1965). RRT keeps interviewers from knowing if respondents are instructed to give a truthful or a specific artificial answer, which is determined by a simple probability mechanism. Thus, RRT provides respondents with confidentiality. The researcher can produce population estimates because she knows what proportion of the sample was instructed to give the artificial answer. Research assessing RRT has been mixed. Some studies show that it increases reports of socially undesirable behaviors, but either has no effect on or increases reports of socially desirable behaviors (sometimes to untenable levels, Holbrook and Krosnick, 2010a). These results suggest RRT may simply increase all types of reports rather than producing more accurate reports and that many respondents do not implement the technique correctly. A similar technique is the item count technique (ICT), which involves respondents reporting the number of behaviors they have done from either a list of nonsensitive behaviors or the same list plus one sensitive behavior. The difference between the mean number of behaviors for the two groups reflects the prevalence of the sensitive behavior in the population (Holbrook and Krosnick, 2010b). This method provides confidentiality regarding engaging in the sensitive behavior unless a respondent engages in either all or none of the behaviors in the list. ICT produces higher reports of undesirable and similar reports of desirable behaviors or attributes than direct questions (for a review, see Holbrook and Krosnick, 2010b), but reduces statistical power because estimates are based on only a subset of the target population. Neither RRT nor ICT work with continuous items. For these, a similar technique, the item sum technique (IST), has been proposed (Trappmann et al., 2013). Another social/normative phenomenon that impacts willingness to provide accurate answers is acquiescence, the tendency
Designing Questions and Questionnaires
to agree rather than to disagree. In one experiment 60% agreed with the statement, ‘Individuals are more to blame than social conditions for crime and lawlessness in this country’ and 57% agreed with the exact opposite statement, ‘Social conditions are more to blame than individuals for crime and lawlessness in this country’ (Schuman and Presser, 1981). Without acquiescence, the sum of the percentage agreeing with the two statements should not exceed 100%. Clearly, some respondents were arbitrarily agreeable. There are several explanations for acquiescence. One is that it is related to a personality or cultural trait in which agreeing is experienced as more pleasant and polite than disagreeing (Javeline, 1999; Krosnick, 1991, 1999). Another is that acquiescence results from respondents being deferential to higher status surveyors (Lenski and Leggett, 1960). Both of these are normative and thus might lead respondents to edit their answer at the reporting stage. A third possibility is that respondents engage in only superficial retrieval, which would likely result in more reasons the statement would be true and thus their ultimate agreement with it (Krosnick, 1991). Regardless of the cause, it is clear that acquiescence leads to lower quality responses and is more common on agree/disagree, true/ false, and yes/no questions (Krosnick, 1991, 1999). Methods to minimize acquiescence include switching to self-administered modes (Dillman and Tarnai, 1991; Jordan et al., 1980; but see de Leeuw, 1992 for an exception), switching from agree/disagree to constructspecific scales as described above (Dillman et al., 2014; Saris et al., 2010; Schuman and Presser, 1981), and asking questions in a forced choice format (e.g., asking which of two opposing statements respondents agree with) (Schuman and Presser, 1981). As all of these examples illustrate, there are question wording design strategies that can help respondents through each stage of the response process. The questionnaire design literature contains many others. But in
225
all cases, when making decisions about question wording, it is helpful to think through how wording might impact each stage of the response process to assess whether it is likely to work well.
VISUAL DESIGN OF QUESTIONS AND QUESTIONNAIRES TO FACILITATE RESPONSE Good question wording can do a great deal to help respondents understand and answer questions, but alone is not enough. Questions also need to be visually designed to support question answering. Considerable research on visual design of questions and questionnaires has now been conducted (see Couper, 2008; Dillman et al., 2014; Tourangeau et al., 2013 for overviews). Because most of this research has focused on visual design for self-administered modes (except see Edwards et al., 2008) that will be the focus of this section; however, many of the visual design strategies covered here should also help interviewers. Good visual design has several purposes. First, it can help establish a navigational path that discourages answering questions out of order. Second, it can help ensure that respondents see each question, reducing item nonresponse. Third, it can help respondents proceed through the steps of the response process for each question and thus encourage quality responses. To understand how visual design can be used to accomplish each of these goals, a brief summary of relevant concepts is given here. For more in-depth treatment, see Palmer (1999), Hoffman (2004), and Ware (2004) for the vision sciences concepts and Dillman et al. (2014) and Smyth et al. (2014) for their application to surveys. First, in self-administered surveys, respondents encounter four visual elements: the written word, numbers, symbols, and graphics (Christian and Dillman, 2004;
226
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Jenkins and Dillman, 1997). Each of these can be visually modified by altering properties such as size, font, brightness, color, shape, orientation, location, and motion. For example, an important word may be made more visible by increasing its font size or bolding or italicizing it. Likewise, nonessential information (e.g., office use only, data entry codes, etc.) can be deemphasized by making it smaller and lighter in color and placing it in areas respondents are less likely to look. Applying properties allows designers to impact how respondents understand the relationship between elements on a page. When respondents first encounter a page in a paper or web survey, they have to make sense of the visual elements and properties. The sense-making process starts at a broad level where they are processing the entire page at once and then narrows to a focus on individual elements on the page. They start by subconsciously scanning the entire page and noticing certain visual properties such as size, shape, color, and enclosure to gain a basic understanding of the scene. Here they are relying on sight alone, not applying previous knowledge or experience to make sense of the scene (i.e., bottom-up processing). They then begin to visually segment the page into multiple parts based on shared visual properties. In this step, contours, boundaries, and colors help them distinguish between figure (i.e., information to be further processed) and ground (i.e., information that will fade to the background). They will then start to look for patterns or relationships between elements perceived as figure. At this point, they move from preattentive (i.e., subconscious) to attentive (i.e., conscious) processing and they start to use top-down processing, meaning they use existing cultural knowledge, expectations, and previous experiences to start to make sense of the scene. At this stage, they are trying to get a sense of the overall organization of the page, recognition of the major elements on the page, and an understanding of how related elements are grouped together and separate
from unrelated elements. With this information, they can enter the final stage where their focus is narrowed to specific elements on the page. It is in this final task completion stage that they start reading words and focusing on images as would be needed to process individual questions. When they do this, their visual focus is substantially narrowed such that they are looking at only about 15 degrees of visual angle and are most focused on only about 2 degrees (i.e., 8–10 characters) at a time. Good visual design will give respondents a strong sense of the navigational path and where answers are needed at the broader page level before they start the more focused attentive processing of individual questions. This allows them to focus their conscious effort on answering questions. This is achieved through the strategic application of visual properties to visual elements on the page so that they help respondents recognize the patterns, groupings, and subgroupings among the elements. Gestalt psychology principles of pattern perception can be used to figure out how properties can be strategically applied. These principles say that items that are located in close proximity to one another, share a property (e.g., similarity in size, shape, contrast, color, etc.), are enclosed in a common region, are connected by other elements, continue smoothly from one to the other, or have a common fate will be perceived as belonging together in a group. The grouping of letters into words on this page provides a simple example of the principle of proximity at work. The close proximity of related letters and more distant proximity of unrelated letters creates boundaries between individual words (i.e., grouping). As with question wording, these visual design concepts can be applied to help respondents through each stage of the response process, although visual design tends to be most helpful at the perception, comprehension, judgment, and reporting stages.
Designing Questions and Questionnaires
Visual Design for Perception To help respondents perceive individual questions, they should be located in expected places on the page. The first should typically be located in the upper left side of the page and questions thereafter should be located on the left margin under the preceding question (Dillman et al., 2014). Examples of questions that break from these conventions are followup questions placed to the right of the filter question, a second column of questions in a web design, or double question grid formats. These questions are less likely to be noticed and more likely to be left blank (Couper et al., 2013; Dillman et al., 2014). In addition, visual cues such as question numbers in front of the question stem or a bolder or larger font for the question stem can help respondents more easily recognize where each new question begins (Dillman et al., 2014).
Visual Design for Comprehension There are several ways visual design can impact respondents’ comprehension. For example, the use of graphical images can impact how respondents understand vague concepts. Couper et al. (2004) randomly assigned respondents answering a question about frequency of shopping in the last month to one of three experimental web survey versions: (1) no picture, (2) a low frequency picture (i.e., department store), and (3) a high frequency picture (i.e., grocery store). Unsurprisingly, the image impacted how respondents understood the vague concept of ‘shopping’ and thus their responses. The mean number of instances of shopping among those who received the department store picture was 7.73 compared to 9.07 among those who received the grocery store picture and 8.72 among those who got no picture. Numbers with scales can also impact comprehension. In one well-known study,
227
respondents were asked how successful they had been in life so far. They were shown a card with an 11-point scale labeled ‘not at all successful’ on the low end and ‘extremely successful’ on the high end, and corresponding numbers of either 0 to 10 or –5 to 5. Thirty-four percent placed themselves in the lowest six points on the 0 to 10 scale compared to only 13% on the –5 to 5 scale (Schwarz et al., 1991a). Additional research indicated that respondents understand labels such as ‘not at all successful’ to be much more extreme when they are paired with a negative number (e.g., explicit failure) than when paired with zero (e.g., absence of success) (Schwarz et al., 1991a).
Visual Design for Judgment Considerable research has concerned how to ask general and specific questions. For example, if one wants to ask about satisfaction with a series of life domains such as relationship, work, and leisure as well as satisfaction with life overall, how should the questions be ordered? The literature suggests that if the more specific questions are asked first, the general life satisfaction question will be treated as a summary measure (Schwarz et al., 1991b). If one wants to avoid this, the overall question should be asked first and/or instructions should be added – ‘excluding your relationship, work, and leisure, how satisfied are you currently with your life as a whole’. Visual design might be helpful in a similar way. Placing the specific items in the same enclosed space as the overall item should encourage respondents to view these items as conceptually related and use the specific domains in their overall judgment, especially if the overall item comes last. Placing the overall item in a separate enclosure, especially on a different page, should encourage respondents to exclude the specific items from their judgment for the overall item.
228
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Visual Design for Reporting Visual design seems particularly important at the reporting stage. Several examples from the visual design survey literature speak to this stage. For example, there is clear evidence that better answers are obtained when answer spaces for open-ended items are designed to accommodate the type of answer the question asks for. For descriptive openended items, larger answer boxes obtain longer answers containing more information than smaller answer boxes (Christian and Dillman, 2004; Smyth et al., 2009). Likewise, a single answer box works well for descriptive open-ended questions, but multiple answer boxes stacked like a list increases the number of items listed and results in cleaner, more focused answers on list-style questions (Dillman et al., 2014; Smyth et al., 2007). For numeric open-ended questions, templates and labels (i.e., ‘masks’) can be used to help respondents properly format their answers (Christian et al., 2007; Couper et al., 2011; Dillman et al., 2014). In one study, 44% of college undergraduates respondents provided their college start date using the desired twodigit month and four-digit year when the answer boxes for month and year were the same size and labeled ‘month’ and ‘year’, but 94% did so when the month box was half the size of the year box and they were labeled with the symbols ‘MM’ and ‘YYYY’ (Dillman et al., 2014). The design of the answer boxes greatly improved respondents’ ability to provide the desired format. Whether only one or multiple responses are acceptable can also be communicated with visual design. Generally, circle shaped answer spaces mean only one answer is allowed while square shaped answer spaces mean multiple are allowed. This is true by definition on the web where radio buttons work as a group and html check boxes work individually (Couper, 2008), but it also seems to carry over to paper questionnaires. Respondents receiving circle answer spaces are more likely to give single
answers while those receiving square answer spaces are more likely to provide multiple answers (Witt-Swanson, 2013). As a final example of the importance of visual design at the reporting stage, Tourangeau et al. (2004) found that response distributions on an ordinal scale differed significantly depending on whether the nonsubstantive response options (i.e., ‘don’t know’ and ‘no opinion’) were visually grouped with substantive response options. Respondents seemed to assume that the conceptual and visual midpoints of the scale were aligned, but when nonsubstantive response options were grouped with substantive response options, the visual midpoint fell below the conceptual midpoint, leading more respondents to choose lower points on the scale. That is, the reporting process was impacted by the visual design, not simply the lexical meaning of the category labels. Additional discussion of how visual design can be used for individual questions and whole questionnaires is provided in Dillman et al. (2014) and Smyth et al. (2014).
ORDERING QUESTIONS While we tend to think of survey questions as independent measurement devices, each question appears within the context of the other questions. That context can substantially impact how respondents proceed through the response process.
Ordering Questions for Comprehension When question meaning is unclear, respondents often seek clarification from other features of the questionnaire (Schwarz, 1996) such as previous questions. The following agree/disagree items illustrate the importance of context:
Designing Questions and Questionnaires
You had trouble paying the bills You felt burden from having too much debt You were able to do almost everything you needed to do You were able to take time for yourself You had too little time to perform daily tasks
In this order, the item, ‘you were able to do almost everything you needed to do’ appears to be referring to finances. But consider this new ordering: You had trouble paying the bills You felt burden from having too much debt You were able to take time for yourself You had too little time to perform daily tasks You were able to do almost everything you needed to do
Now it appears to be asking about time availability. Reordering gives this item a totally different meaning. Thus, question order can have significant impacts on comprehension.
Ordering Questions for Retrieval There are several ways question ordering can help with retrieval. For example, grouping topically similar questions gains efficiencies because respondents can use retrieved information to answer all questions on a topic before moving to a different topic (Dillman et al., 2014). However, one also has to be watchful for unintentional priming effects, which occur when information retrieved for an early question is used to answer a later question simply because it is more easily accessible. For example, Todorov (2000) found that asking respondents about specific chronic conditions early in the National Health Interview Survey on Disability increased the likelihood that they would identify one of the chronic conditions asked about as a cause of their disability in a later question. Question ordering can also impact retrieval when asking about a series of events that occurred over time. In this case, the events
229
should be asked about in chronological or reverse-chronological order because autobiographical memories are often linked such that remembering one event can help remember the next (or previous) event in the sequence (Belli, 1998).
Ordering Questions for Judgment Question order can also impact judgment once the relevant information has been retrieved, as illustrated in the example of general and specific questions above. To illustrate, in one study 52% of respondents selected ‘very happy’ when asked how ‘things are these days’ after having been asked about their marriages, but only 38% selected ‘very happy’ when they had not previously been asked about their marriages. Those who were first asked about their marriages appeared to carry their happy thoughts over to the more general question (Schuman and Presser, 1981). The opposite, where respondents deliberately leave information out of a judgment because they do not want to be repetitive (i.e., a subtraction effect) can also happen (Grice, 1975; Schwarz, 1996). In one example, Mason et al. (1994) found respondents were less optimistic about economic conditions in their state over the next five years if they were first asked about economic conditions in their community. This was because they were less likely to count new industry as a positive factor for their state if they had already counted it as a positive for their community. Those asked about their state first did not subtract out the new industry factor.
Ordering Questions for Reporting Question order can also reduce editing of responses. For example, Dillman et al. (2014) suggest placing sensitive questions near the end of the questionnaire so respondents are more committed to finishing the task, have
230
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
better rapport with the researcher, and understand the full context of the questionnaire by the time they reach them. Clancy and Wachsler (1971) found that respondents are more likely to acquiesce when they are fatigued, suggesting acquiescence-prone questions ought to be placed earlier. Respondents might also edit their responses in other socially normative ways linked to question order. For example, in a classic experiment, Hyman and Sheatsley (1950) found that respondents were much more likely to concede that communist reporters should be able to report on visits to the United States if they had already answered a question about whether US reporters should be able to report on visits to the Soviet Union. Respondents were applying the norm of even handedness or fairness. It is difficult to say communist reporters should not be allowed to do something one just said US reporters should be allowed to do. The same effect has been found among students asked about penalties for professors and students plagiarizing (Sangster, 1993). Respondents might also edit later answers to appear more consistent with previous answers (Dillehay and Jernigan, 1970).
HOLISTIC DESIGN The examples in this chapter only scratch the surface of what we know about how to design questions and questionnaires. However, they demonstrate that wording, visual design, and question order can have substantial impacts on responses. In fact, in some cases, visual design can even impact the way question wording is interpreted such as when visual subgrouping of response options along with an instruction to ‘select the best answer’ leads respondents to select one answer from each subgroup rather than one answer overall (Smyth et al., 2006). In reviewing questionnaires, one commonly comes across questions like this that are not
designed holistically (e.g., yes/no question stems with ordinal response options, openended question stems with closed-ended response options, check-all-that-apply question stems with forced-choice response options, descriptive open-ended questions with tiny answer boxes, etc.) The contradictions in these designs make it more difficult for respondents to get through the response process and undermine measurement.
PRETESTING WITH THE RESPONSE PROCESS IN MIND Survey designers have several pretesting methods for detecting questionnaire design problems (Dillman et al., 2014; Presser et al., 2004), as discussed in Chapter 24 this Handbook). I will not repeat that discussion here, but will point out that keeping the response process in mind during pretesting is potentially beneficial in two important ways. First, each pretesting method is stronger at detecting certain types of problems than others. For example, expert reviews have been shown to be more effective at identifying retrieval and judgment problems that affect data quality than comprehension and reporting problems (Olson, 2010). Cognitive interviews can be designed to detect problems at all stages of the response process (see Dillman et al. 2014; Willis 2004, 2015) as can experiments (Fowler, 2004). Pilot studies and eye tracking, however, are more useful for detecting problems with perception and reporting (Bergstrom and Schall, 2014; Dillman et al., 2014; Galesic et al., 2008). Knowing what type of problem one wants to detect can help determine what pretesting method(s) should be used. Likewise, knowing the strengths of pretesting methods for detecting different types of problems can help one ensure their pretesting plan will cover all of the bases. Schaeffer and Dykema (2004) and Fowler (2004) provide examples of using multiple pretesting techniques to
Designing Questions and Questionnaires
improve design. Second, once pretest findings are in hand, one can use the response process to help make sense of them and think through possible solutions to problems. For example, identifying at what stage(s) of the response process problems seem to be occurring can help one identify solutions to the problems.
CONCLUSION Despite the impression sometimes created by software advertisers, good questionnaire design is not fast, easy, or free. It takes considerable time and knowledge. There is considerable scientific research regarding different aspects of design. However, in designing a questionnaire, one will quickly realize that our knowledge is still incomplete. Many decisions need to be made without empirical evidence. To some degree, this is because each questionnaire is different and as such sets a different context for individual questions. Further complicating matters, target populations and their skills, abilities, and cultures also vary widely, meaning the way questions and other design features are understood can vary widely, which is particularly challenging for cross cultural or cross national surveys. Orienting oneself to how respondents will experience the questionnaire and their response process can provide a useful framework for making these design decisions. This includes keeping in mind the goals of encouraging response, promoting optimizing, providing a clear navigational path, and, of course, collecting high quality measurements. It also requires one to think about how their design will impact each stage of the response process. In addition, understanding the response process and where it can break down can help one determine what pretesting method(s) are most appropriate for a given questionnaire and troubleshoot problems.
231
RECOMMENDED READINGS Dillman et al. (2014) provides in-depth discussion of questionnaire design issues for the major survey modes, including many examples of common mistakes and ways to fix them. Its companion website provides further survey examples. Schwarz (1996) describes assumptions research participants and survey respondents bring to the research task and how these should be taken into account by researchers during research design.
REFERENCES Aquilino, W. S. (1994). Interview mode effects in surveys of drug and alcohol use: a field experiment. Public Opinion Quarterly, 58, 210–240. Belli, R. F. (1998). The structure of autobiographical memory and the event history calendar: potential improvements in the quality of retrospective reports in surveys. Memory, 6 (4), 383–406. Belli, R. F., Schwarz, N., Singer, E., and Talarico, J. (2000). Decomposition can harm the accuracy of behavioral frequency reports. Applied Cognitive Psychology, 14, 295–308. Bergstrom, J. R., and Schall, A. (2014). Eye Tracking in User Experience Design. Waltham, MA: Morgan Kaufmann. Bernstein, R., Chadha, A., and Montjoy, R. (2001). Over-reporting voting: why it happens and why it matters. Public Opinion Quarterly, 65, 22–44. Blair, E., and Burton, S. (1987). Cognitive processes used by survey respondents to answer behavioral frequency question. Journal of Consumer Research, 14, 280–288 Bradburn, N. M., and Sudman, S. (1979). Improving Interview Method and Questionnaire Design. San Francisco, CA: Jossey-Bass. Bradburn, N., Sudman, S., and Wansink, B. (2004). Asking Questions. San Francisco, CA: Jossey-Bass. Christian, L. M., and Dillman, D. A. (2004). The influence of graphical and symbolic
232
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
language manipulations on responses to self-administered questions. Public Opinion Quarterly, 68 (1), 58–81. Christian, L. M., Dillman, D. A., and Smyth, J. D. (2007). Helping respondents get it right the first time: the influence of words, symbols, and graphics in web surveys. Public Opinion Quarterly, 71 (1), 113–125. Clancy, K.J., and Wachsler, R.A. (1971). Positional effects in shared-cost surveys. Public Opinion Quarterly, 35, 258–65 Conrad, F., Brown, N., and Cashman, E. (1998). Strategies for estimating behavioral frequency in survey interviews. Memory, 6, 339–366. Couper, M. P. (2008) Designing Effective Web Surveys. Cambridge: Cambridge University Press. Couper, M. P., Kennedy, C., Conrad, F. G., and Tourangeau, R. (2011). Designing input fields for non-narrative open-ended responses in web surveys. Journal of Official Statistics, 27 (1), 65–85. Couper, M. P., Tourangeau, R., Conrad, F. G., and Zhang, C. (2013). The design of grids in web surveys. Social Science Computer Review, 31, 322–345. Couper, M. P., Tourangeau, R., and Kenyon, K. (2004). Picture this! Exploring visual effects in web surveys. Public Opinion Quarterly, 68, 255–266. de Leeuw, E. D. (1992). Data Quality in Mail, Telephone, and Face-to-Face Surveys. Amsterdam: TT Publications. Dillehay, R. C., and Jernigan, L. R. (1970). The biased questionnaire as an instrument of opinion change. Journal of Personality and Social Psychology, 15 (2), 144–150. Dillman, D. A., and Tarnai, J. (1991). Mode effects of cognitively-designed recall questions: a comparison of answers to telephone and mail surveys. In P. P. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, and S. Sudman (eds), Measurement Errors in Surveys (pp. 73–93). New York: Wiley. Dillman, D. A., Sangster, R. L., Tarnai, J., and Rockwood, T. (1996). Understanding differences in people’s answers to telephone and mail surveys. In M. T. Braverman and J. K. Slater (eds), New Directions for Evaluation Series: Vol. 70. Advances in Survey Research (pp. 45–62). San Francisco, CA: Jossey-Bass.
Dillman, D. A., Smyth, J. D., and Christian, L. M. (2014). Internet, Phone, Mail, and MixedMode Surveys: The Tailored Design Method. Hoboken, NJ: John Wiley & Sons. Edwards, B., Schneider, S., and Brick, P.D. (2008). Visual elements of questionnaire design: experiments with a CATI establishment survey. In J. M. Lepkowski, C. Tucker, J. M. Brick, et al. (eds), Advances in Telephone Survey Methodology (pp. 276–296). Hoboken, NJ: John Wiley & Sons. Fowler, F. J. Jr. (1992). How unclear terms affect survey data. Public Opinion Quarterly, 56, 218–231. Fowler, F. J. Jr. (1995). Improving Survey Questions. Thousand Oaks, CA: Sage. Fowler, F. J. Jr. (2004). The case for more splitsample experiments in developing survey instruments. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 173– 188). Hoboken, NJ: John Wiley & Sons. Fox, J. A., and Tracy, P. E. (1986). Randomized Response: A Method for Sensitive Surveys. Beverly Hills, CA: Sage. Galesic, M., Tourangeau, R., Couper, M. P., and Conrad, F. G. (2008). Eye-tracking data: new insights on response order effects and other cognitive shortcuts in survey responding. Public Opinion Quarterly, 72 (5), 892–913. Grice, H. P. (1975). Logic and conversation. In P. Cole and J. L. Morgan (eds), Syntaxt and Semantics, 3: Speech Acts (pp. 41–58). New York: Academic Press. Hadaway, K., Marler, P., and Chaves, M. (1993). What the polls don’t show: a closer look at U.S. church attendance. American Sociological Review, 58, 741–752. Hoffman, D. D. (2004). Visual Intelligence. New York: Norton. Holbrook, A. L., and Krosnick, J. A. (2010a). Measuring voter turnout by using the randomized response technique: evidence calling into question the method’s validity. Public Opinion Quarterly, 74, 328–343. Holbrook, A. L., and Krosnick, J. A. (2010b). Social desirability bias in voter turnout reports: tests using the item count technique. Public Opinion Quarterly, 74, 37–67. Hox, J. J. (1997). From theoretical concept to survey question. In L. E. Lyberg, P. Biemer,
Designing Questions and Questionnaires
M. Collins, E. D. de Leeuw, C. Dippo, N. Schwarz, et al. (eds), Survey Measurement and Process Quality (pp. 47–69). New York: Wiley-Interscience. Hyman, H. H., and Sheatsley, P. B. (1950). The current status of American public opinion. In J. C. Payne (ed.), The Teaching of Contemporary Affairs: Twenty-first Yearbook of the National Council for the Social Studies (pp. 11–34). New York: National Education Association. Javeline, D. (1999). Response effects in polite cultures: a test of acquiescence in Kazakhstan. Public Opinion Quarterly, 63 (1), 1–28. Jenkins, C., and Dillman, D. A. (1997). Towards a theory of self-administered questionnaire design. In L. E. Lyberg, P. Biemer, M. Collins, E. D. de Leeuw, C. Dippo, N. Schwarz, et al. (eds), Survey Measurement and Process Quality (pp. 165–196). New York: Wiley-Interscience. Jordan, L. A., Marcus, A. C., and Reeder, L. G. (1980). Response styles in telephone and household interviewing: a field experiment. Public Opinion Quarterly, 44, 210–222. Joseph, J. G., Emmons, C-A, Kessler, R. C., Wortman, C. B., O’Brien, K., Hocker, W. T., and Schaefer, C. (1984). Coping with the threat of AIDS: an approach to psychosocial assessment. American Psychologist, 39 (11), 1297–1302. Kreuter, F., Presser, S., and Tourangeau, R. (2008). Social desirability bias in CATI, IVR and web surveys. Public Opinion Quarterly, 72, 847–65. Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of attitude measures in surveys. Applied Cognitive Psychology, 5, 213–236. Krosnick, J. A. (1999). Survey research. Annual Review of Psychology, 50, 537–567. Krosnick, J. A., and Fabrigar, L. R. (1997). Designing rating scales for effective measurement in surveys. In L. Lyberg, P. Biemer, M. Collins, L. Decker, E. de Leeuw, C. Dippo, et al. (eds), Survey Measurement and Process Quality. New York: Wiley-Interscience, pp. 141–162. Lenski, G. E., and Leggett, J. C. (1960). Caste, class, and deference in the research interview. American Journal of Sociology, 65, 463–467. Martin, E. (2002). The effects of questionnaire design on reporting of detailed Hispanic
233
origin in Census 2000 mail questionnaires. Public Opinion Quarterly, 66, 582–593. Martin, E., Sheppard, D., Bentley, M., and Bennett, C. (2007). Results of the 2003 National Census Test of Race and Hispanic Questions. Research Report Series (Survey Methodology 2007–34). Washington DC: US Census Bureau. Mason, R., Carlson, J. E., and Tourangeau, R. (1994). Contrast effects and subtraction in part-whole questions. Public Opinion Quarterly, 58 (4), 569–578. Menold, N., Kaczmirek, L., Lenzner, T., and Neusar, A. (2013). How do respondents attend to verbal labels in rating scales? Field Methods. Advanced Access DOI: 10.1177/1525822X13508270 Menon, G. (1997). Are the parts better than the whole? The effects of decompositional questions on judgments with frequent behaviors. Journal of Marketing Research, 34, 335–346. Olson, K. (2010). An examination of questionnaire evaluation by expert reviewers. Field Methods, 22, 295–318. Oudejans, M., and Christian, L. M. (2011). Using interactive features to motivate and probe responses to open-ended questions. In M. Das, P. Ester, and L. Kaczmirek (eds), Social and Behavioral Research and the Internet: Advances in Applied Methods and Research Strategies (pp. 215–244). New York: Routledge. Palmer, S. E. (1999). Vision Science: Photons to Phenomenology. London: Bradford Books. Presser, S., Rothgeb, J. M., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., and Singer, E. (2004). Methods for Testing and Evaluating Survey Questionnaires. Hoboken, NJ: John Wiley and Sons. Redline, C. (2011). Clarifying survey questions. Unpublished dissertation, The Joint Program in Survey Methodology, University of Maryland, College Park, MD. Redline, C. (2013). Clarifying categorical concepts in a web survey. Public Opinion Research, 77, 89–105. Salant, P., and Dillman, D. A., (1994). How to Conduct Your Own Survey. Hoboken, NJ: John Wiley & Sons. Sangster, R. L. (1993). Question order effects: are they really less prevalent in mail surveys?
234
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Unpublished doctoral dissertation, Washington State University, Pullman, WA. Saris, W. E., and Gallhofer, I. N. (2014). Design, Evaluation, and Analysis of Questionnaires for Survey Research. Hoboken, NJ: John Wiley & Sons. Saris, W. E., Revilla, M., Krosnick, J. A., and Schaeffer, E. M. (2010). Comparing questions with agree/disagree response options to questions with item-specific response options. Survey Research Methods, 4 (1), 61–79. Schaeffer, N. C., and Dykema, J. (2004). A multiple-method approach to improving the clarity of closely related concepts: distinguishing legal and physical custody of children. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 475– 502). Hoboken, NJ: John Wiley & Sons. Schaeffer, N. C., and Presser, S. (2003). The science of asking questions. Annual Review of Sociology, 29, 65–88. Schaeffer, N. C., and Thomson, E. (1992). The discovery of grounded uncertainty: developing standardized questions about strength of fertility motivation. Sociological Methodology, 22, 37–82. Schuman, H., and Presser, S. (1981). Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording, and Context. New York: Academic Press. Schwarz, N. (1996). Cognition and Communication: Judgmental Biases, Research Methods, and the Logic of Conversation. Mahwah, NJ: Lawrence Erlbaum Associates. Schwarz, N., Knauper, B., Hippler, H. J., NoelleNeumann, E., and Clark, L. (1991a). Rating scales: numeric values may change the meaning of scale labels. Public Opinion Quarterly, 55 (4), 570–582. Schwarz, N., Strack, F., and Mai, H-P. (1991b). Assimilation and contrast effects in partwhole question sequences: a conversational logic analysis. Public Opinion Quarterly, 55, 3–23. Smyth, J. D., Dillman, D. A., and Christian, L. M. (2007). Improving response quality in liststyle open-ended questions in web and telephone surveys. Presented at the American Association for Public Opinion Research. Anaheim, CA, May 17–20, 2007.
Smyth, J. D., Dillman, D. A., and Christian, L. M. (2014) Understanding visual design for questions and questionnaires. [Video Presentation] http://bcs.wiley.com/he-bcs/Books? action=index&itemId=1118456149& bcsId=9087 [accessed on 14 June 2016]. Smyth, J. D., Dillman, D. A., Christian, L. M., and McBride, M. (2009). Open-ended questions in web surveys: can increasing the size of answer spaces and providing extra verbal instructions improve response quality? Public Opinion Quarterly, 73, 325–337. Smyth, J. D., Dillman, D. A., Christian, L. M., and Stern, M. J. (2006). Effects of using visual design principles to group response options in web surveys. International Journal of Internet Science, 1(1), 6–16. Snijkers, G., Haraldsen, G., Jones, J., and Willimack, D. K. (2013). Designing and Conducting Business Surveys. Hoboken, NJ: John Wiley & Sons. Todorov, A. (2000). The accessibility and applicability of knowledge: predicting context effects in national surveys. Public Opinion Quarterly, 64 (4), 429–451. Tourangeau, R., and Smith, T. W. (1996). Asking sensitive questions: the impact of data collection mode, question format, and question context. Public Opinion Quarterly, 60, 275–304. Tourangeau, R., and Yan, T. (2007). Sensitive questions in surveys. Psychological Bulletin. 133 (5), 859–883. Tourangeau, R., Conrad, F. G., and Couper, M. P. (2013). The Science of Web Surveys. Oxford: Oxford University Press. Tourangeau, R., Conrad, F., Couper, M., Redline, C., and Ye, C. (2009). The effects of providing examples: questions about frequencies and ethnicity background. Paper presented at the American Association of Public Opinion Research. Hollywood, FL, May 14–17, 2009. Tourangeau, R., Couper, M., and Conrad, F. (2004). Spacing, position, and order: interpretive heuristics for visual features of survey questions. Public Opinion Quarterly, 68, 368–393. Tourangeau, R., Rips, L. J., and Rasinski, K. (2000). The Psychology of Survey Response. New York: Cambridge University Press.
Designing Questions and Questionnaires
Trappmann, M., Krumpal, I., Kirchner, A., and Jann, B. (2013). Item sum: a new technique for asking quantitative sensitive questions. Journal of Survey Statistics and Methodology, 2 (1), 58–77. Ware, C. (2004). Information Visualization: Perception for Design. San Francisco: Morgan Kaufmann. Warner, S. L. (1965). Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60, 63–69. Willis, G. B. (2004). Cognitive Interviewing: A Tool for Improving Questionnaire Design. Thousand Oaks, CA: Sage.
235
Willis, G. B. (2015). Analysis of the Cognitive Interview in Questionnaire Design. New York: Oxford University Press. Witt-Swanson, L. (2013). Design decisions: can one reduce measurement error on paper/ pencil surveys by using boxes or ovals? Paper presented at the International Field Directors & Technologies Conference. Providence, RI, May 19–22, 2013. Yeager, D. S., and Krosnick, J. A. (2012). Does mentioning ‘some people’ and ‘other people’ in an opinion question improve measurement quality? Public Opinion Quarterly, 76 (1), 131–141.
17 Creating a Good Question: How to Use Cumulative Experience Melanie Revilla, Diana Zavala and Willem Saris
INTRODUCTION Creating a good question for survey research is a difficult task given the many decisions that are involved. The fundamental part of a question is what we will call the request for an answer. This part always has to be available. However, besides the request for an answer, many different components can be added such as an introduction, a motivation statement, extra information regarding the content or definitions, instructions to the respondents or the interviewer, a show card and finally answer categories. All these components can be formulated in many different ways. The effects that the wording of survey questions can have on the responses have been studied in the tradition of survey research, for example by Sudman and Bradburn (1983), Schuman and Presser (1981), Andrews (1984), Alwin and Krosnick (1991), Költringer (1993), Scherpenzeel and Saris (1997), and Saris and Gallhofer
(2007). On the contrary, little attention has been given to the problem of translating the concepts one wants to measure into the basic component of a question, the request for an answer (De Groot and Medendorp, 1986; Hox, 1997). However, if this step goes wrong, little improvements can be made afterwards. Therefore, we first discuss how to design theoretically valid requests for an answer. This can be achieved following the three-step procedure proposed by Saris and Gallhofer (2014): 1 Specification of the concept-by-postulation (com plex concepts like attitudes) in concepts-byintuition (simpler concepts for which questions can be directly formulated); 2 Transformation of concepts-by-intuition in asser tions indicating the requested concepts; 3 Transformation of assertions into requests for an answer.
The three-step procedure, if properly applied, should lead to a measurement
Creating a Good Question: How to Use Cumulative Experience
instrument that measures what is supposed to be measured and nothing else. However, there are much more decisions to take while designing questions that can affect the results. There are many possible formulations of questions that are theoretically valid for a given concept. But not all of them are equally good. Therefore, it makes sense to study the quality of the question. This quality is defined as the strength of the relationship between the latent variable of interest (the concept-by-intuition) and the observed answers to a specific formulation of the question. A quality of 1 would mean that there are no measurement errors at all: the question would perfectly measure the concept-by-intuition. Nevertheless, in practice, the quality is always lower than one. Indeed, the quality of a question is determined by the amount of random and systematic errors which are estimated respectively as 1-reliability of the question and the variance in the observed responses explained by the method used (Saris and Andrews, 1991). Over the last 25 years, a lot of experiments have been done to estimate the random and systematic errors for many questions (Andrews, 1984; Költringer, 1993; Scherpenzeel and Saris, 1997; Saris et al., 2011; Saris and Gallhofer, 2014). This cumulative knowledge has led to the development of the program SQP 2 that can be used to predict the quality of questions on the basis of the characteristics of these questions. In this chapter, we start by discussing the design of theoretically valid questions. Then, we give an overview of the other decisions that have to be taken in order to get the final formulation of survey questions and we show how the SQP program, based on a meta-analysis of thousands of quality estimates, can help researchers in evaluating if their questions are good enough. SQP also provides suggestions of improvements. We propose a concrete illustration of how to code a question in SQP in order to check its quality and improve it. Finally, we explain how SQP can also be used in cross-national
237
research to create good questions over different languages.
HOW TO DESIGN THEORETICALLY VALID REQUESTS FOR AN ANSWER In the first step, we should determine if we consider that the concept of interest is a complex concept that requires more than one indicator to be measured, or if it is a simple concept about which we can ask directly. If our concept is a concept-by-postulation, then, it has to be operationalized in terms of several simpler concepts (concepts-by-intuition), using reflective or formative indicators (for more details, we refer to Saris and Gallhofer, 2014). In this chapter, we concentrate on the second and third steps of the three-step procedure, once we have determined the different concepts-of-intuition needed. The transformation of concepts-by- intuition into questions is not so simple if we want to provide a procedure that leads to a question representing the concept-byintuition with a very high probability. One of the reasons for this is that so many different concepts exist that one cannot make rules for each of them. Another reason is that there are many possible ways of formulating questions. We try to simplify the task by classifying the different concepts used in social sciences in general classes of basic concepts. For these basic concepts, we can formulate valid questions. First, assertions can be formulated that represent them quite certainly. After that, these assertions can be transformed in questions.
Basic Concepts and Concepts-byIntuition There is a nearly endless list of possible concepts in social sciences. We cannot specify how to formulate questions for each of them. However, if we can reduce the number of concepts by classifying them into a limited
238
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
number of classes of basic concepts, then this problem may be solved. Table 17.1 illustrates that by presenting a list of concepts-by-intuition measured in round 1 of the European Social Survey (ESS) and indicating the classes of basic concepts to which they belong. Table 17.1 does not give examples of all possible basic concepts. Instead, it shows that there are many different concepts-byintuition that can be seen as specific cases of basic concepts. For example, evaluation of services and evaluation of one’s own health have in common that both are evaluations (good or bad) while the subject in this case Table 17.1 The classification of conceptsby-intuition from the ESS into classes of basic concepts of the social sciences Question ID
ESS concepts
Basic concept
B33 C7 A8 B7 B27 C1 B1 S B2 B28 C5 C16 B44 B46 C9
Evaluation of services Health Social trust Political trust Satisfaction with …. Happiness Political interest Value benevolence Political efficacy Left/right placement Victimization of crimes Discrimination Income equality Freedom of lifestyle Religious identity
C10
Religious affiliation
C14 C15 B13-B19 A1 C2 C20 F1 F2
Church attendance Praying Political action Media use Social contacts Country of origin Household composition Age
Evaluation Evaluation Feeling Feeling Feeling Feeling Importance Importance Judgment Judgment Evaluative belief Evaluative belief Policy Policy Similarity or association Similarity or association Behavior Behavior Behavior Frequency or amount Frequency or amount Demographic Demographic Demographic
is very different. If we know how to formulate sentences that express an evaluation, we can apply this rule to both concepts to formulate questions that measure what we want to measure. Other examples can be interpreted in the same way. It illustrates that even if the number of concepts-by-intuition is nearly unlimited, they can be classified in a limited number of basic concepts. For each basic concept, an assertion can be formulated.
The Basic Elements of Assertions In linguistics, a simple assertion can be decomposed into main components. A first relevant basic structure for assertions expressing evaluations, feelings, importance, demographic variables, values, and cognitive judgments is Structure 1. Structure 1: subject + predicator + subject complement.
For example: Q1a – Clinton was a good president. subject + predicator + subject complement.
‘Clinton’ functions as the subject that indicates what is being discussed in the sentence. The predicator or verb ‘was’ connects the subject with the remaining part of the sentence, which expresses what the subject is and is therefore called a subject complement. It contains a noun (‘president’) with an adjective (‘good’) as modifier of the noun. Predicators that indicate what a subject is/ was or becomes/became are called link verbs. A second relevant linguistic structure used to formulate e.g. relations, preferences, duties, rights, actions, expectations, feelings, and behavior, is Structure 2. Structure 2: subject + predicator + direct object.
It is illustrated in example Q1b: Q1b – The president likes his job. Subject + predicator + direct object.
Creating a Good Question: How to Use Cumulative Experience
This example has a subject (‘the president’), the predicator ‘likes’, and a direct object ‘his job’. Koning and Van der Voort (1997: 52) define a direct object as the person, thing, or animal that is ‘affected’ by the action or state expressed by the predicator. By changing the predicator in this structure, we change the concept-by-intuition the assertion refers to. There is a third linguistic structure relevant to present behaviors, behavioral intentions, and past and future events. Structure 3: subject + predicator
For example: Q1c – The position of the president has changed. subject + predicator.
Example Q1c has a subject (‘the position of the president’) and a predicator (‘has changed’). In linguistics, these verbs which are not followed by a direct object are called intransitive. The meaning of the sentences is easily changed by modifying the predicator as in structure 2. However, the number of possibilities is much more limited because of the reduced number of intransitive verbs. The basic components of these three possible linguistic structures of assertions can be extended with other components (Saris and Gallhofer, 2014). These structures are not only used in English but in most European languages. We will treat further the topic of cross-national questions in the final section.
Basic Concepts-by-Intuition Now, we will describe how assertions that are characteristic of the basic concepts-byintuition employed in survey research can be generated. For more details about the different points discussed in this section, we refer to Saris and Gallhofer (2014).
239
We distinguish between subjective and objective variables. By subjective variables, we understand variables for which the information can only be obtained from a respondent because the information exists only in his/her mind. The following basic conceptsby-intuition are discussed: evaluations, importance judgments, feelings, cognitive judgments, perceived relationships between the X and Y variables, evaluative beliefs, preferences, norms, policies, rights, action tendencies, and expectations of future events. By objective variables, we mean nonattitudinal variables, for which, in principle, information can also be obtained from a source other than the respondent.1 Commonly these variables concern factual information such as behaviors, events, time, places, quantities, procedures, demographic variables, and knowledge. Table 17.2 presents the assertions for the different basic subjective and objective variables. Imagine that we want to know the degree of importance that the respondents place on the value ‘honesty’. Then, Table 17.2 indicates that the structure we should use is structure 1 (vIi), where ‘v’ refers to the value of interest (honesty), ‘I’ is the link verb, and ‘i’ refers to the basic concept of importance. Following this structure, we can formulate several assertions: Q2.1a – Honesty is very important. Q2.1b – Honesty is important. Q2.1c – Honesty is unimportant. Q2.1d – Honesty is very unimportant.
For some concepts, different structures can be used to formulate the assertions. Until now, we have focused on the basic structure of assertions. However, in reality, assertions have a lot of variations. They are expressed in sentences much longer than have been studied so far. Often indirect objects, modifiers, or adverbials are added to the simple sentences. We have indicated how the most commonly applied basic concepts-by-intuition in survey
240
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 17.2 The basic structures of assertions Basic concepts Subjective variables Evaluation (e) Importance (i) Values (v) Feelings (f) Cognitive judgment (c) Causal relationship (ca) Similarity relationship (s) Preference (pr) Norms Policies Rights (ri) Action tendencies Expectations of Future events Evaluative belief Objective variables Behavior Events Demographics (d) Knowledge Time Place Quantities Procedures
Structure 1
Structure 2
Structure 3
xIsc
xPy
xP
xIe xIi vIi xIf xIc xIca xIs xIpr – – xIri – –
–
–
– xFy or xP f – xCy xSy xPR y (z…) oH (+I) y gH (+I) y xHRy rFDy xFDy
– – – – – – oH(+I) gH(+I) – rFD xFD
–
xPey or xPye
xPe
– – xId xIsc – – – –
rDy xDy – xPy – – xDqu –
rD xD – xP xDti xDpl – xDpl, pro
Notations: x denotes the grammatical subject; P, the predicator; I the link verb. Frequently occurring subjects are the government (g); anyone (o); and the respondent him/herself (r). Frequently employed lexical verbs for predicators are: C, which indicates relationships where the subject causes the object; D, which indicates deeds; E, which indicates expectations; F, which specifies feelings; FD, which indicates a predicator referring to future deeds; H(+I), which specifies a predicator which contains words like ‘has to’ or ‘should’ followed by an infinitive; HR, which specifies predicators like ‘has the right to’; J, which specifies a judgment; PR, which indicates predicators referring to preferences; S, which indicates relationships of similarity or difference between the subject and the object.
research can be expressed in assertions specifying these concepts. These rules are summarized in Table 17.2. This table can be used to specify an assertion for a certain type of concept according to the criteria specified there. For example, if we want to specify an evaluation about immigrants, the structure of the sentence recommended is (xIe). Therefore, we can formulate a statement such as ‘immigrants are good people’. If we want a feeling (xIf), we can write ‘immigrants are in general
friendly’. If we want a cognitive judgment (xIc), the statement can be: ‘immigrants are hard-working’. If we want to formulate a cognition concerning the reasons why immigrants come here, the structure is (xRy), and a possible assertion would be ‘Problems in their own country cause emigration to Europe’. In the same way, assertions can be formulated for any other concept. Now that standard assertions have been specified for the basic concepts of the social
Creating a Good Question: How to Use Cumulative Experience
sciences, the task of the researcher is to determine what type of basic concept his/her specific concept-by-intuition is. If that is done, sentences that represent with very high probability this concept can be formulated.
From Assertion to Request for an Answer The term ‘request for an answer’ is employed, because the social science research practice and the linguistic literature (Harris, 1978; Givon, 1984; Weber, 1993; Graesser et al., 1994; Huddleston, 1994; Ginzburg, 1996; Groenendijk and Stokhof, 1997; Tourangeau et al., 2000) indicate that requests for an answer are formulated not only as questions (interrogative form) but also as orders or instructions (imperative form), as well as assertions (declarative form) that require an answer. Even in the case where no request is made and an instruction is given or a statement is made, the text implies that the respondent is expected to give an answer. Thus, the common feature of the formulation is that an answer is expected. If an assertion is specified for a concept, a simple way to transform it into a request for an answer is to add a pre-request in front of the assertion. This procedure can be applied to any concept and assertion. Using Q2.1a– Q2.1d, to make a request from these assertions, pre-requests can be added in front of them, for example: Q2.2a – D o you think that honesty is very important? Q2.2b – Do you think that honesty is important? Q2.2c – Do you think that honesty is unimportant? Q2.2d – Do you think that honesty is very unimportant?
Using such a pre-request followed by the conjunction ‘that’ and the original assertion creates a request called an indirect request. The choice of one of these possible requests for a questionnaire seems rather arbitrary as this specific choice of the request can lead
241
the respondent in that direction. Therefore, a more balanced approach can be used. Q2.2e – D o you think that honesty is very important, important, unimportant or very unimportant?
In order to avoid such a sentence with too many adjectives, it is advisable to substitute them with a so called ‘WH word’ like ‘how’: Q2.2f – How important, do you think, is honesty?
This is also an indirect request with a prerequest and a sub-clause that starts with a WH word and allows for all the assertions Q2.1a–Q2.1d as an answer and other variations thereof. Instead of indirect requests, direct requests can also be used. The most common form is an interrogative sentence. In this case the request can be created from an assertion by the inversion of the (auxiliary) verb with the subject component. The construction of direct requests by the inversion of the verb and subject component is quite common in many languages but also other forms can be used. Let us illustrate this by the same example using only two of the four assertions mentioned: Q2.1b – Honesty is important. Q2.1c – Honesty is unimportant.
One can transform these assertions into direct requests by inverting the auxiliary verb and the subject: Q2.3b – Is honesty important? Q2.3c – Is honesty unimportant?
Here, the requests can be seen as ‘leading’ or ‘unbalanced’ because they have only one possible answer option. It could be expected that a high percentage of respondents would choose this option for this reason. Therefore, the requests can be reformulated as follows: Q2.3e – Is honesty important or unimportant?
Thus, two basic choices have to be made for formulating a request for an answer: the use
242
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of direct or indirect requests and whether to use WH words. The combination of these two choices leads to four different types of requests. Besides the interrogative form, the second grammatical form of a request for an answer possible is the imperative form. In its basic form the request consists only of an instruction to the respondent, as for example: Q2.4 Indicate how important honesty is for you:
Example Q2.4, colloquially known as an instruction or in grammatical terms as an ‘imperative’, is another example of a direct request for an answer. The third grammatical form, a declarative request, is only possible as an indirect request. Illustrations are examples Q2.5 and Q2.6. Both examples have a declarative pre-request, followed by a WH word and an embedded interrogative query: Q2.5 I would like to ask you how important honesty is for you. Q2.6 N ext we ask you how important honesty is for you.
Although these are statements from a grammatical perspective, it is commonly understood that an answer to the embedded interrogative part of the sentence is required. To conclude, different assertions and even more different requests for an answer can be formulated to measure concepts like ‘the importance of the value of honesty’, as Figure 17.1 summarizes. However, it is important to note that whatever the request form used, all these requests measure what they are supposed to. Therefore, there is no real difficulty with making an appropriate request for a concept if the assertions represent the concept of interest well. For more details, we refer to Saris and Gallhofer (2014).
HOW TO DEAL WITH ALL THE OTHER DECISIONS? Many More Decisions to Complete the Questions So far we have discussed the most fundamental decisions that have to be made in order to
Figure 17.1 The different steps applied to the importance of the value honesty.
Creating a Good Question: How to Use Cumulative Experience
create valid request for an answer, i.e. in order to be as sure as possible that the request formulated measures the concept-by-intuition that it is supposed to measure and nothing else. There are, however, many more decisions to be made to create a complete survey question, about the introduction of other components of a question and the way they can be formulated. All these decisions lead to different forms of the final question and potentially to different results. Table 17.3 provides an overview of the characteristics of the questions that have to be taken into account. The table shows that a considerable number of decisions are made in the formulation of one question. Going on with the example of the importance of the value honesty, we can formulate quite different complete questions, even if we start from the same assertion ‘honesty is important’: Q3a – T o what extent do you think that honesty is important? Not important Very important Don’t Know 1 2 3 4 5 Q3b – Now we want to ask you questions about different values. Do you agree or disagree with the following statement: ‘honesty is important’. Please use this card to answer. 1 Agree strongly 2 Agree 3 Neither agree nor disagree 4 Disagree 5 Disagree strongly 6 Don’t Know
Both questions differ at many different levels: presence of an introduction, of an instruction, formulation of the request, scale characteristics, etc. Because of the huge number of decisions and the potential interactions between them, one cannot easily evaluate the consequences of all these decisions on the quality of the question. In order to evaluate how good a question is, we need a more adapted tool: this is the program Survey Quality Predictor (Saris et al, 2011; Saris and Gallhofer, 2014) available for free at: http://sqp.upf.edu/.
243
The Program SQP 2 What is Behind SQP? On Which Knowledge is the Program Based? One of the most common procedures to evaluate the quality of questions is the MultiTrait-MultiMethod (MTMM) approach. Indeed, Lance et al. (2009), searching a citation database of the seminal article by Campbell and Fiske (1959), found up to 4,338 citations spanning several disciplines with special emphasis in psychology. They conducted a literature review of applied research that used the MTMM approach which included among others personality traits, labor studies, organizational research, mental and social health, social psychology, and organizational research. Campbell and Fiske (1959) proposed to repeat several ‘traits’ (i.e. questions) using different ‘methods’ (in our case it will be formulations using different characteristics) in order to study the convergent and discriminant validity. Later, the approach has been developed and formalized by using Confirmatory Factor Analysis models to analyze the MTMM correlation matrices (Werts and Linn, 1970; Jöreskog, 1970, 1971; Althauser et al., 1971; Andrews, 1984). Alternative models have been proposed and compared, with the conclusion (Corten et al., 2002; Saris and Aalberts, 2003) that the model of Alwin (1974) and the one of Saris and Andrews (1991) are the two that fit the best to several data sets. SQP is based on thousands of quality estimates of questions obtained in more than 25 European countries and languages by MTMM analyses using the True Score model by Saris and Andrews (1991). This True Score model allows decomposing the quality in validity and reliability. Most of these MTMM experiments have been done in the ESS. In each ESS round, four to six MTMM experiments are included in almost all the participating countries. In each experiment, three traits are measured using three or four different methods. All respondents get a common method in the main ESS questionnaire
244
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 17.3 The characteristics of the questions to be taken into account Group
Specific characteristic
The trait
Domain Concept Social desirability Centrality of the topic Time specification Trait requested indirectly, direct or no request and presence of stimulus (battery) WH word and what type of WH word Type of the request (interrogative, Imperative question-instruction, declarative or none (batteries). Gradation Balance of request or not Encouragement to answer Emphasis on subjective opinion Information about the opinion of other people Absolute or a comparative judgment Categories; yes/no answer scale; frequencies; magnitude estimation; line production and, more steps/procedures. Amount or the number of categories Full or partial labels Labels with long or short text Order of labels Correspondence between labels and numbers Theoretical range of scales (bipolar or unipolar) Range of scales used Fixed reference points Don’t know option Respondent instructions Interviewer instructions Additional definitions, information or motivation Introduction and if request is in the introduction Number of sentences Number of subordinated clauses Number of words Number of nouns Number of abstract nouns Number of syllables
Associated to the trait
Formulation of the request for an answer
Characteristics of the response scale
Instructions Additional information about the topic Introduction Linguistic complexity
Method of data collection Language of the survey Characteristics of the show cards
Categories in horizontal or vertical layout Text is clearly connected to categories or if there is overlap Numbers or letters shown before answer categories Numbers in boxes Start of the response sentence shown on the show card Question on the show card Picture provided
and a repetition with a different method at the end in a supplementary questionnaire (split-ballot MTMM design as proposed by Saris et al., 2004). More than 20 minutes of
similar questions separate the first question from its repetition in order to avoid potential memory effects (van Meurs and Saris, 1990). Reliability, validity, and quality have been
Creating a Good Question: How to Use Cumulative Experience
estimated for all these traits, methods, and countries. However, even if thousands of quality estimates are available (they can be consulted in the SQP program), there are still much more questions for which MTMM experiments have not been implemented. It is not possible to repeat all the questions of every questionnaire twice. The costs and cognitive burden will be too high. Only a small subset of questions can be repeated each time that a survey is done. Therefore, the MTMM model can be used to estimate the quality of only a limited number of questions. What about the quality of all the other questions that were not part of an MTMM experiment? What about new questions? In order to get information about the quality of all possible questions, Saris and Gallhofer (2007) proposed to do a meta-analysis over the thousands of quality estimates available from all the MTMM experiments done in the past. The idea is to explain the quality estimates as a function of all the characteristics presented in Table 17.3. This has been done originally using a regression model (Saris and Gallhofer, 2007) but later the Random Forest approach (Breiman, 2001) for prediction has been applied on the existing data (Oberski et al., 2011; Saris and Gallhofer, 2014). Once the impact of these characteristics on the reliability and validity is known and is good enough (R2 around 0.8), this information can be used to produce a prediction of the quality of new questions. This is what the program SQP does in a user-friendly way as illustrated in the next section.
What Can You Do With SQP? The program SQP makes predictions of the quality of questions on the basis of information about the choices that have been made with respect to the questions characteristics mentioned in Table 17.3. If you have to create new questions, in order to evaluate how good the questions you proposed are and how you could
245
improve them, you can go to the SQP program. There, you have to code within the program the questions with respect to all the characteristics mentioned in Table 17.3. When you finish, the program provides a prediction of the reliability, validity,2 and quality of the questions, together with confidence intervals for the predictions. Besides, it provides suggestions about how changing one or another of the characteristics would affect the quality. In that way, questions can be improved. Each time that a user codes a question and obtains a quality prediction, this information is stored accessible for other users. Therefore, SQP will provide a growing database of survey questions with quality information over the years. If one does not trust the prediction by another user, one can recode the question to get a new prediction. But if one trusts the prediction, he/she could use it directly. The predictions of reliability, validity, and quality can then be of help to improve the questions before going to the field, but also to control the comparability of questions translated in different languages in a crossnational or cross-cultural setting as will be discussed in the final section. Finally, they can be used to correct for measurement errors in studies of relationships between variables (Saris and Revilla, 2015; De Castellarnau and Saris, 2014).
Limits of the Scope of the Program SQP 2 covers a large number of countries (more than 25). However, the program is mainly based on data from European countries, collected by face-to-face or self-completion, and about attitudinal questions. Thus, it is more appropriate to use the SQP predictions when one is interested in these countries or countries culturally similar (but SQP may not give a good prediction for Asiatic countries), this mode of data collection, and kinds of questions. Besides, the program considers only some aspects of the context (level of social desirability and
246
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
sensibility), so if one is interested in very specific contexts, this might not be taken into account in the prediction. Nevertheless, we should notice than the context affects usually more the answers themselves than the quality of these answers.
GETTING QUALITY PREDICTIONS OF YOUR QUESTIONS USING SQP: AN ILLUSTRATION At the end of the three-step procedure, we had formulated theoretically valid requests for answers. Then, we had to make decisions about how to complete these requests in order to get complete questions. Using the example of the importance of the value honesty, we proposed two forms of complete questions that should be theoretically valid. However, in order to know if they are good questions, we need to estimate their quality. In this section, we illustrate how this can be done using the program SQP to predict their quality. Both Q3a and Q3b are measuring the same concept-by-intuition. Table 17.4 repeats the two forms in a way that underlines the main differences across them: for instance, Q3b has an introduction but Q3a has not, Q3b has an instruction for the respondent whereas Q3a has not, Q3a has an horizontal scale whereas Q3b has a vertical one, the scale in Q3a has only the end points labeled with text whereas in Q3b it is fully
labeled. Besides, in Q3a the scale is specific to the item measured whereas in Q3b it is an Agree-Disagree scale. Overall, these two formulations differ on many characteristics that may affect the way respondents answer the question and therefore the quality. SQP asks the coder information about all the characteristics listed in Table 17.3. Appendix 1 provides the code for the two questions. Once all the characteristics are coded, SQP shows the quality predictions and information about the uncertainty of the predictions (standard errors and interquartile ranges). English is used for this example but the program allows users to get quality predictions of survey questions in more than 20 languages. Table 17.5 gives the prediction for Q3a and Q3b. This table shows that even if both questions vary on many characteristics, the overall quality is very similar. It is slightly higher in Q3a but if we consider the interquartile ranges, we see that the incertitude about the prediction estimates is much larger than the observed difference between Q3a and Q3b. In this illustration, both questions are therefore similarly good. However, it also shows that questions Q3a and Q3b are not perfect: the measurement errors are still quite large (quality much lower than 1). Therefore, we can try to improve it. For that, SQP also provides suggestions of potential improvements of the questions that will increase the measurement quality.
Table 17.4 Two survey questions for a concept-by-intuition Question
Introduction
Q3a
Q3b
Now we want to ask you questions about different values
Request for answer
Answer options
To what extent do you think that honesty is important?
Not important
Do you agree or disagree with the following statement: ‘honesty is important’? Please use this card to answer
1 Agree strongly 2 Agree 3 Neither agree nor disagree 4 Disagree 5 Disagree strongly Don’t Know
1 2 3 4
Very important 5
Don’t Know
247
Creating a Good Question: How to Use Cumulative Experience
Table 17.5 Quality predictions in SQP Q3a Reliability coefficient (r) Validity coefficient (v) Quality coefficient (q)
Q3b
Prediction
(IR)
Prediction
(IR)
Total quality
Q3a
Q3b
0.805 0.964 0.776
(0.734, 0.866) (0.897, 0.990) (0.656, 0.817)
0.798 0.962 0.767
(0.718, 0.850) (0.906, 0.989) (0.647, 0.808)
Reliability (r2) Validity (v2) Quality (q2)
0.649 0.928 0.602
0.636 0.925 0.588
Note: IR: interquartile range
In SQP, users get suggestions over 20 characteristics that have had the largest impact in the quality predictions over all questions in the database. It is also possible to ask the program to evaluate not only these first 20 variables but all the 60 characteristics that are included in the model to predict measurement quality. If we concentrate on suggestions for Q3a, the variable that potentially leads to the largest change in quality is the number of categories (from 7 categories onward, the increase in quality is 0.049) and the number of fixed reference points. The program also suggests avoiding showing explicitly the ‘Don’t know’ option. Following these suggestions, we propose an improved question, presented in Table 17.6. However, one should realize that each suggestion is done while all the other characteristics remain the same. This is in general not true. Changing characteristics may affect the linguistic complexity of the items and this will have an effect on the quality prediction. For instance, by introducing here fixed reference points using ‘extremely’, the number of syllables for the labels changes, and this increase in syllables can have a negative effect on the quality. In order to check if by introducing one suggested improvement the quality of the new question really increases,
Table 17.7 Quality predictions for Q3a and Q3a-bis
Reliability (r2) Validity (v2) Quality (q2)
Q3a
Q3a-bis
0.649 0.928 0.602
0.707 0.932 0.659
one has to reformulate the question and code the new question in SQP in the same way as before. The results are shown in Table 17.7. The quality of Q3a-bis is higher than the one of Q3a by 0.057. The improvement is mainly coming from an increase in the reliability. This shows that the suggestions proposed by the program in the response scale helped to formulate a better question. However, we did not manage to get a very high quality neither. Therefore, correction for measurement errors after the data collection will be really crucial to draw correct conclusions about standardized relationships across variables.
GOOD QUESTIONS FOR CROSSNATIONAL RESEARCH In order to create a good question, we started by formulating theoretically valid requests
Table 17.6 An improved question for the same concept-by-intuition Question Q3a-bis
Introduction
Request for Answer To what extent do you think that honesty is important?
Response scale Not at all Extremely important important 0 1 2 3 4 5 6 7 8 9 10
248
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
for an answer. Then, we made complete questions based on these simple requests for an answer and used SQP to evaluate how good different complete questions were. We got to the point where we have a theoretical valid question with acceptable quality and we have improved it as much as we could using SQP. But in the frame of cross-national research, a good question needs also to be comparable across countries. Therefore, there is one more level to consider in this case. This is the object of this last section. Cross-national research requires that the questions used are functionally equivalent across countries, meaning that the message embedded in a text is received by the receptor in the same way as it would be received in the source language (Nida, 1964). Survey questions are not comparable with respect to quality if formal characteristics of the questions are different. Functional equivalence in cross-cultural survey research is only confirmed by formally testing invariance. Several studies have identified translation deviations as a source of non-equivalence in assessments of survey data (Hambleton et al., 2005; Harkness et al., 2010a; Mallinckrodt and Wang, 2004; Oberski et al., 2007; Saris and Gallhofer, 2007; Van de Vijver and Leung, 1997). Unfortunately non-invariance was detected once data was collected and survey organizations had already spent a lot of resources in data collection. Empirical methods are mostly used for detecting flaws once data is already collected (Horn and McArdle, 1992; Meredith, 1993; Steenkamp and Baumgartner, 1998; Vandenberg and Lance, 2000; Braun and Johnson, 2010; Byrne and Van de Vijver, 2010; Saris and Gallhofer, 2014). Translation guidelines suggest that a good translation aiming at functional equivalence would avoid changing deliberately other semantic components than those necessary because of language differences. Questions in different languages should keep the concepts the same; preserve the item characteristics and maintain the intended psychometric
properties (Harkness, 2003; Harkness et al., 2003; Harkness et al., 2010b). In order to preserve the same meaning of concepts across languages, the state-of-theart suggests a multi-step committee approach to translate survey items (Harkness et al., 2003; Harkness et al., 2010b). However, it is difficult to compare systematically that throughout a questionnaire item characteristics in different languages are the same (such as domain, concept, wording, response scale, polarity, labeling, symmetry, balance of the request, introduction, instructions, and linguistic complexity). As these characteristics define the measurement properties of survey items, a way to prevent non-equivalence is to compare them systematically before data collection. As SQP asks users to code a large set of properties of a survey item, we propose a procedure aiming to detect deviations in translations by comparing the codes of a source questionnaire and target languages. This procedure is complementary to the translation committee approach (which focuses in semantic similarity). As one cannot be familiar with all languages participating in a cross-cultural project, the coding scheme in SQP allows that trained individuals in survey research and proficient in the respective languages provide information about item characteristics. Once characteristics of the source and translated versions are coded, comparing the codes allows detection in a systematic way of deviations across language versions. Characteristics of translated items can be compared using SQP in a five-step procedure. Step 1: Introducing questions in SQP
Each question in the source and target languages should be introduced into the program SQP. This can be done by any user at no cost after signing up and logging in the program at sqp.upf.edu webpage. When coding, the program displays a help option on each screen indicated by a yellow box, which
Creating a Good Question: How to Use Cumulative Experience
defines each item characteristic asked and gives examples. Step 2: Coding the source questionnaire
The information regarding the item characteristics of the source questionnaire must be accurate because target versions will be compared against it. It should be coded independently by two individuals with deep knowledge about questionnaire design; differences should be reconciled in collaboration with a third individual who plays the role of a reviewer. Step 3: Coding a target questionnaire
The translated questionnaire should be coded by a proficient speaker of the target language, preferably someone involved in the translation process. Step 4: Comparison of measurement properties
The codes of the characteristics of source items should be compared with those in the target language. Any difference should be clarified with coders, first, to rule out coding errors in the target questionnaire. True differences in the codes should be reported to the translation team. Step 5: Interpretation of deviations and actions taken in the target text
The translation team should clarify any difference in the codes in terms of the definition of the features. In other words, it should justify the reasons behind a deviation in the item characteristics. The differences may fall into one of three categories shown in Table 17.8. For each category, an action is suggested for the translated text. This procedure was tested for the first time in the ESS Round 5 and has become part of the specifications of the survey design from Round 6 onwards. This process has helped to prevent deviations such as: different formulations for a repetition, missing introductions,
249
extra explanations making the item more complex, missing definitions of the scale, deviations in the scale formulation, fixed reference points, labels about bipolar concepts defined as unipolar concepts, complete sentences instead of short texts for labels, among many others.
CONCLUSION To conclude, creating good questions requires a lot of effort. First, the researchers have to make sure that the questions really measure the concepts of interest, i.e. that the questions are theoretically valid. This can be done by following the three-step procedure proposed by Saris and Gallhofer (2014). In this way, requests for answers that measure the concepts of interest can be formulated for any concept. However, there are many more decisions to take in order to formulate a complete question: using or not an introduction or an instruction, defining the answer categories, etc. Each of these decisions can influence the quality of the question in different ways and it is difficult to evaluate really how good a complete question is depending on the choices made without an appropriate tool. However, it is also essential to do so, before the data collection, in order to collect the highest quality data possible, and after data collection, in order to have the information to correct for the remaining measurement errors. The tool we propose to evaluate the quality of survey questions is the program SQP. This program is based on a meta-analysis of thousands of quality estimates obtained by estimating MTMM models. It uses the cumulative information of the past MMM research in order to predict the quality estimates of new questions depending on their characteristics. This program is available for free and gives researcher a powerful tool in order to evaluate how good their questions are. It also gives suggestions to improve the questions.
250
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 17.8 Categories for differences in the SQP codes for two languages Type of deviations found (source vs. translation)
Action taken
(A) A difference that cannot be warranted, for instance a different The translation should be amended number of response categories, leaving out a ‘don’t know’ option or/and an instruction for the respondent (B) A difference that may or may not be warranted e.g. use of Amendments in the translation are recommended complete sentences in the scales instead of short texts. In to keep the principle of functional equivalence some languages it is necessary, in some others this may be a in translation if the language structure allows fact of stylistic choice keeping the item characteristic the same (C) A difference in the linguistic characteristics that may be Amendments in the translation are recommended warranted e.g. different number of words, syllables. Also, a to keep the principle of functional equivalence in difference in the codes of linguistic characteristics that may translation if the language structure allows it. If not be warranted e.g. different number of sentences, nouns, the differences are unavoidable due to linguistic extreme deviations in the number of words characteristics, no change is recommended
Source: Zavala-Rojas, D. and Saris, W. (2014)
Finally, in the frame of cross-national research, a good question is also a question that can be used in different languages and give comparable answers, such that observed differences indicate true differences across countries and not differences in the questions. One point to take into account to facilitate measurement equivalence across different languages is that no unnecessary changes are made during the translation process for characteristics that are not directly language related. For instance, if fixed-reference points are used in a language, they should also be
used in the other ones. The SQP program can be and has been used to compare for a given question the characteristics in different languages and detect potential deviations introduced during the translation process. One limit of this approach is that not all existing languages are covered so far. Still, the program covers around 20 languages currently, including almost all the European ones. Therefore, it is a powerful tool to create good questions. But it has to be used only once the theoretically valid requests for answers have been formulated.
Appendix 17.1 SQP prediction codes Characteristic Domain Domain: other beliefs Concept Social Desirability Centrality Reference period Formulation of the request for an answer: basic choice WH word used in the request ‘WH’ word Request for an answer type Use of gradation Balance of the request Presence of encouragement to answer Emphasis on subjective opinion in request
Choice Q3a Other beliefs Yourself Importance of something A lot Rather central Present Indirect requests
Choice Q3b Other beliefs Yourself Importance of something A lot Rather central Present Indirect requests
WH word used How (extremity) Interrogative Gradation used Balanced or not applicable No particular encouragement present Emphasis on opinion present
Request without WH word – Interrogative Gradation used Balanced or not applicable No particular encouragement present No emphasis on opinion present
Creating a Good Question: How to Use Cumulative Experience
Information about the opinion of other people Use of stimulus or statement in the request Absolute or comparative judgment Response scale: basic choice Number of categories Labels of categories Labels with long or short text Order of the labels Correspondence between labels and numbers of the scale Theoretical range of the scale bipolar/ unipolar Number of fixed reference points Don’t know option Interviewer instruction Respondent instruction Extra motivation, info or definition available? Introduction available? Number of sentences in introduction Number of words in introduction Number of subordinated clauses in introduction Request present in the introduction Number of sentences in the request Number of words in request Total number of nouns in request for an answer Total number of abstract nouns in request for an answer Total number of syllables in request Number of subordinate clauses in request Number of syllables in answer scale Total number of nouns in answer scale Total number of abstract nouns in answer scale Show card used Horizontal or vertical scale Overlap of text and categories? Numbers or letters before the answer categories Scale with numbers or numbers in boxes Start of the response sentence on the showcard Question on the showcard Picture on the card provided? Computer assisted Interviewer Visual presentation Position
251
No information about opinions of others No stimulus or statement An absolute judgment Categories 5 Partially labeled Short text First label negative or not applicable High correspondence
No information about opinions of others Stimulus or statement is present An absolute judgment Categories 5 Fully labeled Short text First label positive Low correspondence
Theoretically unipolar
Theoretically unipolar
1 DK option present Absent Absent Absent Not available – – –
1 DK option present Absent Present Absent Available 1 10 0
– 1 10 2
Request not present 3 18 3
2
1
14 1 8 0 0 Yes Horizontal Text clearly connected to category Numbers
26 0 13 0 0 Yes Vertical Text clearly connected to category Numbers
Only numbers No
Only numbers No
No No picture provided Yes Yes Oral 15
No No picture provided Yes Yes Oral 15
252
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
NOTES 1 Although we refer to objective variables, that does not imply that they are exempted from measurement error. For instance, administrative records may have low quality as measurement instruments if their categories are poorly defined. 2 Note that in this case the validity is defined as 1-method variance. We assume that the question measured what it is supposed to measure. So, here we just look at the invalidity due to the effect of the reaction of the people to the chosen method on the observed responses.
RECOMMENDED READINGS To get into more details in the issues considered in this chapter, we recommend: Andrews (1984), Saris et al. (2011), Saris and Gallhofer (2014), and Zavala-Rojas and Saris (2014).
REFERENCES Althauser, R.P., T.A. Heberlein, and R.A. Scott (1971). A causal assessment of validity: The augmented multitrait-multimethod matrix. In H.M. Blalock Jr. (ed.), Causal Models in the Social Sciences. Chicago, IL: Aldine, pp. 151–169. Alwin, D.F. (1974). An analytic comparison of four approaches to the interpretation of relationships in the multitrait-multimethod matrix. In H.L. Costner (ed.), Sociological Methodology. San Francisco, CA: JosseyBass, pp. 79–105. Alwin, D.F., and J.A. Krosnick (1991). The reliability of survey attitude measurement: The influence of question and respondent attributes. Sociological Methods and Research, 20: 139–181. Andrews, F. (1984). Construct validity and error components of survey measures: A structural modeling approach. Public Opinion Quarterly, 46: 409–442. Braun, M., and T.P. Johnson (2010). An illustrative review of techniques for detecting inequivalences. In Survey Methods in
Multinational, Multiregional, and Multicultural Contexts. Chichester: John Wiley & Sons, Inc., pp. 73–393. doi:10.1002/9780470609927.ch20 Breiman, L. (2001). Random forests. Machine Learning, 45 (1): 5–32. Byrne, B.M., and F.J.R. van De Vijver (2010). Testing for measurement and structural equivalence in large-scale cross-cultural studies: Addressing the issue of nonequivalence. International Journal of Testing, 10 (2): 107–132. Campbell, D.T., and D.W. Fiske (1959). Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 6: 81–105. Corten, I.W., W.E. Saris, G.M. Coenders, W. van der Veld, C.E. Aalberts, and C. Kornelis (2002). Fit of different models for multitrait– multimethod experiments. Structural Equation Modeling: A Multidisciplinary Journal, 9: 213–232. De Castellarnau, A., and W.E. Saris (2014). A simple way to correct for measurement errors. European Social Survey Education Net (ESS EduNet). Available at: http://essedunet. nsd.uib.no/cms/topics/measurement/ De Groot, A.D., and F.L. Medendorp (1986). Term, Begrip, Theorie: Inleiding tot Signifische Begripsanalyse. Meppel: Boom. Ginzburg, J. (1996). Interrogatives: Questions, facts and dialogue. In S. Lappin (ed.), The Handbook of Contemporary Semantic Theory. Cambridge, MA: Blackwell, pp. 385–421. Givon, T. (1984). Syntax. A Functional–Typological Introduction Vol. I–II. Amsterdam: J. Benjamin. Graesser, A.C., C.L. McMahen, and B.K. Johnson (1994). Question asking and answering. In M. Gernsbacher (ed.), Handbook of Psycholinguistics. San Diego, CA: Academic Press, pp. 517–538. Groenendijk, J., and M. Stokhof (1997). Questions. In J. van Benthem and A. ter Meulen (eds), Handbook of Logic and Language. Amsterdam: Elsevier, pp. 1055–1124. Hambleton, R.K., P.F. Merenda, and C.D. Spielberger (2005). Adapting Educational and Psychological Tests for Cross-cultural Assessment. Hillsdale, NJ: Lawrence Erlbaum Publishers. Harkness, J.A. (2003). Questionnaire translation. In J.A. Harkness, F.J.R. van de Vijver,
Creating a Good Question: How to Use Cumulative Experience
and P.P. Mohler (eds), Cross-cultural Survey Methods. Hoboken, NJ: John Wiley & Sons, pp. 35–56. Harkness, J.A., B. Edwards, S.E., Hansen, D.R., Miller, A. Villar (2010a). Designing questionnaires for multipopulation research. In Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: John Wiley & Sons, Inc., pp. 31–57. doi:10.1002/9780470609927.ch3 Harkness, J.A., F.J.R. van de Vijver, and P.P. Mohler (2003). Cross-cultural Survey Methods. Hoboken: Wiley & Sons. Harkness, J.A., A. Villar, and B. Edwards (2010b). Translation, adaptation, and design. In Survey Methods in Multinational, Multiregional, and Multicultural Contexts. Hoboken, NJ: John Wiley & Sons, Inc. pp. 115–140. doi:10.1002/9780470609927.ch7 Harris, Z. (1978). The interrogative in a syntactic framework. In H. Hiz (ed.), Questions. Dordrecht: Reidel, pp. 37–89. Horn, J.L., and J.J. McArdle (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18 (3–4):17–44. doi:10.1080/ 03610739208253916 Hox, J.J. (1997). From theoretical concept to survey questions. In L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz, and D. Trewin (eds), Survey Measurement and Process Quality. New York: John Wiley & Sons, pp. 47–70. Huddleston, R. (1994). The contrast between interrogatives and questions. Journal of Linguistics, 30: 411–439. Jöreskog, K.G. (1970). A general method for the analysis of covariance structures. Biometrika, 57:239–251. Jöreskog, K.G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36: 109–133. Költringer, R. (1993). Gültigkeit von Umfragedaten. Wien: Bohlau. Koning, P.L., and P.J. van der Voort (1997). Sentence Analysis. Groningen: Wolters– Noordhoff. Lance, C.E., L.E. Baranik, A.R. Lau, and E.A. Scharlau (2009). If it ain’t trait it must be method. In Charles E. Lance, Robert J. Vandenberg (eds), Statistical and Methodological Myths and Urban Legends. London: Taylor & Francis, pp. 337–360.
253
Mallinckrodt, B., and C.-C. Wang (2004). Quantitative methods for verifying semantic equivalence of translated research instruments: a Chinese version of the experiences in close relationships scale. Journal of Counseling Psychology, 51: 368–379. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58 (4): 525–543. doi:10.1007/ BF02294825 Nida, E.A. (1964). Toward a Science of Translating: With Special Reference to Principles and Procedures Involved in Bible Translating (p. 331). Brill Archive. Oberski, D., T. Grüner, and W.E. Saris (2011). The prediction procedure the quality of the questions, based on the present data base of questions. In W.E. Saris, D. Oberski, M. Revilla, D. Zavalla, L. Lilleoja, I. Gallhofer, and T. Grüner, The Development of the Program SQP 2.0 for the Prediction of the Quality of Survey Questions. RECSM Working paper 24, Chapter 6. Available at: http://www.upf. edu/survey/_pdf/RECSM_wp024.pdf Oberski, D., W.E. Saris, and J. Hagenaars (2007). Why are There Differences in Measurement Quality Across Countries: Measuring Meaningful Data in Social Research. Leuven: Acco. Saris, W.E., and C. Aalberts (2003). Different explanations for correlated disturbance terms in MTMM studies. Structural Equation Modeling: A Multidisciplinary Journal, 10: 193–213. Saris, W.E., and F.M. Andrews (1991). Evaluation of measurement instruments using a structural modeling approach. In P.P. Biemer, R.M. Groves, L. Lyberg, N. Mathiowetz, and S. Sudman (eds), Measurement Errors in Surveys. New York: John Wiley & Sons, pp. 575–597. Saris, W.E. and I. Gallhofer (2007). Design, Evaluation, and Analysis of Questionnaires for Survey Research. New York: Wiley. doi:10.1111/j.1751-5823.2008.00054_20.x Saris, W.E., and I. Gallhofer (2014). Design, Evaluation, and Analysis of Questionnaires for Survey Research, 2nd edn. New York: John Wiley & Sons. Saris, W.E., and M. Revilla (2015). Correction for measurement errors in survey research: necessary and possible. Social Indicators Research. First published online: 17 June 2015.
254
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
DOI: 10.1007/s11205-015-1002-x Available at: http://link.springer.com/article/10.1007/ s11205-015-1002-x Saris, W.E, D. Oberski, M. Revilla, D. Zavalla, L. Lilleoja, I. Gallhofer, and T. Grüner (2011). The Development of the Program SQP 2.0 for the Prediction of the Quality of Survey Questions. RECSM Working paper 24. Available at: http://www.upf.edu/survey/_pdf/ RECSM_wp024.pdf Saris, W.E., A. Satorra, and G. Coenders (2004). A new approach to evaluating the quality of measurement instruments: The split-ballot MTMM design. Sociological Methodology, 34: 311–347. Scherpenzeel, A.C., and W.E. Saris (1997). The validity and reliability of survey questions: A meta-analysis of MTMM studies. Sociological Methods and Research, 25: 341–383. Schuman, H., and S. Presser (1981). Questions and Answers in Attitude Survey: Experiments on Question Form, Wording and Context. New York: Academic Press. Steenkamp, J.-B.E.M., and H. Baumgartner (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25 (1): 78–107. Sudman, S., and N.M. Bradburn (1983). Asking Questions: A Practical Guide to Questionnaire Design. San Francisco: Jossey-Bass.
Tourangeau, R., L.J. Rips, and K. Rasinski (2000). The Psychology of Survey Response. Cambridge, MA: Cambridge University Press. Vandenberg, R.J., and C.E. Lance (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3 (1): 4–70. doi:10.1177/109442810031002 Van Meurs, L., and W.E. Saris (1990). Memory effects in MTMM studies. In W.E. Saris and L. van Meurs (eds), Evaluation of Measurement Instruments by Meta-analysis of Multitrait– Multimethod Studies. Amsterdam: North Holland, pp. 89–103. Vijver, F.J.R. van de, K. Leung, (1997). Methods and Data Analysis for Cross-cultural Research. Thousand Oaks, CA: Sage Publications, Inc. Weber, E.G. (1993). Varieties of Questions in English Conversation. Amsterdam: J. Benjamin. Werts, C.E., and R.L. Linn (1970). Path analysis: Psychological examples. Psychological Bulletin, 74: 194–212. Zavala-Rojas, D., and W.E. Saris (2014). A Procedure to Prevent Differences in Translated Survey Items Using SQP. RECSM Working Paper 38. Available at: http://www.upf.edu/ survey/_pdf/RECSM_wp038.pdf
18 Designing a Mixed-Mode Survey Don A. Dillman and Michelle L. Edwards
INTRODUCTION A recent meeting with a survey sponsor on how they might change a single-mode survey to a mixed-mode survey began with the sponsor’s assertion: ‘Changing to mixed-mode is not something we want to do; it makes our work more difficult, but we don’t think we have a choice’. Surveyors are now considering changing from single-mode to mixed-mode designs for a variety of reasons. One reason for considering a transition to mixed-mode procedures is to improve coverage, through using a second or third mode to increase the likelihood that all members of a random sample can be contacted with a request to complete the questionnaire. A second reason for the use of multiple modes of contact and/or response is to improve the expected response rate and reduce the likelihood of respondents being different from non respondents. Yet another reason for a mixed-mode design, though observed only rarely, is to improve survey measurement, such as using a self-administered set of
questions at the end of a face-to-face interview to obtain answers to especially sensitive items. However, as expressed by the above survey sponsor, there are also multiple reasons for not wanting to utilize mixed-mode surveys. Mixed-mode surveying means that sponsors who are used to designing and implementing surveys in one mode are going to have to learn to deal with the intricacies of a different mode or modes. In addition, the coordination of one mode with another requires adding new survey implementation requirements. But, perhaps the most compelling objection is that measurement will change when a second or third mode is added to the data collection process. The exact nature of the measurement challenge differs greatly depending upon the population with which one is conducting the proposed survey, the mode or modes of surveying the researcher has proposed, and the purpose behind using each mode. In this chapter we consider the measurement consequences associated with going mixed-mode, and how to minimize them.
256
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
WHY SINGLE-MODE STUDIES ARE DECLINING Fewer single-mode studies are now being conducted than at any time in history. The use of single modes, whether face-to-face visits, telephone interviews, paper questionnaires, or Internet responses, has a substantial impact on how survey questions are asked in efforts to measure opinions, behaviors, and other sample unit characteristics. Understanding these consequences is important for articulating the measurement challenges associated with switching to multiple-mode designs for collecting survey data.
Face-to-face Interviews Until the late 1960s, face-to-face interviews were the standard practice for most household as well as individual-person surveys. Data collection by telephone had rarely been attempted, and mail was thought incapable of producing reasonable response rates with most questions being answered. Certain more or less standard data collection practices had evolved to improve face-to-face interviewing. For example, interviewers were trained to look for respondent fatigue and, when it seemed appropriate, to introduce conversational aides to stimulate rapport. Interviews were typically long, and mid-interview break-offs rare. There was little incentive for survey sponsors to keep interviews short. In household surveys during this time period, proxy interviewing was common, and even encouraged, inasmuch as households were more likely to be multi-person with someone at home when the interviewer visited. These factors resulted in high response rates for most interview studies. The prevailing cultural norms also influenced the structure of interviews, which emphasized interaction with potentially useful information being withheld unless it was deemed important. For example, survey designers usually restricted response
categories for opinion questions to desired response categories, but if a respondent had difficulty answering they would then offer a ‘no opinion’ or ‘prefer not to answer’ category. Interviewers also sometimes opted to use categories that the respondent never saw, e.g., ‘refused’. Great emphasis was focused on obtaining some kind of answer to every question, so that there were no missing items. Sometimes this entailed giving respondents ‘show cards’ that visually repeated long and complex questions or provided numerous answer choices. Though item nonresponse was low, faceto-face interviews also had a significant downside from the perspective of measurement. Respondents often provided socially desirable answers that fit prevailing social norms and reflected their evaluation of the interviewer’s beliefs as a representative of that culture (DeMaio, 1985; de Leeuw and van der Zouwen, 1988). In addition, interviews were often conducted in the presence of children or other adults, so that people’s responses were influenced by others. The desire to replace face-to-face interviews started to grow in the late 1960s. Researchers faced increased difficulties finding people at home and sought an alternative to the high costs of returning to a residence again and again to complete an interview. In addition, the part-time, predominately female, interview labor force became harder to obtain, as the societal trend towards employment of all adults continued.
Telephone Interviews As the presence of the telephone in households increased to 80% and beyond surveyors began serious attempts to use telephone interviews as a replacement for face-to-face interviews (Blankenship, 1977; Dillman, 1978; Groves and Kahn, 1979). In particular, telephone’s random digit sampling frame made it possible to randomly contact and interview individuals in most households,
Designing a Mixed-Mode Survey
replacing face-to-face interviews for most surveys. The computerization of survey processes – sampling, calling, recalling unanswered numbers, recording answers, and data summarization – also produced large cost savings. The measurement consequences of the shift to telephone were considerable. Telephone surveying could no longer rely on show cards as a way of clarifying what was being asked of respondents. Instead, the telephone required adapting to the delivery of shorter verbal utterances. This led to the development of shorter fully labeled scales (e.g., fewer than seven answer choices) and even more intense branching systems (e.g., ‘If yes to this question, go to question x’) than those used in face-to-face interviews. Computerization also made it possible to achieve that branching without the errors that often occurred in face-to-face interviews. At the same time, the issues of social desirability and acquiescence (the tendency to agree) seen in face-to-face interviewing also existed for telephone interviewing (de Leeuw, 1992). Telephone interviewing now faces serious problems throughout the world, including issues with sampling (Lepkowski et al., 2007). Household landlines are rapidly giving way to personal cellular phones, and it is not possible to achieve adequate coverage of household populations without including such phones. To do this requires accommodating to new problems, such as the possibility of children having phones or people answering the phone while involved in other activities (e.g., driving) that are not conducive to answering survey questions. In addition, the transportability of household numbers to different places means that it is a challenge to be sure the recipient of the call is eligible by place of residence for the survey. As one surveyor put it, ‘I now have to spend a third of my interview getting information about who I have reached and whether I can interview them’. A secondary effect of telephone interviewing is that a person’s tendency to cut-off in
257
the middle of the interview is much, much greater than for face-to-face interviews. Thus, it is not surprising that a second problem with telephone interviewing is plummeting response rates. For example, the Pew Research Center (2012) reported declines in response rates from 36% in 1997 to only 9% in 2012. While high response rates may not correlate well with low nonresponse error (differences in the characteristics of respondents and nonrespondents), it is difficult to know when low response is a problem and affects the overall credibility of surveys. Perhaps the largest problem facing telephone interviews is that talking to strangers over the telephone is no longer normative. Communication has become increasingly asynchronous, where messages are delivered in one-way rapid fire bursts between participants (Dillman et al., 2014). Longer substantive exchanges are also becoming asynchronous, allowing individuals to deliver multiple ideas and obtain thoughtful response at a later time. Most such conversations have transitioned to electronic exchanges. Thus, being interviewed by telephone now runs counter to current societal norms, which makes self-administration a better fit with preferred interaction styles.
Mail Questionnaires Self-administered paper questionnaires have been used for surveys for nearly as long as face-to-face interviews. Yet, mail has often been considered a mode of last resort (Dillman, 1978). In the early years of surveying, mail response rates tended to be lower than those obtained by face-to-face or telephone interviewing. In addition, some believed that without an interviewer to guide the respondent, misinterpretations by the respondent could not be corrected. The inability to read and write also made surveying by mail difficult with some populations. Others were concerned that the explanations provided for encouraging people to complete mail questionnaires may
258
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
bias the results, compared to modes in which respondents are recruited without prior knowledge of the topics or specific questions to be included in the survey. Other concerns about mail surveying are related to measurement, such as the potential for higher item nonresponse, difficulties in following branching instructions, and the lack of completeness and clarity of openended responses. It is also more likely that mail questionnaire respondents will look at the questions before starting to answer and complete questions out of order, thus producing unexpected and undesired question order effects. In addition, whenever surveyors are required to use mail, for cost or other reasons, there has been a tendency to change mail questionnaires in ways that create additional problems, such as converting a series of forced-choice items to a check-all-that-apply format. Or, to avoid the need to branch, some mail surveyors have turned to more complex question wording. For example, instead of asking whether one owned or rented the home in which they lived and following up to get more detail, the 2000 Decennial Census mail questionnaire asked, ‘Is this house, apartment or mobile home – owned by you or someone in this household with a mortgage or loan, owned by you or someone in this household free and clear (without a mortgage or loan), rented for cash rent, or occupied without payment of cash rent?’ For most of the twentieth century, there was also a problem with getting complete coverage of households, inasmuch as there were no lists of households available for drawing samples or established random sampling algorithms as there was for telephone random digit dialing or face-to-face sampling methods (Dillman, 1991). However, in recent years, there has been a significant change with regard to being able to access households through postal contact. The US Postal Service now provides address information for all households in the US that receive postal mail delivery, making it possible to contact 95–97% of all residential households throughout the United States. In
a striking reversal of fortunes, mail surveys have gone from being the least adequate mode for obtaining samples of the general public to the best mode. In addition, mail-only surveys consistently produce response rates much greater than can usually be obtained by telephone-only or email/web- only surveys (Dillman et al., 2014). The coverage and response-inducing capabilities of mail make it particularly useful when used in support of obtaining responses over the Internet and/or telephone.
Internet Questionnaires Internet surveying was seen as the heir apparent to telephone interviewing just as the latter had been seen in the 1970s as the expected replacement for face-to-face interviews. However, that transition has not occurred as smoothly or as rapidly as desired. A case can now be made that surveying college-educated, working-age populations can now be done entirely by the Internet in some countries. However, less educated, lower income, and older people are less likely to have Internet access and be able to use it effectively. In addition, for many individuals, the Internet is not the preferred mode of response (Stern et al., 2014). The Internet has allowed many new features to be used for survey measurement. Pictures and maps, slider scales, drop-down menus, complicated branching similar to that used on the telephone, special instructions, fill-ins from previous answers, and recorded voices that ask questions are only a few of the measurement additions that can be made. Questions can also be broken into multiple steps, such as asking whether people are satisfied or not satisfied, followed by, ‘Would that be very, somewhat, or only slightly satisfied?’ The measurement capabilities for Internet surveying continue to expand (Christian et al., 2007). However, Internet surveying of households continues to be limited by coverage.
Designing a Mixed-Mode Survey
Although 85% of US adults report using the Internet, only 70% have broadband Internet access in their homes. The lack of broadband access significantly limits the ability of many individuals to be surveyed. In addition, there is no sample frame available for contacting random samples of households. It has also been observed that response rates to Internet surveys using email-only contacts tend to be low, similar to those obtained by telephone (Dillman et al., 2014).
MIXED-MODE SURVEYS AS A POTENTIAL SOLUTION One potential solution for the coverage and response problems that now characterize single-mode survey designs is mixed-mode surveying. There are multiple ways in which modes can be mixed, with some designs producing greater measurement concerns than others (de Leeuw, 2005; Dillman et al., 2009). The most common differences in mixed-mode designs are: (1) implementation strategies that use multiple modes of contact and (2) those that use multiple modes of response. It’s also possible to use both features in the same design, i.e., (3) multiple modes of contact and multiple modes for responding. In the first form of mixed-mode surveys, more than one mode of contact, researchers typically use a different mode or modes to contact, recruit, or screen sample members than the mode used to collect data. Considerable research summarized by Dillman et al. (2014) now shows that doing so can improve response rates. In particular, mixing modes of contact may help to legitimize the response mode, as well as to deliver rewards to respondents. For example, incentives sent by postal mail with a request to respond by telephone or by the Internet can produce response rates much higher than those obtained by using telephone contact only to obtain a phone response, or email
259
only to obtain a web response (Dillman et al., 2014). Mixing modes in this way can potentially reduce coverage and nonresponse error, while posing no risk of mode effects on measurement, since data collection is restricted to one mode (de Leeuw, 2005). For example, Millar and Dillman (2011) have shown that using both mail and email contacts with a request for only a web response, can produce a higher response rate, with better population representation, than a design that uses only one contact mode, whether postal or email. Another important use of mixing contact methods occurs in the building of probability Internet panels, which is occurring world-wide with increased frequency. For example, face-to-face interviews have been used in European countries to recruit individuals from population registers for panels that are then surveyed only over the Internet (e.g., the Netherlands’ Longitudinal Internet Studies for the Social Sciences [LISS] panel). In the United States, probability household panels have been recruited by mail from address-based sampling, which are then surveyed only over the Internet. In these instances recruitment by one mode and response by another mode is considered an important part of data collection success (Das et al., 2011). The second form of mixed-mode surveys, mixing modes of response, poses greater risks for mode effects on measurement, especially when visual and aural modes of response are used by different respondents. For example, some surveys with contacts made only by the telephone offer individuals the opportunity to respond over the phone or via the Internet. A second, and particularly common example, involves surveys using addressbased sampling, which use postal contacts to give recipients the option of responding by either mail or the Internet. The motivation for offering different response modes is in part the belief that certain respondents have a clear preference for a particular response mode and are more likely to respond when that mode is offered. However, research has
260
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
shown that mode response preference is not especially powerful as a determinant of whether people respond (Olson et al., 2012). A meta-analysis has also shown that mixing modes of response can negatively impact response rates compared to studies using a single mode of response when sample members are offered a simultaneous or concurrent choice of response modes (Medway and Fulton, 2012). By far the most common mixed-mode design is to use both multiple contact modes and multiple response modes. These designs capture the benefits of multiple contacts and multiple response modes. Multiple modes of contact make it possible to stimulate response from sample units that cannot be reached by other contact modes. Research has shown that offering both postal and web response options, supported by both postal and email contacts, can improve response rates compared to using only one contact mode while seeking response by multiple modes (Millar and Dillman, 2011). While using multiple response modes may alleviate some coverage and response problems now associated with single-mode surveys, these designs potentially increase measurement error due to mode differences, The use of multiple contact and response modes may also bring into play varied respondent selection procedures that have different coverage limitations. Such effects may occur both at the household and the individual respondent levels. These effects and potential methods for resolving those differences have been described by others (Vannieuwenhuyze et al., 2010; Lugtig et al., 2011).
ACHIEVING COMMON MEASUREMENT ACROSS SURVEY RESPONSE MODES There are various reasons why different measurement is obtained across modes.
They include: the structure of the social setting, social desirability bias, acquiescence bias, interviewer presence and bias, response choice order (e.g., recency vs primacy effects), recall effects, the length of verbal responses, the sensitivity of information, respondent mode preferences, and the visual vs aural presentation of questions (Bowling, 2005; Dillman et al., 2014). Thus, obtaining common measurement across the various survey response modes is far from simple. The single-mode tradition of surveying has strongly influenced the way questions are structured, the exact wording of those questions, and how they appear visually or aurally to respondents and/or interviewers. Not only are certain formats favored by users of a particular mode, these formats have developed over time as best practices based on their fit with a particular mode of administration. For example, those trained in face-to-face interviewing are understandably reluctant to let go of traditional formats and wordings that are typically supported with show cards. Similarly, telephone designers who have learned to shorten utterances to achieve quicker and better understanding through the aural channel without the use of show cards are reluctant to use other formats. In addition, mail and web designers have each evolved towards the use of favored formats for self-administration. In mixed-mode surveying, it is desirable to present a unified stimulus across modes to collect data for a particular study. To do so requires thinking about three different tasks: question structure, question wording, and visual versus aural presentation of questions. We refer to efforts to achieve the same or closely similar question structures, question wordings, and visual layouts (whether on computer screens or paper) as unified mode construction (Dillman et al., 2009). Below, we provide recommendations for how to integrate unified mode construction into the development of mixed-mode designs.
Designing a Mixed-Mode Survey
Avoid Mode-specific Question Structures When Possible One of the main sources of measurement differences across modes is the use of different question structures. For example, self-administered questionnaires, such as mail and web surveys, commonly use check-all formats for questions, asking respondents to mark all answers that apply. In contrast, telephone and face-to-face interviews typically ask respondents to answer yes/no for each individual item. Experimental research has shown that check-all-that-apply formats in both mail and web surveys decrease the number of answers provided compared to force-choice formats (Dillman and Christian, 2005; Smyth et al., 2006). Also, by using an alternative forcedchoice format for each item in web and telephone surveys, research has shown that quite similar answers can be obtained (Smyth et al., 2008). Thus, researchers can reduce potential differences across modes by using similar question structures, or in this example, the forced-choice format. As mentioned earlier, face-to-face or telephone interviewers are typically instructed to hold back certain categories, such as ‘no opinion’ or ‘prefer not to answer’, in order to reduce their use. Instead, these survey modes rely on the interviewers’ judgment as to when to offer these answer categories, as a way of avoiding no answer at all. If, in selfadministered web and mail surveys, these answer categories are normally presented to respondents, this could produce differences when interview and self-administered results are used in the same study. Again, this difference can be avoided simply by presenting the categories in all modes. As a third example, face-to-face and telephone interviewers sometimes ask questions in an open format that requires the interviewer to code the question. For example, an interviewer might ask a respondent ‘What is your marital status?’ Web and mail surveys, on the other hand, frequently avoid these types of open-ended question formats,
261
typically providing respondents with specific answer categories, with the expectation that the respondent will select one of the provided response options. Research shows that these two question formats produce different results (Dillman and Christian, 2005), which could be prevented through the use of one consistent question structure across modes. In addition, the challenge that respondents face in remembering information delivered to them aurally has led some surveyors to limit the length of items to the extent possible, which leads to more extensive branching patterns. For example: Were you employed full-time during July 2014? Yes No
(If no) Were you employed part-time during July 2014? Yes No
(If yes to either question) Were you paid on an hourly basis? Yes No
(If no) Were you paid a set salary instead? Yes No
(If no) How were you remunerated for your work? Each of these items focuses on a single concept in an effort to keep the focus of the question clear to respondents. In self-administered surveys, a question structure similar to the one below seems more likely to be used: During July 2014, were you employed full-time, part-time, in some other way, or not employed? Employed full-time Employed part-time
262
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Employed in some other way (please specify) ________________ Not employed (If employed) Were you paid, on an hourly basis, a set salary, or in some other way? Hourly basis A set salary Some other way
Telephone or aural surveys are often quite effective in focusing successive questions on single concepts, while using automatic branching to make sure the next appropriate question appears in sequence. Mail surveys, however, face an enormous challenge in being able to get respondents to appropriately branch from one item to the next. Web surveys are somewhat intermediate in the sense that the same branching used for telephone can be used on the web, although there is some danger that complicated branching schemes can cause web respondents to lose track of where they are in the sequence of responding to individual screens. A goal of unified mode construction is to optimize construction across modes rather than maximizing construction for the benefit of one mode alone. Structuring questions similarly, to the extent feasible, is the first essential step in trying to reduce mode differences.
Reduce Wording Differences In addition to structural differences, the exact wording used in questions often varies, with the objective of making questions and questionnaires more suitable to the mode. For example, in telephone interviews, conversational inserts are often included in questions to personalize the items, as shown below: ‘Next I would like to ask you to think about where you have eaten meals during the last week.’ (Followed by) ‘Have you eaten any meals in a fast food restaurant?’
In a self-administered survey the question might simply be stated:
During the past week have you eaten any meals in a fast food restaurant?
Another kind of wording difference between aural and self-administered questionnaires involves the insertion of response categories into questions as reminders. In self-administered questionnaires, a question might appear like this when the respondent is asked about satisfaction with various aspects of life: Thinking about the neighborhood where you live, would you say that you are:
Completely satisfied Mostly satisfied Somewhat satisfied Slightly satisfied Not at all satisfied
In an aural survey the response categories are likely to be included in the query itself: ‘Thinking about the neighborhood where you live, would you say that you are completely satisfied, mostly satisfied, somewhat satisfied, slightly satisfied, or not at all satisfied?’
These differences are unlikely to affect answers. However, in other instances, such as when respondents are asked a series of such questions, a telephone interviewer is often instructed to repeat the answer categories for the first few items until they ‘think’ the respondent will remember them, with the interviewer eventually switching to abbreviated queries like: ‘The next item is the police protection in your community’. The dependence on interviewers to use their judgment often leads to different styles of interviewing as they attempt to build rapport and adapt their style to respondent interests. Use of this process on the telephone helps to overcome the memory limitation associated with providing a lot of information over the telephone, but also results in those respondents receiving a different set of words that comprise the question stimulus, with potential effects that are as of yet not well understood. At the same time, a series of experiments comparing
263
Designing a Mixed-Mode Survey
seemingly small wording variations for opinion questions found many significant variations in answers (Christian et al., 2009). When designing a mixed-mode survey it is desirable to use the same wording to the extent practical, while making reasonable accommodations to deal with different memory capabilities and avoiding redundancy (such as repeating the same category choices in the question stem and as response categories).
different than the visual responses (Dillman et al., 2009). In particular, answers to satisfaction questions about long distance telephone service tended to be more extreme in a positive direction for a series of five telephone and IVR modes than the mail and web modes, as shown in Figure 18.1. It can be seen here that the percent of respondents responding in the most satisfied category on aurally delivered surveys was nearly twice as high as the percent responding that way in the visually administered format. This study raised the critical question of whether it is possible to obtain the same responses to opinion questions asked in aural and visual surveys. Extensive experimentation carried out for multiple types of question formats (i.e., unipolar and bipolar, polar point only and fully labeled, 5 point to 11 point, and one vs two-step formats) demonstrated that telephone respondents provided significantly more positive answers than did web respondents for all of these variations (Christian et al., 2008). This research suggests that that even with consistent question structures and wording, the complete reduction of mode measurement differences between aural and visual surveys may not be possible for certain
Reduce Differences from Aural Versus Visual Presentation of Questions The aural presentation of questions in telephone, Interactive Voice Response (IVR) and face-to-face surveys, and the visual presentation of questions in web and mail surveys, can impact measurement. For example a comparison of two aural modes of collecting data, telephone and IVR, and two visual modes, mail and web, for which data were collected in 1999 found that responses to visual modes (mail and web) were quite similar, while responses to aural modes (IVR and telephone) were also similar but quite Telephone
IVR
Mail
Web
60 50
Percent
50 40
43
39 39
36 27
30 21
29 29
26 21
33
19 18
20 11
10
21 22
18
16
9
0 Q2
Q3
Q4
Q5
Q6
Question Number
Figure 18.1 Percent of respondents choosing the most positive endpoint category for five long distance satisfaction questions in a survey, by assigned response mode. Source: Dillman et al. (2009).
264
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
types of questions. Thus, combining visual self-administered modes (mail or web) may be less of a measurement challenge than combining an interview mode (telephone or face-to-face) with either of these modes. Other research has shown that answers to survey questions in the visual mode vary significantly based upon specific features of the question layout (Jenkins and Dillman, 1997; Christian and Dillman, 2004; Smyth et al., 2006). For example, the size of openended answer spaces has been found to influence the number of words entered, as well as the completeness of answers, for both mail (Christian and Dillman, 2004) and web questionnaires (Smyth et al., 2009). The size of answer spaces for numerical answers has also been found to impact people’s responses (Christian et al., 2007; Couper et al., 2011). In addition, research has shown that the use of graphical symbols and visual layouts can significantly improve responses to branching instructions in paper questionnaires (Redline and Dillman, 2002), bringing answers into line with those provided to web surveys. Comparisons of item nonresponse for mail and web questionnaires using those design principles, reveal only minor differences (Messer et al., 2012; Millar and Dillman, 2012). The many consequences of visual design are summarized elsewhere (Dillman et al., 2014), but generally suggest that the effects in web and mail questionnaires are quite similar. These findings encourage us to try to achieve common visual formats for web and mail mixed-mode surveys as part of our effort to achieve unified mode design.
A PRACTICAL APPLICATION: TESTING A WEB+MAIL MIXED-MODE DESIGN TO OVERCOME COVERAGE AND RESPONSE PROBLEMS, WHILE ACHIEVING COMMON MEASUREMENT We end this chapter by describing a potential mixed-mode design for improving survey
coverage and response to household surveys now undergoing testing that utilizes the unified mode design ideas discussed here to minimize measurement differences across survey modes. Random samples of residential addresses obtained from the US Postal Service, which include more than 95% of all US residential addresses, were used to contact households. Over a period of about five years, five large-scale experiments were conducted to determine whether households could be effectively pushed from only mail contacts to respond by the web (Smyth et al., 2010; Messer and Dillman, 2011; Messer, 2012; Edwards et al., 2014). The push-to-web treatment groups in these experiments mixed modes in two ways. First, they utilized only one mode (mail) to contact sample members throughout the research process; and second, while they encouraged response by one mode (web), they provided an opportunity through later contacts to respond by a second mode (mail follow-up) for those unable or unwilling to go to the web. The push-to-web treatment groups in these experiments resulted in response rates that ranged from about 31% to 55%, with an average of 43%. While these response rates were lower than those obtained by single-mode mail-only treatment groups in the same studies (which had an average response rate of 53%), they were higher than response rates commonly obtained by other surveys of the American general public (Dillman et al., 2014). Overall, the push-to-web design was able to obtain an average of almost twothirds of response by the web. However, these experiments also showed that web respondents within the push-to-web group tended to differ demographically from the mail followup respondents, particularly on variables such as age, education level, and income level. Thus, in studies of the general public, it remains crucial that researchers provide an alternative mode of response for those who are unwilling or unable to respond by web in order to obtain a representative sample.
Source: Lewiston and Clarkston Quality of Life Survey. For more explanation, see Smyth et al. (2010) and Dillman et al. (2014).
Figure 18.2 Example of unified design question format, using the same question structures, question wording and visual appearance in both the mail (on left) and web (on the right) versions of the question; next and back buttons on web screens are not shown here.
Designing a Mixed-Mode Survey 265
266
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Across each of these experiments, researchers designed web and paper questionnaires that largely shared the same question structures, question wordings, and visual layouts. Figure 18.2 illustrates this unified mode construction using the first set of questions in both the web and mail version of the Lewiston and Clarkston Quality of Life Survey (Smyth et al., 2010). The visual layout provides a quite similar visual stimulus in both web and mail. However, there were certain aspects in which researchers departed from unified mode construction techniques in order to adjust for the differences between web and mail modes. For example, in paper questionnaires, respondents can easily see how many questions remain in the survey by simply turning the pages. However, in web questionnaires, this is far more difficult. As a result, while items were numbered normally in the paper questionnaires (e.g., 1, 2, 3), a numbering system (e.g., 1 of 52, 2 of 52) was used in the web questionnaires to simultaneously indicate progress. The layout of web pages, which are typically viewed horizontally rather than vertically, also meant that the same number of questions could not be displayed on a single web screen as on a piece of paper without requiring scrolling, which could potentially increase respondent burden. As a result, researchers displayed only one question per screen on the web. To create as parallel an experience as possible, researchers utilized lightly colored visual fields in the paper questionnaires to discourage eye movement across questions. In addition, while paper respondents were directed to branch or continue using graphical layouts that included arrows and special wording, web respondents were automatically skipped to the correct next question. Aside from these differences, web questionnaires did not use other features that could not be replicated on paper, such as pop-up explanations of items or drop-down menus. These examples demonstrate that while these
experiments primarily relied on unified mode construction techniques to reduce potential measurement error, there are still situations in which we need to allow for differences in order to improve overall measurement.
CONCLUSION Increasingly, surveyors are turning to mixedmode surveys as a means of improving response rates and survey representativeness. One unfortunate consequence of this tendency is that measurement differences sometimes negate the effect of those gains. In this chapter we have described steps surveyors can take to reduce measurement differences. Specifically, we suggest that instead of maximizing the fit of questions to each mode, as has traditionally been done, that unified mode design principles be followed. This means using, to the extent possible, the same question structure and wording across all modes. It also means understanding how aurally versus visually presented questions likely produce different answers. Our application of these principles shows how a push-to-web (web+mail) design for household surveys in the United States is currently evolving, which has resulted in greatly improved response rates and reduced measurement differences.
RECOMMENDED READINGS For additional information, Dillman and Christian 2005 describe how mode differences are the result of multiple causes ranging from preferred wording for particular modes to visual vs aural differences in comprehension across survey modes. De Leeuw (2005) provides an interpretation of the growing literature on why surveying was trending towards greater use of mixed-mode survey designs. Dillman et al. (2014) provide research-based guidance on alternative ways
Designing a Mixed-Mode Survey
of designing mixed-mode surveys for greater effectiveness.
REFERENCES Blankenship, A.B. (1977). Professional Telephone Surveys. New York: McGraw Hill. Bowling, A. (2005). Mode of Questionnaire Administration Can Have Serious Effects on Data Quality. Journal of Public Health, 27, 281–291. Christian, L.M., and Dillman, D.A. (2004). The Influence of Graphical and Symbolic Language Manipulations on Responses to SelfAdministered Questions. Public Opinion Quarterly, 68, 57–80. Christian, L.M., Dillman, D.A., and Smyth, J.D. (2007). Helping Respondents Get it Right the First Time: The Influence of Words, Symbols, and Graphics in Web Surveys. Public Opinion Quarterly, 71, 113–125. Christian, L.M., Dilllman, D.A., and Smyth, J.D. (2008). The Effect of Mode and Format on Answers to Scalar Questions in Telephone and Web Surveys. In J.M. Lepkowski, C. Tucker, M. Brick, E.D. de Leeuw, L. Japec, P.J. Lavrakas, M.W. Link, R.L. Sangster (eds), Advances in Telephone Survey Methodology (pp. 250– 275). Hoboken, NJ: Wiley-Interscience. Christian, L.M., Parsons, N.L., and Dillman, D.A. (2009). Designing Scalar Questions for Web Surveys. Sociological Methods and Research, 37, 393–425. Couper, M.P., Kennedy, C., Conrad, F.G., and Tourangeau, R. (2011). Designing Input Fields for Non-Narrative Open-Ended Responses in Web Surveys. Journal of Official Statistics, 27, 65–85. Das, M., Ester, P., and Laczmirek, L. (2011). Social and Behavioral Research and the Internet: Advances in Applied Methods and Research Strategies. New York: Routledge. de Leeuw, E.D. (1992). Data Quality in Mail, Telephone, and Face-to-Face Surveys. Amsterdam, Netherlands: TT Publications. de Leeuw, E.D. (2005). To Mix or Not to Mix Data Collection Modes in Surveys. Journal of Official Statistics, 21, 233–255. de Leeuw, E.D., and van der Zouwen, J. (1988). Data Quality in Telephone and Face-to-Face
267
Surveys: A Comparative Analysis. In R.M. Groves, P.P. Biemer, L.E. Lyberg, J.T. Massey, W.L. Nicholls II, and J. Waksberg (eds), Telephone Survey Methodology (pp. 283–299). New York: John Wiley & Sons, Inc. DeMaio, T.J. (1985). Social Desirability and Survey Measurement: A Review. In C.F. Turner and E. Martin (eds), Surveying Subjective Phenomena (Volume 2) (pp. 257–282). New York: Russell Sage Foundation. Dillman, D.A. (1978). Mail and Telephone Surveys: The Total Design Method. New York: John Wiley & Sons, Inc. Dillman, D.A. (1991). The Design and Administration of Mail Surveys. Annual Review of Sociology, 17, 225–249. Dillman, D.A., and Christian, L.M. (2005). Survey Mode as a Source of Instability across Surveys. Field Methods, 17, 30–52. Dillman, D.A., Phelps, G., Tortora, R., Swift, K., Kohrell, J., Berck, J., and Messer, B.L. (2009). Response Rate and Measurement Differences in Mixed-Mode Surveys Using Mail, Telephone, Interactive Voice Response (IVR) and the Internet. Social Science Research, 38, 1–18. Dillman, D.A., Smyth, J.D., and Christian, L.M. (2014). Internet, Phone, Mail and MixedMode Surveys: The Tailored Design Method (4th edn). Hoboken, NJ: John Wiley & Sons, Inc. Edwards, M.L., Dillman, D.A., and Smyth, J.D. (2014). An Experimental Test of the Effects of Survey Sponsorship on Internet and Mail Survey Response. Public Opinion Quarterly, 78, 734–750. Groves, R. L., and Kahn, R.L. (1979). Surveys by Telephone: A National Comparison with Personal Interviews. New York: Academic Press. Jenkins, C.R., and Dillman, D.A. (1997). Towards a Theory of Self-Administered Questionnaire Design. In L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz, and D. Trewin (eds), Survey Measurement and Process Quality (pp. 165–196). Hoboken, NJ: Wiley Series in Probability and Statistics. Lepkowski, J.M., Tucker, C., Brick, M., de Leeuw, E.D., Japec, L., Lavrakas, P.J., Link, M.W., Sangster, R.L. (eds) (2007). Advances in Telephone Survey Methodology. Hoboken, NJ: Wiley-Interscience.
268
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Lugtig, P., Lensvelt-Mulders, G.J.L.M., Frerichs, R., and Greven, A. (2011). Estimating Nonresponse Bias and Mode Effects in a MixedMode Survey. International Journal of Market Research, 53, 669–686. Medway, R.L., and Fulton, J. (2012). When More Gets You Less: A Meta-Analysis of the Effect of Concurrent Web Options on Mail Survey Response Rates. Public Opinion Quarterly, 76, 733–746. Messer, B.L. (2012). Pushing Households to the Web: Experiments of a ‘Web+Mail’ Methodology for Conducting General Public Surveys (Unpublished doctoral dissertation). Washington State University, Pullman, WA. Messer, B.L., and Dillman, D.A. (2011). Surveying the General Public over the Internet Using Address-Based Sampling and Mail Contact Procedures. Public Opinion Quarterly, 75, 429–457. Messer, B.L., Edwards, M.L., and Dillman, D.A. (2012). Determinants of Item Nonresponse to Web and Mail Respondents in Three Address-Based Mixed-Mode Surveys of the General Public. Survey Practice, 5(2), http:// surveypractice.org/index.php/SurveyPractice/ article/view/45/pdf [accessed 14 June 2016]. Millar, M.M., and Dillman, D.A. (2011). Improving Response to Web and Mixed-Mode Surveys. Public Opinion Quarterly, 75, 249–269. Millar, M.M., and Dillman, D.A. (2012). Do Mail and Internet Surveys Produce Different Item Nonresponse Rates? An Experiment Using Random Mode Assignment. Survey Practice, 5(2), http:// surveypractice.org/index.php/SurveyPractice/ article/view/48/pdf [accessed 14 June 2016]. Olson, K., Smyth, J.D., and Wood, H.M. (2012). Does Giving People Their Preferred Survey Mode Actually Increase Survey Participation Rates? An Experimental Examination. Public Opinion Quarterly, 74, 611–635. Pew Research Center (2012). Assessing the Representativeness of Public Opinion
Surveys. May 15. Retrieved from http:// www.people-press.org/2012/05/15/assessingthe-representativeness-of-public-opinionsurveys/ [accessed 14 June 2016]. Redline, C.D., and Dillman, D.A. (2002). The Influence of Alternative Visual Designs on Respondents’ Performance with Branching Instructions in Self-Administered Questionnaires. In R.M. Groves, D.A. Dillman, J.L. Eltinge, and R.J.A. Little (eds), Survey Nonresponse (pp.179–193). New York: John Wiley & Sons, Inc. Smyth, J.D., Christian, L.M., and Dillman, D.A. (2008). Does ‘Yes or No’ on the Telephone Mean the Same as ‘Check-All-That-Apply’ on the Web? Public Opinion Quarterly, 72, 103–111. Smyth, J.D., Dillman, D.A., Christian, L.M., and Mcbride, M. (2009). Open-Ended Questions in Web Surveys: Can Increasing the Size of Answer Boxes and Providing Extra Verbal Instructions Improve Response Quality? Public Opinion Quarterly, 73, 325–337. Smyth, J.D., Dillman, D.A., Christian, L.M., and O’Neill, A.C. (2010). Using the Internet to Survey Small Towns and Communities: Limitations and Possibilities in the Early 21st Century. American Behavioral Scientist, 53, 1423–1448. Smyth, J.D., Dillman, D.A., Christian, L.M., and Stern, M.J. (2006). Comparing Check-All and Forced-Choice Question Formats in Web Surveys. Public Opinion Quarterly, 70, 66–77. Stern, M.J., Bilgen, I., and Dillman, D.A. (2014). The State of Survey Methodology: Challenges, Dilemmas, and New Frontiers in the Era of the Tailored Design. Field Methods, 26, 284–301. Vannieuwenhuyze, J., Loosveldt, G., and G. Molenberghs. (2010). A Method for Evaluating Mode Effects in Mixed-Mode Surveys. Public Opinion Quarterly, 74, 1027–1045.
19 The Translation of Measurement Instruments for Cross-Cultural Surveys Dorothée Behr and Kuniaki Shishido
INTRODUCTION Cross-national surveys typically collect data by using an (almost) identical set of questions across different countries. The goal of cross-national surveys, that is, comparing countries or regions on various dimensions, requires that these questions are equivalent. Otherwise methodological artefacts might be taken as real similarities or differences between countries. Equivalence needs to be addressed ex-ante by adequate source questionnaire development, translation, and related testing; furthermore, it needs to be addressed ex-post by quantitative and/or qualitative assessments of questions. This chapter concentrates on the role of translation in the endeavor to produce equivalent questions in cross-national studies; it thus sheds light on what can be done ex-ante to address and ensure equivalence. The focus of this chapter will be on cross-national, multilingual surveys that are designed for the purpose of cross-national
comparisons. Despite this focus, much of the chapter also applies to questionnaire translation beyond the survey context, such as when personality inventories are translated; to questionnaire translation within a single country, such as when questionnaires are translated for different linguistic groups or migrant populations; or to questionnaire translation in the context of adopting a questionnaire originally developed for one country for use in another country. In addition, while questionnaires will be the main concern, many findings and good practice approaches also apply to the translation of assessment instruments, such as those assessing numeracy or literacy skills. In fact, all these specific fields contribute to a large extent to the literature on questionnaire translation. The chapter is set up as follows. First, the importance of good questionnaire design for high-quality translation is addressed. Second, various translation and translation assessment methods are introduced. Third, the concepts of translation and adaptation are delineated.
270
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Fourth, the challenges of translating attitude and opinion items, with a special focus on response scales, will be presented. This will be done using the example of cross-national survey research in Asia. Finally, further developments, research desiderata as well as recommended readings complete the chapter. Following a naming convention in translation studies, the original questionnaire, language, or culture will be called source questionnaire, language, or culture. Its counterpart will be the target questionnaire, language, or culture. The terms questionnaire and instrument will both be used to refer to a question–answer-based measurement form.
GOOD QUESTIONNAIRE DESIGN AS A PRECONDITION FOR TRANSLATION QUALITY Producing comparable questionnaire translations is no longer discussed only in the context of appropriate translation and assessment methodology. Fact is now that the production of comparable questionnaires presupposes adequate source questionnaire design that incorporates different layers of cross-cultural collaboration and input (Smith, 2004, see also Chapter 4 by Johnson and Braun, this Handbook). These different layers include, taking the example of the European Social Survey (2014a), cross-cultural questionnaire development teams, involvement of all national teams at various stages throughout the process, qualitative and quantitative pretesting as well as advance translation and piloting in several countries. This way it shall be ensured that on the conceptual level questions are equally relevant and valid for the participating countries. Moreover, increased cross-cultural cooperation shall make sure that linguistic particularities do not prevent or unnecessarily impede good translations later on. Translation itself has become a valuable part of cross-cultural questionnaire design:
So-called advance translations are now carried out on fairly advanced though pre-final source questionnaires (Dept, 2013; Dorer, 2011). The goal is to identify issues of concern for cross-cultural implementation and translation early on, such as culturally inappropriate assumptions or linguistic problems (ambiguous terms, overly complex wording, etc.). The background to advance translation is that many problems related to source questionnaire concepts and formulations only become apparent – even to experienced crosscultural researchers – if a concrete attempt is made to translate a questionnaire (Harkness et al., 2003). The feedback received from advance translation – similar to what happens with all other feedback received from commenting and testing – may either lead to modifications of source questions or to annotating the questionnaire specifically for translation. Translation annotations provide necessary background information to translators, amongst which background information on concepts, explanation of terms or phrases, or specific instructions for translators (Behr and Scholz, 2011). In general, translation annotations have become a valuable tool for ensuring comparable translations that measure what they are supposed to measure. Remarkably, this type of documentation in view of translation has already been suggested in the late 1940s (Barioux, 1948).
TRANSLATION AND TRANSLATION ASSESSMENT Ensuring comparable translations is most often discussed in the context of choosing the appropriate translation and assessment methodology. Even though there are a multitude of approaches in social science survey research, cross-cultural psychology, and the health sciences, good practice in these various disciplines shares a set of common features. These are summed up in Table 19.1 and include a multi-step approach, the
The Translation of Measurement Instruments for Cross-Cultural Surveys
271
Table 19.1 Core features of good practice translation and assessment methodology Multi-step approach in general
Multi-step approach in more detail
Involvement of various persons with different skill sets and backgrounds
Documentation
Production of the translation, including first-hand versions and judgmental assessment
Parallel translations Team-based review and adjudication
Problems, comments Problems, decisions, adaptations
Testing the translation as a measurement instrument, including empirical assessment
• Skilled translation practitioners • Skilled translation practitioners • Survey and subject matter expert(s) • Other person(s) of relevance • Cognitive interviewers • Members of the target group
Qualitative testing, such as cognitive interviewing Quantitative testing, • (Interviewers) such as a pilot • Survey researchers/statisticians study • Members of the target group
involvement of various persons with different skills sets and expertise, and documentation of both the overall process and individual decisions and findings (e.g., Acquadro et al., 2008; Harkness, 2003; Hambleton, 2005). A multi-step approach includes (1) the process of producing a translation, including judgmental assessments, and (2) the process of testing the translation as a measurement instrument among members of the target group. Each of these processes can further be broken down.
Production of a Translation Good practice includes, as a first step, the production of parallel translations, that is, two independently produced translation versions (e.g., Harkness, 2003). Parallel translations help to uncover idiosyncratic wording or different interpretations (e.g., ‘feeling anxious’ either in the sense of worry or in the sense of anxiety). Furthermore, they offer stylistic variants (e.g., different syntactical structures) and help identify clear-cut errors that inevitably occur in translation, even among experienced translators (e.g., omission of important elements). It should be self-evident, however, that the parallel translation approach should not be taken as a remedy against a weak translator but rather
Problems, decisions, adaptations
Of entire process
Problems, decisions, adaptations
draw on two persons with (potentially complementary) expert translation skills. In a subsequent team-based review session, the two translation versions are then reconciled to result in a final translation. Reconciling translations can include choosing one or the other translation for a given question, combing the two translations, modifying a given translation, or generating a completely new one. Arriving at a final translation always involves thorough decision-making and should thus not be limited to just selecting the ‘better’ out of the two translations that are offered. Aspects that will play a role are meaning, conciseness, fluency, questionnaire design conventions, consistency, amongst others (Behr, 2009). Thus, translation always is a multi-dimensional decision-process. In what has been called one of the ‘major misunderstandings in the field’ (Hambleton and Zenisky, 2011: 66), translations have all too often been made by a friend, a colleague, or a partner simply because they happen to be ‘bilingual’. Nowadays there is a broad consensus that the skills and background of the personnel employed are crucial in determining the outcome. Translators should have an excellent command of both source and target language and culture and typically they translate into their mother tongue. Other requirements include a combination of (questionnaire) translation experience, knowledge
272
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of the study topic and of questionnaire design principles. In social science survey research the view is taken that the ‘people most likely to be good questionnaire translators are people who are already good translators and who learn or are trained to become questionnaire translators’ (Harkness et al., 2004: 463). Supporting this position, research from the field of translation studies suggests that translation practitioners, when compared to translation novices, act more sense-oriented, take into account larger segments of text, are more likely to attend to the needs of prospective users of a text and more so exploit their cultural and world knowledge in the process of translation (Jääskeläinen, 2010; Shreve, 2002). Nevertheless, if translation practitioners are not cognizant of do’s and don’ts in questionnaire translation, they will need to be briefed and trained on what it means to retain measurement properties and design principles in translation (e.g., characteristics of response scales, balanced wording, etc.). Translators may in fact also be trained on the job in team review sessions, as described below. Team-based review, the follow-up to parallel translation, is a recommended way to spread needed skills and expertise among several people (Harkness, 2003). A team-based review brings the translators together with survey and subject matter experts as well as other persons that are deemed helpful for the task. A team approach can thus ensure that linguistic and measurement aspects are taken on board when making decisions. While the pooling of expertise is the great advantage of the team-based approach, there are factors in this team set-up that might negatively impact on the outcome, such as defending a version for personal reasons or not wanting to criticize each other (Brislin, 1980). These challenges are likely to be mastered by providing information on the chosen translation approach and on review ‘rules’ (the quality of the translation is the focus, not any personal assessment) to all parties prior to team selection. Thus, the participants of a team
approach can ‘mentally’ prepare for an interdisciplinary exchange. Moreover, Harkness et al. (2010a) recommend testing translators in terms of their review performance and team suitability prior to hiring, which is certainly easier to implement for longer instruments than for shorter ones. In addition, the actual review process can be supported by allocating enough time for the process as well as involving skilled personnel to chair the review session. Review leaders are crucial for the overall outcome; they should be knowledgeable both of questionnaire translation and of study and measurement characteristics. Often, they have good knowledge of the subject matter and/or of survey methods and as such they guarantee that the measurement perspective is adequately taken into account during the translation process. After review, a so-called adjudication step may be necessary for adding further expertise, clarifying remaining uncertainties, and for final decision-making (Harkness, 2003). In addition, enough time should always be set aside for copy-editing, including consistency or fluency checks, as well as proofreading in terms of spelling, grammar, and completeness.
Special Case: Back Translation In the above descriptions of good practice in questionnaire translation the method of back translation has deliberately been omitted. Back translation, in widespread use since about the 1970s (Brislin, 1970), is controversially discussed in the research community (Harkness, 2003; Leplège and Verdier, 1995). Essentially, it involves translating the ‘actual’ translation of a questionnaire back into the source language and the subsequent comparison of the two source-language versions with a view to identify discrepancies. Even though (gross) errors can be detected using back translation, the method itself is no guarantee that the ‘actual’ translation is indeed comparable, working as intended, intelligible
The Translation of Measurement Instruments for Cross-Cultural Surveys
or fluent. In one study, for instance, back translation did not identify that ‘feeling downhearted and depressed’ was translated as ‘clinically depressive’, since the back translation came back as ‘depressed’ and as such suggested no problem. Additionally, if wrongly implemented, the use of back translation fosters a translation that stresses equivalence of form over equivalence of meaning and thus it can even be detrimental to translation quality. Furthermore, much of the detection capability of a back translation depends on the skills and instructions given to a back translator and also on the background of the person who eventually compares the back translation to the original questionnaire. Especially if back translation is relied on as a sole quality check, which is in fact what already Brislin (1970) warned against, a low quality of the questionnaire translation is likely. While major survey programs or centers have discarded back translation altogether, focusing on target text-centered appraisals of the questionnaire instead (European Social Survey, 2014b; Ferrari et al., 2013), back translation is still a recommended method in many fields in the health sciences and cross-cultural psychology. However, also here efforts are underway to critically evaluate the method, notably by comparing psychometric properties and user preferences based on questionnaires produced according to different translation methods. Even though it needs to be acknowledged that the methodological set-ups of these studies differ in more than in the inclusion of back translation or not, first results suggest that at least in terms of user preference the ‘back translation version’ falls behind other methods (e.g., Hagell et al., 2010).
Empirical Assessment of a Translation The recommended multi-step approach to questionnaire translation is not only reflected
273
in the subsequent steps of translation, review, and adjudication but also in the supplementary empirical assessment of the translated questionnaire. The particularity of this additional layer lies in the assessment of the questionnaire as a measurement instrument, which is, after all, its ultimate goal. Empirical assessment brings in the intended target group. Qualitative assessment, typically in the form of cognitive interviews, is a way to gain in-depth knowledge of how respondents understand individual questions and arrive at their answers. Based on the respondent explanations, conclusions can be drawn on whether the translated instrument measures what it is intended to measure (see Chapter 24 by Willis in this Handbook). Quantitative assessment is typically based on quantitative pretests or larger pilot studies, which may test pre-final questionnaires as such but also different question versions in split-ballot manner. The exact nature of statistical analyses heavily depends on the sample size, on how many items are used to measure a construct or on research traditions in the different disciplines (see Chapter 39 by Cieciuch et al. in this Handbook). The types of analyses that are possible also depend on the timing of translation and empirical assessment. If cognitive interviewing or pretest or pilot studies are implemented simultaneously in different languages and cultures, equivalence across countries and cultures can be assessed in addition to other ‘national’ testing routines. The question of when translation and its empirical assessment takes place within the survey cycle also has an effect on whether or not the source questionnaire can be modified based on the results. Once a source questionnaire has been finalized in a cross-national study, any feedback received from empirical assessment of the translated version can solely contribute to improving the translation itself (such as replacing a word by another) but not to improving the source questionnaire in general. If, however, translation and
274
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
its empirical assessment take place as part of comparative questionnaire design, modifications of both the translation itself and the pre-final source questionnaire are possible (Fitzgerald et al., 2011). Even though qualitative and quantitative assessment of the translation can also identify clear-cut translation mistakes (e.g., ‘waiting’ translated as ‘wanting’), these types of errors should ideally be eradicated at earlier steps. After all, this is the rationale of thoroughly implemented parallel translation and team review/adjudication. Empirical assessment should primarily tackle more subtle issues of translation, such as connotations or misunderstandings, as well as cultural problems or generic design problems.
Translation Documentation The general translation and assessment approach should be documented, not only in terms of steps implemented but also in terms of personnel involved. Furthermore, the individual steps should be documented: Translator documentation may include problems or alternative formulations (e.g., a word used in the translation may be difficult to understand). Reviewer and/or adjudicator documentation may include problems and (adaptation) decisions (e.g., ‘spending a school term abroad’ translated as ‘spending several months abroad’ to take into account different lengths of exchanges for different countries). Documentation pertaining to empirical testing may include information on identified problems with the translation and/ or the source questionnaire and decisions based upon these findings (e.g., ‘romantic partner’ too closely translated. It is perceived as odd. It should be rendered in a more senseoriented way). Documentation on both the general and the specific level gives future instrument and data users a first indication as to the quality of the translation and also a source to turn to in case of unexpected statistical results. During the translation process
itself, documentation helps to inform later steps in the process, thus making the entire process more efficient.
Harmonization Harmonization is receiving increased attention in cross-national studies. On the one hand, it refers to the process of developing a common version for different varieties of a ‘shared language’. This would apply, for instance, to a common French version for France and Belgium. Differences between shared-language questionnaires should only occur where this is culturally or linguistically needed. The rationale is to remove any unnecessary variation that might impact on the comparability of data (Harkness, 2010b; Wild et al., 2009). On the other hand, harmonization may refer to the process of fostering consistency in translation decisions and thus comparability independent of the respective language. Harmonization of this type is warranted in countries that simultaneously need to produce several language versions of a questionnaire, such as Switzerland needing to translate cross-national source questionnaires into French, German, and Italian. Additional efforts are needed at the adjudication, that is, finalization stages to harmonize the different linguistic versions, possibly by members of staff mastering the different languages. Furthermore, the call for consistent translation decisions should be extended to all countries in a study. This wider type of harmonization can either be helped by (in-person) meetings in which representatives from each target language take part to discuss problematic issues (Acquadro et al., 2012). Alternatively, FAQ lists with country queries on the meaning or scope of terms and developer feedback that are regularly updated and circulated among translation teams may serve the purpose (Furtado and Wild, 2010). In the same vein, proactive translation assessment of selected languages at an early stage
The Translation of Measurement Instruments for Cross-Cultural Surveys
can identify issues that should be clarified for all countries (Wild et al., 2005). All these efforts show that questionnaire translation methodology in a cross-national study is increasingly shifting from a vertical perspective that only looks at one target questionnaire in relation to the source to a horizontal perspective that looks at several target questionnaires simultaneously. Especially inperson meetings require additional time and money and may also be logistically difficult to implement. Moreover, countries need to be more or less at the same stage within the process to make it work. Ongoing harmonization efforts by updated FAQ lists are thus a powerful alternative; these lists can also be accessed by countries which join a survey at a later stage.
Quality in the Hands of the Translation Commissioner While the quality of a questionnaire translation is usually linked to the aforementioned factors, that is, source questionnaire quality and related documentation, suitable personnel, and appropriate translation and assessment methods, the role of the translation commissioner should not be ignored (Chesterman, 2004). Translation commissioners, which are often synonymous to the national translation project managers, determine or provide the production conditions which then impact on the translation (quality). Production conditions include deadlines, overall process planning, payment, translation files, translation tools, briefing, and training. Briefing in terms of study goals, target group, and implementation mode is vital to producing good translations. After all, translation involves decision-making taking into account these factors. Beyond briefing, training may become particularly important if hired personnel need to be trained on the particularities of measurement instruments. By providing an adequate financial, temporal, and content-wise framework,
275
the commissioner can thus improve translation quality.
TRANSLATION VS ADAPTATION When it comes to the use of questionnaires in cross-national or cross-cultural contexts, the terms translation and adaptation dominate the relevant literature. The following is an attempt to shed light on what is meant when people refer to adaptation rather than translation. On the one hand, adaptation may be used as a generic term for the overall process of transferring an instrument from one language and culture to another language and culture. On the other hand, the term may be used to describe deliberate changes to specific questions. In the following, these two different perspectives – the generic and the specific one – will be presented.
Adaptation in a Generic Sense In particular in cross-cultural psychology and the health sciences, using the term adaptation for the overall process is very popular (Acquadro et al., 2012; van de Vijver and Leung, 2011). Adaptation in this sense signals that ‘pure’ translation may not be sufficient and that cultural adaptation at various levels and for various questions may be needed to make an instrument suitable for a new context. Given that in these disciplines many questionnaires that were originally developed with only one culture in mind are now transferred to a new culture, the preference for the term adaptation becomes understandable. Using the term adaptation in a generic sense may also stress the need for psychometric testing to ensure that the new instrument works as intended. Rigorous statistical testing as part of the ‘adaptation’ process of an instrument is more common in cross-cultural psychology and the health sciences than in the social sciences; this may be
276
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
due to the different types and numbers of items measuring a construct as well as to issues of copyright and commercial distribution (Harkness et al., 2004). In the social sciences, it is rather the researcher working with the final data set who is responsible for statistical testing. Eventually, one may wonder whether the term adaptation also has become so immensely popular as an overall term because translation itself is often reduced to a mere word-by-word replacement or a ‘literal’ translation. Such an understanding testifies to a misconception of what translation involves and also to a lack of knowledge of where translation researchers have been heading since the 1950s (BolañosMedina and González-Ruíz, 2012).
Adaptation in a Specific Sense Apart from adaptation in the generic sense, adaptation in the specific sense is used to refer to changes to specific questions. Harkness et al. refer to these changes as ‘deliberate changes to source material in order to meet new needs of various kinds’ (2010b: 133). One can look at these changes from different angles: (1) domains of changes; (2) type of ‘material’ affected; (3) type of changes; and (4) topics potentially triggering changes. Following a slightly modified classification approach by van de Vijver and Leung (2011), one can differentiate between adaptations in the domain culture, measurement, and language (see Table 19.2). A clear cut between these domains is not always given, though, so that some adaptations types may concurrently be assigned to different domains. Adaptations in the domain culture can be subdivided into terminological-/ factualdriven adaptations and norm-driven adaptations. The former deal with the ‘hard’ aspects of culture, whereas the latter cover its ‘soft’ aspects (van de Vijver and Leung, 2011). Terminological-/factual-driven adaptations accommodate factual, often obvious
Table 19.2 Overview of adaptation domains and types Type
Domain
Type
1 2 3 4 5 6 7
Culture Culture Measurement Measurement Language Language Language
Terminological/factual-driven Norm-driven Familiarity-driven Format-driven Comprehension-driven Language-driven Pragmatics-driven
Note: classification slightly modified from van de Vijver and Leung (2011)
differences between countries. For instance, references to political systems (American president vs British prime minister) or school systems (British A-levels vs German Abitur) will have to be adapted. Norm-driven adaptations cater for less tangible, often less obvious differences between countries, notably as regards norms, practices, or values. Asking respondents whether they have recently worn a campaign badge or sticker in countries where badge and sticker are not elements of political participation is certainly not suitable. Also aspects of social desirability or sensitivity in a given culture need to be considered. In Japan, for instance, questions involving the assessment of one’s own or others’ earnings as just or unjust are socially inappropriate and thus cannot be asked in a general survey (Harkness et al., 2003). Adaptations in the domain measurement can be subdivided into familiarity-driven adaptations and format-driven adaptations. Familiarity-driven adaptations are needed to accommodate different familiarity with measurement instruments. Surveys may be carried out in populations that have had no (or hardly any) prior exposure to surveys. In these cases, the survey experience may need to be brought to these populations by adding explanations or instructions on how to use the survey instrument. In addition, verbal scales may be adapted or supplemented with pictorial aids to make measurement more accessible to these survey respondents (Harkness
The Translation of Measurement Instruments for Cross-Cultural Surveys
et al., 2010c). Format-driven adaptations take into account differential response effects or styles. A Japanese agreement scale may thus label the extreme categories of an agree-disagree scale as ‘agree’ and ‘disagree’, whereas the source scale uses the labels ‘strongly agree’ and ‘strongly disagree’. This modification would take into account the Japanese predisposition to avoid response categories with strong labels (Smith, 2004; see also ‘Attitude and opinion items in translation’ below). Adaptations in the domain language pertain to comprehension-driven adaptations, such as when certain concepts or wordings may need to be supplemented by definitions to help adequate understanding. Also for populations with lower levels of education compared to source text respondents the wording and vocabulary of the target instrument as a whole may need to be simplified. Apart from these adaptations, various authors also list language-driven and pragmatic-driven adaptations among the adaptation types (Harkness, 2008; Harkness et al., 2003; van de Vijver and Leung, 2011). Language-driven adaptations include the array of changes that inevitably happen in translation, such as changes in the sequence of information, in sentence structure, or in word class. Van de Vijver and Leung (2011) illustrate languagedriven adaptations with the gender-neutral English word ‘friend’ and its gender-specific counterparts in German (freund/freundin) or French (ami/amie). Pragmatics-driven adaptations take into account that language use in social contexts differs between languages and cultures. Different discourse norms may call for modifications. For instance, the required degree of explicitness of a request may differ or the way how deference or politeness is expressed. Neglecting the peculiar discourse norms and instead rendering the source text too closely may mean that questions come across as strange or even inappropriate, with potential effects for measurement. We would argue that particular language- and pragmaticdriven adaptations are the backbones of
277
‘translation’ itself. This does not make them less difficult or important, but this view would free the activity of ‘translation’ from a mere mechanical replacement activity and highlight the changes that are necessarily inherent in ‘translation’ (Baker, 2011). In this view, a meaningful distinction between translation and adaptation can be made and the term adaptation be reserved for activities that change the stimulus in more significant ways, notably in the domains culture and measurement. If all changes were called adaptation, even the most basic ones as required by different language systems (e.g., change of word order), the distinction between adaptation and translation would become futile; moreover, it would be difficult to inform data users on how a translation compares to the source version and what this could mean for statistical analyses. Of course, there will always be grey areas in the domain language of what should be called a translation and what should rather be called an adaptation. In terms of necessary documentation of decisions, it seems advisable in any case to document the more significant types of changes, especially where impact on comparable measurement can be expected, no matter whether these changes should technically be called a translation or an adaptation. Potential candidates for adaptation are individual questions and their answer scales. However, also the visual presentation of a questionnaire may be affected, such as when colors or pictures need to be changed, modes of emphasis switched from capitalization to underlining, or answer text boxes regrouped or resized in view of typical answer patterns or conventions. Furthermore, also the layout and direction of response scales may be affected when different writing systems (left–right/right–left/up–down) are involved. The various types of changes include addition or omission of questions or parts thereof as well as substitution of different kinds (content, pictures, and colors). Topics potentially triggering adaptations include socio-economic topics (such
278
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
as income, housing), religion, sports, foods and drinks, drugs, activities, holidays, music, family ties, school system, health system, political system, history, name, address, and date formats, knowledge questions, and measuring units (Dean et al., 2007; Harkness et al., 2010b). In sum, the transfer of a questionnaire from one language and culture to another language and culture requires a great deal of sensitivity towards cultural issues. The degree to what extent adaptations are needed will largely depend on whether a source questionnaire has deliberately been designed for cross-cultural use and thus tries to avoid cultural particularities right from the start or whether it was originally developed for one particular culture. Adaptations needs, notably in the domains culture and measurement, are typically taken into account and circumvented in deliberate cross-national questionnaire design or, to a lesser degree, they are also accommodated.
ATTITUDE AND OPINION ITEMS IN TRANSLATION – EXPERIENCES FROM CROSS-CULTURAL SURVEYS IN ASIA This section discusses translation issues regarding attitude and opinion items that require special attention during translation. The discussion will be based on the experience of the East Asian Social Survey (EASS), which is a cross-cultural survey exclusively conducted in Asian countries and regions. It is more difficult to obtain translation equivalence for survey items which ask about attitudes and opinions than it is for survey items which ask about the respondent’s background characteristics (gender, age, education, etc.) and behaviors (number of hours of TV watched per day, frequency of exercise per month, etc.) (Behling and Law, 2000; Tasaki, 2008). Survey items which ask about attitude and opinions gauge individual values and are of high interest for psychological and
sociological survey researchers. Generally, these survey items measure answers given by respondents by using scales with 2–5 categories on dimensions such as ‘good-bad’ or ‘agree-disagree’. Highly abstract concepts tend to be included among attitude and opinions items and these often cause problems for translation. The response will change based on how the highly abstract concept is translated. In addition, the design and translation of response scales has a more direct impact on the response than the translation of the question itself. If response scales are adopted from other surveys and ‘merely’ translated, this may create an especially significant impact on survey data.
Harmonization in View of Conceptual Equivalence The most basic and important requirement in comparative survey research is to measure the same concept across countries. When the concept to be measured deviates between cultures, conceptual equivalence is impaired. Failure in translation is one of the causes that reduce conceptual equivalence. In 2008, an East Asian Social Survey (EASS), based on the theme of Culture and Values in East Asia, was carried out in Japan, South Korea, China, and Taiwan. Among the most difficult survey items to translate were the survey items regarding ‘preferred qualities of friends’ as shown in Table 19.3. The reasons why these survey items were difficult to translate included (1) the concepts to be measured being highly abstract, (2) multiple appropriate translations with different nuances being available, and (3) a lack of useful information that could clarify meaning (for example, context of preceding and following questions or notes to the translator). In the EASS, meetings to develop questionnaires are carried out in English, and the source questionnaire is developed in English.
The Translation of Measurement Instruments for Cross-Cultural Surveys
279
Table 19.3 Survey item for preferred qualities of friends Q When you associate with your personal friend, how important is each of the following qualities? Very important
Important
Neither Important nor unimportant
Not important
Not important at all
a. b. c. d. e. f.
Honest Responsible Intelligent Cultured Powerful Wealthy
(1) (1) (1) (1) (1) (1)
(2) (2) (2) (2) (2) (2)
(3) (3) (3) (3) (3) (3)
(4) (4) (4) (4) (4) (4)
(5) (5) (5) (5) (5) (5)
g. h.
Loyal Warm-hearted
(1) (1)
(2) (2)
(3) (3)
(4) (4)
(5) (5)
However, the actual survey is carried out in Chinese, Japanese, Formosan, and Korean in the respective regions. Discussions regarding the translation of cross-cultural surveys involve numerous discussions concerning how to appropriately translate the English source questionnaire. There is the assumption that if the translation from English to the languages of the multiple survey target regions can be done appropriately, comparability between the multiple survey target regions is established. However, this assumption does not apply to the translation of highly abstract survey items in particular. For example, even if the translation from English to Japanese or from English to Chinese is carried out appropriately, this does not guarantee that there is comparability between the Japanese and the Chinese version. There are multiple appropriate translations with different nuances for the Japanese translation of the term ‘honest’, which have meanings such as shouzikina (sincere), seizituna (faithful), zicchokuna (trustworthy), socchokuna (straightforward), honmonono (genuine), zyunseino (pure), kouheina (fair), kouseina (impartial), nattokudekiru (satisfactory), ukeirerareru (acceptable), sobokuna (unembellished), mie-wo-haranai (non-ostentatious), and kazarinonai (unvarnished). Likewise, there are also multiple appropriate translations with different nuances for the Chinese translation of the term ‘honest’.
For concepts which are highly abstract and can have multiple appropriate translations, it is not only necessary to determine the appropriateness of the translation from English to Japanese or from English to Chinese, but it is also vital to establish conceptual equivalence between the languages of the survey target regions and carry out harmonization among these languages (Figure 19.1). Researchers and translators who understand two languages such as Japanese and Chinese are necessary for this harmonization. In East Asia, communication using kanji characters, which are moderately common in these regions, has been effective in increasing conceptual equivalence in addition to English. In cases when there are numerous survey target regions, harmonization in all survey items becomes difficult work and is not realistic. Harmonization is recommended, if not for the Source questionnaire [English]
Target questionnaire 1 [Chinese]
Target questionnaire 2 [Japanese]
Target questionnaire 3 [Formosan]
Target questionnaire 4 [Korean]
Figure 19.1 Harmonization between survey target regions.
280
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
entire questionnaire, at least for some items that require special attention in translation. In addition, item-specific translation annotations which specify the intended meaning of these highly abstract concepts can support harmonization.
Response Styles Generally, survey items which measure opinions and attitudes base their measurement on graded scales. The scales may range, for instance, from ‘1 = very important’ to ‘5 = not important at all’ or from ‘1 = strongly agree’ to ‘5 = strongly disagree’. Depending on the target region and culture, there will be a disposition towards response patterns such as midpoint responding and extreme responding. These are called response styles and they create methodological artefacts which jeopardize the comparability between cultures (Tasaki, 2008; van de Vijver and Poortinga, 1997). Cross-cultural surveys should design response scales which take response styles of the target region into consideration in order to improve comparability. East Asia is located in the Confucian cultural sphere. The standard of conduct among the teachings of Confucianism is moderation, and value is placed on being neutral and not changing. In addition, the idea of taking a moderate course is taught in Buddhism, which is followed in parts of East Asia, with value being placed on distancing oneself from too extreme ways of thinking. Added to this is collectivism: According to Hofstede (1995), Western countries such as the US, Australia, England, and Canada have strong individualistic ways of thinking while countries and regions in East Asia such as Japan, Hong Kong, South Korea, and Taiwan have strong collectivistic ways of thinking. According to Triandis (1995), collectivism aspires for everyone to think, feel, and act in the same manner while individualism prefers people to clarify their position through discussion.
The collectivistic ways of thinking in Confucianism and Buddhism are thought to have an impact on response styles in social surveys. This way of thinking creates a tendency to avoid extreme responses and instead choose midpoint responses. Si and Cullen (1998) found that East Asian people (China, Japan, Hong Kong) have a greater tendency than Western people (United States, Germany, United Kingdom) to choose middle response categories when offered an explicit midpoint response category. Besides, East Asian people are less likely than Western people to select either end-point categories. Midpoint responding is remarkable especially in Japan (Hayashi and Hayashi, 1995). Japanese people have a tendency to value group-oriented culture. Group-oriented culture signifies favorably maintaining personal relationships within a group and placing importance on the order and harmony of the group. Likewise, Japanese people have a tendency to regard expressing individual opinions and emotions as shameful, and they voice their individual opinions based on the situation. There is a strong attitude towards being vague without saying your opinion instead of expressing your opinion and disturbing the situation. In social surveys which only take place in Japan, scales for opinion items are intentionally created as ‘1: agree, 2: somewhat agree, 3: somewhat disagree, 4: disagree’, which takes the response style of Japanese people into consideration. By avoiding adverbs such as ‘strongly’ as well as midpoint categories such as ‘neither agree nor disagree’, the scale ensures a spread of responses and it further clarifies if the respondent agrees with the opinion or not.
Design of Response Scales Shishido et al. (2009) compared the questionnaires from the World Value Survey (WVS) and the International Social Survey Programme (ISSP), which are cross-cultural surveys carried out on a global scale, with the
The Translation of Measurement Instruments for Cross-Cultural Surveys
East Asia Value Survey (EAVS), the East Asia Barometer Survey (EABS), and the Asia Barometer Survey (ABS), which are all cross-cultural surveys carried out only in Asian regions. These comparisons clarified the characteristics of response scales of attitudinal items. The characteristics common to all survey projects are (1) the frequent use of a verbal and bipolar scale, and (2) the infrequent use of a scale with more than five points. The ISSP project, which started in the 1980s as a cross-cultural survey among Western countries, very frequently uses a five-point scale that includes categories with a strong adverb and a midpoint category. In contrast to the ISSP, the WVS, which since its inception extensively covers heterogeneous cultural zones, is characterized by its frequent use of two-point and four-point scales that have no midpoint. In surveys that focus only on Asia, scales without a midpoint are used relatively frequently. There has been a dispute as to which scale should be adopted. Smith (1997) discussed that the bipolar scale including a midpoint has a smaller risk of mistake in terms of positioning one’s opinion on a response scale, and is therefore more desirable for crossnational comparison than the unipolar scale or scales without a midpoint. Klopfer (1980) and Krosnick et al. (2008) also suggested that offering a midpoint is desirable because omitting the middle alternative leads respondents to randomly select one of the moderate scale points closest to where a midpoint would appear. On the other hand, Converse and Presser (1986) suggested that a middle alternative should not be explicitly provided because providing a midpoint leads to the loss of information about the direction in which people lean. It is better not to offer the middle point in response scales if the direction in which people are leaning on the issue is the type of information wanted (Payne, 1951; Presser and Schuman, 1980). Cross-cultural surveys which target Western countries with individualistic
281
cultures, where people often clearly express their opinions, include midpoint categories in the response scales and use strong adverbs (strongly, absolutely, etc.) on both extremes of the response scales. This is not considered too much of a problem. However, in crosscultural surveys which include East Asia with countries such as Japan which prefer vague responses, responses often seem to concentrate on the midpoint or the area around midpoint responses. Therefore, incorporating midpoint categories and strong adverbs into the scale should be considered carefully. It would also be necessary to consider whether off-scale options (such as ‘Can’t choose’, ‘It depends on the situation’, ‘I don’t know’) are incorporated into the scale or not, because response patterns to midpoint categories are similar to response patterns to off-scale options. Figure 19.2 shows results of the EASS 2006 family module. Response distribution of 18 attitudinal items regarding family in Japan, South Korea, China, and Taiwan were compared. The response categories were on a seven-point scale which included a midpoint category. The Japanese team was against including a midpoint category in the scale, but a scale with a midpoint category was eventually adopted based on the request of other teams who placed importance on comparability with the ISSP. There is a notable proportion of midpoint responses in all regions, but midpoint responses were especially high in Japan. Taking these striking results into account, it seems advisable for researchers conducting global surveys to take into consideration also the response styles of non-Western regions when designing response scales.
Translation of Response Scales The issues of how to design response scales and how to translate response scales are closely related. Shishido et al. (2009) compared the target questionnaires of the ISSP
282
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Japan (n=2,107)
China (n=3,208)
80.0%
80.0% v202
70.0%
V203
v202
70.0%
V203 v204
v204 60.0%
v205
60.0%
v205 v206
v206 v207
50.0%
v207
50.0%
v208
v208 v209
40.0%
v209
40.0%
v210
v210 v211
30.0%
v212
v211 30.0%
v212 v213
v213 v214
20.0%
v214
20.0%
v215
v215 v216
10.0%
v216
10.0%
v217
v217 v218
.0% Strongly agree
Fairly agre Somewhat agree
Neither agree nor disagree
Somewhat disagree
Fairly disagree
Strongly disagree
v219
v218
.0% Strongly agree
Fairly agre
S ou th Ko rea (n =1, 595)
Somewhat agree
Neither agree nor disagree
Somewhat disagree
Fairly disagree
Strongly disagree
v219
Taiwan (n =2,097)
80.0%
80.0% v202
70.0%
V203
v202
70.0%
V203 v204
v204 60.0%
v205
60.0%
v205 v206
v206 v207
50.0%
v207
50.0%
v208
v208 v209
40.0%
v210
v209
40.0%
v210 v211
v211 30.0%
v212
30.0%
v212 v213
v213 20.0%
v214
v214
20.0%
v215
v215 v216
10.0%
v216
10.0%
v217
v217 v218
.0% Strongly agree
Fairly agre Somewhat agree
Neither agree nor disagree
Somewhat disagree
Fairly disagree
Strongly disagree
v219
v218
.0% Strongly agree
Fairly agre
Somewhat agree
Neither agree nor disagree
Somewhat disagree
Fairly disagree
Strongly disagree
v219
Figure 19.2 Response distribution of 18 survey items.
and the WVS and found the following: (1) there are multiple regions where the translation of the two surveys differed even though labels of response categories in English were identical; and (2) the difference in translations of response categories had an impact on response distribution. The two surveys produced different results even though the survey items themselves were identically translated. When looking at a number of cross-cultural surveys which are carried out in Japan, there are multiple translations for a category such as ‘strongly agree’ (Figure 19.3). The translations of this response category are different in the ISSP and the WVS in Japan, which produces different response distributions for the
same survey items in different surveys. The WVS directly translates ‘strongly agree’ as tsuyoku sansei (‘strongly approve’) while the ISSP freely translates ‘strongly agree’ as sou omou (‘I think so’ or merely ‘agree’). As a result of such translations, there were few respondents who answered with ‘strongly agree’ in the WVS; on the other hand, there were many respondents who answered with ‘strongly agree’ in the ISSP. Difficulties in translating response categories do not exist only in Japan when assuming that there are many regions which have different translations of response categories. Given the differences in responses distribution possibly caused by translation
The Translation of Measurement Instruments for Cross-Cultural Surveys
強く賛成
283
strong ‘strongly’
strongly approve of
まったくその通りだと思う absolutely I think so
Strongly agree
とてもそう思 fairly I think so
そう思
I think so
weak ‘strongly’
Figure 19.3 Examples of Japanese translations of ‘strongly agree’.
differences in response categories, identical English response scales should be translated identically among cross-national surveys (ISSP, WVS, etc.) in order to be able to compare results across these surveys. It is desirable to seek a way to harmonize translation by sharing information among local research agencies. Countries and regions participating in cross-national survey projects should make an effort to examine the intensity of adverbial expressions and to enhance equivalence of the expressions across participant countries and regions. As for research into adverbial expressions in agreement scales, the Research into Methodology of Intercultural Surveys (MINTS) project shall be named as a reference. This project examined the equivalence of the agreement scale based on a direct rating approach that quantified and measured the impressions of respondents when confronted with different response categories (Mohler et al., 1998). A continuation of this work, including more languages and cultures, is highly desirable.
FUTURE DEVELOPMENTS While the research community continues to refine translation and assessment methodology as well as the integration of translational
aspects into questionnaire design, surprisingly as it is little research has been done so far about the impact of different translation or adaptation versions on data comparability. Which differences matter and which do not, and for which type of concepts or items do such findings hold, what are particularly robust items? Here, further research is urgently needed to inform translation and adaptation practices. A related area of heightened interest is corpus linguistics and in particular the question as to how large language corpora may help substantiating translation decisions or understanding questionnaire design principles in different languages. In the latter regard, the research community should evaluate to what extent design guidelines and principles developed predominantly based on the English language and Anglo-Saxon culture really apply to other languages and cultures, or whether some re-thinking is urgently needed. Furthermore, while much research, especially in the context of the US, deals with translation for migrant groups, systematic work on differences between questionnaire translation for migrant groups and questionnaire translation for different countries is missing. What are particular requirements and challenges that need to be addressed when translating a questionnaire for a migrant group, the particularity of which is to bring two cultures and two languages to the response process? The
284
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
growing importance of including migrants in surveys or even focusing on them makes this topic particularly sought-after. The role of computer-aided translation tools will also become more important, especially in light of increased electronic data collection and the potential to produce the translation in a format that is directly usable by a given survey software or a data archive. The DASHISH project (2014), which aims, amongst others, at the development of a cross-cultural questionnaire design, translation, and documentation tool, can be named as a case in point. Furthermore, closer cooperation and exchange of information between all parties concerned is likely to shape comparative projects more so than before. The harmonization of translation decisions is a key factor in this regard. Last but not least, slowly but steadily findings, theories, and applications from translation research are finding their way into cross-cultural survey research (Behr, 2009; Bolaños-Medina and González Ruíz, 2012; Chidlow et al., 2014). This interdisciplinary exchange needs to grow to help substantive and survey researchers better understand the complexities, possibilities, and limits of translation and to improve cooperation between the fields. The same learning process, of course, is needed among translation practitioners and translation researchers when it comes to survey research and measurement issues.
RECOMMENDED READINGS To conclude this chapter, following works shall be recommended for further reading: two seminal works by Harkness, who tremendously influenced the field of questionnaire translation (Harkness, 2003; Harkness et al., 2010b); a chapter on cross-cultural questionnaire design by Smith (2004); an article on cross-cultural cognitive interviewing (Fitzgerald et al., 2011), and finally a general book
on translation, written by an acknowledged translation researcher (Baker, 2011).
ACKNOWLEDGMENTS East Asian Social Survey (EASS) is based on Chinese General Social Survey (CGSS), Japanese General Social Surveys (JGSS), Korean General Social Survey (KGSS), and Taiwan Social Change Survey (TSCS), and distributed by the EASSDA.
REFERENCES Acquadro, C., Conway, K., Giroudet, C., and Mear, I. (2012). Linguistic Validation Manual for Health Outcome Assessments. Lyon: Mapi Institute. Acquadro, C., Conway, K., Hareendran, A., and Aaronson, N. (2008). Literature review of methods to translate health-related quality of life questionnaires for use in multinational clinical trials. Value in Health, 11, 509–521. Baker, M. (2011). In Other Words: A Coursebook on Translation (2nd edn). London & New York: Routledge. Barioux, M. (1948). Techniques used in France. Public Opinion Quarterly, 12, 715–718. Behling, O., and Law, K.S. (2000). Translating Questionnaires and Other Research Instruments: Problems and Solutions. Thousand Oaks, CA: Sage. Behr, D. (2009). Translationswissenschaft und international vergleichende Umfrageforschung: Qualitätssicherung bei Fragebogenübersetzungen als Gegenstand einer Prozessanalyse. GESIS-Schriftenreihe, 2, Bonn: GESIS. Behr, D., and Scholz, E. (2011). Question naire translation in cross-national survey research: on the types and value of annotations. Methoden – Daten – Analysen, 5, 157–179. Bolaños-Medina, A., and González Ruíz, V. (2012). Deconstructing the translation of psychological tests. Meta, 57, 715–739.
The Translation of Measurement Instruments for Cross-Cultural Surveys
Brislin, R. W. (1970). Back-translation for crosscultural research. Journal of Cross-Cultural Psychology, 1, 185–216. Brislin, R. W. (1980). Translation and content analysis of oral and written materials. In H. C. Triandis and J. W. Berry (eds), Handbook of Cross-Cultural Psychology: Methodology, Vol. 2 (pp. 389–444). Boston, MA: Allyn and Bacon. Chesterman, A. (2004). Functional quality. Retrieved from http://www.youtube.com/ watch?v=IJW1Y6rAB1I [accessed 2014-04-08]. Chidlow, A., Plakoyiannaki, E., and Welch, C. (2014). Translation in cross-language international business research: beyond equivalence. Journal of International Business Studies, 45(5), 1–21. Converse, J. M., and Presser, S. (1986). Survey Questions: Handcrafting the Standardized Questionnaire. London: Sage Publication. DASHISH – Data Service Infrastructure for the Social Sciences and Humanities (2014). Activities: Data Quality. Retrieved from http:// dasish.eu/activities/ [accessed 2014-04-08]. Dean, E., Caspar, R., McAvinchey, G., Reed, L., and Quiroz, R. (2007). Developing a low-cost technique for parallel cross-cultural instrument development: the Question Appraisal System (QAS-04). International Journal of Social Research Methodology, 10, 227–241. Dept, S. (2013). Translatability assessment of draft questionnaire items. Paper presented at the conference of the European Survey Research Association (ESRA), Ljubljana, SI. Dorer, B. (2011). Advance translation in the 5th round of the European Social Survey (ESS). FORS Working Paper Series 2011, 4. European Social Survey (2014a). ESS pre-testing and piloting. Retrieved from http://www. europeansocialsurvey.org/methodology/ pre-testing_and_piloting.html [accessed 2014-04-08]. European Social Survey (2014b). ESS Round 7 Translation Guidelines. London: ESS ERIC Headquarters, Centre for Comparative Social Surveys, City University London. Ferrari, A., Wayrynen, L., Behr, D., and Zabal, A. (2013). Translation, adaptation, and verification of test and survey materials. In OECD (eds), Technical Report of the Survey of Adult Skills (PIAAC) 2013 (pp. 1–28). PIAAC: OECD Publishing.
285
Fitzgerald, R., Widdop, S., Gray, M., and Collins, D. (2011). Identifying sources of error in cross-national questionnaires: application of an error source typology to cognitive interview data. Journal of Official Statistics, 27, 569–599. Furtado, T., and Wild, D. J. (2010). Harmonisation as Part of the Translation and Linguistic Validation Process: What is the Optimal Method? Retrieved from http://www.oxfordoutcomes. com/library/conference_material/posters/ translations/TRANSLATION%20Furtado%20 Wild%20Optimal%20Methods%20handout. pdf [accessed 2014-04-08]. Hagell, P., Hedin, P.-J., Meads, D. M., Nyberg, L., and McKenna, S. P. (2010). Effects of method of translation of patient-reported health outcome questionnaires: a randomized atudy of the translation of the Rheumatoid Arthritis Quality of Life (RAQoL) Instrument for Sweden. Value in Health, 13, 424–430. Hambleton, R. K. (2005). Issues, designs, and technical guidelines for adapting tests into multiple languages and cultures. In R. K. Hambleton, P. F. Merenda, and C. D. Spielberger (eds), Adapting Educational and Psychological Tests for Cross-Cultural Assessment (pp. 3–38). Mahwah, NJ: Lawrence Erlbaum Associates. Hambleton, R. K., and Zenisky, A. (2011). Translating and adapting tests for cross-cultural assessments. In D. Matsumoto, and F. van de Vijver (eds), Cross-Cultural Research Methods (pp. 46–70). Cambridge: Cambridge University Press. Harkness, J. (2003). Questionnaire translation. In J. Harkness, F. J. R. van de Vijver, and P. Ph. Mohler (eds), Cross-Cultural Survey Methods (pp. 35–56). Hoboken, NJ: Wiley. Harkness, J. A. (2008). Comparative survey research: goal and challenges. In E. de Leeuw, J. J. Hox, and D. A. Dillman (eds), International Handbook of Survey Methodology (pp. 56–77). New York: Lawrence Erlbaum Associates. Harkness, J., Behr, D., Bilgen, I., Córdova Cazar, A. L., Huang, L., Lui, A., Stange, M., and Villa, A. (2010a). VIII. Translation: finding, selecting, and briefing translation team members. In Survey Research Center, Guidelines for Best Practice in Cross-Cultural
286
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Surveys. Ann Arbor, MI: Survey Research Center, Institute for Social Research, University of Michigan. Retrieved from http://www. ccsg.isr.umich.edu/ [accessed 2015-05-22]. Harkness, J. A., Edwards, B., Hansen, S. E., Miller, D. R., and Villar, A. (2010c). Designing questionnaires for multipopulation research. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell, and T. W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 33–57). Hoboken, NJ: Wiley. Harkness, J., Pennell, B.-E., and Schoua-Glusberg, A. (2004). Survey questionnaire translation and assessment. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 453–473). Hoboken, NJ: Wiley. Harkness, J., van de Vijver, F. J. R., and Johnson, T. P. (2003). Questionnaire design in comparative research. In J. Harkness, F. J. R. van de Vijver, and P. Ph. Mohler (eds), Cross-Cultural Survey Methods (pp.19–34). Hoboken, NJ: Wiley. Harkness, J. A., Villar, A., and Edwards, B. (2010b). Translation, adaptation, and design. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell, and T. W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 117–140). Hoboken, NJ: Wiley. Hayashi, C., and Hayashi, F. (1995). Comparative study of national character. Proceedings of the Institute of Statistical Mathematics, 43, 27–80. Hofstede, G. (1995). Cultures and Organizations: Software of the Mind. London. McGraw Hill. Jääskeläinen, R. (2010). Are all professionals experts? Definitions of expertise and reinterpretation of research evidence in process studies. In G. M. Shreve and E. Angelone (eds), Translation and Cognition (pp. 213– 227). Amsterdam: John Benjamins. Klopfer, F. J. (1980). The middlemost choice on attitude items: ambivalence, neutrality, or uncertainty? Personality and Social Psychology Bulletin, 6, 97–101.
Krosnick, J. A., Judd, C. M., and Wittenbrink, B. (2008). The measurement of attitudes. In C. Roberts and R. Jowell (eds), Attitude Measurement, Volume 2: Designing Direct Measures (pp. 1–85). London: Sage Publications. Leplège, A., and Verdier, A. (1995). The adaptation of health status measures: methodological aspects of the translation procedure. In S.A. Shumaker and R. A. Berzon (eds), The International Assessment of Health-Related Quality of Life: Theory, Translation, Measurement and Analysis (pp. 93–101). Oxford: Rapid Communications. Mohler, P. P., Smith, T. W., and Harkness, J. A. (1998). Respondents’ ratings of expressions from response scales: a two-country, twolanguage investigation on equivalence and translation. ZUMA-Nachrichten Spezial No. 3: Cross-Cultural Survey Equivalence, 159–184. Payne, S. (1951). The Art of Asking Questions. Princeton, NJ: Princeton University Press. Presser, S., and Schuman, H. (1980). The measurement of a middle position in attitude surveys. Public Opinion Quarterly, 44, 70–85. Shishido, K., Iwai, N., and Yasuda, T. (2009). Designing response categories of agreement scales for cross-national surveys in East Asia: the approach of the Japanese General Social Surveys. International Journal of Japanese Sociology, 18, 97–111. Shreve, G. (2002). Knowing translation: Cognitive and experiential aspects of translation expertise from the perspective of expertise studies. In A. Riccardi (ed.), Translation Studies: Perspectives on an Emerging Discipline (pp. 150–171). Cambridge: Cambridge University Press. Si, S. X., and Cullen, J. B. (1998). Response categories and potential cultural bias: effects of an explicit middle point in cross-cultural surveys. The International Journal of Organization Analysis, 6(3), 218–230. Smith, T. W. (1997). Improving cross-national survey research by measuring the intensity of response categories. GSS Cross-National Report, 17. Chicago, IL: National Opinion Research Center, University of Chicago. Smith, T. W. (2004). Developing and evaluating cross-national survey instruments. In S.
The Translation of Measurement Instruments for Cross-Cultural Surveys
Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 431–452). Hoboken, NJ: Wiley. Tasaki, K. (2008). Cross-Cultural Research Methods in Social Sciences. Kyoto: Nakanishiya Press. Triandis, H. C. (1995). Individualism & Collectivism. Boulder, CO. Westview Press. van de Vijver, F.J.R. and Leung, K. (2011). Equivalence and bias: A review of concepts, models, and data analytic procedures. In D. Matsumoto, and F.J.R. van de Vijver (eds), Cross-Cultural Research Methods in Psychology (pp. 17–45). New York: Cambridge University Press. van de Vijver, F. J. R., and Poortinga, Y. H. (1997). Towards an integrated analysis of bias in cross-cultural assessment. European
287
Journal of Psychological Assessment, 13, 21–29. Wild, D., Eremenco, S., Mear, I., Martin, M., Houchin, C., Gawlicki, M., et al. (2009). Multinational trials – recommendations on the translations required, approaches to using the same language in different countries, and the approaches to support pooling the data: the ISPOR Patient Reported Outcomes Translation and Linguistic Validation Good Research Practices Task Force Report. Value in Health, 12, 430–440. Wild, D., Grove, A., Martin, M., Eremenco, S., McElroy, S., Verjee-Lorenz, A., and Erikson, P. (2005). Principles of good practice for the translation and cultural translation process for Patient-Reported Outcomes (PRO) measures: Report of the ISPOR Task Force for Translation and Cultural Adaptation. Value in Health, 8, 94–104.
20 When Translation is not Enough: Background Variables in Comparative Surveys1 S i l k e L . S c h n e i d e r, D o m i n i q u e J o y e and Christof Wolf
INTRODUCTION Background or socio-demographic variables like age, education, or occupation are often referred to as ‘objective variables’, reflecting ‘hard facts’. Under this assumption, the measurement of these variables should not pose any difficulties in comparative perspective. Of course, it is true that, for example, age can be measured in years, days, or even seconds. However, we are not ‘young’ or ‘old’ at the same ‘age’ in different societies. It is therefore important to realize that those constructs and measures, even if they seem to reflect ‘natural’ or ‘objective’ states, are always socially constructed and consequently are linked to institutional structures and context-dependent interpretations. With respect to background variables, theorizing is oftentimes limited, resulting in variables not optimal for testing specific hypotheses. For example, education is frequently included in models as a ‘control variable’ without considering its substantive meanings. This is even more
a problem because countries have different systems of education, including for example vocational training paths. Other problems may arise from treating education as a continuous variable, ignoring that, in some cases, its categorical characteristics could be of importance. Another aspect jeopardizing comparability of background variables arises when a category of such a variable is strongly linked to the institutional structure of a given society. An example with respect to occupation is the position of ‘Cadre’ in the French context or ‘Professional’ in the Anglo-Saxon one. A literal translation of such terms does not produce comparability in the occupations measured. A second case arises when seemingly identical categories actually refer to different phenomena. For example, the nominally same level of schooling, e.g. completed primary, may have very different meanings and consequences in different countries, because in one country, primary education lasts four years, and in another, it lasts nine years. Again, translation may be possible, but does likely
WHEN TRANSLATION IS NOT ENOUGH
not create comparable measurements. On a national level, coding of background variables will often be straightforward, but using either categories biased towards one specific institutional setting or using one label for different social realities will endanger comparability. Consequently, the meaning of a given measure can vary by institutional context. In short, this is a general challenge for comparative research: specific institutional settings, historical developments, and political culture shape the meaning and significance of the social and demographic structure we are interested in and want to quantify. Sound measurement and interpretation of measures always has to take the specific institutional, historical, and cultural context into account, even more when considering that the use of categories can have a formative effect: they may have consequences for social functioning. This will often require more than a good translation, but the classification of complex social phenomena. However, background variables generally are not a prominent research topic in comparative survey methodology. Even the ‘Intersectional perspective’, trying to consider simultaneously ‘Class, Race and Gender’ has not considered a comparative perspective as a priority. Instead of theorizing these variables in a cross-national perspective and investigating the challenges arising when striving for comparative measurement of these indicators, we often are encountering a ‘naïve’ enumeration of categories. In this chapter we will discuss three examples of social characteristics strongly shaped by institutions, history, and customs: ethnicity, education, and social position. We argue that these three variables exemplify the difficulty of comparative measures and their anchoring in specific social contexts. They are among the most often used variables in contemporary empirical social research and therefore exploring the challenges and implications posed by measuring them in comparative research seems more than warranted.
289
ETHNICITY: A FIGHT FOR CLASSIFICATION At the beginning of his book Ethnic Boundary Making, Andreas Wimmer (2013) mentions the debate around the definition of ethnicity as a given characteristic at the time of birth or as group characteristics ‘chosen’ according to the situation or the social configuration. It seems now that most researchers agree that ethnicity, as most classifications, is a social construction (Bancel et al., 2014) and therefore subject to a ‘fight for classification’ (Starr, 1992). In other words, these categorizations, especially in a highly debated field such as ethnicity, may have a ‘performative’ effect (Felouzis, 2010). This means that the very usage of a classification creates a social reality, modulated according to the public salience of the question and the way a particular society is structured. Similarly, there is also the danger to reify or naturalize ethnic categories and for the wider public, the media and/or specific political groups to misinterpret ethnicity as an ‘objective’ biological or instrumental characteristic (Zuberi and Bonilla-Silva, 2008). The historical experience with such risks may explain some differences between countries in how ethnicity is handled in survey research: for example the French model postulates that the immigrant integration process ends with the acquisition of citizenship, implying, at a theoretical level, that racial or ethnic categories are not needed in the French census (Simon, 2010). This might partly result from the impact ‘racial’ categorization had during German occupation in France in the Second World War. From the decision not to elaborate statistics on ethnicity for those being ‘French’, it follows that – at least based on official data – it is impossible to study discrimination or social inequality that can be linked to ethnic or migrant origin in France. Though we have no room to dwell further on these aspects – the interested reader is referred to the references mentioned
290
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
here – we nevertheless want to underline four points: •• Classifications of ethnicity always have political and social consequences and thus we have to anticipate a ‘classification struggle’.2 As Wimmer (2013: 205) mentions, ‘Ethnicity is more than an “imagined community”, a cognitive classifica tion, or a discourse of identity. Ethnic boundary making is driven by hierarchies of power and prestige and is meant to stabilize and institution alize these hierarchies.’ •• Classifications in general and the measurement of ethnicity in particular are therefore closely related to political debates. For example, the decision to measure ‘race’ in the US census was linked to the representation of the different states in Congress (see e.g. Schor, 2009: 30). •• As a consequence, the debate on ethnicity and its measurement is a function of historical and national contingencies as well as power rela tions. The following examples highlight this point: {{ For Latin America, Ferrandez and Kradolfer (2012: 2) write: ‘A decade ago, seventeen out of nineteen Latin American states were incorporating categories to identify ethnic or racial belonging in their first census round of the century’. As the authors underline, these changes take place in a context of affirma tion of human rights but also of an ethnically very unequal distribution of wealth. In conse quence, the debate on ethnic/racial categories is clearly to be related to a general discus sion of social inequalities and distribution of resources in a country, at a specific time. {{ In the United States Census there is a long tradition for measuring race/ethnicity even sometimes mixing these two concepts, as in the 2010 census for example (Prewitt, 2013). But this is not established forever and, for each edition of the census, some modifications are made in the questionnaire, reflecting the difficulty to construct objective social classifications. Since 2000 multiple belongings are considered (Nobles, 2000: 82). This is one more argument for the contingent character of this question. {{ If, as mentioned, some countries do not use ethnic or racial categories in their censuses, this is nevertheless now a dominant prac
tice: nearly two thirds of the countries have employed such a measure in their census around the millennium turn. Ann Morning (2008, p. 1) for example writes: ‘I find that 63% of the national censuses studied incor porate some form of ethnic enumeration, but their question and answer formats vary along several dimensions that betray diverse conceptualizations of ethnicity (for example, as “race” or “nationality”)’. •• And finally, at least when thinking of the ‘ances try dimension’ of ethnicity, we may face, depend ing on the countries, two different situations as extremes. The first one is the case where the ‘minority population’ is the ‘native’ one, like for example American Indians in America or Aborigines in Australia. In the second case, the ‘ethnic minority’ is an immigrant population as it is the case in most European countries (and there may be additional established minorities, such as Roma, or country-specific cultural/language groups in certain countries such as the Catalans in Spain, too). Of course, the way to express interest and legitimate representation or rights could be very different in these situations. This is another way to link ethnicity and migration.
That means that the ‘ethnic boundaries’ we have spoken about have a political function in many countries and are part of political claims. In order to go further, we have to first present a definition of ethnicity, and from there see how this concept can be measured empirically in different contexts. Wimmer (2013: 7) mentions, following in this Max Weber: ‘ethnicity is understood as a subjectively felt belonging to a group that is distinguished by a shared culture and by common ancestry… In this broad understanding of ethnicity, “race” is treated as a subtype of ethnicity, as is nationhood’. This means also that there is always a tension between ascription and achievement as well as between external attribution and selfidentification. Often, in particular for censuses, the state defines the relevant categories, often based on external criteria rather than based on feelings of belonging. For example, Morning (2008: 248) mentions that only 12
WHEN TRANSLATION IS NOT ENOUGH
out of 87 states employing ethnic enumeration use the subjective facet of identity. A measure of distance between ethnic groups can be considered also, in order to know how distinct the different groups are. An example of such a measure could be, for example, frequency of intermarriage, as it is discussed in New Zealand (Callister, 2004). Such an idea of social proximity based on intermarriage is also applied in other fields like the CAMSIS scale used to define social position (see section on ‘Social Position’ further below in this chapter). Scientific associations have also proposed standards, sometimes at the price of a very important debate of the question itself of proposing norms. See for example the American Anthropological Association (1997) response to OMB Directive 15: Race and Ethnic Standards for Federal Statistics and Administrative Reporting, or the discussion inside the American Sociological Association (2003). More generally several books are dedicated to the (subjective) measurement of ethnicity that can be used in a survey context and the strength of belonging to one or another group (see for example Davis and Engel, 2010; Haller and Eder, 2015). In any case, the level of diversity observed in the different countries as well as the way that ethnicity organizes itself in the social structure of the different countries underlines the importance to consider the social setting in which ethnicity is discussed. This is perhaps one more argument to consider an ‘intersectional perspective’, that means a perspective integrating ‘Race, Class and Gender’ and their relations and interactions (e.g. Andersen and Collins, 2012; Dorlin, 2009). International comparative surveys are examples also on the way that the scientific community has operationalized these notions. The ISSP participating countries were traditionally free to choose to report the ethnic dimension most ‘useful’ to measure, according to the local context. This situation leads to very diverse measures which are difficult to
291
fully document for data users, and difficult to analyze in a comparative fashion. In the last revision of its background variables, the ISSP has adopted an open question to measure subjective belonging.3 However, the instruction for implementation is less clear, mixing ancestry and belonging: ‘The country-specific list of ethnic groups should be based on the core concept of ancestry which, deviating in different countries, can be founded on genetic, cultural or historical roots. The list may capture one specific dimension of ethnicity, such as nationality, citizenship, race, language or religion, depending on which aspect is particularly relevant in the country’.
Furthermore, in some countries like Switzerland, intensive cognitive testing has shown that the idea of ‘belonging to a group’ is not clear at all for respondents and can give a misleading impression of the validity of the measure.4 In addition, country of birth of both parents has also been added to the ISSP, making reference to the ‘ancestry dimension’ already mentioned, even if referring to the second generation only. Although it is not totally explicit, this goes in the direction to capture sub-dimensions of ethnicity and to use eventually the one that is relevant in the context of a particular research question: nationality in some cases, language in other ones, migration experience in still others, etc. The European Social Survey has done an interesting job of theoretical construction of the core indicators (see Billiet, 2001). The first dimension is the subjective one of belonging to a minority ethnic group5 and shows that less than 10% have the feeling to be part of an ethnic minority, the exceptions being Israel, Bulgaria, Russia, Estonia where it was less than 20%. Even more interestingly, the relationship between the subjective variable ‘belonging to an ethnic minority group in the country’ and ethnic minority ancestry (where respondents state their ancestry and researchers define whether this is a minority in the country) is actually surprisingly weak, at least in the UK (European Social Survey, 2014a). In other words, the question asked in
292
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
the former way shows ethnicity as a relatively marginal question in most European countries, at least in the context of general social surveys. The concept of ‘ethnic minority group’ is likely too abstract and scientific – in addition to not being well understood in many countries – to elicit identification by respondents in a direct question. At the time of the inception of the ESS, however, this was considered the only way to get some directly cross-nationally comparable information on respondents’ subjective ethnic minority status. Of course, this is not the only way to explore the belonging to a cultural or national group and the ESS also measures citizenship, country of birth, country of birth of the parents, and the language(s) spoken at home (including dialects), giving to the researcher a set of variables to build different typologies based on belonging to language groups or migration background. In ESS round 7, a dual-response measure of ancestry with an underlying classification of cultural and ethnic groups was included for the first time, in order to improve the analytic potential of the immigration module implemented in the same round (European Social Survey, 2014b). The idea of multiple indicator measurement as used by the ESS and ISSP was already mentioned by Smith (1980) and was also discussed in Hoffmeyer-Zlotnik and Warner (2014: 212). From this line of reasoning, it appears difficult to propose one single measure for ethnicity for comparative survey research, but a set of variables including citizenship and legal status, language, migration background could be a common basis, letting space to add an indicator of subjectively belonging to an ethnic group, and/or ancestry, in the case where it makes sense. This multiplication of perspectives and resulting flexibility is probably the best way to take into account the specific settings in which indicators like this make sense and avoid imposing one unique logic, be it academic or political. In conclusion, even though ethnicity seems impossible to measure with one simple
indicator in a really comparative frame, multiple indicators can be nevertheless used in different contexts, knowing that their importance can vary according to national structures and power relations. In this sense this is a very fascinating case of application of a ‘struggle for classification’ and this dimension must be carefully considered keeping in mind social and political implications. But this political dimension can also be found in other social classifications. In the case of education that we will consider next, the relative importance of general and vocational training as well as what are ‘equivalent’ qualifications across countries was always a serious challenge in international comparison but also for political authorities trying to compare their own educational system with that of other countries.
EDUCATION Two elements of an individual’s education are important in survey research: the individual’s educational attainment, i.e. how much education an individual has achieved, and the chosen field of education and training, i.e. the specific subject area or substance matter the individual has specialized in. The first is referred to as vertical and the second as horizontal educational differentiation (Charles and Bradley, 2002; Jonsson, 1999). Both are presented here in turn.
Educational Attainment Educational attainment is either of interest in itself, as an outcome of an individual’s ‘career’ in formal education, or an indicator for more distant theoretical concepts such as human capital, cognitive skills, literacy, capacity for learning, cultural capital, or even social status. The measurement of educational attainment, or in brief just ‘education’, seems to be a self-evident concept in
WHEN TRANSLATION IS NOT ENOUGH
293
most national contexts. Only when authority of the educational system is devolved to regional entities such as ‘Länder’ in Germany or constituent countries in the United Kingdom, issues similar to those very apparent in cross-national surveys surface: How can researchers compare the levels of education achieved by individuals in historically grown, institutionally bound, and symbolically enriched – in short, idiosyncratic and thus incomparable – educational systems? First of all, the understanding of the concept of education itself differs across countries. While in some countries, the term ‘education’ is mostly understood to refer to general, theoretical, or academic learning (‘Bildung’), it includes practical, vocationally specific, or professional training (‘Ausbildung’) in others. In the first group of countries, both functions of education tend to be rather distinct in the institutional setup so that one can be measured separately from the other.6 When researching individuals’ success in the labor market, it is rather clear that all countries need to cover both aspects of education. For cross-national research, however, it is recommended to always use the wider definition of education because in many countries, both are so closely entwined that it is almost impossible to measure general or academic education only. If some countries decide to measure education only narrowly, their data will not be comparable with other countries. Two indicators and directly related measures for educational attainment are commonly in use: years of education and educational qualifications. More rarely, education is conceptualized and measured in a positional, i.e. relative, way. Such measures are always derived from the other two indicators.
uses a unit of measurement that looks crossnationally comparable because a year is a year everywhere. A questionnaire item for this indicator could thus be input harmonized in principle (see Chapter 33 in this Handbook) with potential adaptation in the respondent or interviewer instructions, highlighting which educational programs to count or to omit in the country in question. This hints at one difficulty with this indicator: since there is no show card clarifying the relevant ‘universe’ of education to the respondent, the instructions need to clarify, in a cross-nationally comparable manner, the scope of education, especially at the beginning and end of educational careers. In particular, the in- or exclusion of early childhood education, pre-schooling, vocational training that is entirely or partly work based, continuing education and PhD studies need to be clarified. With respect to part-time education, respondents also need to know how to take this into account. One consequence of this is that the instructions may become lengthy and will not be read by respondents, and that summing up all relevant years may be cognitively demanding. Another difficulty with respect to years of education is the fact that a year is not in fact the same everywhere so that its cross-national and system-wide comparability may be more imagined than real. Hours of instruction per school day, days of schooling per school year as well as pacing vary widely across countries, levels of education, and even between school types and classrooms. However, crossnational research has shown that instructional time in a subject is unrelated to achievement scores in this subject (Baker et al., 2004).
Years of Education
With the advancement of social stratification theories and expansion of this research to Europe with its reliance on educational certificates and differentiated education systems, the indicator of years of education was increasingly questioned. If from a theoretical point of view, the signaling value of
The indicator ‘years of education’ was the first indicator of education to be used in (mostly US-American) social stratification research (see e.g. Blau and Duncan, 1967). It seems very simple at first sight, and can be modeled in a linear fashion. Importantly, it
Educational Qualifications
294
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
educational qualifications (Spence, 1973) or qualitative differences between different types of education (with similar duration) are important, years of education will not be the optimal indicator. Countries differ with respect to how much symbolic power educational qualifications assume in society and the labor market. Education questions in surveys are almost always closed questions. Since educational qualifications are country-specific and respondents cannot reliably relate their specific qualification to abstract international categories, response categories have to be country-specific, too. Only the question stem, e.g. ‘What is the highest educational qualification that you have achieved so far?’ and general instructions can be input harmonized to make sure countries actually use the same indicator. The country-specific response categories need to be coded into a cross-national coding scheme during data processing (output harmonization, see Chapter 33 in this Handbook). There are two main cross-national coding schemes for educational attainment in comparative research. The CASMIN scheme was developed in the project ‘Comparative Analysis of Social Mobility in Industrial Nations’ in the 1970s and 1980s (Brauns et al., 2003; König et al., 1988) to harmonize largely European cross-national educational attainment data ex post (see Chapter 33 in this Handbook). It has become the paradigmatic coding scheme in social stratification and mobility research since then and was used in several seminal studies (e.g. Breen et al., 2009; Shavit and Blossfeld, 1993). It holds a high reputation amongst scholars (for evaluations see e.g. Kerckhoff et al., 2002; Schneider, 2010). Its main advantages compared to ISCED (see below) are that it is theoretically based on social stratification research and therefore recognizes horizontal institutional distinctions between general or academic and vocational qualifications. Also, it classifies education in a relative way at the lowest level, thus remaining rather stable
when an education system expands. (See the section on ‘Fields of Specialization’ in this chapter.) However, it has not been updated since 2003, which is becoming problematic given the increasing differentiation of higher education in Europe. It also lacks world-wide coverage, which is one likely reason for it not being used for ex-ante output harmonization in large cross-national surveys. Table 20.1 gives an overview of the CASMIN categories. The International Standard Classification of Education (ISCED), developed and maintained by UNESCO (2006, 2012), is a multidimensional, multi-purpose crossclassification for harmonizing national educational programmes and qualifications into a cross-national framework for levels and fields of education (see section on ‘Fields of Specialization’ in this chapter for the latter). It is mostly used for international statistical reporting by UNESCO, OECD, and Eurostat and in official comparative surveys like the European Union Labour Force Survey or the Programme for the International Assessment of Adult Competencies (PIAAC) (OECD, 2015). The classification for educational levels consists of main levels and a number of sub-dimensions like program orientation, Table 20.1 The CASMIN education scheme Description 1a 1b 1c
2a
2b 2c voc 2c gen 3a_voc 3a_gen 3b_low 3b_high
Inadequately completed general elementary education General elementary education Basic vocational qualification or general elementary education and basic vocational qualification Intermediate vocational qualification or intermediate general education plus basic vocational qualification Intermediate general qualification Full vocational maturity certificate Full general maturity certificate Lower tertiary certificate, vocational Lower tertiary certificate, general Higher tertiary certificate, lower level Higher tertiary certificate, higher level
WHEN TRANSLATION IS NOT ENOUGH
destination, and duration. The version most commonly found in data today is ISCED 1997, which consists of seven main levels (plus sub-dimensions, which are only rarely used in survey data). The successor, ISCED 2011 (for details, see Schneider, 2013), consists of nine levels (plus sub-dimensions) to better differentiate tertiary education after the Bologna reforms, as shown in Table 20.2. While ISCED covers most countries of the world and is by now well documented, it lacks a theoretical basis and is implemented by countries’ education ministries and statistical offices, which makes it vulnerable to political exploitation. The European Social Survey (ESS) has for round 5 (2010) introduced a detailed cross-national education variable (edulvlb) adapted from the fully detailed three-digit ISCED 2011 to better distinguish within certain heterogeneous categories and to purposefully deviate from the official classification of some qualifications, with the aim to improve the comparability, validity, and coding flexibility of the ESS education variable. A simplified variable, the European Survey version of ISCED (see Schneider, 2010), with higher comparative validity than the five-level ISCED-97 based variable (edulvla) provided in earlier ESS waves was also introduced (European Social Survey, 2012). The International Social Survey Programme (ISSP) has in 2011 introduced
295
a similar variable (called DEGREE, ISSP Demographic Methods Group, 2012). Table 20.3 shows how the different coding schemes relate to one another.
Positional Education Measures A problem affecting both years of education and educational qualifications is that with educational expansion, the distribution of education ‘moves up’ across cohorts. If somebody had a university entrance qualification in 1950, he was highly educated in comparison to his peers. Today, the same level of education would have to be regarded as mediocre at best. Therefore, depending on the research question, it may make sense to conceptualize and measure education in a relative rather than an absolute way. This idea was first brought forward by labor economists in the 1970s (Thurow, 1975), but is increasingly picked up in sociological research also. Various ways to code positional education measures have been proposed, both ordinal/continuous ones (Bol, 2015; Rotman et al., 2016; Tam, 2005; Triventi et al., 2016; Ultee, 1980) and, more rarely, categorical ones (Bukodi and Goldthorpe, 2016).7 In addition, one can distinguish between univariate, bivariate, and multivariate measures. Univariate positional education measures transform respondents’ education by taking the attainment of other cohort members into
Table 20.2 ISCED 1997 and 2011 main levels ISCED
1997
ISCED
2011
Level
Label
Level
Label
0
Pre-primary education
0
1 2 3 4 5
Primary education Lower secondary education Upper secondary education Post-secondary non-tertiary education First stage of tertiary education
6
Second stage of tertiary education
1 2 3 4 5 6 7 8
Early childhood education (attainment: < primary education) Primary education Lower secondary education Upper secondary education Post-secondary non-tertiary education Short cycle tertiary education Bachelor level education and equivalent Master level education and equivalent Doctoral level education
1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
129 211 212 213 221 222 223 229 311 312 313 321 322 323 411 412 413 421 422 423
Achieved certificate from a short vocational ISCED 2 programme Achieved certificate from a general/pre-vocational ISCED 2 programme not giving access to ISCED 3 Achieved certificate from a general/pre-vocational ISCED 2 programme giving access to ISCED 3 (vocational only) Achieved certificate from a general ISCED 2 programme giving access to ISCED 3 (general or all) Achieved certificate from a long vocational ISCED 2 programme not giving access to ISCED 3 Achieved certificate from a vocational ISCED 2 programme giving access to ISCED 3 (vocational only) Achieved certificate from a vocational ISCED 2 programme giving access to ISCED 3 (general or all) Achieved certificate from a short vocational ISCED 3 programme Achieved certificate from a general ISCED 3 programme without access to tertiary considered as level 3 completion Achieved certificate from a general ISCED 3 programme preparing for lower tier ISCED 5A or 5B, but not upper tier 5A Achieved certificate from a general ISCED 3 programme preparing for upper/single tier ISCED 5A Achieved certificate from a long vocational ISCED 3 programme not giving access to ISCED 5 Achieved certificate from a vocational ISCED 3 programme giving access to ISCED 5B or lower tier 5A, but not upper tier 5A Achieved certificate from a vocational ISCED 3 programme giving access to upper/single tier ISCED 5A Achieved certificate from a general ISCED 4 programme not giving access to ISCED 5 Achieved certificate from a general ISCED 4 programme giving access to lower tier ISCED 5A or ISCED 5B, but not upper tier 5A, without prior completion of 3B/3C Achieved certificate from a general ISCED 4 programme giving access to upper/single tier ISCED 5A, without prior completion of 3B/C voc Achieved certificate from a vocational ISCED 4 programme not giving access to ISCED 5 Achieved certificate from a vocational ISCED 4 programme giving access to lower tier ISCED 5A or ISCED 5B, but not upper tier 5A, or general ISCED 4 after completing ISCED 3B/C programme Achieved certificate from a vocational ISCED 4 programme giving access to upper/single tier ISCED 5A, or general ISCED 4 after completing vocational ISCED 3B programme
1 1
000 113
Not completed primary education Achieved certificate from an ISCED 1 programme, or completed an ISCED 1 programme that does not provide any certificate
ESS edulvla
ESS edulvlb
Description (using ISCED 1997 terminology)
4
5 4
4
4 4 4
1 2 2 2 2 2 2 2 3 4 4 3 4
1 1
ESISCED
3
4 3
3
3/4 4 3/4
1 2 2 2 2 2 2 2 3 3 3 4 3
0 1
ISSP DEGREE
454
453 454
444
354 443 444
100 243 244 244 253 254 254 254 343 344 344 353 354
000 100
ISCED 2011
(Continued)
4A voc
4C voc 4A/B2 voc
4A gen
3A voc 4C gen 4A/B2 gen
1 2C gen 2A/B1 gen 2A gen 2C voc 2A/B1 voc 2A voc 3C voc short 3C gen long 3A/B2 gen 3A gen 3C voc long 3A/B2 voc
0 1
ISCED 1997
Table 20.3 Detailed educational attainment categories and their coding in the ESS (edulvlb), and ES-ISCED), ISSP (DEGREE), ISCED 2011 and 1997
5 5
720 800
7
7
5 5 6 6 7
ESISCED
6
6
5 5 5 5 6
ISSP DEGREE
Notes: 1 In ISCED 97 at level 2, A programs give access to 3A and 3B; B programs give access to 3C only. 2 In ISCED 97 at levels 3 and 4, A programs give access to 5A and B programs to 5B. 3 While OECD countries usually classified such programs as ‘A’, other countries often classified them as ‘B’. 4 The ISCED 2011 codes for ‘partial completion’ in a level (242, 252, 342, 352) are omitted since such qualifications are classified at the next lower level in the ESS.
5 5 5 5 5
510 520 610 620 710
Achieved general/academic tertiary certificate below bachelor’s level (level 6xx) after 2–3 years of study Achieved vocational tertiary certificate below bachelor’s level (level 6xx) after 2–3 years of study Achieved 1st polytechnic/applied/lower tier college degree after 3–4 years Achieved 1st upper/single tier university degree after 3–4 years of study Achieved 1st polytechnic/applied/lower tier college degree after more than 4 years of study or 2nd or further lower tier college degree Achieved 1st upper/single tier university degree after more than 4 years of study or 2nd or further upper/single tier university degree below the doctoral level Doctoral degree
ESS edulvla
ESS edulvlb
Description (using ISCED 1997 terminology)
Table 20.3 Continued
860
760 (740)
540 550 660 (650) 660 (640) 760 (750)
ISCED 2011
6
5A long
5A int 5B short 5A/B3 med 5A med 5A/B3 long
ISCED 1997
298
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
account (Bol, 2015; Bukodi and Goldthorpe, 2016). Bivariate positional education measures scale education categories, often within birth cohorts using some other cross-nationally comparable quantitative variable such as years of education, social status, or income (Ganzeboom and Treiman, 1993; Rotman et al., 2016; Treiman and Terrell, 1975; Triventi et al., 2016). A multivariate scaling approach to education was proposed by Schröder and Ganzeboom (2013).
Fields of Specialization Fields of specialization refer to subjects in education, training, or study. The concept is complementary to educational attainment and represents educational differences in kind rather than grade or level. The substance matter in which people specialize in the later stages of their educational career is a background variable that is much more rarely measured than educational attainment though, resulting in calls for better and more international data for analyzing horizontal stratification in education (Charles and Bradley, 2002). It is typically a mediating variable when analyzing labor market opportunities and work conditions (see e.g. the special issue of the International Journal of Comparative Sociology, van de Werfhorst, 2008), educational inequality and social class mobility (Jackson et al., 2008; van de Werfhorst and Luijkx, 2010), or gender inequalities in labor market outcomes or occupational sex segregation (e.g. Bobbitt-Zeher, 2007; Bradley, 2000).8 Often it is only considered for tertiary graduates. ISCED 2011 includes a detailed threedigit classification for Fields of Education and Training (ISCED-F; UNESCO Institute for Statistics, 2013), which partly adopted ideas from an operational manual for use in European statistics including an elaboration for vocational education and training (Andersson and Olsson, 1999). ISCED 1997 only used two digits for fields (UNESCO,
2006). As an alternative to ISCED, there is the classification of Fields of Science and Technology (OECD, 2007), which applies to higher education only. There is, especially when looking at national studies, a great variety of ways to aggregate specializations into a manageable number of categories, which ‘has received only scattered attention’ (Gerber and Cheung, 2008: 308). As with other categorical variables, aggregation of categories may create substantial within-category (subfield) heterogeneity (Jacobs, 1995) and lead to non-comparability across studies (Reimer et al., 2008). Surveys measure fields of education and training directly, i.e. they give respondents a number of response options – usually broad fields of education9 – and ask them to classify themselves into the categories provided. In cross-national surveys, the question on field of education and training is input harmonized (see Chapter 33 in this Handbook), i.e. question and response options are translated into the different survey languages. While it will usually be easy for respondents to remember and report their field of education or training, given the increasingly high degree of differentiation and interdisciplinarity of fields, self-assignment to abstract categories may be difficult for respondents. Eurostat (2013) thus recommends, for the purpose of Labour Force Surveys, to allow open answers and use office post-coding for difficult cases.
OCCUPATION AND SOCIAL POSITION Occupation and position on the labor market are among the most important variables for the analysis of social inequalities and labor market processes as well as income for example. Comparative measures in this area were first designed by statistical offices. They also ask the question how far such a variable is related to the individual or the household, with all the discussion of how to define a household in the different countries
299
WHEN TRANSLATION IS NOT ENOUGH
(Hoffmeyer-Zlotnik and Warner, 2014). We are not able to give a full review of all dimensions and tools related to these concepts here but will restrict ourselves to the International Standard Classification of Occupations (ISCO) and measures of social position derived from ISCO codes.
Table 20.4 Structure of ISCO-08
Occupation as Measured Through the ISCO Coding
4
The development of a comparative measure of what kind of tasks people perform on their jobs was a considerable challenge. This is the goal of the ISCO, which is now at its 4th edition also known as ISCO-08. The idea to have a common code to classify occupations goes back to the middle of the last century. A first version was developed by the International Conference of Labour Statisticians (ICLS) in 1958 and a second edition in 1968. With increasing use and some important transformation of traditional occupations, a new classification was published in 1988 and has had a very strong influence. ISCO-88 was rather different from its predecessor, more strongly related to the level of skills needed to work in a given occupation. Though the International Labour Organization as curator of ISCO has published a crosswalk relating ISCO-68 to ISCO-88 and vice versa, transforming these codes into each other is problematic (Wolf, 1997). As ISCO-88 and ISCO-08 are based on the same principles, comparisons between these two classifications is less problematic, but some particular categories could be difficult to match in a one-to-one procedure. Table 20.4 gives information on the structure of this hierarchical classification (see also International Labour Organisation, 2012). At the most detailed level ISCO-08 distinguishes between 436 unit groups designated by a four-digit code. At the next highest level, 130 minor groups are denoted by three digits. This level is followed by 43 submajor groups, and on the highest level 10 major
Major groups
1 2 3
5 6
7 8
9 10 Sum
Managers Professionals Technicians and associate professionals Clerical support workers Service and sales workers Skilled agricultural, forestry and fishery workers Craft and related trades workers Plant and machine operators, and assemblers Elementary occupations Armed forces occupations
Number of Number Number submajor of minor of unit groups groups groups 4 6 5
11 27 20
31 92 84
4
8
29
4
13
40
3
9
18
5
14
66
3
14
40
6
11
33
3
3
3
43
130
436
groups of occupations are distinguished. As an example, hotel receptionist is classified as 4224, which belongs to the minor group 422 ‘Client information workers’, which in turn is part of the submajor group 42 ‘Customer services clerks’ located in major group 4 ‘Clerical support workers’. From a technical point of view, an ISCO code is determined after a careful coding of a set of three open questions. Using such a detailed schema and an open question means that further recoding is needed. Probably each National Institute of Statistics or survey organization has a computer program or relies on the expertise of especially trained personnel in order to do the coding. Often the concrete routines employed remain a black box and are considered as ‘in-house secrets’ which are not shared in international publications. This lack of transparency can harm the quality of data, even more when national decisions
300
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
are made on coding specific categories, for example nurses or primary level teachers, to mention the most obvious cases. Available tools include the CASCOT system developed by a team around Peter Elias, at the Warwick Institute of Employment Research, which in the last few years has started m oving towards an international occupation coding tool (http:// www2.warwick.ac.uk/fac/soc/ier/software/ cascot/internat/, accessed July 28, 2015). In order to measure social position without going through the ISCO system, implying open formulation and a posteriori codification, there is the possibility to propose a question with a limited number of response options to the respondent. This approach was utilized regularly in the Eurobarometers, for respondents’ parents in the ESS or a double measurement as for the parents in the ISSP on inequality, in 1999 and 2009 for example. However, there is currently neither consensus on the scale to utilize nor a possible comparative value in different contexts. Furthermore, there is no straightforward way to compare different scales used in different surveys. With such indicators, there is also the problem that ‘classical’ measures of social class or status, described below, cannot be used. This means that this kind of schema is probably problematic, if used as the only measure of social position, when trying to harmonize a posteriori international surveys and thus not the most useful tool for measuring social position. Some researchers argue nevertheless that a double measurement, even if far from perfect, is always better than a single one. In this perspective a crude measure, as mentioned here, could be useful as a complementary one (see for example Schröder and Ganzeboom, 2013).
Social Position Often we are not directly interested in measuring occupation per se, but more generally in determining the social position of individuals. In nearly all sociological
traditions, the relation to the production of resources, and/or the activity in which people are involved during the production process, is the basis for measuring social position, i.e. social class or other index. In this context, the contractual character is also important, to define independent or salaried people in the Marxian tradition or the precise dependency between work-takers and work-givers as in the model of Erikson and Goldthorpe (1992). In many cases, people not working are classified either by reference to their previous work, or by reference to the persons acquiring resources for the household or the family. This is an old debate for the National Statistical Institute when trying to define a reference person inside the household but also between the main authors of the field, as for example Wright (1997). That means that information about occupation is generally crucial to define a social position, even if other types of resources or ‘capital’ can be taken into account. In this context, at least two traditions are worth mentioning: one considering a continuous social status index and another one considering discrete socioeconomic groups or social classes.
Socio-economic Indexes and Onedimensional Continuous Measures Without going into too much detail in the discussion between prestige and socio-economic indexes, a strong argument has been developed that it is possible to have one common index for nearly every country and period, at least since the second half of the twentieth century. This was advocated by Treiman (1977) for occupational prestige, and Ganzeboom et al. (1992) for socio-economic status, giving birth to the International SocioEconomic Index (ISEI) and seen as a major achievement of the sociology of stratification (Hout and DiPrete, 2006). However, strong criticisms of the pertinence of a uni-dimensional schema were published (Coxon et al., 1986; Lorenzi-Cioldi and Joye, 1988); and the possibility to have the same schema between countries or even between social groups inside
WHEN TRANSLATION IS NOT ENOUGH
a country was also questioned (Hauser and Warren, 1997; Joye and Chevillard, 2013). Partly in reaction to this last line of criticism, and partly also for more conceptual reasons arguing that social position has to be inferred from social proximity in a given context rather than based on a priori positioning, some alternative scales have been developed, on the basis of the social distance inside couples or between friends: these are the ideas behind the CAMSIS scale (Lambert, 2012; Prandy and Lambert, 2003) or the index proposed by Chan (2010). Once again, without discussing the sociological literature here, the advantages of these constructions are to allow for national or cohort variations, and even, at least from a theoretical point of view, for variations between social groups. As with positional education measures, this is also an interesting example of how a tool allowing differentiations according to national contexts can, in fine, be the best comparative measure. A second tradition of conceptualizing and measuring social position is also important, at least in the field of social stratification and mobility research: the choice of a class schema allowing for multiple dimensions of differentiation in one single measure. We will turn to this now.
Socio-economic Groups and Social Class Models From an historical point of view, social class schemes are not only the result of Marxist thought but have also been developed at the beginning of the twentieth century in Great Britain, partly as an empirical tool: the first classifications were mainly designed in order to predict mortality. More generally there were historically strong links between health analysis and social stratification research. Since then, many countries built a national classification, commonly used in national statistics and survey analysis, even if the nature and criteria used for this could be rather different (Joye et al., 1996). Without going too deep into national idiosyncrasies, three points are worth mentioning.
301
Firstly, some social class categories are above all nationally or culturally specific and very difficult to translate or to use with the same pertinence in different contexts. That means that national categories are deeply rooted in some national way to conceive the organization of a society. In a really fascinating contribution to this debate, Desrosières (1987) and Desrosières and Thevenot (1988) have shown that, in the French case, the position of particular occupations in a socioprofessional category was heavily negotiated with the trade unions. The result of this is that the French categories are rather well-known inside France and could be used for selfplacement while an equivalent is much more difficult to obtain for example in Germany (Schultheis et al., 1996). Secondly, in the field of social mobility, some scientists have tried to describe the common structure of different countries in order to derive a tool allowing comparison. It was notably the case of Erikson et al. (1979) who, when comparing Sweden, Great Britain, and France, do a rather empirical comparative work in order to design a Neo-Weberian (socalled EGP) class schema, which was popularized and refined in the famous Erikson and Goldthorpe (1992) book The Constant Flux. At the same time, the work of Wright (1997) was also another attempt at building a theoretically founded social class schema in order to measure social stratification in different western countries. Thirdly, this multiplication of empirically and theoretically founded schemes has of course consequences on the international adoption of standards. The (academic) European Socio-economic Classification (ESeC) project, developed in a Goldthorpian tradition by Rose and Harrison (2010), was proposed as an international standard to Eurostat. Finally, the European Statistical System seemed more ready to adapt another model called ‘ESeG’ (for ‘European Socioeconomic Groups’) developed within the realm of official statistics with an active role of the INSEE and three other national
302
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
statistical institutes (see http://www.crosportal.eu/content/eseg; Eurostat, 2015). This example shows that the best way to measure social position and the categories needed for that is still a heavily debated field. Even if we now have some tools that can be used in order to compare social position between countries, at least the western ones, a careful use implies to look at their origin, their goals, and the way they were implemented. In fact, for measuring social position, we are once more in a tension between best adaptation to local conditions and possibility to define a common frame, as well as depending on the theoretical considerations underlining these constructions. In conclusion we would like to stress that, as in the case of ethnicity and education, there does not exist one ideal measure. Unidimensional tools, like ISEI, are very popular but their universal validity, either with respect to time, space, or social groups, is still debated. The problems lie less in the valuation of extreme categories, but rather in the middle ones, which are more sensitive to social change and differences in structure. Similarly for social class: although the EGP schema is the one used most often in the context of social mobility studies, other schemas, such as the one by Oesch (2006) that better reflects women’s position in the modern service economy, or the discussion on the heterogeneity of the self-employed categories in some countries (Maloutas, 2007), show that we need alternatives. However, as long as the coding of occupation is done according to ISCO, and some additional information is available as is the case in the main international surveys,10 multiple social classifications can be implemented. This is a condition for a transparent advancement of scientific work.
CONCLUSION AND OUTLOOK In summary, this presentation and discussion has shown the importance of linking
the categories used with the local context, and to be aware of their political underpinning. In this sense, socio-demographic variables are not so different from variables measuring subjective perceptions. Their validity is to be discussed also by considering the theoretical perspective of the researchers and the scientific work on this must be developed. Finally, as in translation (see Chapter 19, this Handbook), there is a constant tension between finding the best adaptation to the local structure versus comparative validity. This discussion is not just an academic game, since the choice of one measure or another can have substantive consequences. In other words, and this is also a result of this chapter, concept, classification criteria, and category definitions are at the heart of the political debate and functioning as well as at the one of empirical measurement and modeling. To name something and to give it a place in a system could be an act of power and institution. It is not a random fact that the way to deal with ethnic categories is so different in the US and France, to take two examples, and so difficult to transpose from one to the other, or that vocational education and training is largely ignored in some surveys or countries. Finally, we want to stress the fact that even if ethnicity, education, and social position are very different variables, they share in common this relation to political and social questions. In this sense, we hope that the discussion of these three concepts in one chapter can help to understand the challenges of the conceptualization and measurement of these concepts, which are perhaps less ‘objective’ than often presented.
NOTES 1 We want to thank Sabine Kradholfer and Yannick Lemel for comments on an earlier version of this text.
WHEN TRANSLATION IS NOT ENOUGH
2 This point can be discussed in the same way for education or social position. 3 The wording could be, for example, ‘Please indicate which of the following group or groups you consider yourself to belong to’ (ISSP Demographic Methods Group, 2012b: 20). 4 Oral communication by Marlène Sapin, who is part of the FORS team in charge of fielding ISSP in Switzerland. 5 The question was ‘Do you belong to a minority ethnic group in [country]?’ with a translation note insisting on the subjective side of belonging. 6 In Germany, for example, both elements are measured with two separate questionnaire items, the first focusing on general schooling and the second on specialized post-school education and training (of both academic and professional types). 7 See also the other contributions in the special issue of Research in Social Stratification and Mobility, edited by Park and Shavit, of which at the time of writing three were in advance access. 8 For a review of research on causes and consequences of horizontal stratification in postsecondary education, see Gerber and Cheung (2008). 9 For example, in tertiary education these could be: 1. Education, 2. Humanities and Arts, 3. Social Science, Business and Law, 4. Natural Science, 5. Math and Computer Science, 6. Medicine and Health, and 7. Engineering (Charles and Bradley, 2002). 10 Eurobarometer is an exception in this case.
RECOMMENDED READINGS As a general frame for socio-demographic variables in comparative research see HoffmeyerZlotnik and Wolf (2003). For the debate on ethnic categories see Felouzis (2010) and the book of Wimmer (2013) for a more theoretical discussion. For education, discussing in particular the case of vocational training, see Braun and Müller (1997). See also Smith (1995). For an introduction to the measurement of social position see Lambert et al. (2012). And for an in-depth discussion of socio-economic indexes, their interest and weaknesses, see Hauser and Warren (1997).
303
REFERENCES American Anthropological Association (1997). American Anthropological Association Response to OMB Directive 15: Race and Ethnic Standards for Federal Statistics and Administrative Reporting. Retrieved August 20, 2015, from http://www.aaanet.org/gvt/ ombdraft.htm American Sociological Association (2003). The Importance of Collecting Data and Doing Social Scientific Research on Race. Washington, DC: American Sociological Association. Retrieved August 20, 2015, from http:// www.asanet.org/images/press/docs/pdf/asa_ race_statement.pdf Andersen, M., and Collins, P. H. (2012). Race, Class, & Gender: An Anthology. Belmont, CA: Wadsworth. Andersson, R., and Olsson, A.-K. (1999). Fields of education and training manual. Fields of Education and Training Manual, December, Luxembourg: Eurostat. Baker, D. P., Fabrega, R., Galindo, C., and Mishook, J. (2004). Instructional time and national achievement: cross-national evidence. Prospects, 34(3), 311–334. doi:10.1007/s11125-004-5310-1 Bancel, N., David, T., and Thomas, D. (2014). L’invention de la race. Paris: La Découverte. Billiet, J. (2001). Questions about national, subnational and ethnic identity. In European Social Survey: Core Questionnaire Development. Blau, P. M., and Duncan, O. D. (1967). The American Occupational Structure. New York, London: Wiley. Bobbitt-Zeher, D. (2007). The gender income gap and the role of education. Sociology of Education, 80, 1–22. Bol, T. (2015). Has education become more positional? Educational expansion and labour market outcomes, 1985–2007. Acta Sociologica, 58(2), 105–120. doi:10.1177/ 0001699315570918 Bradley, K. (2000). The incorporation of women into higher education: paradoxical outcomes? Sociology of Education, 73(1), 1–18. Braun, M., and Müller, W. (1997). Measurement of education in comparative research. Comparative Social Research, 16, 163–201.
304
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Brauns, H., Scherer, S., and Steinmann, S. (2003). The CASMIN educational classification in international comparative research. In J. H. P. Hoffmeyer-Zlotnik and C. Wolf (eds), Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables (pp. 221– 244). New York; London: Kluwer Academic/ Plenum. Breen, R., Luijkx, R., Müller, W., and Pollak, R. (2009). Nonpersistent inequality in educational attainment: evidence from eight European countries. American Journal of Sociology, 114(5), 1475–1521. Bukodi, E., and Goldthorpe, J. H. (2016). Educational attainment – relative or absolute – as a mediator of intergenerational class mobility in Britain. Research in Social Stratification and Mobility, 43, 5–15. doi:10.1016/j. rssm.2015.01.003 Callister, P. (2004). The classification of individuals. Social Policy Journal of New Zealand, 23, 109–140. Chan, T. W. (2010). The social status scale: its construction and properties. In T.W. Chan (ed.) Social Status and Cultural Consumption (pp. 28–56). Cambridge: Cambridge University Press. doi:http://dx.doi.org/10.1017/ CBO9780511712036.002 Charles, M., and Bradley, K. (2002). Equal but separate? A cross-national study of sex segregation in higher education. American Sociological Review, 67(4), 573–599. Coxon, A. P. M., Davies, P. M., and Jones, C. L. (1986). Images of Social Stratification: Occupational Structures and Class. London: Sage. Davis, L. E., and Engel, R. J. (2010). Measuring Race and Ethnicity. New York: Springer Science & Business Media. Desrosières, A. (1987). Eléments pour l’histoire des nomenclatures socioprofessionnelles. In INSEE (ed.), Pour une histoire de la statistique (Vol. 1, pp. 155–232), Paris: Economica. Desrosières, A., and Thevenot, L. (1988). Les catégories socio-professionnelles. Paris: Éditions la Découverte. Dorlin, E. (2009). Sexe, race, classe: pour une épistémologie de la domination. Paris: Presses Universitaires de France. Erikson, R., and Goldthorpe, J. H. (1992). The Constant Flux: A Study of Class Mobility in Industrial Societies. Oxford: Clarendon Press.
Erikson, R., Goldthorpe, J. H., and Portocarero, L. (1979). Intergenerational class mobility in 3 Western European societies – England, France and Sweden. British Journal of Sociology, 30(4), 415–441. European Social Survey (2012). Appendix A1 – Education, ESS6 – 2012 ed. 2.0. ESS6 Data Documentation Report. Bergen. Retrieved August 2, 2015, from http://www.europeansocialsurvey.org/docs/round6/survey/ESS6_ appendix_a1_e02_0.pdf European Social Survey (2014a). ESS Round 7 Pilot Report Overview. London: ESS ERIC Headquarters, Centre for Comparative Social Surveys, City University London. European Social Survey (2014b). ESS Round 7 Source Questionnaire. London: ESS ERIC Headquarters, Centre for Comparative Social Surveys, City University London. Retrieved August 20, 2015, from http://www. europeansocialsurvey.org/docs/round7/ questionnaire/ESS7_source_main_ questionnaire_final_alert_03.pdf Eurostat (2013). EU labour force survey explanatory notes (to be applied from 2014q1 onwards). Retrieved August 21, 2015, from http://ec.europa.eu/eurostat/documents/1978984/6037342/EU-LFS-explanatory-notes-from-2014-onwards.pdf Eurostat (2015). ESSnet on the harmonisation and implementation of an European socioeconomic classification. Collaboration in Research and Methodology for Official Statistics. Retrieved August 21, 2015, from http://www.cros-portal.eu/content/eseg Felouzis, G. (2010). The use of ethnic categories in sociology: a coordinated presentation of positions. Revue Française de Sociologie, 51, 145–150. doi:10.3917/rfs.515.0145 Ferrández, L. F. A., and Kradolfer, S. (2012). Everlasting Countdowns: Race, Ethnicity and National Censuses in Latin American States. Newcastle upon Tyne: Cambridge Scholars. Ganzeboom, H. B. G., and Treiman, D. J. (1993). Preliminary results on educational expansion and educational achievement in comparative perspective. In H. A. Becker and P. L. J. Hermkens (eds), Solidarity of Generations: Demographic, Economic, and Social Change, and its Consequences (pp. 467– 506). Amsterdam: Thesis Publishers.
WHEN TRANSLATION IS NOT ENOUGH
Ganzeboom, H. B. G., de Graaf, P. M., and Treiman, D. J. (1992). A standard international socio-economic index of occupational status. Social Science Research, 21(1), 1–56. Gerber, T. P., and Cheung, S. Y. (2008). Horizontal stratification in postsecondary education: forms, explanations, and implications. Annual Review of Sociology, 34(1), 299–318. Haller, M., and Eder, A. (2015). Ethnic Stratification and Economic Inequality around the World: The End of Exploitation and Exclusion? Farnham: Ashgate. Hauser, R. M., and Warren, J. R. (1997). Socioeconomic indexes for occupations: a review, update, and critique. Sociological Methodology, 27(1), 177–298. Hoffmeyer-Zlotnik, J. H. P., and Warner, U. (2014). Harmonising Demographic and Socio-Economic Variables for Cross-National Comparative Survey Research. Dordrecht: Springer Netherlands. doi:10.1007/978-94007-7238-0 Hoffmeyer-Zlotnik, J.H.P., and Wolf, C. (eds) (2003). Advances in Cross-National Comparison. New York: Kluwer Academic. Hout, M., and DiPrete, T. A. (2006). What we have learned: RC28’s contributions to knowledge about social stratification. Research in Social Stratification and Mobility, 24(1), 1–20. International Labour Organisation (2012). International Standard Classification of Occupations, ISCO-08. Geneva: International Labour Organisation. ISSP Demographic Methods Group (2012). ISSP Background Variables Guidelines – Version of 2012-06-26. Retrieved August 20, 2015, from http://www.gesis.org/fileadmin/upload/ dienstleistung/daten/umfragedaten/issp/ members/codinginfo/BV_guidelines_for_ issp2013.pdf Jackson, M., Luijkx, R., Pollak, R., Vallet, L.-A., and van de Werfhorst, H. G. (2008). Educational fields of study and the intergenerational mobility process in comparative perspective. International Journal of Comparative Sociology, 49(4–5), 369–388. Jacobs, J. A. (1995). Gender and academic specialties: trends among recipients of college degrees in the 1980s. Sociology of Education, 68(2), 81–98.
305
Jonsson, J. O. (1999). Explaining sex differences in educational choice: an empirical assessment of a rational choice model. European Sociological Review, 15(4), 391–404. Joye, D., and Chevillard, J. (2013). Education, prestige, and socioeconomic indexes in Switzerland. In R. Becker, P. Bühler, and T. Bühler (eds), Bildungsungleichheit und Gerechtigkeit: wissenschaftliche und gesellschaftliche Herausforderungen (pp. 163–178). Bern: Haupt. Joye, D., Schuler, M., Meier, U., and Sayegh, R. (1996). La structure sociale de la Suisse: catégories socio-professionnelles. Bern: Office fédéral de la statistique (OFS). Kerckhoff, A. C., Ezell, E. D., and Brown, J. S. (2002). Toward an improved measure of educational attainment in social stratification research. Social Science Research, 31(1), 99–123. König, W., Lüttinger, P., and Müller, W. (1988). A Comparative Analysis of the Development and Structure of Educational Systems. University of Mannheim: CASMIN Working Paper No. 12. Lambert, P. (ed.) (2012). Social Stratification: Trends and Processes. Farnham: Ashgate. Lambert, P. S., Connelly, R., Gayle, V., and Blackburn, R. M. (2012). Introduction. In P. S. Lambert (ed.), Social Stratification: Trends and Processes (pp. 1–10). Farnham: Ashgate. Lorenzi-Cioldi, F., and Joye, D. (1988). Représentations sociales de catégories socioprofessionnelles: aspects méthodologiques. Bulletin de Psychologie, 60(383), 377–390. Maloutas, T. (2007). Socio-economic classification models and contextual difference: the ‘European socio-economic classes’ (ESeC) from a south European angle. South European Society and Politics, 12(4), 443–460. doi:10.1080/13608740701731382 Morning, A. (2008). Ethnic classification in global perspective: a cross-national survey of the 2000 census round. Population Research and Policy Review, 27(2), 239–272. doi:10.1007/s11113-007-9062-5 Nobles, M. (2000). Shades of Citizenship: Race and the Census in Modern Politics. Stanford: Stanford University Press. OECD (2007). Revised Field of Science and Technology (Fos) Classification in the Frascati Manual. Paris: OECD.
306
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
OECD (2015). Programme for the International Assessment of Adult Competencies. Retrieved August 20, 2015, from http:// www.oecd.org/site/piaac/ Oesch, D. (2006). Redrawing the Class Map: Stratification and Institutions in Britain, Germany, Sweden, and Switzerland. Basingstoke: Palgrave Macmillan. Prandy, K., and Lambert, P. (2003). Marriage, social distance and the social space: an alternative derivation and validation of the Cambridge Scale. Sociology, 37(3), 397–411. doi:10.1177/00380385030373001 Prewitt, K. (2013). What is Your Race?: The Census and Our Flawed Efforts to Classify Americans. Princeton, NJ: Princeton University Press. Reimer, D., Noelke, C., and Kucel, A. (2008). Labor market effects of field of study in comparative perspective: an analysis of 22 European countries. International Journal of Comparative Sociology, 49(4–5), 233–256. Rose, D., and Harrison, E. (2010). Social Class in Europe: An Introduction to the European Socio-Economic Classification. London: Routledge. Rotman, A., Shavit, Y., and Shalev, M. (2015). Nominal and positional perspectives on educational stratification in Israel. Research in Social Stratification and Mobility, 43, 17–24. doi:10.1016/j.rssm.2015.06.001 Schneider, S. L. (2010). Nominal comparability is not enough: (in-)equivalence of construct validity of cross-national measures of educational attainment in the European Social Survey. Research in Social Stratification and Mobility, 28(3), 343–357. doi:10.1016/j. rssm.2010.03.001 Schneider, S. L. (2013). The International Standard Classification of Education 2011. In G. E. Birkelund (ed.), Class and Stratification Analysis (Comparative Social Research, Vol. 30, pp. 365–379). Bingley: Emerald. doi:10.1108/ S0195-6310(2013)0000030017 Schor, P. (2009). Compter et classer: Histoire des recensements américains. Paris: EHESS. Schröder, H., and Ganzeboom, H. B. G. (2013). Measuring and modelling level of education in European societies. European Sociological Review, 30(1), 119–136. doi:10.1093/esr/jct026 Schultheis, F., Bitting, B., Bührer, S., Kändler, P., Mau, K., Nensel, M., Pfeuffer, A., Scheib, E.,
Voggel, W. (1996). Repräsentationen des sozialen Raums -Zur Kritik der soziologischen Urteilskraft. Berliner Journal für Soziologie, 1, 97–119. Shavit, Y., and Blossfeld, H.-P. (eds) (1993). Persistent Inequality: Changing Educational Attainment in Thirteen Countries. Boulder, CO: Westview. Simon, P. (2010). Statistics, French social sciences and ethnic and racial social relations. Revue Française de Sociologie, 51, 159. doi:10.3917/rfs.515.0159 Smith, T. W. (1980). Ethnic measurement and identification. Ethnicity. An Interdisciplinary Journal of the Study of Ethnic Relations Chicago, Ill., 7(1), 78–95. Smith, T. W. (1995). Some aspects of measuring education. Social Science Research, 24(3), 215–242. Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355–374. Starr, P. (1992). Social categories and claims in the liberal state. Social Research, 59(2), 263–295. doi:10.2307/40970693 Tam, T. (2005). Comparing the seemingly incomparable: a theoretical method for the analysis of cognitive inequality. Paper Presented at the Conference ‘Welfare States and Social Inequality’, RC28 of the International Sociological Association, Oslo. Unpublished. Thurow, L. C. (1975). Generating Inequality: Mechanisms of Distribution in the U.S. Economy. New York: Basic Books. Treiman, D. J. (1977). Occupational Prestige in Comparative Perspective. New York; London: Academic Press. Treiman, D. J., and Terrell, K. (1975). The process of status attainment in the United States and Great Britain. American Journal of Sociology, 81(3), 563–583. Triventi, M., Panichella, N., Ballarino, G., Barone, C., and Bernardi, F. (2016). Education as a positional good: implications for social inequalities in educational attainment in Italy. Research in Social Stratification and Mobility, 43, 3–52. doi:10.1016/j.rssm.2015. 04.002 Ultee, W. C. (1980). Is education a positional good? An empirical examination of alternative hypotheses on the connection between
WHEN TRANSLATION IS NOT ENOUGH
education and occupational level. Netherlands’ Journal of Sociology, 16(2), 135. UNESCO (2006). International Standard Classification of Education: ISCED 1997 (re-edition). Montreal: UNESCO Institute for Statistics. UNESCO Institute for Statistics (2012). International Standard Classification of Education – ISCED 2011. Montreal: UNESCO Institute for Statistics. UNESCO Institute for Statistics (2013). International Standard Classification of Education: Fields of Education and Training 2013 (ISCED-F 2013). Montreal: UNESCO Institute for Statistics. van de Werfhorst, H. G. (2008). Educational fields of study and (European) labor markets: introduction to a special issue. International Journal of Comparative Sociology, 49(4–5), 227–231.
307
van de Werfhorst, H. G., and Luijkx, R. (2010). Educational field of study and social mobility: disaggregating social origin and education. Sociology, 44(4), 695–715. Wimmer, A. (2013). Ethnic Boundary Making: Institutions, Power, Networks. New York: Oxford University Press. Wolf, C. (1997). The ISCO-88 International Standard Classification of Occupations in Cross-National Survey Research. Bulletin de Méthodologie Sociologique: BMS 54, 23–40. Wright, E. O. (1997). Class Counts: Comparative Studies in Class Analysis. Cambridge: Cambridge University Press. Zuberi, T., and Bonilla-Silva, E. (2008). White Logic, White Methods: Racism and Methodology. Lanham, MD: Rowman & Littlefield.
PART V
Sampling
21 Basics of Sampling for Survey Research Yv e s T i l l é a n d A l i n a M a t e i
INTRODUCTION Sampling consists of selecting a subset of units or elements from a finite population for the purpose of extrapolating the results obtained on this subset to the entire population. Sampling is mainly intended to reduce the survey costs. The methodology that enables us to extrapolate has been the subject of many debates and is still controversial. This chapter offers a view of the main probabilistic sampling schemes and statistical inference approaches used in survey sampling theory. The chapter is structured as follows. The next section introduces the basic concepts and notation. It also shows the difference between probabilistic and non-probabilistic samples, including a discussion of the representativeness concept. Then we introduce the concept of statistical inference and show its main approaches, the design-based and modelbased. While one focuses on the designbased approach, the model-based approach to perform statistical inference is also
presented. Next the two main inferential approaches are discussed from the estimation perspective, underlining the error sources. This is followed by an introduction of the main sampling schemes and a description of the cases where they may be used effectively. The chapter ends with a discussion on how to choose a sampling design and some recommended reading.
BASIC CONCEPTS Population, Sample, Sampling Design A finite population is a limited size set of units such as persons, households, companies, establishments. The population can be described by intentional or extensional definitions. An intentional definition, also called the target population, is a semantic description of the population, e.g. the population
312
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
living in Switzerland at a particular date. An extensional definition, also called the sampling frame, is a list of all units in the population. There is always a gap between the target population and the sampling frame. For instance, for the population living in Switzerland, problems can arise because of illegal residents, residents that are temporally abroad, thereby coding error. The gap between the target population and the sampling frame is known as the coverage problem. It is possible that units that are not in the target population are in the sampling frame; this is known as overcoverage. In contrast, undercoverage means that units exist in the target population, but not in the sampling frame. In what follows, we refer to the extensional definition of the population. Consider U to be a finite population of size N. Each unit in U is identified by a label, which is an integer number between 1 and N. The population is usually described as the list of unit labels U = {1, … , k, … , N} . Subsets of units called samples are drawn from U. A sample is denoted by s. The notation s ⊂ U indicates that s is a part of U. Samples can be drawn from the population using probabilistic or non-probabilistic methods. The non-probabilistic methods are also called deterministic or purposive methods. Only the units of the sample are interviewed in the survey, even though probabilistic or deterministic methods are employed. The main goal of using a sample instead of the entire population is to reduce the survey costs. The methodology consisting of using a sample, i.e. a subset of the population, was unanimously rejected during the nineteenth century. During this period statistical survey was considered valid only if all the units of the population had been observed. This dogma was clearly expressed in official statistics by Quetelet (1846) and remained the prominent paradigm up until the end of the nineteenth century. At the end of the nineteenth century, Anders Nicolai Kiær, director of Statistics Norway (Kiaer, 1896, 1899, 1903, 1905), proposed
the use of representative samples instead of census to collect information on a population. This proposal led to a lively debate in the International Statistical Institute. A committee was elected to evaluate the interest of representative samples whose members were Adolph Jensen (from Denmark), Arthur Bowley (from the UK), Corrado Gini (from Italy), Lucien March (from France), Coenraad Alexander Verrijn Stuart (from Holland), and Frantz Zizek (from Germany). This committee produced a very well documented report where two methods are clearly identified: purposive selection and random sampling. Purposive selection is now called quota sampling (see Jensen, 1926). This report recognizes the interest of the sample use and marks the beginning of the research and development of the survey sampling theory (for an historical review of survey research, see also de Heer et al., 1999).
Probabilistic Samples Probabilistic samples are drawn from a given population using a sampling scheme and a sampling design. A sampling scheme is a method of selecting a sample from a population using algorithms based on random numbers. The result of a sampling scheme application is a random sample. Thus, each application results in a different sample. A sampling design is a probability measure p(.) assigned to each possible sample drawn from the population. These probabilities are known before drawing a sample and show the chances of a sample of being selected. Larger p(s) means that it is more likely that sample s will be selected from U. They are constructed in such a way that p(s ) ≥ 0, ∑ p(s ) = 1. s ⊂U
The sampling theory is thus not based on methods used to select a particular population unit, but on methods used to select a set of units, the sample.
313
Basics of Sampling for Survey Research
Example 1 Consider the population U = {1, 2, 3}. All possible samples without replacement drawn from U are ∅,{1},{2},{3},{1, 2},{1, 3},{2, 3},{1, 2, 3}. Consider also the following sampling design on U p(∅) = 0, p({1}) = 0, p({2}) = 0, p({3}) = 0, 1 1 p({1, 2}) = , p({1, 3}) = , 2 4 1 p({2, 3}) = , p({1, 2, 3}) = 0. 4 This sampling design has a fixed sample size because only the samples of size two have a non-null probability of being selected. Since each sample s has a probability of being drawn from U, each unit k in s also has a probability of being selected. This probability is called the first-order inclusion probability of unit k and is denoted by pk. However, unit k could be selected in different samples. Consequently, this probability is obtained by summing the probabilities of all samples that contain unit k (denoted below in the sum by s k ), i.e.
π k = ∑ p(s ). (1)
s k
It is possible that different sampling designs defined on the same population fulfill Equation (1) and thus provide the same inclusion probabilities pk. Example 2 From Example 1, we have
π 1 = p({1}) + p({1, 2}) + p({1, 3}) + p({1, 2, 3}) = 0+
1 1 3 + +0 = , 2 4 4
π 2 = p({2}) + p({1, 2}) + p({2, 3}) + p({1, 2, 3}) = 0+
1 1 3 + +0 = , 2 4 4
π 3 = p({3}) + p({1, 3}) + p({2, 3}) + p({1, 2, 3}) = 0+
1 1 1 + +0 = . 4 4 2
Similarly to the first-order inclusion probabilities, the joint or second-order inclusion probability π k ∑ is the probability of jointly selecting two distinct units k and in the sample. This probability is obtained by summing the probabilities of all samples that jointly contain units k and (denoted below in the sum by s ⊃ {k, } ), i.e.
π k =
∑
s ⊃{k , }
p(s ). (2)
Both probabilities pk and π k ∑ are very important in different estimations computed on the sample level as shown in the section on estimation. They have to be known before the sample selection. Nevertheless, the computation of π k ∑ can be difficult or even impossible for some sampling designs. For such cases, the variance estimation is also difficult to com pute. The section on sampling designs and schemes shows the relationship between sampling methods and inclusion probabilities.
Deterministic Samples In many practical cases, the sample is not selected randomly, but deterministically. This kind of sample is extensively covered in Chapter 24. However, in this chapter, we consider some aspects of deterministic samples to discuss the problem of sample representativeness and the abusive use of statistical and econometric methods. A common method used to draw deterministic samples is the method of quota. It consists of selecting a sample in such a way that some categories of units are presented in the same proportions in the sample and in the population. Several variants exist. For instance, the method of marginal quota consists of satisfying the proportion of several categorical variables. With the quota method, the interviewers may freely choose the respondents provided that the quotas are satisfied. Thus, it is not possible to control the unit inclusion probabilities, because the sample is not random.
314
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Quota samples often provide biased estimators because of the self-selection bias of the respondents. An interesting discussion of the quota method can be found in Deville (1991). Deville’s conclusions are drawn with regard to official statistics and to some questionable statistical and econometric models. We share his conclusions for survey research as well, since drawing inference from quota samples is problematic. Deville’s conclusions are as follows: In a survey, the use of any speculative model represents methodological risk-taking. This may be perfectly reasonable if the users are aware of it, and if they have ratified the speculations leading to the specification of the model. This is typically what happens, at least implicitly, in marketing surveys: an organization, company, administration, or association requests a sampling survey from a polling company. A contract marks the agreement between the two parties respecting the implementation of the survey, its price, the result delivery schedule, and the methodology used. In this methodology, models are used to formalize the sampling or behavior of the population. Thus, from this point of view, the use of the quota method may be quite proper. Official statisticians, on the other hand, are responsible for generating data that can be used by the entire society; and that can be used, in particular, in the arbitration of disputes between various groups, parties, and social classes. The use of statistical models, particularly econometric models that describe the behavior of economic agents, may turn out to be very dangerous, partial, or affected by a questionable or disputed economic theory. Official statistics should not tolerate any uncontrollable bias in its products. It should carry out sample surveys using probabilistic methods. (‘A Theory of Quota Surveys, p. 177)
For probabilistic sampling, the inference can be conducted with respect to the random mechanism used to select the sample (see next section). For quota samples, the selection of the sample is not random. Statistical analysis of quota samples is thus intrinsically linked to a formalization of the population and thus to population modeling.
Sample Representativeness Sample representativeness has different meanings in survey research. According to a popularly held one, the sample is a ‘mockup’ or a ‘mirror image’ of the population. Thus, representativity is assured by this simple relationship between sample and population. Quota sampling is often justified using this argument. In this paradigm of representativity, the statistical inference is not valid. The argument of representativity is fallacious here not only because of the self-selection bias of sample respondents, but also because of interest to misrepresent some population categories in surveys. In his seminal publications, Neyman (1934, 1938) already showed the interest of overrepresenting the most population dispersed categories with the goal of obtaining more efficient estimates. Another point of view about sample representativeness is given by Davern (2008, p. 720): A representative sample is one that has strong external validity in relationship to the target population the sample is meant to represent. As such, the findings from the survey can be generalized with confidence to the population of interest. There are many factors that affect the representativeness of a sample, but traditionally attention has been paid mostly to issues related to sample design and coverage. More recently, concerns have extended to issues related to nonresponse. When using a sample survey to make inferences about the population from which the sampled elements were drawn, researchers must judge whether the sample is actually representative of the target population. The best way of ensuring a representative sample is to (a) have a complete list (i.e. sampling frame) of all elements in the population and know that each and every element (e.g. people or households) on the list has a nonzero chance (but not an equal chance) of being included in the sample; (c) gather complete data from each and every sampled element.
In this definition, the sample representativeness is well related to the notions of probabilistic sample and inclusion probability, reducing thus the think gap between survey sampling practitioners and a part of social
Basics of Sampling for Survey Research
scientists. A lot of other definitions of a representative sample have been proposed (see Langel and Tillé, 2011a, for a discussion), but this term remains confusing and should be avoided in survey sampling theory.
STATISTICAL INFERENCE Inference describes the statistical framework on how to extrapolate the results obtained on the selected sample to the whole population. Consider a survey item y that takes the value yk on unit k of the population. The aim is to estimate a parameter, the basic one being the population total of the variable of interest y Y = ∑ yk . k∈U
There are several paradigms or frameworks in survey sampling that enable us to conduct statistical inference: 1 In the design-based framework, yk is supposed to be non-random. The only source of randomness is thus the random sample s selected using a sampling design. 2 In the model-based framework, yk is supposed to be random. The inference is conducted con ditionally on the realized random sample. This approach is equivalent to supposing that the sample s is not random. A model is then assumed on yk. The unique source of randomness is the model. For instance, consider the following para metric linear model M : y k = x k β + ε k , where x k is a column vector of p auxiliary variables for unit k, b is a column vector of unknown regres sion coefficients and ε k is the random error term, i.e a random variable with null expectation that can include heteroscedasticity. The model randomness is determined by the error term. 3 In the model-assisted framework, yk is supposed to be random but the inference is conducted on the basis of both the sampling design and model. The aim is to conduct an inference that can be helped by modeling the population but that remains approximately unbiased with respect to the sampling design in case of model mis specification. Thus, this approach, developed by
315
Särndal et al. (1992) is a compromise between the first two ones.
The model-based and design-based approaches reflect very different views on statistical inference. A careful reading of the seminal papers of Neyman (1934, 1938) shows that at the beginning of the theory of survey sampling, there is a confusion between the two approaches. In the papers of Horvitz and Thompson (1952), Yates and Grundy (1953), and Sen (1953), the variable of interest is clearly considered as non-random and the only source of randomness is the sampling design. An important clarification was given by Cornfield (1944) who introduced the use of indicator random variables for the presence or absence of a unit in the sample. The use of model-based inference was proposed by Brewer (1963), but it was clearly formalized by Royall (1970a,b, 1971, 1976a,b, 1992). A complete presentation of this approach is given in Valliant et al. (2000). Mathematically, each of these approaches is correct. The choice of an approach depends on the point of view one has on the sampling problem. The design-based framework is only applicable if the sample is selected randomly. The only possible approach with a purposive selection (or quota sampling) is the modelbased framework. Nevertheless, with quota sampling, it is not unusual to see publications where the inference is conducted for a quota sample as the sample was selected randomly. The quota sample is then considered as a simple random sample and even as a stratified sample, which is obviously misleading. The question of the inferential foundation has been the subject of many debates (Godambe, 1970; Rao, 1975; Godambe and Sprott, 1971; Cassel et al., 1993). Without venturing too far into a debate, the modelbased approach implies a certain level of confidence in the population modeling. In the model-based framework, however, a great consideration is given to the robustness of the estimation in the case of model misspecification.
316
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Most official statisticians prefer the design-based approach. It is also the main inferential approach used in survey research. The use of population modeling can thus appear in conflict with this practice. However, one cannot avoid modeling in all estimation problems, especially in the handling of nonresponse.
The variance of the Horvitz-Thompson estimator is given by y2 varp Yˆπ = ∑ k2 π k (1 − π k ) k∈U π k y y + ∑ ∑ k (π k − π k π ). (3) π k∈U ∈U k π
( )
≠k
This variance can be unbiasedly estimated by y2 varp Yˆπ = ∑ k2 (1 − π k ) k∈s π k y y π − π kπ + ∑ ∑ k k , (4) π k k∈s ∈s π k π
( )
ESTIMATION Design-based Estimation The expectation and variance of the estimators are computed differently in the two main inferential approaches. To underline this difference, in what follows, we use a different notation for these measures. Let E p (.), varp (.), E M (.) and varM (.) be the expectations and variances under a given sampling design p and under an assumed model M, respectively. In the design-based framework, the Horvitz-Thompson estimator (also called estimator by expansion or π-estimator; see Horvitz and Thompson, 1952; Sen, 1953) is defined by y y Yˆπ = ∑ k = ∑ k I k , k∈s π k k∈U π k where Ik is a Bernoulli random variable that takes the value 1 if unit k is selected in the random sample s and 0 otherwise. Variable Ik is a function of the random sample s. We have E p ( I k ) = π k and E p ( I k I ) = π k for k ≠ . Provided that πk > 0 for all k ∈ U , the Horvitz-Thompson estimator is design unbiased, i.e. E p (Y^π ) = Y . The proof derives from the fact that yk is not random and E p ( I k ) = π k , for all k ∈ U . The HorvitzThompson estimator is the basic estimator but it is rarely used in practice. Indeed, unbiasedness is a quite poor property. Generally, the knowledge of auxiliary information enables us to improve the accuracy of the estimations.
≠k
provided that all πk > 0. However, the last expression is rarely used as given here, because simplified expressions of the estimated variance occur for the most usual sampling designs. In many cases, not only totals must be estimated. The most important part of parameters of interest can be written as a function of totals. For instance, suppose that we want to estimate the population mean Y =
1 N
∑y
k∈U
k
=
Y . N
We already have an estimator of the numerator Y. The population size can be written as a total because N = ∑ 1, k∈U
and thus it can be estimated by the HorvitzThompson estimator 1 Nˆ π = ∑ . π k∈s k The plug-in estimator (also called the substitution estimator) consists of estimating Y by substituting the totals by their estimators, which gives Yˆ Yˆ = π . Nˆ π
317
Basics of Sampling for Survey Research
Plug-in estimators are generally slightly biased. Their variances can be computed using the linearization method (Binder, 1996; Deville, 1999; Demnati and Rao, 2004). However, not all the parameters can be written as a function of totals (see, for example, the poverty indexes) and particular treatments must be applied to them (see for instance Osier, 2009; Langel and Tillé, 2011b, 2013).
Its sample error is defined as YˆBLU − Y . Royall (1970b) has shown that YˆBLU is the Best Linear Unbiased (BLU) estimator under model M (its sample error has an expectation of zero). The model mean-squared error of the estimator is
(
E M YˆBLU − Y
Model-based Estimation In the model-based framework, a model is assumed on the population. The most common one is the linear model
yˆ k = x k βˆ , where x x βˆ = ∑ 2 ∈s σ
−1
x y . ∑ 2 ∈s σ
An estimator of Y uses the observed and the predicted unobserved values of yk YˆBLU = ∑ yk + ∑ yˆ k = ∑ wk yk , k∈s
k∉s
k∈s
= E M ∑ yk + ∑ x k βˆ − ∑ yk k∉s k∈U k∈s 2 ˆ = E M ∑ (x k β − yk ) k∉s 2 ˆ = E M ∑ x k ( β − β ) − ε k k∉s = ∑ x k varM ( βˆ )∑ x k∉s
−1
x x x k wk = 1 + ∑ x j ∑ 2 . 2 j∉s ∈s σ σ k This estimator is model-unbiased in the sense where E M (YˆBLU − Y ) = 0.
M
k∉s
∉s k∉s
k
k
M
( βˆ , ε ).
Since −1
) = x j x j , var (ε ) = σ 2 , and varM ( β M k k 2 ∑ j∈s σ j , ε ) = 0, if ∉ s, cov M ( β
we obtain
(
E M YˆBLU − Y
)
2
⎛ x j x j ⎞ = ∑σ + ∑x ⎜∑ 2 ⎟ ⎝ j ∈s σ j ⎠ k ∉s k ∉s 2 k
k
−1
∑x .
∉s
The mean-squared error of an estimator is also equal to its variance plus the squared value of its bias. Because the bias of YˆBLU 2 is 0, we obtain E M YˆBLU − Y = varM YˆBLU . A particular case is interesting. When x k = 1, σ k2 = σ 2 , and β = µ , model M is reduced to a simple mean μ plus an homoscedastic error term ε k , k ∈ U
(
where
2
∉s
∑ var (ε ) − 2 ∑ ∑ x cov +
M : yk = x k β + ε k , where x k is a column vector of p auxiliary variables for unit k, b is a column vector of the p unknown regression coefficients and ε k is a random variable, such that E M (ε k ) = 0 , varM (ε k ) = σ2k and cov M (ε k , ε ) = 0 when k ≠ , where cov M (ε k , ε ) denotes the covariance between ε k and ε . In the modelbased framework, the unknown values of yk , k ∉ s are predicted using the model as follows
)
2
)
(
yk = µ + ε k , with E M (ε k ) = 0 and varM (ε k ) = σ 2 .
)
318
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Then, we can compute 1 β = µˆ = ∑ yk = y , n k∈s YˆBLU = ∑ yk + ∑ y = Ny , k∈s
k∉s
) = var ( y ) = σ , varM ( β M n 2
and
σ varM YˆBLU = ( N − n)σ 2 + ( N − n )2 n N −nσ2 = N2 . N n
(
)
2
This variance is very similar to the designbased variance obtained under simple random sampling presented below in Expression (5). Thus, the model-based and design-based approaches can provide similar results, even if the foundation of inference is completely different.
Sampling Error In the absence of coverage problems and non response, in the design-based approach, the sample is the unique source of the error. The sampling error means variability of an estimator over samples. The precision of an estimator is related to its variance: better precision means smaller variance. Several sampling designs have been developed with the goal of improving the precision of the estimators. The next section describes different sampling designs and schemes, indicating the gain in precision of using them. On the other hand, if disturbances occur in the sampling process, such as undercoverage/overcoverage and nonresponse, coverage error and nonresponse error are added, respectively, to the sampling error, thus defining the total survey error in a complex way. For a discussion on the total survey error, see Groves (1989, p.15). In the model-based approach, the source of error is given by the errors in modeling the
variable of interest. Thus, the bias and variance are computed with respect to the model, as described in the previous subsection. As shown in the same section, the model-based approach considers the bias and variance computation conditional to the outcome of the sample selection process (the sample is constant, not random). It follows that the source of error in the model-based approach is close to those in deterministic samples. Thus, the sample error is defined in this approach as the difference between the estimator computed on the sample and the true value of the parameter, which is, of course, unknown.
MAIN SAMPLING DESIGNS AND SCHEMES In the description of the following sampling designs, we distinguish between fixed size and random size sampling designs and between equal and unequal probability sampling designs. We also present the following main sampling designs and schemes: simple random sampling without replacement, systematic sampling, Poisson/Bernoulli sampling, conditional Poisson sampling, stratified sampling, balanced sampling, cluster sampling, and two-phase sampling.
Fixed Size Sampling Design A sampling design has a fixed sample size n if p(s) = 0 when the size of s is not equal to n. In fixed size sampling designs, the sum of the inclusion probabilities is equal to the sample size. Indeed,
∑π
k∈U
k
= ∑ E p ( I k ) = E p ∑ I k = E p ( n ) = n. k∈U k∈U
The main advantage of using fixed sample size is the ability to control the sampling cost. A fixed size design can be implemented using equal or unequal inclusion probabilities.
319
Basics of Sampling for Survey Research
Simple Random Sampling Without Replacement
where S y2 =
Simple random sampling without replacement is the simplest sampling scheme. It provides equal probability of selection to all possible samples of a given size. The number of all possible samples of size n is N! , n !( N − n)!
if the sample size of s is equal to n otherwise.
Consequently, based on Expressions (1) and (2)
πk =
n n(n − 1) , for all k ≠ ∈ U . , π k = N N ( N − 1)
Simple random sampling without replacement is usually used in cases where auxiliary information correlated to the variable of interest is not available in the sampling frame. The inclusion of auxiliary information in the sample selection process is intended to improve the estimation precision, thereby decreasing its variance. Consequently, the variance of the Horvitz-Thompson estimator in simple random sampling without replacement is usually considered as a baseline for variance estimates provided by other sampling designs. Under simple random sampling without replacement, the variance of the Horvitz-Thompson estimator given in Expression (3) reduces to
2
N − n Sy varp Yˆπ = N 2 (5) N n
( )
and the estimator of the variance given in Expression (4) reduces to 2
p Yˆ = N 2 N − n s y , var π N n
( )
Y = s y2 =
1 N
∑y ,
k∈U
k
(
)
2 1 yk − Yˆ , ∑ n − 1 k∈s
and
where n ! = 1 × 2 … × n. Their probability of selection is n !( N − n)! p(s ) = N! 0
1 ∑ ( yk − Y ) 2 , N − 1 k∈U
1 Yˆ = ∑ yk . n k∈s Simple random sampling without replacement can be implemented using different methods, all based on the use of random numbers (see Tillé, 2006, p. 45–50). A very simple method to draw a sample of size n using a computer is to generate N independent random numbers following the uniform distribution between 0 and 1; the sample will be formed by the first n units in the list of N ordered random numbers.
Unequal Probability Sampling Designs with Fixed Sample Size There are many sampling designs with unequal inclusion probabilities and fixed sample size (see for instance Brewer and Hanif, 1983, and Tillé 2006). All the sampling designs included in this class use auxiliary information to draw the samples. This auxiliary information should be well correlated with the main variable of interest y. The auxiliary information x is used in the computation of inclusion probabilities by means of the following formula:
πk =
nxk , k ∈U, ∑ x ∈U
where n is the fixed sample size and xk is the value of x on unit k. Values πk larger than 1 are fixed to 1 and the corresponding units are
320
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
provisionally retracted from the population. The values πk are again computed for the remaining units in the population using the previous formula. This algorithm stops when all πk are between 0 and 1. Sample selection is done afterwards using the values π k , k ∈ U and one or more random numbers generated by a computer. The strengths of the relationship between the variables y and x usually induce a variance reduction. Thus, sampling designs with unequal inclusion probabilities and fixed sample size usually produce less variance than simple random sampling without replacement. However, the problem of variance estimation is still complicated in these sampling designs, since they generally use second-order inclusion probabilities, which can be difficult to compute or approximate. A popular sampling design with unequal inclusion probabilities and fixed sample size is systematic sampling. This sampling design can be carried out using the following algorithm. One generates a random number u uniformly distributed between 0 and 1, and computes the quantity k
Vk = ∑ π , k ∈ U , =1
with V0 = 0 . The first selected unit in the sample will be unit k1 if Vk1 −1 ≤ u < Vk1 , the second one k2 if Vk2 −1 ≤ u + 1 < Vk2 , … , the jth one kj if Vk j −1 ≤ u + j < Vk j , and so on. The algorithm will select exactly n units, because ∑ π k = n.
k∈U
One drawback of this sampling design is the small number of possible samples compared to other sampling designs. It was shown that this number does not exceed N, the population size (Pea et al., 2007). Another drawback of systematic sampling is that most of the joint inclusion probabilities are equal to 0. Consequently, many pairs of different units are not selected in the sample. In order to try to avoid the problem of null joint inclusion probabilities, we can sort the population before selecting the sample. Indeed, with
systematic sampling, the joint inclusion probabilities depend on the order of the units in the population. When this method is applied to a population which was randomly sorted beforehand, the result is relatively similar to the conditional Poisson sampling design (see below).
Poisson Sampling Poisson sampling is an unequal sampling design, but with random sample size. Each unit is selected with inclusion probability π k . b 2 − 4 ac The values πk can be computed as shown in the previous subsection using auxiliary information. The selection of each unit is provided independently from the selection of other units. Thus, the joint inclusion probability is π k = π k π for k ≠ . The sample size is random and has a Poisson-Binomial distribution (Hodges and Le Cam, 1960). When all the πk are equal the Poisson sampling is called a Bernoulli sampling design. Poisson sampling is rarely used in practice because the sample size is random. There are, however, two important applications where this sampling design is used. The first one is in sample cooordination, which consists in increasing/decreasing the overlap of two or more samples drawn on different time occasions or simultaneously from overlapping populations. Poisson sampling provides samples that are easy to coordinate and a rotation of the units can easily be organized in repeated samplings (Brewer et al., 1972). It is actually used in the Swiss Federal Statistical Office to coordinate samples of establishments and households (Graf and Qualité, 2014). The second application is in treatment of nonresponse. Nonresponse can be viewed as a second phase of sampling where the units randomly decide to respond or not to the survey. It is assumed that each unit responds independently from the other ones. The sampling design is thus a Poisson one. In this case, however, the inclusion probabilities
321
Basics of Sampling for Survey Research
(which are in this case the response probabilities) must be estimated. Under Poisson sampling, the variance of the Horvitz-Thompson estimator has a simplified form and is easy to compute, because the units are selected independently in the sample and thus π k = π k π . It is given by y2 varp Yˆπ = ∑ k2 π k (1 − π k ), k∈U π k
( )
and can be estimated by
( )
2 k 2 k
y p Yˆ = var (1 − π k ). ∑ π k∈s π A Poisson sample can be easily drawn. One generates N independent random numbers uniformly distributed between 0 and 1 : u1 ,..., uN . One selects the unit k in the sample if uk < π k .
is useful in the construction of confidence inter vals (Berger, 1998).
When the first-order inclusion probabilities are equal, conditional Poisson sampling reduces to simple random sampling without replacement.
Stratified Sampling Stratification is a very simple idea. The population is split into H non-overlapping subsets called strata U h , h = 1, … , H . Next, a random sample is selected in each stratum independently from the other strata. The final sample is the union of the samples drawn in each stratum. Suppose that the goal is to estimate the population total Y = ∑ yk . Let Nh be k∈U
Conditional Poisson Sampling Conditional Poisson sampling or fixed size maximum entropy sampling design is a fixed size sampling design with unequal probabilities (for an overview, see Tillé, 2006, p. 79). Its name is due to one of the methods to implement it: one draws Poisson samples until the sample size equals n. Other methods to draw a sample are available in the sampling literature. The inclusion probabilities provided by this sampling design are different from those used in the sample selection. However, they can be computed using different methods. Conditional Poisson sampling designs has interesting properties: • it maximizes the entropy in the class of sampling designs with the same first-order inclusion prob abilities (see next section for a discussion on the entropy); • the variance estimation can be approximated using only the first-order inclusion probabilities (without using the second-order inclusion prob abilities); • the convergence of the Horvitz-Thompson esti mator to the normal distribution is higher than in the case of other sampling designs; this property
the population size in stratum Uh and sh be the random sample of size nh selected in Uh by means of simple random sampling without replacement. The inclusion probability of unit k ∈ U h is π k = nh /N h . The HorvitzThompson estimator is H H N Yˆπ = ∑ ∑ h yk = ∑ N hYˆh , h =1 h =1 k∈sh nh
where 1 Yˆh = nh
∑y
k∈sh
k
is the unbiased estimator of Yh =
1 Nh
∑y.
k∈Uh
k
The variance of Yˆπ can easily computed b 2 −be4 ac because the samples are drawn independently in each stratum
where
( )
H
( )
varp Yˆπ = ∑ N h2 varp Yˆh h =1 H
= ∑ N h2 h =1
N h − nh Sh2 , N h nh
(6)
322
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Sh2 =
2 1 yk − Yh ) . (7) ( ∑ N h − 1 k∈Uh
It can be shown that the precision of the stratified estimator of the population total is improved compared to the estimator computed using simple random sampling without replacement when the variable of interest y is very homogeneous in the strata. There exist mainly two ways of defining the sample sizes in the strata: 1 Proportional allocation consists of selecting nh = nNh /N units in stratum Uh. Obviously nNh /N is not necessarily an integer and must be rounded to an adjacent integer value. If nh = nNh /N, then the first-order inclusion prob ability is equal to n/N for all k ∈U . This sampling design thus provides equal inclusion probabili ties. Each stratum has the same proportion as in the population nh /n = Nh /N, which is similar to the idea of quota sampling. In stratification with proportional allocation, the variance given in Expression (6) simplifies to H
( )
varp Yˆ π = N
2
NS N −n ∑ h=1
N
2 h h
nN
. (8)
When we compare (8) to (5), we can show that
∑
H h=1
Nh Sh2 is almost always less than Sy2 .
Thus, stratification with proportional allocation is preferable to simple random sampling without replacement, particularly when the strata are very homogeneous with respect of y. 2 Optimal allocation is due to Neyman (1934). It consists of searching the allocation n1, ... , nh , ... , nH that minimize the variance of the stratified design given in Expression (6) subject to a fixed sample size H
n = ∑ nh . h=1
After solving this optimization problem, one obtains
nh =
nNh Sh
∑
H i =1
nNi Si
,
(9)
where Sh is the standard deviation of y computed in stratum Uh and given by taking the square root in Expression (7). This result shows that the simple idea of representativity is fallacious and that the strata with a large dispersion must be overrepresented in the sample. Sometimes, Expression (9) can result in nh > Nh, which is not possible. In this case, all the units of these strata are selected (they are called take-all strata) and the optimal allocation is recomputed on the remaining strata. To apply the optimal allocation we must know Sh or an estimate of Sh computed on a pilot survey. Thus, a prior knowledge of Sh is necessary before applying this method.
Balanced Sampling The idea of balanced sampling regroup a large set of methods that restore some population characteristics to the sample. Intuitively, a balanced sampling design provides samples that have sample means that are equal to the population means for some auxiliary variables that are known from external sources. This idea is very old. It was studied by Gini and Galvani (1929), Thionet (1953, pp. 203–207), Yates (1946, 1949) and (Hájek, 1981, p. 157). Some rejective methods have been proposed in the sampling literature. They consist of repeating the sample selection until obtaining one whose balancing constraints are satisfied. These methods are not efficient because they are slow and do not respect the inclusion probabilities πk . Indeed, the units with values far from the mean ultimately have inclusion probabilities that are smaller than expected, which can be problematic if one wants to estimate quantiles or dispersions (see the simulations in Legg and Yu, 2010). Consider that we know the values of p auxiliary variables for all the population units. Let x k be the column vector of the p values taken by these variables on unit k ∈ U. The values of x k are known for all units k ∈ U from external sources, like a population register. The aim is to estimate the population total Y. A balanced sampling design randomly selects a random sample s such that
323
Basics of Sampling for Survey Research
xk
∑π k ∈s
k
≈ ∑ xk . k ∈U
That means that the Horvitz-Thompson estimator of the auxiliary variables is equal or almost equal to the true population values. The sample s is called a balanced sample. Due to numerical problems a strict equality is often not possible. A balanced sampling design can be selected with equal or unequal inclusion probabilities πk. The cube method proposed by Deville and Tillé (2004) enables the selection of balanced samples with equal or unequal inclusion probabilities and avoids a purely rejective implementation. Simple random sampling without replacement, fixed size unequal probability sampling, stratified sampling, and Poisson sampling are particular cases of the cube method.
Cluster Sampling In cluster sampling, the population is divided into several clusters. The clusters are homogeneous on particular survey variables. The sampling units are now the clusters, and no longer the population units. Several stages of clusters can be used. The sample is formed by all units nested in the selected clusters. Cluster sampling can be implemented using equal or unequal probabilities. Different types of cluster sampling are available: • One-stage cluster sampling: a single stage of clusters is used (called primary sampling units) and these are involved in the sample selection; a typical application is those of students nested in classrooms and one selects classrooms and all students inside of them to estimate the mean student competence. • Two-stage cluster sampling: one assumes the existence of primary sampling units and second ary sampling units. A sample is selected in two steps. First, a sample of primary sampling units is selected; from this sample a new sample of second sampling units is drawn. The sample
including all the units nested in the second selected sampling units is the final one. For instance, we suppose that the survey goal is to sample households in order to estimate the population mean income. The households are nested in blocks. The primary sampling units could be the neighborhoods, while the secondary sampling units are the blocks. Inside the selected blocks, all households are interviewed. • Multi-stage cluster sampling: one assumes sev eral stages of clusters, primary, secondary, … , ultimate sampling units.
Cluster sampling is used for efficiency costs (for instance, when the population units are geographically spread and it takes time and money to sample using simple random sampling). It is also applied when it is difficult to construct the sampling frame (for instance, when we do not have a list of people living in households, but it is easy to construct the sampling frame of their households). While it is very convenient to use cluster sampling, there is, however, a drawback: the precision of the estimates in cluster sampling can be lower than in the case of using element samples of the same size. For example, assume a one-stage cluster sampling with simple random sampling without replacement of size m and clusters of equal size. To measure what we call the ‘cluster effect’ one considers the intercluster correlation, given by the covariance among the pairs of individuals of the same cluster Ci , i = 1, … , M for the variable of interest y M
∑ ∑ ∑ (y ρ=
1 N −1
i =1 k∈Ci ∈Ci ≠k
− Y )( yi , − Y )
i,k
M
∑ ∑ (y i =1 k∈Ci
i,k
, −Y )
2
where M is the number of clusters in the population, N = N /M , n is the size of each cluster, and yi , k is the value of y taken on unit k from the ith cluster, i = 1, … , M One can show that (without considering the sampling rate m/M)
324
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
( )
varp Yˆ = N 2
S y2 mn
(1 + ρ (n − 1)), (10)
where Yˆ = ∑ yk / (m /M ) . Compared to the k∈s
variance given by simple random sampling without replacement for n = mn (without considering the sampling rate n/N; see Expression (5)), varp Yˆ given in Expression (10) differs by 1 + ρ (n − 1) . Thus, the cluster effect can increase the variance, especially if ρ > 0 and n is large.
are the administrative units (the ‘communes’) and the second sampling units are the households. Complex sampling designs are sometimes employed to simplify the sampling process. In such cases, it is not sure that the precision of estimators is improved compared to a simpler sampling scheme.
( )
Two-phase Sampling It was mentioned previously that auxiliary information can improve the precision of an estimator. Suppose that auxiliary information is not available at the moment of sampling. It is possible, however, to reveal it using twophase sampling. In the first phase of this sampling scheme, one selects a sample of n1 units from the population and collects the auxiliary information on the sampled units. Thus, in the second phase, this information is available for the n1 selected units. Next, a second sample is selected from the first one with a size smaller than n1 . On the second sample, the auxiliary information can be used for estimation purposes. An important application of two-phase sampling is in the handling of nonresponse. As also shown above, the set of survey respondents can be considered as a second phase of sampling. The problem consists then of estimating the response probabilities.
Complex Surveys Large-scale surveys can combine several types of sampling presented earlier. For example, a household survey conducted in Switzerland can use a stratification of cantons. In each canton, a two-stage cluster can be applied, where the primary sampling units
THE CHOICE OF SAMPLING DESIGN The choice of sampling design is a delicate operation that requires a good knowledge of survey sampling theory, but also a clear definition of the population, of the parameters to be estimated and of the available auxiliary information. Several major ideas are generally used: 1 The sampling should be of fixed size. One can impose a fixed sample size or a fixed sample size in population categories. The use of unequal inclusion probabilities proportional to auxiliary variable xk > 0 is appropriate when xk is approxi mately proportional to yk. In this case, it can be shown that the variance of the Horvitz-Thompson estimator is considerably reduced. The use of a fixed sample size enables us to control the cost of the survey. Moreover, a fixed sample size generally improves the precision of the HorvitzThompson estimator compared to designs with random sample sizes. 2 Without prior knowledge on the population, the sample should be as random as possible. The random selection ensures impartiality, but is also useful for the variance estimation and for the determination of asymptotic results of estima tors needed for inference. The usual measure of randomness of a random variable is the Shannon entropy. It can also be applied to measure the randomness of a sampling design p: H( p) = − ∑ p(s )log p( s ). s⊂ ∪
Large entropy of a sampling design implies more randomness in sample selection. A list of main sampling designs which maximize the entropy in the class sampling designs with the same first-order inclusion probabilities is presented in
325
Basics of Sampling for Survey Research
Table 21.1. These designs are Bernoulli sampling, Poisson sampling, simple random sampling with out replacement and conditional Poisson sam pling. For more details on the entropy of sampling designs, see among others Grafström (2010). 3 The sampling design should be submitted to constraints other than fixed sample size. More generally, one can impose that the sample is bal anced on auxiliary variables that are known in the entire population. Thus, a balanced sampling should be used. 4 First-order inclusion probabilities should be opti mized in such a way as to minimize the variance of estimators. The first result concerning this topic is the optimal allocation proposed by Neyman (1934). Neyman’s result has been gen eralized (see among others Nedyalkova and Tillé, 2008), by overrepresenting the more dispersed units in order to reduce the variance. Thus, the sampling error is minimized. 5 Several stages can be used. Multi-stage sampling is an efficient way of reducing the travel costs of interviewers and thus the survey cost.
The main difficulty consists in simultaneously applying these principles. For instance, one might want to select a sample with fixed sample size, maximum entropy and unequal inclusion probabilities, thus using conditional Poisson sampling. This list of principles contradicts several preconceived notions of how samples should be selected. The most important one is the idea of representative samples according to which a good sample should replicate the population category proportions. Another false conception is the idea of biased or unbiased sample. Indeed, in statistics, unbiasedness is a property applied to an estimator. An estimator is unbiased for a parameter if its expectation is equal to the parameter. A sample cannot be biased or unbiased. A sample is just a tool to enable more efficient estimations of the parameters and can be a distorted picture of the population. However, some sampling designs can be problematic when certain first-order inclusion probabilities are null, which means that some units cannot be selected in the sample. One generally refers to this situation as a coverage problem and not as a biased sample.
Table 21.1. Main sampling designs with maximum entropy in the class of sampling designs with the same first-order inclusion probabilities
Random sample size Fixed sample size
Equal inclusion probabilities
Unequal inclusion probabilities
Bernoulli sampling Simple random sampling without replacement
Poisson sampling Conditional Poisson sampling
The choice of a sampling design is particularly difficult because it implies a prior knowledge of the population in order to optimize the design. The population knowledge can be formalized using a theoretical model, showing the relationship between the auxiliary variables and variable(s) of interest. A recent work dedicated to this question is Nedyalkova and Tillé (2008). These authors show that if the variable of interest can be modeled by means of a linear regression using the auxiliary variables as predictors, the sample must be balanced on these auxiliary variables. Moreover, the first-order inclusion probabilities must be proportional to the standard deviation of the error terms of the model. In this case, the model-based and design-based strategies are often the same and the Horvitz-Thompson estimator is equal to the BLU estimator. Apart from the prior knowledge of the population, the success of a survey is also assured by a good knowledge of the response process and of the context (international, national or regional), involving thus the participation of an important number of experts from different domains.
RECOMMENDED READING We give below a list of recommended books covering the contents of this chapter:
326
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
•• for an overview of sampling methods and esti mation: Cochran (1977), Lohr (2010), Valliant et al. (2013), •• for unequal probability sampling designs: Brewer and Hanif (1983), Tillé (2006), •• for the design-based approach: Cochran (1977), •• for the model-based approach: Valliant et al. (2000), Chambers and Clark (2012), •• for the model-assisted approach: Särndal et al. (1992).
Cornfield, J. (1944). On samples from finite populations. Journal of the American Statistical Association, 39:236–239. Davern, M. E. (2008). Representative sample. In Lavrakas, P. J., editor, Encyclopedia of Survey Research Methods, pages 720–723. Washington DC: SAGE Publications, Inc. de Heer, W., de Leeuw, E. D., and van der Zouwen, J. (1999). Methodological issues in survey research: A historical review. Bulletin de méthodologie sociologique, 64:25–48. Demnati, A. and Rao, J. N. K. (2004). Linearization variance estimators for survey data ACKNOWLEDGEMENTS (with discussion). Survey Methodology, 30:17–34. Deville, J.-C. (1991). A theory of quota surveys. The authors thank Dominique Joye for positive Survey Methodology, 17:163–181. and constructive comments that allowed them to improve the quality of this chapter. Deville, J.-C. (1999). Variance estimation for complex statistics and estimators: Linearization and residual techniques. Survey Methodology, 25:193–204. REFERENCES Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: The cube method. Biometrika, 91:893–912. Berger, Y. G. (1998). Rate of convergence for asymptotic variance for the Horvitz- Gini, C. and Galvani, L. (1929). Di una applicazione del metodo rappresentativo al censiThompson estimator. Journal of Statistical mento italiano della popolazione 1. dic. Planning and Inference, 74:149–168. 1921. Annali di Statistica, Series 6, Binder, D. A. (1996). Linearization methods for 4:1–107. single phase and two-phase samples: A cookbook approach. Survey Methodology, Godambe, V. P. (1970). Foundations in survey 22:17–22. sampling. The American Statistician, Brewer, K. R. W. (1963). Ratio estimation in 24:33–38. finite populations: Some results deductible Godambe, V. P. and Sprott, D. A., editors from the assumption of an underlying sto(1971). Foundations of Statistical Inference. chastic process. Australian Journal of StatisHolt, Rinehart et Winston, Toronto. tics, 5:93–105. Graf, E. and Qualité, L. (2014). Sondage dans Brewer, K. R. W., Early, L. J., and Joyce, S. F. un registre de population et de ménages en (1972). Selecting several samples from a Suisse: coordination d’échantillons, pondérasingle population. Australian Journal of tion et imputation. Technical report, UniverStatistics, 14:231–239. sité de Neuchâtel et Office Fédéral de la Brewer, K. R. W. and Hanif, M. (1983). SamStatistique, Suisse. pling with Unequal Probabilities. Springer, Grafström, A. (2010). Entropy of unequal probNew York. ability sampling designs. Statistical MethodCassel, C.-M., Särndal, C.-E., and Wretman, J. H. ology, 7(2):84–97. (1993). Foundations of Inference in Survey Groves, R. M. (1989). Survey Errors and Survey Sampling. Wiley, New York. Costs. Wiley, New York. Chambers, R. L. and Clark, R. G. (2012). Hájek, J. (1981). Sampling from a Finite PopulaModel-Based Methods for Sample Survey tion. Marcel Dekker, New York. Design and Estimation. New York: Oxford Hodges, J. L. J. and Le Cam, L. (1960). The PoisUniversity Press. son approximation to the Poisson binomial Cochran, W. G. (1977). Sampling Techniques. distribution. Annals of Mathematical StatisWiley, New York. tics, 31:737–740.
Basics of Sampling for Survey Research
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47:663–685. Jensen, A. (1926). Report on the representative method in statistics. Bulletin of the International Statistical Institute, 22:359–380. Kiaer, A. (1896). Observations et expériences concernant des dénombrements représentatifs. Bulletin de l’Institut International de Statistique, 9:176–183. Kiaer, A. (1899). Sur les méthodes représentatives ou typologiques appliquées à la statistique. Bulletin de l’Institut International de Statistique, 11:180–185. Kiaer, A. (1903). Sur les méthodes représentatives ou typologiques. Bulletin de l’Institut International de Statistique, 13:66–78. Kiaer, A. (1905). Discours sans intitulé sur la méthode représentative. Bulletin de l’Institut International de Statistique, 14:119–134. Langel, M. and Tillé, Y. (2011a). Corrado Gini, a pioneer in balanced sampling and inequality theory. Metron, 69:45–65. Langel, M. and Tillé, Y. (2011b). Statistical inference for the quintile share ratio. Journal of Statistical Planning and Inference, 141:2976–2985. Langel, M. and Tillé, Y. (2013). Variance estimation of the Gini index: Revisiting a result several times published. Journal of the Royal Statistical Society, A176(2):521–540. Legg, J. C. and Yu, C. L. (2010). Comparison of sample set restriction procedures. Survey Methodology, 36:69–79. Lohr, S. L. (2010). Sampling: Design and Analysis. Belmont, CA: Duxbury Press, second edition. Nedyalkova, D. and Tillé, Y. (2008). Optimal sampling and estimation strategies under linear model. Biometrika, 95:521–537. Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97:558–606. Neyman, J. (1938). Contribution to the theory of sampling human population. Journal of the American Statistical Association, 33:101–116. Osier, G. (2009). Variance estimation for complex indicators of poverty and inequality
327
using linearization techniques. Survey Research Methods, 3:167–195. Pea, J., Qualité, L., and Tillé, Y. (2007). Systematic sampling is a minimal support design. Computational Statistics & Data Analysis, 51:5591–5602. Quetelet, A. (1846). Lettres à S. A. R. le Duc régnant de Saxe-Cobourg et Gotha, sur la théorie des probabilités appliquées aux sciences morales et politiques. M. Hayez, Bruxelles. Rao, J. N. K. (1975). On the foundations of survey sampling. In Srivastava, J. N., editor, Statistical Design and Linear Models, pages 489–506. Elsevier/North-Holland, Amsterdam. Royall, R. M. (1970a). Finite population sampling - On labels in estimation. Annals of Mathematical Statistics, 41:1774–1779. Royall, R. M. (1970b). On finite population sampling theory under certain linear regression models. Biometrika, 57:377–387. Royall, R. M. (1971). Linear regression models in finite population sampling theory. In Godambe, V. P. and Sprott, D. A., editors, Foundations of Statistical Inference, pages 259–279, Holt, Rinehart et Winston, Toronto. Royall, R. M. (1976a). Likelihood functions in finite population sampling theory. Biometrika, 63:605–614. Royall, R. M. (1976b). The linear least squares prediction approach to two-stage sampling. Journal of the American Statistical Association, 71:657–664. Royall, R. M. (1992). The model based (prediction) approach to finite population sampling theory. In Ghosh, M. and Pathak, P. K., editors, Current Issues in Statistical Inference: Essays in honor of D. Basu, volume 17 of Lecture Notes-Monograph Series, pages 225–240. Institute of Mathematical Statistics. Särndal, C.-E., Swensson, B., and Wretman, J. H. (1992). Model Assisted Survey Sampling. Springer, New York. Sen, A. R. (1953). On the estimate of the variance in sampling with varying probabilities. Journal of the Indian Society of Agricultural Statistics, 5:119–127. Thionet, P. (1953). La théorie des sondages. Institut National de la Statistique et des
328
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Études Économiques, Études théoriques vol. 5, Imprimerie nationale, Paris. Tillé, Y. (2006). Sampling Algorithms. Springer, New York. Valliant, R., Dorfman, A. H., and Royall, R. M. (2000). Finite Population Sampling and Inference: A Prediction Approach. Wiley, New York. Valliant, R., Dever, J. A., and Kreuter, F. (2013). Practical Tools for Designing and Weighting Survey Samples. Springer, New York.
Yates, F. (1946). A review of recent statistical developments in sampling and sampling surveys. Journal of the Royal Statistical Society, A 109:12–43. Yates, F. (1949). Sampling Methods for Censuses and Surveys. Griffin, London. Yates, F. and Grundy, P. M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society, B 15:235–261.
22 Non-probability Sampling V a s j a V e h o v a r , V e r a To e p o e l a n d Stephanie Steinmetz
A sample is a subset of a population and we survey the units from the sample with the aim to learn about the entire population. However, the sampling theory was basically developed for probability sampling, where all units in the population have known and positive probabilities of inclusion. This definition implicitly involves randomization, which is a process resembling lottery drawing, where the units are selected according to their inclusion probabilities. In probability sampling the randomized selection is used instead of arbitrary or purposive sample selection of the researcher, or, instead of various self-selection processes run by respondents. Within this context, the notion of non-probability sampling denotes the absence of probability sampling mechanism. In this chapter we first reflect on the practice of non-probability samples. Second, we introduce probability sampling principles and observe their approximate usage in the non-probability setting and we also discuss some other strategies. Third, we provide a
closer look at two contemporary – and perhaps also the most exposed – aspects of non-probability sampling: online panels and weighting. Finally, we summarize recommendations for deciding on probability–nonprobability sampling dilemmas and provide concluding remarks.
TYPOLOGY, PRACTICE AND PROBLEMS OF NON-PROBABILITY SAMPLES We initially defined non-probability sampling as a deviation from probability sampling principles. This usually means that units are included with unknown probabilities, or, that some of these probabilities are known to be zero. There are countless examples of such deviations. The most typical ones received their own labels and we present below a non-exhaustive list of illustrations:
330
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
•• Purposive sampling (also judgemental sampling) – the selection follows some judgement or arbi trary ideas of the researchers looking for a kind of ‘representative’ sample, or, researchers may even explicitly seek for diversity (deviant case samplings); sometimes units are added sequen tially until researchers satisfy some criteria. •• Expert selection – subject experts pick the units, for example when, say, only two most typical settlements are to be selected from a region or when the units with highest level of certain char acteristics are selected (modal instance sampling). •• Case study – an extreme example of expert selec tion where the selected unit is then profoundly studied, often with qualitative techniques. •• Convenience sampling – the prevailing nonprobability approach where units at hand are selected; the notion roughly overlaps also with accidental, availability, opportunity, haphazard or unrestricted sampling; most typical formats are recruitment at events (e.g. sports, music, etc.) and other locations (e.g. mall intercept, where customers are approached in shopping malls, street recruitment, where people on the street are invited into a survey) or at virtual venues (e.g. web intercept, where the web visitors are exposed to pop-up invitations); particularly fre quent is also the recruitment of social ties (e.g. friends, colleagues and other acquaintances). •• Volunteer sampling – a type of convenience sam pling, where the decision to participate strongly relies on respondents due to the non-individualized nature of invitations (e.g. general plea for par ticipation appears in media, posters, leaflets, web, etc.). •• Mail-in surveys – a type of volunteer sampling with paper-and-pencil questionnaires, which are exposed as leaflets at public locations (e.g. hotels, restaurants, administration offices) or enclosed to magazines, journals, newspapers, etc. •• Tele-voting (or SMS voting) – a type of vol unteer sampling where people are invited to express their vote by calling-in some (toll-free or chargeable) number or by sending an SMS; most typically this is related to TV shows (particularly entertainment and reality) and various contests (music, talents, performance), but also to politi cal polling. •• Self-selection in web surveys – a type of volunteer sampling, where a general invitation is posted on the web, such as invitations on web pages (particularly on social media), question-of-the-day
polls and invitations to online panel, which all is typically done via banner recruitment; a consider able overlap exists here also with the notion of river sampling, where users are recruited with various approaches while surfing on the web. •• Network sampling – convenience sample where some units form the starting ‘seeds’, which then sequentially lead to additional units selected from their network; a specific subtype is snowball sampling, where the number of included network ties at each step is further specified. •• Quota sampling – convenience sample can be ‘improved’ with some socio-demographic quotas (e.g. region, gender, age) in order to reflect the population.
In addition, we may talk about non-probability sampling also when we have probability sampling, but: (a) we previously omitted some (non-negligible) segments of the target population, so that the inclusion probabilities for corresponding units are zero (i.e. we have non-coverage); (b) we face very low response rates, which – among others – occurs with almost any traditional public media recruitment, including the corre sponding advertising, from TV and radio spots to printed ads and out-of-home-advertising, i.e. OOH media, being digital (e.g. media screens in public places) or non-digital (e.g. billboards).
The bordering line when a certain part of the population is non-negligible (a) or when the response rate is very low (b) is a practical question, which is conceptually much blurred, disputable and depends on numerous circumstances. In particular, it depends on whether we know the corresponding missing data mechanism and control the probabilities of inclusion, as well as on whether the missing segments substantially differ in their characteristics. The opposite is also true: whenever we have low non-coverage and high response rates – especially with relatively small populations (e.g. hotel visitors, attendants of an event) – some of the above types of nonprobability sampling can turn to probability sampling, particularly in the case of volunteer samples (e.g. mail-in, web self-selection).
Non-probability Sampling
331
There are countless examples of disasters, where results from non-probability sampling were painfully wrong. There also exist some historical well-documented evidences, which have already become textbook cases. One is related to early purposive sampling attempts at the end of the nineteenth century from official statistics in Norway (Kiaer, 1997 [1897]), which ended in serious errors. Two cases from the US elections have also become popular warning against deviations from probability sampling. The first one is the 1936 Literary Digest Poll, where around two million mailin polls failed to predict the winner in the US presidential election. The failure was basically due to the fact that disproportionally more republicans were in the sampling frame. The second example refers to the US 1948 election prediction failure, which was – among other reasons – due to a lack of sufficient randomization in quota sampling. We should underline that wrong results based on non-probability samples – including the above three examples – are typically linked to situations where the standard statistical inference assuming probability sampling was used with non-probability samples. For this reason we first overview the probability sampling principles and their approximate implementation in non-probability settings.
was huge cost saving, because the characteristics (e.g. attitudes) of a population with millions of units could be measured with a sample of just a few hundreds. Correspondingly, the sampling practice expanded enormously in the commercial and public sectors.
PRINCIPLES OF PROBABILITY SAMPLING1
Statistical Inference Principles
It was only when the corresponding sampling theory was implemented in the 1930s that a major breakthrough in the development of modern surveys occurred. We should also recall that only a few decades before, at the end of the nineteenth century, the statistical profession (i.e. the International Statistical Institute) rejected the practice from the Norwegian statistical office (Kiaer, 1997 [1897]) of using parts of the population to infer about the entire population. One direct consequence of the changes from the 1930s
Sampling Design Principles We can select a probability sample using a rich array of sampling techniques. Most textbooks start with a simple random sample (SRS), which fully reflects lottery drawing. In SRS, all units in the population have the same probability of selection, but also all samples of a certain size have the same probability to appear. We can also use complex designs with strata, clusters, stages, phases etc. The general overview on probability sampling designs is provided already in this Handbook (see Tillé and Matei, Chapter 21). For further reading we also point to general survey (e.g. Bethlehem, 2009; Fowler, 2014; Groves et al., 2009) or statistical textbooks (e.g. Agresti and Frankin, 2013; Healey, 2011). Rich sampling literature also exists, either as a general introduction (e.g. Kalton, 1983; Lohr, 2010) or as an advanced statistical treatment (Särndal et al., 2003).
When we infer from a probability sample to the entire population, we face risks that due to randomization the selected sample is by chance different from the population. For example, it is theoretically possible, although extremely unlikely, that we drew a SRS of the general population of a country (where men have roughly a 50% share), which contains only men, but no women. The advantage of probability sampling is that this risk (e.g. selecting such extreme samples) can be quantified, which is then often evaluated at a level of 5% using so-called confidence intervals. A risk of 5% means that the corresponding
332
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
confidence interval contains the population value in 19 out of 20 randomly selected samples of certain type. The width of the confidence interval communicates the precision of the estimate, which strongly depends on the sample size; larger samples produce more precise estimates. Another advantage of probability samples – which is of extreme practical value – is that the confidence interval can be estimated from the sample itself. However, the construction of the confidence interval is only one aspect of the process of statistical inference, which is a set of procedures for generalizing results from a sample to a population, but it also deals with other inferential diagnostics, as well as with statistical hypothesis testing. Various inferential approaches exist. We referred here only to the most popular frequentist one, which (a) takes into account the sampling design, (b) assumes that unknown population values are fixed and (c) builds on a sampling distribution of the estimates across all possible samples. Other inferential approaches can differ in one or more of the above-mentioned aspects; however, for the majority of practical situations they provide very similar results. More differences between approaches may appear with unusual distributions and with specific modelling requests. The related inferential procedures can be very simple, as in the case of estimating the sample mean in a SRS setting, while the estimation of some other parameters (e.g. ratios, various coefficients, etc.) in complex samples (e.g. with stages, strata, phases, etc.) or with complicated estimators (e.g. regression estimator) may require additional statistical resources.
APPROACHES AND STRATEGIES IN NON-PROBABILITY SAMPLING We now turn to the non-probability context, which is, in principle, very problematic,
because sampling theory was developed for probability sampling. Consequently, we first discuss approaches where probability sampling principles are used in nonprobability settings, although – due to unfulfilled assumptions (i.e. lack of known probabilities) – formally this could (and should) not be done. For this reason we will call it as an approximate usage of probability sampling principles. Next, we provide an overview of some other strategies, developed to deal specifically with non-probability samples.
Approximations of Standard Probability Sampling Principles The approximate usage of probability sampling in a non-probability setting is sometimes understood in a sense that we first introduce certain modelling assumptions, e.g. we assume that there is actually some randomization in the non-probability sample. We will discuss these assumptions and corresponding implementations separately for sampling design and for statistical inference.
Approximations in Sampling Design The very basic idea here is to approximate and resemble, as much as possible, the probability sampling selection of the units. This will then hopefully – but with no guarantee – also assure other features of probability sampling, particularly the randomization. This goal can be achieved indirectly or directly. Indirect measures to approximate probability sample designs The most essential recommendation is to spread the non-probability sample as broadly as possible. In practice this predominantly means we need to combine various recruitment channels, which may appear also as media planning strategies. A successful example is the WageIndicator survey (see WageIndicator.org), implemented in over
Non-probability Sampling
85 countries worldwide, which uses an elaborated recruitment strategy by posting invitations to their volunteer web survey across a wide range of websites (Steinmetz et al., 2013). With this strategy they approximate the randomization spread of recruited units. Similarly, the Vehovar et al. (2012) study of online hate speech, based on a non-probability web survey, successfully applied well-spread recruitment procedures, ranging from offline strategies (e.g. delivering leaflets with survey URL at public places) to systematic publishing of the invitation posts across numerous online forums (covering a broad array of topics, from cooking to cycling). The systematic attempt to include a variety of recruitment channels assured in this study a considerable level of spread, which approximated a good level of randomization. Of course, there was no guarantee in advance that this approach would work, but the results showed that corresponding estimates were very close to those from a control probability sample. Another indirect strategy is to shape nonprobability samples, so they would reflect, as much as possible, the structure of the survey population. If possible, of course, the match need to be established for control variables, which are related to the target ones. Quota sampling is a typical approach here, usually based on controls for socio-demographics (e.g. gender, age, education, region, etc.). The quotas can be applied immediately at the recruitment or at the screening phase of a survey. For example, if we use gender quotas, this implies that once we have reached the quota for women, we then reject to further survey any women. Alternatively, we may also intensify the recruitment channels, which attract more men (e.g. advertising on sports media). In the case of a sampling frame, where we have additional information about the units (e.g. as in online panels), we can use this same strategy already at sample selection. A fine selection can be then used, not only according to the target population structure of socio-demographics (age, gender, education,
333
region, etc.), but also according to attitudes, media consumptions, purchase behaviour, etc. Sophisticated algorithms can be used to select an optimal sample for a certain purpose. An extreme case is individual matching, which is fully effective only when we have large databases of units available (as in online panels), where we first design a ‘normal’ sample from some ideal or high-quality control source. We then search for units in our database, using some distance measures (e.g. nearest neighbour), which most closely resemble the ideally desired units (Rivers and Bailey, 2009). Direct incorporation of probability sampling design principles Whenever possible, we can also directly include probability sampling principles. Sometimes components of probability sample selection can be incorporated already into the early steps of the sampling process. For instance, in probability sampling with quotas we may first select a probability sample of schools, where we then recruit pupils with unrestricted banner invitations on the web according to some quotas (e.g. age, gender, year of schooling). Another example is a direct inclusion of randomization, which can be incorporated into various ‘random walk’ selections related to the recruitment across streets and buildings, or in randomization across time (e.g. days, hours). In specific types of convenience sampling (mall intercept, street recruitment, invitation on the web) we can introduce randomization also by including every ith (e.g. every tenth) passenger, customer or visitor. Of course, none of the above-described approaches, direct or indirect, assures with some known accuracy that the corresponding statistical inference will be equivalent to the situation with probability samples. However, it is also true that these approximations usually contribute to certain improvements, while they rarely cause any serious damage. In the meanwhile, practitioners have developed elaborated skills, which often work well
334
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
for their specific circumstances, research problems and target variables. Again, there is no formal guarantee (in advance) that their solutions will also work in modified circumstances or in some other situations.
Approximations in Statistical Inference As already mentioned, with non-probability samples, by definition, the inclusion probabilities are unknown or zero, so without further assumptions this very fact formally prevents any statistical inference calculations (e.g. estimates, variances, confidence intervals, hypothesis testing, etc.). Uncontrolled inclusion probabilities can also introduce a certain selection bias, for which Bethlehem and Biffignandi (2012: 311) showed that it depends on (a) the variability of inclusion probabilities and on (b) the correlation between target variable and inclusion probabilities. For example, the Internetsavvy respondents typically participate more frequently in self-selected web surveys. As they differ from other Internet users not only in socio-demographics (e.g. younger) but also in web skills or altruism (Couper et al., 2007; Fricker, 2008; Malhotra and Krosnick, 2007), the results from such samples may show a bias towards the characteristics of these users. Against this background, in survey practice the same statistical inference procedures – developed in probability sampling – are routinely applied also into a non-probability setting, although there is no guarantee whether and how this will work. For instance, if we apply for a certain non-probability sample a standard procedure to calculate confidence intervals, we implicitly assume that probability sample selection (e.g. SRS) was used. However, at this time we are no longer confident that the interval actually contains the population value with the pre-assumed, say, 5% risk. It is very likely, that the actual risk is way above 5%, but we have no possibility to calculate it from the sample. Of course, we can still pretend and use the same interpretation as in a probability sampling context,
although this is not true. Applying standard statistical inference approaches as an approximation in non-probability samples can thus ‘seduce’ the users into believing that they have ‘real’ confidence intervals. Therefore, it seems reasonable to consider using a distinct terminology to formally separate a nonprobability from a probability setting. For example, as proposed by Baker et al. (2013) we might rather use the term ‘indications’ instead of ‘estimates’. This is also a direction to which the hardcore statistical approach against any calculation of confidence interval in non-probability setting is slowly melting in the recent AAPOR (2015) Code on Professional Ethics and Practices. In practice the situation is not that bad and this approximate usage often provides reasonable results, which is mainly due to the following reasons: (a) In almost all non-probability samples there still exists a certain level of ‘natural’ randomization, which, of course, varies across circumstances and from sample to sample. (b) This can be further accelerated with measures researchers undertake for improving the nonprobability sampling designs (see above discussion on spread, randomization, quotas and matching). (c) Powerful techniques also exist in post-survey adjustment steps, such as imputations, weight ing, complex estimators, and statistical models (see discussion further below).
All these three streams (a, b, c) together can be very effective, at least according to the perceptions of users and clients, who judge the results by using some external controls and their knowledge. The above-mentioned study on online hate speech (Vehovar et al., 2012) is a typical example here, because comparisons with a parallel telephone probability sample confirmed the results from non-probability samples: •• Estimated means on rating scales fully matched between the two samples. •• A similar result was true for the correlations, ranks and subgroup analysis.
Non-probability Sampling
•• The estimates for the shares (percentages) were skewed towards more intensive Internet users, but the ranks and relations between categories were still usually preserved.
The above pattern is very typical for many studies with ‘well-spread’ and ‘well-weighted’ non-probability samples. Successful results from non-probability samples are thus often reported, particularly for online panels, as shown in Rivers (2007) and in Callegaro et al. (2014). However, the latter reference also points to many exceptions. Similarly, scholars often compare the two approaches and as a result they typically expose advantages of the probability setting (e.g. Yeager et al., 2011). We may add that advocates (e.g. Rivers, 2010) of using probability sampling principles in a non-probability setting also emphasise that papers in leading scholarly journals often calculate confidence intervals for nonprobability samples, as well as that the nonprobability sampling approach is already acceptable in many neighbouring fields (e.g. experimental and quasi-experimental studies, evaluation research and observational studies). On the other hand, many statisticians basically insist that without sound and explicit assumptions or models – which are usually missing in non-probability sampling – there are no solid grounds to infer from any sample. In any case, whenever we use, run or discuss non-probability samples, we should clearly differentiate among them, because huge differences exist. While some carefully recruit and spread across the entire spectrum of the population, using elaborated sampling design procedures (e.g. strata, quotas, matching), others may simply self-select only within one narrow sub-segment. Similar differences appear also with (non)using advanced procedures in post-survey adjustment (imputation, weighting, calibration, estimation, modelling). Moreover, the specifics of the target variables and research problem are usually also very important here. Some variables are simply more robust in a sense that they are less
335
affected by deviations from probability sampling principles. Consequently, they can be successfully dealt even with low-quality nonprobability samples; this is often the case in marketing research. On the other hand, certain variables, such as employment status, health status, poverty measures or study of rare characteristics (e.g. diseases, minorities, sensitive behaviour), are less robust for deviations and require a strict probability sampling design and an elaborated estimation treatment.
Specific Modelling Approaches Being very formal and strict, we should acknowledge that randomization – as well as the related probability sampling – is not a necessary precondition for valid statistical inference (Valliant et al., 2000: 19). If we have clear and valid assumptions, a specific modelling approach can also provide corresponding statistical inference. In such a situation, we first assume that the data – or at least specific relations among the variables – are generated according to some statistical model. Next, additional external data are then particularly valuable (e.g. sociodemographic background variables in the case of online panels), so that certain modelbased approaches to statistical inference can be used to build the model and then estimate the values of interest. However, the corresponding statistical work (i.e. assumption testing, model development, data preparation, etc.) can be complicated and time-consuming, particular because it typically needs to be done for each variable separately. Another stream of approaches is related to sample matching, mentioned above in relation to quota selection (which is a type of aggregated matching) and to individual matching (where we may use some ‘nearest neighbour’ techniques). However, we can also use some advanced models, which are related to propensity score matching. The latter is closely related to causal inferences
336
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
techniques (Rubin, 1974), developed for quasi-randomized and non-randomized settings, and in particular to propensity score procedures developed for causal inference for observational studies (Rosenbaum and Rubin, 1983). An extensive overview of sample matching implementation for non-probability survey settings is provided in Baker et al. (2013: 34). The achievements of some modelling approaches are in fact admirable. For example, the successes of various voting predictions, which are based on results from non-probability samples, are very notable. Sometimes they are perhaps even shocking for many traditional statisticians, who believe that no statistical inference exists outside probability sampling. It is true, however, that weak documentation of the corresponding procedures (particularly in the commercial sector) might possibly violate the principles of research transparency (AAPOR, 2014). Nevertheless, due to their specific and narrow targeting, these approaches are often too demanding in skills and resources to be generally used. As a consequence, whenever we need a straightforward practical implementation of statistical inference, such as running descriptive statistics of all variables in the sample, we usually still run standard statistical inference procedures (which assume probability-based samples). Of course, as indicated above, this is formally correct only if we can prove that the necessary assumptions hold, which is usually not the case. Otherwise, we run them as approximations or indications, together with the related risks, which then need to be explicitly stated, warned and documented.
SELECTED TOPICS IN NONPROBABILITY SAMPLING When discussing non-probability sampling, two issues deserve special attention: nonprobability online panels, because they are
perhaps the most frequent and most advanced example of contemporary non-probability sampling, and weighting, because it is the subject of high expectations about solving the problems arising from non-probability samples.
Non-probability Online Panels When the same units are repeatedly surveyed at various points in time, we talk about longitudinal survey designs. Important advantages over independent cross-sectional studies are efficiency gains in recruiting and reduced sampling variation, because we can observe change at the individual level (Toepoel et al., 2009). In traditional longitudinal designs, all units in the sample are repeatedly surveyed with questions belonging to a certain topic (e.g. labour market). However, we will use here the notion of panel somehow differently and according to the understanding, which actually prevails within the non-probability sampling context: panels are large databases of units, which initially agreed to cooperate, provided background information and were recruited to serve for occasional or regular selection into specific sample surveys addressing related or unrelated topics. With the rise of the Internet, the so-called online panels of general populations – sometimes also labelled as access panels – emerged in developed countries and they are rapidly becoming the prevailing survey data collection method, particularly in marketing research (Macer and Wilson, 2014). Our focus here is on non-probability online panels, which are the dominant type of online panels, while probability online panels are relatively rare. We discuss general population panels, but the same principles can be extended to other populations, e.g. customers, professionals, communities, etc. In general, non-probability online panels comply with issues, challenges and solutions (i.e. approximations), which have been discussed in previous sections. Therefore, conceptually, there is not much to add. On the
Non-probability Sampling
other hand, the practices of online panels are very interesting and illustrative for the nonprobability sampling context. One specific is that recruitment (i.e. sample selection) actually occurs at two points: first, with the recruitment of units into the panel from, say, the general population, where we need to focus on a good approximation of probability sampling (i.e. spread and coverage). Second, when we recruit units from our panel into a specific sample, we need to focus on selecting an optimal structure by using quotas, stratification, matching, etc. Careful management is required to prove the representation of essential subgroups in each sample, but also to control the workload for each unit. A panel management system thus needs to record and optimize when and how units are recruited, how often they participate in surveys, and how much time it takes them to complete a survey. It also has to detect various types of ‘cheaters’, which can be among others recruited from ‘professional respondents’, who may maintain fake profiles to participate in online panels. After the data are collected, another step of optimization occurs by editing, imputing and weighting, which can be especially tailored to certain circumstances. Results from the NOPVO-study (Van Ossenbruggen et al., 2006), in which 19 nonprobability online panels in the Netherlands were compared, showed that panels use a broad array of recruitment methods, with the main approach relying on online recruitment and various network procedures. Sometimes a subset – particularly hard-to-reach subgroups – of panel members is also recruited via probability-based sampling using traditional modes (telephone, face-to-face). The incentives also vary, from lotteries, charity donations, vouchers or coupons for books or other goods, to the prevailing method of collecting points, which can be then monetized. Due to these variations, it is not surprising that considerable differences were found also with respect to the estimates. Some panels in the NOPVO study performed much better
337
than others, but sometimes serious discrepancies to benchmark values were found for all panels (e.g. political party preference). A similar study, where 17 online panels in the US contributed a sample of respondents (Thomas and Barlas, 2014), showed that online panels actually provided consistent estimates on a range of general topics, such as life satisfaction and purchasing behaviour. The corresponding variations often resembled variations in estimates from probability samples. However, there were also specific topics, particularly those related to small population shares, where larger variations were found, such as occurrence of binge drinking (i.e. drinking more than 4–5 glasses of alcohol in one evening). An exhaustive overview of studies evaluating results from non-probability samples can be found in Callegaro et al. (2014). The overview reveals that professional online panels often provide results, which rarely differ significantly from the corresponding benchmarks. On the other hand, it also showed that exceptions do exist. Similarly, this overview demonstrates that the estimates from probability samples consistently outperform the ones from non-probability online panels. Nevertheless, in practice, the dramatic differences in costs between the two alternatives usually (but not always) outbalance the differences in the estimates. We may add here that cost-error optimization is a very complex and much underresearched issue even for probability-based samples, where we can minimize the product of total survey error (usually approximated with means squared error) and the costs (see Vehovar et al., 2010; Vannieuwenhuyze, 2014). Also in this case, the cost-error optimization in non-probability samples could, in principle, approximate the optimization in probability-based approach, although in reality these judgements are often much more arbitrary. With non-probability online panels one has to be careful with statistical inferences, as noted previously. Different to probability
338
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
samples, in non-probability online panels – as in non-probability samples in general – the corresponding quality relies on a successful mix of approximations. The quality of a certain panel can be checked by monitoring their documentation and past results. Both ISO 20252 (Market, Opinion and Social Research) and ISO 26362 (Access Panels in Market, Opinion and Social Research) standards for instance specify key requirements to be met and reported. ESOMAR (2015), the global organization for market, opinion and social research, also regularly issues practical guidelines to help researchers make fit-for-purpose decisions about selecting online panels. We can summarize that contemporary non-probability online panels successfully serve for studying numerous research topics. However, here too, we cannot escape from the usual limitations of non-probability samples. Correspondingly, for certain research problems these panels may not be the right approach. Still, they are perhaps the best example, how the limitations of non- probability setting can be overcome or at least minimized through carefully applied methodological approaches.
Weighting In probability samples so-called base weights are used first to compensate for the inclusion probabilities. In a second step, specific non-response weights are applied to reduce the bias – a difference between the estimate and the true value – resulting from nonresponse. In a similar way, specific weights can be developed for other methodological problems (e.g. non-coverage). In addition, with so-called population weighting we usually correct for any discrepancies in available auxiliary variables. Typically, these are socio-demographic variables (e.g. age, gender, region), but when available, other variables are also considered (e.g. media consumption). A good overview regarding the general treatment of weights can be
found in Bethlehem (2009), Bethlehem and Biffignandi (2012) and Valliant et al. (2013). A core requirement for weights is that the variables used have to be measured in the sample and also in the population. Moreover, for weighting to be effective, they also need to be correlated with target variables. Different weighting approaches exist; one is based on generalized regression estimation (linear weighting), where linear regression models are used (Kalton and Flores-Cervantes, 2003). Poststratification is a special and also most frequently used example of this approach, where auxiliary variables simply divide the population into several strata (e.g. regions). All units in a certain strata are then assigned the same weight. The problems appear, if strata are too small and we have insufficient or no observations in a strata, which consequently hinder the computation of weights for such strata. We may also lack corresponding population control data. The second stream of approaches is based on raking ratio estimation. This approach is closely related with iterative proportional fitting (Bethlehem and Biffignandi, 2012), where we sequentially adjust for each auxiliary variable separately (e.g. age, gender, etc.) and then repeat the process until convergence is reached. This saves the abovedescribed problem of a small number of units in stratification and lack of population controls, and is thus very convenient when more auxiliary variables are considered. However, we are limited here to categorical auxiliary variables; problems also appear with variance estimation. In most situations the linear weighting and raking provide asymptotically similar results. Still, differences may appear, depending on whether the underlying relation between auxiliary (control) and target variables is linear or loglinear (as assumed in raking). We should also mention general calibration approaches, which enable to simultaneously include auxiliary information, as well as potential restrictions for the range of weights. In addition, they produce the
Non-probability Sampling
estimates together with corresponding variance calculations. Linear weighting and raking can be then treated as special cases of these general calibration approaches. We should be aware that weights may (or may not) remove the biases in target variables, but they for sure increase the sampling variance. However, the underlying expectation is of course, that the gains in reducing the bias outweigh the corresponding loss due to increased variance. This is not necessarily true, particularly in the case of weak correlations between auxiliary and target variables, where weights may have no impact on bias removal. The bias–variance trade-off is usually observed in the context of the so-called mean squared error (MSE), which is typically reduced to the sum of sampling variance and squared bias. In addition, the sampling variance and potentially also the bias might further increase when the auxiliary information cannot be obtained from the population, but only from some reference survey. With probability samples the weighting typically reduces only a minor part of the bias (Tourangeau et al., 2013), while the differences between weighting approaches themselves are usually rather small (of course, as long as they use the same amount of auxiliary information). The above-described procedures (and their characteristic) developed for probability sampling can be used also in a nonprobability setting. Again, as we do not know the probabilities of inclusion, this usage is conditional and should be treated only as an approximation. Correspondingly, the potentials of weighting in the case of non-probability sampling hardly surpass the benefits in probability sampling. Still, in survey practice, basically the same procedures are applied in a non-probability setting, hoping they will somehow work. With non-probability samples we often have only one step, because we cannot separate sampling and non-response mechanisms. The weighting process thus simply assigns weights to units in the sample, so that the underrepresented ones get a
339
weight larger than 1, while the opposite happens to overrepresented ones, which is similar as in population weighing. For example, if the share of units from a certain segment in our sample is 25%, while it is 20% in the population, these units receive a weight proportional to 20/25 = 0.8. Within this framework, the propensity score adjustment (PSA) has been proposed as a specific approach to overcome problems of non-probability samples (e.g. Lee, 2006; Steinmetz et al., 2014; Terhanian et al., 2000; Valliant and Dever, 2011), particularly in situations when weighting based on standard auxiliary variables (socio-demographics) failed to improve the estimates. One reason for this is the fact that the attitudes of Internet users might differ substantially from those of non-users (Bandilla et al., 2003; Schonlau et al., 2007). Therefore, additional behavioural, attitudinal or lifestyle characteristics are considered. Observational and process data (so-called paradata) can also be useful (Kreuter et al., 2011). PSA then adjusts for the combined effects of selection probabilities, coverage errors and non-response (Bethlehem and Biffignandi, 2012; Schonlau et al., 2009; Steinmetz et al., 2014). To estimate the conditional probability of response in the non-probability sample, a small probability-based reference survey is required, which has been ideally conducted in the same mode using the same questionnaire. After combing both samples, a logistic (or probit) model estimates the probability of participating in the non-probability sample using the selected set of available covariates. There are several ways of using propensity scores in estimations (Lee, 2006). For instance, while Harris Interactive uses a kind of poststratified estimator based on propensity score (Terhanian et al., 2000), YouGovPolimetrix uses a variation of matching (Valliant and Dever, 2011). Again, we should repeat that differences between weighting approaches are usually very small, which also means that PSA rarely significantly outperforms alternative weighting procedures, given that both
340
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
approaches use the same auxiliary information (Tourangeau et al., 2013). To summarize, weighting is by no means a magic wand to solve the non-probability sampling problems. Still, it usually does provide a certain improvement; at least it assures cosmetic resemblance to population controls. In practice it is thus routinely implemented, although it might sometimes also cause certain damage for some estimates.
DECIDING ABOUT PROBABILITY SAMPLING APPROXIMATIONS When we implement approximations from probability samples into a non-probability setting, there is by the very definition little theoretical basis for running a sound statistical inference. Therefore, the justification and validation of these procedures rely solely on the accumulation of anecdotal practical evidence, which is sometimes ironically called ‘faith-based’ sampling. Only when a certain non-probability approach has repeatedly worked sufficiently well in specific circumstances, can this then become a basis for its justifications. We may reflect here on two comments on the initially mentioned Literary Digest poll failure. First, despite long series of successful evidence on previous election forecasts (from 1916 onwards), the applied non-probability sampling approach unexpectedly failed in 1936. This is an excellent illustration about the risk when relying on successful anecdotal evidence from past success. Second, with today’s knowledge and approaches it is very likely that this failure could have been prevented. In other words, an elaborated non-probability sampling strategy would have been less likely to fail so unexpectedly, although it perhaps still could not provide the same assurance as a probability sample. In this context, we should also point out again that in some circumstances, certain non-probability sampling approaches may
repeatedly work well, while in others they do not. Similarly, sometimes even a very deficient probability sample (e.g. with a very low response rate) outperforms high-quality non-probability samples, but it can also be the opposite. In any case, only repeated empirical evidence within specific circumstances may gradually justify a certain non-probability sampling approach. Within this framework, the justification of the New York Times (NYT) about why they switched from probability to non-probability sampling is very informative (Desilver, 2014), i.e. they simply observed the repeated empirical evidence of the new supplier (which relied on non-probability sampling) and were persuaded by the corresponding success in the past (and, perhaps, also by lower costs). Of course, with this decision they also took the above-mentioned risk that past success might not guarantee success also in the future. When deciding between probability and non-probability sampling, the first stream of considerations is thus to profoundly check all past evidence of approaches suitable for a certain type of survey research. Whenever possible, for available non-probability samples in the past we should inspect the distributions of relevant target variables, together with the effects of accompanying weights, and also compare them with external controls, e.g. administrative data or data from official statistics. This will give us an impression of how the non-probability sample resembled the probability ones in terms of randomization and deviations from population controls (i.e. ‘representativeness’). Another stream of considerations is to systematically evaluate broader survey settings. Namely, we are rarely in the situation to decide only about the probability vs nonprobability dilemma, but rather between whole alternative packages, each containing an array of specifics related also to noncoverage, non-response, mode effects, and other measurement errors, as well as those related to costs, timing and comparability. Extreme care is often required to separate the genuine problem of non-probability sampling
Non-probability Sampling
(e.g. selection bias, lack of randomization) from other components. For example, it would be of little help to reject a non-probability sample, if problems actually stem from mode effect. In such case replacing the non-probability web survey with a probability web survey brings no improvement, but only lots of costs. We thus rarely decide separately about the probability–non-probability component alone, but rather simultaneously judge over a broad spectrum of features. For instance, in a survey on youth tourism we may choose between a probability-based telephone, mail or web survey, and a non-probability online panel or a sample recruited through online social media. The decision involves simultaneous considerations of many criteria, where all above-mentioned components (nonresponse, non-coverage, selection bias, mode effect, costs, timing, comparability, accuracy, etc.) are often inseparably embedded into each alternative (e.g. the social media option perhaps comes with high non-coverage and selection bias, but also with low costs and convenient timing, etc.). Within this context, the selection bias itself, arising from the use of a non-probability setting, is only one specific element in the broad spectrum of factors and cannot be isolated and treated separately. Sometimes, combining the approaches can be a good solution. We may thus obtain population estimates for key target variables with some (expensive) probability survey, while indepth analysis is done via some (inexpensive) non-probability surveys. Again, the study by Vehovar et al. (2012) on online hate speech might serve as a good example, where key questions were put in a probability-based omnibus telephone survey, while extensive in-depth questioning was implemented within a broadly promoted non-probability web survey.
CONCLUDING REMARKS When introducing sampling techniques for probability samples we recalled a major
341
breakthrough from the 1930s, which brought spectacular changes into survey practice. After the elaboration of non-probability sampling approaches we can perhaps paraphrase the same also for the gains arising from modern non-probability sampling. The advent of non-probability web surveys, in particular, revolutionized the survey landscape. Numerous survey projects were enabled, which otherwise could not have been done, and many others were facilitated to run dramatically faster, simpler, and less expensively. We continuously indicated in this chapter – explicitly or implicitly – that in a nonprobability setting the statistical inference based on probability sampling principles formally cannot be applied. Still, the practitioners routinely use the corresponding ‘estimates’, which should be rather labelled as ‘indications’ or even ‘approximations’. Of course the corresponding evaluation of such ‘estimates’ cannot be done in advance (ex-ante), as in probability settings, but only afterwards (ex-post). Still, such evaluations are very precious, as we can improve the related procedures, or, at least recognize their limitations and restrict their future usage only to a narrow setting for which it works. After many successful repetitions, with corresponding trials and errors, the practitioners often develop very useful practical solutions. Further discussion on these so-called ‘fit-for-purpose’ and ‘fitness-to-use’ strategies can be found in the AAPOR report on non-probability samples (Baker et al., 2013), which systematically and cautiously addresses selection, weighting, estimation and quality measures for nonprobability sampling We should be thus clearly aware that application of standard statistical inference procedures in a non-probability setting implicitly assumes that inclusion probabilities actually do exist (which is of course not true). For example, calculating the mean for a target variable in a non-probability sample of size n automatically assumes that probabilities are all equal (e.g. SRS). Consequently, the inclusion
342
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
probabilities n/N are implicitly assigned to all units in the sample, where N is the population size. It then depends on corresponding expost evaluation procedures to judge the quality of such an ‘estimate’, which as mentioned should be perhaps better called ‘indication’ or ‘approximation’. By abandoning probability sampling principles we usually also abandon the science of statistical inference and enter instead into the art and craft of shaping optimal practical procedures. The experience based on trials and errors is thus essential here, as well as the intuition of the researcher. Correspondingly, with complex problems, an expert advice or a panel of experts with rich experience can be extremely valuable. Of course, hardcore statisticians may feel doubts or even strong professional opposition towards procedures, where inferential risks cannot be quantified in advance. On the other hand, is this not also the very essential characteristic of the applied statisticians – which separates them from ‘out-of-reality’ mathematical statisticians – to stop and accept the procedures, which bring them close enough for all practical purposes? We should also add that the above-discussed issues are nothing really new. More than a hundred years ago researchers were already trying various inexpensive alternatives to replace studying the entire population or to replace expensive probability sampling procedures. Wherever these alternatives worked sufficiently well, they were simply preserved. Therefore, on one (conceptual) side we have serious objections to non- probability samples, summarized in the profound remarks from Leslie Kish (1998) in his reply to the repeated requests from the marketing research industry for a review of his attitudes to quota samples: My section 13.7 in Survey Sampling expresses what I thought then (1965) and I still think. I cannot add much, because quota methods have not changed much, except that it is done by many more people and also by telephone. Quota sampling is difficult to discuss precisely, because it is
not a scientific method with precise definition. It is more of an art practiced widely with different skills and diverse success by many people in different places. There exists no textbook on the subject to which we can refer to base our discussion. This alone should be a warning signal against its use.
On the other (practical) side, it is also true that it is hard to deny advances in modern approaches for dealing with non-probability samples. It is also hard to object to countless examples of successful and cost-effective implementations in research practice. We thus recommend to more openly accepting the reality of using a standard statistical inference approach as an approximation in non-probability settings. However, this comes with two warnings: first, the sample selection procedure should be clearly described, documented, presented and critically evaluated. We thus join the AAPOR recommendations that the methods used to draw the sample, to collect the data, to adjust it and to make inference should be even more detailed compared to probability samples (Baker et al., 2013). The same should also apply for reporting on standard data quality indicators. Within this context, the AAPOR statement related to the NYT introduction of non-probability sampling is very illustrative (AAPOR, 2014). Second, we should also elaborate on the underlying assumptions (e.g. models used) and provide explanations about conceptual divergences, dangers, risks and limitations of the interpretations (AAPOR, 2015). When standard statistical inference is applied to any non-probability sample, the minimum should be thus to clearly acknowledge that estimates, confidence intervals, model fitting and hypothesis testing may not work properly or may not work at all. Of course, the entire approach should be accompanied with corresponding survey profession integrity, which means exhaustively and transparently. Even though these concluding remarks predominantly relate to the prevailing situation, where we use probability sampling principles as approximations in a non-probability setting, they are also highly relevant to dedicated
Non-probability Sampling
statistical modelling approaches, which have particular potential for future developments in this area.
NOTE 1 Some parts of this chapter rely on Section 2.2, which Vasja Vehovar wrote for Callegaro et al. (2015).
RECOMMENDED READING Report of the AAPOR Taskforce on Non-Probability Sampling by Baker et al. (2013) provides an exhaustive and systematic insight into issues related to non-probability sampling. Web Survey Methodology by Callegaro et al. (2015) provides an extensive treatment of non-probability sampling approaches in web surveys. Online Panel Research: A Data Quality Perspective by Callegaro et al. (2014) provides a state-of-the-art insight into the non-probability nature of contemporary online panels.
REFERENCES AAPOR (2014). AAPOR Response to New York Times /CBS News Poll: The Critical Role of Transparency and Standards in Today’s World of Polling and Opinion Research. Retrieved 1 July 2015 from http://www.aapor.org/ AAPORKentico/Communications/PublicStatements.aspx AAPOR (2015). The Code of Professional Ethics and Practices. Retrieved 1 July 2015 from http://www.aapor.org/AAPORKentico/ Standards-Ethics/AAPOR-Code-of-Ethics.aspx Agresti, A. and Franklin, C. (2013). Statistics: The Art and Science of Learning from Data (3rd edn). Oak View, CA: Pearson. Baker, R. P., Brick, J. M., Bates, N., Battaglia, M. P., Couper, M. P., Dever, J. A., Gile, K. J. and Tourangeau, R. (2013). Report of the AAPOR Taskforce on Non-Probability Sampling. Retrieved from https://www.aapor.org/ AAPORKentico/AAPOR_Main/media/
343
MainSiteFiles/ NPS_TF_Report_Final_7_ revised_FNL_6_22_13.pdf. Accessed on 13 June 2016. Bandilla, W., Bosnjak, M. and Altdorfer, P. (2003). Survey administration effects? A comparison of web-based and traditional written self-administered surveys using the ISSP environment module. Social Science Computer Review, 21, 235–243. Bethlehem, J. (2009). Applied Survey Methods. A Statistical Perspective. (Wiley Series in Survey Methodology). Hoboken, NJ: Wiley. Bethlehem, J. and Biffignandi, S. (2012). Handbook of Web Surveys. Hoboken, NJ: Wiley. Callegaro, M., Baker, R., Bethlehem, J., Göritz, A., Krosnick, J. and Lavrakas, P. (eds). (2014). Online Panel Research: A Data Quality Perspective. Chichester: Wiley. Callegaro, M., Lozar Manfreda, K. and Vehovar, V. (2015). Web Survey Methodology. London: Sage. Couper, M., Kapteyn, A., Schonlau, M. and Winter, J. (2007). Noncoverage and nonresponse in an Internet survey. Social Science Research, 36, 131–148. Desilver, D. (2014). Q/A: What the New York Times’ polling decision means. Retrieved 1 July 2015 from http://www.pewresearch. org/fact-tank/2014/07/28/ qa-what-the-new-york-times-pollingdecision-means/ ESOMAR (2015). ESOMAR/GRBN Guideline for Online Research. Retrieved 25 June 2015 from: https://www.esomar.org/knowledgeand-standards/codes-and-guidelines/esomargrbn-online-research-guideline-draft.php Fowler, F. J. (2014). Survey Research Methods (5th edn). Thousand Oaks, CA: Sage. Fricker, R. (2008). Sampling methods for web and E-mail surveys. In N. Fielding, R. Lee and G. Blank (eds), The SAGE Handbook of Online Research Method (pp. 195–216). London: Sage. Groves, R., Fowler, F., Couper, M., Lepkowski, J., Singer, E. and Tourangeau, R. (2009). Survey Methodology. Hoboken, NJ: Wiley. Healey, J. F. (2011). Statistics: A Tool for Social Research. Belmont, CA: Wadsworth Publishing. Kalton, G. (1983). Introduction to Survey Sampling. SAGE University Paper series on Quantitative Applications in the Social
344
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Sciences, series no. 07–035. Beverly Hills and London: Sage. Kalton, G. and Flores-Cervantes, I. (2003). Weighting methods. Journal of Official Statistics, 19(2), 81–97. Kiaer, A. N. (1997). The Representative Method of Statistical Surveys (English translation by Christiania Videnskabsselskabets Skrifter), Oslo, Norway: Statistics Norway, (Reprinted from Den repräasentative undersökelsesmetode, II. Historiskfilosofiske klasse, No. 4, 1897). Kish, L. (1965). Survey Sampling. New York: Wiley. Kish, L. (1998). Quota sampling: Old Plus New Thought. Personal communication. Retrieved 25 June 2015 from http://www. websm.org/db/12/11504/Web%20 S u r v e y % 2 0 B i b l i o g r a p h y / Quota_sampling_Old_Plus_New_Thought/ Kreuter, F., Olson, K., Wagner, J., Yan, T., EzzatiRice, T., Casas-Cordero, C., Lemay, M., Peytchev, A., Groves, R. and Raghunathan, T. (2011). Using proxy measures and other correlates of survey outcomes to adjust for non-response: examples from multiple surveys. Journal of the Royal Statistical Society Series, A, 173(2), 389–407. Lee, S. (2006). Propensity Score Adjustment as a weighting scheme for volunteer panel web surveys. Journal of Official Statistics, 22, 329–349. Lohr, S. L. (2010). Sampling: Design and Analysis. Pacific Grove, CA: Brooks/Cole Publishing Company. Macer, T. and Wilson, S. (2014). The 2013 Market Research Technology Survey. London: Meaning Ltd. Malhotra, N. and Krosnick, J. (2007). The effect of survey mode and sampling on inferences about political attitudes and behaviour: comparing the 2000 and 2004 ANES to Internet surveys with nonprobability samples. Political Analysis, 15, 286–324. Rivers, D. (2007). Sampling for web surveys. In: Proceedings of the Joint Statistical Meeting, Survey Research Methods Section. Salt Lake City, UT: AMSTAT. Rivers, D. (2010). AAPOR report on online panels. Presented at the American Association for Public Opinion Research (AAPOR) 68th Annual Conference, Chicago, Illinois and at the PAPOR 2010 Annual Conference, San Francisco.
Rivers, D. and Bailey, D. (2009). Inference from matched samples in the 2008 U.S. national elections. Presented at the American Association for Public Opinion Research (AAPOR) 64th Annual Conference, Hollywood, Florida. Rosenbaum, P. and Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701. Särndal, C.-E., Swensson, B. and Wretman, J. (2003). Model Assisted Survey Sampling. (Series in Statistics). New York, Heidelberg, Dordrecht, London: Springer. Schonlau, M., van Soest, A. and Kapteyn, A. (2007). Are ‘Webographic’ or attitudinal questions useful for adjusting estimates from web surveys using propensity scoring? Survey Research Methods, 1, 155–163. Schonlau, M., van Soest, A., Kapteyn, A. and Couper, M. (2009). Selection bias in web surveys and the use of propensity scores. Sociological Methods Research, 37, 291–318. Steinmetz, S., Bianchi, A., Biffignandi, S. and Tijdens, K. (2014). Improving web survey quality – potentials and constraints of propensity score weighting. In M. Callegaro, R. Baker, J. Bethlehem, A. Göritz, J. Krosnick and P. Lavrakas (eds), Online Panel Research: A Data Quality Perspective (chapter 12, pp. 273–298). (Series in Survey Methodology), New York: Wiley. Steinmetz, S., Raess, D., Tijdens, K. and de Pedraza, P. (2013). Measuring wages worldwide – exploring the potentials and constraints of volunteer web surveys. In N. Sappleton (ed.), Advancing Social and Business Research Methods with New Media Technologies. Hershey, PA: IGI Global, pp. 110–119. Terhanian, G., Bremer, J., Smith, R. and Thomas, R. (2000). Correcting data from Online Survey for the Effects of Nonrandom Selection and Nonrandom Assignment. Research paper, Harris Interactive, Rochester, NY. Thomas, R. and Barlas, F. M. (2014). Respondents playing fast and loose? Antecedents and Consequences of Respondent Speed of
Non-probability Sampling
Completion. Presented at The American Association for Public Opinion Research (AAPOR) 69th Annual Conference, Anaheim, California. Toepoel, V., Das, M. and van Soest, A. (2009). Relating question type to panel conditioning: a comparison between trained and fresh respondents. Survey Research Methods, 3(2), 73–80. Tourangeau, R., Conrad, F. and Couper, M. (2013). The Science of Web Surveys. New York: Oxford University Press. Valliant, R. and Dever, J. (2011). Estimating propensity adjustments for volunteer web surveys. Sociological Methods, 40(1), 105–137. Valliant, R., Dever, J., and Kreuter, F. (2013). Practical Tools for Designing and Weighting Survey Samples. (Statistics for Social and Behavioral Sciences Series): New York, Heidelberg, Dordrecht, London: Springer. Valliant, R., Dorfman, A. H. and Royall, R. M. (2000). Finite Population Sampling and Inference: A Prediction Approach. New York: John Wiley.
345
Van Ossenbruggen, R., Vonk, T. and Willems, P. (2006). Dutch Online Panel Comparison Study (NOPVO). Retrieved 1 July 2015 from http://www.nopvo.nl/ Vannieuwenhuyze, J. (2014). On the relative advantage of mixed-mode versus singlemode surveys. Survey Research Methods, 8(1), 31–42. Vehovar, V., Berzelak, N. and Lozar Manfreda, K. (2010). Mobile phones in an environ ment of competing survey modes: applying metric for evaluation of costs and errors. Social Science Computer Review, 28(3), 303–318. Vehovar, V., Motl, A., Mihelič, L., Berčič, B. and Petrovčič, A. (2012). Zaznava sovražnega govora na slovenskem spletu. Teorija in Praksa, 49(1), 171–189. Yeager, D., Krosnick, J., Chang, L., Javitz, H., Levendusky, M., Simpser, A. and Wang, R. (2011). Comparing the accuracy of RDD telephone surveys and Internet surveys conducted with probability and nonprobability samples. Public Opinion Quarterly, 75(4), 709–747.
23 Special Challenges of Sampling for Comparative Surveys Siegfried Gabler and Sabine Häder
HISTORY AND EXAMPLES Comparative social surveys have a long history. The beginning goes back to the mid of the last century with surveys such as ‘How nations see each other’ (Buchanan and Cantrill 1953) in 1948/1949. An impressive overview of such surveys based on population samples is given on the web site of the research data center for international survey programs at GESIS (http://www.gesis.org/ en/institute/competence-centers/rdc-international-survey-programmes/). In the last decades a long process of amelioration and control of those surveys has taken place. In this chapter we will document the current state of the art concerning sampling for comparative surveys. According to Lynn et al. (2006: 10) there are three goals for cross-national surveys: (a) comparisons of estimates of parameters for different countries;
(b) rankings of countries on different dimensions such as averages or totals; (c) estimates for a supra-national region such as the European Union aggregated from estimates of different countries.
Sampling strategies have to ensure the equivalence and/or combinability of these estimates. For this, both sample designs and estimation strategies have to be chosen carefully. Kish (1994: 173) provides basic ideas for the application of sample designs in crosscultural surveys: Sample designs may be chosen flexibly and there is no need for similarity of sample designs. Flexibility of choice is particularly advisable for multinational comparisons, because the sampling resources differ greatly between countries. All this flexibility assumes probability selection methods: known probabilities of selection for all population elements.
Special Challenges of Sampling for Comparative Surveys
Following this statement, an optimal sample design for cross-national surveys should consist of the best random sampling practice used in each participating country. According to Hubbard and Lin (2011: V.1) ‘best’ means in this context an optimal design ‘that maximizes the amount of information obtained per monetary unit spent within the allotted time and meets the specified level of precision’ (see also Heeringa and O’Muircheartaigh 2010). The choice of a special sample design depends on the availability of frames, and experience, but also mainly the costs in different countries. Once the survey has been conducted, and adequate estimators are chosen, the resulting values become comparable. To ensure this comparability, design weights have to be computed for each country. For this, the inclusion probabilities of every respondent at each stage of selection must be known and recorded. Furthermore, the inclusion probabilities for non-respondents must also be recorded at every stage where the information to calculate them is available. Over the last decades, cross-national surveys have become more common practice. In particular, regular and repeated surveys are of high importance for the observation and analysis of social change in different nations. In the following, we want to describe how sample designs of some of these surveys have developed over time. We will show that the trend went from more or less naive approaches to more rigorous applications in the present. For example, one of the first series of cross-national surveys was the Standard Eurobarometer, which is conducted between two and five times per year in personal interviews. Starting in 1973, different sampling methods were used which varied between countries. The sampling designs were either multi-stage national probability samples or national stratified quota samples. With regard to Eurobarometer 32 (October 1989), the basic sampling design in all member states changed to a multi-stage, random (probability) one, i.e. quota sampling was no longer
347
accepted. The sampling is now based on a stratified random selection of sampling points. In the second stage, a cluster of addresses is selected from each sampling point. In most countries addresses are chosen using standard random route procedures. In each household, a respondent is selected by a random procedure, such as the next birthday method. Of course, these sampling strategies have several disadvantages, too. Especially, random route procedures are difficult to control if the interviewer who selects the addresses of the households is also the one in charge of conducting the interview. Nevertheless, this development is an example for two typical trends: (a) the abandonment of quota sampling and (b) the strategy of having widely similar designs in all participating countries. Abandoning quota sampling is also typical for other large-scale survey programs such as the European Value Study (Face-to-face) or the International Social Survey Programme (Face-to-face or Self-administered), which both changed to strict probability sampling. This is consistent with the findings of Zheng (2013: 3652) even if he had analyzed only a small number of comparative surveys: Although various non-probability sampling techniques have been proposed for cross-cultural research, it may be said that the establishment of sampling approaches in the framework of probability sampling is an important subject in order to obtain the responses with high reliability and validity.
Another example of comparative surveys conducted in face-to-face mode is the European Social Survey (ESS). Commencing upon the initial round of the ESS in 2002, only designs were applied where the inclusion probabilities of each stage were known. Furthermore, to ensure the highest methodological standard for this series of surveys, the best sampling designs had to be found (see Häder and Lynn 2007). This means that in contrast to the Standard Eurobarometer, a variety of designs is applied – which is in accordance with Kish – but in many countries
348
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
much more time and money consuming. In Germany, for instance, a register-based sample is the design with the highest quality of the gross sample. However, it usually takes around half a year to receive the addresses of the individuals from the registers. In addition, the addresses cost (in some municipalities an exceeding amount of) money and the travel costs for the interviewer are much higher than in random route designs. Let us now turn to a different survey mode: telephone studies. The Flash Eurobarometer is a prominent example for this type of survey. It is conducted in the EU member states via telephone – in most cases on landline phones. However, there are also exceptions where parts of the population are interviewed on mobile phones: All interviews were carried [out] using the TNS e-Call center (our centralized CATI system). In every country respondents were called both on fixed lines and mobile phones. The basic sample design applied in all states is multi-stage random (probability). In each household, the respondent was drawn at random following the ‘last birthday rule’. TNS has developed its own RDD sample generation capabilities based on using contact telephone numbers from responders to random probability or random location face to face surveys, such as Eurobarometer, as seed numbers. The approach works because the seed number identifies a working block of telephone numbers and reduces the volume of numbers generated that will be ineffective. The seed numbers are stratified by NUTS2 region and urbanization to approximate a geographically representative sample. From each seed number the required sample of numbers are generated by randomly replacing the last two digits. The sample is then screened against business databases in order to exclude as many of these numbers as possible before going into field. This approach is consistent across all countries (European Commission 2013).
Interestingly, the Flash Eurobarometer is one of the very rare multi-nation studies, which is conducted via telephone. The fact that telephone interviews are seldom applied is surprising because BIK Aschpurwis+Behrens provide – at least for European survey institutes – modified RDD
samples for fixed line phones and mobile phones (Heckel and Wiese 2012). These samples are of high quality and are developed in consideration of the latest research results in this field. The design is comparable to the above described frame from TNS although it uses telephone books and information from the Internet as a source for seed numbers. For telephone surveys, two trends can also be observed: (a) the use of randomized last digits techniques (Lavrakas 1993) has become less frequent in comparison to previous surveys. It has become clear among survey statisticians that these techniques have the disadvantage of unequal inclusion probabilities and adjusting for these unequal probabilities would be inefficient. (b) Telephone surveys have to be extended from solely used landline phones to mobile phones in order to furthermore reflect the opinions and attitudes of the general population. The reason for this is the growing group of people who cannot be reached via fixed line phones, i.e. the ‘Mobile-onlys’. Table 23.1 shows the development of telephone access in the European Union. It has become apparent that whilst the proportion of households with only landline phones is declining, there is an increasing number of households that can only be contacted by mobile phone (see Häder et al. 2012) – a substantial proportion of the population that cannot be left out in general social surveys. This problem becomes even
Table 23.1 Telephone access in Europe
Households with both landline and cell phone access Households with only landline phone access Households with only cell phone access Households without phone access
2007 in %
2009 in %
2011 in %
58
62
62
14
11
9
25
25
27
3
2
2
Source: Eurobarometer 2007, 2009, 2011
Special Challenges of Sampling for Comparative Surveys
more serious when, for example, the range of Mobile-onlys across Europe is taken into consideration: from 11% in Germany to 71% in Finland. As a result of this development, a high demand is placed for dual frame approaches where landline and mobile phone numbers are selected and contacted.
MAIN REQUIREMENTS In the following section we will discuss the main requirements concerning sampling for comparative surveys. For a comprehensive overview of those requirements see also Hubbard and Lin (2011).
Unique Definition of the Target Population and Suitable Sampling Frames An important step in planning a survey is the definition of the population under study (target population). In the case of the ESS, within each country, it contains persons 15 years or older residing in private households, regardless of nationality and citizenship, language1 or legal status. This definition applies to all participating countries. Thus, every person with these defined characteristics should have a non-zero chance of being selected. This implies that the more the frame covers the persons belonging to the target population completely, the better the resulting sample will be. However, the quality of the frames – e.g. coverage, updating, and access – differs from country to country. Therefore, frames have to be evaluated carefully. The results of these evaluations are documented and have to be taken into account when the data are analyzed. Among others, the following kinds of frames can be found: (a) countries with reliable lists of residents that are available for social research such as the
349
Swedish population register that has approximately a good coverage of persons residing in Sweden; (b) countries with reliable lists of households that are available for social research; (c) countries with reliable lists of addresses that are available for social research such as the postal delivery points from ‘PostaalAfgiftenpuntenbestand’ in the Netherlands; (d) countries without reliable and/or available lists, such as Portugal.
Drawing a sample is more complicated if no registers (lists) are available (group d). In this instance, multi-stage designs are usually applied, in which the selection of municipalities forms the first stage and the selection of households within these municipalities the second stage. Because no sampling frames are available, the crucial problem is the selection of households. There are two main ways to go about this. The first is to list all the addresses within certain areas of each selected community. The target households are drawn from these lists. It is possible to assess this procedure as one way of drawing a random sample, even if one which is fairly strongly clustered. Another frequently used way to find target households is the application of random route elements. The question here, however, is the extent to which random routes can be judged to be ‘strictly random’. That depends on both the definition of the rules for the random walk and the control of the interviewers by the fieldwork organization in order to minimize the interviewer’s influence on the selection of respondents. In ESS rounds 1–6 in Austria, for example, there was a design with a random route element. The survey institute, along with the National Co-ordinator, operationalized the general rules for various household types (e.g. large apartment buildings, small houses within densely populated areas, houses in the countryside). Moreover, all selected households were checked by the supervising team. This approach was convincing for the sampling expert panel. Even in countries where reliable frames exist, some problems had to be solved. For
350
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
example, in Italy there is an electoral register available. But it contains, of course, only persons 18 years or older. Therefore, it had to be used as a list of addresses – not of electors. In Ireland, the same situation can be found. The problem arising in that context is that addresses where no elector lives are not included in the sampling frame. That can lead to a biased sample with regard to foreigners. People with illegal status will be underrepresented because they are not registered. Such systematic losses because of undercoverage cannot be ruled out in practice. However, they must be documented carefully. To maximize coverage of the target population, according to Heeringa and O’ Muircheartaigh (2010) it is advisable to: 1 avoid major geographic exclusions; 2 develop strategies to include also hard-to-reach population elements in the sample; 3 translate the questionnaire into the major spoken languages in each country; 4 consider seasonal absences due to vacations, work patterns and so on.
Best Sampling Designs The next step is the design of the samples given the frames in each country. Usually a regional stratification scheme to ensure a good representation of different geographical areas of the country is recommended. If further characteristics of the frame population are available, countries should also use them for stratification. In many survey programs, such as SHARE, ESS, or PIAAC, a guiding principle is to design sampling plans which yield minimum variation in inclusion probabilities and a minimum amount of clustering. This is necessary because both of these design characteristics directly influence the precision of estimates based on the underlying samples. Finding a sampling frame which allows for such a design is, however, not always possible. Such a scenario applies, for example, if a country team only has access to a list of
households and an eligible person has to be selected from all eligible target persons of a sampled household. In this case, variation in inclusion probabilities cannot be avoided. This procedure introduces a so called ‘design effect due to unequal inclusion probabilities’ (Deffp). Studies (i.e. ESS) have shown that Deffp usually ranges between 1.20 and 1.25, depending on the variation of household sizes in a country (Ganninger 2010). This variation in inclusion probabilities has to be taken into account by a design weight which is basically the inverse of the inclusion probability. Fortunately, many countries have access to population registers, e.g. Scandinavian countries, Slovenia, Switzerland, and Germany. In these countries, sample designs can be implemented which yield equal inclusion probabilities for all elements. In Germany, however, researchers have to use a two-stage clustered sample design as the population registers are locally administered by the municipalities. Therefore, a number of municipalities have to be selected at the first stage and persons at the second stage. In such a case, an additional component of the design effect emerges. It is the ‘design effect due to clustering’ (Deffc). Usually, Deffc is larger than 1 since both, the mean cluster size of the primary sampling units as well as the intra-class correlation determine its magnitude. Therefore, by design, the smallest possible mean cluster size has to be chosen and the highest possible amount of primary sampling units has to be selected. This is at odds with the interests of the survey agencies for whom an increase in the number of primary sampling units is equivalent to increased costs. Telephone surveys have the advantage that in principle unclustered samples of the population in each participating country can be drawn. However, at present it has to be considered that omitting the population that exclusively uses mobile phones (so-called Mobile-onlys) can lead to biased estimates. By definition, a landline sample and a mobile phone sample overlap for respondents in households with both landline and
Special Challenges of Sampling for Comparative Surveys
mobile phones. As extensively debated in the AAPOR Cell Phone Task Force document (2010), two different approaches have been used to handle the overlap: the screening approach and the overlap approach. •• In the screening approach the interviewers termi nate interviews with mobile phone respondents who have at least one landline phone in the house hold and thus could potentially be reached through the landline sample frame. These individuals will be eligible only when sampled from the landline frame. •• In the overlap approach the interview is con ducted regardless of the frame from which the respondent is selected (landline or mobile phone) and their phone ownership status, and informa tion about each respondent phone status, is col lected from both frames.
2 3
4 5
6 7
351
what is their information content to construct the sampling frame? Is there any known under- or overcoverage in the used frame? Sample size. How many units will be selected? Stratification. Are the sampling units grouped, i.e. stratified, and is sampling performed independently within each stratum? If yes, what are the variables used to stratify the sampling units? The allocation to strata. If stratification is used then how is the sampling size distrib uted, i.e. allocated, to the strata? Allocation to the selected preceding sampling units. If the stage is not the first one, how is the sample size of the stage distributed, i.e. allocated, to the sampling units of the stage prior to this stage? How many sampling units are in the selected clusters of the previous stage? Sampling method. What method (ideally the algorithm and the software of its implemen tation) is used to select the sampling units?
At this current stage of knowledge there is no general consensus on which method is preferable (AAPOR Cell Phone Task Force 2010: 6). However, in countries where the mobile-only population is still so small that screening for Mobile-onlys is deemed too time and money consuming, a dual frame approach is advisable. To combine the samples from both frames, the inclusion probability of each household, which is necessary to account for multiplicity, has to be computed. For the documentation of the sampling designs in comparative surveys the ESS may serve as an example because the described procedure meanwhile has been applied successfully for seven rounds. According to the ESS Sampling Guidelines, the following questions have to be answered in order to describe the design comprehensively (http:// www.europeansocialsurvey.org):
For this description, it is advisable to use forms that ensure a standardized description. Examples of such forms can be found at ESS (http://www.europeansocialsurvey. org/docs/round6/methods/ESS6_sampling_ guidelines.pdf) and SHARE (http://www. share-project.org/uploads/tx_sharepublications/Methodology_Ch5.pdf). A detailed description of the standards and guidelines for sampling procedures in all participating countries is given in Chapter 4 of the PIAAC Technical Standards and Guidelines (http://www.oecd.org/site/piaac/PIAACNPM%282014_06%29PIAAC_Technical_ Standards_and_Guidelines.pdf). These documentations can serve as examples for other cross-cultural surveys.
•• Is the design single- or multi-stage? •• What is the assumed design effect? •• For each sampling stage of the design the follow ing specifications are needed:
Prediction of Effective Sample Size and Design Effects
1 Sampling frame. What is the definition of the sampling units that are listed in the sampling frame? How is the sampling frame of the stage constructed, i.e. which registers or lists are used to enumerate sampling units and
In the planning phase of a design it is important to develop a measure which compares the different sampling designs. A benchmark for a sample design is simple random sampling or, to be more precise, the variance of
352
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
the sample mean under simple random sampling. Following Kish (1965: 88), the socalled design effect Deff is defined as the ratio of the variance of an estimator under the complex sample design which is used in the survey and the variance of the sample mean under simple random sampling. If cluster sampling is used in a survey and the cluster sizes are all equal, say b, it is well known that the design effect is Deff = 1 + (b-1)r, where r is the intra-class correlation coefficient. In practice, often r is positive and the design effect is larger than 1 as already mentioned above. The effective sample size neff is defined as neff = n/Deff, where n is the net sample size, i.e. the actual number of achieved interviews. The effective sample size neff is the size of a simple random sample that would produce the same precision (standard errors) as the design actually used. If Deff is larger than 1, which is the usual case because of clustering and/or of differing selection probabilities, the concept of the effective sample size helps to make the precision of an estimator independent on the sample design and to conduct samples in different countries with the same effective samples sizes. neff is usually smaller than n and serves as a comparison purpose of different sampling designs. For example, in ESS round 6 Finland uses a single-stage equal probability systematic sample (no clustering). Thus Deff = 1. This implies neff = n/Deff = n. Thus, for Finland, the effective sample size and the net sample size are identical. What is needed if there is a design effect of 2 in a different country? This means that a net sample size which is 2 times higher than that of Finland is required in order to secure the same precision as is the case in Finland. Since the variances in the denominator and nominator of the design effect are difficult expressions, Deff is rarely predicted in this way. Instead, for the simplification of assumptions in ESS, a model for the characteristics determining Deff is used. Let bi be the number of observations in the I i-th cluster (i = 1, … , I) so that b = 1 ∑ bi . I i =1
Let yij and wij be the observation and the weight for the j-th sampling unit in the i-th ultimate cluster i = 1, ... , I; j = 1, ... , bi. The design weights wij are the inverse of the inclusion probabilities of first order and must be determined via the selection probabilities of each stage in the survey. The usual design-based estimator for the population mean is defined by bi
I
yw =
∑∑w y i =1 j =1
ij ij
.
bi
I
∑ ∑ wij i =1 j =1
The following model M1 can be assumed: Var( yij ) = σ 2 for
i = 1, ... , I ; j = 1, ... , bi ,
ρσ if i = i ′; j ≠ j ′, Cov( yij , yi ′j ′ ) = 0 otherwise. 2
The (model-based) design effect is defi ned as DeffM = VarM1 ( yw ) / VarM 2 ( y ) , where VarM1 ( yw ) is the variance of yw under model M1 and VarM2 ( y ) is the variance of the overall bi
I
sample mean y, defined as
I
∑∑ y / ∑b , i =1 j =1
ij
i =1
i
computed under the following model M2: Var(yij ) = σ 2 for
i = 1, ... , I ; j = 1, ... , bi ,
Cov(yij , yi ′j ′ ) = 0 for all (i, j ) ≠ (i ′, j ′). Note that model M2 is appropriate under simple random sampling and provides the usual formula σ 2 /
I
∑ b for Var i =1
M2
i
bi I I DeffM = ∑ bi ∑ ∑ wij2 i =1 i =1 j =1 X 1 + b* − 1 ρ
(
)
2
I bi ∑ ∑ wij i =1 j =1 with
bi ∑ wij ∑ i =1 j =1 . b* = I b i 2 ∑ ∑ wij I
i =1 j =1
( y ): 2
353
Special Challenges of Sampling for Comparative Surveys
DeffM = Deffp × Deffc can be split up by multiplying two design effects (see Gabler et al. 1999, and Kish 1995). Deffp is the design effect with respect to the selection of units with unequal probabilities. If all wij are equal and all cluster sizes bi are equal Deffp = 1 and b* = b. There is a big difference between Deffp and Deffc. While Deffp depends only on the weights wij and the cluster sizes, in addition Deffc depends on a model parameter r, which usually varies from variable to variable. For a variable of binary type the intra-class coefficient will be different from a variable of Likert-scale type.
Estimation In the design-based case the variances are not known and have to be estimated by the sampled data. Since this is very complicated for most complex designs used in practice, the estimation is done for the model-based design effect DeffM. The model-based approach is in this respect more flexible than the design-based approach as it is possible to estimate both components of the design effect separately. The computation of Deffp can be directly done by computing the first bracket in the DeffM formulae above. In a lot of countries the only available sampling frame is a list of addresses and these addresses are selected with equal probabilities. The next step in this proposed design is to randomly select one person for an interview at each address. This last sampling step implies differing selection probabilities. Afterwards, population statistics on the distribution of household sizes are used to estimate the number of respon‑ dents in each selection probability class. It is assumed that an address usually contains only one household. The necessary estimation of r can be done by the classical ANOVA estimator
ρ=
MSB − MSW , MSB + ( K − 1) MSW
where I bi2 ∑ I 1 i =1 K= ∑ bi − I I − 1 i =1 bi ∑ i =1
,
and MSB and MSE are the appropriate mean square error between and within clusters, respectively. For other estimators of r, see Ganninger (2010: 52). Unfortunately, in many countries the cluster ID is not contained in the data. This implies that the user is not able to estimate r. As already mentioned above, r also depends on the variable of interest. Since in most surveys a variety of variables are of interest, a different estimation of r is obtained for each variable. In the ESS, about 70 variables are used and the mean over all the r is taken. In this way r = 0.036 and a mean Deffc of 1.73 is computed for Germany. The estimation of Deffp and Deffc are the basis to compute the net sample size for a country in the next round. The design weights are important to get an unbiased estimator. All weights were rescaled in a way that the sum of the final weights equals the net sample size n. To avoid too large design weights it is common to truncate such weights. In the ESS, the border of design weights is 4. Often there is an interest in estimating the mean of a variable of interest for a multicountry region such as the Scandinavian countries. Since in each country the sampling design is independent from each other the stratified estimator can be used. However, we have to correct for the different inclusion probabilities in the countries. For this, population size weights are needed, i.e. the portions of the population sizes (aged 15 years and over) of the countries with respect to the population size for the whole region. The population size weight corrects population size when combining the data of two or more countries.
354
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
In big comparative surveys such as the ESS, there currently exist several rounds possibly using different sample designs even for one country. A more complex problem arises if a mean of a variable of interest combining two or more countries over several rounds has to be estimated. An overview of the combinations of weights in countries and/or rounds can be found in http://www.europeansocialsurvey.org/ docs/methodology/ESS_weighting_data.pdf. From a statistical point of view, the use of weights is of high importance. However, it must be noted that they do not account for the adjustment of non-response in the sample. Additional information, e.g. from official statistics, should be incorporated in the estimation of an average or total. This is usually carried out by adjusting the estimates with respect to known variables such as age, gender, region, or education. Post-stratification or raking methods are often applied for handling such problems. A famous estimator for such calibration is the generalized regression estimator (Deville and Särndal 1992). Of course, such kinds of weighting for non-response do only ‘improve’ the estimators, i.e. decrease deviations between the sample distribution and the true population distribution, if the adjustment variables and the variables of interest are not independent. That can of course be different for different variables of interest. Therefore, the ESS sampling panel has been reluctant for a long period to serve with official adjustment weights. However, due to the enormous need of such weights – because of the high non-response in various countries – it was agreed to provide raking weights based on the adjustment of the variables age, gender, region, and education. These variables seem to improve estimates in a variety of cases and are in so far a compromise.
FUTURE DEVELOPMENTS In summary, it can be said that in comparative surveys ‘we should distinguish the needs
of the survey methods (definitions, variables, measurements), which must be harmonized, standardized, from sample designs, which can be designed freely to fit national (even provincial) situations, provided they are probability designs’ (Kish 2002: 32). This means that in each country the best probability design in terms of frame coverage, design effect, experience, and economic efficiency should be found – there is no need for similarity. Apart from this general guideline, several trends concerning sampling for comparative surveys can clearly be observed. •• There is a trend towards multiple frame designs. They are used when a single frame has coverage deficiencies. For instance, in sampling for tel ephone surveys, dual frame approaches become more common to also include solely mobile phone users. •• Abandoning quota sampling is a clear trend which was established in the 1990s. Only a few survey programs (such as the World Values Survey, http://www.worldvaluessurvey.org) still allow quota elements in certain cases. •• The most suitable probability design should be used accordingly in each country. Being most suitable means that the design is workable and efficient in regards to the resources of time and money. What is meant by this is that there is not just one sample design that is the best for a country but it can also differ from survey to survey. Furthermore, sample designs can even differ between countries in a single survey. •• It is essential to record the inclusion probabilities for all elements of the gross sample. This informa tion is needed to calculate design weights. •• The concept of meeting the target of the same effective sample size in each participating coun try ensures the same precision of estimates and insofar allows for comparability of the analyses results. •• A detailed documentation of the sampling pro cedures is necessary to allow researchers to comprehend the process of data collection. Forms as used in the ESS or SHARE are helpful for a standardized documentation and probably a way to help the user to properly use the design weights.
Special Challenges of Sampling for Comparative Surveys
NOTE 1 In countries where any minority language is spoken as a first language by 5% or more of the population, the questionnaire will be translated into that language.
RECOMMENDED READINGS Hubbard and Lin (2011) – these online free available guidelines give an overview of sampling for cross-cultural surveys. Häder et al. (2012) give methodological background on how multipopulation telephone surveys in Europe should be conducted. Kish (1994) explains the statistical theory behind estimations from multipopulation samples.
REFERENCES AAPOR Cell Phone Task Force (2010). New considerations for survey researchers when planning and conducting RDD telephone surveys in the US with respondents reached via cell phone numbers. https://www.aapor. org/AAPOR_Main/media/MainSiteFiles/2010 AAPORCellPhoneTFReport.pdf Buchanan, W. and Cantril, H. (1953). How Nations See Each Other. Urbana, IL: University of Illinois Press. Deville, J.C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association 87, 376–382. European Commission (2013). Flash Eurobarometer 379. Attitudes towards Biodiversity. Summary. http://ec.europa.eu/public_opinion/flash/fl_379_sum_en.pdf: TS1. Gabler, S., Häder, S. and Lahiri, P. (1999). A model based justification for Kish’s formula for design effects for weighting and clustering. Survey Methodology 25, 105–106. Ganninger, M. (2010). Design Effects: Modelbased versus Design-based Approach, Vol. 3. Bonn: GESIS Schriftenreihe.
355
Häder, S. and Lynn, P. (2007). How representative can a multi-nation survey be? In R. Jowell et al. (eds), Measuring Attitudes Cross-nationally (pp. 33–52).. London: SAGE. Häder, S., Häder, M. and Kühne, M. (2012) (eds). Telephone Surveys in Europe: Research and Practice. Heidelberg: Springer. Heckel, C. and Wiese, K. (2012). Sample frames for telephone surveys in Europe. In S. Häder, M. Häder and M. Kühne (eds), Telephone Surveys in Europe: Research and Practice (pp. 103–119). Heidelberg: Springer. Heeringa, S. G. and O’Muircheartaigh, C. (2010). Sampling design for cross-cultural and cross-national studies. In J. Harkness, B. Edwards, M. Braun, T. Johnson, L. Lyberg, P. Mohler, B-E. Pennell, and T. Smith (eds), Survey Methods in Multicultural, Multinational, and Multiregional Contexts (pp. 251– 267). Hoboken, NJ: John Wiley & Sons. Hubbard, F. and Lin, Y. (2011). Sample design. In Cross-Cultural Survey Guidelines. Ann Arbor, MI: Survey Research Center, Institute for Social Research, University of Michigan. Retrieved from http://ccsg.isr.umich.edu/ Sampling.cfm [accessed 14 June 2016]. Kish, L. (1965). Survey Sampling. New York: John Wiley & Sons. Kish, L. (1994). Multipopulation survey designs. International Statistical Review 62, 167–186. Kish L. (1995). Methods for design effects. Journal of Official Statistics 11, 55–77. Kish, L. (2002). New paradigms (models) for probability sampling. Survey Methodology 28, 31–34. Lavrakas, P. (1993). Telephone Survey Methods. Newbury Park: SAGE. Lynn, P., Japec, L. and Lyberg, L. (2006). What’s so special about cross-national surveys? In J. Harkness (ed.), Conducting Cross-National and Cross-Cultural Surveys (pp. 7–20). ZUMA-Nachrichten Spezial Band 12. http:// www.gesis.org/fileadmin/upload/forschung/ p u b l i k a t i o n e n / z e i t s c h r i f t e n / z u m a _ nachrichten_spezial/znspezial12.pdf Zheng, Y. (2013). Comparative studies on survey sampling bias in cross-cultural social research. Proceedings 59th ISI World Statistics Congress, August 25–30, 2013, Hong Kong (Session CPS015).
PART VI
Data Collection
24 Questionnaire Pretesting Gordon B. Willis
DEFINING PRETESTING Survey pretesting can be conceptualized as product testing prior to the start of production fielding of the survey – whether fielding consists of administration in households, of business entities, over the telephone, via the Web, or in some other manner. Pretesting is conducted in order to detect and remediate problems before a standardized set of procedures is finalized, and is therefore distinguished from subsequent evaluation methods devoted to quality assessment within the fielded survey environment, as described by Esposito and Rothgeb (1997). Survey pretesting encompasses a range of empirical methods that are targeted towards either (a) the questionnaire instrument, or (b) other components of survey administration such as advance materials, respondent selection procedures, interviewing procedures, and other operational features. This chapter is restricted to the former – questionnaire pretesting – and will describe the major methods used to
test the questionnaire instrument, subsequent to its initial development, but prior to fullscale field administration.
BACKGROUND AND SCOPE Key Literature Sources in Questionnaire Pretesting Although questionnaire pretesting has been carried out for several decades, it is only relatively recently that methodologists have endeavored to treat it as an applied science, to systematically describe its component activities, and to critically evaluate these activities. Two books in particular stand out, and serve as key background sources: Methods for Testing and Evaluating Survey Questionnaires, by Presser et al. (2004a), and Question Evaluation Methods, by Madans et al. (2011). Both are edited volumes containing detailed descriptions of methods by experts, in sufficient depth that the reader can
360
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
largely rely on these as a compendium of the current approaches. Below, I synthesize much of the information presented in those books, and beyond this, discuss an integrative approach to pretesting methodology that considers the methods in combination.
Scope of Chapter Referring to the overall instrument development sequence depicted in Box 24.1, questionnaire pretesting exists within a developmental continuum consisting of three major activities: (1) formative methods used to develop the questionnaire; (2) questionnaire pretesting and (3) quality assessment methods that are conducted through the course of production fielding. My focus is on the methods within Step 2 of Box 24.1, which are generally reparative in nature (Willis, 2015a). That is, they are oriented
towards the repair of defects in survey questions that are detected as a function of the pretesting process, as opposed to the descriptive methods in Step 3 which are carried out during the production survey to enable quality assessment. Further, I consider pretesting in contexts that involve both interviewer-administration (IAQ) and self-administration (SAQ), but recognizing that large-scale surveys increasingly involve self-administration (internet/ web modes, in particular). I also pay special attention to surveys that that can be labeled multi in nature: i.e., they are multi-cultural, multi-lingual, or multi-national (Harkness et al., 2010). Further, with respect to scope, I regard the focus group as a developmental rather than as a pretesting method, as it is normally applied in the initial, conceptual development of a survey questionnaire, rather than as a means for pretesting survey items per se (Fowler, 1995; Krueger, 1994; Willis, 2005).
Box 24.1 Questionnaire development, pretesting, and evaluation sequence. Step 1: Questionnaire development activities (a) Develop research questions, analytic objectives (b) Background research into concepts/constructs to be covered: review of conceptual areas to be covered by survey (c) Conduct focus groups or ethnographic interviews: to understand topic areas, applicability to multiple population groups, appropriate terminology and language (d) Translate concepts into initial draft survey questions (e) Conduct expert review: appraisal of questions for common design pitfalls Step 2: Empirical pretesting of the survey questionnaire (Reparative) (a) Stand-alone pretesting methods (1) Cognitive Interviewing (Qualitative Interviewing) (2) Usability Testing/Eye Tracking (computerized instrument) (b) Embedded pretesting methods (conducted within the Field Test/Pilot Test) (3) Interviewer Debriefing (4) Behavior Coding (interviewer-administration) (5) Psychometric/Item Response Theory (IRT) Analysis (6) Planned Experiments (7) Field-based Probing (Respondent Debriefing and Embedded Probing/Web Probing) Step 3: Evaluation/Quality Assessment within the Production survey (Descriptive) (a) Methods that may also be used in pretesting: Interviewer and Respondent Debriefing, Psychometric Analysis (b) Methods that are too resource or data intensive for pretesting, and so are applied within the production survey: statistical modeling, including Latent Class Analysis, Multi-Trait Multi-Method (MTMM)
Questionnaire Pretesting
Similarly, although expert review procedures often straddle the gap between development and pretesting, I consider them as also mainly developmental, and therefore out of scope. For each of the seven pretesting methods listed in Box 24.1, Step 2, I will review its: (a) purpose and the context in which it is applied, (b) key procedures, (c) analysis procedures and implications of findings for questionnaire modification, and (d) resource requirements, benefits, limitations, and practical issues related to its use. I first discuss two m ethods – cognitive interviewing and usability testing – that I label Stand-Alone Pretesting Pro cedures. These are generally conducted relatively early in the pretest stage, and are characterized by their focus on intensively evaluating the instrument through procedures that make little effort towards replication of the field environment, and therefore stand apart from more integrated evaluations such as the field or pilot test to be described later.
COGNITIVE INTERVIEWING Purpose and Context in Which it is Applied Cognitive interviewing is the pretesting method that has received the most attention in the literature, and is the subject of several books (Collins, 2015; Miller, et al., 2014; Willis, 2005; 2015a). Over the past twenty years, cognitive testing has evolved to the point that it is the approach normally assumed (as by Federal clients of contract research organizations) to be included in a comprehensive pretesting plan. Further, it is the only method to have spawned dedicated entities – cognitive laboratories – devoted to its practice, for both population surveys as well as those involving businesses and other establishments (Willimack et al., 2004). In classic form, cognitive testing of draft survey questionnaires is intended to identify the cognitive processes associated with answering the
361
survey questions. Although several cognitive models of the survey response process have been proposed (see Chapter 15 by Miller and Willis, this Handbook), the model having the most influence through the years – perhaps largely due to its simplicity – has been the four-stage model introduced by Tourangeau (1984). The Four-Stage Cognitive Model focuses on the serial operation of (a) comprehension; (b) retrieval; (c) judgment/estimation; and (d) response mapping processes. Originally, cognitive interviewing was viewed by advocates such as Loftus (1984) as a means for understanding how these processes operate when survey questions are administered. For example, Loftus assessed the nature of retrieval processes for individuals who were asked questions concerning medical visits, to determine whether these reports were produced in a forward or backward chronological order. A signal event in the history of pretesting was the realization that cognitive interviewing could be used not only to describe the cognitive processes associated with answering survey questions, but as a means for evaluating and modifying draft survey questions for evidence of problems prior to fielding – that is, for pretesting purposes. To this end, the Questionnaire Design Research Laboratory (QDRL) at the National Center for Health Statistics (NCHS) was established as a test case, and has persisted as an organizational entity to the present. The attractiveness of a laboratory devoted to cognitive pretesting has resulted in the establishment of laboratories at the US Census Bureau, at the Bureau of Labor Statistics, at several contract research organizations, and within several European statistical agencies and research institutions. Although the initial, and nominal focus of cognitive interviewing has been cognitive processes, there has been an evolution in perspective concerning its nature and application. In brief, researchers who applied cognitive testing appreciated very early that many of the issues, defects, and limitations of survey questions cannot easily be fit within
362
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
the standard four-stage cognitive model. Rather, a significant proportion of the relevant findings appeared to represent a category initially labeled as logical/structural problems (Willis et al., 1991), and increasingly recognized as socio-cultural and anthropological in nature (Gerber, 1999; Miller, 2011). As a simple illustration, a question asking about the respondent’s ‘usual source of care in the past 12 months’ makes the critical assumptions that the respondent has a usual source, and has even visited a practitioner during that period. The fault in the question is in its underlying logic, rather than strictly in the way that it is comprehended or otherwise cognitively processed. Today, cognitive interviewing is characterized by several related perspectives, deriving from sociologically based qualitative research (e.g., the Interpretivist Perspective of Miller, 2011), as much as cognitive theory.
Key Procedures As a stand-alone method, cognitive interviewing is normally carried out in either the cognitive laboratory, or else at a location such as a home, clinic, business, or other location in which specially recruited participants (or subjects) can be interviewed. Recruitment is accomplished through a variety of means, often involving provision of incentives or remuneration, and that rely on convenience or quota sampling to maximize the variability of demographic, behavioral, and other characteristics relevant to the survey. Cognitive testing is best done subsequent to initial questionnaire drafting, and with sufficient time to make changes prior to field testing or other subsequent steps. In brief, cognitive testing involves the supplementation of the participant’s answers to the survey questions, through the collection of verbal reports obtained by either thinkaloud or probing procedures. Thinking aloud is based on the notion, advanced by Ericsson and Simon (1980), that individuals can provide access to their cognitive processes by
spontaneously talking through their activities as they complete them. As an alternative, or as a complement to the think-aloud, verbal probing is much more interviewer directed, and relies on targeted queries (‘What does the term “vacation home” mean to you?’; ‘Can you tell me more about that?’) that are administered after the participant answers the tested, or target, survey questions. Much more detail concerning the activities involved in cognitive probing is provided by Willis (2005). Overall, it appears that the passage of time has been accompanied by increased emphasis on probing, rather than reliance on pure think-aloud. Further, investigators have tended to make use of concurrent probing (probe questions administered immediately after each target survey question is answered), especially for interviewer- administered questionnaires. On the other hand, retrospective probing (debriefing, in which probes are asked as a group, subsequent to main survey completion) is often used for self-administered instruments such as paper-based questionnaires, or whenever there is an emphasis on not interrupting the individual in the course of responding to the target questions. For both concurrent and retrospective interviews, participants are typically asked to answer the target questions with respect to their own behavior or life situation – but in some cases they are asked to respond to vignettes, or ‘brief, imaginary stories’ (Goerman and Clifton, 2011: 363) that describe a hypothetical survey respondent. Cognitive testing is normally conducted using small samples, spread across several iterations (rounds) in which a set of perhaps 8–12 individuals are tested, the results are considered, modifications to questions are made, and a further round of testing is carried out of the revised items. Increasingly, investigations that make use of multiple groups, such as English, Spanish, and Asian language speakers, call for much increased sample size (Chan and Pan, 2011; Miller, 2011; Willis, 2015b). Cognitive interviewing is generally carried out as a flexible activity in which the
Questionnaire Pretesting
interviewer reacts to the ongoing interaction with the participant, and is encouraged to modify and develop probe questions as appropriate during the course of the interview. Although more standardized and even rigid approaches to probing are sometimes used, this can produce problems where fully scripted probes are found to be insufficient in adequately investigating the targeted items (Beatty and Willis, 2007). On the other hand, allowing latitude in probing, and in the general conduct of the interview, leads to variation across interviewers that may not be well-controlled or characterized, and to discrepancies in the nature of findings that are a function of interviewer approach as opposed to inherent item functioning.
Analysis Procedures and Implications for Question Modification Data to be analyzed from cognitive interviews vary, depending on whether thinkaloud or probing procedures are used, and normally consist of either qualitative, written comments recorded by the cognitive interviewer, or (less frequently) transcripts of the interview. When data consist of written comments (akin to ethnographic field notes) concerning observations of question function, analysis is largely interpretive. The analyst – who is often the original cognitive interviewer – must consider the objectives of the target item, the participant’s answer to it, and the detailed, text-based notes from testing of that item, in order to make judgments concerning question function. A more formalized manner in which to analyze cognitive interviews is to make use of coding systems that characterize problems found (Willis, 2015a), or the ways in which questions are interpreted (Miller et al., 2014). Such coding systems may reflect the Tourangeau Four-Stage model (that is, explicit codes for comprehension, retrieval, judgment/estimation, and response mapping). Alternatively, the coding
363
system may relate to a range of problematic question features, such as the comprehensive coding set described by Lee (2014) which captures cognitive problems, issues with survey question construction, and cultural elements, within the same coding system. In all cases, the analytic objective is to understand the ways in which items function, and to identify problems they present, such that they can be modified to rectify the observed difficulties. For instance, if the term ‘groundwater’ is found to exhibit a code of ‘Difficult Terminology’, a simpler substitute may be attempted – e.g., ‘The water beneath the surface of the ground’. Increasingly, analysis of cognitive interviews focuses on the variation in comprehension and interpretation of key concepts and of target items, across cultural groups. For example, Thrasher et al. (2011) reported that interpretations of tobacco ‘addiction’ varied across participants from the United States, Australia, Uruguay, Mexico, Malaysia, and Thailand.
Resource Requirements, Benefits, Limitations, and Practical Issues On a per-interview basis, cognitive testing is a fairly resource intensive activity, requiring significant attention by specially trained staff to participant recruitment, design of the cognitive test, analysis, interpretation, and extensive discussion with collaborators or clients concerning testing objectives and question modifications. As such, the total sample sizes for cognitive interviews devoted to pretesting tend to be modest, although extensive cross-cultural investigations in particular may involve several hundred interviews (Willis, 2015b). Due to the heavy reliance on a small number of participants for insights concerning question function, it is especially important to carefully recruit a wide variety of individuals to cover a range of possible reactions and interpretations, and to carry out testing in an unbiased, nondirective manner.
364
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Given sufficient coverage of the relevant population, cognitive interviewing is often regarded as the method that is the most sensitive to underlying dynamics of the survey response process, and particularly useful for diagnosis of variation in item comprehension across individuals, or sub-populations (Beatty and Willis, 2007). However, it is unclear whether the strongly interpretive, and even subjective, results of cognitive interviewing are dependent on variation in the conduct of the interviews, or to the presence of ‘house effects’ (systematic variation across practitioner groups or laboratories) that may result in a lack of uniformity in the conclusions made with respect to a set of target survey items. Empirical tests of the consistency and reliability of cognitive interviewing results have been mixed, sometimes suggesting consistency, but other times finding that different observers fail to come to similar conclusions, based on the same set of evaluated items (see Willis, 2005; 2015a). With respect to the evaluation of method validity, as opposed to reliability, Yan et al. (2012) found that substantive results from cognitive interviews correlated with item validity estimates, suggesting that the method has measurable utility.
USABILITY TESTING Purpose and Context Usability testing focuses on questionnaires that have a strong visual rather than auditory element – where there is some type of instrumentation for the respondents to ‘use’ in the sense that they need to meaningfully interact with it in order to produce data. Although one might talk about the usability of a paper questionnaire, such as whether the skip patterns are clear and easy to follow, the term is mainly reserved for computerized instruments, and normally those that involve some degree of problem-solving to navigate, especially as administration options include handheld and
mobile devices. Couper (1999) describes usability-based pretesting of instruments as an extension of cognitive testing, and argues that techniques used in cognitive interviewing can be applied to usability testing as well. Although testing once focused mainly on usability of the instrumentation by interviewers rather than the respondent (as it was the former who relied on the computer within IAQ surveys), it appears that there has been a shift to usability by the respondent, as questionnaires increasingly require selfadministration (e.g., Web surveys). Like cognitive testing, usability testing is conducted apart from the field pretest environment, usually prior to final programming of an operational instrument.
Key Procedures Interestingly, although there are several recent volumes on the design of Web surveys (e.g., Tourangeau et al., 2013), and despite the importance of usability testing of survey questionnaires as an emerging activity, there appears to be a paucity of published literature describing the methodological details. However, despite the somewhat independent historical developments of the methods, general procedures are similar to those of cognitive testing, and involve participant recruitment, identification of specific testing objectives, development of a testing procedure, one-on-one conduct of the interviews, and collection of mainly qualitative data. Research questions include general concerns such as ‘Can the user easily complete the instrument?’; but also may be more specific, such as whether users can determine how to stop and restart at a later point, or whether they know where to look for online help. Sample sizes are similar to those used for cognitive interviewing, and testing is also generally iterative in nature. Baker et al. (2004) describe the use of a one-on-one interviewing technique that is analogous to retrospective debriefing in
Questionnaire Pretesting
cognitive interviewing, in which the user first completes the questionnaire, and the researcher then revisits each question screen, probing the participant concerning ambiguities or difficulties. Similarly to cognitive testing, usability testers rely on both think-aloud and verbal probing procedures – and there is a parallel ongoing debate concerning which of these is most efficient and productive (Killam, undated). Because it can be time-consuming and resource intensive to fully program operable Web surveys, it is not always practical to program one version to be tested, and to then fully re-program for production fielding. An alternative is to produce simpler – even paper-based – mock-ups of the Web survey, for evaluation of dominant trends related to usability. In such cases, testing can be done relatively early in the testing process, where major changes are still possible. However, paper prototypes, or those that rely on limited functionality, cannot reproduce the full operating environment, and are therefore restricted to investigating the overall appearance of component Web pages or other elements, and the general structure of the task to be completed. Because web surveys in particular vary tremendously in complexity and cognitive demands, those that present a relatively linear, streamlined analog to a simple paper questionnaire may present few usability issues. In particular, given that the computer can take control of question sequencing (skip patterns), there may be few demands on the navigational behaviors of respondents. On the other hand, data-collection instruments that involve flexible, complex actions, such as a 24-hour dietary recall task (Subar et al., 2012), present an array of screens, boxes to click, search activities, and other features that can require significant problem-solving and navigational behavior. Whereas the former, simple model may effectively require only the testing of a set of questionnaire items, akin to cognitive interviewing, the latter poses cognitive demands that can only
365
be studied and minimized through an intensive program of usability investigation that involves the functioning system. Techniques for adapting to this range of requirements, achieving an adequate understanding of user difficulties, and then in resolving them, are generally developed on an ad hoc basis, and to date no clearly articulated methodology has emerged. A further, technologically sophisticated adjunct to usability testing involves eyetracking, where researchers can observe eye fixations and movements as the user completes the task (Galesic and Yan, 2011); again, with the purpose of identifying problematic system user requirements. It remains to be seen whether this procedure has sufficient utility that it can be incorporated routinely into ongoing pretesting and evaluation.
Analysis Procedures and Implications of Findings Similar to cognitive interviewing, analysis of usability data is mainly qualitative – and based on interpretation by the interviewers concerning dominant problems, and potential modifications. However, usability testing can produce quantitative metrics, especially related to welldefined problem-solving behaviors; e.g., the proportion of users who are able to successfully move between screens to select and complete a data entry task. Further, sophisticated analysis procedures may involve a simultaneous review of the user’s hands on the keyboard/mouse, the display they are viewing at any time, and the direction of their eyes through use of tracking software. Although the nominal focus of usability-based pretesting is, as the term implies, on system design (e.g., literally making the system more usable), the results tend to pertain to questionnaire design as well. If a user remarks that a Web-based questionnaire is easy to use, but that the questions are confusing, this finding will need to be addressed just as it would within a cognitive interviewing investigation.
366
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Resource Requirements, Benefits, Limitations, and Practical Issues Like cognitive testing, usability testing is resource intensive on a per-interview basis, requiring significant expertise and cost, and so usually leads to the use of small sample sizes. The presumed benefits and drawbacks are somewhat parallel. On the one hand, researchers are able to intensively investigate the underlying nature of the reactions of participants to our presented materials. On the other hand, the use of small and possibly non-representative samples, along with the artificial environment associated with the laboratorylike environment and of the testing procedures, calls into question the generalizability of the results to the production survey environment.
PROCEDURES WITHIN THE FIELD PRETEST In contrast to the stand-alone pretesting approaches of cognitive and usability testing, researchers may rely on procedures that are embedded within a field pretest, also referred to as a pilot test, dress rehearsal, or conventional pretest (Presser et al., 2004a). Such tests are often carried out on a scale that allows for quantitative data such as frequencies of responses to each survey question to be collected, although qualitative data are also relied upon. With respect to questionnaire pretesting, Harris-Kojetin and Dahlhamer (2011: 322) state that field tests are ‘not so much a method for performing survey questionnaire evaluations as they are mechanisms or settings that allow the utilization of other evaluation methods’ – those I label embedded. Therefore, I next discuss methods best suited to inclusion within a field environment, and therefore typically embedded within or piggy-backed onto a field pretest of a survey questionnaire: interviewer debriefing, behavior coding, psychometric-IRT modeling, planned experiments, and field-based probing.
INTERVIEWER DEBRIEFING Purpose and Context I cover interviewer debriefing first, because it is the default procedure used for decades as part of a field pretest, and as Presser et al. (2004a) point out, has sometimes been viewed as sufficient for evaluation of survey questions. The driving assumption is that interviews of interviewers – almost always as a type of focus group or group discussion – are effective for identifying problematic survey questions.
Key Procedures The interviewer debriefing is normally conducted at the close of the pretest, either in person or via telephone, where a leader induces the interviewers to discuss problems they encountered through field test administration, sometimes relying on a set of notes, but often dependent only on memory.
Analysis Procedures and Implications of Findings for Questionnaire Modification Analysis is typically qualitative in nature, often depending on notes taken during a group debriefing of interviewers who have recently completed a set of field test interviews. This discussion may be facilitated through the use of auxiliary information, such as the raw frequencies of responses to the tested items, or paradata like rates of missing data at the item level. Based on this accumulated information, the designers make judgments concerning modification to features such as question wording, format, and ordering.
Resource Requirements, Benefits, and Limitations Because it is conducted at the close of the field test, where it is efficient to conduct a group
Questionnaire Pretesting
review, interviewer debriefing poses few additional resource requirements. However, such discussions can be unsystematic, as where particular interviewers dominate the discussion. Even if such factors can be controlled, a fundamental limitation is that interviewers tend to focus on problems that exist for themselves when administering the questions, as opposed to problems experienced by respondents. Hence, survey administrators have developed methods that strive to negate or offset these drawbacks, especially behavior coding.
BEHAVIOR CODING Purpose and Context Behavior coding relies on logic similar to that supporting the interviewer debriefing: that observation and review of field pretest interviews provides information useful for diagnostic purposes. However, behavior coding relies on the systematic observation and coding of behaviors of both the interviewer, and of the respondent, by a third party labeled the behavior coder. Similarly to cognitive interviewing, behavior coding is intended to systematically ferret out problems on a question-level basis, using a set of prescribed methods. Whereas cognitive interviewing metaphorically emphasizes the part of the iceberg under the water, behavior coding focuses on the visible part, but across a larger area of the sea. That is, behavior codes focus on the observable, by effectively eavesdropping on a relatively large set of interviews, allowing the researcher to obtain quantitative estimates of the frequencies of coded behaviors. Behavior coding is in some cases referred to as interaction coding (van der Zouwen, 2002), a label that reflects the interactive focus of this pretesting method as it emphasizes the behaviors of both interviewers and survey respondents. Unlike cognitive interviewing, behavior coding is amenable only to interviewer-administered
367
questionnaires (it is certainly possible to develop behavioral codes for self-completion questionnaires, but this would constitute a major departure worthy of a different label). Fowler (2011) describes behavior coding as a mechanism for systematically capturing cases in which the verbal exchange does not go according to plan – or what Schaeffer and Maynard (1996) refer to as a paradigmatic sequence. The assumptions underlying behavior coding are that (a) observable deviations from the paradigmatic sequence signal threats to data quality; and (b) these deviations can be minimized through adjustments to question design. These two features lead to the use of behavior coding in a field pretest – where interactions between interviewers and respondents can be observed in an ecologically valid environment, and where the observed deviations can be used as evidence of problems to be repaired, prior to production survey administration.
Key Procedures Behavior coding is normally accomplished by use of trained coders who listen to recordings of field interviews and apply a set of predetermined codes to each interaction involving a set of targeted survey questions. Interviews may be recorded over the telephone, or for in-person interviews by making use of laptop computers through Computer Audio Recorded Interviewing (CARI) (Thissen et al., 2008). No standard set of behavior codes exists, but codes are normally divided into one set pertaining to the interviewer, and another to the respondent. Coding systems vary in complexity and number of codes used, but a typical system is the one in Box 24.2 below, modeled after Fowler (2011). Procedurally, coders may rely either on a paper coding form, or increasingly, on the use of a computerized system that allows them to review particular exchanges, enter codes, and type in additional notes. Several coders may be used, as the sample size for behavior coding
368
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Box 24.2 Typical set of behavior codes used in questionnaire pretesting (based on Fowler, 2011). Interviewer behaviors: (1) (2) (3) (4) (5)
Reads question as worded Minor modification in reading that does not affect meaning Major modification in reading that affects meaning Verifies information rather than reading question Skip error
Respondent behaviors: (1) Interrupts question reading (2) Requests repeat of question reading (3) Requests clarification of question meaning (4) Provides qualified response indicating uncertainty (5) Provides an uncodeable response (6) Answers with Don’t Know/Refusal
is often 50–100 interviews, so that quantitative coding frequencies are produced for each coded item, for each behavior code. Beyond conducting quantitative coding activities, coders may also engage in a qualitative coder debriefing at the end of the process – similar to an interviewer debriefing, but guided by the more systematic experience of coding – in which they review their interpretations of the dominant patterns they have observed, as well as their views of why particular items may have produced high frequencies of codes indicating interactional difficulties.
Analysis Procedures A common approach is to tabulate, for each item, the frequency with which each behavior code was applied – for example, an item may have been assigned a Major Misread in 30% of the 40 times it was administered, and Respondent Requests for Clarification for 20% of those 40. A summary coding table can be produced listing the code frequencies for each evaluated survey item, which is then used to identify items that appear to be particularly problematic. It is possible to set either an absolute threshold for problem identification – such as 15% (Fowler, 2011), or this can be
done relatively, where items that produce coding frequencies that are higher than average receive special scrutiny (Willis, 2005). Depending on the use of skip patterns in the questionnaire, it may also be necessary to set a floor for item administration – e.g., that an item was asked a minimum of ten times, in order for the effective sample size to support the use of quantitative counts. As indicated above, a coder debriefing, guided by frequency tables that highlight items receiving high frequencies of the various coding categories, may be used to augment quantitative data with qualitative information. Although behavior codes in themselves may indicate how often problems occur, they do not reveal the nature of these problems, or point to means for remediation. Behavior coders who have listened to a large number of interviews, and repeatedly observed the same type of problem, are likely to have developed a good understanding of what has gone wrong, and of revisions that may alleviate the problems.
Resource Requirements, Benefits, and Limitations Because behavior coding is almost always conducted within the context of a field
Questionnaire Pretesting
pretest, it presents relatively modest logistical challenges, beyond the request for respondent permission to record. Resource requirements mainly consist of coder training, the development of a system for recording selected interviews and perhaps of selected items within those interviews, and the time taken for coders to review recordings, conduct coding activities, and then participate in later activities such as coder debriefing. Behavior coding can be conducted across a fairly large range of interviews, so in departure from typical cognitive interviewing procedures, it provides a quantity of interviews sufficient for determining the extent of problems likely to occur in the field. Further, the use of coding as an empirical basis for identifying problematic questions provides a diagnostic measurement criterion arguably superior to those based only on opinions gathered via than interviewer debriefing. On the other hand, compared to cognitive testing in particular, behavior coding does not feature an intensive investigation of subtle problems that may involve respondent cognitive processes. Because behavior coding involves simply listening to the exchange between interviewer and respondent, it is prevented from identifying hidden problems that emerge only as a result of the conduct of the active probing that characterizes cognitive testing. Further, it has been argued that even if discrete non-paradigmatic behaviors of interviewers and respondents can be reliably coded, it is not necessarily clear which behaviors are reflective of problems with the data obtained within a field survey interview (Schaeffer and Dykema, 2011). In some cases, achievement of shared meaning between the interviewer and the respondents may be accomplished through an interactional sequence that does not necessarily follow a strictly scripted protocol. Therefore, departure from the script may not serve as a marker of error, if a complex verbal interaction instead serves to establish common ground (Grice, 1989), ultimately reducing the likelihood of error in the obtained response.
369
PSYCHOMETRICS/ITEM RESPONSE THEORY MODELING Purpose and Context Psychometric analysis has an established history within the science of item development and evaluation. As a generic term, psychometrics – or mental measurement – includes a range of procedures that rely on mathematical-statistical techniques applied to responses to survey questionnaires, in order to evaluate the measurement properties of the items. Specific procedures include test-retest assessment of reliability over multiple administrations, factor analysis to determine which items conceptually cluster, or computation of metrics like Cronbach’s alpha for internal consistency reliability (the degree to which items within the same scale relate to one another). Increasingly, survey researchers have made use of Item Response Theory (IRT), a psychometric technique which is described by Reeve (2011) as assessing ‘the relationship between an individuals’ response to a survey question and his or her standing on the construct being measured by the questionnaire’. In principle, pretesting of this type also includes other statistical techniques applied in order to ascertain the measurement properties of the evaluated items (as opposed to traditional statistical analysis to produce substantive analyses of the outcome measures), such as Latent Class Analysis (LCA) and Multi-Trait Multi-Method (MTMM) models. However, as I will discuss, there are severe limits to employing such sample-size intensive methods in survey pretesting.
Key Procedures The computation of any estimate, parameter, or index as a product of a psychometric or IRT modeling technique depends on the collection of responses to survey questions, across a range of respondents. As such, these techniques differ from cognitive/usability
370
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
testing and behavior coding, in that they are not dependent on the collection of auxiliary data, but are produced directly from the answers that respondents provide. Because of this, the use of these procedures for pretesting purposes generally impose little additional respondent burden, or additional procedural requirements at the data-collection stage of a field pretest. A notable exception to this would be the use of test-retest reliability, which requires the literal replication of data collection at two time points.
Analysis Procedures and Implications of Findings for Questionnaire Modification Each psychometric method requires its own analysis procedures and algorithms, often dictated by a particular formula (as for Cronbach’s alpha) or software program (for IRT modeling). Notably, analysis and interpretation are often dependent on the satisfaction of key assumptions, which may be robust to minor deviation (Reeve, 2011), but cannot be assumed to be satisfied. The simplest example would be the assumption that the responses provided by two points in a test-retest reliability study are independent, such that responses to the item at second administration are uninfluenced by the initial answers. As for any pretesting method, questionnaires may be modified or deleted as a function of the obtained results; for example, the finding that two scale items are metrically redundant may lead to deleting one.
to these, but to the sometimes underappreciated fact that they are mainly appropriate for study of latent constructs. In theoretical terms, such constructs are unobservable, and must be inferred from the use of survey measures – e.g., psychological or attitudinal traits such as anxiety, as opposed to potentially observable phenomena such as number of cigarettes smoked in the past seven days. In practical terms, such measures are normally ascertained through multi-item scales rather than single-item measures (Tucker et al., 2011). Due to this requirement, psychometric methods like IRT in particular are generally applicable only to a subset of types of survey questions – those that assess latent constructs through the use of multi-item scales – and cannot be applied to single-item measures of the type commonly used within government surveys (as examples of the latter, Tucker et al. list consumer expenditures, job search behaviors, and crime victimization). Due to this critical limitation, most psychometric methods cannot serve as general pretesting procedures, but rather as specialpurposes tools which are applicable to specific contexts (Harkness and Johnson, 2011: 206). A further, serious limitation is that psychometric techniques may be so demanding of sample, especially within multiple-group comparisons, that they are practical to implement only within a production survey, and therefore have utility mainly as descriptive, quality-assessment methods.
PLANNED EXPERIMENTS Resource Requirements, Benefits, and Limitations Although psychometric methods impose non-trivial requirements in terms of statistical expertise, and sometimes extensive sample size demands, the most critical requirement – and associated limitation – relates not
Purpose and Context The conduct of a planned (rather than a natural) experiment within a field pretest shares a number of commonalities with psychometric pretesting: mainly, that it involves statistical analysis of the responses given by survey respondents to survey items, and
Questionnaire Pretesting
is embedded within the conduct of a field pretest that provides sample size adequate for purely quantitative assessment. Features that distinguish experiments from psychometrics, however, are as follows: •• Experiments can be conducted with any variety of item – e.g., factual, autobiographical, attitudinal, multi-item, single-item – although DeMaio and Willson (2011) discuss the distinction between experiments based on attitudinal/latent meas ures, versus objective measures. •• Experiments are generally focused on particular a priori hypotheses concerning item function, as opposed to the application of a more general method for evaluating item function (Moore et al., 2004). Where there is uncertainty, say, between two approaches to asking questions about health insurance coverage, these can be compared through a split-sample experiment that randomly allocates each version to half the pretest sample. •• Finally, experiments may rely on dependent measures other than responses to the items themselves, including paradata such as indices of item administration difficulty. For example, Bassili and Scott (1996) and Draisma and Dijkstra (2004) studied the amount of time taken by respondents to answer survey questions as an index of question performance, under the pre sumption that questions taking longer to answer may be those that present some type of difficulty.
Key Procedures Although experimental manipulations may be done within-respondent (e.g., two versions of a question within the same instrument, as described by Krosnick, 2011), a common design feature is to include multiple instrument versions within the pretest, each administered to a sub-sample. An important consideration, however, is whether to rely on a factorial design that maintains strict control over the number of variables and their interactions, versus a ‘package’ approach which compares versions differing across a number of dimensions – e.g., comparison of a standard versus redesigned approach to the survey
371
interview (Tourangeau, 2004). The former is typically conducted in the context of hypothesis testing concerning the effects of, at most, a few variables (e.g., the influence of adding an introduction to sensitive questions attempting to soften their perceived impact, relative to no introduction). The latter, package approach, would be more appropriate for a bridging study, in which a new version of a questionnaire is to be introduced, and a splitsample experiment is used to ascertain the impact of the revision on numerical estimates.
Analysis Procedures and Implications of Findings for Questionnaire Modification Analysis procedures depend on the nature of the experiment conducted: for factorial experiments that incorporate careful control of independent variables and assignment to condition, standard statistical measures (such as ANOVA) may be used to compute both main effects and interactions. For package studies, it is common to compute simple parameter estimates for key variables, and to make use of auxiliary and paradata (such as questionnaire administration time, or skip pattern errors) to also assess the impact of the new version on operational indicators. A further issue in analysis of experiments concerns whether to analyze the data as though they were produced under simple random sampling, versus taking into account the survey’s complex sample design (see Krosnick, 2011, for a discussion of this issue). Finally, designers must make use of experimental results to decide which of the contrasted approaches is optimal for the production survey. Sometimes this decision can be made on the basis of the conclusion that one version produces a more plausible estimate (e.g., under a ‘more is better’ assumption), but other times it can be difficult to know which estimate is closer to the true value.
372
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Resource Requirements, Benefits, and Limitations As Tourangeau (2004) points out, true factorial experiments can be demanding, because even three variables with three levels each would lead to 27 versions. Because it is often feasible to administer only two or perhaps three versions within a field pretest, it may be advisable to develop a simple design featuring a single independent variable. Or, one can include multiple main effects, but ignore potential interaction effects. A common strategy is to introduce multiple variations that are assumed to be independent, such as alternative wordings (Version A versus B) for each of a number of survey questions. A major benefit to experimental designs is that the estimates produced are amenable to standard statistical analysis procedures, can be assessed for statistical significance, are likely to be generalizable to the field context given the considerable ecological validity normally achieved in the field pretest environment, and are therefore compelling to clients or key stakeholders. On the other hand, these procedures can be difficult to implement, relative to a single pretest questionnaire – a reason that the US Census Bureau developed the Questionnaire Design Experimental Research Survey (QDERS) system (Rothgeb, 2007), which is conducted apart from the context of a survey’s operational field pretest. Further, similarly to other purely quantitative measures such as behavior code frequencies, the interpretation of results may be difficult – especially within a package experiment, it can be unclear which of a number of confounded variables is responsible for varying estimates between experimental versions. Hence, the degree to which the experiment leads to a general understanding of questionnaire design effects can be very limited. As with any procedure that is implemented within a field pretest, the investigators must ensure that there is sufficient time to meaningfully analyze and interpret the findings – a
significant practical challenge is that the time interval between the field testing and fullscale production implementation is typically short.
FIELD-BASED PROBING (RESPONDENT DEBRIEFING/ EMBEDDED PROBING) Purpose and Context This pretesting category includes several procedures that have key elements in common: (a) they involve the extraction of additional information from survey respondents from within the field pretest environment; and (b) in principle they can be regarded as cognitive interviewing in this environment, as the queries asked of respondents consist of modifications of cognitive probes (DeMaio and Rothgeb, 1996).
Key Procedures Respondent debriefing is similar to retrospective cognitive probing, in that the recipient of the survey questions, upon completion, is probed for further thoughts concerning the items. Embedded probing, on the other hand, can be viewed as an analog to concurrent probing, where probe questions are placed within the questionnaire, immediately subsequent to targeted items. Although placement may be literally random (through the use of random probes as described by Schuman, 1966), they are more typically applied strategically, and embedded in the context of items for which the investigators have particular interest in evaluating. These probes can be implemented within either an interviewer-administered survey, when read by the interviewer, or through self-administration, especially in the form of Web-probing, where the respondent is asked to elaborate concerning his/her response to
Questionnaire Pretesting
one or more target items (Behr et al., 2014; Fowler et al., 2015; Murphy et al., 2014). For example, after presenting the target question ‘Since the first of May, have you or any other member of your household purchased any swimsuits or warm-up or ski suits?’, Murphy et al. (2014) later presented the open-ended probe: ‘What types of items did you think of when you read this question?’, for which respondents typed in an open-ended text response.
Analysis Procedures and Implications of Findings for Questionnaire Modification When field-based probes make use of closedended response categories, analysis is straightforward, as these are tabulated as regular survey items. Where probes are openended, a coding system can be used to reduce these meaningfully. Interpretation of the results must also take into account an assessment of the utility of the information gathered. For the example above, if a majority of Web participants simply typed in the examples provided (‘swimsuits, warm-up and ski suits’), this information might not be regarded as especially informative. However, if rich information is obtained (e.g., ‘Any type of exercise clothing, like bicycle shorts’) the investigators can make use of this in a manner similar to cognitive testing results, and to revise items otherwise containing hidden sources of misinterpretation or other difficulty.
Resource Requirements, Benefits, and Limitations Obtaining supplementary information through debriefing or embedding probes is a low-cost approach to obtaining the types of qualitative information that otherwise requires a full cognitive testing study. The marginal costs of adding several probes
373
within a field pretest instrument are modest, as are the requirements for tabulating such data. Coding of open-ended responses may be more time-consuming than practical considerations allow, but the size of the investigation can be tailored such that these requirements are still manageable. Limitations to field-based probing approaches potentially involve (a) placement within the questionnaire development process, and (b) comprehensiveness of the information obtained. Because a field pretest is conducted relatively late in the questionnaire development process, there may be less time than available within an early-pretest process like cognitive interviewing to carry out intensive discussions concerning interpretation and redesign. Further, a field pretest is almost always a ‘one-shot’ approach that precludes the possibility of iteration (i.e., continued testing of revisions made subsequent to the pretest). Finally, in terms of the nature of information gathered, only a few items can be probed, the scripted nature of the probes may render these problematic relative to the more flexible types generally used within cognitive interviews, and there is no opportunity for follow-up probing of unclear or particularly interesting responses. On the other hand, because the sheer number of responses that can be obtained for a handful of field-based target items far exceeds those of a normal cognitive interviewing study, it is possible to obtain information across a wider range of respondent type. Further, one can obtain a sample size sufficient for production of useful quantitative information, such as the frequency with which particular item interpretations are in evidence (e.g., that 80% of respondents interpret the item in one dominant way). Finally, novel platforms for Webbased survey administration may allow for the incorporation of cognitive probes in a way that obviates the need to conduct a full field pretest (Fowler et al., 2015). Overall, fieldbased probing may serve as a useful adjunct to cognitive testing, rather than a replacement (Behr et al., 2014).
374
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
GENERAL/CROSS-CUTTING OBSERVATIONS AND RECOMMENDATIONS Comparison Versus Combination of Pretesting Techniques It seems clear that the survey methods field has moved away from evaluating survey pretesting methods by simply asking ‘which is better?’ in an absolute sense. Whereas earlier studies, such as those by Presser and Blair (1994) and Willis et al. (1999), relied at least partly on computing and counting quantitative metrics of quality such as the number of problems identified in survey questions, more recent studies have taken an integrative approach which asks, instead: what does each method uniquely provide, and how do they fit together and complement one another? Further, there is considerable interest in the timing and resource implications involved in assembling a pretesting package that includes several methods, such as cognitive testing followed by behavior coding of field pretest interviews. The cumulative methodological studies that have been done that include qualitative comparisons of the results of survey pretesting methods share a common theme: these methods provide divergent, but complementary, types of information (Presser and Blair, 1994; Willis et al., 1999; Yan et al., 2012). Dillman and Redline (2004) examined the added-value of incorporating cognitive testing as a complement to field tests, concluding that this combination is effective, as ‘the conclusions drawn from the cognitive interviews were informative of why certain field test results were obtained’ (p. 317). Further, in a relatively large multi-cultural investigation, Thrasher et al. (2011: 440) combined cognitive interviewing and behavior coding to evaluate tobacco survey items, and concluded that: Coordinated qualitative pretesting of survey questions (or post-survey evaluation) is feasible across
cultural groups and can provide important information on comprehension and comparability. The CI (cognitive interviewing) appears to be a more robust technique than behavioral coding, although combinations of the two might be even better.
As a further promising proposal involving the combination of methods, Blair (2011) describes the implementation of experiments, but within cognitive interviewing studies as opposed to a field pretest. Similarly to other enhancements of cognitive testing, this practice requires an increase in the number of interviews conducted, and calls for the investment of resources that is in excess of the minimal amount that may only be used to support a statement that ‘the items were tested’. Further, rather than conducting comparisons between pretesting methods which assume each to constitute a static, fixed variable, some investigators have focused on the variability of procedures within each method, as a function of systematic variation of key parameters. DeMaio and Landreth (2004) conducted a methodological experiment contrasting three approaches to cognitive interviewing, involving the same questionnaire but different packages of procedures, and were unable to unambiguously recommend a particular approach. Further, Priede and Farrall (2011: 271) contrasted think-aloud (TA) versus verbal probing (VP) techniques, and in summary suggested that ‘… in terms of the eventual redrafting of the questions respondents were asked about, there was very little difference between the VP and TA interviews’. As researchers increasingly become involved in the application of pretesting methods within complex multi-cultural, multi-lingual, and multi-national contexts, the associated complexities associated with these techniques also require attention to the myriad variables that define key procedural elements. For example, the nature of analysis of cognitive interviews when multiple laboratories or interviewing teams are used – whether these involve sequential, hierarchical forms of analysis, versus parallel analysis
Questionnaire Pretesting
at the lowest level – is a key parameter that varies across practitioners (Willis, 2015a). Empirical investigations have demonstrated that the effectiveness of alternative procedures depends largely on considerations such as previous interviewer experience, and methods used to control extraneous variation across laboratories (Miller et al., 2014; Willis, 2015a). Several researchers have also begun to examine the question of how particular methods, and cognitive interviewing in particular, may function differently across cultural groups (Pan et al., 2010, Willis, 2015b).
Evolution of Pretesting Methods as a Function of Survey Administration Mode As mentioned above, the survey field in general seems to be moving away from intervieweradministered questionnaires (IAQ) to selfadministration (SAQ) – mainly for reasons of cost, and to maintain reasonably high response rates, especially as Random Digit Dial telephone surveys involving landline phones become increasingly infeasible, and the costs for in-person household visit interviews increase. In conjunction with this trend, pretesting methods useful for IAQ (e.g., interviewer debriefing) may come to be used less frequently, and those associated with increasing uptake and adoption of Web surveys may promote development of those designed for SAQ, such as usability testing. Further, the development of quick and efficient Web-based platforms for survey administration may increasingly provide a mechanism for the pretesting of survey questions using methods, such as planned experiments, psychometric analysis, and embedded (Web) probing, that previously necessitated the conduct of a full field pretest due to sample size requirements (Fowler et al., 2015). The future of survey pretesting is therefore likely to be governed not so much by evaluation research that determines the effectiveness of each method, but rather by
375
the pragmatic considerations that drive the direction of the survey field as a whole.
Attention to a Total Survey Quality Framework Stemming from discussion above which takes into account the heavy hand of budget and other practical considerations, there are increasing attempts to consider not only pretesting, but all of survey methodology, in terms of a Total Survey Quality framework that recognizes the existence of a larger, but limited system in which resources must be optimally distributed (Biemer and Lyberg, 2003). As such, pretesting cannot be viewed as a separate activity divorced from the overall context of survey administration. As surveys become more challenging and costly, researchers need to focus on both quality and Fitness for Use (Juran and Gryna, 1980). A major challenge is to determine the relative number of resources to be placed into pretesting, as opposed to other functions (e.g., follow-up of non-respondents to increase response rate). Just as it does little good to expend resources in such a way that many thousands of interviews are completed of an untested survey questionnaire, it also makes little sense to conduct pretesting to the point of creating a shiningly high-quality instrument, only to complete a small number of interviews that provide insufficient statistical power, or at a distressingly low response rate that seriously calls into question the representativeness of the sample interviewed. Given that threats to validity have recently focused largely on declining response rates, survey researchers have responded by buttressing that part of the system, by shifting resources towards development of multiple mode surveys, and generally expending greater attention to ‘getting the interview’. A potential cost, then, is the increasing complexity of pretesting, as multi-mode surveys may require cognitive testing of both paper and telephonebased versions, and usability testing of
376
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
computerized ones (Gray et al., 2014). Multicultural surveys, further, demand increased attention to varied sample composition. The unfortunate alternative to attending to these demands is to pay insufficient attention to questionnaire development, evaluation, and pretesting practices. Determining how much pretesting one can afford must be assessed on a case-by-case basis, but perhaps incorporating at least the requirement that some form of pretesting should always be done (US Census Bureau, 2003) – and that within the constraints of the total system, more is better.
Minimal Standards: Determining How Much Pretesting is Really Necessary Beyond the question of which pretest methods to select, and what we can afford, is the persistent issue of how much pretesting is really necessary for every new questionnaire. The argument involves a theoretical – and even philosophical – debate, concerning the optimal balance between the conduct of empirical testing in a tabula rasa manner, versus making use of cumulative design principles that obviate the need for extensive cycles and types of pretesting. The Interpretivist perspective (Miller et al., 2014) holds that all meaning contained within survey questions is developed at the time of the data-collection interaction. As an extreme expression of this view, there are no absolute or persistent truths concerning question functioning, as this is highly contextually sensitive, and must be ascertained for each new survey-based data-collection cycle, as measurement objectives, respondent characteristics, and survey contextual variables change. This stance leads to strict empiricism which eschews the application of questionnairedesign rules or algorithms as determinants of question function, and emphasizes a heavy investment into the application of intensive methods like cognitive interviewing for every new investigation.
The alternative viewpoint, which has existed in some form ever since Payne (1951) explicated a set of questionnaire-design principles, is that a positive outcome of pretesting and other question development methods consists of a cumulative body of guidance in question design. Hence, development of new questionnaires can be governed as a balance of application of rules and collection of empirical evidence. The model proposed by Saris et al. (2004) explicitly makes the case that previously collected empirical data on question reliability and validity can be relied upon as a basis for predicting the functioning of newly developed items, therefore reducing the need for purely empirical assessments. Less formally, question appraisal methods such as those by Lessler and Forsyth (1996) are designed to ‘offload’ some of the responsibility for question development and evaluation to experts. In extreme form, this view would project that over the course of questionnaire development pretesting and history, the relative balance between design of questions, and their pretesting, could shift to a point at which pretesting is used sparingly, or as a final quality check, as opposed to the primary mechanism by which quality and function are assessed. Certainly, the researcher’s view concerning where the field currently exists along this potential continuum impacts the intensity of pretesting that is considered necessary in order to produce an instrument that is regarded as fit-for-use.
FUTURE DIRECTIONS The observations above suggest several potentially fruitful directions that research into the science of pretesting might take. These fall into three areas: (a) extension to the multi-cultural/lingual domain, (b) improved documentation of pretesting practices, and (c) evaluation of methods. Efforts to attain crosscultural equivalence of survey questions in particular is a vital requirement (Johnson, 1998) calling for the application of methods,
Questionnaire Pretesting
such as cognitive interviewing and behavior coding, that are explicitly devoted to the study of question meaning and interpretation. Documentation of the details of pretests – including not only the actions taken and procedures used, but also the amount of resources expended – would be valuable in providing a body of information concerning usual-andcustomary practices within the field, variations in pretesting procedures, and some measure of the relative amount of resources that actually end up being applied, beyond textbook exhortations concerning what one should do. To this end, Boeije and Willis (2013) have developed the Cognitive Interviewing Report Format (CIRF), a checklist of elements to optimally include in a cognitive testing report – and this logic could be extended to any pretesting method. Further, in order to facilitate the wide sharing and distribution of the resultant testing reports, an inter-agency group of US-based investigators have developed the Q-Bank database, a repository of cognitive testing reports that is publically available (http://wwwn.cdc.gov/qbank/ home.aspx). Further developments related to the documentation of pretesting procedures, and results, will be invaluable as the base methods used to collect self-report data evolve – e.g., use of Web panels, mobile devices, smartphones, and extensive mode-mixing – so that pretesting methodology is able to keep pace with these changes. As a final note, devotion of effort to evaluation studies of pretesting methods will help to address several of the issues raised above. Researchers will be better able to determine which variants of single pretest methods are the most efficient and effective in particular situations – for example, by assessing various models used to recruit non-English-speaking participants for cognitive interviewing studies (Park and Sha, 2014). Extensions of pretesting methods comparisons can also be used to compare the results of fundamentally different methods, in order to address research questions such as which sequences and combinations of stand-alone and embedded
377
methods provide the greatest amount of non-overlapping, comprehensive information concerning item functioning. Finally, by establishing a program of methods evaluation that illustrates strengths, limitations, and best practices, researchers will be better able to confidently adapt pretesting methods to new data-collection methodologies, and approaches to survey measurement generally.
RECOMMENDED READINGS Beatty and Willis (2007) – a review paper that outlines the state-of-the-science concerning cognitive interviewing as a major pretesting procedure. Madans et al. (2011) – this volume compares and contrasts a range of survey pretesting and evaluation procedures, at a level of detail far greater than that of the current chapter. Presser et al. (2004a) – a compendium of methods that are appropriate for questionnaire pretesting, based on the 2002 Questionnaire Development, Evaluation, and Testing (QDET) Conference. Presser et al. (2004b) – the condensed, journalarticle equivalent of the Presser et al. (2004a) book deriving from the QDET conference. The article discusses a range of questionnaire pretesting methods, with a critical analysis of each. Willis (2005) – a detailed guide to the conduct of cognitive interviews, and especially the development of cognitive probes. The book is suitable as a training manual and as a comprehensive guide for students and survey researchers.
REFERENCES Baker, R. P., Crawford, S., and Swinehart, J. (2004). Development and Testing of Web Questionnaires. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 361– 384). Hoboken, NJ: John Wiley & Sons. Bassili, J. N., and Scott, B. S. (1996). Response Latency as a Signal to Question Problems in
378
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Survey Research. Public Opinion Quarterly, 60(3), 390–399. Beatty, P. C., and Willis, G. B. (2007). Research Synthesis: The Practice of Cognitive Interviewing. Public Opinion Quarterly, 71(2): 287–311. Behr, D., Braun, M., Kaczmirek, L., and Bandilla, W. (2014). Item Comparability in CrossNational Surveys: Results From Asking Probing Questions in Cross-National Surveys About Attitudes Towards Civil Disobedience. Quality and Quantity, 48(1), 127–148. Biemer, P. P., and Lyberg, L. E. (2003). Introduction to Survey Quality. Hoboken, NJ: John Wiley & Sons. Blair, J. (2011). Response 1 to Krosnick’s Chapter: Experiments for Evaluating Survey Questions. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 239–251). Hoboken, NJ: John Wiley & Sons. Boeije, H., and Willis, G. (2013). The Cognitive Interviewing Reporting Framework (CIRF): Towards the Harmonization of Cognitive Interviewing Reports. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 9(3), 87–95. Chan, A. Y., and Pan, Y. (2011). The Use of Cognitive Interviewing to Explore the Effectiveness of Advance Supplemental Materials among Five Language Groups. Field Methods, 23(4), 342–361. Collins, D. (2015). Cognitive Interviewing Practice. London: SAGE. Couper, M. P. (1999). The Application of Cognitive Science to Computer Assisted Interviewing. In M. G. Sirken, D. J. Herrmann, S. Schechter, N. Schwarz, J. M. Tanur, and R. Tourangeau (eds), Cognition and Survey Research (pp. 277–300). New York: John Wiley & Sons. DeMaio, T., and Landreth, A. (2004). Do Different Cognitive Interview Techniques Produce Different Results? In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 89–108). Hoboken, NJ: John Wiley & Sons. DeMaio, T. J., and Rothgeb, J. M. (1996). Cognitive Interviewing Techniques: In the Lab and in the Field. In N. Schwarz and S. Sudman
(eds), Answering Questions: Methodology for Determining Cognitive and Communicative Processes in Survey Research (pp. 177–195). San Francisco, CA: Jossey-Bass. DeMaio, T., and Willson, S. (2011). Response 2 to Krosnick’s Chapter: Experiments for Evaluating Survey Questions. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 253–262). Hoboken, NJ: Wiley. Dillman, D. A., and Redline, C. D. (2004). Testing Paper Self-Administered Questionnaires: Cognitive Interview and Field Test Comparisons. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 299–317). Hoboken, NJ: John Wiley & Sons. Draisma, S., and Dijkstra, W. (2004). Response Latency and (Para)Linguistic Expressions as Indicators of Response Error. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 131–147). Hoboken, NJ: John Wiley & Sons. Ericsson, K. A., and Simon, H. A. (1980). Verbal Reports as Data. Psychological Review, 87(3), 215–251. Esposito, J. L., and Rothgeb, J. M. (1997). Evaluating Survey Data: Making the Transition From Pretesting to Quality Assessment. In L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz et al. (eds), Survey Measurement and Process Quality (pp. 541– 571). New York: John Wiley & Sons. Fowler, F. J. (1995). Improving Survey Questions: Design and Evaluation. Thousand Oaks, CA: SAGE. Fowler, F. (2011). Coding the Behavior of Interviewers and Respondents to Evaluate Survey Questions. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 7–21). Hoboken, NJ: John Wiley & Sons. Fowler, S. L., Willis, G., Ferrer, R., and Berrigan, D. (2015, December). Reliability Testing of the Walking Environment Module. Paper presented at the Federal Committee on Statistical Methodology Research Conference, Washington, DC.
Questionnaire Pretesting
Galesic, M. R., and Yan, T. (2011). Use of Eye Tracking for Studying Survey Response Processes. In M. Das, P. Ester, and L. Kaczmirek (eds), Social and Behavioral Research and the Internet: Advances in Applied Methods and Research Strategies (pp. 349–370). London: Routledge/Taylor & Francis Group. Gerber, E. R. (1999). The View from Anthropology: Ethnography and the Cognitive Interview. In M. Sirken, D. Herrmann, S. Schechter, N. Schwarz, J. Tanur, and R. Tourangeau (eds), Cognition and Survey Research (pp. 217–234). New York, NY: Wiley. Goerman, P. L., and Clifton, M. (2011). The Use of Vignettes in Cross-Cultural Cognitive Testing of Survey Instruments. Field Methods, 23(4), 362–378. Gray, M., Blake, M., and Campanelli, P. (2014). The Use of Cognitive Interviewing Methods to Evaluate Mode Effects in Survey Questions. Field Methods, 26(2), 156–171. Grice, H. P. (1989). Indicative Conditionals: Studies in the Way of Words. Cambridge, MA: Harvard University Press. Harkness, J. A., Braun, M., Edwards, B., Johnson, T. P., Lyberg, L., Mohler, P. P., Pennell, B. E., and Smith, T. W. (2010). Survey Methods in Multinational, Multiregional and Multicultural Contexts. Hoboken, NJ: John Wiley & Sons. Harkness, J. A., and Johnson, T. P. (2011). Response 2 to Biemer and Berzofsky’s Chapter: Some Issues in the Application of Latent Class Models for Questionnaire Design. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 199–212). Hoboken, NJ: John Wiley & Sons. Harris-Kojetin, B. A., and Dahlhamer, J. M. (2011). Using Field Tests to Evaluate Federal Statistical Survey Questionnaires. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 321–344). Hoboken, NJ: John Wiley & Sons. Johnson, T. P. (1998). Approaches to Equivalence in Cross-Cultural and Cross-National Survey Research. ZUMA Nachrichten Spezial, 3, 1–40. Juran, J. M., and Gryna, F. M. (1980). Quality Planning and Analysis, 2nd edn. New York: McGraw-Hill.
379
Killam, W. (undated). The Use and Misuse of the ‘Think aloud’ Protocol. Unpublished manuscript, User-Centered Design, Ashburn, VA. Krosnick, J. A. (2011). Experiments for Evaluating Survey Questions. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 215–238). Hoboken, NJ: John Wiley & Sons. Krueger, R. A. (1994). Focus Groups: A Practical Guide for Applied Research, 2nd edn. Thousand Oaks, CA: SAGE. Lee, J. (2014). Conducting Cognitive Interviews in Cross-National Settings. Assessment, 21(2), 227–240. Lessler, J. T., and Forsyth, B. H. (1996). A Coding System for Appraising Questionnaires. In N. Schwarz and S. Sudman (eds), Answering Questions (pp. 259–292). San Francisco, CA: Jossey-Bass. Loftus, E. (1984). Protocol Analysis of Responses to Survey Recall Questions. In T. B. Jabine, M. L. Straf, J. M. Tanur, and R. Tourangeau (eds), Cognitive Aspects of Survey Methodology: Building a Bridge Between Disciplines (pp. 61–64). Washington, DC: National Academies Press. Madans, J., Miller, K., Maitland, A., and Willis, G. (2011). Question Evaluation Methods. Hoboken, NJ: John Wiley & Sons. Miller, K. (2011). Cognitive Interviewing. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 51–75). Hoboken, NJ: Wiley. Miller, K., Willson, S., Chepp, V., and Padilla, J. L. (2014). Cognitive Interviewing Methodology. New York: John Wiley & Sons. Moore, J., Pascale, J., Doyle, P., Chan, A., and Griffiths, J. K. (2004). Using Field Experiments to Improve Instrument Design: The SIPP Methods Panel Project. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 189–207). Hoboken, NJ: John Wiley & Sons. Murphy, J., Edgar, J., and Keating, M. (2014, May). Crowdsourcing in the Cognitive Interviewing Process. Paper presented at the Annual Meeting of the American Association for Public Opinion Research, Anaheim, CA.
380
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Pan, Y., Landreth, A., Park, H., Hinsdale-Shouse, M., and Schoua-Glusberg, A., (2010). Cognitive Interviewing in Non-English Languages: A Cross-Cultural Perspective. In Harkness, J.A., Braun, M., Edwards, B,. Johnson, T. P., Lyberg, L., Mohler, P. P., Pennell, B. E., and Smith T. W. (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 91–113). Hoboken, NJ: John Wiley & Sons. Park, H., and Sha, M. M. (2014). Evaluating the Efficiency of Methods to Recruit Asian Research Participants. Journal of Official Statistics, 30(2), 335–354. Payne, S. L. (1951). The Art of Asking Questions. Princeton, NJ: Princeton University Press. Presser S., and Blair, J. (1994). Survey Pretesting: Do Different Methods Produce Different Results? In P. V. Marsden (ed.), Sociological Methodology (pp. 73–104). San Francisco, CA: Jossey- Bass. Presser, S., Rothgeb, J. M., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., and Singer, E. (2004a). Methods for Testing and Evaluating Survey Questionnaires. Hoboken, NJ: John Wiley & Sons. Presser, S., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., Rothgeb, J. M., and Singer, E. (2004b). Methods For Testing and Evaluating Survey Questions. Public Opinion Quarterly, 68(1), 109–130. Priede, C., and Farrall, S. (2011). Comparing Results From Different Styles of Cognitive Interviewing: ‘Verbal Probing’ vs. ‘Thinking Aloud’. International Journal of Social Research Methodology, 14(4), 271–287. DOI:10.1080/13645579.2010.523187. Reeve, B. (2011). Applying Item Response Theory for Questionnaire Evaluation. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 105–123). Hoboken, NJ: John Wiley & Sons. Rothgeb, J. M. (2007). A Valuable Vehicle for Question Testing in a Field Environment: The US Census Bureau’s Questionnaire Design Experimental Research Survey. Research Report Series (Survey Methodology #2007– 17), US Census Bureau, Washington, DC: https://www.census.gov/srd/papers/pdf/ rsm2007-17.pdf Saris, W. E., van der Verd, W., and Gallhofer, I. (2004). Development and Improvement of
Questionnaires Using Predictions of Reliability and Validity. In Presser, S., Rothgeb, J. M., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., and Singer, E. (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 275–297). Hoboken, NJ: John Wiley & Sons. Schaeffer, N. C., and Dykema, J. (2011). Response 1 to Fowler’s Chapter: Coding the Behavior of Interviewers and Respondents to Evaluate Survey Questions. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 23–39). Hoboken, NJ: Wiley. Schaeffer, N.C., and Maynard, D.W. (1996). From Paradigm to Prototype and Back Again: Interactive Aspects of Cognitive Processing in Standardized Survey Interviews. In N. Schwarz and S. Sudman (eds), Answering Questions: Methodology for Determining Cognitive and Communicative Processes in Survey Research (pp. 65–88). San Francisco, CA: Jossey-Bass. Schuman, H. (1966). The Random Probe: A Technique for Evaluating the Validity of Closed Questions. American Sociological Review, 31(2), 218–222. Subar, A. F., Kirkpatrick, S. I., Mittl, B., Zimmerman, T. P., Thompson, F. E., Bingley, C., Willis, G., Islam, N. G., Baranowski, T., McNutt, S., and Potischman, N. (2012). The Automated Self-Administered 24-Hour Dietary Recall (ASA24): A Resource for Researchers, Clinicians, and Educators from the National Cancer Institute. Journal of the Academy of Nutrition and Dietetics, 112(8), 1134–1137. Thissen, M. R., Fisher, C., Barber, L., and Sattaluri, S. (2008). Computer AudioRecorded Interviewing (CARI). A Tool for Monitoring Field Interviewers and Improving Field Data Collection. Proceedings of the Statistics Canada Symposium in Data Collection: Challenges, Achievements and New Directions. Ottawa: Statistics Canada. Thrasher, J. F., Quah, A. C. K., Dominick, G., Borland, R., Driezen, P., Awang, R., Omar, M., Hosking, W., Sirirassamee, B., and Boado, M. (2011). Using Cognitive Interviewing and Behavioral Coding to Determine Measurement Equivalence Across Linguistic and Cultural Groups: An Example From the International Tobacco Control Policy Evaluation Project. Field Methods, 23(4), 439–460.
Questionnaire Pretesting
Tourangeau, R. (1984). Cognitive Science and Survey Methods: A Cognitive Perspective. In T. Jabine, M. Straf, J. Tanur, and R. Tourangeau (eds), Cognitive Aspects of Survey Design: Building a Bridge Between Disciplines (pp. 73–100). Washington, DC: National Academies Press. Tourangeau, R. (2004). Experimental Design Considerations for Testing and Evaluating Questionnaires. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 209–224). Hoboken, NJ: John Wiley & Sons. Tourangeau, R., Conrad, F. G., and Couper, M. P. (2013). The Science of Web Surveys. New York: Oxford. Tucker, C., Meekins, B., Edgar, J., and Biemer, P. P. (2011). Response 2 to Reeve’s Chapter: Applying Item Response Theory for Questionnaire Evaluation. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods (pp. 137–150). Hoboken, NJ: John Wiley & Sons. US Census Bureau (2003). Census Bureau Standard: Pretesting Questionnaires and Related Methods for Surveys and Censuses. Washington, DC: US Census Bureau. van der Zouwen, J. (2002), Why Study Interaction in the Survey Interview? Response from a Survey Researcher. In D.W. Maynard, H. Houtkoop-Steenstra, N.C. Schaeffer, and J. van der Zouwen (eds), Standardization and Tacit Knowledge: Interaction and Practice in the Survey Interview (pp. 47–65). New York: John Wiley & Sons.
381
Willimack, D. K., Lyberg, L., Martin, J., Japec, L., and Whitridge, P. (2004). Evolution and Adaptation of Questionnaire Development, Evaluation, and Testing Methods for Establishment Surveys. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 385–407). Hoboken, NJ: John Wiley & Sons. Willis, G. (2005). Cognitive Interviewing: A Tool for Improving Questionnaire Design. Thousand Oaks, CA: SAGE. Willis, G. (2015a). Analysis of the Cognitive Interview in Questionnaire Design. New York: Oxford. Willis, G. (2015b). Research Synthesis: The Practice of Cross-Cultural Cognitive Interviewing. Public Opinion Quarterly, 79 (S1), 359–395, DOI: 10.1093/poq/nfu092. Willis, G. B., Royston, P., and Bercini, D. (1991). The Use of Verbal Report Methods in the Development and Testing of Survey Questionnaires. Applied Cognitive Psychology, 5, 251–267. Willis, G., Schechter S., and Whitaker, K. (1999). A Comparison of Cognitive Interviewing, Expert Review, and Behavior Coding: What do They Tell us? Proceedings of the Section of Survey Research Methods, American Statistical Association (pp. 28–37). Alexandria, VA: American Statistical Association. Yan, T., Kreuter, F., and Tourangeau, R. (2012). Evaluating Survey Questions: A Comparison of Methods. Journal of Official Statistics, 28 (4), 503–529.
25 Survey Fieldwork Annelies G. Blom
INTRODUCTION Broadly speaking, fieldwork is the collection of any kind of information about any kind of observational units, with the exception of research conducted in a laboratory and desk or library research. In survey research, fieldwork in its simplest form is the administration of a questionnaire to a group of target persons, where the target persons may respond on behalf of different types of entities such as individuals, households, or organizations. At first sight, conducting fieldwork seems straightforward. The researcher has developed a questionnaire, which is presented to a sample of units (typically persons) to be interviewed. In practice, you will soon discover that, if you are interested in obtaining valid data, planning, managing, and monitoring fieldwork is a science of its own. Fieldwork can be conducted across different types of modes, including face-to-face at people’s homes or offices, by telephone, by mail,
or through the internet in form of a web survey. The chosen mode largely determines the way fieldwork is conducted (see Chapters 11 and 18 in this Handbook for discussions of the impact of mode choice on the survey design and survey outcomes). While face-to-face and tele phone surveys are implemented with the help of interviewers, who recruit respondents, ask the survey questions, and record the answers given, mail and web surveys are self-completion formats in which interviewers typically play no role. Overall, interviewer-mediated surveys demand a more complex fieldwork process than self-completion formats, as the interviewer adds an additional level of complexity. However, in circumstances in which respondents are unable to complete the questionnaire by themselves, interviewer-mediated surveys remain the most feasible form of survey data collection. Since the lessons learnt from complex interviewer-mediated surveys are more easily transferable to the other survey modes than vice-versa, this chapter focuses on the processes of conducting fieldwork with the
383
Survey Fieldwork
help of interviewers, while remaining relevant for mail and web surveys as well. Next to the survey mode, the sampling design largely determines the complexity of the survey fieldwork (see Chapters 21–23 in this Handbook for discussions of different types of sampling designs). Surveys based on probability samples are typically associated with far more complex fieldwork processes than those based on nonprobability samples. In probability samples, the interviewer can only interview the sampled units. If such a unit cannot be located, contacted, or refuses cooperation with the survey request, it cannot be replaced by a different sample unit, but instead is nonrespondent to the survey. In order to limit nonresponse and the potential for nonresponse bias, fieldwork operations in probability sample surveys typically entail repeated contact attempts and refusal avoidance or refusal conversion techniques thus increasing the fieldwork efforts compared to nonprobability sample surveys (see also Chapter 27 for a discussion of unit nonresponse in probability sample surveys). To introduce the reader to the full potential complexity of fieldwork operations, this chapter thus focuses on fieldwork for surveys based on probability samples. Despite this focus, many sections are also relevant to researchers conducting surveys based on nonprobability samples. In a typical probability-based interviewermediated survey, the goals of fieldwork are: maximizing the sample size given a fixed gross sample (thus maximizing the response rate), minimizing potential biases due to nonresponse, and achieving interviews with low measurement error, timely within a set fieldwork period and budget (Figure 25.1). As this variety of goals shows, survey fieldwork always trades off many and sometimes opposing objectives. More often than not, researchers find themselves in the situation, where they outsource the fieldwork of their survey. Setting up the infrastructure needed to conduct a sizeable number of interviews and to monitor the interviewers’ work oneself for a single
Maximize sample size
Minimize nonresponse error
Minimize measurement error
Budget and time constraints
Figure 25.1 Trading off fieldwork objectives.
survey would be inefficient and thus costly, especially in face-to-face surveys of the general population. Therefore, data collection is often outsourced to a (commercial) data collection agency, which employs interviewers to conduct the interviews. Thus conducting and monitoring fieldwork becomes a multitiered process with the survey managers and fieldwork operations acting as intermediaries between the researcher designing the survey and the interviewers conducting the data collection. In such situations ‘researchers are far removed from the interviewers and the actual survey operations’ (Koch et al., 2009). Nonetheless, researchers will need to ensure that they have information and influence on fieldwork progress and quality. The aim of this chapter is to make explicit the fieldwork processes that influence survey data and to provide researchers embarking on survey fieldwork support regarding what to look out and account for when planning and conducting their fieldwork. The next section on ‘planning fieldwork’ constitutes the marrow of this chapter. Here, the researcher is guided through the key issues to account for when planning a survey fieldwork operation or when subcontracting fieldwork to a (commercial) data collection organization. Its subsections provide details on how to announce the survey, the advantages and disadvantages
384
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of various types of respondent incentives, on interviewer training, fieldwork monitoring, post-fieldwork checks and on how to deal with situations where fieldwork targets are not met. The third section on ‘responsive designs’ and the fourth section on ‘interviewer effects’ are more technical sections, which may be of special interest to researchers with experience in conducting fieldwork and to data collection organization. The section on ‘responsive designs’ describes the relatively recent innovation in fieldwork practice of turning the focus away from monitoring response rates to monitoring nonresponse bias and measurement error. In the section on ‘interviewer effects’ I consider another technical and relatively under-reported phenomenon in large-scale survey data collections, the effect that interviewers have on the data collected in terms of bias and variance. While the section on ‘planning fieldwork’ was written primarily for researchers new to conducting survey fieldwork, the later sections are more specialized discussions, which might inspire those with several years of fieldwork experience.
PLANNING FIELDWORK Careful planning, monitoring, and reacting to deviations from the expected results are key to successful fieldwork. Many researchers running a survey for the first time will be surprised at the complexities thereof and even seasoned survey specialists are faced with new challenges during each new data collection. Even when repeating exactly the same survey with the same survey organization in the same region, fieldwork outcomes can vary dramatically, since many seemingly small factors can change and affect the interacting fieldwork mechanisms. Such factors may, for example, be the political situation at the moment of fieldwork, which can influence people’s sense of duty, willingness to be interviewed on a particular topic, trust in the
survey sponsor or worries about data protection (e.g. Lyberg and Dean, 1992). Another factor may be the weather conditions during the fieldwork period. For example, in Northern Europe, rare beautiful summer days may well affect whether interviewers can find people at home and dark winter seasons may influence whether sample units are likely to let an interviewer into their house in the evening. It may also be seemingly irrelevant details of the fieldwork design, such as the type of stamps used on the letter announcing the survey (e.g. Dillman et al., 2014) or how the survey was presented to interviewers during their training (e.g. Groves and McGonagle, 2001). Therefore, planning and managing fieldwork needs to be forward looking and – most importantly – fast at reacting to fieldwork realities. No fieldwork is like the previous, but many aspects can be planned and thought through to prevent unpleasant surprises. When outsourcing, the planning of the fieldwork starts with writing and publishing a call for tender for conducting the survey fieldwork. This call for tender needs to stipulate the fieldwork design as well as the expectations the researcher has of the survey organization. Furthermore, any contract made with the survey organization needs to describe in detail, which actions or repercussions follow, if expectations are not met. But even if the survey is not contracted out and all fieldwork is conducted in-house, setting clear expectations early on enables researchers to check fieldwork progress against prior expectations. The checklist in Figure 25.2 gives an overview of key design aspects that anyone conducting an interviewer-mediated probability-based survey should have considered before embarking on fieldwork or before publishing a call for tender for outsourcing the fieldwork.
Announcing the Survey In interviewer-mediated surveys it can be wise to announce the call or visit of the
Survey Fieldwork
385
How will the survey be announced to the target persons? Will respondent incentives be used? If so, which type(s) of incentives? How will the interviewers be trained for this survey? How will you monitor fieldwork progress? How will you check for fraud by the interviewers or the survey organization? Which will be the repercussions, if specifications or targets are not met?
Figure 25.2 Checklist for fieldwork planning.
interviewer prior to the start of fieldwork. Research has shown that advance letters prior to the interviewer visit can increase response rates by decreasing people’s likelihood to refuse (e.g. Link and Mokdad, 2005). This is likely due to the official appearance of advance letters legitimizing the survey request. At the same time, ‘cold calling’, i.e. calling or visiting a person unannounced, is typically associated with obscure organizations trying to sell something over-priced to gullible persons, an association that any sincere survey operation should aim to prevent. A good advance letter should be kept short, whilst mentioning the key aspects of the survey and conveying the respectability of the survey request. Key aspects that should be covered in the advance letter are: 1 The topic of the questionnaire, yet care should be taken that the mentioning of the topic does not selectively attract different parts of the target population (nonresponse biases might occur), or that it influences respondents’ opinions and thus answers during the survey (introducing measurement biases). Overall, however, giving sample units some idea of what to expect from the survey should encourage their participation. 2 The sponsor of the study. If the survey sponsor is a government department, a university, or other well-known official or respected body, mention ing them as sponsor in the advance letter and including their logo will lend the survey credibil ity (e.g. Groves et al., 2006). However, not every official sponsor is well-liked across (all parts of) the population and nonresponse and measure ment biases might occur. For example, a financial
survey sponsored by the national tax office might well scare off respondents with undeclared sources of income or influence their answers to questions on income, benefits, or assets. 3 The length of the interview, even when long, gives respondents an idea of what to expect and may prevent annoyances later on. Research on web surveys shows that announcing the survey as slightly shorter than its actual length, but still giving a realistic estimate, results in the highest response rates and does not affect respondents’ likelihood to break-off the interview (Galesic and Bosnjak, 2009). 4 The envelope and stamps are the first impression that target persons will get from your survey. Consider printing the logo of the sponsor on the envelope. Furthermore, studies have shown that envelopes with real stamps are more likely to elicit response than envelopes on which the postage was printed automatically (e.g. Dillman et al., 2014). In addition, personally addressing the target persons increases response to a survey, as does personally signing the letter (see Scott and Edwards, 2006).
Respondent Incentives The effects of respondent incentives on survey participation have been well-documented across survey modes and countries (see, for example, Singer et al., 1999; Singer and Ye, 2013; see also Chapter 28 on incentives in this Handbook). Summarizing this literature, incentives experiments have found that: (1) incentives increase response rates; (2) the larger the incentive, the more likely target
386
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
persons are to respond, however; (3) the returns of response to incentive size decrease, i.e. large incentives have little additional effect compared to small incentives; (4) incentives in cash are more effective at increasing response than vouchers or incentives in kind (e.g. small gifts) and lotteries have no significant effect on response rates compared to no incentive (Göritz, 2006); and (5) unconditional incentives, i.e. incentives that are handed over prior to the interview and independent of whether the sample unit actually participates in the interview (for example, when sent with the advance letter), are more effective in terms of response rates than conditional incentives that are promised to the respondent and handed over only if the interview is completed. Research findings on the effects of incentives on nonresponse bias are less clear-cut. While experiments show that incentives can affect the sample composition, this effect has typically been interpreted positively, as incentives ‘may induce participation on the part of groups that would otherwise be underrepresented in the survey’ (Singer et al., 1999: 225). In a literature review, Simmons and Wilmot (2004) summarize that persons and households with low income or low education, with dependent children, young respondents, and minority ethnic groups are more likely to be attracted to the survey through incentives than other respondents. Most studies, however, find no or negligible effects of incentives on the sample composition (e.g. Pforr et al., 2015). However, publication biases might well obscure negative effects of incentives on nonresponse bias, because the undesirable effects of incentives might well be published less frequently than the desirable effects. Whether researchers use respondent incentives during fieldwork and which type of incentives are chosen, will depend on the type of survey and the expectations and aims of the fieldwork. This chapter succinctly covers incentives effects found in the literature to support researchers in this decision and
provides references pointing to where more information on the effects of incentives can be found. Researchers deciding to incentivize survey participation, are advised more indepth reading of this literature.
Training Interviewers The training of the interviewers conducting the fieldwork should take an important part in the fieldwork planning. In face-to-face surveys of the general population training interviewers can be costly, as training events may have to be hosted across the country and interviewers’ time and travel expenses have to be paid. Since survey quality is always closely tied to costs (see Figure 25.1), researchers sometimes sacrifice interviewer trainings and instead spend the money on other aspects of the survey design, such as a larger sample. Yet, the importance of interviewers and the effect that they have on the data collected should not be underestimated (see also my section on ‘interviewer effects’). Interviewers affect both the sample composition (primarily through nonresponse, for example by predominantly interviewing respondents that are easy to contact or that interviewers find more pleasant to communicate with) and the measurement (by directly or indirectly influencing respondents’ answers and through their accuracy in coding and writing down the answers given). Thus taking interviewers’ role lightly and saving on their training might well prove to be a false economy. Training interviewers can take several forms. The cheapest and arguably least effective way is the written or video training, where interviewers are mailed a manual or a video that they have to study before starting fieldwork. The most-expensive and arguably most effective training is in person. Depending on the questionnaire and the experience of interviewers in-person training might take half a day for a straightforward interview and for interviewers that receive regular general
Survey Fieldwork
interviewer training or up to several days, if the interview contains the administration of complex measurements and interviewers have a low general training level. In-person training is typically conducted by the survey organization that employs the interviewers. However, in-person contact between interviewers and the researchers of the survey can have a positive impact on interviewers’ motivation and thus on achieved response rates and measurement accuracy, because it makes interviewers feel a valuable part of the research process. Many survey organizations also offer a train-the-trainer system, where the researchers and managers at the survey organization personally train the regional lead or supervisor interviewers, who in turn train the interviewers employed on the survey. On the positive side, the train-the-trainer system allows for personal contact and monitoring of the interviewers on a study. However, it prevents personal contact between field interviewers and researchers and the marginal cost savings might not weigh up this deficiency.
Planning for Real-time Fieldwork Monitoring When outsourcing their survey fieldwork many researchers also outsource any monitoring of fieldwork progress. The questionnaire is handed over to the survey agency and a few weeks later the survey data are received back. In this way, what actually happens during fieldwork remains a black box and potential inconsistencies in the survey data remain unexplained. In addition, the researcher has no possibilities to influence fieldwork actions in case targets are not reached. To prevent this, researchers fielding questionnaires should closely monitor the fieldwork progress. The preparation of fieldwork monitoring starts with the fieldwork specifications in the call for tender, since the survey organization needs to be aware of the information required for monitoring.
387
In its simplest form, fieldwork monitoring entails the specification of a target response rate and the formulation of how many interviews are expected in each week of fieldwork to achieve this target. A reason for the popularity of response rates as a fieldwork target is that they are easy to calculate, monitor, and compare across surveys. Using the Standard Definitions published by the American Association for Public Opinion Research (for example, AAPOR, 2011) outcome codes can be standardized and thus response rates compared across similar surveys. In addition, response rates are considered an indicator of data quality, since they are related to the potential for nonresponse bias in the data. When determining your target response rate, finding out which response rates were achieved in previous surveys that were conducted in the same mode and have a similar target population, topic, sponsor, and fieldwork period can be helpful. Setting a target response rate together with expected numbers of interviews per week will provide an indicator, against which fieldwork progress can be regularly monitored. In probability sample surveys, the response rate is defined as the number of valid interviews divided by the number of eligible sample units: Response rate =
Σ interviews Σ eligible sample units
However, the definition of an ‘eligible sample unit’ can differ across researchers and survey organizations. Therefore, when defining target response rates, care should be taken that the survey organization and the researcher operate with the same definitions. The AAPOR Standard Definitions give details of how response rates and final disposition codes may be defined (AAPOR, 2011) and may be helpful in specifying target response rates in a call for tender and prior to the start of fieldwork. In addition, Blom (2014) illustrates how final disposition codes may be derived from sequences of contact attempts in the field.
388
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
In order to monitor fieldwork progress by means of target response rates, the researcher needs to receive regular feedback from the survey organization regarding the number of interviews achieved, as well as the number of interview appointments, refusals, noncontacts, other nonproductive outcomes, and ineligible sample units. These numbers are best split up by interviewer to monitor the success of individual interviewers and to enable identifying where additional efforts may be needed. Additionally, it may be prudent to specify and monitor the frequency and distribution of contact attempts that an interviewer has to make, before a case can be reported back as unproductive. The European Social Survey (ESS), for example, specifies for each fieldwork operation in each country a minimum of four in-person contact attempts per sample unit, spread across at least two different weeks and containing at least one attempt in the evening and one on the weekend, in order to also reach those employed full-time with a busy social schedule (see, for example, ESS, 2011). In many ESS countries the national teams define even stricter contacting schedules to prevent low response rates and nonresponse biases. During fieldwork researchers should aim to monitor the number and timing of contact attempts made by the interviewers. Where the survey organization has contact information available real-time, requesting detailed data should be possible. However, some organizations may still be operating with paper-based contact protocols, such that real-time fieldwork monitoring is impossible. The call for tender should thus define, what kind of information is expected from the survey organization and at which intervals this information is to be provided during fieldwork. Finally, the researcher should define key substantive survey outcomes that need to be monitored during fieldwork. Especially outcomes that may well depend on interviewers’ diligence in asking and recording complex answers should receive attention. For example,
if a questionnaire is heavily routed on a few questions and, depending on the answer to these questions, the interview is considerably shorter or longer, interviewers might be tempted to take shortcuts. A good example of such a situation is surveys that collect information on respondents’ social networks. Typically, the questionnaire first collects the size of a respondent’s social network. This is followed by a battery of questions that is looped for each network person. An interviewer recording a large social network thus has to work through a considerably longer interview than an interviewer who records a small social network. Therefore, monitoring the network size recorded by interviewers over the course of the fieldwork enables researchers to intervene, if some interviewers record network sizes that are inconceivably small or if during the course of the fieldwork the network sizes recorded decrease considerably (see, for example, Brüderl et al., 2013; Paik and Sanchagrin, 2013). Other examples of outcomes that should be monitored during fieldwork are add-on survey parts, such as drop-off questionnaires, alteri questionnaires (i.e. questionnaires for other household or family members), or questions that ask for consent to record linkage. Here, interviewers should be monitored as to their success at the add-on survey part. If interviewers, for example, show low performance at eliciting consent to record linkage they might need to be re-trained on how respondents should be approached with such a sensitive question.
Planning for Ex-post Data Checks Monitoring fieldwork and survey outcomes real-time is of key importance, since it enables the researcher to intervene and correct for problems as soon as they emerge. Nonetheless, there are some additional checks after fieldwork has been completed or that take place during the later stages of fieldwork, which need to be borne in mind. The most important and common measures are
Survey Fieldwork
so-called back-checks, where respondents to the survey are contacted via mail or telephone and briefly interviewed about the interviewers’ work. Questions that may be asked in a phone or mail back-check interview include: •• Did an interviewer from [survey company] conduct an interview with you on [date of interview]? If such a question is answered negatively, the case may have been falsified. • How long did your interview take? If the respondents’ impression of the length is considerably shorter than the length reported by the interviewer, the interviewer might have completed large parts of the interview themselves and only asked a few key questions to the respondent. • Did the interviewer introduce him-/herself with their company ID-card? Did he/she mention that the survey is conducted on behalf of [sponsor]? Did the interviewer explain to you that par ticipation in the survey is voluntary, that all your answers will be recorded anonymously and that the official data protection regulations will be strictly adhered to? If the interviewer did not introduce the survey as prescribed by professional ethics and data protection regulations, interviewers should be re-trained. • During the interview, did the interviewer read out all questions and answers clearly and comprehensibly? Did he/she show you cards on which answer options were listed for you to choose from? Did he/she record [on the computer/on the paper questionnaire] all answers as that you gave him/ her? If not, interviewers should be re-trained to follow the standardized interviewing protocol. •• After the interview, did the interviewer leave you an additional paper questionnaire, for you to fill in in private and return to [survey agency] by mail? If applicable, the back-checks should ask about how the interviewer conducted the more difficult survey sections and whether add-on survey parts were correctly administered.
389
Since back-checks suffer from nonresponse, just like any survey, they can only identify the most serious problems and only if the sample size is sufficiently high. Given the relatively low cost of mailed back-checks and the low expected response rate, researchers opting for mailed back-checks should send these to all survey respondents. When opting for the more costly, but arguably also more informative phone back-checks, researchers might select a random subsample to be back-checked and only check back all cases of an interviewer, when errors are detected in the subsample. Back-checks should be conducted close to the interview date, i.e. during fieldwork, although researchers should expect to only be able to really act on its results well into the fieldwork period, since the results of these efforts will only become apparent after a good proportion of cases have been checked. While many survey organizations offer to conduct back-checks in-house, researchers should take an active role in this process and, if possible, conduct the back-checks themselves. Researchers might have slightly different interests in back-checking than survey organizations. By conducting back-checks themselves, researchers will, therefore, put themselves in a stronger position for negotiating actions, if the back-checks uncover problems. In addition, by conducting the back-checks oneself, one can also uncover if the survey organization deviated from what was stipulated in the survey specifications and agreed upon in the contract. For example, during back-checks one might discover changes made to the gross sample that were not agreed upon, such as substituting nonresponding sample units to increase the response rate. Finally, while it is important to maintain a trustful relationship with the survey organization, it may be prudent to nonetheless check the data for falsifications and computational errors that may have arisen through opposing fieldwork pressures (see Figure 25.1 above) or differential interests in accuracy
390
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
and diligence between academic researchers and more economically oriented survey organizations. Researchers should thus check the interview data that they receive from the survey organization after fieldwork. Such checks may include consistency analyses and searches for duplicates, which may occur through accidental computational errors, but also through purposeful institutional falsification.
Defining Actions and Repercussions if Fieldwork Targets are Not Met As stipulated above, predicting the success of fieldwork can be difficult as every fieldwork operation is unique. Nonetheless, fieldwork targets can and should be set. Together with such targets, researchers and survey organizations define what happens, if these are not met within the efforts planned and budgeted for. As described above, typical fieldwork targets are the final response rate and the number of achieved interviews per week. Potential repercussions for not meeting response rate expectations can be the extension of the fieldwork period, the re-training or exchange of under-performing interviewers, the prioritization of cases with low response propensities such that more attention (for example, in terms of number of call attempts) is paid to the difficult cases, and additional incentives for cases with low response propensities. Researchers can also contractually establish financial penalties for not meeting set targets together with bonuses for exceeding the targets. In many survey research projects a set number of interviews is needed at the end of fieldwork. If this number is not met and measures to increase the number of interviews by increasing the response rate prove unsuccessful, repercussions may include drawing additional sample units for the gross sample and expanding fieldwork to these additional units. The final achieved response rate will be lower; however, this measure might serve to
reach the necessary number of interviews. If, when planning fieldwork, the researcher feels unsure about what kind of response rate to expect, it may be wise to reserve an additional backup sample, which is only released to the field if response rates fall behind expectations. Care should be taken, however, to ensure the backup is fielded as soon as problems arise, to ensure that all sample units receive sufficient contact attempts and that these are spread over several weeks. If the extra sample is activated in a late phase, this may drastically affect the response rate for this and the total sample. Nonetheless, since low response rates increase the potential for nonresponse biases, one should always be careful with increasing the gross sample and first try to take alternative measures that increase both response rates and the sample size.
RESPONSIVE DESIGNS The introduction of this chapter stated that the aim of fieldwork should be ‘maximizing the sample size given a fixed gross sample (thus maximizing the response rate), minimizing potential biases due to nonresponse, and achieving interviews with low measurement error, timely within a set fieldwork period and budget’. In survey reality, however, fieldwork during the past decades has been primarily concerned with maximizing response rates within a fixed budget and a set period of time. Minimizing nonresponse biases and measurement errors have been secondary concerns at most. The reason for this is likely the relatively easy calculation of response rates, which can be determined for the complete survey. In comparison, nonresponse bias and measurement errors are estimate-specific and difficult to determine when lacking knowledge about the true value of the estimate. In recent years, researchers and survey institutes at the forefront of survey management have increasingly turned towards
Survey Fieldwork
looking into nonresponse bias (and to a lesser extent also measurement errors) in an attempt to counter the potential biases introduced by the progressively low response rates in Western countries (e.g. de Leeuw and de Heer, 2002). The key method in monitoring and reducing nonresponse biases during fieldwork has become the so-called responsive design. According to the pioneers in this area, Groves and Heeringa: two phenomena have combined to prompt consideration of new design options for sample surveys of household populations in wealthier countries of the world: (a) the growing reluctance of the household population to survey requests has increased the effort that is required to obtain interviews and, thereby, the costs of data collection […] and (b) the growth of computer-assisted data collection efforts provides tools to collect new process data about data collection efforts. (Groves and Heeringa, 2006: 439)
These new developments enable us to monitor not only the response rate, but also process data (so-called paradata; Couper, 2005) and survey estimates in real-time. We can thus observe how continuing fieldwork efforts influence our estimates and whether varying the fieldwork protocols brings in different parts of the sample and thus affects the estimates: The ability to monitor continually the streams of process data and survey data creates the opportunity to alter the design during the course of data collection to improve survey cost efficiency and to achieve more precise, less biased estimates. (Groves and Heeringa, 2006: 439)
Responsive designs consist of several design phases. During each phase ‘the same set of sampling frame, mode of data collection, sample design, recruitment protocols and measurement conditions are extant’ (Groves and Heeringa, 2006: 439). Different phases can be implemented simultaneously for different random parts of the sample (i.e. as experiments on survey design features) or sequentially. Design features that may be varied across phases include the data collection mode,
391
number and timing of contact attempts, mode of contact attempts, incentive types and refusal conversion procedures. When setting up a responsive design, researchers need to consider four steps. First, they need to pre-identify a set of design features potentially affecting costs and errors of survey estimates. For example, different incentive sizes will affect costs and may attract different population groups. Second, researchers identify a set of indicators of the cost and error properties of those features and monitor those indicators in initial phases of data collection. For example, one may monitor the number of hours the interviewers spend attempting contact with sample households. Next, the design features of the survey are altered based on cost-error trade-off decision rules. With each change in the survey design a new phase commences. Finally, the data from the separate design phases are combined into a single estimator (Groves and Heeringa, 2006: 439). Since the phases aim at bringing in different parts of the gross sample, with different characteristics, the researcher is likely to observe changes in the monitored estimates as a result of design choices made for each phase. Key to responsive designs is the ability to track key survey estimates as a function of estimated response propensities (conditional on the fieldwork design during a specific phase): If survey variables can be identified that are highly correlated with the response propensity, and it can be seen that point estimates of such key variables are no longer affected by extending the field period, then one can conclude that the first phase of a survey (with a given protocol) has reached its phase capacity and a switch in recruitment protocol is advisable. (Kreuter, 2009: 5)
For example, one may monitor the nonresponse bias due to noncontact. Noncontact is typically associated with at-home patterns, which in turn correlate with age and whether the sample person is employed (Lynn, 2003), the household composition, household income and education (Goyder, 1987;
392
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Durrant and Steele, 2009). Within one design phase with a specific mode and recruitment protocol, the phase is exhausted if the age structure, household composition, employment status, income structure, and education groups stabilize over fieldwork time. As this example demonstrates, a prere quisite for responsive designs is the availability of real-time surveys and paradata. In the case of face-to-face fieldwork this means that interviewers have to record survey data as well as contact form data electronically and daily upload new information on both productive and unproductive cases to a centralized case management system. From the survey organization, this demands that all case assignments are done electronically and that interviewers have modern devices to conduct interviews and record contact attempts. From the interviewers, it demands regular (preferably daily) uploads of new fieldwork actions to a centralized server. Only then, can researchers monitor closely survey and fieldwork indicators and decide on when a next design phase needs to be initiated. Responsive designs have close resemblance in nature and implementation with dynamic adaptive designs; and partial R-indicators may also be used as a tool to monitor and compare surveys. A substantive and statistical discussion of the differences and similarities between the three is beyond the scope of this chapter, but may be found in Schouten et al. (2011) and in Chapter 26 in this Handbook.
INTERVIEWER EFFECTS This chapter focuses on fieldwork in interviewer-mediated surveys, therefore, it seems prudent to pay some more attention to the effect that interviewers have on survey data. This section focuses on these so-called interviewer effects, i.e. interviewers’ influence on the variance and bias of survey estimates. Whether on purpose or by accident, interviewers in face-to-face and telephone surveys
affect almost all aspects of the survey process in various ways, including the coverage of the sampling frame during listing and screening operations, response rates and nonresponse biases, responses given and measurements recorded, and coding and editing processes (see Figure 25.3). On the positive side, regarding nonresponse, for example, interviewers encourage participation and completion of the interview. In terms of response rates, interviewer-mediated surveys tend to achieve higher response rates than surveys using self-completion questionnaires, and face-to-face surveys, where the interviewers have the strongest involvement in the recruitment process, tend to achieve higher response rates than telephone surveys (Groves et al., 2009). On the negative side, however, interviewers may impact on the representativeness of the achieved sample, for example, by avoiding dangerous neighborhoods or through low persistence when it comes to households that are seldom home or difficult to persuade (see Blom et al., 2011). When it comes to the measurement side of the survey process, many surveys rely on interviewers to conduct long and tedious interviews, where interviewers play a crucial role in encouraging respondents to complete the survey. However, interviewers can also introduce measurement biases, for example, by eliciting socially desirable responses or by leading the respondents’ answers through their own prejudices (for an early account of such interviewer effects on measurement see Rice, 1929). Interviewer effects on survey error can be of two kinds: effects on bias and variance. Where biases are concerned, all interviewers influence the survey in such a way that the estimates of interest are systematically biased. To observe and research interviewer biases we typically need information on the true value of the estimate of interest, to which the survey statistic can be compared. Since in survey research such validation data on the true value is typically absent, interviewer
Survey Fieldwork
393
Figure 25.3 Interviewer effects in terms of the Total Survey Error framework. Source: Adapted from Groves et al. (2009).
effects are more frequently observed with respect to interviewer variances than biases. When interviewers introduce variance into the survey data, we observe that respondent answers cluster within interviewers, i.e. that the answers given by respondents interviewed by the same interviewer resemble each other. Such variability across interviewers in survey outcomes of interest is expressed in terms of high interviewer intra-class correlations (ICCs): ICCinterviewer =
2 σ interviewer 2 2 σ interviewer + σ sample unit
where ICCinterviewer is the proportion of the total variance of an estimate that is due to the interviewer. It is a type of clustering effect, analogous to area clusters in sampling designs (see also Chapter 21 in this Handbook). In fact, since in face-to-face surveys the allocation of interviewers to areas
can be congruent, interviewer effects are frequently difficult to disentangle from area effects. Only in so-called interpenetrated designs, where an interviewer is randomly assigned to at least two areas or each area is randomly allocated to at least two interviewers, can we reliably distinguish between area and interviewer effects. Interviewer effects on variance indicate that different interviewers either generate different types of samples or elicit differential responses during the interview. So far, the literature has spawned many articles describing such interviewer effects. However, few studies have succeeded in explaining the effects found and a common framework describing which types of estimates are (in)vulnerable to interviewer effect is still lacking. This section aims to draw the researcher’s attention to the interviewer effects that occur when conducting a survey with the help of interviewers. While there are
394
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
many convincing reasons for opting for an interviewer- administered mode, such as the lack of a suitable sampling frames for mail or web surveys or a complex and heavily routed questionnaire that necessitates interviewers’ support during the interview, interviewers may also have undesirable effects on both the representativeness and the measurement. For each particular survey, it is the researcher’s responsibility to decide on whether the positive effects of interviewers outweigh the negative ones.
careful planning and active monitoring of its various steps. Following Ronald A. Fisher, the famous statistician, I would like to conclude that, any time and effort spent on planning and monitoring fieldwork is well-spent. Trying to patch up data after the fact, on the contrary, is usually futile. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. (Fisher, 1938)
CONCLUSION Fieldwork lies at the heart of any survey data collection. Its influence on the quality of the data collected is immense; nonetheless researchers often pay little attention to its course. To some extent, this may be due to the fact that fieldwork does not seem directly related to the substantive research questions that most academics are interested in. The questionnaire content, including the design of measurement scales, directly reflects substantive research interests. However, how the sample was realized or how the questions were actually asked and answers recorded, may seem secondary. Furthermore, the influence of survey fieldwork on data quality can be difficult to grasp and quantify. While the questionnaire content in form of data points in the survey data file are directly observable outcomes, the fieldwork processes leading to these outcomes – starting at the interviewer training and continuing via in-field nonresponse processes and fieldwork monitoring to back-checks for interviewer falsifications – are often hidden and difficult to include in substantive analyses. This chapter aimed to make explicit the many steps in the fieldwork process that lead up to the data set an analyst may use and to point out that the decisions taken regarding fieldwork shape the survey data. I hope to have encouraged researchers to take an active role in shaping their fieldwork through
RECOMMENDED READINGS Blom et al. (2011): an example of a study on interviewer effects during fieldwork Dillman et al. (2014): a comprehensive overview of survey design aspects when fielding internet, mail, and mixed-mode surveys Groves and McGonagle (2001): key paper on interviewer training Groves and Heeringa (2006): seminal paper on responsive designs Pforr et al (2015) and Singer and Ye (2013): good overviews of the theory of incentive effects on nonresponse and results of various incentives experiments
REFERENCES American Association for Public Opinion Research (2011). Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys (7th edn). Retrieved from http://aapor.org/ Content/NavigationMenu/AboutAAPOR/ StandardsampEthics/StandardDefinitions/ StandardDefinitions2011.pdf Blom, A.G., de Leeuw, E.D., and Hox, J.J. (2011). Interviewer effects on nonresponse in the European Social Survey, Journal of Official Statistics, 27, 359–377. Blom, A.G. (2014). Setting priorities: spurious differences in response rates. International Journal of Public Opinion Research, 26, 245–255.
Survey Fieldwork
Brüderl, J., Huyer-May, B., and Schmiedeberg, C. (2013). Interviewer behavior and the quality of social network data. In P. Winker, N. Menold, and R. Porst (eds), Interviewers’ Deviations in Surveys: Impact, Reasons, Detection and Prevention (pp. 147–160). Frankfurt: Peter Lang Academic Research. Couper, M.P. (2005). Technology trends in survey data collection. Social Science Computer Review, 23, 486–501. Dillman, D.A., Smyth, J.D., and Christian, L.M. (2014). Internet, Mail and Mixed-Mode Surveys: The Tailored Design Method (4th edn). Hoboken, NJ: John Wiley & Sons. Durrant, G.B., and Steele, F. (2009). Multilevel modelling of refusal and noncontact in household surveys: evidence from six UK government surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172, 361–381. European Social Survey (ESS) (2011). Round 6 Specification for Participating Countries. London: Centre for Comparative Social Surveys, City University London. Retrieved from http://www.europeansocialsurvey.org/ d o c s / ro u n d 6 / m e t h o d s / E S S 6 _ p ro j e c t _ specification.pdf Fisher, R.A. (1938) Presidential address to the First Indian Statistical Congress. Sankhya – Indian Journal of Statistics, 4, 14–17. Quote retrieved from https://en.wikiquote.org/wiki/ Ronald_Fisher Galesic, M., and Bosnjak, M. (2009). Effects of questionnaire length on participation and indicators of response quality in a web survey. Public Opinion Quarterly, 73, 349–360. Göritz, A.S. (2006). Cash lotteries as incentives in online panels. Social Science Computer Review, 24, 445–459. Goyder, J. (1987). The Silent Minority: Nonrespondents on Sample Surveys. Cambridge: Policy Press. Groves, R.M., and McGonagle, K.A. (2001). A theory-guided interviewer training protocol regarding survey participation. Journal of Official Statistics, 17, 249–265. Groves, R.M., and Heeringa, S.G. (2006). Responsive design for household surveys: tools for actively controlling survey errors and costs. Journal of the Royal Statistical Society: Series A (Statistics in Society), 169, 439–457.
395
Groves, R.M., Couper, M.P., Presser, S., Singer, E., Tourangeau, R., Acosta, G.P., and Nelson, L. (2006). Experiments in producing nonresponse bias. Public Opinion Quarterly, 70, 720–736. Groves, R.M., Fowler, F.J., Couper, M.P., Lepkowski, J.M., Singer, E., and Tourangeau, R. (2009). Survey Methodology (2nd edn). Hoboken, NJ: John Wiley & Sons. Koch, A., Blom, A.G., Stoop, I., and Kappelhof, J. (2009). Data collection quality assurance in cross-national surveys: the example of the ESS. Methoden–Daten–Analysen, 3, 219–247. Kreuter, F. (2009). Survey Methodology: International Developments. RatSWD Working Paper Series, No. 59. Berlin: SCIVRO Press. German Council for Social and Economic Data (RatSWD). de Leeuw, E.D., and de Heer, W. (2002). Trends in household survey nonresponse: a longitudinal and international comparison. In R. M. Groves, D.A. Dillman, J.L. Eltinge and R.J.A. Little (eds), Survey Nonresponse (pp. 41–54). New York, NY: John Wiley & Sons. Link, M.W., and Mokdad, A. (2005). Advance letters as a means of improving respondent cooperation in random digit dial studies: a multistate experiment. Public Opinion Quarterly, 69, 572–587. Lyberg, L., and Dean, P. (1992). Methods for Reducing Nonresponse Rates: A Review. Paper presented at the annual meeting of the American Association for Public Opinion Research, St. Petersburg, FL. Unpublished manuscript. Lynn, P. (2003). PEDAKSI: methodology for collecting data about survey non-respondents. Quality and Quantity, 37, 239–261. Paik A., and Sanchagrin K. (2013). Social isolation in America: an artifact. American Sociological Review, 78, 339–360. Pforr, K., Blohm, M., Blom, A.G., Erdel, B., Felderer, B., Fräßdorf, M., Hajek, K., Helmschrott, S., Kleinert, C., Koch, A., Krieger, U., Kroh, M., Martin, S., Saßenroth, D., Schmiedeberg, C., Trüdinger, E.-M., and Rammstedt, B. (2015). Are incentive effects on response rates and nonresponse bias in large-scale face-to-face surveys generalizable to Germany? Evidence from ten experiments, Public Opinion Quarterly, 79(3), 740–768. DOI: 10.1093/poq/nfv014
396
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Rice, S.A. (1929). Contagious bias in the interview: a methodological note. American Journal of Sociology, 35, 420–423. Schouten, B., Shlomo, N., and Skinner, C. (2011). Indicators for monitoring and improving representativeness of response. Journal of Official Statistics, 27, 231–253. Scott, P., and Edwards, Ph. (2006). Personally addressed hand-signed letters increase questionnaire response: a meta-analysis of randomised controlled trials. BMC Health Services Research, 6:111, 1–4.
Simmons, E., and Wilmot, A. (2004). Incentive payment on social surveys: a literature review. Social Survey Methodology Bulletin, 53, 1–11. Singer, E., Van Hoewyk, J., Gebler, N., Raghunathan, T., and McGonagle, K. (1999). The effect of incentives on response rates in interviewer-mediated surveys. Journal of Official Statistics, 15, 217–230. Singer, E., and Ye, C. (2013). The use and effects of incentives in surveys. The ANNALS of the American Academy of Political and Social Science, 645, 112–141.
26 Responsive and Adaptive Designs François Laflamme and James Wagner
INTRODUCTION Most computer-assisted surveys conducted by statistical agencies or organizations record a wide range of paradata that can be used to understand the data collection process to better plan, assess, monitor and manage data collection. One explicit objective of paradata research is to identify strategic data collection opportunities that could be operationally viable and lead to improvements in quality or cost efficiencies. Much research in this area has indicated that the same collection strategy does not work effectively through an entire collection cycle. This also points to the need to develop a more flexible and efficient data collection approach, called responsive design. Responsive design is an approach to survey data collection that uses information available before and during data collection to adjust the collection strategy for the remaining cases. The main idea is to constantly assess the data collection process using the most recent information available (active management), and to
adapt data collection strategies to make most efficient use of available resources remaining (adaptive collection). This strategy aims to use information available before, and paradata accumulated during collection to identify when changes to collection approach are needed in response to the progress of collection. This can be judged against any number of error sources, but has been focused on nonresponse in most examples. For nonresponse error, responsive design monitors each phase of data collection. When a phase no longer produces changes in estimates, it is said to have reached ‘phase capacity’, and a new phase is initiated. Responsive design generally refers to a strategy that can be modified during collection while the adaptive design approach takes advantage of the lessons learned from previous collection cycles to improve the next one. In the adaptive context, the range of possible interventions during collection is generally limited often because of a short collection period – e.g., 10 days for Canada’s monthly Labour Force Survey.
398
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Responsive design was first discussed by Groves and Heeringa (2006) for computerassisted personal interview (CAPI) surveys. Wagner et al. (2012) described a set of interventions used in a large CAPI survey. Mohl and Laflamme (2007) developed a responsive design conceptual framework for computerassisted telephone interview (CATI) surveys. Laflamme and Karaganis (2010) presented the responsive design strategy implemented for CATI, including the tools and indicators that were used to monitor collection and identify data collection milestones. More recently, Calinescu et al. (2013) have applied methods from operations research to the problem of optimizing resource allocation for surveys. The next section provides an overview of paradata and other sources that can be used when implementing responsive design. Planning a responsive design survey section discusses the key factors to consider when planning a responsive survey design. The next two sections present two examples of implementation of responsive design surveys: one for CATI and one for CAPI surveys. The last section discusses future development and research needed on theory and practice.
PARADATA AND OTHER DATA SOURCES In the data collection context, the terms ‘paradata’ or ‘process data’ refer to any information that describes the collection process throughout the entire data collection cycle. This type of information is generally used to better understand the data collection process, to identify strategic opportunities, to evaluate new collection initiatives and to improve the way surveys are conducted and managed. Records of call attempts and interviewer observations about the neighbourhood, sampled unit or sampled person, are among the most important paradata sources used for responsive design surveys. Records of attempts contain detailed information
about each call or visit made (e.g., the date, time, amount of production system time spent, result of the call or visit) during data collection. Records of call data accumulated throughout the collection period and interviewer observations are generally available for each sampled unit (i.e., respondents, nonrespondents and out-of-scope cases). Paradata, as well as other data sources such as interviewer pay claims recorded, budgeted production system time and budgeted interviewer pay claims hours, sample design and frame information, play a key role in the planning, development and implementation of responsive survey design. When available, paradata and other data sources can be efficiently used to derive key indicators (e.g., productivity and cost) that can help improve the survey-monitoring and decision-making processes.
PLANNING A RESPONSIVE DESIGN SURVEY Many factors should be taken into account when planning a responsive survey design. Before developing the strategy, it is important to assess these elements and, especially, their potential influence on the planned survey. Some of these factors affect not only the feasibility of responsive survey design but also the development and the quality of monitoring tools and indicators. They also often limit the type and timeliness of potential interventions that can be performed during collection. The following key factors can directly impact the feasibility, choice, implementation and performance of the strategy: •• objectives; •• data collection modes; •• type and quality of paradata and other data sources available during the collection; •• practical and operational considerations.
The responsive design objectives are a main driver used to determine the way the survey is managed and the orientation of interventions.
Responsive and Adaptive Designs
Generally, a survey pursues one or more of these objectives: reducing cost, improving response rate, improving sample representativeness or reducing variance and/or nonresponse bias for key survey estimates. However, when no relevant auxiliary information about each sample unit is available before collection and when there are no interviewer observations made during data collection, then there is no basis for interventions aimed at balancing response rates across subgroups of the sample. The mode of collection influences the potential responsive design strategy and interventions. CATI surveys conducted from controlled environments, such as centralized call centers, provide different opportunities than CAPI surveys in which interviewers work independently in the field. The distribution of collection effort throughout the collection period and the ways of monitoring and managing the data collection process are not the same for CATI and CAPI surveys. The option of transferring cases to another collection mode in the multi-mode environment offers an additional strategic feature that can be used to reduce cost or improve quality. The type and the quality of paradata and other data sources are critical factors to consider when planning a responsive survey design. The timeliness of available paradata affects not only the feasibility of responsive survey design but also the ability to develop key indicators and actively monitor the survey process. It can also affect the capacity to take informed decisions and limit the type and range of possible interventions during collection. Determining when significant changes to the collection approach are required, which aspects of data collection need to be adjusted and how to adjust them are examples of responsive design decisions. In addition, the availability of other data sources before and during collection, such as budget and cost information accumulated throughout collection and sample design, can also limit the number and quality of key indicators that can be used to monitor and manage data collection.
399
Practical and operational constraints also need to be considered. For example, technical system limitations, quality of communication channels between parties involved in data collection as well as human and organizational factors (e.g., in many organizations, interviewers need to know their schedule in advance) are examples of these potential constraints. Different survey modes face different challenges, and have different tools available with which to meet these challenges. Therefore, in the following sections, we examine responsive designs for face-toface and telephone surveys separately.
RESPONSIVE DESIGN FOR CAPI SURVEYS This section describes the use of responsive designs for CAPI using examples developed at the University of Michigan’s Survey Research Center. Traditionally, in face-toface interviewing, interviewers have a great deal of latitude in deciding when and how to do their work. A key question in the early work on responsive and adaptive designs has been to find design features that can be controlled or, at least, influenced by central office staff. In the Issues section, several examples of issues that CAPI surveys have been facing are outlined. The Indicators section describes indicators for these risks. The Interventions section discusses interventions designed to respond to these risks.
Issues CAPI surveys have faced declining response rates over the past several decades. These declines have prompted concerns about the risk of nonresponse bias. CAPI surveys have been spending more effort to maintain the same or, in some cases, even lower response rates (de Leeuw and de Heer, 2002; Brick and Williams, 2013). One strategy for
400
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
controlling the risk of nonresponse bias is balancing the response (i.e., equalizing response rates) along key dimensions that are measured for the whole sample. There is an implicit model that the characteristics used to balance the sample are predictive of all the survey outcome variables, and an additional assumption that balancing the response on these dimensions will limit the nonresponse bias more than simply using these characteristics for nonresponse adjustments. Many surveys have very weak predictors of either the response probability or the survey outcome variables available on the sampling frame. Finding useful data in these situations may require that interviewers observe characteristics of sampled units that are related to the survey outcome variables. For surveys that screen for eligible persons, getting interviewers to focus on the task of screening households can be difficult. For many interviewers, the task of conducting interviews is more interesting than knocking on doors and explaining the survey to households in order to identify eligible persons. In the 2002 data collection for the National Survey of Family Growth (NSFG), getting screening interviews completed required careful monitoring and attention from supervisors. A final issue faced by many surveys – not just CAPI surveys – is expending great effort to recruit sampled units that yield only small gains in response rate. Although a large portion of the sample may be easily interviewed, there is often a small subset of the sample that can require a disproportionate effort. For example, in the NSFG 2006–2010, the 12% of the sample that required 12 or more calls received 37% of all calls made.
Indicators In this section, we give examples of indicators that can be used to monitor data collection. The indicators can be used to trigger interventions, described in the next section. These interventions are aimed at controlling
the risks associated with each of the issues described in this section. Monitoring for the risk of nonresponse bias is difficult because there are no direct indicators of when this bias has occurred. Instead, proxy indicators are used. For example, subgroup response rates may be monitored. Respondents in groups with lower-thanaverage response rates may be suspected of not being representative of the entire group. Reduced variation in the response rates may show that data collection is improving the ‘representativeness’ of those who respond. Such reductions indicate that the response mechanism is not highly selective, at least with respect to the observed characteristics under consideration. Other indicators that can be monitored include the variation of subgroup response rates (Figure 26.1) and the R indicator (Schouten et al., 2009). Often, the variables available on the sampling frame are not very useful for predicting the particular variables collected by the survey. In this event, it may be useful to have interviewers observe characteristics of the sample that are related to the survey variables. The NSFG, for example, asks interviewers to estimate whether they believe the selected person is in a sexually active relationship with a member of the opposite sex (Wagner et al., 2012). This observation is highly associated with many of the key survey outcome variables collected by the NSFG. The balance of response with respect to this observation is monitored during data collection. Several indicators can be used to ensure that interviewers are working on screening households for eligibility. These include the ratio of main to screener calls, the screening response rate and the number of cases that have never been called. Further, each of these can be monitored at the interviewer level. When the ratio of main calls to screener calls is too high, interviewers can be asked to emphasize screener calls in their work. Several indicators can also be used to determine when the ‘tail’ of production has been
Responsive and Adaptive Designs
401
Figure 26.1 Subgroup response rates by day for the NSFG.
hit and the efficiency rapidly declines. A key indicator is the proportion of calls that result in completed interviews by call number. When this proportion becomes very small, data collection will become highly inefficient. The design may be changed for efficiency reasons at this point. For example, a subsample of the remaining ‘difficult’ cases could be taken and a new protocol applied (Groves and Heeringa, 2006; Wagner et al., 2012). An indicator for phase capacity is the change in estimates brought about by key design features. For example, the cumulative estimate for several key statistics from a survey by call number is one indicator that phase capacity has been reached. Once estimates stop changing, it is time to switch to a new design. There are also examples of ‘stopping rules’ for surveys (Rao et al., 2008; Wagner and Raghunathan, 2010) that may be adapted to the purpose of ending phases of data collection. Identifying indicators is a function each survey must carry out. Prioritizing the objectives of the study is a key first step. Once the
main objectives have been identified, survey designers should identify the key areas of uncertainty and risks faced by the survey. Indicators for each of these areas need to be developed.
Interventions Several interventions can be triggered by the indicators described in the previous section. To address the risk of nonresponse bias, cases can be prioritized using information about subgroup response rates, including response rates for subgroups defined by the interviewer observations. For example, during NSFG collection, cases that interviewers observed to be in sexually active relationships may have lower response rates than those who were observed to be in no such relationship. The cases in the lower responding group can be prioritized. Wagner et al. (2012) show that interviewers called prioritized cases more, and that this higher rate of calling often led to greater response
402
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
rates for these subgroups. Figure 26.1 highlights the impact with a box around the time period of the intervention. A dramatic increase in the response rate of the lowestresponding group can be seen. A second intervention is designed to ensure that interviewers are keeping up with screening households for eligibility. This was accomplished in the NSFG by creating ‘screener week’. During this week, usually the fifth week of a 12-week data collection period, interviewers were instructed to focus on screening households. They were told to complete a main interview if the selected person was there at the time of screening and ready to do the interview, but to schedule appointments for main interviews the following week whenever this could be efficiently done. This focus on screeners led to higher completion rates for screening interviews (Wagner et al., 2012). Figure 26.2 illustrates monitoring an indicator for relative emphases on screener and main calling (the ratio of screeners to main calls) and shows the increase in calls to
screener cases during the fifth week. The dotted line shows the ratio calculated as the ratio of days screener to main calls. The thin line shows the smoothed seven-day average of this ratio. Since the survey is repeated on a quarterly basis (with four 12-week data collection periods each year), there are additional lines on this figure to show experience from previous years and quarters. A third intervention was the use of twophase sampling accompanied by several design changes. A subsample of cases is selected from among the active cases. This sample is then subjected to a new protocol. The goal is to bring in a new kind of respondent that was less likely to respond to the previous protocol. The new design might include a larger incentive, or a new interviewer, perhaps selected from a subset of more successful interviewers. Further, with a smaller sample size, interviewers will be able to focus more attention on each case. This has been shown to be effective (Groves and Heeringa, 2006).
Figure 26.2 Ratio of screener to main calls by day for NSFG.
Responsive and Adaptive Designs
Lessons Learned and Future Research Evidence is accumulating that responsive and adaptive designs can be used to control errors and costs in CAPI surveys. Research in this area is still in the early stages, but in this section several useful techniques are described, including examples of a small toolkit of design interventions that have been developed and tested. However, new interventions are needed for the diversity of survey settings across many survey organizations. Further, interventions aimed at controlling sources of error other than nonresponse are needed.
RESPONSIVE DESIGN FOR CATI SURVEYS This section outlines the responsive design strategy for CATI surveys, presents the response propensity model, describes the active management tools used to manage the survey and highlights the results obtained and lessons learned.
403
Responsive Design Strategy The responsive design (RD) strategy used for CATI social surveys at Statistics Canada (Laflamme and Karaganis, 2010) is presented in Figure 26.3. The first step, planning, occurs before data collection starts. During planning, data collection activities and strategies are developed and tested for the other three phases, including the development of response propensity models. The second step, initial collection, includes the first portion of data collection, from the collection start date until it is determined when RD Phase 1 should begin. An intermediate cap on calls has also been introduced to prevent cases from reaching the cap (i.e., the maximum number of calls allowed) before the last data collection phase. During this initial collection phase, many key indicators of the quality, productivity, cost and responding potential of in-progress cases are closely monitored to identify when the RD Phase 1 should be initiated. The third step (RD Phase 1) categorizes in-progress cases using information available before collection begins and paradata accumulated during collection to better target data collection effort. In particular, the ‘no-contact’ group consists of
Figure 26.3 Responsive design (RD) strategy for CATI surveys.
404
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
all cases for which no contact had been made since the start of the collection period; the ‘highprobability’ group includes in-progress cases with the highest probability of completion, as assigned by the response propensity model. The objective is to improve overall response rates. During this phase, monitoring of key indicators continues. In particular, the sample representativeness indicator provides information on the variability of response rates between domains of interest to help determine when the RD Phase 2 should begin. This last step aims to reduce the variance of response rates between the domains of interest, improving sample representativeness by targeting cases that belong to the domains with lower response rates.
Response Propensity Model During RD Phase 1, a propensity model evaluates at the beginning of each collection day the likelihood of each sampled unit to be interviewed during the remaining data collection period. Three different data sources are used to identify the most important explanatory variables to be included in the model. In practice, sample design information (e.g., household) paradata from the current cycle accumulated since the beginning of the survey (e.g., the number of calls by time of day, indicator of whether a roster is completed and the number of calls with specific outcome codes) can be used in the propensity model, as well as paradata from the previous data collection (e.g., the number of calls needed to complete the previous interview). The propensity model assigns a response probability to each outstanding case in the sample each day throughout the collection period. Interviewer observations can also be used in the model when available, as in the example for CAPI surveys. The model is generally developed and validated, when possible, during the planning stage to ensure that higher completion probabilities are assigned to cases that ended up as completed interviews as opposed to other cases.
Active Management of Responsive Design for CATI Surveys In this context, active management is used to provide timely information on survey progress and performance using key indicators, to decide when is the right moment to initiate responsive design phases during collection (when adjusting strategy) and to identify the appropriate interventions to be used to meet responsive design objectives.
Key Indicators Response rate is not the only key indicator that should be considered to monitor and assess data collection performance and progress. Instead, response rate, in conjunction with other measures which constitute in some way a data quality and cost model for responsive design. These other measures are survey productivity, cost and representativeness indicators as well as response potential of in-progress cases, and should be used to monitor responsive design surveys and to make best use of data collection resources while taking into account the trade-off between quality and cost. The productivity indicator is defined as the ratio of the production system time (i.e., the total time logged onto the system by the interviewers once a case is open) devoted to the interviews themselves to the total system time, which includes all unsuccessful and successful calls (Laflamme, 2009). Productivity is the number of completed interviews over the total number of calls for which each call is weighted by its duration. The proportion of the budgeted interviewer claims (financial) and system time hours (production) spent from the beginning of the survey are both used as cost indicators. The correlation between production and financial information is very high throughout the collection period. If timely financial data are unavailable, the proportion of the budgeted system time spent is a very good cost indicator, especially for ongoing surveys. The selected representativeness indicator provides information on the variability of
Responsive and Adaptive Designs
response rates between domains of interest. It is defined as 1 minus 2 times the standard deviation of response rates between the domains of interest.1 Finally, the response potential of in-progress cases is based on two measures. The first measure is the proportion of ‘regular’ in-progress cases – i.e., cases with no special outcome such as refusal or tracing; the second measure gives an indication of the effort already put into these cases. The proportion of ‘more difficult cases’ – those that generally required more data collection effort to get an interview or to confirm a nonresponse – generally increases as the survey progresses. Figure 26.4 shows an example of how these key indicators progress through the collection period.
Adjustment Strategy During the initial phase, survey progress is closely monitored and analyzed against a predetermined set of key indicators and conditions to decide when to begin RD Phase 1. As shown in Figure 26.4, the Households and the Environment Survey response rate increased at the same rate as the two cost indicators (percentage of budget spent) at the beginning of the survey. However, once average productivity over the previous five days
Figure 26.4 Key responsive design indicators.
405
reached about 45% (i.e., 45% of total system time is devoted to interviews), the gap between response rate and the two cost indicators started growing (the same effort yielded fewer interviews than at the beginning). Early on, the proportion of in-progress regular cases started to drop while the ratio of the average number of calls made for the regular in-progress cases divided by the global cap on calls for the survey continued to rise rapidly. This also suggested that effort was being spent on a smaller number of cases with less productivity. This is particularly obvious after the lines of these last two series crossed. Using a conservative approach, the window to initiate RD Phase 1 was identified to be between the 26th and 31st day of data collection. A more aggressive responsive design strategy would have identified the period between the 15th and 20th collection day as a more suitable window. During RD Phase 1, which aims at improving response rates, the same key indicators used in the initial phase are monitored along with two additional ones – the representativeness indicator and the average response rate increases over the previous five days – to determine when a given regional office should initiate RD Phase 2. The decision to begin each RD
406
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
phase is based on a set of conditions for these key indicators that determine the window for each RD phase: the window can be updated during the collection period if required. To facilitate interpretation and decision-making for survey managers, the indicators and the status of each condition at a given time during collection are consolidated onto a single dashboard.
Active Management Challenges In the responsive design context, active management represents an essential feature but also faces several challenges. The first one is producing relevant, customized and manageable reports based on key indicators that can be easily analyzed and used by survey managers at different points during collection. Analyzing this information requires survey managers to develop and maintain new skills. A second challenge is to find a balance between the amount and level of detail of information needed to manage responsive design and the amount of effort required by survey managers to analyze it. A third challenge is to identify the information required for monitoring, managing and analytical purposes. Some analytical information can be seen as ‘good-to-know information’ that is often only needed at the end of collection.
Highlights and Some Lessons Learned Responsive design is an important data collection feature that provides new opportunities to develop and implement more flexible and efficient data collection strategy for CATI surveys. At this point, many responsive design surveys have been successfully implemented at Statistics Canada. The technical and operational feasibility of this strategy has been proven. Recent experiences have also shown this approach can generally improve response rates with less collection effort. Finally, the availability and accessibility of timely paradata that enables the evaluation of
survey progress through key indicators is critical to developing and implementing a relevant active management strategy. Without this information, it is almost impossible to build an effective active management strategy, which is the heart of any responsive design strategy. However, additional investigations, experiments and research are needed to analyze, evaluate and improve the current strategy, especially in a multi-mode context.
CONCLUSION Since Groves and Heeringa published their paper, evidence from many surveys has accumulated that responsive and adaptive designs can lead to improvements in quality or cost efficiency in both CAPI and CATI surveys. Since a number of learning processes are required to effectively implement a responsive survey design (e.g., better understanding data collection processes to identify strategic opportunities for improvement), the planning phase is essential to success. Many factors must be considered when choosing a specific approach: objectives, data collection mode (e.g., CAPI, CATI or multi-mode), duration of collection period and availability of information (such as paradata from previous cycles) before collection, availability and timeliness of paradata during the collection period and other practical considerations. The institutional context also plays an important role. Centralized organizations with strong monitoring units have many advantages. However, it does not prevent others from implementing responsive design, especially when paradata and additional relevant information are available on a timely basis during collection. At the planning stage, the plans, tools and the global active management strategy that will be used to closely monitor and manage data collection have to be developed and tested. Responsive design should not be seen as a remedy to all data collection issues. It
Responsive and Adaptive Designs
is only one important new feature for data collection managers. This strategy must be consolidated and used with other collection initiatives, especially for those particularly interested in improving the cost efficiency of data collection. For example, on CATI surveys it is crucial that the interviewer staffing composition (i.e., interviewer experience and skills) and levels (i.e., the number of interviewers) used throughout the data collection period should be aligned with the proposed strategy to take the greatest advantage of the entire period. There is fast-growing interest in responsive design strategies in the survey methodology community. In fact, the number of sessions on responsive design and paradata research in international conferences and workshops has steadily grown over the past few years. However, research is still in the very early stages. New research has looked at using methods from operations research to optimize survey designs. These new designs may include changing survey modes as a key design feature. Despite this new research, survey researchers and practitioners agree that more theoretical and empirical research is needed – in particular, on the impact of responsive design on nonresponse bias and variance.
NOTE 1 The implemented representativeness indicator for RD is conceptually different than the R indicator proposed by Schouten et al. (2009). For example, no assumption is made about the response propensity of each sample unit prior to collection.
RECOMMENDED READING Responsive design was first proposed by Groves and Heeringa (2006). The article gives an explanation of responsive design and includes several examples.
407
REFERENCES Brick, J.M. and D. Williams. 2013. ‘Explaining rising non-response rates in cross-sectional surveys.’ Annals of the American Academy of Political and Social Science 645(1): 36–59. Calinescu, M., S. Bhulai and B. Schouten. 2013. ‘Optimal resource allocation in survey designs.’ European Journal of Operational Research 226(1): 115–121. de Leeuw, E. and W. de Heer. 2002. ‘Trends in household survey nonresponse: a longitudinal and international comparison.’ In Survey Nonresponse, eds R.M. Groves, D.A. Dillman, J.L. Eltinge and R.J.A. Little. New York: John Wiley & Sons, 41–54. Groves, R.M. and S.G. Heeringa. 2006. ‘Responsive design for household surveys: tools for actively controlling survey errors and costs.’ Journal of the Royal Statistical Society: Series A 169(3): 439–457. Laflamme, F. 2009. ‘Experiences in assessing, monitoring and controlling survey productivity and costs at Statistics Canada.’ Proceedings from the 57th International Statistical Institute Conference. http:// zanran_storage.s3.amazonaws.com/www. statssa.gov.za/ContentPages/675681671. pdf Laflamme, F. and M. Karaganis. 2010. ‘Development and implementation of responsive design for CATI surveys at Statistics Canada.’ Proceedings of European Quality Conference in Official Statistics, Helsinki, Finland. https://q2010.stat.fi/ sessions/session-29 Laflamme, F. and H. St-Jean. 2011. ‘Highlights and lessons from the first two pilots of responsive collection design for CATI surveys.’ American Statistical Association, Proceedings of the Section on Survey Research Methods. https://www.amstat.org/ sections/srms/proceedings/y2011/ Files/301087_66138.pdf Mohl, C. and F. Laflamme. 2007. ‘Research and responsive design options for survey data collection at Statistics Canada.’ American Statistical Association, Proceedings of the Section on Survey Research Methods. https:// www.amstat.org/sections/srms/proceedings/ y2007/Files/JSM2007-000421.pdf
408
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Rao, R.S., M.E. Glickman and R. J. Glynn. 2008. ‘Stopping rules for surveys with multiple waves of nonrespondent follow-up.’ Statistics in Medicine 27(12): 2196–2213. Schouten, B., F. Cobben and J. Bethlehem. 2009, ‘Indicators for the representativeness of survey response’, Survey Methodology, 35: 101–114.
Wagner, J. and T.E. Raghunathan. 2010. ‘A new stopping rule for surveys.’ Statistics in Medicine 29(9): 1014–1024. Wagner, J., B.T. West, N. Kirgis, J.M. Lepkowski, W.G. Axinn and S.K. Ndiaye. 2012. ‘Use of paradata in a responsive design framework to manage a field data collection.’ Journal of Official Statistics 28(4): 477–499.
27 Unit Nonresponse Ineke A.L. Stoop
THE NONRESPONSE CHALLENGE In 1989 Bob Groves sent a letter to a number of Swedish statisticians proposing cooperation in the area of nonresponse research and advocating the setting up of an international workshop on household nonresponse. In 2015 the International Workshop on Household Survey Nonresponse (www.nonresponse.org) convened for the 26th time. Looking back on nonresponse studies over time allows us to assess whether the research agenda from the past is still relevant, which problems have been solved, and which new problems have arisen. The first part of this chapter introduces a number of background issues, such as the decline in response rates and why nonresponse is important. Several of the topics presented here are discussed in more detail in other chapters of this book. Subsequently an overview is given of evolutions in the research agenda. The final part of the chapter looks backward and forward: where are we now, which problems have
been solved and which remain; how has the situation changed and what can we do? According to Norman Bradburn (1992: 391) in his Presidential Address at the 1992 meeting of the American Association for Public Opinion Research, ‘we … all believe strongly that response rates are declining and have been declining for some time’. De Leeuw and De Heer (2002) confirmed this steady decline. Ten years later Kreuter (2013a) introduced the nonresponse issue in The Annals of the American Academy of Political and Social Sciences by a referral to the high nonresponse rates in modern surveys, while Peytchev (2013), in the same journal, even spoke about the seemingly continual ascent of current unit nonresponse rates. Singer (2011) concluded that the default answer to a survey request is now ‘no’. The decline in response rates seems steady and universal. There is some evidence, however, that response rates are not declining everywhere. The biennial cross-national European Social Survey (ESS) allows comparison over time
410
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
and across countries, and shows that response rates are going down in some countries but remain stable in others (Stoop et al., 2010). Interestingly, a number of countries show remarkable increases in response rates. These positive outcomes result from increased field efforts and intensive monitoring of fieldwork. It is not always easy to compare response rates from different surveys, across countries and over time (see also De Leeuw and De Heer, 2002). Even in official statistical surveys, fieldwork procedures and response rate calculation can differ between countries. In the European Survey on Income and Living Conditions (EU-SILC), for instance, some countries use substitution to replace nonresponding sample units (Eurostat, 2012), and in the Labour Force Survey a number of countries calculate response rates based on the participants in previous waves (Eurostat, 2014). In addition, in many surveys nonresponse rates are not always calculated uniformly over time. A decrease in response rates in Dutch surveys, for example, was caused partly by the fact that ‘in the old days’ high response rates were partly due to less strict sampling procedures, including unrecorded substitution (Stoop, 2009). Why are high, or increasing, nonresponse rates a problem? Low response rates ‘… can also damage the ability of the survey statistician to reflect the corresponding characteristics of the target population’ (Groves et al., 2002: xiii). High response rates are universally regarded as a sign of survey quality (Biemer and Lyberg, 2003). Nonresponse can reduce the precision of a survey when the final number of respondents is smaller than intended. And nonresponse can result in bias when respondents differ from nonrespondents, or when the likelihood of participating in a survey is correlated with one of the core survey variables. Consequently, substantial efforts have to be made to keep response rates at least stable. Sample units being hard to reach, or needing increased efforts (such as large incentives) to be persuaded to participate, will increase
survey costs. On a practical point, nonresponse can be a problem because it makes data collection hard to plan and can result in long fieldwork periods. And even when these extra efforts are made, there is no guarantee that nonresponse will be kept at bay. Nonresponse rates are neither the only nor the main problem. The empirical relationship between response rates and nonresponse bias is not strong (see Groves and Peytcheva, 2008; and Brick, 2013 for an overview). The possible relationship between response rates and nonresponse bias has been formally explained by Bethlehem (2002). According to his model, nonresponse is not problematic if outcome variables are not related to response behavior. If they are, for example, because those interested in politics are more likely to vote and more likely to participate in a survey on political participation, we have nonresponse bias. Groves and Peytcheva (2008) presented several mechanisms to show how response behavior can be related to outcome variables, and Groves et al. (2006) experimented with creating nonresponse bias, trying to highlight the relationship between the topic of interest and response rates. These studies clearly show that the focus should be on nonresponse bias rather than nonresponse rates.
Calculating Response Rates Given the perceived importance of high response rates, it is clear that it is crucial to define and document how response rates are calculated. AAPOR, the American Association for Public Opinion Research (2015) regularly provides guidelines on how to code individual field outcomes and how to calculate response rates. These guidelines are available for different survey modes. Individual field outcomes can be placed in one of the following four categories: ineligibles; cases of unknown eligibility; eligible cases that are not interviewed (nonrespondents); and interviews (respondents). This classification looks deceptively simple, but in practice
411
Unit Nonresponse
is rather complicated. An interview can be a complete interview or a partial interview (and when is a partial interview complete enough to be accepted?). In some cases, information from more than one person has to be collected, for example, an entire household. Here we can speak of household nonresponse or individual nonresponse. In some surveys, proxies can provide information for other household members (proxy interview). In other surveys this would not be allowed. Some surveys allow substitution. This means that when the original sample unit does not respond, a replacement can be sought. According to the AAPOR, substitution is allowed as long as it is clearly reported. When substitution is allowed, selection probabilities are hard to determine, and the effect of substitution on nonresponse bias is hard to assess. For this reason, many high-quality surveys do not allow for substitution. Assessing the eligibility of sample units, i.e., whether they belong to the target population, is an important part of the response rate calculation. This might be a simple procedure in general social surveys among the residents of a particular country, but could be difficult for specific surveys, e.g., members of minority ethnic groups. If screening questions have to be asked to identify members of particular subgroups, there is evidence that the membership of subgroups will be underestimated (Clark and Templeton, 2014). If possible, nonrespondents are subdivided into noncontacts, refusals, and others. In selfcompletion surveys, it is generally not possible to make these distinctions, but they can be made in face-to-face and telephone surveys. With respect to noncontact, the question remains of whether contact with another household member (who refuses any participation) should be counted as noncontact (with the target person) or refusal (by proxy). In the event of refusal, a distinction is often made between hard and soft refusal. In addition to final outcome codes, temporary classifications of call outcomes can be recorded, as shown by the sequence of call
Table 27.1 Temporary outcomes and final disposition codes Sequence
Final outcome
refusal, interview noncontact, noncontact, interview appointment, noncontact, refusal, noncontact noncontact, not able noncontact, ineligible noncontact, appointment, noncontact
interview interview refusal not able ineligible refusal (broken appointment)
outcomes in the first column of Table 27.1. Some priority coding is required to derive the final outcome codes from the temporary codes. Interview always takes priority over other intermediate outcomes, as does ineligibility. Also, a refusal followed by a noncontact is generally treated as a refusal. In these sample outcomes, the final disposition codes would be those in the second column of the table. In more complex situations, detailed priority rules are required to determine the final outcomes (see Blom et al., 2010; Blom, 2014). The simplest formula (RR1) for calculating response rates (or the minimum response rate) is the number of complete interviews divided by the number of interviews (complete plus partial) plus the number of non-interviews (refusal and break-off plus noncontacts plus others) plus all cases of unknown eligibility (unknown if housing unit, plus unknown, other).
Factors Behind Nonresponse The factors behind nonresponse are contextdependent. One reason for this is that nonresponse mechanisms will differ according to the type of surveys. Compare, for instance, business surveys where an informant has to report on their organization establishment, with household surveys where people have to report on themselves; or compare value surveys where people in the privacy of their own home give detailed information about their personal values with travel surveys in which
412
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
people at airports or on the train quickly report on their travel experience; or multiagent surveys where information on elderly people is provided by children and different types of carers. Stoop and Harrison (2012) present a classification of surveys that is also relevant for nonresponse. With respect to household surveys, Groves and Couper (1998) distinguish between factors under the control of the researcher and factors not under the control of the researcher. The latter include the general survey climate (see Chapter 6 of this Handbook), and characteristics of neighborhood, households, and target persons. Under control of the researcher are factors such as the sampling frame used, the coverage of the sample, the mode of the survey, design elements such as translation and incentives, the experience and quality of interviewers, and the planned duration of fieldwork. In practice, factors that are supposedly under the control of the researcher may not really be under control. National statistical institutes, for instance, may have their own interviewer staff and are able to recruit and train interviewers. Most researchers, however, have to use the sampling frames that are available, rely on survey agencies with their own survey culture and particular fieldwork staff, and cope with severe cost and time constraints. The type of sampling frame can have a substantial impact on response rates. For example, when population registers are used as sampling frames, the interviewer knows exactly who to interview. A personalized advance letter can be sent, and the sampled persons ideally should be followed to a new address if they have moved. Persons not in the population register (illegal aliens, migrant workers) will be excluded. On the other hand, when no list is available and a random route procedure is followed, an interviewer usually has to follow strict rules to select a dwelling and a person, but this is not easy to control. Sending a (personalized) advance letter is more difficult. When a person has moved, following to a new address is not required, and migrant workers – if eligible – can be reached.
Exclusion and noncoverage also have an effect on response rates. It will be easier to obtain a high response rate if difficult groups are excluded in advance, such as people who do not speak the majority language, people with mental or physical disabilities, people in non-residential households, and people who live in isolated areas. Sampling frames that only cover Internet users, or people without a land line, could result in higher response rates, but this is not likely to result in a more representative sample. Although in many cases difficult groups are excluded, increasing attention is being paid to securing the inclusion in surveys of hard-to-survey groups (Tourangeau et al., 2014). The mode of the survey (see Chapter 11 of this Handbook) is directly related to response rates and nonresponse bias. Response rates are usually highest in face-to-face surveys (De Leeuw, 2008). This survey mode is also least likely to exclude respondents such as the functionally illiterate, the telephone-excluded (no land line, mobile only, ex-directory), the non-Internet population, etc. The hope was once that mixed-mode surveys would increase response rates, because different groups would be recruited using their preferred mode. In practice, however, the response rate is usually lower in mixed-mode surveys than in single-mode face-to-face surveys. Many other aspects of the survey design can have an impact on response rates, such as the position, training, experience of interviewers (e.g., Schaeffer et al., 2010), the type of sponsor (e.g., Stocké and Langfeldt, 2004; Groves et al., 2012), the topic of the survey (e.g., Groves et al., 2004) and the use of incentives (see Singer and Ye, 2013, and Chapter 28 of this Handbook).
Current Approaches to the Nonresponse Problem There are several ways to deal with the nonresponse problem: ignoring, reducing, measuring, and adjusting (Groves, 1989). The easiest option is to ignore or circumvent the nonresponse problem. This can be done by
Unit Nonresponse
calculating response rates in inventive ways, e.g., ignoring initial nonresponse when respondents from a previous survey comprise the gross sample, by substituting nonrespondents with their family members or neighbors, by allowing family members to answer questions for one another (proxy interviews). It is also possible to abandon the paradigm of random sampling and conduct surveys among members of inexpensive self-selection convenience samples. According to Groves (2006: 667) some in the field wonder ‘What advantage does a probability sample have for representing a target population if its nonresponse rate is very high and its achieved sample is smaller than that of nonprobability surveys of equal or lower cost?’ The answer is that one should acknowledge selection effects on bias in all phases of a survey, and should be most concerned about the representativeness of survey outcomes when all kinds of selection mechanisms outside the researcher’s control determine who participates in the survey and who does not (see also AAPOR, 2011; Baker et al., 2013). Seeking the easy way out could result in a high response rate (assuming that a response rate can be calculated) but this is more akin to ‘response fetishism’ than improving representativeness. Rather than artificially trying to increase response rates, another option is to try to genuinely enhance response rates, potentially at high costs. This makes sense, as survey response is often seen as an important quality criterion of surveys (Biemer and Lyberg, 2003). As Brick (2013: 331) says: ‘Response rates are easy to compute, provide a single measure for an entire survey, and have face validity’. Measures to enhance response rates are partly generic and partly dependent in their implementation on the mode of the survey. In all modes, it is important to provide potential respondents with information on the aim of the survey, the topic, data use and data protection, to offer interesting questions that can be answered and are relevant, and to closely monitor fieldwork, keep track of who answered and who did not and send
413
reminders. In interviewer-mediated surveys (face-to-face, telephone) interviewers play an important role. Experienced, well-trained interviewers usually obtain a higher response rate than less experienced, untrained interviewers. In self-completion surveys (web, mail) one cannot rely on interviewers, and the mail or email invitation to participate is even more crucial than advance letters in interviewermediated surveys (Dillman et al., 2014). PIAAC (OECD, 2013) sets a target response rate of 70%. In this study a response rate of 50% or better is acceptable if the results of a nonresponse bias analysis (when necessary) determine that no significant bias is present. Similarly, the US Office of Management and Business (OMB, 2006) prescribes that for a survey with an overall unit response rate of less than 80%, an analysis of nonresponse bias should be conducted, with an assessment of whether the data are missing completely at random. The obligation to analyze nonresponse bias would be appreciated by the nonresponse measurers. Underlying this approach is the fact that the response rate is in reality not a good quality indicator, as there is not necessarily a linear relationship between response rates and nonresponse bias (Groves and Peytcheva, 2008). According to Bethlehem (2002), nonresponse will not result in bias if response behavior is not related to target variables. If it is, auxiliary variables that are related to both target variables and response behavior can be used to adjust for nonresponse bias. These auxiliary data can come from rich sampling frames, from linking population records and administrative data to sampling frames, and from call records and interviewer observations. Nonresponse adjusters try to solve the problem in this way. Weighting for nonresponse is a common strategy aimed at making the final sample more representative. Bethlehem et al. (2011) distinguish between post-stratification, linear weighting, multiplicative weighting, and calibration as a general approach. According to Bethlehem (personal communication), the
414
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
difference between weighting strategies is small compared to the choice of variables that can be used for weighting. Brick (2013) provides a very useful critical review of weighting adjustments.
DEVELOPMENTS IN THE NONRESPONSE RESEARCH AGENDA The collaborative projects suggested twentyfive years ago include multivariate analysis of correlates of nonresponse, collection of interviewer observations on correlates of nonresponse, construction of models of persuasion and nonresponse bias, sequential design for nonresponse reduction, relationships between nonresponse propensity and measurement error, assembly of arguments for and against survey participation, and a survey of interviewers. The next sections will show how these areas have progressed.
From the Correlates of Nonresponse to Theories on Survey Participation Originally, the focus was on single correlates of nonresponse (e.g., age and gender), but this is no longer a key element in nonresponse research. Firstly, it has become clear that the correlates of contacting potential respondents will differ from the correlates of willingness (or ability) to participate in a survey. The survey design plays an important role here. Minority language speakers will not be able to participate in a survey that is fielded in the majority language only. The illiterate will be unable to answer a self-completion questionnaire. AAPOR (2015: 15) distinguishes between permanent conditions which render people physically and/or mentally unable to participate in a face-to-face interview (e.g., dementia, blindness, or deafness) and temporary conditions (e.g., pneumonia or drunkenness) which prevailed when attempts were made to conduct an
interview. In the latter case the interviewer could come back later in the fieldwork period. Surveys among specific groups often use tailored strategies to make participation possible for everybody, e.g., translated or adapted questionnaires (e.g., in sign language), braille showcards, specially trained interviewers, etc. (see Kappelhof, 2015). Secondly, when it comes to willingness to participate in a survey, individual characteristics of respondents are just one element of many. Other issues play a role, such as the topic and the sponsor of the survey and, in interviewer-mediated situations, the interaction between interviewer and proposed respondent. Over the years, the focus has shifted from correlates of nonresponse to theories on why people answer questions and participate in surveys. Dillman et al. (2014) review seven popular theories, including the leverage-salience theory (Groves et al., 2000) and the cost–benefit theory (Singer, 2011). In the former, the aim is that features of the survey that can have a positive effect should be identified and made more salient, and negative features should be de-emphasized. In the latter, the focus is explicitly on reducing the costs that people associate with responding to surveys, and on increasing the benefits of responding. Dillman et al. (2014) try to break down the response process into different stages. In a mail or web survey, for instance, target persons may be unaware of a response request (letter, email not received); they may not have taken immediate action and forgotten the request; the response request may not have been opened or read; and the questionnaire may not have been started, started but not completed, or completed but not returned. They argue for a more holistic approach rather than focusing on individualistic psychological appeals. Their social exchange theory posits that individuals respond to human requests on the basis of perceived rewards that they trust will be received by responding to the request, and a belief that the rewards will outweigh the perceived costs of doing so.
Unit Nonresponse
There are many ways of providing rewards, decreasing the costs of responding and establishing trust. Applying social exchange theory is different today from in the past, because social interaction is now more spontaneous, communication more likely to be asynchronous and occur in rapid-fire sequences, there is less trust in the source of communication, trust is more likely to be withheld, and trust is now more important for making connection between the ‘costs’ and ‘rewards’ of responding. Indeed, letters, emails, or phone calls from unknown persons or institutions, asking the recipient to provide information, are now often distrusted, even (or perhaps especially) when they tell the recipient that this is important or promise them a gift.
From the Collection of Interviewer Observations on Correlates of Nonresponse to the use of Paradata The collection of interviewer observations was rather new twenty-five years ago. Now it is standard practice in many face-to-face surveys. In addition, contact history information is collected in many telephone and face-toface surveys. This comprises complete information on all calls and all contacts with the respondents. Interviewer observations, information on interviewer-respondent interaction, and call history information are collectively often called paradata (Couper, 1998). These paradata, or process data, are used to calculate response rates (using final and intermediary outcome codes, see ‘Factors behind nonresponse’ above), to improve fieldwork in future rounds of a survey, to adapt the design during fieldwork (see ‘From sequential design for nonresponse reduction to responsive design’ below) and to assess nonresponse bias. The European Social Survey (Stoop et al., 2010) was the first international study to routinely collect paradata in all participating countries and make this information freely available to researchers all over the world. Analyses of paradata, both related to the
415
response process and to the interview process (measurement error) are now emerging. Complete overviews of the present collection and use of paradata may be found in Durrant and Kreuter (2013) and Kreuter (2013b). In addition, a large number of studies have been published recently on errors in interviewer observations and other paradata, e.g., CasasCordero et al., (2013), West (2013), West and Kreuter (2013), and Sinibaldi et al. (2013). Other key issues in studies on the collection of paradata are the ethics of collecting paradata (Couper and Singer, 2013), and the usefulness of paradata information in assessing or adjusting for nonresponse bias. Olson (2013) gives a good overview of the use of paradata for nonresponse adjustment. If paradata information is related to the likelihood of response (there is a lot of evidence on this) and related to core variables of the survey (on which there is less evidence), paradata are useful auxiliary variables (see Bethlehem, 2002 and West, 2013). If not, paradata of course are still useful for monitoring and improving fieldwork and for adjusting the design of a survey. For the latter purpose, paradata are most useful if they are collected in real time.
Construction of Models of Persuasion and Nonresponse Bias: Are Difficult Respondents Different? Groves (2006) describes five methods of studying nonresponse, one of which involves studying the variation between subgroups that may exhibit different nonresponse bias characteristics. One of the earliest studies in this area was that by Lin and Schaeffer (1995), who describe two simple, ad hoc methods for assessing the impact of nonparticipation on survey estimates, the classes of nonparticipants model, and the continuum of resistance model. The first qualitative model distinguishes between classes of nonparticipants such as noncontacts and refusals. The more quantitative second model assumes a
416
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
continuum between easy respondents at one end, difficult respondents in the middle, and nonrespondents at the other end. Inherent in these methods is the ability to assess the difficulty of obtaining a response from all sample units, including nonrespondents. Measures of difficulty in persuading respondents could be the ease of establishing contact (with the final noncontacts as the most difficult cases) or the ease of obtaining cooperation (or the opposite, i.e., reluctance). The simplest indicator of reluctance is whether the respondent cooperated immediately (interview or appointment) or whether more visits or additional incentives were required. More complicated indicators measure the number of unsuccessful contacts with the respondent before obtaining cooperation, comments made by the sample person during the interaction with the interviewer, and the reason for the initial refusal (no time, wrong subject, surveys not useful, doesn’t like surveys). In most studies, the simplest indicator of reluctance is used, namely whether there was an initial refusal that was overcome at a later visit. The initial refusal is often called a ‘soft’ refusal. Whether a refusal is a soft or a hard refusal will only be clear if the interview is reissued and a new contact is established. The process of re-approaching an initial (hopefully temporary) refusal and asking them again to participate in the survey is generally called refusal conversion (for an overview of studies on refusal conversion, see Stoop et al., 2010). Different approaches are possible. Either all refusals can be reapproached, even the most adamant, something which might pose ethical problems. Or specific types of refusals may receive another visit, for instance those refusals who live on the route of a second interviewer, or those who refused to cooperate with an interviewer who delivered poor quality in the survey in question. In addition, only those refusals may be revisited for whom the interviewer believes there is an acceptable chance of their cooperating. The latter case might be the most efficient, but is probably less useful in
measuring reluctance, and in reducing nonresponse bias. Interviewer assessments of the chances of future success may be also be diverse and open to differences in interpretation (Schnell, 1997). They may not really expect ‘their’ refusers to cooperate, which might explain the success of deploying a new interviewer who is not hampered by previous failures to obtain cooperation. Finally, a random sample of refusals might be approached. This might be the best approach. Studies of refusal conversion are sometimes combined with studies of follow-up surveys among final refusals (Stoop, 2005; Stoop et al., 2010). There is some evidence that converted refusals are rather like initial respondents, whereas final refusals (at least those who participate in a follow-up study) are dissimilar to both.
From Sequential Design for Nonresponse Reduction to Responsive Design As Groves (2006: 668) stated, ‘Blind pursuit of high response rates in probability samples is unwise, informed pursuit of high response rates is wise’. The recent years have seen a wide range of studies on tailored, responsive, and adaptive designs (for an overview, see Schouten et al., 2013). These designs have in common that they aim to secure a proportional representation of all subgroups, or to optimize the quality of the response rather than the size of the response rate. Tailored designs develop different strategies for different subgroups. These could be translated questionnaires and matching interviewers and sample units on minority ethnic background, developing different call schedules for pensioners and people in the labor force, or giving incentives to people living in low-response areas only. Adaptive designs (Wagner, 2008) originate from medical research where treatments are varied beforehand across patient groups but also depend on the responses of patients, i.e., on measurements
Unit Nonresponse
made during data collection. Responsive survey designs were introduced by Groves and Heeringa (2006). They divide the data collection into multiple phases. After each phase, differential treatments may be used depending on the response rates in subgroups, i.e., mode switches, additional incentives, or the use of more experienced interviewers. More information on adaptive survey designs can be found in Chapter 26 of this Handbook. Related to the use of adaptive designs is the development of other indicators of representativeness besides the response rate (Schouten et al., 2013). The R-indicator identifies those response outcomes as representative if subgroups in the population have the same response rate. Basically, it assumes that if response rates are equal in subgroups based on auxiliary data – which thus have to be available for both respondents and nonrespondents – the survey is representative. For instance, if response rates are equal for the employed and the unemployed (even when those rates are low), this will produce an unbiased estimate of the unemployment rate. Another proxy indicator of nonresponse bias is the Fraction of Missing Information (FMI). FMI measures the level of uncertainty about the values one would impute for nonresponding cases (Wagner, 2010). Aiming for balanced response rates through tailored, adaptive, or responsive designs seems attractive, but is not always easy. One problem is that one needs current, detailed information on fieldwork progress to adapt designs. The second problem is that it is not always clear which subgroups should have the same response rates. If different strata respond proportionally, but those strata are not homogeneous with respect to survey outcomes, these efforts will not have an impact on nonresponse bias.
Relationships Between Nonresponse Propensity and Measurement Error At present there are a number of studies that relate response-enhancing measures to
417
measurement error. One of the first examples was Couper (1997), who found, not surprisingly, that those who say they are not interested in participating in a survey are less likely to participate and – if they do participate – are less likely to provide ‘substantively meaningful’ data (i.e., more likely to produce ‘missing’ data). Olson (2006) found limited support for the suspicion that measurement error bias increases for estimates that include reluctant respondents. Sakshaug et al. (2010) present some evidence that hard-to-persuade or hard-to-contact respondents gave more satisficing answers, but this evidence was not entirely consistent across studies or measures of satisficing. They also provide results on the complicated relationship between nonresponse error and measurement error in a mixed-mode survey. Mixed-mode surveys are seen as one way of enhancing response rates, possibly in a responsive design. However, when people can self-select a mode, selection effects and measurement error (or mode effect) are hard to disentangle. An increasing number of studies focus on response composition, selection, and measurement error in mixed-mode surveys (see Vannieuwenhuyze et al., 2010).
Assembly of Arguments For and Against Survey Participation: From Survey Climate to Survey Culture Survey climate can be seen as a container concept relating to general social and economic indicators, privacy and confidentiality, survey culture within subgroups, social trust, social isolation and alienation, and survey scarcity (or the overabundance of surveys resulting in respondent fatigue at an individual level) (see Brick and Williams, 2013) for a recent overview). Within surveys, Brick and Williams found a strong correlation between nonresponse rates and single-person households over time, an indicator of a change in survey culture. Other results were less convincing: the decrease in violent crime rates
418
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
over time correlated strongly with a decrease rather than an increase in response rates. Asking direct questions about survey attitudes is one way to explore survey attitudes (Singer et al., 1998; Loosveldt and Storms, 2008). Other researchers have studied the relationship between general culture and survey nonresponse (e.g., Goyder et al., 2006; Johnson et al., 2010), again in an attempt to find out how the social and cultural environment influences response behavior. More information on this topic may be found in Chapter 6 of this Handbook.
A Survey of Interviewers Information on interviewers can be a key element in improving fieldwork procedures and enhancing response rates, as well as in studying nonresponse bias. From the early 1990s (Morton-Williams, 1993) to the present (for an overview, see Kreuter and Olson, 2013: 29–30) recordings of survey introductions and interviewer-respondent interactions have been analyzed. Many others have studied interviewer characteristics such as age, sex, educational level, race, and experience. Schaeffer et al. (2010) conclude with respect to observable characteristics (race, gender, age, and voice) of interviewers that effects on nonresponse are small. Experienced interviewers seem better at overcoming negative survey attitudes and delaying tactics (‘too busy’). Experience may be hard to measure, however, especially when interviewers work for different survey organizations and the effect can be confounded by selection effects (less successful interviewers drop out). One collaborative project that was fostered by the 2015 nonresponse workshop was a concerted effort to go beyond these observable variables and to collect information on interviewers’ espoused behavior, attitudes, and opinions and link them to survey nonresponse outcomes (Hox and De Leeuw, 2002). Schaeffer et al. (2010: 445) confirm the conclusions of Groves and Couper (1998) and
Hox and De Leeuw (2002), that ‘… interviewers who report more positive attitudes about persuasion strategies, express greater belief in confidentiality, ascribe to the importance of refusal conversion, and express “willingness to proceed as usual in the face of obstacles” tend to obtain higher response rates’. Three recent studies linked the results from interviewer questionnaires to survey outcomes: Blom et al. (2010), Durrant et al. (2010), and Jäckle et al. (2011) (for a more comprehensive overview, see also Stoop, 2012). They studied experience, resilience, persistence, confidence, ‘Big Five’ personality traits (extroversion, agreeableness, conscientiousness, neuroticism, and openness to experience), etc. The results of these studies do not always point in the same direction, and relationships are sometimes weak. Collecting information on interviewer attitudes and declared doorstep behavior can be helpful in understanding response processes and for training purposes. Nonetheless, the success of this approach is modest, especially since it is clear that the type, topic, and sponsors of the surveys differ, interviewers are expected to adapt to the type of sample person and the content of the interaction, and experienced interviewers may be given the difficult cases.
WHERE ARE WE NOW? Brick (2013: 346, 347) concluded his review of adjustment techniques by stating that, although a lot of progress has been made in many areas of nonresponse research, the ‘… central problem, in our opinion, is that even after decades of research on nonresponse we remain woefully ignorant of the causes of nonresponse at a profound level’. And ‘Until we have methods to better understand the relationships between survey requests and response, we are unlikely to be able to structure survey designs, data collection protocols, and estimation schemes to minimize nonresponse bias’. Having said that, and given developments in the past, what can we
Unit Nonresponse
do now? In line with Groves’ proposal for lines of research, there are a number of issues that definitely merit our attention, though without any guarantee that the nonresponse problem will be solved.
Monitor and Control Fieldwork Response rates are sometimes influenced by fairly trivial operational procedures and decisions. Sometimes interviewers drop out in particular regions without this being noticed. In other cases, fieldwork stops because the target number of interviews has been reached, funds have been spent, or a more important survey is prioritized. This will lower response rates and may result in nonresponse bias if the hard-to-survey groups are ignored. Thus, a ‘simple’ measure to improve fieldwork and enhance response rates is to closely plan, monitor, and control fieldwork. Analyses of nonresponse in the European Social Survey (Stoop et al., 2010) show that there is not a clear relationship between fieldwork efforts and response rates across countries, if only because countries in which people are more willing to participate in a survey have to make fewer efforts. Within countries, however, the analyses show that when survey efforts are intensified, response rates go up. Luiten (2013) tried to explain interviewer compliance (and noncompliance) with the rules, and the relationship between adherence to fieldwork protocols and nonresponse bias. Differences in adherence were large between interviewers, but the relationship with nonresponse bias was small. This may have been due to these interviewers all being paid employees of Statistics Netherlands, being fairly closely monitored and using a sample of individuals from the population register. Additional evidence from the European Social Survey (Koch et al., 2014) shows that nonresponse bias may be related to the leeway that interviewers have in selecting respondents within households.
419
Before implementing sophisticated responsive designs, therefore, clear rules on fieldwork must be in place, these rules must be adhered to, and current paradata must be available for monitoring, controlling (and possibly adapting) the process.
Auxiliary Data There is increasing evidence that the relationship between response rates and nonresponse bias is not always linear. Nonresponse will be a problem when the likelihood of responding is related to core survey outcomes. Auxiliary data that correlate with both mechanisms can help adjust for nonresponse. For example, if socially isolated people cooperate less in surveys and do less voluntary work, and socially isolated people less often have a listed telephone number, information on having a listed telephone number from both respondents and nonrespondents can be an important tool in adjusting for nonresponse. National statistical institutes that draw samples from the population register to which a wide range of administrative records can be linked have a wealth of auxiliary data at their disposal. For other organizations, the situation is far more complicated. Neighborhood observations from interviewers and geo-registers can play a similar role, although they may contain errors (see Smith, 2011). The same holds for the collection of paradata. One possible answer is to build in at the beginning of every survey the possibility of collecting as many auxiliary data as possible.
Multiple Modes and Mixed Modes Twenty-five years ago, most of the research on nonresponse was based on face-to-face surveys. Increasingly, however, telephone surveys are studied. These also make it much easier to schedule calls and collect call information. Web surveys (like the earlier
420
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
mail surveys, which seem to be making a comeback) are increasingly attractive because of the lower costs and shorter fieldwork periods. These self-completions surveys, however, lack the interference of an interviewer who can try to persuade potential respondents and collect information on the response process. Mixed-mode surveys have the pros and cons of all survey (or rather recruitment) modes, and also introduce mode effects combined with selection effects. Still, it seems inevitable that surveys will move to the web-based and mixed mode. Future nonresponse research will have to take this into account and acknowledge the presence of multiple modes.
Hard-to-Survey Groups and Exclusion Mechanisms Nonresponse can be a problem when some groups in the target population are underrepresented, and if these groups differ from other groups on core variables of the survey. However, there are many other mechanisms for excluding (potentially difficult) subgroups (Stoop, 2014). The first is simply to exclude subgroups from the target population, e.g., exclude non-native speakers, or people living in non-private households. Secondly, the sampling frame used can exclude subgroups. The homeless and illegal aliens are not in population registers; an increasing number of people do not have listed telephone numbers. Other survey design aspects can make it impossible for people to participate: the functionally illiterate in self-completion surveys, the deaf in interviewer surveys. The Roma in Central Europe, the Travelers in Ireland, and other hard-to-survey groups are rarely represented due to an aggregate of exclusion mechanisms (see Tourangeau et al., 2014). One of our tasks for the future is to broaden our view, reflect on what the aims of the survey are, and make sure that hard-to-survey groups can at least participate.
Survey Ethics and Data Protection Striving for high response rates poses some ethical problems. Is it ethical to give people incentives to try to persuade them to participate (Singer and Couper, 2008)? What does informed consent mean and do people understand that (Couper and Singer, 2009)? What is the difference between harassing people who refused to participate and politely trying to convert initially reluctant target persons? Ethical issues in survey research are discussed in Chapter 7 in this Handbook. What is certain is that data protection laws will become stricter, and that target persons will become more concerned about their privacy. This means that the ethics of survey research will have to play a major role also (or especially) in the area of nonresponse.
Advertise the Importance of HighQuality Survey Data As costs of fieldwork go up, and budgets go down, designing good surveys becomes harder. One possible approach is to reflect on survey quality and survey costs and decide on how to optimally allocate funds (Biemer and Lyberg, 2003). High response rates are irrelevant if most difficult groups have been excluded in advance (and if characteristics of these groups are relevant for the topic), or poorly tested questions deliver unreliable and invalid answers. The need for good survey data is not always recognized. A recent example of surveys under threat is the decision by the Canadian government to scrap the mandatory longform census (Chase and Grant, 2010), which caused a lot of unrest among delegates to the Joint Statistical Meeting 2010 in Vancouver. A similar uproar broke out in the US when the House passed a bill cutting more funds and eliminating the long-form American Community Survey entirely (Rampell, 2012). In other countries, too, high-quality surveys are struggling to convince funders of their value to science and to society.
Unit Nonresponse
Couper (2013: 145) asked himself (and the audience) in his keynote speech to the 2013 ESRA conference ‘Are we redundant?’, and answered: I take a different view, and argue for the important role of surveys – and especially high quality surveys – in our understanding of people and the societies in which we live. I believe that surveys still play a vital role in society, and will continue to make important contributions in the future. However, this does not mean we can be complacent – we do need to adapt as the world around us changes.
Towards a New Research Agenda After more than a quarter century of nonresponse research we need to realize that surveys are not the only means of collecting information on people and our societies, that good surveys play an important role in those societies, and that nonresponse can be a threat to the usefulness and accuracy of survey data. The US National Research Council (2013) published a comprehensive review addressing the core issues regarding survey nonresponse. They propose a research agenda covering three types of studies: •• research that would deepen our understanding of the phenomenon of nonresponse and the causes of the decline in response rates over the past few decades; •• research aimed at clarifying the consequences of nonresponse; and •• research designed to improve our tools for boosting response rates or more effectively compensating for the effects of nonresponse statistically.
As will be clear from this chapter, the core issues have remained the same, although the knowledge of nonresponse issues has increased dramatically, and the tools and instruments for measuring and analyzing nonresponse have improved substantially. Let us see what happens in the next twentyfive years.
421
REFERENCES AAPOR (2011). AAPOR Report on Online Panels. Prepared for the AAPOR Executive Council by a Task Force operating under the auspices of the AAPOR Standards Committee. The American Association for Public Opinion Research. AAPOR (2015). Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys. 8th edition. The American Association for Public Opinion Research. Baker, R., Brick, J.M., Bates, N.A., Battaglia, M., Couper, M.P., Dever, J.A., Gile, K.J., and Tourangeau, R. (2013). Report of the AAPOR Task force on Non-probability Sampling. The American Association for Public Opinion Research. Bethlehem, J.G. (2002). Weighting nonresponse adjustments based on auxiliary information. In R.M. Groves, D.A. Dillman, J.L. Eltinge, and R. Little (eds), Survey Nonresponse (pp. 265–287). New York: Wiley. Bethlehem, J.G., Cobben, F., and Schouten, B. (2011). Handbook of Nonresponse in Household Surveys. Hoboken, NJ: John Wiley & Sons. Biemer, P.P., and Lyberg, L.E. (2003). Introduction to Survey Quality. Hoboken, NJ: John Wiley & Sons, Inc. Blom, A.G. (2014). Setting priorities: spurious differences in response rates. International Journal of Public Opinion Research, 26(2), 245–255. Blom, A., De Leeuw, E.D., and Hox, J. (2010). Interviewer Effects on Nonresponse in the European Social Survey. ISER Working Paper Series No. 2010–25, University of Essex. Blom, A.G., Jäckle, A., and Lynn, P. (2010). The use of contact data in understanding crossnational differences in unit nonresponse. In J. Harkness, M. Braun, B. Edwards, T. Johnson, L.E. Lyberg, and P.Ph. Mohler (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 335–354). New York: John Wiley & Sons. Bradburn, N.N. (1992). Presidential address: a response to the nonresponse problem. Public Opinion Quarterly, 56, 391–397. Brick, J.M. (2013). Unit nonresponse and weighting adjustments: a critical review. Journal of Official Statistics, 29(3), 329–353.
422
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Brick, M.J., and Williams, D. (2013). Explaining rising response rates in cross-sectional surveys. Annals of the American Academy of Political and Social Sciences, 645, January, 36–59. Casas-Cordero, C., Kreuter, F., Wang, Y., and Babey, S. (2013). Assessing the measurement error properties of interviewer observations of neighbourhood characteristics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(1), 227–249. Chase, S., and Grant, T. (2010). Statistics Canada chief falls on sword over census. The Globe and Mail, July 21, updated December 22, 2010. http://www.theglobeandmail. com/news/politics/statistics-canada-chieffalls-on-sword-over-census/article1647348/ [accessed August 5, 2014]. Clark, R.G., and Templeton, R. (2014) Sampling the Maori Population using proxy screening, the Electoral Roll, and disproportionate sampling in the New Zealand Health Survey. In R. Tourangeau, B. Edwards, T.P. Johnson, K.M. Wolter, and N. Bates (eds), Hard-toSurvey Populations (pp. 468–484). Cambridge: Cambridge University Press. Couper, M.P. (1997). Survey introductions and data quality. Public Opinion Quarterly, 61, 317–338. Couper, M.P. (1998). Measuring survey quality in a CASIC environment. Proceedings of the Survey Research Methods Section. American Statistical Association, 41–49. Couper, M.P. (2013). Is the sky falling? New technology, changing media, and the future of surveys. Survey Research Methods, 7(3), 145–156. Couper, M.P., and Singer, E. (2009). The role of numeracy in informed consent for surveys. Journal of Empirical Research on Human Research Ethics, 4(4), 17–26. Couper, M.P., and Singer, E. (2013). Informed consent for web paradata use. Survey Research Methods, 7(1), 57–67. De Leeuw, E.D. (2008). Choosing the method of data collection. In: E.D. de Leeuw, J.J. Hox, and D.A. Dillman (eds), International Handbook of Survey Methodology (pp. 113– 135). New York, NY: Taylor & Francis Group/ Lawrence Erlbaum Associates. De Leeuw, E.D., and De Heer, W. (2002). Trends in household survey nonresponse: a
longitudinal and international comparison. In R.M. Groves, D.A. Dillman, J.L. Eltinge, and R.J.A. Little (eds), Survey Nonresponse (pp. 41–54). New York: Wiley. Dillman, D.A., Smyth, J.D., and Christian, L.M. (2014). Internet, Phone, Mail and MixedMode Surveys: The Tailored Design Method, 4th edition. John Wiley: New York. Durrant, G.B., Groves, R.M., Staetsky, L., and Steele, G. (2010). Effects of interviewer attitudes and behaviors on refusal in household surveys. Public Opinion Quarterly, 74(1), 1–36. Durrant, G., and Kreuter, F. (2013). The use of paradata in social survey research. Special issue of the Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(1). Eurostat (2012). 2010 Comparative EU Intermediate Quality report. Doc. LC 77B/12 EN – rev.2. Eurostat (2014). Quality Report of the European Labour Force Survey 2012. Luxembourg: Publications Office of the European Union. Goyder, J., Boyer, L., and Martinelli, G. (2006). Integrating exchange and heuristic theories of survey nonresponse. Bulletin de Méthodologie Sociologique, 92, October, 28–44. Groves, R.M. (1989). Survey Errors and Survey Costs. New York: John Wiley & Sons, Inc. Groves, R.M. (2006). Nonresponse rates and nonresponse bias in household surveys. Public Opinion Quarterly, 70, 646–675. Groves, R.M., and Couper, M.P. (1998). Nonresponse in Household Interview Surveys. New York: Wiley. Groves, R.M., Couper, M.P., Presser, S., Singer, E., Tourangeau, R., Acosta, G.P., and Nelson, L. (2006). Experiments in producing nonresponse bias. Public Opinion Quarterly, 70, 720–736. Groves, R.M., Dillman, D.A., Eltinge, J.L., and Little, R.J.A. (eds) (2002). Survey Nonresponse. New York: Wiley. Groves, R.M., and Heeringa, S. (2006). Responsive design for household surveys: tools for actively controlling survey errors and costs. Journal of the Royal Statistical Society Series A (Statistics in Society), 169 (3), 439–457. Groves, R.M., and Peytcheva, E. (2008). The impact of nonresponse rates on nonresponse bias: a meta-analysis. Public Opinion Quarterly, 72(2), 167–189.
Unit Nonresponse
Groves, R.M., Presser, S., and Dipko, S. (2004). The role of topic interest in survey participation decisions. Public Opinion Quarterly, 68, 2–31. Groves, R.M., Presser, S., Tourangeau, R., West, B.T., Couper, M.P., Singer, E., and Toppe, C. (2012). Support for the survey sponsor and nonresponse bias. Public Opinion Quarterly, 76(3), 512–524. Groves, R.M., Singer, E. and Corning, A., (2000). Leverage-saliency theory of survey participation. Description and an illustration. Public Opinion Quarterly, 64, 299–308. Hox, J., and de Leeuw, E. (2002). The influence of interviewers’ attitude and behavior on household survey nonresponse: an international comparison. In R.M. Groves, D.A. Dillman, J. Eltinge, and R.J.A. Little (eds), Survey Nonresponse (pp. 103–120). New York: Wiley. Jäckle, A., Lynn, P., Sinibaldi, J., and Tipping, S. (2011). The Effect of Interviewer Personality, Skills and Attitudes on Respondent Co-operation with Face-to-Face Surveys. ISER Working Paper Series No. 2011-14, University of Essex. Johnson, T.P., Lee, G., and Ik Cho, Y. (2010). Examining the association between cultural environments and survey nonresponse. Survey Practice, 3(3). Retrieved from http:// w w w. s u r v e y p r a c t i c e . o r g / i n d e x . p h p / SurveyPractice/article/view/134 Kappelhof, J. (2015). Surveying Ethnic Minorities. The Hague: The Netherlands Institute for Social Research/SCP. Koch, A., Halbherr, V., Stoop, I.A.L., and Kappelhof, J.W.S. (2014). Assessing ESS Sample Quality by Using External and Internal Criteria. Mannheim: European Social Survey, GESIS. Kreuter, F. (2013a). Facing the nonresponse challenge. Annals of the American Academy of Political and Social Sciences, 645, January, 23–35. Kreuter, F. (ed.) (2013b). Improving Surveys with Paradata. Analytic Uses of Process Information. Wiley Series in Survey Methodology. Hoboken, NJ: John Wiley & Sons. Kreuter, F., and Olson, K. (2013). Paradata for nonresponse error investigation. In F. Kreuter (ed.), Improving Surveys with Paradata. Analytic Uses of Process Information (pp. 1342). Wiley Series in Survey Methodology. Hoboken, NJ: John Wiley & Sons.
423
Lin, I-F., and Schaeffer, N.C. (1995). Using survey participants to estimate the impact of nonparticipation. Public Opinion Quarterly, 59, 236–258. Loosveldt, G., and Storms, V. (2008). Measuring public opinions about surveys. International Journal of Public Opinion Research, 20(1), 74–89. Luiten, A. (2013). Improving Survey Fieldwork with Paradata. Dissertation. Heerlen: Statistics Netherlands. Morton-Williams, J. (1993). Interviewer Approaches. Aldershot: Dartmouth Publishing. National Research Council (2013). Nonresponse in Social Science Surveys: A Research Agenda. Roger Tourangeau and Thomas J. Plewes, Editors. Panel on a Research Agenda for the Future of Social Science Data Collection, Committee on National Statistics. Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press. OECD (2013). Technical Report of the Survey of Adult Skills (PIAAC). Pre-publication copy. Olson, K. (2006). Survey participation, nonresponse bias, measurement error bias, and total bias. Public Opinion Quarterly, 70(5), 737–758. Olson, K. (2013). Paradata for nonresponse adjustment. Annals of the American Academy of Political and Social Sciences, 645, January, 142–170. OMB (2006). Standards and Guidelines for Statistical Surveys. Office of Management and Budget. Peytchev, A. (2013). Consequences of survey nonresponse. Annals of the American Academy of Political and Social Sciences, 645, January, 88–111. Rampell, C. (2012). The beginning of the end of the census? The New York Times Sunday Review, May 19. http://www.nytimes. com/2012/05/20/sunday-review/the-debateover-the-american-community-survey.html?_ r=1 [accessed 5 August 2014 http://www. nytimes.com/2012/05/20/sunday-review/ the-debate-over-the-american-communitysurvey.html?_r=1 Sakshaug, J.W., Yan, T., and Tourangeau, R. (2010). Nonresponse error, measurement error, and mode of data collection: tradeoffs
424
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
in a multi-mode survey of sensitive and nonsensitive items. Public Opinion Quarterly, 74(5), 907–933. Schaeffer, N.C., Dykema J., and Maynard, D.W. (2010). Interviewers and interviewing. In P. Marsden and J. Wright (eds), Handbook of Survey Research, 2nd edition (pp. 437–470). Bingley, UK: Emerald Group Publishing Ltd. Schnell, R. (1997). Nonresponse in Bevölkerungsumfragen, Ausmaß, Entwicklung und Ursachen. Opladen, Leske und Budrich. Schouten, B., Calinescu, M., and Luiten, A. (2013). Optimizing quality of response through adaptive survey designs. Survey Methodology, 39(1), 29–58. Singer, E. (2011). Towards a cost–benefit theory of survey participation: evidence, further test, and implications. Journal of Official Statistics, 27(2), 379–392. Singer, E., and Couper, M.P. (2008). Do incentives exert undue influence on survey participation? Experimental evidence. Journal of Empirical Research on Human Research Ethics, 3, 49–56. Singer, E., Van Hoewyk, J., and Maher, M.P. (1998). Does the payment of incentives create expectation effects? Public Opinion Quarterly, 62, 152–164. Singer, E., and Ye, C. (2013). The use and effects of incentives in surveys. Annals of the American Academy of Political and Social Sciences, 645, January, 112–141. Sinibaldi, J., Durrant, G.B., and Kreuter, F. (2013). Evaluating the measurement error of interviewer observed paradata. Public Opinion Quarterly, 77(S1), 173–193. Smith, T. (2011). The Report of the international workshop on using multi-level data from sample frames, auxiliary databases, paradata and related sources to detect and adjust for nonresponse bias in surveys. International Journal of Public Opinion Research, 23(3), 389–402. Stocké, V., and Langfeldt, B. (2004). Effects of survey experience on respondents’ attitude towards surveys. Bulletin de Méthodologie Sociologique, 81, January, 5–32. Stoop, I.A.L. (2005). The Hunt for the Last Respondent. The Hague: Social and Cultural Planning Office.
Stoop, I. (2009). A few questions about nonresponse in the Netherlands. Survey Practice, June. Retrieved from www. surveypractice.org Stoop, I. (2012). Unit non-response due to refusal. In L. Gideon (ed.), Handbook of Survey Methodology for the Social Sciences (pp. 121–147). Heidelberg: Springer. Stoop, I. (2014). Representing the populations: what general social surveys can learn from surveys among specific groups. In R. Tourangeau, B. Edwards, T.P. Johnson, K.M. Wolter, and N. Bates (eds), Hard-to-Survey Populations (pp. 225–244). Cambridge: Cambridge University Press. Stoop, I., Billiet, J., Koch, A., and Fitzgerald, R. (2010). Improving Survey Response. Lessons Learned from the European Social Survey. Chichester: John Wiley & Sons. Stoop, I., and Harrison, E. (2012). Classification of surveys. In L. Gideon (ed.), Handbook of Survey Methodology for the Social Sciences (pp. 7–21). Heidelberg: Springer. Tourangeau, R., Edwards, B., Johnson, T.P., Wolter, K.M., and Bates, N. (eds) (2014). Hard-to-Survey Populations. Cambridge: Cambridge University Press. Vannieuwenhuyze, J., Loosveldt, G., and Molenberghs, G. (2010). A method for evaluating mode effects in mixed-mode surveys. Public Opinion Quarterly, 74(5), 1027–1045. Wagner, J. (2008). Adaptive survey design to reduce nonresponse bias. PhD thesis, University of Michigan. Wagner, J. (2010). The fraction of missing information as a tool for monitoring the quality of survey data. Public Opinion Quarterly, 74(2), 223–243. West, B.T. (2013). An examination of the quality and utility of interviewer observations in the National Survey of Family Growth. Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(1), 211–225. West, B.T., and Kreuter, F. (2013). Factors Affecting the Accuracy of Interviewer Observations. Public Opinion Quarterly, 77(2), 522–548.
28 Incentives as a Possible Measure to Increase Response Rates Michèle Ernst Stähli and Dominique Joye
INTRODUCTION Even if surveys are seen as a highly standardized process, there is still considerable scope for adaptation and tailoring in order to optimize quality. Measures to increase response rates, including incentives, are part of this and will be discussed in this chapter. When deciding if an incentive could be used and how it should be distributed, survey designers have to consider three characteristics of the survey process, related to diversity. First, surveys are diverse in their length, their modes, their aims, their sponsors, etc. Second, the contexts in which surveys take place are various, from urbanized areas to rather isolated places, from highly developed countries to very poor ones, in which the ‘Survey Climate’ (See Chapter 6 in this Handbook) can be very different. And, thirdly, respondents can be very different in terms of age, gender, social class, minority or majority group, values, opinions, knowledge, etc.
This diversity of surveys, contexts, and respondents, also emphasized in the leverage model of Groves et al. (2000), has many consequences. The use, or not, of incentives, and also their types, will be very different according to the conditions. We give here just one example for each dimension of diversity: (1) if the survey implies a significant burden for the respondents, an incentive, in particular a higher one, will probably be more appropriate; (2) if the response rate in a specific context is expected to be low, the use of some means to increase participation will probably be more often considered; and (3) concerning the respondents in a particular survey, the question of giving the same incentive, or not, to every group of participants is open and needs to be discussed in depth. This chapter aims to give some general rules to survey practitioners about the use of incentives, without forgetting to take into account the specific conditions in which a survey takes place. This means that not only can practical guidance be given, but also
426
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
some theoretical elements are required, so that each survey designer can adapt the information to the specific conditions of his or her own survey. In the next section we present the rival theories explaining survey participation. The different types of incentives are then presented with their very general effects. This can be used as a list of general rules (third section). We then discuss the specific effects by mode (fourth section) and their impacts on final sample composition, data quality, and interviewers (fifth section). The final sections discuss ethical considerations, possible long-term effects of incentive practices, and the lack of knowledge in this field concerning countries in some parts of the world, and finally return to the necessary global surveydesign approach. But let us start with two introductory remarks. The first concerns the relative need for high response rates. Of course, a hundredpercent response rate implies the absence of response biases. However, it does not guarantee the highest data quality, and, conversely, a low response rate does not imply a high non-response bias. In other words, while it is good to have a high response rate, the reverse is not necessarily true and, most importantly, it is not the only criterion of quality (Groves and Peytcheva, 2008; see also Chapter 35 by Bethlehem and Schouten in this Handbook). The second remark concerns ‘Total Survey Error’. As mentioned in other chapters, it is important to bear in mind the entire design of a survey and not to place too much emphasis on a single point. In other words, while we shall mainly discuss incentives here, the way they are used and implemented must be considered with the full design in mind. A strategy just aiming at maximizing response rate might lead to suboptimal solutions, while a more general perspective is probably the best to follow. Dillman et al. (2014) use the expression ‘holistic approach to design’ for the same idea. But before discussing the design, it is important to go back to the basic question: why do respondents answer our surveys?
WHY DO RESPONDENTS ANSWER AND WHY DO SOME PEOPLE NOT ANSWER OUR SURVEYS? The answers to these two questions are not exactly the same because the first one refers more to the general mechanism of participation while the second could be linked to particular conditions, such as the geographical or temporal circumstances of a contact attempt. Both aspects of the question have to be considered, however, the second having often been neglected. To be able to tailor the use of incentives to each survey, it is fundamental to understand why potential respondents might participate in a survey or not and through what mechanisms incentives can intervene in this process.
The Reasons for Participating: Some Theoretical Orientations A short outline of the main theoretical orientations will help to explain the impact of incentives and we shall refer to them when discussing the potential effects. At the individual level, we can simplify the discussion to two approaches, presented here as opposite ideal-types: economic and social exchanges. The economic-exchange theory postulates that people make a rational cost–benefit calculation to decide whether to participate in a survey. A monetary incentive must, however, be sufficiently attractive. In such a frame, we can expect that the bigger the incentive, the higher the participation will be. A variant of this will include a relationship between the earnings or the ‘value of the time’ of the potential respondent and the value of the incentive: a poorer respondent is more likely to participate with a given incentive than a richer one. The incentive is seen as a compensation for the time devoted to answering the survey. But other benefits and costs, such as pleasure or perceived risk for confidentiality, can also enter into the cost–benefit
Incentives as a Possible Measure to Increase Response Rates
calculation, even if they are less easily translated into figures and numbers. The social-exchange theory (e.g., Dillman et al., 2014), on the other hand, takes into account the perception of the parties involved that the exchange in itself can be a benefit. A person’s action is motivated by what he or she is seeking from the interaction, such as the reward his or her action is expected to obtain. People seek something from interaction and give something in return, so that the interaction is based on mutual trust. An incentive therefore cannot be seen as a payment of the respondent, but as a term in a social interaction. When a prepaid incentive is given, the researcher indicates that she trusts the respondent, so that the respondent is prone to feel she has to honor this trust. The logic of this exchange is far removed from the idea of economic exchange. In a social-exchange perspective, the mechanism of incentives refers to the anthropological rule of receiving and reciprocating, which has been described by Mauss (1966 [1924]) as gift and countergift. When receiving a prepaid incentive, monetary or not, a potential respondent can feel an obligation to reply, at least in some way. Most people react positively to the incentive but some may also react quite negatively, particularly to higher incentives. Monetary incentives can thus be perceived in different ways in modern societies. So far, the reasons for participating, according to these two models, can be classified either as egoistic, based on self-interest, or as social, linked to a sense of global relations in a given society. A further distinction can be introduced between models stressing one main explanation or a combination: a person can decide to participate, or not, according to the weight he or she gives to different parameters, meaning that we have to consider a multitude of causes and the way they combine in order to explain participation. Let us take two examples, the first rooted in sociology, considering the respondents in their social position, the structural perspective, and the second anchored in social psychology,
427
considering the respondents on the individual level and emphasizing interactions. On a more structural level, among many authors, Bourdieu has addressed class position as an important explanation for answering surveys or not (Bourdieu, 1973). It is not only the idea that middle-class people generally have the highest probability of participating, because the upper classes are often difficult to reach while the lower ones refuse more often, but also the fact that surveys are seen as organized by official or intellectual organizations: government, universities, etc. This means that more educated people feel they have greater legitimacy to express their personal opinions and positions through such an instrument. For example, an electoral survey will have better acceptance among people interested in the political process than people who think they have no say in politics. An incentive, even of the same value, will not have the same meaning for these two categories of respondents, and could be seen as a means to impose the values of a particular social group. From an empirical point of view, the tendency toward higher participation by the most integrated people is generally true and can support such a hypothesis, even if it is not so easy to know whether it is linked to a life-style facilitating contacts for a survey or to social position strictly speaking. On the interactional level, Groves et al. (1992) have established a list of socialpsychological motivators for participating in surveys that explain some mechanism of incentives. The main motivator refers to the already mentioned principle of ‘reciprocation’. It points out the tendency to participate if a reward is given and to observe the norm of reciprocity, where incentives create a sense of obligation. The other compliance principles stressed by these authors are: ‘commitment and consistency’ (tendency to behave in similar way over time and consistently with personal attitudes), ‘social validation’ (tendency to behave like others in one’s social group), ‘authority’ (tendency to comply if the request comes from a legitimate authority),
428
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
‘scarcity’ (tendency to participate when having the feeling of being ‘rare’), and ‘liking’ (tendency to be responsive to requesters people like). In the often-quoted leverage–salience theory (Groves et al., 2000), the authors emphasize the multiple combinations that these motivators and ‘demotivators’ can have. The decision to participate or not depends on the weight a person gives to the possible factors for and against participation, meaning that the different advantages and disadvantages of participation may be more or less relevant for each person as well as being dependent on the survey design. Measures to increase response rates can and should therefore be targeted at subgroups. It should be of importance what factors are made relevant when requesting people to answer a survey. In interviewer-mediated surveys, the emphasis on the different factors can be adapted to the potential respondents by skilled and experienced interviewers. In fact, probably one single explanation cannot account for all the complexity of explaining participation. But the approaches mentioned here also have the virtue of recalling that participation is not only the result of an exchange between individuals but is anchored in a social context. When respondents are directly asked why they participate, they indicate reasons that can be categorized into two dimensions (Singer and Ye, 2013): altruistic responses, such as those based on norms of cooperation (for example helping the researcher or feeling that the survey is important and useful to all), and egoistic responses, which are based on self-interest (such as enjoying responding, learning from surveys, and obtaining the incentive). These dimensions refer directly to Groves’ socio-psychological compliance principles, and they also remind us that not only material rewards can be used as incentives. Enjoyment, interest, civic duty, and altruism represent alternative means to valorize survey participation that we should not underestimate and are related to the place of surveys and respondents in a society. It is of
course important to keep these characteristics in mind when discussing the different types of incentives and their use. But before doing so, we should also consider the possible reasons for not answering a survey.
Why People Do Not Answer The decision not to participate is not only related to a low level of motivation, but is also linked to different circumstances. The usual distinction between contact and cooperation is particularly important in this context. In order to obtain participation, contact is the first pre-condition. This is not only a problem of coverage, which occurs when some people are excluded because they do not have the technical tool allowing contact (for example, a telephone in a CATI1 survey or Internet access for a CAWI2 survey). This could also be linked to social characteristics of respondents such as their life-style: for example, people with a very intense social life could be difficult to reach by interviewers in a face-to-face survey, even if the attempts are made at different times of the day and week – hence the expression in the literature, ‘hard to reach’. Good-quality surveys envisage multiplying the contacts in order to correctly cover this kind of population, but a carefully designed use of incentives could also help to obtain an appointment. Even if a contact is obtained, refusal can occur. The reasons for refusing cooperation are manifold, with different implications. Some refusals are linked to a problem of timing, particularly in the case of interviewer-driven surveys. In these cases, use of incentives can help at the time of fixing an appointment and reduce this kind of problem. Other types of refusal, stemming from fear for privacy or a general attitude against surveys, often grouped under the general idea of ‘reluctance’ toward the survey process, will probably not be overcome by incentives but perhaps by other elements of the survey design.
Incentives as a Possible Measure to Increase Response Rates
In addition, while there will always be a part of the sample that is impossible to involve in a survey – because of non-contact, systematic refusal, or other problems – a substantial part of decision-making about survey participation might be based on ephemeral factors, depending on the momentary perception of the context. Being solicited in a specific moment, mood, and personal context, by a specific channel, style, and arguments, is not neutral. Changing one of these parameters can modify the result of the cost–benefit evaluation (Dillman et al. 2014: 25–27). This is why multiple contacts and contact strategies or even re-approaching refusers are shown to be effective. These distinctions between reasons for not answering are important, either to adapt the process to what is going on in the field (adaptive design, see Chapter 26 in this Handbook) or at least to make efforts in different directions so as not to select only one kind of respondent and to ensure maximum diversity in the final sample. Once again, this shows the complexity of the survey process and the need to integrate the different elements in a coherent design, but also the fact that incentives can impact contact and cooperation probabilities.
TYPES AND VALUES OF INCENTIVES Many different incentives have been used in surveys. This diversity is perhaps also related to the change in the survey models described by Dillman et al. (2009: 1), showing the transition from surveys conducted by welleducated interviewers in the fifties or sixties to the more depersonalized forms of recent years.
Types of Incentives Incentives can be of very different types. Stoop (2005: 147), for example, cites a Dutch follow-up survey among refusers
429
where the interviewers could freely customize the incentive, ranging from money to gifts such as flowers, wine, etc., and the moment to offer it. Nevertheless, in general, we can distinguish incentives along two dimensions: timing and nature. The first, timing, refers to the moment in the survey process at which incentives are given. Unconditional or ‘prepaid’ incentives are given at the beginning of the process, at the first contact or with an advance letter, before the interview. As mentioned, following the theory of social exchange, a first gesture (‘gift’) is meant to initiate a process of giving in return (‘counter-gift’), where the value is not so important. In fact, it rather should be modest in order to appear as a gift and not as a payment. By contrast, following the ‘economic-exchange model’, the value has to be adapted to the burden because the incentive is intended to be a sort of payment. We shall see that while the first model is confirmed in most cases, the second model can also be effective in some specific circumstances. Conditional or ‘promised’ incentives are given at the time of or after the interview and are linked to its completion. Particularly in the case of surveys with a low response rate, this is seen as a fair and correct procedure, avoiding wasting some resources on potential non-respondents. We shall see that this line of reasoning is not always valid in term of cost–benefit analysis. In reality, this distinction is not totally determined by timing alone, because the way the incentives are presented by letter or by the interviewer and the way they are distributed matter as well. The second distinction is linked to the nature of the incentive. In particular, the relation to ‘real money’ is a practical criterion. Monetary: cash is given either by an interviewer at the first contact or later, or sent by postal mail before or after the interview. The value can be very different; the literature gives examples ranging from less than one US dollar to more than a hundred dollars. Some countries do not allow sending and receiving money by post. In other countries,
430
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
it could be seen as suspicious to receive such an amount in a normal letter. Quasi-monetary: checks. The advantage is that they seem less intrusive than ‘real money’. Checks are seen as less problematic to send and only cost something to the survey organization if cashed. However, checks have generally been found to be less effective than cash. In Switzerland, a couple of experiences have shown that checks are less than optimal in a cost–benefit perspective and can have a negative impact on the sample composition, if the reliability of checks is differentially perceived by subgroups of the population (Ernst Stähli et al., 2016). Semi-monetary: vouchers for purchases (including books, cinemas, flowers, or transport). As with checks, vouchers are less intrusive than cash, but the impact is generally also lower. Vouchers also have a less universal value because they are linked to a type of object or service. Furthermore, it could be detrimental for the survey to be associated with a commercial enterprise or even to give the impression that it is sponsored. Non-monetary gifts: such as pen or pencil or any kind of object. In general, they are not very efficient. As with checks, part of the problem is that the same object does not necessarily have the same value for every person in the sample, and this is of course very difficult to control. In some cases, such an inkind incentive can be linked to the goal of the survey, thereby being adapted and more efficient. Charities: offering the respondent the opportunity to make a donation to a charity organization is generally not efficient. But, if we can let the respondent choose between different incentives, it can be the incentive of choice for a non-negligible number of respondents (Ernst Stähli et al., 2016). Lotteries: the argument is that the perspective of the maximal value when winning is a strong factor for participation. However, as pointed out by Singer et al. (1999), in an economic model the value corresponding to the expectation of gain is, in fact, small, which
will reduce its interest. In many countries also, legal precautions have to be observed when a lottery is organized, even in a survey context. Finally, and more importantly, the impact is generally negligible, according to the literature. In summary, cash seems the most convenient form, in particular because it is unconditional in terms of use: everyone can do what they want with cash money, such as buying something, giving a gift to someone else or even donating to charity. The other types of incentives can be fine if we have some information on the respondents’ preferences, which normally are unavailable, except possibly in panel studies. And finally, such a choice is also related to the general design of the survey. Just to give an example, it is of course easier to use unconditional cash incentives if we have a sample based on a list of individuals rather than a list of households in which we first have to select an individual. This is summarized in the literature by the title of a famous paper in the field, ‘Charities, no; lotteries, no; cash, yes’ (Warriner et al., 1996), to which ‘prepaid, yes’ should be added. Based on the clearest trends revealed by the literature (see among others the central readings proposed at the end), we can put forward a few general rules: •• Prepaid incentives are far more effective than conditional, promised ones, because of the norm of reciprocity. Even with low value, participants get significantly involved in a social exchange. •• All experiments, whatever the interview mode, show an increased response rate with uncondi tional prepaid cash incentives. There is, however, a declining effect of the amount. In some cases, a higher amount was shown to be counterproductive. •• The effectiveness of promised incentives is at the very least discussed by the literature. Such incen tives might perhaps influence the interviewers more than interviewees. But we shall return to this later. •• Monetary incentives (especially in the form of cash) are more effective than non-monetary ones, especially for people who have no other reason to participate (no topic interest, altruism,
Incentives as a Possible Measure to Increase Response Rates
authority, etc.) or low time availability (according to leverage–salience theory). •• Basically, the effect of incentives is higher when the response rate without incentives is particu larly low. Yet, especially in this case, we should also work on other dimensions, such as making the survey more interesting. •• The effect of incentives used during refusal con version is similar to the effect during first contact, so that, depending on sample size and cost struc ture, it can be cost-effective to concentrate incen tives on this step (Brick et al., 2005). However, with the differential use of incentives, ethical problems also arise, which are discussed later.
Values According to the economic-exchange theory, the greater the value, the stronger the effect. By contrast, in a social-exchange model, a high-value incentive can have the reverse effect. People have the feeling of being ‘bought’ in an economic market and do not feel as though they are in a social relationship. Dillman et al. (2014: 370) argue that a one-dollar solution is often the best (or at least a one to five dollar range), taking into account that for practical reasons, coins are not easily usable. While most studies highlight an effect growing with the value, more recent studies show that it has a decreasing power (Cantor et al., 2008). An experiment in Switzerland even showed a reverse result: comparing unconditional incentives of 10 and 20 CHF (approximately 10 and 20 US$) for a 1 hour face-to-face interview among the general population, the smaller value showed to be slightly more effective, giving argument to the social-exchange theory (Ernst Stähli et al., 2016). The value should also be related to the type of incentive. According to another Swiss survey, it seems that a promised incentive has to have a higher value in order to be efficient: the effect of the promised incentive was much stronger in the case of 30 CHF than in the case of 20 CHF (ibid.). Of course the value is to be related to the context. First, the burden associated with the
431
survey: the more demanding a survey, the higher the incentive could be. This is true still being in a social-exchange model. Second, the value has to be adapted to the target population: Dillman et al. (2014: 370) mention a study proposed to physicians where the incentive could be as high as one hundred dollars. Third, it has to be in line with the national context and ‘survey culture’. Dillman et al. (2014) discuss for example a Russian study where the incentive was rather low but still effective.
INCENTIVES AND SURVEY MODES: WHERE ARE THE SPECIFICITIES? The mode of contact and of data collection has a clear impact on the usability and efficacy of incentives. It determines not only the feasibility of offering certain types of incentives, but also interferes with the process of decision-making for survey participation. Among other reasons, mode matters because the burden is also a function of the presence, or not, of an interviewer, and of the mode itself, as face-to-face interviews generally last longer, for example. Moreover, some people can accept answering questions just to comply with the request of the interviewer and not for the survey itself. More generally, we shall argue that thinking about incentives is necessarily linked to the way that the total survey is implemented. In this sense, the mode is not a question by itself, what matters is rather the presence, or not, of an interviewer who manages the incentives and the process of the interviews. Historically, incentives were initially used in mail surveys, and one of the first reviews of the literature on incentives (Armstrong, 1975; Church, 1993) was dedicated to this mode. This is one of the modes where the impact of the incentives is the most important, partly because of the absence of an interviewer. In mail surveys, increases of up to nearly 20% were observed in some American studies (Church, 1993). But of course such a value
432
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
also depends on the response rate in the control group, the interest and importance of the survey, the necessary effort and burden of responding, as well as the context for example. Similar results have been observed for Web surveys, which are seen very often as sharing the same model as mail surveys in terms of length, effort, use of visual support, and non-intervention of an interviewer – the exception being of course the case of online Internet panels, which will be discussed later with other panel studies. Because of their specific recruitment and contact procedures, lotteries are widely used in Internet surveys (Göritz, 2006), with, however, limited effects as already mentioned. For telephone surveys (Singer et al., 1999, 2000 and Cantor et al., 2008) the literature mentions the same kind of results. Cantor et al. (2008) nevertheless stress the complexity induced by RDD (random digit dialing) where it is not always possible to have the name of the participant before calling. In the same paper, the authors also mention that if a minimal value of a prepaid incentive is adapted to the screening part of a survey, a promised incentive could be adapted to later phases, but with a greater value in this case. The higher value needed for promised incentives was also mentioned in the Swiss case (Ernst Stähli et al., 2016). In face-to-face surveys, interviewers have an even more central role in the recruitment and persuasion of participants. The effects of incentives are therefore generally smaller in interviewer-mediated surveys, but in the same direction as in mail surveys. The influence of interviewers is discussed later in this chapter. In the case of mixed-mode surveys, the results of the use of incentives are the same as in the other modes, even if the question of ideal timing can be complicated. The problem is that in sequential modes, it is probably difficult to add incentives step after step unless this is organized so as to raise the incentives in order to convert refusals. But it will then be seen more in the logic of economic
perspective than as social exchange. Some experiences in Switzerland implementing high-value incentives at the end of the process were not efficient. Panel surveys, whatever their interview mode, are a particular case because of their intrinsic sequential design. Compared to mixed-mode surveys, however, the sequences are designed in larger time spans, making the recall of past incentives less crucial. The effect of incentives in panels of big surveys is quite difficult to isolate because incentives are always implemented along with other measures to increase response rate and minimize attrition. But, generally, incentives given at the recruitment stage retain at least some effect over waves, even if not repeated. On the other hand, if distributed in subsequent waves, the impact is higher on previous refusers. Overall, they minimize attrition, although the optimal design, the long-term effects and impact on accuracy of measurement are not clear (Laurie and Lynn, 2009). Internet panels add the challenge of finding incentives appropriate to the electronic channel and feature some specific effects. For example, promised incentives tend to be more effective here than in other settings, probably because a trust relationship between the survey organization and the respondents has already been established (Göritz, 2006). Mixed-mode and panel surveys show in a salient way the interactions between the different aspects of the whole survey design. In any case, whatever the interview mode, the use of incentives should be planned with the whole survey design in mind. In general, and this is true for every mode or combination of modes, it is important to keep in mind the interaction between the use of incentives and the other aspects of the survey design. Here are two examples showing this link. We have already mentioned that it is easier to use unconditional cash incentives if we have a sample of individuals rather than households. Likewise, the integration of this kind of incentive in a random route sampling implies a preliminary selection of addresses
Incentives as a Possible Measure to Increase Response Rates
which can be extremely costly. The same is of course true for RDD in the case of CATI surveys (Cantor et al., 2008). It is not enough to give an incentive. It is essential to present it in the frame of the survey. In general, it is important to decide how to present it in the advance letter. There are also many examples in the literature of adapting and presenting in-kind incentives according to the aim and the target of the survey (for example Ryu et al., 2005). In some specific settings, incentives can even prove counterproductive, namely in the case of a very strong intrinsic motivation to participate: for example, if the topic concerns a health problem at the center of the respondent’s life. Being offered an incentive could then be perceived as demeaning (Coogan and Rosenberg, 2004 cited by Singer and Ye, 2013).
CONSEQUENCES OF USING INCENTIVES The general interest of using incentives is of course to increase response rates in most cases, but two dimensions can be distinguished – one linked to the recruitment process and the other to the results.
Impact on the Recruitment Process At the time of recruitment, the first consequence of using incentives, particularly prepaid cash, is to increase the response rate. The proportion is of course dependent on the type of survey, in most cases higher in mail than in interviewer-mediated ones. The effect of the incentive is also higher if the response rate is low and probably also higher if the burden is high. Part of the story is that incentives help particularly if there are no other reasons to answer. But two other consequences are important to keep in mind. On the positive side, the contact is facilitated for the interviewer: after receiving such an incentive, many people are
433
keen to answer. In fact, this is also an interesting point in terms of cost–benefit analysis: diminishing the number of necessary contacts is of course a gain, which is particularly important in the case of an unclustered national sample in face-to-face surveys. On a less positive side, some people may refuse because of the feeling of being forced to answer by the incentive and the logic of exchange linked to it. A good number of them will send back the incentive even if it is of a certain value. This last category of people will be very difficult to convert and persuade to participate. Once again, we have to balance this drawback with the advantages of using incentives, particularly in terms of sample composition. According to Swiss experiences at least, the demographic composition of the sample improves when using incentives. The literature mentioned so far also identifies either no effect or positive effects on traditionally biased groups such as extreme incomes, minorities, low education, low community involvement, and lack of interest in the topic. However, a few studies have shown some undesired effects, so that this conclusion is not systematic and has to be handled carefully. The literature on nonresponse bias induced by incentives in Web surveys is still scarce (Callegaro et al., 2014).
Impact on Data Quality The other element of impact concerns data quality. But how does one define data quality in such a case (Biemer and Lyberg, 2003)? If it is in terms of socio-demographic representativeness, we have already seen that the result is mostly positive: incentives bring into the survey some people who would not participate otherwise. If it is in terms of number of missing values, there is no difference according to the literature used in this chapter. In other words, it is not because people would have been less motivated to answer in the absence of incentives that they answer less accurately. The little literature on other
434
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
indicators of the quality of answers, such as straight-lining or satisfying, show rather positive, neutral, or non-consistent results (i.e., Medway, 2012; Medway and Tourangeau, 2015; or Lipps and Pekari, 2013). All in all, once again the balance is more in favor of a good use of incentives because the data quality seems as good, or even better, if we think in terms of characteristics of the sample. But we have also to mention that this last aspect has received less attention in the literature, and even less in a comparative perspective.
Incentives and Consequences for Interviewers Interviewers can have an important role to play, not only in trying to persuade potential respondents but also in managing incentives, which can add some complexity to the management of the survey, in particular when they are free to allocate some money to the adaptation of the incentive as reported by Stoop (2005: 147). In some cases, the interviewers’ task can become very complex if they not only have to persuade and interview respondents but also to manage incentives, describe the environment of the place of interview as in the ESS (European Social Survey), and code the interaction, even in the case of refusal. In relation to incentives, some experiments have shown that knowing or not knowing the value or even the presence of an incentive does not change the result of the interaction between interviewer and interviewee (Singer and Ye, 2013). However, the experience of many survey agencies shows that interviewers like incentives and can feel themselves in a better position to ask the respondent to answer if they have something to offer. Some interviewers in some situations even decided by themselves to give a gift to the respondents, just to show that the survey situation is a case of social exchange between persons (personal communication with interviewers in Britain and Switzerland).
ETHICAL AND PRACTICAL CONSIDERATIONS Ethics seems to be an increasingly frequent topic in discussion of survey practices in general and incentives in particular. It is for example interesting to note that in some books dedicated to survey methodology (Alasuutari et al., 2008 and even Groves et al., 2013) incentives are discussed with other ethical issues. One reason, put forward by Singer and Bossarte (2006), is the growing importance of ethical committees, whose importance is also growing outside the medical research field, not only in the US but also in some European countries. We shall focus more on two aspects in this section, discussing differential incentives as well as general social acceptance.
Differential Incentives We have mentioned several times the tension between a standardized process, identical for everybody, and adaptation to the different targets. If we use the leverage–salience model, adaptation is of course even more attractive, as the motivations to participate can vary according to the respondent. At the same time, as in all model-based approaches, it is important to know exactly the limitations of the model before implementing it. Without going into discussion of data quality and the implication of targeting one type of respondents more than another, there is a twofold problem to be mentioned. First, there is a clear question of fairness. If the incentives are higher for people having refused a first time, is this fair in relation to the persons who accepted the survey solicitation at the first call? Along the same lines, how fair is it to pay more to people with higher earnings, arguing that the ‘value’ of a given amount is a function of what could be earned in a given amount of time? If we follow the logic of social rather than pure economic exchange, this is clearly questionable.
Incentives as a Possible Measure to Increase Response Rates
Second, if we follow the argument of Pierre Bourdieu in ‘Public opinion does not exist’ (1973) and postulate a power structure underlying the responses to surveys, we can examine whether an incentive providing the same participation level for all social categories will reflect the asymmetry inherent in the power distribution in a society. In other words, is it meaningful to ‘buy’ some responses from some social categories given that, by definition, the probability of response and the centrality of the subject are not the same?
Social Acceptance Another problem is social acceptance of incentives, in particular those given as cash, in an unconditional setting. Sometimes they are seen as a waste of money, even more if it is public money. Of course, this could be efficient in cost–benefit terms, but this is a difficult argument to use to respondents or the public. In this context, it is not surprising, as mentioned also by Dillman et al. (2009), that many government surveys prefer to use the legitimacy given by their public role and avoid financial incentives. In the case of surveys supported by the Swiss National Science Foundation, some people who are approached write directly to the Foundation, questioning the validity of cash incentives and the misuse of public money in such a case (Ernst Stähli et al., 2016). Even if these reactions are not frequent, it is important to keep in mind the importance of communication about surveys and their social utility, which is also part of the construction of the survey climate.
Long-term Consequences Some researchers, such as Singer and Ye (2013), question the long-term consequences of using monetary incentives in creating expectations. Nevertheless, this does not seem to be a big problem, for at least two reasons.
435
First, as we have seen in the case of panel surveys, it is not the act of receiving an incentive that negatively conditions the probability of answering without incentive at a later stage of the process. Second, probably only a small proportion of the surveys conducted in a country are of very high quality and offer incentives that are well integrated in the process. It could be even a way to distinguish ‘serious surveys’, where quality is crucial, from more commercial ones that do not represent such an investment. Such a distinctive feature could be an advantage for high-quality surveys using incentives. But it could also have consequences on other surveys involving interviewers if they expect less cooperation when incentives are not used.
WHEN AND WHERE TO USE INCENTIVES? All the literature examined so far suffers from some form of universalistic fallacy. Most of the experiences were North American, some were European, and very few lay outside these contexts. Nevertheless, the results are often presented as being valid everywhere. There are a few exceptions: the question of geographic limitation and cultural differences is mentioned by Singer et al. (1999), as well as Singer and Ye (2013), as a reason to limit data to the US and Canada. That leaves some space for research in other countries. For example, the use of monetary incentives applied without differentiation in a country like Turkey could lead to an overrepresentation of the poorer categories, already very present in samples (Ylmaz Esmer, personal communication), a different result than in most Western countries. In Germany most of the effects found in US studies are confirmed, at least for large-scale face-to-face surveys (Pforr et al., 2015). For a more general picture, we can look at international programs. For example, in 2013, a survey was conducted among
436
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
member countries of the International Social Survey Programme (ISSP) to investigate their survey-design practices. Thirty-nine countries participated in the survey, the most currently active in the ISSP. In 2005, a similar survey was conducted, with answers from all 37 active member countries (Smith, 2007). Concerning incentives, 10 out of the 39 respondent countries did use them for the last ISSP survey (about the same proportion as in 2005), and nine other countries reported that they would like to if they had enough resources. Finally, half of the ISSP countries said they do not use and do not want to use incentives.
All ISSP countries which use cash as an incentive reported that their population is frequently solicited for surveys and also quite difficult to persuade to participate. Cash is probably used as an effective, but perhaps also aggressive, means to raise participation. The prepaid incentive (offered to all sample units) was used at the time only in selfcompletion surveys. Figure 28.1, with the GDP and the mean response rate to the last ISSP surveys, shows the different situations that we can meet: for example, in countries with a relatively low per capita GDP, only face-to-face surveys
100
ISSP Modes and Use of Incentives Mail No Mail would like
ZA CL
60
UR
40
UA PH
BG
HU PT RU PL HR LI
FtF No
US
JP
FtF would like
IS CZ KR SK SL ES LV
GB DK FI SW BE
FtF Yes
CH NO
IR
NZ
DE F
IN
20
Mean Response Rate 2008−2012
80
Mail Yes
AU
NL
0
CA
0
20000
40000
60000
80000
GDP per capita in 2012 (USD)
Figure 28.1 Modes and use of incentives in last ISSP survey by per capita GDP and response rate. Source: 2013 internal survey about methodology among ISSP member countries Legend: Mail: self-completion questionnaire; FtF: face-to-face interview; No: no incentive offered; Would like: no incentive but would like to use some; Yes: incentive offered AU: Australia, BE: Belgium-Flanders, BG: Bulgaria, CA: Canada, CH: Switzerland, CL: Chile, CR: Czech Republic, DE: Germany, DK: Denmark, ES: Spain, F: France, FI: Finland, GB: Great Britain, HR: Croatia, HU: Hungary, IR: Ireland, IS: Israel, JP: Japan, KR: South Korea, LI: Lithuania, LV: Latvia, NL: Netherlands, NO: Norway, NZ: New Zealand, PH: Philippines, PL: Poland, PT: Portugal, RU: Russia, SL: Slovenia, SK: Slovakia, SW(=>SE): Sweden, UA: Ukraine, UR: Uruguay, US: Unites States of America, ZA: South Africa (N = 35) (Argentina, India, Italy, and Taiwan were excluded, because of missing data).
Incentives as a Possible Measure to Increase Response Rates
are conducted. The level of literacy and also the cost of surveys are probably low in such a context. Incentives are also not used if GDP is low. If the GDP is higher than the median value, mail surveys can be used even if some countries like Switzerland and the US have kept a face-to-face format. Also, in the same category, surveys with incentives generally give higher response rates. But the examples of Germany or Norway show that there are exceptions to such a general rule. This very cursory picture of an international example shows that incentives are considered in different contexts and are generally effective. However, local regulations and traditions make it impossible to make a real experimental design and to validate such a conclusion. Furthermore, the use of incentives is somewhat paradoxical because they are used above all if the response rate is otherwise low. This explains the diversity of practices used by the ISSP surveys and probably by most comparable international surveys even if the general principles, like social exchange and reciprocity, are still valid even in very different contexts.
CONCLUSION While in this chapter we emphasize incentives as a central means to increase response rates, the need for integration in the survey process shows that other measures should not be underestimated. First, the whole communication with the respondents has to be carefully conceived and prepared, from the different contact and follow-up letters, with their style, content, highlighted topic, legitimacy of the organizer, etc., to additional communication channels such as websites, flyers, personalized feedback, image work in the media, training of interviewers for the first contact, and persuasion. Second, the whole contact procedure, by means of the definition of the modes of contact, timing, reminders, number of contact attempts,
437
refusal conversions, etc., has a direct impact on response rates. International programs, like the European Social Survey and the ISSP survey just mentioned, also underscore the importance of training interviewees in the case of interviewermediated studies and the role of trust in the sponsors and field agencies active in the survey process. This includes the elements discussed in Chapter 6 (this Handbook), ‘Defining and assessing survey climate’. In this sense, it is important not only to keep the number of surveys to a minimum and to reduce the burden as much as possible, but also to convince respondents that it is important to have good indicators. In other words, we not only have to be convinced of the usefulness of surveys but also to convince the potential respondents of the quality of the survey process. In conclusion, we again emphasize the need for careful integration of the incentives in the survey design. Incentives, particularly using unconditional cash, are efficient, but they only become meaningful after analysis of the setting, the value, the burden, and the context. Even the way interviewers will deal with this reward must be carefully evaluated. Incentives are not the only way to obtain a higher response rate, but they could be even more efficient in conjunction with other measures. In other words, it is really a ‘Total Survey Error’ perspective that has to be considered. Likewise, it has often been argued that incentives are efficient in cost–benefit terms. This is of course generally true, but we have to keep in mind that while the costs are easy to compute, the benefits are more difficult to estimate in many cases because it is then necessary to quantify qualitative concerns. For example, the ease of contacts can be taken into account in a face-to-face model, as well as the reduced cost of recalls in the case of a mail survey, but an increase in quality has to be assessed in relation to the goals and importance of the survey. The more important such a survey will be in terms of results or scientific use, the more important are high-quality data and the more useful are carefully chosen and implemented incentives.
438
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
We have also drawn attention to the potential problems that can arise when adapting too much to some specific categories of respondents, in particular following a strict economic perspective and focusing on reluctant respondents. This reasoning applies to surveys carried out in a national framework and it is important to discuss this in the design. When transferred to a multinational comparative setting, it becomes even more important: we also must consider the risk for the comparison induced by the optimization of results in a particular country; a good comparative design is not necessarily obtained by specific optimization processes in each country (Stoop et al., 2010). In this sense too, a Total Survey Error perspective implies looking at the whole process of data production in the complete set of participating countries. This is also an invitation to develop research in the field of incentives within a comparative perspective. Since response rates, as well as the general acceptance of surveys, have been declining for several years (Smith, 1995), at least in Western countries, incentives will probably be used more often in the future. As we have seen, this most often has valuable consequences in terms of response rate, composition of the sample, and quality of response. However, more complex designs, with the integration of register or auxiliary data, will probably increase the temptation to adapt incentives to different targets, which is problematic in terms of ethics and endangers the idea that surveys are ‘fair’ and based on a social-exchange model. Furthermore, for the analysts, integrating surveys or data produced in different conditions will be challenging. This is one more call to position the question of incentives in the global perspective of surveys and production of knowledge on a society.
RECOMMENDED READINGS See Singer and Ye (2013) for a systematic review of recent articles, as well as Singer
(2002) or Cantor et al. (2008) for earlier reviews. In addition, see Church (1993) for a meta-analysis of experiences with incentives for mail surveys, Göritz (2006) for Web surveys and Singer et al. (1999) for interviewermediated surveys; and use Dillman et al. to discuss the ‘tailored design’ – for example, the 2014 4th edition.
NOTES 1 Computer Assisted Telephone Interviewing 2 Computer Aided Web Interviewing
REFERENCES Alasuutari, P., Bickman, L., and Brannen, J. (eds). (2008). The SAGE Handbook of Social Research Methods. London: SAGE. Armstrong, J.S. (1975). Monetary incentives in mail surveys. Public Opinion Quarterly, 39, 111–116. Biemer, P.P. and Lyberg, L.E. (2003). Introduction to Survey Quality. Hoboken, NJ: Wiley. Bourdieu, P. (1973). L’opinion publique n’existe pas. In Les temps modernes, 318, January, 1292–1309 (also http://www.hommemoderne.org/societe/socio/bourdieu/ questions/ opinionpub.html). Translated as: Public opinion does not exist. In Sociology in Question. London, Thousand Oaks, New Delhi: SAGE, 1993, pp. 149–157. Brick, J.M., Montaquila, J., Hagedorn, M.C., Brock Roth, S., and Chapman, C. (2005). Implications for RDD design from an incentive experiment. Journal of Official Statistics, 21(4), 571–589. Callegaro, M., Baker, R.P., Bethlehem, J., Göritz, A.S., Krosnick, J.A., and Lavrakas, P. J. (eds). (2014). Online Panel Research: A Data Quality Perspective. U.K: John Wiley & Sons. Cantor, D., O’Hare, B., and O’Connor, K. (2008). The use of monetary incentives to reduce non-response in random digit dial telephone surveys. In Advances in Telephone Survey Methodology, eds Lepowski, J.M. et al. New York: Wiley, pp. 471–498.
Incentives as a Possible Measure to Increase Response Rates
Church, A.H. (1993). Estimating the effect of incentives on mail survey response rates: a meta-analysis. Public Opinion Quarterly, 57, 62–79. Dillman, D.A., Smyth, J.D., and Christian, L.M. (2009). Internet, Mail and Mixed-mode Surveys: The Tailored Design Method, 3rd edition. New York: Wiley. Dillman, D.A., Smyth, J.D., and Christian, L.M. (2014). Internet, Phone, Mail And MixedMode Surveys: The Tailored Design Method, 4th edition. New York: Wiley. Ernst Stähli, M., Joye, D., Pollien, A., Sapin, M., and Ochsner, M. (2016). Over 10 years of incentive experiments within international surveys in Switzerland: an overview. FORS Working Paper Series, paper 2016-X. Lausanne: FORS. Göritz, A.S. (2006). Incentives in Web surveys: methodological issues and a review. International Journal of Internet Science, I, 58–70. Groves, R.M., Cialdini, R.B., and Couper, M.P. (1992). Understanding the decision to participate in a survey. Public Opinion Quarterly, 56, 475–495. Groves, R.M., Singer, E., and Corning, A.D. (2000). Leverage-salience theory of survey participation: description and illustration. Public Opinion Quarterly, 64(3), 299–308. Groves, R.M. and Peytcheva, E. (2008). The impact of nonresponse rates on nonresponse bias. Public Opinion Quarterly, 72, 1–23. Groves, R.M., Fowler, F.J., Couper, M.P., Lepkowski, J.M., Singer, E., and Tourangeau, R. (2013). Survey Methodology, 2nd edition. Hoboken, NJ: Wiley. Laurie, H. and Lynn, P. (2009) The use of respondent incentives on longitudinal surveys. In Methodology of Longitudinal Surveys, ed. Lynn P. New York: Wiley, pp. 205–234. Lipps, O. and Pekari, N. (2013). Mode and incentive effects in an individual register frame based Swiss election study. FORS Working Paper Series, paper 2013–3. Lausanne: FORS. Mauss, M. (1966 [1924]). Essai sur le don. Forme et raison de l’échange dans les sociétés archaïques. Translated by Ian Cunnison, The Gift: Forms and Functions of Exchange in Archaic Societies. London: Cohen & West.
439
Medway, R. (2012). Beyond Response Rates: The Effect of Prepaid Incentives on Measurement Error. PhD Thesis, University of Maryland. Medway, R. and Tourangeau, R. (2015). Response quality in telephone surveys: do prepaid cash incentives make differences? Public Opinion Quarterly, 79(2), Summer, 524–543. Pforr, K., Blohm, M., Blom, A.G., Erdel, B., Felderer, B., Fräβdorf, M., Hajek, K., Helmschrott, S., Kleinert, C., Koch, A., Krieger, U., Kroh, M., Martin, S., Saβenroth, D., Schmiedeberg, C., Trüdinger, E.-M., and Rammstedt, B. (2015). Are incentive effects on response rates and nonresponse bias in large-scale, face-to-face surveys generalizable to Germany? Evidence from ten experiments. Public Opinion Quarterly, first published online June 2, pp.29 nfv014. doi:10.1093/ poq/nfv014 Ryu, E., Couper, M., and Marans, R. (2005). Survey incentives: cash vs in-kind; face-toface vs mail; response rate vs nonresponse error. International Journal of Public Opinion Research, 18(1), 89–106. Singer, E., Van Hoewyk J., Gebler, N., Raghunathan, T., and McGonagle, K. (1999). The effect of incentives on response rates in interviewer-mediated surveys. Journal of Official Statistics, 15(2), 217–230. Singer, E., Van Hoewyk, J., and Maher, M.M. (2000). Experiments with incentives in telephone surveys. Public Opinion Quarterly, 64(2), 171–188. Singer, E. (2002). The use of incentives to reduce nonresponse in household surveys. In Survey Nonresponse, eds Groves R.M., Dillman, D.A., Eltinge, J.L., and Little, R.J.A. New York: Wiley, pp. 163–178. Singer, E. and Bossarte, R.M. (2006). Incentives for survey participation: When are they ‘coercive’? American Journal of Preventive Medicine, 31(5), 411–418. Singer, E. and Ye, C. (2013). The use and effects of incentives in surveys. ANNALS AAPSS, 645, 112–141. Smith, T.W. (1995). Trends in non-response rates. International Journal of Public Opinion Research, 7(2), 157–171. Smith, T.W. (2007). Survey non-response procedures in cross-national perspective: the
440
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
2005 ISSP Non-Response Survey. Survey Research Methods, 1(1), 45–54. Stoop, I. (2005). The Hunt for the Last Respondent. The Hague: SCP Press. Stoop, I., Billiet, J., Koch, A., and Fitzgerald, R. (2010). Improving Survey Response: Lessons
Learned from the European Social Survey. Chichester: Wiley. Warriner, K., Goyder, J., Gjersten, H., and Hohner, P. (1996). Charities, no; lotteries, no; cash, yes. Public Opinion Quarterly, 60, 542–562.
PART VII
Preparing Data for Use
29 Documenting Survey Data Across the Life Cycle M a r y Va rd i g a n , P e t e r G r a n d a a n d Ly n e t t e H o e l t e r
INTRODUCTION Just as expectations around open access to research data have risen in recent years, so too have expectations for robust metadata and documentation that provide a complete picture of a dataset from its inception through data publication and beyond. Researchers increasingly want access to detailed information about what happens in the early phases of survey design and implementation: the sampling process, questionnaire development, fieldwork, post-survey data processing, weighting – basically all elements of the survey data life cycle. It is clear that knowing more about what occurs along the life course of a survey sheds light on the quality of the resulting dataset. Open access to data itself is not sufficient – we need intelligent openness for science to be effective (Royal Society Science Policy Centre 2012). In this new context of rising expectations for data and metadata quality, the documentation of survey data can be seen as an integral aspect
of the total survey error approach to estimating data quality. Groves and Lyberg (2010) proposed several elements of the future research agenda for the total survey error approach. In addition to discussing many potential sources of error emanating from the conduct of a survey itself, they highlighted ‘the role of standards for measuring and reporting sources of error’ (p. 875). The integration of documentation standards into the data life cycle and the new emphasis on research transparency enable users to assess data quality better than ever before and actually make comprehensive documentation an essential component of the total survey error approach. How do we ensure that data producers provide high-quality documentation to facilitate accurate and effective data analysis? In this chapter we provide a set of recommendations to guide data producers in developing documentation that is not only rich in information but optimally usable and able to be repurposed for new types of data-driven exploration. We supplement these recommendations with examples
444
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Figure 29.1 Research data life cycle.
of best practices and tools that can facilitate the documentation process. We close with a look at future developments, including innovative uses of documentation not likely to have been envisioned by the original data producers.
For data to be successfully used in secondary analysis, robust metadata are necessary. When principal investigators are not available for consultation, documentation is often the only communication possible between the data creator and the data user, and documentation needs to bridge the gap between these two parties. This means that documentation must be produced with care and not hastily assembled at the end of a project. Tacit knowledge, including knowledge about the lineage and provenance of all variables, must be captured in order to assure independent understandability of the data for future users.
consider the full life cycle of research data (see Figure 29.1) from the conceptualization stage through data collection, processing, archiving, dissemination, and possible repurposing. Metadata should ideally accumulate along this ‘data pipeline’ with little or no metadata loss during data and metadata handoffs. Standards and tools discussed later in this chapter can assist in this documentation task. Of course, each of these phases may involve many steps and processes themselves, adding to the challenge of documenting data thoroughly. Crosscultural and cross-national surveys in particular can be difficult to document because of their complexity. The Survey Life Cycle Diagram (Figure 29.2) from the Cross-Cultural Survey Guidelines (Survey Research Center 2011) includes processes like translation and data harmonization, highlighting the challenges of capturing high-quality documentation for comparative research. The Cross-Cultural Survey Guidelines lay out best practices for each phase of the process and encourage the use of standards to document the many dimensions of these complex surveys.
Data Life Cycle
Example: Rich Metadata
In thinking about producing comprehensive documentation, data producers should
The European Social Survey (ESS), available at: http://www.europeansocialsurvey.org/
RECOMMENDATION: INCLUDE AS MUCH CONTENT AS POSSIBLE TO ENSURE THAT RESEARCHERS CAN USE THE DATA WITH CONFIDENCE NOW AND INTO THE FUTURE
Documenting Survey Data Across the Life Cycle
445
Figure 29.2 The survey life cycle for cross-cultural surveys.
data/, is a good example of a long-running study with comprehensive, rich documentation.1 Through seven rounds of data collection since 2002, the ESS has regularly provided users with voluminous documentation, not only describing the ESS public-use data files but also documenting the entire data life cycle including sampling, translation, questionnaire development, pretesting and piloting, efforts to improve question quality, response and non-response issues, experiments with mixed-mode data collection, and measuring the importance of national context through the entire process (Matsuo and Loosveldt 2013). This crossnational project has set a gold standard for reporting different aspects of total survey error and documenting data quality.
RECOMMENDATION: USE STRUCTURED MACHINE-ACTIONABLE METADATA STANDARDS AND STANDARDS-BASED TOOLS Data Documentation Initiative (DDI)2 Over the past two decades, the DDI has emerged as a de facto standard for documenting data in the social and behavioral sciences across the full life cycle. DDI actually has two specifications. The first, DDI Codebook, enables the generation of thorough documentation that reflects the content of a traditional social science codebook. The DDI Alliance also makes available a more ambitious specification, DDI Lifecycle, which describes metadata as they are
446
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
created and used through the data production process. DDI is currently rendered in eXtensible Markup Language (XML), the lingua franca of the Web, making it machine-actionable.
Metadata-Driven Survey Design Having standardized, structured, machineactionable metadata means that programs can be written against the metadata, producing efficiencies and cost savings. In fact, using an approach called Metadata-Driven Survey Design (Iverson 2009), it is possible to use metadata to drive survey production. With metadata-driven survey design, the actions taken to define a survey are the same actions that create the survey instrument. Using specialized software instead of a text processor, a researcher can develop a questionnaire and at the same time prepare it for fielding with computer-assisted interviewing (CAI). By using a metadata-driven approach to survey design, the documentation work is done at a project’s outset when it is likely to be most accurate. Taken to its full potential, metadata-driven survey design can facilitate the implementation of subsequent waves of data collection by feeding the metadata produced out of the CAI system after the first wave back in to the system to serve as a foundation for the second wave.
Standards-based Documentation Tools Tools are key to a metadata standard’s success: if markup cannot be produced efficiently, a standard may not find an audience. Several tools have emerged recently that make DDI easier to use. DDI Codebook Tools •• Nesstar – http://www.nesstar.com/ Norwegian Social Science Data Services (NSD) distributes a DDI markup and data analysis tool called Nesstar, created for use with DDI
Codebook. Nesstar Publisher is also part of a suite of tools designed to assist data producers in emerging countries in documenting and disseminating data (see http://www.ihsn.org/ home/software/ddi-metadata-editor). • Dataverse Network – http://thedata.org/ Developed by Harvard-MIT, this tool uses DDI as the standard for the study-level metadata entered at deposit into Dataverse and as a foundation for variable-level analysis. • Survey Documentation and Analysis (SDA) – http://sda.berkeley.edu/ Support for DDI Codebook is also incorporated into the SDA online analysis system created at the University of California, Berkeley.
DDI Lifecycle Tools • Michigan Questionnaire Documentation System (MQDS) – http://inventions.umich.edu/ technologies/4975_michigan-questionnairedocumentation-system This tool, created by the Survey Research Operations group at the University of Michigan, exports DDI Lifecycle from Blaise ComputerAssisted Interviewing software. • StatTransfer – https://www.stattransfer.com/ StatTransfer is a widely used commercial product that transfers data across software packages and now also exports metadata in DDI Lifecycle format. • Colectica – http://www.colectica.com/ Colectica has developed a suite of tools that includes software to produce, view, and edit DDI Lifecycle metadata with an interface to CAI tools. Colectica for Excel (http://www.colectica. com/software/colecticaforexcel) is a free tool to document spreadsheet data in DDI, permitting researchers working in Excel to create study descriptions and document their data at the variable level. • Sledgehammer – http://www.mtna.us/#/products/ sledgehammer1 Developed by Metadata Technology, this tool facilitates the transformation of data across formats and enables the extraction and generation of DDI metadata. • DdiEditor – https://code.google.com/archive/p/ ddieditor/
Documenting Survey Data Across the Life Cycle
Developed at the Danish Data Archive (DDA), this tool produces documentation in DDI Lifecycle format, with an emphasis on documenting data processing for curation purposes.
RECOMMENDATION: CAPTURE METADATA AT THE SOURCE As mentioned earlier, metadata ‘leakage’ can occur across the life cycle of research data at critical junctures along the path to publication of a dataset. For example, archives find that links between variables and question text and the detailed universe information that indicates routing patterns through questionnaire items are often not documented fully in deposited documentation, even when the data were produced through computer-assisted interviewing. When metadata are captured early on, rather than assembled at the end of a project, it is possible to realize the greatest benefits and efficiencies. This is the premise of the MQDS tool mentioned above. MQDS draws upon Blaise metadata to produce codebooks and instrument documentation that mimics what the interviewer saw during administration of the survey.
Examples: Metadata Capture in National Statistical Institutes (NSIs) Capturing metadata at the source is also a goal of several NSIs from around the world. In an effort to standardize and modernize the production of official statistics, the NSIs are collaborating on standards-based models and tools. After establishing a High-Level Group (HLG) for the Modernization of Statistical Production and Services, the NSIs developed the Generic Statistical Business Process Model (GSBPM),3 which describes the steps in the production of official statistics data undertaken by NSIs (see Figure 29.3). To
447
complement this high-level model, the HLG then spearheaded an initiative to provide a more detailed conceptual model – Generic Statistical Information Model (GSIM).4 Together these initiatives have the potential to greatly enrich the documentation and quality of official statistics as information is captured and harvested at the source across the life cycle.
Examples: Metadata Capture in Key Data Series Another good example of a project encouraging the capture of metadata at the source is the Metadata Portal Project, funded by the National Science Foundation in the US with partners at the National Opinion Research Center at the University of Chicago, the National Election Studies at the University of Michigan, and the Inter-university Consortium for Political and Social Research (ICPSR), with technical support from Metadata Technology North America. In this project the workflows of the American National Election Study (ANES) and the General Social Survey (GSS) were analyzed using the GSBPM model to determine how more structured metadata might be generated further ‘upstream’ and carried through the system without any metadata loss. This project has also done something innovative with variable-level universe statements. These statements are captured when a survey is administered, but often they are exported in a form that is difficult to interpret, and they are generally not leveraged later on to recreate the routing through the instrument in an automated way. For this project, a tool was built that identifies the system missing code on each variable (flagged in DDI) and then connects that variable to all of the other variables to find matches between the system missing code and other codes on the searched variables. Then an XSLT stylesheet acts on the data to produce a visualization of the paths through
Figure 29.3 Generic Statistical Business Process Model (GSBPM).
Documenting Survey Data Across the Life Cycle
the instrument (see Figure 29.4). This is an example of the potential inherent in structured metadata especially if it is captured at the source and not added after the fact.
RECOMMENDATION: RE-USE EXISTING METADATA AND DATA As noted, high-quality documentation is critical in making data accessible and usable by secondary researchers. Knowing the context of the data collection, sample, and results allows others to evaluate whether the data might be appropriate for their own projects and how confident they can be about the conclusions drawn from it. Moreover, combining datasets
449
in new and interesting ways – such as adding contextual information based on geography or harmonizing data – is only possible when the underlying data are fully documented. The current funding environment is challenging in that it is increasingly difficult for researchers to obtain research funds to conduct large-scale surveys, to design their own survey questions, and to collect their own data. At the same time funding agencies are placing a new emphasis on sharing data that have been collected with federal funds. For all these reasons, it is clear that re-use and repurposing of existing data will play an ever greater role in moving social science forward. Funding agencies have recently provided increased support for data harmonization, which involves merging data from several
Figure 29.4 Visualizing the path through an instrument based on metadata about skip patterns.
450
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
datasets based on common variables or concepts. The goal of such work is to provide analysts with new data resources to study social and economic issues from different perspectives. Combining variables from many similar surveys into a new integrated data file makes it easier to examine a research question over a longer period of time and/or with a much larger number of respondents.
Example: Re-using Data and Metadata One instructive example involves the harmonization of 10 cross-sectional surveys of US family and fertility behaviors spanning the 1955–2002 period, including the 1955 and 1960 Growth of American Families (GAF)5; the 1965 and 1970 National Fertility Surveys (NFS)6; and the 1973, 1976, 1982, 1988, 1995, and 2002 National Surveys of Family Growth (Cycles 1–6 of the NSFG).7 The resulting harmonization product, the Integrated Fertility Survey Series (IFSS), provides vital benchmarks for documenting and understanding transformations in fertility and the family and permits social scientists to study changes in these basic components of social life over a five-decade period. Such projects require substantial amounts of work in devising harmonization rules and creating the programming to execute them. But they can only succeed if there is a foundation of good documentation on which to build. The IFSS project used such documentation to report to users the individual survey sources and specific variables used to produce a single harmonized variable, the programming code developed for this task, and comparability notes that summarize the results (see Figure 29.5).
Example: Re-using Metadata Tools, working in conjunction with good metadata, can make the harmonization process go much more smoothly. One such tool, produced
by ICPSR and based on variable-level DDI markup, queries a large database of variables using keyword and phrase searching parameters. It produces results that list individual variables from all datasets in the database or from those grouped around common themes or series of data. Once the user finds a set of variables she wants to investigate in more detail, the tool enables her to see up to five variables on a single page, which makes for quick examination to assess how variables from related surveys may be compared and possibly harmonized. Figure 29.6 shows an example of variables reporting on happiness.
RECOMMENDATION: THINK BEYOND THE CODEBOOK AND USE WEB TECHNOLOGY TO BEST ADVANTAGE Over the long course of survey research, now spanning well over a half-century, documentation has evolved alongside survey methodology and data collection methods. In terms of substantive content, early documentation was often quite rich and carefully produced. For example, the documentation for the Census of Population and Housing, 1960 Public Use Sample: One-in-One Thousand Sample (US Department of Commerce 1999 [1973]; see the Table of Contents in Figure 29.7) contains not only important technical information about the structure of the data file and instructions to users about accessing the data from the reel of magnetic tape on which it was stored but also considerable information on sampling design and variability including an early description of ‘total survey error’. Page 61 of the codebook (see References for the full citation) states: Sampling error is only one of the components of the total error of a survey. Further contributions may come, for example, from biases in sample selection, from errors introduced by imputations for nonreporting, and from errors introduced in the coding and other processing of the questionnaires. For estimates of totals representing relatively small proportions of the population, the major component of the total survey error tends to be the sampling error.
Figure 29.5 Rich variable-level metadata in the IFSS harmonized file.
Figure 29.6 Variable comparison tool based on DDI metadata. Source: ICPSR Social Science Variables Database
452
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Figure 29.7 Table of Contents from 1960 US Census Codebook. Source: US Department of Commerce (1999 [1973])
Emerging From the Paper Era This set of technical documentation for the Census, still popularly referred to as a ‘codebook’, was, of course, created as a paper document, painstakingly typed by hand and assembled manually. The world of paper documentation presented a compartmentalized and
static relationship between data and documentation for the end user. Even when codebooks were ‘born digital’, this relationship did not dramatically change. The user could search voluminous codebooks more easily, but the codebook and the data remained as two separate entities, the former used as a reference when analyzing statistical output from the latter.
Documenting Survey Data Across the Life Cycle
Data producers now have many more tools at their disposal to document surveys all through the planning, collecting, and analysis stages of projects. Computer-assisted interviewing software captures every element of the data collection instrument and stores this information in databases. This collection of metadata can readily produce significant portions of the documentation required when data files are prepared for dissemination to other researchers.
Example: Interactive Documentation It is important to take advantage of the potential for metadata re-use and interactivity offered by new technologies and not be constrained by the paper codebook model. A good example is the Collaborative
453
Psychiatric Epidemiology Surveys (CPES) (see Figure 29.8), a set of three nationally representative surveys: the National Comorbidity Survey Replication (NCS-R), the National Survey of American Life (NSAL), and the National Latino and Asian American Study (NLAAS). These surveys were harmonized in recognition of the need for contemporary, comprehensive epidemiological data regarding the distributions, correlates, and risk factors of mental disorders among the general population with special emphasis on minority groups. This example showcases the potential to integrate documentation and data through Web interfaces that not only provide access to both seamlessly but also give users additional ways to interact with the surveys. We note that the concept of a static codebook is replaced by a dynamic set of diverse documentation elements all linked to this Web interface. Users
Figure 29.8 Interactive codebook for the Collaborative Psychiatric Epidemiology Surveys (CPES).
454
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
can browse an interactive codebook by sections of the questionnaires, search each of the three component surveys individually, or examine the contents of all three by subject matter. There is a link to an extensive user’s guide with additional links providing information on downloading files, notes about data processing, online analysis, and training resources as well as direct links to the websites of each of the three original surveys, which contain all of the original data collection instruments. The site features a series of enhancements not normally available with documentation of survey datasets. Variables in all three of the original surveys contain universe statements defining which respondents were eligible to answer the question. This information was directly extracted from the computer-assisted personal interviewing software. One of the component surveys, the National Latino and Asian American Study (NLAAS), included questionnaires in five different languages during the data collection process. Through the interface users can access the original text for each language through tabs near the top of the display page (see Figure 29.9). Users also have direct links to an online analysis system that will provide additional statistical information about the variable of interest. The central principle behind each of these enhancements is to provide users with multiple options for finding the information they want easily so that they can choose relevant variables and proceed with their analyses as quickly as possible. As a result of funding constraints, the CPES website is no longer supported as originally designed, but much of its functionality is preserved in a more sustainable format (see https://www.icpsr.umich.edu/icpsrweb/ ICPSR/studies/20240?searchSource=findanalyze-home&sortBy=&q=cpes).
FUTURE DEVELOPMENTS As shown in this chapter, documentation for social science research data has evolved to
meet new user expectations, to address changes in the research data landscape, and to leverage technological advances. We envision that this trend will continue and in this section we look ahead to metadata-related developments on the horizon.
Innovative Uses of Documentation The repurposing of documentation can lead to interesting new resources and tools for researchers. Indeed, several social science data archives have built tools that allow users to explore data in new ways, leveraging variablelevel documentation and accompanying metadata. ICPSR, for example, uses rich metadata as the basis for its Social Science Variables Database (SSVD), the driver behind the Search and Compare Variables functionality on the website (as illustrated in Figure 29.6). The SSVD becomes another way for users to search the collection for data appropriate to their research endeavors. In addition to beginning a search with highlevel concepts in mind and drilling down once a dataset has been selected, researchers can now begin with specific variables or survey questions of interest, find the studies in which those appear, and discover how they are asked in each study – all before a traditional codebook is even opened. Users can even search on two (or more) different variables at the same time – the resulting display shows the list of questions/variables matching the search terms in each study so that a quick glance is all that is needed to see which study shows the most promise for operationalizing both concepts of interest. Figure 29.10 shows partial results from a search on ‘marital status’ and ‘drug abuse’. From here, the user can compare variables across studies, view codebook information for specific variables, and explore a dataset further by clicking through to the study’s homepage. Beyond functioning as an enhanced search tool, the SSVD makes the underlying data
Documenting Survey Data Across the Life Cycle
455
Figure 29.9 Sample variable from the NLAAS, which is part of the harmonized CPES.
appealing to new audiences who wish to use the data in different ways. A researcher creating his own survey might view the SSVD as a ‘question bank’, searching for questions others have used when collecting data about the same topics. Not only will he be presented with question wording, and therefore not have to reinvent the wheel, but the detailed documentation will also allow him to examine the effects of differences in wording or answer categories, how questions may be answered differently by different samples of respondents, or even whether different modes of survey administration appear to affect responses.
Similarly, an instructor who is teaching a research methods course for undergraduates might use the SSVD to spice up what otherwise tends to be a dull lecture on conceptualization and operationalization. She could, in the moment, ask students for concepts of interest and they can explore together the different ways in which those concepts have been operationalized. In a lecture about survey design, that same instructor might find and compare response distributions for attitude questions with different numbers of answer options (e.g., a 5-point Likert scale compared to a 7-point scale or the effect of
456
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Figure 29.10 Variable discovery using the ICPSR Social Science Variables Database.
having an odd number of response options, and thus a ‘neutral’ midpoint, versus an even number). Without good documentation, down to the variable level, it is unlikely that real data could be used as the foundation for either of these lessons.
Documentation and Research Transparency Discussions of documentation across the data life cycle often consider data sharing – either formally through deposit with a trusted archive or informally between colleagues – and subsequent re-use as the goals toward which metadata creation is aimed. Recently, however, a movement within the social sciences has pushed the goal of good documentation through to the ability to replicate published results. That is, journals and professional associations have begun to require authors to submit the data and/or statistical
code used in producing specific publications, so that others may reproduce the results and further assess their validity. Standardized documentation included with the original data is now just a starting point. In addition to the metadata provided to aid in finding and evaluating the data, careful documentation about any manipulations of the data during analysis must be maintained. A variable that is recoded, for example, should be clearly noted so that a user can identify the original source variable, understand the ways in which the categories were collapsed or reordered, and view a frequency distribution for the new variable to compare with the original. Likewise, decisions about retaining or dropping cases, dealing with missing or incomplete data, and options chosen for given statistical tests should be apparent to those attempting the replication. While most researchers would agree that this is good practice even if no one else were to see the resulting files, it is easy to become lax and
Documenting Survey Data Across the Life Cycle
assume that one will always remember why something is coded the way it is or who was excluded from the analyses. Efforts that teach undergraduate students to carefully document each step in their research projects from day one, such as Project TIER (Teaching Integrity in Empirical Research) led by Richard Ball and Norm Medeiros (2012) at Haverford College, offer promise that the culture of documenting every phase in the survey process, from data collection to published analysis, may still take hold. Another good example is J. Scott Long’s work on designing and implementing efficient workflows for data analysis and data management in Stata, which provides guidance to ensure replication of research results (Long 2009).
Quality Indicators As noted, improving survey quality continues to be a central focus for the field, and good documentation is essential to assessing quality. There is a recent push for US government surveys to provide evaluation reports as ‘evidence’ that questions were tested and are indeed capturing what surveys purport to measure. A Federal Committee on Statistical Methodology (FCSM) workgroup recently submitted recommendations regarding question evaluations to the Office of Management and Budget (OMB), and documentation was an important component of the recommendations. Question evaluation studies can improve question design and the validity of survey data, and also support data users by providing information to help them interpret data used in their research. A collaborative resource for evaluating questions – Q-Bank8 – has been set up by an inter-agency group to collect question evaluation studies from federal statistical agencies and research organizations. Q-Bank allows one to search for tested survey questions and access those evaluation studies. An important theme in the Q-Bank effort is that question evaluation studies should be in a central
457
repository so that they can be easily accessed by all to ensure that knowledge can be shared and built upon. Similar efforts are under way in Europe, where a tool called the Survey Quality Predictor9 is available.
Broader Audience for Data and Metadata Data-enabled inquiry is making its way into all aspects of the human endeavor, and looking ahead, we see data becoming useful for an increasingly broad and diverse audience, one that is not necessarily trained in data use. We already see rudimentary data analysis being taught in elementary school classrooms, popular media incorporating statistics and charts into their stories, and cell phone apps delivering data of various kinds at a record pace. When dealing with novice audiences, metadata take on greater importance to ensure that data are well understood. For example, faculty teaching undergraduates often wish to (or are required to with recent quantitative literacy initiatives) incorporate data into their teaching to expose students to the kinds of critical evaluation required by today’s data-rich society. The barriers to doing so are large given the demands on time, software needed, and abundance of data from which to choose. Here, documentation is key. Data that are not well documented become too time-consuming for faculty to transform into something useful with novice users. In creating teaching exercises based on data, ICPSR staff have found that documentation provided by depositors and/or resulting from the typical processing methods are not necessarily sufficient to make the data usable in teaching. In a few instances, further information was requested from the principal investigator so that the data could be understood well enough by those creating the resources to ensure they were accurately represented in the resulting exercise. On the other hand, strong documentation and
458
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
variable-level metadata have made it possible to incorporate data from a wide variety of surveys in teaching which is a big step from the time when students might be exposed only to the General Social Survey or the American National Election Study.
Documentation for ‘Big Data’ and Other Types of Data Social scientists are increasingly turning to new data sources to supplement survey data. Many of these new sources fall under the heading of ‘big data’ – for example, social media data, climate data, video data, and so forth. Wikipedia (2016) defines big data as ‘data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy’. We respectfully add the challenge of documentation to this list. While the term ‘big data’ is increasingly deprecated because it has myriad meanings and connotations, we note that these new types of data characterized by volume, velocity, and variability will have a huge impact on research. When data on, for example, cash register receipts are collected, documentation about the provenance of the data is critical to interpreting the data and analyzing them effectively. Where and when were the data captured, from whom and by whom, and was any validation of their accuracy performed? We could go on to list many other questions whose answers are key to using the data in a way that is scientifically sound. Even traditional survey data collection has become more complex in ways that demand equally rich metadata. More and more, survey efforts involve interdisciplinary teams of scientists such that biological specimens, video observations, and diary data are being appended to survey responses. In many cases, these data are stored in files (or locations, for
physical specimens) apart from those corresponding responses. Accurate documentation is required if the data are to be used to their fullest advantage now and in the future. As we move away from traditional survey data to these new ways of conducting science, our community will need to create data quality standards and benchmarks to ensure independent understandability.
NOTES 1 European Social Survey (2016). ESS Methodological Overview. Retrieved from http://www. europeansocialsurvey.org/methodology/ [accessed 2016-02-29]. 2 Data Documentation Initiative (DDI). Retrieved from http://www.ddialliance.org/ [accessed 201602-29]. 3 Generic Statistical Business Process Model (GSBPM) (2013). Retrieved from http://www1. unece.org/stat/platform/display/GSBPM/Generic +Statistical+Business+Process+Model [accessed 2016-02-29]. 4 Generic Statistical Information Model (GSIM) (2013). Retrieved from http://www1.unece.org/ stat/platform/display/metis/Generic+Statistical+In formation+Model [accessed 2016-02-29]. 5 Growth of American Families Series. Retrieved from http://www.icpsr.umich.edu/icpsrweb/ ICPSR/series/221 [accessed 2016-02-29]. 6 National Fertility Survey Series. Retrieved from http://www.icpsr.umich.edu/icpsrweb/ICPSR/ series/220 [accessed 2016-02-29]. 7 National Survey of Family Growth Series. Retrieved from http://www.icpsr.umich.edu/icpsrweb/ICPSR/ series/48 [accessed 2016-02-29]. 8 Q-Bank: Improving Surveys Through Sharing Knowledge. Retrieved from http://wwwn.cdc.gov/ QBANK/Home.aspx [accessed 2016-02-29]. 9 Survey Quality Prediction 2.0 (2012). Retrieved from www.upf.edu/survey/actualitat/19.html# [accessed 2016-02-29].
RECOMMENDED READINGS Mohler, P., Pennell, B., and Hubbard, F. (2008). Survey Documentation: Toward Professional Knowledge Management in Sample Surveys, in Edith D. de Leeuw, Joop J. Hox, and Don
Documenting Survey Data Across the Life Cycle
A. Dillman (eds) International Handbook of Survey Methodology. Lawrence Erlbaum Associates, pp. 403–420. Niu, J., and Hedstrom, M. (2008). Documentation Evaluation Model for Social Science Data. Proceedings of the American Society of Information Science and Technology, 45: 11. doi: 10.1002/meet.2008.1450450223. Vardigan, M., and Granda, P. (2010). Chapter 23: Archiving, Documentation, and Dissemination, in Handbook of Survey Research, 2nd revised edition, Peter Marsden and James Wright (eds). Emerald Group Publishing Limited. Vardigan, M., Granda, P., Hansen, S. E., Ionescu, S., and LeClere, F. (2009). DDI Across the Life Cycle: One Data Model, Many Products. Invited paper presented at the 57th Session of the International Statistical Institute (ISI), Durban, South Africa. Vardigan, M., Heus, P., and Thomas, W. (2008). Data Documentation Initiative: Toward a Standard for the Social Sciences. International Journal of Digital Curation, 3 (1): 107–113. doi:10.2218/ijdc.v3i1.45.
REFERENCES Alegria, Margarita, James S. Jackson, Ronald C. Kessler, and David Takeuchi. Collaborative Psychiatric Epidemiology Surveys (CPES), 20012003 [United States]. ICPSR20240-v8. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 201512-09. http://doi.org/10.3886/ICPSR20240.v8. Ball, R., and Medeiros, N. (2012). Teaching Integrity in Empirical Research: A Protocol for Documenting Data Management and Analysis. The Journal of Economic Education, v43 (2): 182–189. Groves, R., and Lyberg, L. (2010). Total Survey Error: Past, Present, and Future. Public Opinion Quarterly, 74 (5): 849–79. ICPSR Social Science Variables Database. Retrieved from http://www.icpsr.umich.edu/ icpsrweb/ICPSR/ssvd/index.jsp [accessed 2016-02-29].
459
Integrated Fertility Survey Series. Retrieved from http://www.icpsr.umich.edu/icpsrweb/ IFSS/harmonization/variables/263440001 [accessed 2016-02-29]. Iverson, J. (2009). Metadata-Driven Survey Design. IASSIST Quarterly, 33 (Spring/ Summer), p: 7. Retrieved from http://www. iassistdata.org/downloads/iqvol3312iverson. pdf [accessed 2016-02-29]. Long, J. S. (2009). The Workflow of Data Analysis Using Stata. ISBN-13: 978-1597180474. StataCorp L. P.College Station, TX: Stata Publishing. Matsuo, H., and Loosveldt, G. (2013). Report on Quality Assessment of Contact Data Files in Round 5: Final Report 27 Countries. Working Paper Centre for Sociological Research (CeSO) Survey Methodology CeSO/ SM/2013-3. Retrieved from http://www. europeansocialsurvey.org/docs/round5/ methods/ESS5_response_based_quality_ assessment_e01.pdf [accessed 2016-02-29]. The Royal Society Science Policy Centre (2012). Science as an Open Enterprise – Final Report. The Royal Society, London, England. Retrieved from https://royalsociety.org/∼/ media/policy/projects/sape/2012-06-20saoe.pdf [accessed 2016-02-29]. Survey Research Center (2011). Guidelines for Best Practice in Cross-Cultural Surveys, 3rd ed. ISBN 978-0-9828418-1-5. Survey Research Center, Institute for Social Research, University of Michigan. Retrieved from http:// ccsg.isr.umich.edu/pdf/FullGuidelines1301. pdf [accessed 2016-02-29]. US Department of Commerce, Bureau of the Census (1999). Census of Population and Housing, 1960 Public Use Sample: One-in-One Thousand Sample. ICPSR version. Washington, DC: US Department of Commerce, Bureau of the Census [producer], 1973. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor]. http://doi. org/10.3886/ICPSR00054.v1. Wikipedia (2016). Definition of Big Data. Retrieved from http://en.wikipedia.org/wiki/ Big_data [accessed 2016-02-29].
30 Weighting: Principles and Practicalities Pierre Lavallée and Jean-François Beaumont
INTRODUCTION Weighting is one of the major components in survey sampling. For a given sample survey, each unit of the selected sample is attached a weight (also called an estimation weight) that is used to obtain estimates of population parameters of interest, such as the average income of a certain population. In some cases, the weight of a given unit may be interpreted as the number of units from the population that are represented by this sample unit. For example, if a random sample of 25 individuals has been selected from a population of 100, then each of the 25 sampled individuals may be viewed as representing four individuals of the population. The weighting process usually involves three steps. The first step is to obtain design weights (also called sampling weights), which are the weights that account for sample selection. For some sampling methods, the sum of the design weights corresponds to the population size. Coming back to the
previous example, the sum of the design weights (that are all equal to 4) over the 25 sampled individuals gives 100, the total size of the population. For any survey, nonresponse is almost inevitable. Because nonresponse reduces the effective size of the sample, it is necessary to adjust the weights (in fact, increasing them) to compensate for this loss. The adjustment of the design weights for the nonresponse is the second step of the weighting process. The third step of weighting is to adjust the weights to some known totals of the population. For example, the number of men and women of our population of 100 individuals might be known. Because of the way the sample has been selected, it is not guaranteed that the estimated number of men obtained by summing the weights of the sampled men will be equal to the true number of men in the population. The same applies to the estimated number of women. It might then be of interest to adjust the design weights (or the design weights adjusted for nonresponse) to make
Weighting: Principles and Practicalities
the estimates agree to the known population totals. This process is called calibration. The special case of calibration that consists of adjusting the weights to population counts (as in the above example) is referred to as post-stratification. In this chapter, some context is first given about weighting in sample surveys. The above three weighting steps are formally described in a section on standard weighting methods. A section follows that describes indirect sampling and the generalized weight share method, which are used when the sampling frame from which the sample is selected does not correspond to the target population. Sometimes, the sample is selected from more than one sampling frame. The next section is devoted to proper weighting methods that account for multiple frames. Finally, the last section provides a brief discussion of some weighting issues that occur in practice.
CONTEXT In survey sampling, it is often of interest to estimate descriptive parameters of a finite population U of size N. A common type of population parameter is the population total Y = ∑ k∈U yk , where yk is the value of a variable of interest y for the population unit k. For a survey on tobacco use, for example, the variable of interest yk could be the number of cigarettes smoked by individual k during a given day. The total Y represents the total number of cigarettes smoked during the day in the population U. An important special case of a population total is the domain total Yd = ∑ k∈U ydk , where ydk = I dk yk and I dk is the domain indicator variable; i.e., I dk = 1, if unit k is in the domain of interest d, and I dk = 0, otherwise. Continuing the tobacco survey example, an analyst might be interested in estimating the total number of cigarettes smoked during the day by all high-school students in the population. In this example, an individual of the population is in the
461
domain of interest only if he/she is a highschool student. Many descriptive parameters of interest can be written as a smooth function of a vector of population totals, Y = ∑ k∈U y k , where y k is a vector of variables of interest for unit k. In other words, the population parameter can be written as θ = f (Y) for some smooth function f (⋅). The most common example is the domain mean Yd = ∑ k∈U ydk ∑ k∈U I dk . In the tobacco survey example, the analyst might be interested in estimating the average number of cigarettes smoked during the day by high-school students in the population. It is usually too costly to collect information about the variables of interest for all population units. A sample is thus selected from the population and information used to derive the variables of interest is collected only for sample units. Sample selection is often done by randomly selecting certain units from a list that we call a sampling frame. Unless otherwise stated (see the sections on indirect sampling and combined sources), we will assume that the sampling frame is identical to the target population U. Formally, a sample s of size n is selected from the population U using some probability sampling design p(s ). The estimation of descriptive parameters of the population U is achieved by using the variable of interest y measured for each unit in sample s. Each sample unit k is then attached an estimation weight wk , which is used to obtain estimates of the parameters of interest. For instance, the estimator of the population total Y = ∑ k∈U yk is:
Yˆ = ∑ wk yk .
(1)
k∈s
The estimation of a more complex population parameter of the form θ = f (Y) can be done similarly by using the estimator ˆ ), where Y ˆ = θˆ = f (Y ∑ k∈s wk y k . A single set of estimation weights, {wk ; k ∈ s}, can be used to obtain estimates for any parameter, variable and domain of interest as long
462
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
as the estimation weight wk does not depend on the population values of the variables and domains of interest. However, it may depend on the sample. In the rest of this chapter, we discuss methods used to obtain the estimation weight wk under different scenarios. For simplicity, we focus on the estimation of a population total Y = ∑ k∈U yk .
sense that E p (Yˆ HT ) = Y . The subscript p indicates that the expectation is evaluated with respect to the sampling design, i.e., with respect to all possible samples that could have been drawn from the population. Note that only the sample selection indicators t k , k ∈ U , are treated as random when taking a design expectation. The property of Yˆ HT to be p-unbiased can be shown by first noting that E p (t k ) = 1 × P(k ∈ s ) + 0 × P(k ∉ s ) = P(k ∈ s ) = π k . From (2), it follows that E p (Yˆ HT ) = E p ( ∑ k∈U yk t k /π k )
STANDARD WEIGHTING METHODS
Design Weighting Let π k = P(k ∈ s ) be the probability that population unit k is selected in the sample s. We assume that π k > 0 for all units k of population U; i.e., all population units have a nonzero chance of being selected in the sample s. For example, with the simple random sample of 25 individuals from 100, we have π k = n /N = 25 /100 = 1/ 4 , for all k ∈ U . The use of unequal probabilities of selection is common in sample surveys. For instance, when a size measure is available for all population units, the population can be stratified and units in different strata may be assigned different selection probabilities. Another possibility is to select the sample with probabilities proportional to the size measure. These unequal probabilities of selection must be accounted for when estimating population parameters; otherwise, bias may result. The most basic estimator of Y that accounts for unequal probabilities of selection is the Horvitz–Thompson estimator (Horvitz and Thompson, 1952), also called the expansion estimator:
y y Yˆ HT = ∑ k = ∑ k t k , k∈s π k k∈U π k
(2)
where t k is the sample selection indicator variable such that t k = 1 , if k ∈ s , and t k = 0 , otherwise. The estimator Yˆ HT is design-unbiased (or p-unbiased) for Y in the
= ∑ k∈U yk E p (t k ) /π k
= ∑ k∈U yk = Y .
The Horvitz–Thompson estimator Yˆ HT , given by (2), can be rewritten as
Yˆ HT = ∑ dk yk ,
(3)
k∈s
where dk = 1/π k is the design weight of sample unit k, also called the sampling weight. In this set-up, the design weight dk = 1/π k of unit k can be used as an estimation weight in the absence of nonresponse. The design weight of unit k may be interpreted as the number of units from population U represented by this sample unit. In our previous example, each individual has one chance out of four (π k = 1/ 4) of being part of the sample and, therefore, each individual has a design weight of 4. Note that this interpretation is not always appropriate. For instance, consider a population of 100 units with one unit having its selection probability equal to 1/1000 and therefore its design weight equal to 1000. Note also that the design weight does not need to be an integer. To learn more about sampling theory, the reader may consult books such as Cochran (1977), Grosbras (1986), Särndal et al. (1992), Morin (1993), Tillé (2001), Thompson (2002), Ardilly (2006) and Lohr (2009a). It should be noted that Hidiroglou
463
Weighting: Principles and Practicalities
et al. (1995) also present a good overview of weighting in the context of business surveys. Before discussing how to adjust design weights to account for nonresponse, we first describe the calibration technique.
Calibration Calibration arises from a generalization by Deville (1988), and then by Deville and Särndal (1992), of an idea by Lemel (1976). It uses auxiliary information to improve the quality of design-weighted estimates. An auxiliary variable, x, also called a calibration variable, must have the following two characteristics to be considered in calibration:
ˆ Cal = wCal x = x = X. X ∑ k k ∑ k k∈s
(4)
k∈U
The resulting calibration estimator is denoted by Yˆ Cal = ∑ wkCal yk . k∈s
Calibration is not only used to remove inequalities but also to reduce the design variance of the Horvitz–Thompson estimator Yˆ HT . The latter is expected to hold when calibration variables are correlated with the variable of interest y. To understand this point, let us take an extreme example and suppose that there is a perfect linear relationship between y and x; i.e., yk = xTk β , for some vector β . Then, it is straightforward to show that Yˆ Cal = ∑ wkCal yk = ∑ wkCal xTk β k∈s k∈s
i. it must be available for all sample units k ∈ s; and ii. its population total, x = ∑ k∈U xk , must be known.
Often, a vector of auxiliary variables, x, is available along with its associated vector of population totals, X = ∑ k∈U x k . The vector of known population totals can be obtained from the sampling frame, an administrative file or a (projected) census. In practice, the vector X may be subject to errors, but we assume that they are small enough to be ignored. Examples of auxiliary variables are the revenue of a business or the age group of a person. ˆ HT ) = X because of the Although E p ( X p-unbiasedness of the Horvitz–Thompson estimator, the main issue with the use of design weights dk is that they usually lead to incoherence since
In this case, the calibration estimator is perfect; i.e., Yˆ Cal = Y . In general, a perfect linear relationship between y and x is unlikely and, thus, Yˆ Cal ≠ Y . However, b 2 − 4 ac we may expect that the calibration estimator Yˆ Cal will be more efficient than Yˆ HT if there is a strong linear relationship between y and x. Note that calibration can also be used in practice to reduce coverage and nonresponse errors. Again, a linear relationship between y and x is required to achieve these goals. More formally, calibration consists of determining calibration weights wkCal , for k ∈ s, so as to minimize
ˆ HT = d x ≠ X ∑ k k ∑ x k = X. k∈s
k∈U
Calibration fixes this inequality by incorporating auxiliary information in the estimator. It consists of determining calibration weights wkCal that are as close as possible to the initial design weights dk while satisfying the following calibration equation:
β = ∑ yk = Y. k∈U
= ∑ xTk k∈U
∑ G (w k∈s
k
Cal k
, dk ),
subject to the constraint (4), where Gk (a, b) is a distance function between a and b. See Deville and Särndal (1992) for examples of distance functions. The most popular distance function in practice is the chi-square distance
G k (w
Cal k
1 ( wk − d k ) , ,d k ) = qk d k 2 Cal
2
(5)
464
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
where qk is a known constant giving the importance of each unit in the function to be minimized. The choice qk = 1 dominates in practice, except when the ratio estimator is used (see below). With this distance function, we obtain the calibration weight wkCal = dk (1 + qk xTk λ ),
(
)
∑w
−1
ˆ HT ) . The with λ = ∑ k∈s dk x k xTk ( X − X resulting calibration estimator is the generalized regression estimator, denoted by Yˆ Reg . It can be expressed as
Yˆ Cal = ∑ wkCal yk = Yˆ HT + k∈s
( X − Xˆ ) HT
(
T
)
βˆ = Yˆ Reg ,
−1
(6)
where βˆ = ∑ qk dk x k x ∑ k∈s qk dk x k yk . k∈s The generalized regression estimator has many important special cases (e.g., Estevao et al., 1995). In several surveys, and especially for business surveys, the ratio estimator is often used when there is a single auxiliary variable x. It is obtained by setting qk = 1/xk . This leads to the calibration weight wkCal = dk ( X /Xˆ HT ) and the calibration estimator Yˆ Cal = X (Yˆ HT /Xˆ HT ) . The ratio estimator is efficient if the relationship between y and x goes through the origin. The well-known Hájek estimator is itself a special case of the ratio estimator when xk = 1, for k ∈ U . It is written as Yˆ Cal = N (Yˆ HT /Nˆ HT ) , where Nˆ HT = ∑ k∈s dk . The post-stratified estimator is another important special case of the generalized regression estimator. It is used when a single categorical calibration variable is available defining mutually exclusive subgroups of the population, called post-strata. Examples of post-strata are geographical regions or age categories. Let δ hk = 1 if unit k belongs to post-stratum h, and δ hk = 0 otherwise, for h = 1,..., H , with H denoting the total number of post-strata. The post-stratified estimator is a special case of the generalized T k
regression estimator obtained by setting xTk = (δ1k ,..., δ Hk ) and qk = 1 . For instance, suppose there are H = 3 post-strata (three age categories, for example) and that unit k belongs to the second post-stratum. Then, we have xTk = (0,1, 0) . The calibration equations are given by
k∈s
Cal k
δ hk =
∑w
k∈sh
Cal k
= N h , h = 1,..., H ,
where sh is the set of sample units falling into post-stratum h and N h is the population count of post-stratum h. The calibration weight of unit k in post-stratum h is given by wkCal = dk ( N h /Nˆ hHT ) , where Nˆ hHT = ∑ k∈s dk . h The calibration estimator reduces to H Nh Yˆ Cal = ∑ wkCal yk = ∑ HT ˆ k∈s h =1 N h
∑d y .
k∈sh
k
k
The above post-stratified estimator reduces to the Hájek estimator when there is only one post-stratum (H = 1). In other words, the poststratified estimator can simply be obtained by applying the Hájek estimator within each post-stratum and then summing over all poststrata. Similarly, we would obtain the poststratified ratio estimator by applying the ratio estimator within each post-stratum and summing over all post-strata. Sometimes, two or more categorical calibration variables are available. Post-strata could be defined by crossing all these variables together. However, some resulting poststrata could have quite a small sample size or could even contain no sample units, which is not a desirable property. Also, the population count of post-strata may not be known even though the population count of each margin is known. Then, only calibration to the known marginal counts is feasible. Raking ratio estimation can then be used in this scenario, which is quite common in social surveys. The reader is referred to Deville et al. (1993) for greater detail on raking ratio estimation. Many distance functions are possible but they all lead to the same asymptotic variance
Weighting: Principles and Practicalities
as the one obtained using the distance function (5). Indeed, Deville and Särndal (1992) proved that under some mild conditions, the calibration estimator Yˆ Cal = ∑ k∈s wkCal yk obtained from some distance function Gk is asymptotically equivalent to the generalized regression estimator Yˆ Reg given by (6) and obtained from the specific distance function (5). The choice of calibration variables may have a much more profound effect than the choice of a distance. As mentioned earlier, the basic principle is to choose calibration variables that are correlated with the variables of interest. Modeling can be a useful tool for selecting such variables. The choice of calibration variables is also sometimes made by subject-matter specialists. It is important to be aware that the use of too many calibration variables, especially if some are weakly correlated with the variables of interest, can actually increase the design variance. Indeed, a point may be reached where the design variance of the calibration estimator is larger than the design variance of the Horvitz–Thompson estimator. Choosing proper calibration variables is not easy in general. The problem is accentuated in multipurpose surveys since calibration variables that are good for estimating the total of one variable of interest may not be good for another variable and a compromise must often be made (e.g., Silva and Skinner, 1997). Finally, if the estimation of some domain totals is important, it is advisable to consider including the domains of interest among the calibration variables; e.g., by using domains as post-strata. To calculate the calibration weights wkCal, Sautory (1991) and Le Guennec and Sautory (2004) developed a software program called CALMAR, which stands for CALage sur MARges (or Calibration to Margins). This program computes calibration weights for the different distance functions listed by Deville and Särndal (1992). CALMAR is used in most of the surveys at the Institut National de la Statistique et des Études Économiques (INSEE) in France such as the Modes de vie
465
(lifestyles) survey and the Budgets de famille (family budgets) survey. For more details, see Sautory (1992).
Nonresponse Weighting Adjustment Most surveys, if not all, suffer from nonresponse. Two types of nonresponse can be distinguished: item nonresponse and unit nonresponse. Item nonresponse occurs when information is collected for some but not all the survey variables. Item nonresponse is often treated through imputation, but is outside the scope of this chapter. In the following, we will assume that no sample unit is subject to item nonresponse. Unit nonresponse occurs when no usable information has been collected for all the survey variables. It is typically handled by deleting the nonrespondents from the survey data file and by adjusting the design weight of respondents to compensate for the deletions. The resulting nonresponse-adjusted weights can then be calibrated, if some population totals are known. The main issue with nonresponse is the bias that is introduced when the respondents have characteristics different from the nonrespondents. An additional component of variance is added due to the observed sample size being smaller than the initially planned sample size, n. The key to reducing both nonresponse bias and variance is to use nonresponse weighting methods that take advantage of auxiliary information available for both respondents and nonrespondents. To fix ideas, let us denote by sr , the set of respondents; i.e., the subset of s containing all units for which we were able to measure the variable of interest y. The set of respondents sr is generated from s according to an unknown nonresponse mechanism φ (sr s ) . The response probability pk = P(k ∈ sr s, k ∈ s ) is assumed to be strictly greater than zero for all k ∈ s . Nonresponse can be viewed as a second phase of sampling (e.g., Särndal and Swensson,
466
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
1987) with the exception that the nonresponse mechanism φ (sr s ) is unknown, unlike the second-phase sample selection mechanism in a two-phase sampling design. If the response probability could be known for all k ∈ sr , the double expansion estimator of Y could be used: Yˆ DE = ∑ dk yk , k∈sr
where dk = (1/π k )(1/pk ) = dk /pk is the adjusted design weight. The double expansion estimator is unbiased for Y in the sense that E pφ (Yˆ DE ) = Y . This follows from Eφ Yˆ DE s = Yˆ HT and the design unbiasedness of Yˆ HT as an estimator of Y. The subscript φ indicates that the expectation is evaluated with respect to the nonresponse mechanism. The proof of Eφ Yˆ DE s = Yˆ HT is obtained by rewriting the double expansion estimator as r Yˆ DE = ∑ k yk , (7) k∈s pk where yk = dk yk is the expanded y variable and rk is the response indicator for unit k; i.e., rk = 1, if sample unit k responds, whereas rk = 0, otherwise. The result follows from the fact that Eφ (rk s, k ∈ s, yk ) = P(rk = 1 s, k ∈ s, yk ) ≡ pk .
(
)
(
)
The response probability pk is unknown in practice unlike the second-phase selection probability in a two-phase sampling design. To circumvent this difficulty, the response probability pk can be estimated using a nonresponse model. A nonresponse model is a set of assumptions about the multivariate distribution of the response indicators rk , k ∈ s . The estimated response probability of unit k is denoted by pˆ k and the nonresponseadjusted design weight by dˆk = dk /pˆ k . The resulting nonresponse-adjusted estimator is Yˆ NA = ∑ k∈s dˆk yk . It is typically not unbiased r anymore but, under certain conditions, the bias is small in large samples. Most methods
for handling nonresponse simply differ in the way the response probability pk is estimated. In the rest of this section, we focus on the modeling and estimation of pk. Ideally, the expanded variable yk = dk yk would be considered as an explanatory variable in the response probability model. In other words, the response probability would be defined as pk ≡ P ( rk = 1 s, k ∈ s, yk ), k ∈ s. As pointed out above, this would ensure that the double expansion estimator Yˆ DE is unbiased. Note that if there are many domains and variables of interest, the number of potential explanatory variables could become quite large. Unfortunately, yk is not known for the nonrespondents k ∈ s − sr and can thus not be used as an explanatory variable. The most commonly used approach to deal with this issue is to replace the unknown yk by a vector z k of explanatory variables available for all sample units k ∈ s . The vector z k must be associated with the response indicator rk. Ideally, it must also be associated with yk, as it is a substitute for yk . This is in line with the recommendation of Beaumont (2005) and Little and Vartivarian (2005) that explanatory variables used to model nonresponse should be associated with both the response indicator and the variables of interest. Explanatory variables that are not associated with the (expanded) variables of interest do not reduce the nonresponse bias and may likely increase the nonresponse variance. Explanatory variables can come from the sampling frame, an administrative file and can even be paradata. Paradata, such as the number of attempts made to contact a sample unit, are typically associated with nonresponse but may or may not be associated with the variables of interest. So, such variables should not be blindly incorporated into the nonresponse model. Beaumont (2005) gave the example of the Canadian Labour Force Survey to illustrate a case where the number of attempts made to contact a sample unit may be an appropriate explanatory variable as it is associated with one of the main variables of interest, namely, the employment status of a
467
Weighting: Principles and Practicalities
person. The rationale is that a person who is employed may be more difficult to reach and, thus, may require a larger number of attempts to be contacted on average than a person who is unemployed. The use of z k as a replacement for yk implies that the response probability is now defined as pk ≡ P ( rk = 1 s, k ∈ s,z k ). The double expansion estimator (7) remains unbiased, provided that the following condition holds: P ( rk = 1 s, k ∈ s, z k , yk ) =
P ( rk = 1 s, k ∈ s, z k ) ≡ pk .
(8)
be viewed as a calibration weight. The main distinction with the calibration equation (4) is that the sample s is replaced with the set of respondents sr and the vector of known population totals is replaced with an estimate computed from the complete sample s. This observation suggests that calibration may be directly used to adjust for the nonresponse. This is the view taken by Fuller et al. (1994), Lundström and Särndal (1999) and Särndal and Lundström (2005), among others. Indeed, the adjusted weights dˆk , k ∈ sr , could be determined so as to satisfy simultaneously both equation (10) and the equation ∑ k∈s dˆk x k = X . r
Condition (8) implies that the nonresponse mechanism does not depend on any unobserved value and, thus, that the values of the variable of interest are missing at random (see Rubin, 1976). In addition to condition (8), it is typically assumed that sample units respond independently of one another. In order to obtain an estimate of pk, one may consider a parametric model. A simple parametric response probability model is the logistic regression model (e.g., Ekholm and Laaksonen, 1991):
pk = p(z k ; α) =
1 , 1 + exp(−zTk α)
(9)
r
where α is a vector of unknown model parameters that needs to be estimated. If we denote by αˆ , an estimator of α , then the estimated response probability is given by pˆ k = p(z k ; αˆ ). The logistic function (9) ensures that 0 < pˆ k < 1 . The maximum likelihood method can be used for the estimation of α . An interesting alternative was proposed by Iannacchione et al. (1991). They suggested determining αˆ that satisfies:
∑ dˆ z = ∑ d z ,
k∈sr
k k
k∈s
k k
Although there is a close connection between calibration and weighting by the inverse of estimated response probabilities, we prefer the latter view because it states explicitly the underlying assumptions required for the (approximate) unbiasedness of Yˆ NA as an estimator of the population total Y, such as assumptions (8) and (9). Also, it stresses the importance of a careful modeling of the response probabilities. Once nonresponse-adjusted weights dˆk have been obtained, nothing precludes using calibration to improve them further; i.e., we may want to determine the calibration weights wˆ kCal , k ∈ sr, that minimize ∑ k∈s Gk ( wˆ kCal , dˆk ) subject to
(10)
where dˆk = dk /pˆ k and pˆ k = p(z k ; αˆ ) . Equation (10) can be viewed as a calibration equation and the nonresponse-adjusted weight dˆk can
the constraint ∑ k∈s wˆ kCal x k = X . This is a r problem very similar to the one discussed in the previous section on calibration. There are two main issues with using the logistic function (9): (i) It may not be appropriate, even though careful model validation has been made. (ii) It has a tendency to produce very small values of pˆ k yielding large weight adjustments 1/pˆ k . The latter may cause instability in the nonresponse-adjusted estimator Yˆ NA. A possible solution to these issues is obtained through the creation of classes that are homogeneous with respect to the propensity to respond. Ideally, every unit in a class should have the same true response probability. A method that aims at forming homogeneous classes is the so-called
468
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
score method (e.g., Little, 1986; Eltinge and Yansaneh, 1997; or Haziza and Beaumont, 2007). It can be implemented as follows:
Step 1: Obtain estimated response probabilities pˆ kLR, k ∈ s , from a logistic regression. Step 2: Order the sample from the lowest estimated response probability computed in Step 1 to the largest. Step 3: Form a certain number of classes that are homogeneous with respect to pˆ kLR, k ∈ s . Classes of equal size can be formed or a clustering algorithm can be used. The number of classes should be as small as possible but large enough to capture most of the variability of pˆ kLR , k ∈ s . Step 4: Compute the final estimated response probability pˆ k for a unit k in some homogeneous class c as the (weighted) response rate within class c.
Forming homogeneous classes using the above procedure provides some robustness to model misspecifications and is less prone to extreme weight adjustments than using 1/pˆ kLR . If the creation of classes does not remove all the extreme weight adjustments, then weight trimming or collapsing classes are possible solutions. The above score method is one method of forming homogeneous classes. There are other methods such as the CHi-square Automatic Interaction Detection (CHAID) algorithm developed by Kass (1980). In stratified business surveys, classes are sometimes taken to be the strata for simplicity and because there may be no other explanatory variable available. Da Silva and Opsomer (2006) described a nonparametric regression method using kernel smoothing as an alternative to forming classes.
INDIRECT SAMPLING AND THE GENERALIZED WEIGHT SHARE METHOD In the previous sections, we assumed that the target population was well represented by a
sampling frame, and we supposed therefore that the survey was undertaken for a population U of size N, without distinguishing the sampling frame and the target population. Let us now suppose that no frame is available to represent the target population. We can then choose a sampling frame that is indirectly related to this target population. We can thus speak of two populations, the sampling frame U A and the target population U B, that are related to one another. We wish to produce an estimate for U B, but we only have a sampling frame for U A. We can then select a sample from U A and produce an estimate for U B using the existing links between the two populations. For the survey on tobacco use, for example, we might not have a list of smokers (our target population). However, we might use a list of convenience stores selling cigarettes (sampling frame) to ultimately reach the smokers for interviewing them. This is what we can refer to as indirect sampling (see Lavallée, 2002; and Lavallée, 2007). In the context of indirect sampling, a sample sA of nA units is selected from a population UA of NA units using a particular sample 2 design. Let π jA bea 2the probability of + bselection − 4all ac j ∈ U A . unit j. We assume that π jA >0b 2for We also assume that target population UB contains NB units. We are interested in estimating the total Y B = ∑ k∈U B y k in population UB for the variable of interest y. We assume that there is a link (or relationship) between the units j of population UA and the units k of population UB. That link is identified by the indicator variable l j , k , where A l j , k = 1 if there is a link between unit j ∈ U B and unit k ∈ U , and 0 if not. Note that there may be cases where there is no link between a unit j of population UA and the units k of target population UB, i.e., LAj = ∑ k∈U B l j , k = 0 . For each unit j selected in s A , we identify the units k of U B that have a nonzero link with j, i.e., l j,k = 1. If LAj = 0 for a unit j of s A, there is simply no unit of U B identified with that unit j, which affects the efficiency of sample s A, but does not cause bias. For each unit k
469
Weighting: Principles and Practicalities
identified, we measure a particular variable of interest yk and the number of links LBk = ∑ j∈U A l j , k between unit k of U B and population U A . Let s B be the set of nB units of UB identified by units j ∈ s A. For target population UB, we want to estimate the total YB. Estimating that total is a major challenge if the links between the units of the two populations are not one-to-one. The problem is due primarily to the difficulty of associating a selection probability, or an estimation weight, to the units of the target population that is surveyed. The generalized weight share method (GWSM) assigns an estimation weight wkGWSM to each surveyed unit k. The method relies on sample s A and the links between U A and U B to estimate total Y B . To estimate the total YB for target population UB, we can use the estimator Yˆ B =
∑w
GWSM k
k∈s
yk .
B
The GWSM is described in Lavallée (1995, 2002, 2007). It is an extension of the weight share method described by Ernst (1989) but applied to longitudinal household surveys. The GWSM can be seen as a generalization of network sampling and adaptive cluster sampling. These two sampling methods are described in Thompson (2002) and Thompson and Seber (1996). In formal terms, the GWSM constructs an estimation weight wkGWSM to each unit k in s B : wkGWSM =
1 LBk
∑l
j∈U
A
tj j,k
π
A j
, b 2 − 4 ac
where t j = 1 if j ∈ s A , and 0 if not. It is important to note that if unit j of UA is not selected, we do not need to know its selection 2 − 4point ac in the GWSM. In probability π jA, abkey addition, for the GWSM to be design- unbiased, we must have LBk = ∑ j∈U A l j , k >0; in other words, each unit k of UB must have at least one link with UA. As shown in Lavallée (1995), Yˆ B can also be written as
Yˆ B =
tj
∑π ∑l
j∈U
=
A j k∈U B
A
tj
∑π
j∈U A
A j
j ,k
yk LBk
(11)
Zj ,
where Z j = ∑ k∈U B l j , k y k / LBk . Using the expression (11), we see that the estimation weight wGWSM for each selected unit j in s A is simply j given by wGWSM = 1/π jA. We can then easily j show that the GWSM is design-unbiased. The GWSM described above does not use auxiliary information to obtain estimation weights. It can be expected, however, that the use of auxiliary variables can improve the precision of estimates coming from the GWSM. For example, auxiliary information could come from the population U A from which the sample is selected, from the target population U B, or both of the populations. Lavallée (2002, 2007) showed how to apply the calibration of Deville and Särndal (1992) to the GWSM. The estimation weights obtained from the GWSM are adjusted so that the estimates produced correspond to known totals associated to auxiliary information. With indirect sampling, there are three types of unit nonresponse. First, unit nonresponse can be present within the sample s A selected from U A . For the example on tobacco use, we first select a sample of convenience stores before we get the sample of smokers. Within the sample s A of convenience stores, there can be some that refuse to give the names of smoker customers, which creates some nonresponse within s A. Second, some of the units of s B might refuse to answer the survey. For example, a smoker might decide not to answer the survey. This creates some nonresponse within s B. Finally, with indirect sampling, there is another form of nonresponse that comes from the problem of identifying some of the links. This type of nonresponse is associated with the situation where it is impossible to determine whether a unit k of U B is linked to a unit j of U A. This is referred to as the problem of link identification. For example, it is impossible to know if
470
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
a particular smoker went to buy his cigarettes at a particular convenience store. This kind of nonresponse problem was mentioned by Sirken and Nathan (1988) in the context of network sampling. Xu and Lavallée (2009) suggested some methods to deal with the problem of link identification. All three types of nonresponse require some adjustments to be made to the estimation weights obtained using the GWSM. Lavallée (2002, 2007), Xu and Lavallée (2009) and Messiani (2013) describe some of these possible adjustments.
COMPUTING WEIGHTS FROM COMBINED SOURCES Combining data from different sources is often performed to increase the amount of available information. One purpose is to add some variables to an existing dataset. One other purpose is to increase the coverage of a given target population. Combining files is then performed with the goal of obtaining sampling frames that will cover as much of the target population as possible. Taking into account these different frames in a sound methodological way can be done using multiple-frame approaches. In the literature, numerous papers exist on how to deal with multiple frames. For an overview of multiple-frame methods, one can consult, for instance, Kott and Vogel (1995), Lohr and Rao (2006) and Lohr (2009b). In this section, three approaches will be briefly described in order to present the weighting process associated with each of them. Let us assume that L sampling frames UA1, A2 U , …, UAL are available, and they are not necessarily exclusive. That is, each of the frames covers a portion of the target population UB, and these frames can overlap. The resulting sampling frame can then be seen as a population UA where the sample is selected. L We have ∪ =1U A = U A . For simplicity, we will assume here that L = 2, but the generalization to L > 2 is straightforward.
We suppose that samples sA1 and sA2 of nA1 and nA2 units are selected from the populations UA1 and UA2, respectively. Each sample is selected independently from the other. We have s A = s A1 ∪ s A 2. As in the previous section, we assume that the target population UB contains NB units. In most applications of multiple frames, it is assumed that the sampling frames and the target population coincide, i.e., they refer to the same type of units. In the example of the survey on tobacco use, the target population is the smokers of a given city, while the sampling frames could be partial lists of smokers. These lists could be obtained, for instance, by obtaining lists of patients from hospitals, or by getting lists of clients from convenience stores selling cigarettes, or any other kinds of sources. As we can see here, the units of the different frames are the smokers, whom we ultimately want to survey. Another situation is the one where sampling frames and the target population do not refer to the same units, but instead different units with some links between them. Coming back to the survey on tobacco use, the sampling frames could be lists of dwellings containing smokers or not, or it could be lists of convenience stores, or both. In this situation, indirect sampling is needed to sample the smokers from the target population. Such applications can be found in Deville and Maumy (2006) and Ardilly and LeBlanc (2001). In this section, we will only refer to the simpler case where the units of the sampling frames and the target population coin2 cide. That is, ∪ =1U A = U A = U B = U . We now describe three approaches to deal with multiple frames, together with the weighting process associated with each of the approaches. The three approaches are: (i) the domain-membership approach, (ii) the single-frame approach and (iii) the unitmultiplicity approach. It should be noted that multiple-frame approaches are not exempted from nonresponse, and therefore, some nonresponse adjustments may be required to the weights obtained. As well, calibration may be
471
Weighting: Principles and Practicalities
used to improve the quality of the estimates produced from multiple frames. Nonresponse adjustments and calibration in the context of multiple frames are direct applications of the preceding material presented in this chapter.
The Domain-Membership Approach This approach was first proposed by Hartley (1962, 1974). We assume that the target population is well covered by the two sampling frames U A1 and U A2 . Because the two frames can overlap, we can define three domains: a = U A1 \ U A 2 , b = U A 2 \ U A1 and ab = U A1 ∩ U A 2 . As mentioned earlier, we suppose that samples sA1 and sA2 of nA1 and nA2 units are selected from the populations UA1 and UA2, respectively. We have s A = s A1 ∪ s A 2 . For each surveyed unit k, we measure a variable of interest y, and we want to estimate the total Y = ∑ k∈U yk . From each sample sA1 and sA2, we can respectively compute an estimate Yˆ A1 and Yˆ A2 using estimator (2). For domain a, an estimate can be obtained from sample sA1 as: YˆaA1 =
yk = ∑ dkA1 yk . A1 k∈s A1 ∩ a π k k∈s A1 ∩ a
∑
For domain b, a similar estimate YˆbA2 can be obtained from sample sA2. For the intersecting domain ab, two possible estimates can be obtained: YˆabA1 =
∑
k∈s
A1
dkA1 yk and YˆabA 2 =
∩ ab
∑
k∈s
A2
dkA 2 yk .
∩ ab
To estimate the total Y, Hartley (1962, 1974) proposed the composite estimator: Yˆ Hartley = YˆaA1 + ϑYˆabA1 + (1 − ϑ )YˆabA 2 + YˆbA 2 , (12) where the parameter ϑ ∈ [ 0,1] . From (12), we can write
Yˆ Hartley =
∑w
y ,
Hartley k k
k∈s A
(13)
where
wkHartley
⎧ dkA1 ⎪ A1 ⎪ ϑ dk =⎨ A2 ⎪(1 − ϑ )dk ⎪ d A2 k ⎩
if k ∈ s A1 ∩ a, if k ∈ s A1 ∩ ab, (14) if k ∈ s A 2 ∩ ab, if k ∈ s A 2 ∩ b.
The weights wkHartley are the estimation weights related to the use of the domainmembership (or Hartley's) estimator in the context of multiple frames. In its simplest form, we can use ϑ = 1 / 2 . Hartley (1974) proposed an optimal value ϑopt obtained by minimizing the variance of estimator (12). Fuller and Burmeister (1972) proposed to improve the estimator of Hartley (1962) by adding a term to (12) associated with the estimation of N ab , the population size of domain ab. However, this estimator is not linear in y, in the sense that it cannot be written as (1).
The Single-Frame Approach Rather than considering each of the two frames U A1 and U A2 separately, one can view the frame U A = U A1 ∪ U A 2 as a whole. The samples s A1 and s A2 are then combined in a single dataset s A = s A1 ∪ s A 2, and estimation weights are calculated appropriately. Bankier (1986) proposed to treat the sample s A = s A1 ∪ s A 2 as a single sample, rather that treating each of the samples s A1 and s A2 separately. In a two-frame sampling design, each of the frames U A1 and U A2 usually contains units that are not included in the other frame, as well as units in common. Selecting sample s A1 from the frame U A1 can be viewed as equivalent to selecting a sample from the complete population U A in which one stratum (which contains those units not in frame U A1 ) has no sample selected from it. A sample selected from frame U A2 can be viewed similarly. Then, the two samples s A1 and s A2 selected independently from two separate frames U A1 and U A2 can be viewed as a
472
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
special case of the sampling design in which the two samples are selected independently from the same frame UA. By combining the samples s A1 and s A2 into A s , we have for a unit k ∈ U A ,
(
= P (k ∈ s A1 ) ∪ (k ∈ s A 2 )
)
= P(k ∈ s A1 )+P(k ∈ s A 2 ) −
(
P (k ∈ s A1 ) ∩ (k ∈ s A 2 )
)
(15)
= π kA1 + π kA 2 − π kA1π kA 2 . The last line holds because of the independence in the selection of s A1 and s A2 . From (15), we directly obtain the estimation weight wkBankier = 1 / π kA = (π kA1 + π kA 2 − π kA1π kA 2 )−1 to be used in the Horvitz–Thompson estimator (2). Like Bankier (1986), Kalton and Anderson (1986) proposed a single-frame approach. Their approach differs however from Bankier (1986) in that the two samples s A1 and s A2 are still considered as separate. That is,
Yˆ KA =
∑w
k∈s A1
KA, A1 k k
y +
∑w
k∈s A 2
y ,
KA, A 2 k k
where the weights wkKA, A1 and wkKA, A 2 are defined as the following. Referring to the domains a, b and ab defined in the previous section, a unit k in domain ab can be selected in s A1 and in s A2 . Therefore, the expected number of times unit k ∈ ab is selected is π kA1 + π kA 2 . For a unit k ∈ ab that is selected in s A , we −1can then assign the weight π kA1 + π kA 2 . Note that for a unit k ∈ a that is selected in−1s A1, we can directly assign the weight π kA1 , and similarly for a unit k ∈ b that is selected in s A2 . From this, we obtain the estimation weights
(
)
( )
1 / π kA1 wkKA, A1 = A1 A2 1 / π k + π k
(
and
w
1 / π kA 2 = A1 A2 1 / π k + π k
(
if k ∈ b,
)
if k ∈ ab.
The Unit-Multiplicity Approach
π kA = P(k ∈ s A )
KA, A 2 k
if k ∈ a,
)
if k ∈ ab,
Mecatti (2007) proposed to construct a multiple-frame estimator by taking into account the multiplicity of the sampled units. First introduced by Sirken (1970) in the context of network sampling, the multiplicity of a unit k refers here to the number of frames mk to which this unit belongs. With two frames, we clearly have mk = 1 if k ∈ a or k ∈ b, and mk = 2 if k ∈ ab . Using multiplicity, the total Y = ∑ k∈U yk can then be represented as Y=
∑
k∈U
A1
yk y + ∑ k. mk k∈U A 2 mk
Using each of the samples s A1 and s A2 , the total Y can then be estimated by
Yˆ M =
∑π
k∈s
A1
y
k A1 k k
m
+
∑π
k∈s
A2
y
k A2 k
mk
. (16)
The weights wkM obtained by using the unitmultiplicity approach are then given by wkM = (π kA mk )−1 for k ∈ U A , = 1, 2 . It is interesting to note that estimator (16) can also be obtained by a direct application of the GWSM in the context of indirect sampling. Recall that we restricted the present theory to the case where 2 A A B U = U = U = U . This means that ∪ =1 there is a one-to-one or zero-to-one correspondence between each of the populations U A , = 1, 2 , and the target population UB. That is, each unit j of U A is linked to only one unit k of UB. On the opposite, a unit k of UB can be linked to at most one unit j of a particular population U A , but this unit k can be linked to the two populations U A1 A2 and U . Thus, there exists a link between each unit j of population U A = U A1 ∪ U A 2 and one unit k of population UB, i.e.,
Weighting: Principles and Practicalities
LAj = ∑ k∈U B l j , k = 1 for all j ∈ U A . On the other hand, there can be one or two links for a unit k of population UB to the population U A = U A1 ∪ U A 2 , i.e., we have LBk = ∑ j∈U A l j ,k = 1 or LBk = 2. We can then see that the number of links LBk in estimator (11) corresponds to the multiplicity mk, and therefore, estimator (11) can be directly rewritten as estimator (16). Note also that estimator (16) can be written in the form of estimator (13) with weights obtained using (14) and ϑ = 1/ 2 .
DISCUSSION We have described a number of common weighting methods that can be used in different situations. In this section, we discuss a few practical issues associated with their use. Let us start with design weighting. As pointed out in the section on design weighting, the Horvitz–Thompson estimator of a total is design-unbiased. This property is due to the use of design weights. However, although design-unbiased, it is well known (e.g., Rao, 1966) that the Horvitz–Thompson estimator may be highly inefficient when the design weights are weakly associated with the variables of interest and are widely dispersed. A simple solution to this problem is to trim the largest design weights (e.g., Potter, 1990). The effect of weight trimming has often a limited impact on the design variance of the Horvitz–Thompson estimator. Beaumont (2008) proposed a weight smoothing method to reduce the possible inefficiency of the Horvitz–Thompson estimator. The main idea consists of replacing the design weight dk in (3) by the smoothed weight dkS = Em ( dk s, k ∈ s, y k ). The subscript m indicates that the expectation is evaluated with respect to a model for the design weights. In general, the smoothed weight dkS depends on unknown model parameters that need to be estimated to yield an estimated smoothed weight dˆkS . Gains in efficiency can
473
be explained because the method only keeps the portion of the design weight that is associated with the variables of interest. Beaumont and Rivest (2009) illustrated through an empirical study the efficiency of weight smoothing to handle the problem of stratum jumpers and its superiority over weight trimming. The main issue with weight smoothing is that it relies on the validity of a model for the design weights. A careful model validation should thus be performed when implementing this method in practice. Another issue that needs consideration is the possible occurrence of extreme calibration factors Fk = wkCal / dk when using the chi-square distance given in the section on calibration. The chi-square distance may even lead to negative calibration weights, although it is not possible if the ratio or poststratified estimator is used. There exist other distances that can be used to control the magnitude of the calibration factor Fk . For example, Deville and Särndal (1992) suggested the distance: Fk − ϕ Low Cal G k (wk , d k ) = ( Fk − ϕ Low ) log + 1 − ϕ Low ϕUp − Fk (ϕUp − Fk ) log , ϕ − 1 Up where ϕ Low and ϕUp are some pre-specified constants. This distance ensures that ϕ Low < Fk < ϕUp . Unfortunately, a solution to the optimization problem may not be possible. Ridge calibration (e.g., Chambers, 1996; Rao and Singh, 1997; or Beaumont and Bocci, 2008) is an alternative to using more complex distance functions that always yields a solution satisfying ϕ Low < Fk < ϕUp at the expense of relaxing the calibration constraint (4). It may sometimes be possible to divide unit nonrespondents into two types: the resolved and unresolved nonrespondents. A sample unit is resolved if we can determine whether it is in scope (IS) or out of scope (OOS) for the survey. Respondents to the
474
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
survey are always resolved and nonrespondents can be either resolved or unresolved. Note that resolved units that are determined to be OOS can be treated as respondents; they are simply outside the domain of interest. In this scenario, nonresponse weighting adjustment can be achieved through a two-step model. First, the probability P( ρ k = 1 s, k ∈ s ) that a sample unit k is resolved is modeled. The indicator ρ k is equal to 1 if unit k is resolved and is equal to 0, otherwise. Second, the probability P(rk = 1 s, k ∈ s, ρ k = 1) that unit k responds given it is resolved is modeled. The response probability for a respondent k is then the product: pk = P( ρ k = 1 s, k ∈ s ) × P (rk = 1 s, k ∈ s, ρ k = 1). Different explanatory variables can be used to model each probability. In particular, the IS/OOS status should be used to model P(rk = 1 s, k ∈ s, ρ k = 1) but cannot be used to model P( ρ k = 1 s, k ∈ s ) as it is not available for unresolved units.
RECOMMENDED READINGS Survey sampling and weighting in general: Särndal et al. (1992), and Hidiroglou et al. (1995). Calibration: Deville and Särndal (1992). Nonresponse adjustment: Särndal and Swensson (1987), and Särndal and Lundström (2005). Indirect sampling: Lavallée (2002, 2007). Multiple-frame estimation: Hartley (1974) and Lohr (2009b).
REFERENCES Ardilly, P. (2006). Les techniques de sondage, 2nd edn. Paris: Éditions Technip.
Ardilly, P., and Le Blanc, P. (2001). Sampling and weighting a survey of homeless persons: a French example. Survey Methodology, 7 (1): 109–118. Bankier, M.D. (1986). Estimators based on several stratified samples with applications to multiple frame surveys. Journal of the American Statistical Association, 81 (396): 1074–1079. Beaumont, J.-F. (2005). On the use of data collection process information for the treatment of unit nonresponse through weight adjustment. Survey Methodology, 31: 227–231. Beaumont, J.-F. (2008). A new approach to weighting and inference in sample surveys. Biometrika, 95: 539–553. Beaumont, J.-F., and Bocci, C. (2008). Another look at ridge calibration. Metron, LXVI: 5–20. Beaumont, J.-F., and Rivest, L.-P. (2009). Dealing with outliers in survey data. In Handbook of Statistics, Sample Surveys: Theory, Methods and Inference, Vol. 29, Chapter 11 (eds D. Pfeffermann and C.R. Rao). Amsterdam: Elsevier BV. Chambers, R.L. (1996). Robust case-weighting for multipurpose establishment surveys. Journal of Official Statistics, 12: 3–32. Cochran, W.G. (1977). Sampling Techniques, 3rd edn. New York: John Wiley & Sons. Da Silva, D.N., and Opsomer, J.D. (2006). A kernel smoothing method of adjusting for unit nonresponse in sample surveys. Canadian Journal of Statistics, 34: 563–579. Deville, J.-C. (1988). Estimation linéaire et redressement sur information auxiliaires d'enquêtes par sondage. In Essais en l'honneur d'Edmond Malinvaud (eds A. Monfort and J.J. Laffont). Paris: Economica, pp. 915–927. Deville, J. C., and Maumy, M. (2006). Extension of the indirect sampling method and its application to tourism. Survey Methodology, 32 (2): 177–185. Deville, J.-C., and Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87 (418): 376–382. Deville, J.-C., Särndal, C.-E., and Sautory, O. (1993). Generalized raking procedures in survey sampling. Journal of the American Statistical Association, 88: 1013–1020.
Weighting: Principles and Practicalities
Ekholm, A., and Laaksonen, S. (1991). Weighting via response modeling in the Finnish Household Budget Survey. Journal of Official Statistics, 7: 325–337. Eltinge, J. L., and Yansaneh, I. S. (1997). Diagnostics for formation of nonresponse adjustment cells, with an application to income nonresponse in the U.S. Consumer Expenditure Survey. Survey Methodology, 23: 33–40. Ernst, L. (1989). Weighting issues for longitudinal household and family estimates. In Panel Surveys (eds D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley & Sons, pp. 135–159. Estevao, V.M., Hidiroglou, M.A., and Särndal, C.-E. (1995). Methodological principles for a generalized estimation system at Statistics Canada. Journal of Official Statistics, 11 (2): 181–204. Fuller, W. A., and Burmeister, L.F. (1972). Estimators of samples selected from two overlapping frames. Proceedings of the Social Statistics Sections, American Statistical Association, pp. 245–249. Fuller, W.A., Loughin, M.M., and Baker, H.D. (1994). Regression weighting in the presence of nonresponse with application to the 1987–1988 Nationwide Food Consumption Survey. Survey Methodology, 20: 75–85. Grosbras, J.-M. (1986). Méthodes statistiques des sondages. Paris: Economica. Hartley, H.O. (1962). Multiple frame surveys. Proceedings of the Social Statistics Sections, American Statistical Association, pp. 203–206. Hartley, H.O. (1974). Multiple frame methodology and selected applications. Sankhya, C, 36: 99–118. Haziza, D., and Beaumont, J-F. (2007). On the construction of imputation classes in surveys. International Statistical Review, 75: 25–43. Hidiroglou, M.A., Särndal, C.-E., and Binder, D.A. (1995). Weighting and estimation in business surveys. In Business Survey Methods (eds B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. Colledge, and P.S. Kott). New York: John Wiley & Sons. , pp. 477–502. Horvitz, D.G., and Thompson, D.J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47: 663–685.
475
Iannacchione, V.G., Milne, J.G., and Folsom, R.E. (1991). Response probability weight adjustments using logistic regression. Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 637–642. Kalton, G., and Anderson, D. W. (1986). Sampling rare populations. Journal of Royal Statistical Society, A, 149: 65–82. Kass, V.G. (1980). An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society, Series C (Applied Statistics), 29: 119–127. Kott, P.S., and Vogel, F.A. (1995), Multipleframe business surveys. In Business Survey Methods (eds B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. Colledge, and P.S. Kott). New York: John Wiley & Sons, pp. 185–201. Lavallée, P. (1995). Cross-sectional weighting of longitudinal surveys of individuals and households using the weight share method. Survey Methodology, 21 (1): 25–32. Lavallée, P. (2002). Le sondage indirect, ou la Méthode généralisée du partage des poids. Éditions de l'Université de Bruxelles (Belgique) and Éditions Ellipses (France). Lavallée, P. (2007). Indirect Sampling. New York: Springer. Le Guennec, J., and Sautory, O. (2004). CALMAR 2: Une nouvelle version de la macro Calmar de redressement d'échantillon par calage. In Échantillonnage et méthodes d'enquêtes. Dunod. Lemel, Y. (1976). Une généralisation de la méthode du quotient pour le redressement des enquêtes par sondage. Annales de l'INSEE, 22–23: 272–282. Little, R. J. A. (1986). Survey nonresponse adjustments for estimates of means. International Statistical Review, 54: 139–157. Little, R.J., and Vartivarian, S. (2005). Does weighting for nonresponse increase the variance of survey means? Survey Methodology, 31: 161–168. Lohr, S. (2009a). Sampling: Design and Analysis, 2nd edn. Pacific Grove, CA: Duxbury Press. Lohr, S. (2009b). Multiple-frame surveys. In Handbook of Statistics, Vol. 29A, Sample Surveys: Design, Methods and Applications (eds D. Pfeffermann and C.R. Rao). Amsterdam: Elsevier, pp. 71–88.
476
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Lohr, S.L., and Rao, J.N.K. (2006). Multiple frame surveys: point estimation and inference. Journal of American Statistical Association, 101: 1019–1030. Lundström, S., and Särndal, C.-E. (1999). Calibration as a standard method for the treatment of nonresponse. Journal of Official Statistics, 15: 305–327. Mecatti, F. (2007). A single frame multiplicity estimator for multiple frame surveys. Survey Methodology, 33 (2): 151–157. Messiani, A. (2013). Estimation of the variance of cross-sectional indicators for the SILC survey in Switzerland. Survey Methodology, 39 (1): 121–148. Morin, H. (1993). Théorie de l'échantillonnage. Ste-Foy: Presses de l'Université Laval. Potter, F. (1990). A study of procedures to identify and trim extreme sampling weights. In Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 225–230. Rao, J.N.K. (1966). Alternative estimators in PPS sampling for multiple characteristics. Sankhya, Series A, 28: 47–60. Rao, J.N.K., and Singh, A.C. (1997). A ridgeshrinkage method for range-restricted weight calibration in survey sampling. Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 57–65. Rubin, D.B. (1976). Inference and missing data. Biometrika, 53: 581–592. Särndal, C.-E, and Lundström, S. (2005). Estimation in Surveys with Nonresponse. Chichester: John Wiley & Sons.
Särndal, C.-E., and Swensson, B. (1987). A general view of estimation for two-phases of selection with applications to two-phase sampling and nonresponse. International Statistical Review, 55, 279–294. Särndal, C.-E., Swensson, B., and Wretman, J. (1992). Model Assisted Survey Sampling. New York: Springer-Verlag. Sautory, O. (1991). La macro SAS: CALMAR (redressement d'un échantillon par calage sur marges). INSEE internal document, Paris. Sautory, O. (1992). Redressement d'échantillons d'enquêtes auprès des ménages par calage sur marges. Actes des journées de méthodologie statistique, March 13–14, 1991, INSEE Méthodes, 29-30-31 (December): 299–326. Silva, P.L.N., and Skinner, C. (1997). Variable selection for regression estimation in finite populations. Survey Methodology, 23 (1): 23–32. Sirken, M.G. (1970). Household surveys with multiplicity. Journal of the American Statistical Association, 65 (329): 257–266. Sirken, M.G., and Nathan, G. (1988). Hybrid network sampling. Survey Research Section of the Proceedings of the American Statistical Association, pp. 459–461. Thompson, S.K. (2002). Sampling, 2nd edn. New York: John Wiley & Sons. Thompson, S.K., and Seber, G.A. (1996). Adaptive Sampling. New York: John Wiley & Sons. Tillé, Y. (2001). Théorie des sondages – Échantilllonnage et estimation en populations finies. Paris: Dunod. Xu, X., and Lavallée, P. (2009). Treatments for links nonresponse in indirect sampling. Survey Methodology, 35 (2): 153–164.
31 Analysis of Data from Stratified and Clustered Surveys S t e p h a n i e E c k m a n a n d B r a d y T. W e s t
INTRODUCTION
SAMPLING
The data analysis techniques often taught in introductory statistics courses rely on the assumption that the data come from a simple random sample. However, many of the data sets that we use are based on samples that include stratification and/or cluster sampling. In addition, the cases may have unequal weights due to sample selection or adjustment for nonresponse and undercoverage. Data from such complex samples cannot be analyzed in quite the same way as data from simple random samples. This chapter discusses how these sample design elements affect estimates and provides some guidance on how to do proper analyses accounting for complex sample designs. At all points we provide references to other articles and books for additional information.
Before we discuss how complex sample survey data should be analyzed, we first give a little background on the sampling techniques themselves, to provide some intuition about why some sampling techniques require special handling during the analysis step. In this section we give examples for the estimation of means and their standard errors. In later sections we discuss regression coefficients, subclass means, and so on. We start with the idea that there is a population mean that we do not know and want to estimate. It might be the average number of employees at a firm, or the average of a four-point rating of respondents’ satisfaction with the government. We estimate the population mean, Y , with the sample mean, y . However, the sample mean usually does not equal the population mean exactly, and
478
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
we also estimate various measures of its precision: its variance, its standard error, and a confidence interval for the population mean given the sample information.
Simple Random Sampling Simple random sampling, or SRS, is rarely used in practice, but it provides a frame of reference for the more complex sample designs that we will consider. When we do simple random sampling, we have a list of members of the population (such as telephone numbers in the United States, or businesses in Germany) and we select an equal probability sample. If the population is size N and the sample is size n, then the probability of selection of each unit is n N . To estimate a mean when we have data from an SRS, we simply use the mean calculated on the n sampled cases:
y=
1 n ∑yi n 1
We select just one sample, but there are many other samples that we could have selected, and each one would give an estimate of the mean. The distribution of means from all the hypothetical simple random samples is shown in Figure 31.1. The horizontal axis is the estimated mean from each sample. The vertical axis is the frequency of each mean. We see that most simple random samples give estimates that are near the true population mean. However, there are some samples on the left and right ends of the distribution where the sample mean is far from the population mean. If we are unlucky and select one of these samples, y will be a poor estimate of Y . The variance of the sample estimate of the mean is: n 2 1 − N S Var ( y ) = n
(1)
where n N is the sampling fraction and S is the variability of the variable (Y) in 2
Figure 31.1 Sampling distribution for simple random sampling, stratified simple random sampling using proportional allocation, and clustered simple random sampling.
Analysis of Data from Stratified and Clustered Surveys
the population. Equation (1) gives the true variance of the estimate of the mean, but we usually cannot calculate it, because S2 is not known. However, we can estimate S2 by s2, the variability of Y in the sample:
∑ (y = n
s
2
1
i
− y )2
n −1
Also note, in Equation (1), that n N is often n very small, which means that 1 – N is near 1. For this reason, it is common to drop the n 1 – N term, called the finite population correction, and simply estimate the variance of the mean as:
var ( y ) =
s2 (2) n
Here we use var ( y ), rather than Var ( y ), to indicate that we are estimating the variance, rather than calculating the true variance.
479
in Figure 31.1. The variance of a stratified sample is:
Var ( y ) = ∑ H (1 − fh )
Wh2 Sh2 (3) nh
where H is the number of strata, and h indexes the strata. Wh is the proportion of the N population in stratum h: Wh = h ( Sh2 is N just like S2 defined above, but within stratum h). 1 – fh is the finite population correction: fh = nh /N h . Note that the theoretical variance of the mean in this case is defined entirely by within-stratum variance for the variable of interest. Between-stratum variance in the mean is removed, because the H strata are fixed by design and will be included in every hypothetical sample. For this reason, estimates from a stratified SRS can have smaller variances than those from an SRS. We estimate the variance by replacing Sh2 in Equation (3) with sh2 , and drop the finite population correction,: var ( y ) = ∑ H
Wh2 sh2 (4) nh
Stratification
Stratification is a technique that can help to reduce the chance that we select a sample whose y is far from Y . We divide the population into H groups, called strata, and select separate simple random samples from each one. For example, in Germany it is very common to stratify the country into east and west, as the federal states that used to be in East Germany and those that were part of West Germany are still different from each other. With stratification, we guarantee that a sample will include cases from both parts of Germany, and thus we eliminate some of the bad simple random samples from the pool of possible samples and reduce the variance of the estimate of the sample mean. We are more confident in our estimate from a stratified SRS of size n than an unstratified SRS of size n, because the stratified sample is more likely to be close to the true value, as shown
Equation (4) is the analog to Equation (2) for stratified simple random sampling. When we stratify, we must decide how many cases to select from each stratum. Proportional allocation distributes the n sample cases into the strata in proportion to their size in the total population: nh = Wh n . Proportional allocation can reduce the variance of the estimate of the mean. Disproportional allocation distributes the n sample cases in another way. One form of disproportional allocation puts equal sample sizes in each stratum, regardless of how big the population is in each stratum. Disproportional allocation can also reduce the variance of the estimate of the mean relative to SRS, possibly resulting in even larger variance reduction than proportional allocation. However, it is also possible that the
480
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
variance with a disproportionate design is larger than that from an SRS design. An example may make this clearer. Table 31.1 shows a simple population. The first column contains information about the entire population, and the second and third columns split this population into two strata. (In this example, we presume that we know Y and S2 in each stratum, though in a real survey situation this would not be the case.) If we select an SRS of size 1,200 from this population, the variance of our estimate of Y , using Equation (1), is 1,498.2. (If we ignore the finite population correction, it is 1,500, a very small difference.) If instead we select a stratified sample using proportionate allocation, we put 20% of the sample in the urban stratum (n1 = 240) and 80% in the rural stratum (n2 = 960) . Then the variance of the estimate of the mean (Equation (3)) is lower: 1,199.2. Disproportionate stratification can reduce the variance further, but can also lead to variances larger than those from SRS. Table 31.2 summarizes the variances from different possible designs.
The design effect captures the gain (or loss) in the efficiency of an estimate relative to an SRS. More formally, the design effect is the ratio of the variance from the complex design to the variance if we incorrectly assumed the sample came from a simple random sample:
N Wh y¯ S2 n f
Urban stratum
Rural stratum
1,000,000 NA 1,400 1,800,000 1,200 0.120%
200,000 0.2 3,000 4,000,000 240 0.120%
800,000 0.8 1,000 1,000,000 960 0.120%
Varcomplex ( y ) VarSRS ( y )
The deft is the ratio of the standard errors.
deft ( y ) =
Varcomplex ( y ) VarSRS ( y )
The last two columns of Table 31.2 give the design effects for each of the allocations. The proportionate allocation approach gives a variance that is 80% of the size of the SRS variance (deff = 0.8). Proportionate allocation has reduced variances by 20%. In the last column, deft = 0.89, which means that the standard error on our estimate of Y has been reduced by 11%. These design effects can also be greater than one, as in the last row of the table, indicating that stratification has increased variances and standard errors relative to the SRS design. We can also interpret design effects the other way around: how incorrect would our estimates of the variance (or standard error) be if we were to ignore the sampling technique and pretend that our sample was selected with an SRS? A deff less than one means that when we ignore the complex nature of the sample we overestimate our variances and fail to capture the efficiencies
Table 31.1 Example population for stratified sampling Population
deff ( y ) =
Table 31.2 Variance of different stratified designs Design
n
n1
n2
Var(y¯ )
deff
deft
SRS Proportionate Disproportionate Disproportionate Disproportionate
1200 1200 1200 1200 1200
240 400 600 800
960 800 600 400
1,498.2 1,199.2 1,039.2 1,119.2 1,798.4
0.80 0.69 0.75 1.20
0.89 0.83 0.86 1.10
Analysis of Data from Stratified and Clustered Surveys
introduced by the sample design. A deff greater than one means that we underestimate our variances and standard errors, and thus reject null hypotheses more often than we should. Such an error can have dangerous consequences for our analyses.
Cluster Sampling Cluster sampling is a technique that is often used to reduce the costs of data collection. If we were to select an SRS of households in the US, or Germany, or Australia, the selected households would be very spread out, and it would be costly to send interviewers to each address. For this reason, many in-person surveys are clustered: we first select geographical regions such as postal codes or counties, and then interview households only within the selected areas. The downside to clustered samples is that the variance of the sample mean is often larger than it is with an SRS. For example, if we want to measure the reading ability of 4th grade students in Germany, we could select a SRS of 2,000 4th graders and give each of them a reading test. Alternatively, we could select an SRS of 20 schools, and then select 100 4th graders within each school and test them. Both designs give us a data set of 2,000 test scores, from which we can calculate the average score. However, the clustered design has a higher variance – we should be less confident about the estimate from this survey than we are about the estimate from the simple random sample. This is because the 4th graders who all go to the same school are similar: they likely come from the same neighborhood, their parents may have similar backgrounds, and they are learning reading from the same set of teachers. For this reason, the clustered sample of 2,000 students produces less precise estimates of features of the entire population of German 4th graders than does an unclustered sample of the same size. Figure 31.1 also shows the sampling distribution for a clustered design. Notice that
481
it is shorter and has fatter tails than the SRS distribution, meaning that fewer samples give mean estimates that are close to the true value, and more samples lead to estimates that are far away. When selecting samples to represent the entire population of a country, it is often the case that some clusters are more important than others. Consider selecting a sample from the US, using counties as clusters. Some counties have larger populations than others, and it would be foolish to give them all the same probability of selection, as an SRS would. A better selection method gives higher probabilities of selection to large clusters: this method is called probability proportional to size (PPS) sampling. PPS sampling ensures that the counties in New York City have a higher chance to be selected than a rural county in Alaska, which sounds reasonable. We do not give formulas for SRS or PPS samples of clusters, but interested readers should see Cochran (1977: Chapters 10 and 11), and Kish (1965: Chapter 5). Design effects for clustered samples are usually greater than one, indicating that variances and standard errors are larger than they would be with an SRS, and that ignoring the complex sample design will cause our estimates of standard error and confidence intervals to be too small. The size of the design effect varies with each estimate that we calculate. In an in-person household survey that is geographically clustered, those variables with a high degree of homogeneity within the clusters will have large design effects. This design effect arises because the persons within each cluster are similar to each other on these measures, and thus sampling additional people within clusters does not tell us much about the population as a whole. However, other variables which are not homogenous within clusters will have lower design effects in the same survey. (It is also the case that some of what we think of as geographical clustering is really clustering introduced by interviewers; see Schnell and Kreuter (2005) and West et al. (2013).)
482
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
WEIGHTING
Stratified Multi-Stage Cluster Sampling In practice, samples for in-person surveys are often both stratified and clustered. Stratification ensures that the sample is diverse and spread throughout the country and tends to reduce variances. Cluster sampling reduces data collection costs, but tends to increase variances. Often, clusters are selected using PPS and stratification is disproportionate, both of which can lead to unequal probabilities of selection across the cases in the sample. Design effects in such surveys are generally greater than one. Table 31.3 gives design effects for 11 variables from the 2012 round of the General Social Survey, a large in-person household survey in the US. People with college degrees tend to live near others with college degrees, and vice versa, leading to a high deff on this variable. However, there is not a lot of geographic clustering on gender or happiness in marriage and thus we see lower deffs on these variables. All, however, are greater than one, meaning that failure to account for the complex sample design in analysis will lead us to estimate variances that are smaller than they should be. Table 31.3 Design effects for selected estimates in the 2012 General Social Survey Variable
y¯
deff
College graduate Premarital sex OK Abortion OK High school graduate Support death penalty for murderers Afraid to walk at night Marijuana should be legal Men are more suited to politics Lower class or working class Happy marriage Female
36% 51% 43% 49% 65%
2.90 2.50 2.40 2.24 2.12
34% 47% 20% 53% 98% 54%
1.87 1.57 1.51 1.41 1.25 1.15
Authors’ calculations, from Smith et al. (2013)
Weights are another common feature of complex samples: they adjust for unequal probabilities of selection, and also attempt to reduce the influence of nonresponse and undercoverage on survey estimates. Using weights in analysis reduces bias in our estimates if the weights are correlated with the variables we are interested in. In this section, we discuss a bit about how weights are calculated, and then describe how they complicate the estimation of variances and standard errors.
Design Weights The first step in deriving weights is to take the inverse of the probability of selection for each unit. In an SRS, the probability of selection of each unit is simply n/N and thus the weight for each case is N/n. In stratified samples using SRS within strata, but disproportionate allocation to strata, the weight is different for each stratum, but constant within strata. In PPS samples, each unit may have a different probability of selection and thus a different weight. Weights that adjust for the probability of selection are called design weights.
Weight Adjustments Most surveys suffer from nonresponse, and weights are often adjusted to account for it. There are several different methods of adjusting weights for nonresponse using information that is available for respondents and nonrespondents. For example, we might know for all cases whether they are in East or West Germany. We could then adjust the weights by the inverse of the response rate within these two cells. If we have several variables available for each selected case, we can estimate a response propensity model
Analysis of Data from Stratified and Clustered Surveys
and use the predicted propensities based on the model to adjust the weights. There is also another class of techniques that adjust for nonresponse and undercoverage, while also reducing variances. Raking, post-stratification, and calibration are in this class: all involve adjusting the sample so that it matches the population based on known population information. For example, in many countries, women are more likely to respond to surveys than men and thus the set of respondents has more women than men. These adjustment techniques increase the weight of the responding men, and decrease the weight of the women, so that the weighted sample reflects the population. See Bethlehem et al. (2011), Valliant et al. (2013), or Chapter 30 of this book for more details on these weight adjustment techniques.
Use of Weights in Analyses Because of these adjustments, even samples which have equal design weights will often have unequal final weights. Unfortunately, variable weights increase the variance of our estimates. This effect of weights is called the weighting loss. The size of the weighting loss is called the weff and is approximately 1 + relvar ( wi ) (Kish 1965: Section 11.2C). The interpretation of the weff is similar to the deff, discussed above. A weff of 1.1 means that variances are 10% larger than they would be if the weights were constant, setting aside any effects of the sample design on the variance. If these adjustment techniques increase the variance of our estimates, why do we use them? Because we believe that the point estimates (that is, the estimates of means, quantiles, correlation coefficients, regression coefficients, and so on) that we get with the final adjusted weights are more accurate (less biased) than those that we get when we use only the design weights. The cost of this reduction in bias is an increase in variance.
483
Such bias-variance trade-offs are common in statistics. We feel that the use of the final weights is generally a good idea for most analyses, but researchers may want to fit their models both with and without weights to see how the point estimates and standard errors change. If there is little difference in the point estimates when the weights are used, and a big increase in the standard errors, then the unweighted estimates may be preferred (meaning that the weights do not carry any information about the estimate of interest). However, if the weights change the point estimates a good deal, then the weighted estimates are probably better, despite their larger standard errors. Some researchers propose not to use the weights in estimation, but to include the variables that go into the weights as covariates in models of interest; see Korn and Graubard (1999: Section 4.5) or Pfeffermann (2011) for more discussion of this idea. However, Skinner and Mason (2012) caution that this approach changes the interpretation of the estimated model coefficients. They propose an alternative technique which adjusts the weights to reduce their variability while capturing their bias-reducing properties. Ultimately, the use of weights in estimation of regression models will ensure that estimates of model coefficients are unbiased with respect to the sample design, which provides analysts with more protection against bias than does ignoring the weights entirely. Appropriate specification of a given model is still quite important, but analysts seldom know the form of a ‘true’ model or specify a model in an exactly correct fashion. Weights in estimation can ensure that estimates of the parameters in the population model of interest are unbiased despite model misspecification; that is, we would probably prefer an unbiased estimate of a misspecified model, enabling unbiased inference about a larger population, than a biased estimate of a misspecified model. As we suggest above, survey analysts would do well to fit models with and without weights and compare the resulting inferences.
484
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
VARIANCE ESTIMATION In complex surveys, weights, stratification, and cluster sampling interact to make the estimation of variances particularly difficult. When a sample contains these elements, the variance estimation formulas given above do not work, and we need to turn to alternative methods. There are two broad classes of techniques to estimate variances and standard errors from complex samples: Taylorseries linearization and replication, which includes several similar techniques. The Taylor-series linearization method approximates the nonlinear (possibly weighted) estimate of a population parameter with a linear one, and then estimates the variance of that approximation. Readers interested in the details should see Chapter 6 of Wolter (2007) or Section 3.6.2 of Heeringa et al. (2010). To implement this technique in software packages, the user must provide information about the sample design: the weights, the strata, and the clusters. Thus the data set one is working with must include these variables. While weights are often provided, some data producers do not wish to provide stratum and cluster identifiers, due to concerns about confidentiality. In these cases, data producers may provide replicate weights (more discussion below), but if only final weights are provided, variance estimation options are more limited. See West and McCabe (2012) for discussion of alternative approaches when only the final survey weight is provided for analysis. Replication techniques work on a different principle. Replication methods estimate variance by choosing subsets from the sample itself, and using the variability across all these subsets to estimate the variance of the sampling distribution. How exactly these subsamples of the sample are selected depends on the exact replication method in use. There are three commonly used methods: Jackknife Repeated Replication (JRR), Balanced Repeated Replication (BRR), and the Bootstrap: see Chapters 3, 4, and 5 of Wolter (2007), Kolenikov (2010), or Section
3.6.3 of Heeringa et al. (2010) for more details on how they differ. When a data set is set up to use one of these methods, the user will see a set of additional weights on the data set. The software uses these weights to calculate many ‘replicate’ estimates of the quantity of interest, and then computes the variance based on these replicates. The choice of which method to use is not usually up to the researcher, but rather the data producers. When the data set includes replicate weights, the analyst should use the replication method that was used to form the weights. If no replicate weights are provided, but the data set contains indicators for clusters and strata, then Taylor-series linearization can be used. In comparison studies, the techniques produce nearly identical standard errors (Kish and Frankel, 1974; Heeringa et al., 2010: Section 3.6.4).
STATISTICAL SOFTWARE Stata, R, SAS, and SPSS all have built-in routines implementing these different types of variance estimation techniques. Because software development is so rapid, we do not provide a list here of what is possible in each software program. Heeringa et al. (2010) and the book’s website, which provides examples of commonly performed analyses of complex sample survey data using all of these software packages, are excellent resources. For information on Stata’s survey analysis techniques, we recommend Kreuter and Valliant (2007) and the software’s SVY manual, available within the program. For R, we recommend Lumley (2010). There are also more specialized statistical software packages, each of which offers survey-robust analysis. Examples of such software are: HLM, MLwiN, Mplus, WesVar, SUDAAN, and IVEWare, among others. Many of these programs will also estimate design effects. One frustration that researchers often run into is that the analyses they want to conduct
Analysis of Data from Stratified and Clustered Surveys
are not available with survey-robust methods, either because the techniques have not yet been developed, or they have not yet been included in the software. Continued statistical advances and software releases will help solve this problem.
ADDITIONAL TOPICS The above sections have summarized designbased analysis of complex sample survey data and the different variance estimation techniques that are available. With this information, researchers will be well equipped to analyze complex sample survey data and correctly estimate variances in most situations. In this section we briefly address a few more advanced topics. We do not have the space in this chapter to go into detail about these topics, but we point interested readers to books and papers that can help them learn more.
Subpopulation Analyses Very often researchers are interested in analyzing only a portion of a data set: those with college degrees, those over the age of 40, or those who have experienced a heart attack. When using complex sample survey data, subpopulation analyses such as these are a bit more complicated than with other data sets. Unless, the subpopulation one is interested in is also a stratum (or combination of strata) used in the sample design, variance estimates based on Taylor-series linearization must account for cases in the subpopulation was all as those outside of it (and their stratification and clustering information). The reason we need special analysis techniques is that the number of cases that are part of the subsample is random: if we repeated the survey, we would end up with a different number of cases with college degrees. Thus the cases that are not part of the subpopulation also
485
contribute to the variance, and we cannot ignore them. This additional source of variation must be accounted for when we use Taylor-series linearization to estimate the variance. Most survey-aware software programs offer commands to correctly estimate variances for subpopulation analysis. See Section 4.5 of Heeringa et al. (2010) and West et al. (2008) for more information.
Multilevel Models Clustered surveys are inherently hierarchical: cases are nested within clusters, which may themselves be nested within other clusters. In addition, cases can be considered as nested within interviewers. Hierarchical data sets such as these suggest to many researchers the use of multilevel models, which explicitly account for the different levels of observation in the data. These models can certainly be applied to survey data and are particularly useful when the variance components associated with each level are of substantive interest. For example, consider a data set of student test scores, where students are nested within schools. Survey-robust estimation of the average score will correctly estimate the standard error, accounting for the clustering of students. But if the intra-class correlation coefficient (a measure of how homogenous clusters are) is of interest, it is best estimated using a multilevel modeling approach. The downside to these models is that they make additional assumptions which may not hold in practice, and it is not easy to include weights. See Rabe-Hesketh and Skrondal (2006), Section 12.3 of Heeringa et al. (2010), or Section 2.9.6 of West et al. (2014).
Cross-Country Analyses There are quite a few survey data sets available today arising from studies that are carried out in many countries at once and thus allow researchers to combine or compare countries,
486
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
such as the European Social Survey, the Program for International Student Assessment, and the European Union Statistics on Income and Living Conditions Survey. In such crossnational surveys, different countries often use different sample designs, due to variations in population size, data collection budgets, and modes. Analysts should correctly account for the sample design in each country to get correct standard errors. To ensure that the software calculates correct standard errors, countries should be treated as strata. If a country uses a stratified design, then that country should be represented by several strata. Clustering should be handled carefully as well. When combining or comparing countries where some use a cluster sample design and others do not, the cluster indicator should reflect the clusters (when the design is clustered) and the individual cases (when it is not). See Lynn and Kaminska (2011) for more details. The use of weights in analyzing crosscountry surveys also requires careful thought. When combining data from several countries, observations from small countries tend to be overshadowed by those from large countries, because the weights in the large countries are bigger. If the goal of the analysis is to estimate what percentage of Europeans have attended university, the small countries should play a small role, and the weights do not require adjustment. However, if the goal is to explore what country-level factors are associated with higher likelihood to attend college, then we may want to adjust the weights so that countries contribute more or less equally. See Skinner and Mason (2012) for a discussion of weight adjustments in such situations. For more details on the analysis of cross-country data sets, see Hox et al. (2010).
should be handled carefully in analysis. Stratification, cluster sampling, and weights are used to make surveys more representative and cost-effective, but they each introduce complications when it comes to analysis. Stratification can reduce variances, but can also increase them. Clustering generally increases variances, due to the homogeneity that naturally occurs in clusters such as neighborhoods, classrooms, and schools. Weights are used to adjust for unequal probabilities of selection (such as in disproportionate allocation stratified sampling or PPS sampling). When a sample includes such features, the usual analysis techniques for simple random samples are not appropriate. Special variance estimation procedures, such as Taylor-series linearization or replication methods, are needed. Fortunately, most statistical software packages now support these variance estimation techniques, and the number of analysis methods that work with these estimation techniques is growing all the time.
RECOMMENDED READINGS For readers wishing to learn more about how and why survey designs change the way we analyze data, we strongly recommend the Heeringa et al. (2010) book. Those who are interested in running regression models on survey data should consult Heeringa et al. (2014). Skinner and Mason (2012) discuss the use of survey weights in analysis of survey data and is also an important resource for data analysts.
REFERENCES CONCLUSION This chapter has motivated why survey data collected from samples with complex designs
Bethlehem, J., Cobben, F. and Schouten, B. (2011). Handbook of Nonresponse in Household Surveys. Hoboken, NJ: Wiley. Cochran, W. G. (1977). Sampling Techniques (3rd edn). New York: Wiley.
Analysis of Data from Stratified and Clustered Surveys
Heeringa, S. G., West, B. T. and Berglund, P.A. (2010). Applied Survey Data Analysis. Boca Raton, FL: Chapman Hall/CRC Press. Heeringa, S. G., West, B. T. and Berglund, P.A. (2014). Regression with Complex Samples. In H. Best and C. Wolf (eds), Handbook of Regression Analysis and Causal Inference (pp. 229–252). London, UK: Sage. Hox, J. J., de Leeuw, E. and Brinkhuis, M. J. S. (2010). Analysis Models for Comparative Surveys. In J. Harkness, M. Braun, B. Edwards, T. Johnson, L. Lyberg, P. Mohler, B. E. Pennell and T. W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 395–418). Hoboken, NJ: Wiley. Kish, L. (1965). Survey Sampling. New York: Wiley. Kish, L. and Frankel, M. R. (1974). Inference from Complex Samples. Journal of the Royal Statistical Society: Series B, 36, 1–37. Kolenikov, S. (2010). Resampling Variance Estimation for Complex Survey Data. The Stata Journal, 10(2), 165–199. Korn, E. L. and Graubard, B. G. (1999). Analysis of Health Surveys. New York: Wiley. Kreuter, F. and Valliant, R. (2007). A Survey on Survey Statistics: What is Done and Can be Done in Stata. The Stata Journal, 7(1), 1–21. Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. New York: Wiley. Lynn, P. and Kaminska, O. (2011). Standard Error of an Estimated Difference Between Countries When Countries Have Different Sample Designs: Issues and Solutions. Proceedings of the International Statistical Institute. http://2011.isiproceedings.org/ papers/950194.pdf [accessed 2014-06-18]. Pfeffermann, D. (2011). Modelling of Complex Survey Data: Why Model? Why is it a Problem? How Can we Approach it? Survey Methodology, 34(2): 115–136.
487
Rabe-Hesketh, S. and Skrondal, A. (2006). Multilevel Modelling of Complex Survey Data. Journal of the Royal Statistical Society, Series A, 169(4), 805–827. Schnell, R. and Kreuter, F. (2005). Separating Interviewer and Sampling-Point Effects. Journal of Official Statistics, 21(3), 389–410. Skinner, C. J. and Mason, B. (2012). Weighting in the Regression Analysis of Survey Data with a Cross-National Application. Canadian Journal of Statistics, 39, 519–536. Smith, T. W., Marsden, P., Hout, M. and Kim, J. (2013). General Social Surveys, 1972–2012 [Machine-readable data file]. Sponsored by National Science Foundation (NORC edn). Chicago, IL: National Opinion Research Center [Producer]; Storrs, CT: The Roper Center for Public Opinion Research, University of Connecticut [Distributor]. Valliant, R., Dever, J. and Kreuter, F. (2013) Practical Tools for Survey Sampling and Weighting. Berlin: Springer. West, B. T., Berglund, P. and Heeringa, S. G. (2008). A Closer Examination of Subpopulation Analysis of Complex Sample Survey Data. The Stata Journal, 8(3), 1–12. West, B. T., Kreuter, F. and Jaenichen, U. (2013). ‘Interviewer’ Effects in Face-to-Face Surveys: A Function of Sampling, Measurement Error or Nonresponse? Journal of Official Statistics, 29(2), 277–298. West, B. T. and McCabe, S. E. (2012). Incorporating Complex Sample Design Effects When Only Final Survey Weights are Available. The Stata Journal, 12(4), 718–725. West, B. T., Welch, K. B. and Galecki, A. T. (with Contributions from Brenda W. Gillespie) (2014). Linear Mixed Models: A Practical Guide Using Statistical Software (2nd edn). Boca Raton, FL: Chapman Hall/CRC Press. Wolter, K. (2007). Introduction to Variance Estimation (2nd edn). Berlin: Springer.
32 Analytical Potential Versus Data Confidentiality – Finding the Optimal Balance Heike Wirth
The need to protect respondents’ privacy and researchers’ need for the most detailed data possible are often in conflict. This is particularly true for data produced by statistical offices. Many statistical solutions have been proposed to overcome the trade-off between data confidentiality and data utility. However, this conflict cannot be resolved satisfactorily by statistical measures alone. Utility and privacy are linked, and, as long as data are useful, there will always remain some disclosure risk, even if it is miniscule. All statistical disclosure control methods lead to a loss of information either in the range of analyses or in the validity of estimations and thus limit the data utility for research purposes. Governments pay substantial amounts of money to statistical institutes to provide decision makers with high-quality data. Reducing the data quality ex post by disclosure control methods is counterproductive. The trade-off between confidentiality and utility can only be resolved if statistical institutes and researchers cooperate constructively to
address legislative, regulatory, and dissemination policies as well as practices.
INTRODUCTION Over the last decades research from many countries has shown that data collected by official statistics, i.e. official microdata, whether collected directly from respondents or indirectly through access to administrative data, are invaluable in addressing numerous research questions in the fields of demography, economics, epidemiology, and social policy (Doyle et al. 2001; National Research Council 2005; Lane and Schur 2009; Duncan et al. 2011). It is also evident that for many research purposes the analytical value of official data is lost if the data cannot be used as microdata, that is data consisting of individual units, such as persons or households. The researchers’ need for the most detailed data possible comes often into conflict with
ANALYTICAL POTENTIAL VERSUS DATA CONFIDENTIALITY
the need of statistical institutes to protect respondents’ privacy for legal, ethical, and practical reasons. Data collected by statistical institutes are generally based on the assurance of confidentiality, meaning that no third party will obtain access to identifiable data. However, unless no microdata are released at all, there is no such thing as a zero-risk scenario. All use of microdata, even in anonymized form, can imply a risk to confidentiality, even if it is miniscule. Statistical institutes are thus facing the challenge of providing access to detailed microdata for research purposes while at the same time safeguarding respondents’ privacy. This is not a new challenge. The issue of ensuring privacy as opposed to the analytical usefulness of official microdata has been discussed for several decades (Bulmer 1979; Duncan et al. 1993; Willenborg and de Waal 1996; National Research Council 2005). There is a multitude of literature on disclosure control methods and researchers specialized in this field have developed more and more sophisticated anonymization techniques over the last forty years. As Sundgren (1993: 512) noted ‘sometimes this research must give an external observer the impression that statistical publications, files, and databases, must be leaking like riddles, leaving statistical offices with no practical alternative but closing down its activities entirely’. All measures suggested by disclosure control research necessarily involve modifying the original data either by restriction of detail or perturbation. The more measures applied, the greater the loss of information, and the more restricted the range of analyses and the quality of estimates may become. Lately there is a growing body of scientific opinion that the trade-off between data confidentiality and data utility cannot be solved by statistical methods exclusively (Ritchie and Welpton 2011; Yakowitz 2011; Hafner et al. 2015). Useful microdata can only be released if the already existing broad range of legal and regulatory measures are taken seriously and become an integral part of the dissemination policy.
489
This implies that a clear distinction is made between data available to the general public and data available for research purposes only. Furthermore, it is vital to distinguish between a legitimate user who uses the data for research purposes with no intent to violate confidentiality and a criminal intruder who seeks to identify respondents (Cox et al. 2011). This article discusses disclosure issues associated with microdata on individuals produced by official statistics, regardless of whether these microdata are collected by surveys or through access to administrative sources. Microdata on establishments are not considered here, as these data have different disclosure issues in terms of e.g. coverage, skewness, publicly available information, and high visibility of large establishments. While the focus in the following is mainly on official microdata, the principal issues of data confidentiality and disclosure control methods are equally applicable to data collected by academia.
WHY IS DATA CONFIDENTIALITY IMPORTANT? The concept of privacy and data confidentiality has a long-standing tradition in official statistics as well as in the social sciences (Bisco 1970; Bulmer 1979; Boruch and Cecil 1979; Sieber 1982; Duncan et al. 1993; Fienberg 1994) and is reflected in ethical standards for statisticians and codes of ethics in the social sciences. Ensuring data confidentiality is often also required by law or regulations as for example in the European Regulation on European Statistics (EC Regulation No 223/2009), the European Regulation on Access to Confidential Data for Scientific Purposes (EC Regulation No 557/2013), or the Data Protection Regulation (EC Regulation No 45/2001; EC Proposal No 15039/15). In the social sciences, as in all other research disciplines where personal data
490
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
are collected, it is essential to keep the data confidential and not to disclose identities of respondents. This is even more particularly the case if the research interest involves rather sensitive topics, e.g. sexual behavior, victimization, or deviant, delinquent, or criminal activities. The willingness of persons to participate in surveys and provide very private information is closely connected with their trust in researchers to keep the data confidential and not to give away any identifiable information. Any breach of confidentiality, whether accidental or intentional, could harm the participants, might reduce the willingness to participate in surveys and provide truthful answers, and in consequence might damage the empirical foundation of the social sciences. It should be added that research data is not protected by law that would grant professional secrecy to researchers. Irrespective of any confidentiality pledges made to respondents, researchers seldom have a legal statute comparable to that of doctors, notaries, and lawyers to back up such guarantees. Thus, research data could be and have been subject to court orders or subpoenas (Knerr 1982; Lee 1993; Fienberg 1994; Picou 2009; Bishop 2014). Social scientists, therefore, have every reason to care about data confidentiality and privacy of their respondents. The confidentiality of data collected by official statistics is not threatened by court orders, but ever since the public debate over confidentiality and privacy issues was triggered some decades ago, respondents are more concerned about who might have access to data collected by official statistics and are less willing to participate (Turner 1982; Duncan et al. 1993; Singer 2001). Even if privacy is guaranteed by law, respondents do not always trust in confidentiality pledges made by statistical offices (Fienberg 1994: 117). As a result, statistical institutes are concerned that any misuse of official data might further negatively affect the willingness to cooperate in surveys and have become very cautious regarding what data to release for research purposes.
DISCLOSURE Disclosure refers to the identification of responding units in released statistical information (microdata or tabular data) and the revelation of information which these respondents consider as confidential. Identification takes place when a one-to-one relationship can be established between a particular record within a microdata file (not including direct identifiers) and a known individual in the population. This is called identity disclosure or re-identification. Other types of disclosure such as attribute disclosure and inferential disclosure do not require re-identification (Duncan et al. 1993: 23ff) and thus are not of interest in the following. The process leading to re-identification can be described as follows: a user of microdata has external data at his or her disposal which not only include direct identifiers (names, addresses, social security numbers, etc.) but also information present in both external data and official microdata. Such overlapping information is also given the label key variables. By linking the key for a specific record within the external data to a record within the microdata in a one-to-one relationship, identification takes place if it can be established that the information pertains to the same individual (Paass 1988; Marsh et al. 1991).
Assessing the Disclosure Risk Measured by the multitude of literature on disclosure control methods any release of microdata seems to be a high-risk situation. The principal arguments put forth in support of the high-risk hypothesis, notably advances of computer technology and the increasing availability of information, have not changed much over time. While in the 1970s the high risk was accounted for by newly-founded databases and increasing computing capacities (Bisco 1970), some decades later a soaring risk is assumed due to rising computing power, an ever increasing volume of data
ANALYTICAL POTENTIAL VERSUS DATA CONFIDENTIALITY
from public and private sources, and the availability of enormous amounts of information on the internet (Doyle et al. 2001; Sweeney 2001). While the perceived risk is regarded as very high, surprisingly little is known regarding the actual risk. Apart from a few examples demonstrating that re- identification is possible under specific conditions, up to now there are no known incidences of willful or accidental real-life disclosure of microdata released for research purposes (Ohm 2009; Yakowitz 2011).1 This is not a coincidence but a direct consequence of the social sciences’ subject matter. The focus of social scientists is on describing and explaining social behavior and social processes, not on individual persons. Statistical analyses do not require that the identities of the respondents are known. Research questions might be ‘how many’, ‘to what proportion’ or ‘why’, but not ‘who’. In terms of professional behavior, any attempt to disclose identities in anonymized microdata is a waste of resources which will get no credits in the scientific community. Social scientists are not necessarily more trustworthy than the average population but there is simply no professional rationalization for disclosure of anonymized microdata. Over the decades several measures of disclosure risk have been developed, but none has been widely accepted. Given the large variety of microdata in terms of topics, observation units, data collection processes or sample fraction, it is difficult to make informed statements regarding the general risk of disclosure. Even when focusing on one particular type of microdata, risk assessment is a challenge due to the manifold preconditions which might affect any disclosure attempt. Such preconditions include the availability of external data, motivations, statistical and computation skills, data inconsistencies, likelihood of success, consequences of a disclosure attempt, and whether the goal can be achieved by other means (Marsh et al. 1991; Elliot 2001; Federal Committee on Statistical Methodology 2005; Skinner 2012).
491
Research in statistical disclosure control often reduces the complexity of disclosure scenarios by applying a simplistic risk model (Domingo-Ferrer 2011). Non-statistical, legal, and subjective risk factors are not part of this model. In its most simple form the model assumes a perfect data world in which external data covering the entire population and perfectly matching official microdata are available at will. In addition, the data user is put at the same level as a hypothetical ‘intruder’ who not only has access to these perfect data but also uses them to re-identify anonymized units in official microdata (Hundepool et al. 2012: 40). Thus, no distinction is made between a legitimate user of microdata and a malicious, criminal intruder. To address this problem in the following, the term ‘intruder’ is not used. Instead I refer to microdata user or researcher.
Measures of Disclosure Risk Most analytical risk measures are based on the concept of uniqueness. Given a set of key variables, a record is called unique if no other record shares its combination of values for the specified key. A distinction is made between population and sample uniqueness. Population uniqueness refers to the proportion of units that are unique in the population with regard to the key variables (Skinner and Elliot 2002). Let us assume that a user of microdata knows a person who is a population unique on specified key variables and there is precisely one record with these characteristics in the microdata file: in this case identity disclosure has taken place. However, if no census or administrative data covering the total population and including key variables are available, it is not easy to determine population uniqueness (Federal Committee on Statistical Methodology 2005: 86). Sample uniques are respondents who are unique with respect to the information in the key variables compared to all other records in the sample. Risk measures based on sample
492
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
uniqueness assume that a user will focus on unique records in the released microdata and then attempt to find corresponding units in the population. However, while every population unique is also unique in a sample, sample uniqueness must not be equal to population uniqueness. A variation of the uniqueness approach focuses on risky records or special units in the microdata file (Elliot et al. 1998). These are records in the data which represent respondents with rare characteristics such as for example a widowed stonemason, 35 years old, single father of two underage sons. The underlying idea is that such units are more likely to be also population uniques and therefore at a higher risk. If the characteristics of the above-mentioned stonemason are unique in the data and if the data also include detailed geographic information the data user could screen local registers (if available) in order to check whether more than one stonemason with these characteristics lives in the known region. If there is only one stonemason in the registers, it is very likely that the microdata record corresponds to this stonemason. However, the specific risk is not due to the rare characteristics of the record but to the availability of geographic details in the microdata. Contrary to the uniqueness approach, the risky record approach does not equate uniqueness with at risk. Nevertheless the crucial question regarding how to identify risky records is not easy to answer (Elliot et al. 1998; Holmes and Skinner 1998; Skinner and Elliot 2002). Sample uniqueness can represent a high risk if a researcher knows that a specific person has participated in a survey (response knowledge) and that therefore the data of this person are included in the microdata file. Suppose the researcher knows that the abovementioned stonemason has participated in the survey and he/she finds a unique record with precisely these characteristics in the microdata file. It can thus be logically concluded that this microdata record stems from the known stonemason.
Müller et al. (1995) and others have pointed out that analytical risk measures based on the uniqueness approach are flawed because they imply that the key variables needed for disclosure are represented in an identical way in both data files. Any inconsistencies which might be caused by measurement errors, response errors, missing values, coding, and differences in the mode or time of data collection are left aside. This leads to a considerable overestimation of the actual disclosure risk, actually indicating the upper limit of a worst case scenario in a perfect data world (Hundepool et al. 2012: 40). Since linking official microdata with external information is essential to re- identification, a better estimation of the real-life risk might be obtained by carrying out such a linkage; this is the so-called empirical approach. Given the long history of disclosure control research there are surprisingly few studies following this empirical approach and carrying out disclosure experiments using real data and with a verification of the matches by a third party (Blien et al. 1992; Müller et al. 1995; Elliot and Dale 1999; Bender et al. 2001). Though these studies used specific microdata files, some of the findings can be generalized: (1) the public availability of external, large-scale data sources with personal identifiers and matching key variables to official microdata is not as prevalent as it is suggested by disclosure control literature; (2) as compared to the perceived disclosure risk (measured by uniqueness in the key variables) the actual disclosure (measured by proportion of correct matches among the total number of cases) is low, indicating a high level of data inconsistencies between different data sources; and (3) the more key variables needed to achieve uniqueness the higher the probability of data inconsistencies and hence the number of nonmatches and mismatches (Blien et al. 1993). While the assumption ‘the more information the higher the risk’ is valid in an idealized data world, it is rather questionable under real data conditions. In fact, it is very specific
ANALYTICAL POTENTIAL VERSUS DATA CONFIDENTIALITY
information, in particular response knowledge and fine geographic details, which entails a high disclosure risk. A crucial disadvantage of the empirical approach is its resource-consuming nature. Nevertheless, one should keep in mind that a data user attempting a real disclosure would have to make a similar effort. Recently the interest in estimating disclosure risks by using an empirical approach has been increasing (O’Hara 2011; Elliot 2011).
DISCLOSURE CONTROL METHODS As mentioned above, statistical analyses do not require that the identities of the respondents are known. Thus, any direct identifiers can be removed before releasing the microdata. In addition, further disclosure control methods can be applied to safeguard data confidentiality. In principle there are two types of methods, which are referred to as non-perturbative methods and perturbative methods. Non-perturbative methods diminish disclosure risk by reducing the level of detail in the data. Perturbative methods reduce the risk by altering the data themselves and thus introducing uncertainty in the data. Both methods lead to a loss of information. While the reduction of detail could restrict the possible range of data analyses, perturbative methods may affect the validity of analyses.
Non-Perturbative Methods Within the group of non-perturbative methods, different strategies are used to reduce the level of detail in the data. Most of them are suitable for categorical data but not necessarily for continuous variables.
Sampling One of the most common strategies to protect confidentiality is sampling or subsampling.
493
Sampling reduces the disclosure risk by introducing an element of uncertainty. Units no longer included in the microdata file cannot be re-identified. Furthermore sampling reduces the precision with which reidentification can be achieved. Suppose a microdata user has response knowledge, i.e. he/she knows that a particular person has participated in the survey. It might then be possible to identify that person by using known characteristics, given that the combination of these characteristics is unique in the microdata file. However, if only a (sub-) sample is released, there is always the possibility that a unique record in the sample corresponds to other individuals in the population (statistical twins) who may not be included in the sample. Thus, even if there is a unique record with the known characteristics in the sample, there is no certainty that this record corresponds to the known participant. It would only be possible to check whether sample uniqueness corresponds to population uniqueness if the external data covered the entire population.
Suppression of Information There are different types of data suppression. Global suppression refers to instances in which specific variables are deleted for all records. Local suppression refers to instances in which only specific variable values or certain cases are suppressed. Some information is known to pose a specific threat to data confidentiality. Above all this is the case for highly detailed regional information, such as a detailed postal code. As Fienberg (1994: 121) puts it ‘after specific identifiers, geographic information poses one of the greatest risks for disclosure’. Detailed regional information not only makes it easier to search external data for information which can be directly linked to microdata, but also makes it easier to check whether sample uniqueness corresponds to population uniqueness as has been demonstrated in the case of the ‘re-identification’ of William Weld (Barth-Jones 2012).
494
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
As a consequence, global suppression of geographical details is the most widely applied method of disclosure control. One approach is to limit the degree of geographical information so that no information on areas with a population size below a certain specified limit is released. Other approaches use the hierarchical structure of administrative boundaries or use geographical classifications which take particular regional characteristics into account. Next to geographical information the precise birthdate (day, month, year) is also very often subject to global suppression. In most cases day and month of birth are either suppressed or replaced by less precise information (e.g. quarter of year). Global suppression of information limits the range of analyses possible with the data. This seems to be particularly problematic for researchers needing geographical details. In the case of local suppression only specific values of one or more variables or for specific cases are suppressed. These could be values with a very low incidence in the data, such as a specific occupation, or these could be values with a very low incidence in the population (Willenborg and de Waal 2001).
Coarsening A further non-perturbative method is coarsening, aggregating, or recoding by collapsing values of a variable so that values with low counts are merged. If variables are hierarchically structured, different degrees of coarsening can be achieved by going from the most detailed level of information to successively less detailed levels of information. The regional classification NUTS (Nomenclature of Territorial Units for Statistics) for example, is a breakdown of the EU territory by countries (NUTS1), basic regions (NUTS2), and small regions (NUTS3). This means NUTS3 is subsumed in NUTS2, and NUTS2 is subsumed in NUTS1. Similar to the NUTS approach for regional information, there are also hierarchical classifications for occupation, industry, or education, to mention just a
few. As long as the level of detail is still meaningful for research purposes, coarsening is an efficient and effective disclosure control method. However, if the level of aggregation becomes too coarse, the analytical value of the data may dwindle. Coarsening nominal variables may sometimes prove to be difficult. Country of birth is a typical example for this. One possibility would be to aggregate geographical areas or political associations, such as ‘local’, ‘EU-countries’, ‘non EU-Countries’. This type of coarsening could have serious repercussions on the analytical potential of the data, especially if rather heterogeneous groups are combined or aggregation changes over time. An alternative approach is local recoding by setting specific thresholds for the categories, so that only values with a certain incidence in the population are released and those below a given limit are aggregated or recoded. When it comes to continuous variables such as age or income, information reduction can be achieved by grouping. The loss of information is at least twofold. First, for many research questions such information is needed as continuous data (e.g. age-wageprofiles). Second, the specific groupings are often not appropriate for certain research questions.
Top- and Bottom-Coding Units at the top or the bottom of a distribution are perceived to be at a higher risk if their values are either extreme (e.g. persons with a wealth of 79 billion dollars) or if they represent a small group in the population (e.g. persons older than 100 years). One way of protecting these units is to censor all values above or below a certain threshold (top-coding/bottom-coding); this is a variation of coarsening. The censoring can be done by replacing the values with a given summary statistic or with the threshold value itself (Duncan et al. 2011: 120). Age and all types of income data are often routinely subjected to top-coding. The loss of information
ANALYTICAL POTENTIAL VERSUS DATA CONFIDENTIALITY
can be at least partly compensated by providing means, median, and variance for the censored values. However, for analysis of income inequality, top-coping may impact the findings because it suppresses income dispersion. Moreover if the censoring is not done consistently over time, it may also impact trend analyses (Jenkins et al. 2009). Non-perturbative methods are effective by reducing the level of detail in the data. The resulting data are no less valid but less detailed. Over time, the problem of preserving data coherence may arise when coarsening boundaries are changed. Take for example a coarsened geographical variable including only the values ‘local’, ‘EU-Country’, and ‘Non-EU-Country’. Over the last decade, many previously Non-EU-Countries became EU-Countries. Thus this variable is not coherent over time and very likely not useful for time series analyses. However, non- perturbative methods are cost-efficient, and their effect on data quality is apparent and well understood by social scientists. They might somewhat limit the range of possible analyses, but they do not endanger the analytical validity of the data.
Perturbative Methods While from a data user’s perspective nonperturbative methods might be preferable, researchers specialized in disclosure control tend to have a preference for the more sophisticated and statistically challenging perturbative methods. These methods basically distort the original data in various ways. The aim is to block identification strategies by both introducing uncertainty as to whether or not a match is correct and by reducing the usefulness of any information revealed. Every now and then it is suggested that implementing perturbative methods allows the release of an entire microdata file (Hundepool et al. 2012: 53) and therefore might overcome limitations in analyses set by non-perturbative methods. However, up
495
to now there is no empirical evidence to support this claim. Over the last four decades research on perturbation has been extended and refined as documented by a vast body of literature. For the latest developments in this area of research see Duncan et al. (2011) and Hundepool et al. (2012).
Noise Addition The protection of confidentiality by adding (additional) noise to data has a long-standing tradition in disclosure control (Sullivan and Fuller 1989). For continuous variables this can be achieved by adding or multiplying random numbers to values. For categorical variables noise addition is not a very practical method. There are different types of noise addition such as adding correlated or uncorrelated noise and including linear or nonlinear transformations (Hundepool et al. 2012: 54ff). While (correlated or uncorrelated) noise can be added in such a way that some characteristics (e.g. means and covariances) are preserved for specified variables, it is very likely that characteristics of nonspecified variables are not preserved. Moreover the level of protection provided by simple noise addition is low. If noise addition and linear transformation are combined, the sample means and covariances are biased compared to the original data and the univariate distributions of the perturbed variables are not preserved. The estimates could be adjusted by the researcher if the parameters used for the linear transformation were known. However, this knowledge would make it possible to undo the linear transformation and thus compromises data confidentiality. Multiplicative noise approaches offer a higher level of protection and preserve means and covariances for the specified variables, but once again the analytical validity of non-specified variables is not guaranteed.
Data Swapping – Rank Swapping The method of data swapping consists in perturbing the values of key variables by switching them between individual records.
496
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Basically records are first grouped according to predefined criteria. Next, the values of specific variables are exchanged among records in such a way that specific univariate or multivariate distributions are maintained. A variation of data swapping is rank swapping. First, the values of given variables are ranked. Next, the ranked values are exchanged among records within a certain specified range (Hundepool et al. 2012). Even though the low-order margins are maintained, multivariate statistics may be distorted, because in practice it will not be possible to control for all multivariate relationships. Empirical evidence that data swapping might extensively damage the validity of analyses is given by a study carried out by Alexander et al. (2010). When working with public use microdata of the 2000 Census, the 2003–2006 American Community Survey, and the 2004–2009 Current Population Survey, the researchers became aware of discrepancies in the distribution of age by gender. Estimations based on the public data differed up to 15 percent from corresponding counts in tables published by the Census Bureau. As a consequence, the employment rate was underestimated for certain age groups and overestimated for others. Evidently this bias was caused by a misapplication of the data swapping method. As Alexander et al. (2010: 565) note ‘users have spent nearly a decade using these data sources without being made aware of the problem’ and recommend that these data should not be used ‘to conduct analyses that assume a representative sample of the population by age and sex for people 65 years of age and older’ (2010: 568).
Data Blurring and Microaggregation Data blurring refers to the aggregation of values across a specified number of respondents for given variables, whereby individual values are replaced by the aggregate value. The aggregation may be across a fixed number of respondents, for example as determined by a given rule. The group size should
be at least three respondents. The term microaggregation is used when respondents are grouped based on proximity measures and these very groups are used for calculating the aggregate values (Federal Committee on Statistical Methodology 2005: 91). In its most simple form microaggregation is based on the ranking of values for selected variables. Groups are subsequently formed on the basis of predefined rules and then the individual values of all group members are replaced by an aggregate value (e.g. mean, mode, median) separately for each variable (Hundepool et al. 2012: 63). Multivariate approaches either aggregate all variables of interest at the same time, or form groups of two or more variables which are then independently aggregated. Although univariate microaggregation can be done in a way which preserves important features of a specific variable, it is not regarded as a very effective disclosure method. And while multivariate microaggregation can reduce the disclosure risk, it endangers the analytical validity of the data. There is a wide range of additional perturbative disclosure control methods, including the transformation of original data into synthetic data (Duncan et al. 2011; Hundepool et al. 2012). From a user’s perspective all of these methods entail a drastic reduction in data utility and can cause serious data damage which is not easily visible to researchers. While various measures of the analytical validity of data anonymized by perturbative methods have been proposed, none of these has been widely accepted (Federal Committee on Statistical Methodology 2005: 85). The development of such measures is difficult because of the multi-purpose use of microdata by researchers. While the analytical validity might be ensured for one research question, this must not be true for other research questions. Even if researchers were aware that the microdata they are using have been disclosure-controlled, they would have to spend a lot of additional resources to ensure the validity of their findings. In
ANALYTICAL POTENTIAL VERSUS DATA CONFIDENTIALITY
addition, statistical institutes applying perturbative methods are somewhat hesitant to make the methods transparent because this might compromise data confidentiality (Lane and Schur 2009).
CONTROLLED DATA ACCESS The challenge of providing access to official microdata for research purposes while protecting confidentiality continues to be a subject of high interest. On the one hand, the utility of disclosure-controlled microdata is sometimes debatable: while perturbation endangers the validity of research findings, information reduction limits the range of analyses. On the other hand there is an increasing demand by social scientists for better access to old-style microdata as well as to new-style data, which combine different data sources produced by official statistics and administration. In order to address these needs, more and more statistical agencies provide access to microdata for research purposes through different modes in terms of licensing agreements, remote execution, remote access, or secure sites. The precise arrangements for access to microdata for research purposes vary from statistical agency to statistical agency and from country to country, depending on national laws and regulations. Even within existing EU-regulations some EU-countries are more hesitant to release microdata than others. An overview of current release practices of national statistical institutes is provided by a United Nations report (2007). Licensing agreements imply a formal contract between researcher and statistical agency whereby the researcher agrees to protect the data according to the laws and other data confidentiality policies as defined by the statistical agency. In return, the researcher receives a scientific use file for a given research purpose, which is more detailed than public use data.
497
The use of microdata via remote execution, remote access, or secure sites also requires a formal agreement between statistical agency and researcher. Since the data do not leave the premises controlled by the agency, researchers often have access to the full range of the microdata file (typically excluding direct identifiers). In the case of remote execution, statistical programs are sent to the statistical agency and executed there. The output is reviewed for disclosure issues before it is sent back to the researcher. Remote access allows a researcher to work with the data as if they were on his or her own computer. However, the output is screened for disclosure issues before it is released to the researcher. In the case of secure sites, the data can be used in a particular location, such as the statistical institute itself or in a licensed research data center using a stand-alone computer. Again the statistical output is subject to an extensive disclosure control review. From a researcher’s perspective neither remote execution nor secure sites are preferable solutions. These modes of access are not only costly in terms of time and money but also do not allow for the exploratory character of the research process (Smith 1991). Remote access is more adapted to the nature of the research process and thus preferable. However, none of these approaches resolves the risk–utility trade-off. Instead, the perceived disclosure risk associated with microdata is merely transferred to the statistical outputs. Statistical outputs are not to be confused with published results. Instead, the term statistical output in this context refers to any finding based on analyses carried out by a researcher via remote access or execution. Any coefficient in an output which could be used to reproduce tabular data presents a disclosure risk if the corresponding tabular data imply a disclosure risk (Federal Committee on Statistical Methodology 2005: 85; Hundepool et al. 2012: 211ff). For example, a commonly implemented rule of thumb is not to release residuals or plots of residuals. In the case of linear and non-linear regression it
498
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
is widely recommended that at least one coefficient is withheld. Placing such limitations on the outputs to be released to the researcher can restrict the effective range of analysis.
CONCLUSION It seems that the trade-off between data utility and data confidentiality cannot be resolved as long as researchers interested in using microdata are put at the same level as malicious intruders. Laws and regulations protecting data confidentiality are binding for all, not only statisticians or statistical agencies staff. In addition to laws and regulations, licensing of researchers and their institutions adds a further safeguarding measure by establishing a variety of restrictions and threats of punishment. Nevertheless these measures apparently do not inspire sufficient confidence in the professional integrity of external researchers and their careful treatment of confidential data. More research in even more sophisticated disclosure control methods will not alleviate this lack of confidence. In order to solve this basic problem of the risk–utility trade-off, statistical institutes and researchers should cooperate more intensively to make sure that the existing legal and regulatory measures are taken seriously by both sides and become an integral part of the dissemination policies of official statistics. Comparative research faces the additional challenge to have to abide by national data protection regulations applying different strategies to protect confidentiality. In the situation in which the research is based on existing national data sources different anonymization strategies may lead to a loss of comparability and in the end lower data quality. Researchers interested in generating comparative data face the additional difficulty of having to deal with different rules. This increase in complexity may discourage researchers from engaging in comparative
research if data confidentiality concerns are pushed to the extreme and therefore lead to a loss of valuable scientific knowledge.
NOTE 1 Apart from these academic demonstrations, there has been one incidence where journalists linked court cases (public use files including personal identifiers) with a public use file of the National Practitioner Data Bank (not including personal identifiers) in order to identify doctors with a malpractice history (Wilson 2011).
RECOMMENDED READINGS For a still insightful discussion of the ethical, political, and technical problems surrounding the conflict of data confidentiality and data access for research purposes see Duncan et al. (1993) as well as Doyle et al. (2001). For a general model of identification risks see Marsh et al. (1991). An excellent overview regarding current principles and practice in statistical confidentiality is provided by Duncan et al. (2011). A summary of perturbative disclosure methods and their implementation by statistical institutes is given by Hundepool et al. (2012). For a critical review of the risk–utility paradigm see Cox et al. (2011). Arguments in favor of an open access to official microdata are given by Yakowitz (2011).
REFERENCES Alexander, J. T., Davern, M., and Stevenson, B. (2010). The polls review: inaccurate age and sex data in the census PUMS files: evidence and implications. Public Opinion Quarterly, 74 (3): 551–569. Barth-Jones, D. C. (2012). The ‘Re-Identification’ of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now. (July 2012). Retrieved
ANALYTICAL POTENTIAL VERSUS DATA CONFIDENTIALITY
from http://ssrn.com/abstract=2076397 [accessed 2015-01-20]. Bender, S., Brand, R., and Bacher, J. (2001). Reidentifying register data by survey data – an empirical study. Statistical Journal of the United Nations Economic Commission for Europe, 18 (4): 373–381. Bisco, R. L. (ed.) (1970). Data Bases, Computers, and the Social Sciences. New York: Wiley-Interscience. Bishop, L. (2014). Re-using qualitative data: a little evidence, on-going issues and modest reflections. Studia Socjologiczne, 3 (214): 167–176. Retrieved from http://www.dataarchive.ac.uk/media/492811/bishop_reusingqualdata_stsoc_2014.pdf [accessed 2015-03-20]. Blien, U., Müller, W., and Wirth, H. (1993). Needles in haystacks are hard to find: testing disclosure risks of anonymous individual data. In Europäische Gemeinschaften, Statistisches Amt (eds), Proceedings of the International Seminar on Statistical Confidentiality in Dublin (Ireland) (pp. 391–406). Brussels. Blien, U., Wirth, H., and Müller, M. (1992). Disclosure risk for microdata stemming from official statistics. Statistica Neerlandica, 46 (1): 69–82. Boruch, R. F., and Cecil J. S. (1979). Assuring the Confidentiality of Social Research Data. Philadelphia: University of Pennsylvania Press. Bulmer, M. (ed.) (1979). Censuses, Surveys and Privacy. London and Basingstoke: MacMillan. Cox, L. H., Karr, A. F., and Kinney, S. K. (2011). Risk-utility paradigms for statistical disclosure limitation: how to think, but not how to act. International Statistical Review, 79 (2): 160–183. Domingo-Ferrer, J. (2011). Discussions. International Statistical Review, 79 (2): 184–186. Doyle, P., Lane, J., Theeuwes, J., and Zayatz, L. (eds) (2001). Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. New York: Elsevier. Duncan, G. T., Elliot, M., Mark, E., and SalazarGonzález, J. J. (2011). Statistical Confidentiality Principles and Practice Statistics for Social and Behavioral Sciences. New York: Springer.
499
Duncan, G. T., Jabine, T. B., and de Wolf, V. A. (1993). Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. Washington, DC: National Academy Press. Elliot, M. (2001). Disclosure risk assessment. In P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds), Confidentiality, Disclosure and Data Access (pp. 75–90). New York: Elsevier. Elliot, M. (2011). Report on the Disclosure Risk Analysis of the Supporting People Datasets. Administrative Data Liaison Service. Retrieved from http://www.cmist.manchester.ac.uk/ medialibrary/archive-publications/reports/ [accessed 2015-01-20]. Elliot, M., and Dale, A. (1999) Scenarios of attack: the data intruder’s perspective on statistical disclosure risk. Special issue on statistical disclosure control. Netherlands Official Statistics, 14 (special issue): 6–10. Elliot, M., Skinner, C. J., and Dale, A. (1998). Special uniques, random uniques and sticky populations: some counterintuitive effects of geographical detail on disclosure risk. Research in Official Statistics, 1 (2): 53–67. Federal Committee on Statistical Methodology (2005). Report on statistical disclosure limitation methodology. Statistical Policy Working Paper 22 (2nd version). Retrieved from http:// www.hhs.gov/sites/default/files/spwp22.pdf [accessed 2015-01-20]. Fienberg, S. E. (1994). Conflicts between the needs for access to statistical information and demands for confidentiality. Journal of Official Statistics, 10 (2): 115–132. Hafner, H.-P., Ritchie, F., and Lenz, R. (2015). User-Centred Threat Identification for Anonymized Microdata. Working papers in Economics no. 1503, University of the West of England, Bristol. March. Retrieved from http://www2.uwe.ac.uk/faculties/BBS/BUS/ Research/Economics%20Papers%20 2015/1503.pdf [accessed 2015-08-20]. Holmes, D. J., and Skinner, C. J. (1998). Estimating the re-identification risk per record in microdata. Journal of Official Statistics, 14 (4): 361–372. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte-Nordholt, E., Spicer, K., and de Wolf, P.-P. (2012). Statistical Disclosure Control. Chichester: John Wiley & Sons.
500
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Jenkins, S., Burkhauser, R. V., Feng, S., and Larrimore, J. (2009). Measuring inequality using censored data: a multiple imputation approach (March 2009). DIW Berlin Discussion Paper No. 866. Retrieved from http:// dx.doi.org/10.2139/ssrn.1431352 [accessed 2015-01-20]. Knerr, C. R. (1982). What to do before and after a Subpoena of Data arrives. In J. E. Sieber (ed.), The Ethics of Social Research. Surveys and Experiments (pp. 191–206). New York: Springer Verlag. Lane, J., and Schur, C. (2009). Balancing Access to Data and Privacy: A Review of the Issues and Approaches for the Future. RatSWD Working Paper Series 113. Berlin. Lee, R. (1993). Doing Research on Sensitive Topics. London: SAGE. Marsh, C., Skinner, C., Arber, S., Penhale, B., Openshaw, S., Hobcraft, J., Lievesley, D., and Walford, N. (1991). The case for samples of anonymized records from the 1991 census. Journal of the Royal Statistical Society, 154 (2): 305–340. Müller, W., Blien, U., and Wirth, H. (1995). Identification risks of microdata: evidence from experimental studies. Sociological Methods and Research, 24 (2): 131–157. National Research Council (2005). Expanding Access to Research Data: Reconciling Risks and Opportunities. Washington, DC: National Academies Press. O’Hara, K. (2011). Transparent Government, Not Transparent Citizens: A Report on Privacy and Transparency for the Cabinet Office. London: Cabinet Office. Retrieved from http://eprints.soton.ac.uk/272769/ [accessed 2015-01-20]. Ohm, P. (2009). Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Review, 57: 1701–1777. Paass, G. (1988). Disclosure risk and disclosure avoidance. Journal of Business Economics and Statistics, 6 (4): 487–500. Picou, J. S. (2009). When the solution becomes the problem: the impacts of adversarial litigation on survivors of the Exxon Valdez oil spill. University of St. Thomas Law Journal, 7 (1): Article 5. Ritchie, F., and Welpton, R. (2011). Sharing risks, sharing benefits: data as a public good.
In Work Session on Statistical Data Confidentiality 2011: Eurostat, Tarragona, Spain, October 26–28, 2011. Retrieved from http:// eprints.uwe.ac.uk/22460 [accessed 2015-01-20]. Sieber, J. (ed.) (1982). The Ethics of Social Research. Surveys and Experiments. New York: Springer Verlag. Singer, E. (2001). Public perceptions of confidentiality and attitudes toward data sharing by federal agencies. In P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds), Confidentiality, Disclosure and Data Access (pp. 341– 370). New York: Elsevier. Skinner, C. J. (2012). Statistical disclosure risk: separating potential and harm. International Statistical Review, 80 (3): 349–368. Skinner, C. J., and Elliot, M. J. (2002). A measure of disclosure risk for microdata. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64 (4): 855–867. Smith, J. P. (1991). Data confidentiality: a researcher’s perspective. In Proceedings of the American Statistical Association, Social Statistics Section (pp. 117–120). Alexandria, VA: American Statistical Association. Retrieved from http://econwpa.repec.org/ eps/lab/papers/0403/0403006.pdf [accessed 2015-01-20]. Sullivan, G.R., and Fuller, W.A. (1989) The use of measurement error to avoid disclosure. In JSM Proceedings, Survey Research Methods Section (pp. 802–807). Alexandria, VA: American Statistical Association. Retrieved from: http:// www.amstat.org/sections/srms/Proceedings/ papers/1989_148.pdf [accessed 2015-01-20]. Sundgren, B. (1993). Discussion: computer security. Journal of Official Statistics, 9 (2): 511–517. Sweeney, L. (2001). Information explosion. In P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds), Confidentiality, Disclosure and Data Access (pp. 43–74). New York: Elsevier. Turner, A. G. (1982). What subjects of survey research believe about confidentiality. In J. E. Sieber (ed.), The Ethics of Social Research. Surveys and Experiments (pp. 151–166). New York: Springer Verlag. United Nations (2007). Managing Statistical Confidentiality & Microdata Access. Principles and Guidelines of Good Practice. New York: United Nations Publication.
ANALYTICAL POTENTIAL VERSUS DATA CONFIDENTIALITY
Willenborg, L., and de Waal, T. (1996). Statistical Disclosure Control in Practice. Lecture Notes in Statistics 111. New York: Springer. Willenborg, L., and de Waal, T. (2001). Elements of Statistical Disclosure Control. Lecture Notes in Statistics 155. New York: Springer. Wilson, D. (2011). Database on Doctor Discipline is restored, with restrictions. The New
501
York Times. November, 9. Retrieved from http://nyti.ms/uMISI14 [accessed 2015-01-20]. Yakowitz, J. (2011). Tragedy of the data commons. Harvard Journal of Law and Technology, 25. Retrieved from http://dx.doi. org/10.2139/ssrn.1789749 [accessed 2015-01-20].
33 Harmonizing Survey Questions Between Cultures and Over Time C h r i s t o f W o l f , S i l k e L . S c h n e i d e r, Dorothée Behr and Dominique Joye
INTRODUCTION Comparing societies across time or space is an important research approach in the social sciences. This approach allows studying how the collective context, such as economic conditions, laws, educational systems or welfare state institutions, shapes the values, attitudes, behaviors and life chances of individuals. Obviously, the validity of this kind of research depends on the quality of the underlying data. If data are gathered using a survey, than sampling, survey mode, question wording, translation, field-work practice or coding all affect the quality and comparability of the resulting data (some of these issues are discussed in Chapters 4, 12, 19, 20 and 23 in this Handbook). In this chapter, we focus on strategies to obtain comparative measures across different contexts, e.g. countries or societies at different points in time. These strategies – also referred to as harmonization approaches – focus on questionnaire development, including translation, as well
as on the processing of the resulting data. We present different approaches to harmonizing measures and discuss several ways to assure and assess comparability of the resulting data. Questions about the comparability of survey data arise in many different situations. For example, if we aim to analyze social change in the United States and for that purpose draw on the cxurrently available data from the General Social Survey (GSS) waves 1972 to 2014, we may wonder whether the data, or more specifically the variables we are interested in, are comparable over time. However, if we are interested in studying a topic over time that is not covered by the GSS or any other cumulative data set we have to search for different surveys covering the years we wish to study that contain the variables of interest. Should we find such data, then we can attempt to ‘harmonize’ them, i.e. render them comparable. Obviously, such an exercise rests on the assumption that the variables being harmonized refer to the same theoretical construct.
Harmonizing Survey Questions Between Cultures and Over Time
Ideally, the questionnaire items used in the different surveys should be identical and their meaning should not have changed across time. Going international now, a similar assumption has to be made when working with crossnational surveys, like the International Social Survey Programme (ISSP) or the European Social Survey (ESS), which aim to produce variables with identical meaning and understanding across countries. In this chapter, we will look at harmonization mainly from the angle of cross-national surveys, even though much of what we write is also applicable to comparisons over time using monocultural surveys. Our presentation in this chapter rests on a simple model of measurement, as depicted in Figure 33.1 (for more on measurement please see Chapters 14 and 16 in this Handbook). Measurement should begin by carefully defining a theoretical concept. Based on this definition, one or more empirical indicators should be selected that are observable and valid representations of the concept. Based on these indicators, a target variable should be defined, i.e. a definition of the variable that should result after data collection and potentially (re-)coding of the data. Only then, the questionnaire item(s) have to be formulated which cover the empirical indicator(s) and which either capture the data directly as intended for the target variable or which can be recoded to do so. In comparative research, in addition to validity and reliability, the challenge of
Theoretical concept/ construct •Definition and scope •Subdimensions •Relevance
Empirical indicator(s) •Observable representation of concept and subdimensions
503
creating comparative measures has to be met. Broadly speaking, we consider measures to be comparable if similarities or differences in measurements over time or across countries reflect similarities or differences in the measured trait and cannot be attributed to method, i.e. the measurement process or any other aspect of the survey. That is, to ascertain the comparability of measures we have to rule out that differences in method affect the measurement.1 For inter-temporal analysis, we additionally have to assume that the meaning of questionnaire items does not change over time (see Smith, 2005), and for cross-cultural analysis we equally have to ensure that the meaning of (translated) items does not differ across countries. A more indepth discussion of these issues is presented in the remainder of this chapter. We first describe common approaches to harmonizing survey questions and survey data. Then we discuss how comparability is assured by input and output harmonization approaches respectively. This is followed by a presentation of different methods to assess the quality of harmonized survey measures and a general conclusion.
HARMONIZATION APPROACHES Survey methodologists have invested heavily in developing procedures to harmonize
Target variable
•Operational definition •Label or description •Values and their codes
Questionnaire item(s)/ instrument •Question stem •Instructions •Response options
Coding specifications and supporting variables
Figure 33.1 From theoretical construct to questionnaire item.
504
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
survey data, i.e. procedures aimed at ensuring comparability of survey data from different countries or time points. As Depoutot (1999: 38) puts it: ‘Harmonisation is a remedial action to improve comparability’. Two main types of harmonization can be distinguished (see Figure 33.2): •• ex-ante harmonization (with the subtypes input and output harmonization); •• ex-post harmonization (which is always output harmonization).
The main distinction is whether harmonization is an aspect foreseen before data collection, i.e. ex ante, and thus gets its place in survey design, or if harmonization is done on pre-existing data, i.e. ex post. While ex post harmonization necessarily can only aim at harmonizing the ‘output’, using the existing measures, ex ante harmonization may aim at harmonizing measurement instruments and procedures, i.e. input harmonization, or aim at the optimal realization of a pre-defined comparable target variable, i.e. ex-ante output harmonization. We will describe these approaches to harmonization now in more detail (see also Ehling, 2003).
If a survey is to be conducted in several contexts and the aim is to produce comparative measures, ex-ante input harmonization seems to be the natural choice. In this approach to harmonization, the same questionnaire is used in all contexts. Although we restrict our discussion here to questions and questionnaires it is important to note that from the Total Survey Error perspective (e.g. Groves et al., 2009) all elements of a survey should be considered when planning a comparative survey. Ideally, all these elements should be chosen to be the same in all contexts, e.g. sampling, mode of data collection, fieldwork procedures, etc. At first sight, this seems to be clear and input harmonization straightforward to apply. However, one soon realizes that it is not always possible to follow exactly the same protocol for all elements of a comparative survey in all countries. Let us consider a cross-cultural survey2: in this case, the master questionnaire usually has to be translated. Over the last 20 years, survey methodologists have developed specific translation procedures aimed at obtaining questionnaire translations that result in equivalent measurement (Harkness et al.,
Harmonization
Ex ante
Input
translation, adaptation
Ex post
Output
pre-designed recoding
Figure 33.2 Overview of harmonization approaches.
Output
recoding of existing data
Harmonizing Survey Questions Between Cultures and Over Time
2010; see Chapter 19 in this Handbook for a more thorough discussion of this topic). This approach by and large works if the question to be translated does not refer to any issue strongly shaped by specific institutions, culture or history of a country.3 In the latter case translation is bound to fail. Grais (1999: 54) gives an ingenious example for this (itself an illustration of the historical boundedness of the social realm): Mrs. Clinton is the First Lady of the United States. Who is the First Lady of France? Madame Chirac? No doubt, but Madame Jospin might also be a valid candidate given the major role played by a prime minister in France. And who is the First Lady of the United Kingdom? This is where things become complicated. Prince Philip is certainly a possible candidate, but Mrs. Blair is, too, since the Queen of England is not a president and the political role of the prime minister is certainly more important and closer to that of a president. But the notion of First Lady is made up of two concepts: ‘first’ and ‘lady’, and from this point of view Queen Elizabeth herself might be a better candidate.
As this example demonstrates, some terms cannot be easily transferred from one country/culture to another. Instead, we have to find more abstract concepts underlying such notions that are relevant in a variety of cultures, find comparable empirical indicators for them and then word questions and response options accordingly. For this example, a possibility would be to find the best term in each language to describe the concept ‘spouse of the head of state’. This example also shows that an empirical referent for a concept may not necessarily be found in each country or cultural context. Just think about the Vatican where, for the time being, the existence of a spouse of the head of state is ruled out by ecclesiastical law. Therefore, in the first step of instrument design for cross-cultural surveys, a common understanding of the theoretical concept to be measured needs to be established and documented, including a working definition and specification of the ‘universe’ of manifestations for this concept (or its scope).
505
For example, with respect to the concept of ‘educational attainment’, survey designers need to decide whether they want to include vocational training and/or non-formal continuing education in the concept of education or whether they want to focus on schooling and academic higher education only (see also Hoffmeyer-Zlotnik and Wolf (2003) and Chapter 20 on the measurement of background variables, which are often strongly affected by national culture, history and institutions, in this Handbook). The (ideally) cross-cultural survey design team needs to make sure at this point that this theoretical concept is meaningful and relevant in all cultures to be covered by the study. In the second step, cross-culturally comparable empirical manifestations or indicators for the theoretical concept need to be specified. Theoretical arguments, prior research and knowledge of the countries in question will serve as important guidelines as to which specific indicators to choose (or whether several need to be envisaged and the responses then combined). Importantly, the cross-cultural ‘portability’ of the indicator(s) needs to be considered here. For example, in order to measure the theoretical concept of ‘political participation’, democratic countries without compulsory voting can use the indicator of ‘voting’, but this indicator cannot be used in non-democratic countries (where the indicator does not apply) or democratic countries where voting is mandatory (where the indicator has a different meaning). Participation in demonstrations may be an indicator that would be equally valid in both types of democratic countries but may not have the same meaning in nondemocratic societies. Finally, questionnaire items have to be chosen or designed that capture the indicators of interest. As with surveys in general, multiple (at least three) items should be envisaged whenever possible to facilitate reliability assessment and latent variable modeling, as well as assessment as to whether cross-national differences are
506
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
substantive or linguistic/methodological (Smith, 2004; see also Chapters 14, 34 and 39 in this Handbook). In order to maximize standardization and thus comparability, the same questionnaire items should be used across countries whenever possible (Ask the Same Question-approach). If this is impossible, cross-cultural equivalents need to be found that equally well represent the theoretical concept across cultures (Ask a Different Question-approach). Within the Ask a Different Question-approach it is useful to further distinguish between questions for which only the response categories have to be adapted – and often later recoded into a common code frame – and questions in which also the question stem has to be adapted to different context-specific circumstances. Input harmonization, i.e. asking the same question and using the same answering options, may not be possible with respect to all concepts and indicators of interest, such as political parties voted for, or educational qualifications obtained. An adequate approach for cases in which the question stem refers to the same concept and indicator, but response categories need to refer to country-specific manifestations, is ex-ante output harmonization. In this approach, one also first agrees on the theoretical concepts that should be measured, as well as comparable empirical indicators. Then, however, one defines a comparable target variable together with the values this variable can obtain, i.e. a code frame, before designing the countryspecific questionnaire items. This additional process aims to inform the design of country-specific response categories, produced in the next step, so that they can eventually be coded into the target variable. The way how this information is collected is then left to the national teams, or the correspondence between country-specific responses and target variable codes is centrally coordinated to some degree. That is, one agrees on equivalent output measures but allows for variation of the survey questions, especially response options, used to produce them.
Surveys coordinated by Eurostat on behalf of the European Union, such as the EU Labour Force Survey, are extreme cases of ex-ante output harmonization in that there is only minimal central coordination of questionnaire (or survey) design. For these surveys, there exist legally binding lists of target variables containing the concept behind the variables and their categories and codes. However, countries are free to collect these data according to their needs, customs and budgets. Thus the mode of data collection varies, there is no common questionnaire, the order, the wording and the answer categories of questions are country-specific, etc. It is even left to countries if they use surveys at all or if the data is obtained from registers and administrative records. Thus, this model optimizes the flexibility and budgets of national statistical institutes at the expense of comparability. For variables such as sex, age or marital status, the country-specific variability of data collection may not affect comparability much. However, the comparability for other measures, such as status of (un-) employment or supervisory status (see Pollack et al., 2009), and in particular subjective indicators, such as self-perceived health or deprivation, will be problematic. A contrasting example for ex-ante output harmonization, where more central coordination is used, is provided by the ISSP, which uses the ex-ante output harmonization approach for its ‘background variables’, i.e. the socio-demographic questions (a special treatment of background variables in crosscultural surveys can be found in Chapter 20 in this Handbook). For a few years now, the ISSP has offered a blueprint for this section of the questionnaire, which many members find useful.4 Even more centrally coordinated and thus input harmonized are the ESS, where the Central Scientific Team sets up a substantial set of specifications for countries to follow, and the Survey of Health, Ageing and Retirement in Europe (SHARE), which even employs a centralized CAPI system. The examples show that strict input and ex-ante
Harmonizing Survey Questions Between Cultures and Over Time
output harmonization may be considered to be extreme points on a continuum of more or less input- or output-oriented harmonization strategies that may be chosen according to the variables of interest, participating countries, available resources, etc. The second major approach – ex-post harmonization – is characterized by the fact that harmonization takes place only after, sometimes long after, data collection, usually in the context of a secondary data analysis project. The aim of this approach is to take existing data and build an integrated database with variables following a common definition. Typically, this is done to allow for either cross-national comparison or analysis over time (or both). The data sources used for ex-post harmonization are typically not produced with comparability in mind, that is to say, achieving comparability was not part of the original survey design. Usually ex-post harmonization is the only feasible way to obtain comparable data for the research question at hand at all. A well-known example is the Integrated Public Use Microdata Series (IPUMS) offering harmonized data from the US Census and the American Community Survey, from 1850 onwards (cf. https://www. ipums.org/). If our aim is to obtain comparative survey data, the most important message is that there is no generally applicable best strategy to reach cross-cultural equivalence of survey measures. Instead, the preferable strategy depends on the concept and indicator we want to measure as well as the basic set-up of the survey, i.e. to what extent the questionnaire can still be modified (ex-ante harmonization) or not (ex-post harmonization). Whether a questionnaire item can be harmonized ex ante, either by translation, adaptation or ex-ante output harmonization, depends on the theoretical concept, empirical indicator and the specific national, cultural and historical conditions prevailing in the contexts studied. Successful harmonization therefore depends on expert knowledge of these circumstances.
507
INPUT HARMONIZATION In this section, we discuss some of the elements that are crucial in making an input harmonization approach successful. 5 Frequently, comparative survey projects begin with the development of a master questionnaire, most often in English, which is subsequently translated into various languages. The major challenge is that the master questionnaire needs to ‘get it right’ for all multilingual questionnaire versions: That is, all decisions taken at the development stage – in terms of operationalization of concepts, wording of items, choice of answer scales, etc. – need to make sense for all languages and cultures (Harkness et al., 2003a). There is a drawback involved in the sense that questions may get quite general and decontextualized (and thus open to different interpretations) in order to be applicable to a diverse set of countries (Heath et al., 2005). Sometimes, some forms of permitted adaptation (replacement, addition or omission on the item level) are already anticipated in such a master questionnaire (e.g. replacement of examples), but other than that, translated versions should usually follow the master questionnaire when input harmonization is applied. Deviations from the master questionnaire, such as adding or collapsing answer categories, need to be explicitly documented as such6 – and often require prior approval from a coordinating party in a cross-national study because they may hint at the need to output-harmonize measurement for the indicator in question, or threaten the cross-cultural comparability of the resulting variable. Various procedures can be applied when producing a common master questionnaire for a cross-national study. Three approaches to master questionnaire development can be distinguished: sequential, parallel and simultaneous development (Harkness et al., 2003b, 2010). The sequential approach essentially does without much cross-national input during questionnaire development. A team of
508
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
more or less monocultural researchers develops a questionnaire and submits it for translation. Cross-national requirements and needs only receive attention when translation starts and data collection is often imminent. At this stage, however, it is often too late to correct for the lack of cultural appropriateness of questions (e.g. an item does not fit a country’s reality) or of translatability (e.g. heavily idiomatic wording, difficult to translate). Smith (2003) and others (e.g. Jowell, 1998) reject such ‘research imperialism’ (Smith, 2004: 443) and instead call for parallel questionnaire development, which includes different layers of cross-national research collaboration and collective development work to prevent cultural and linguistic bias. The most direct form of parallel questionnaire development in cross-national collaboration employs a multinational drafting group to decide on concepts, indicators, target variables and items. By bringing together persons from different cultural, institutional and linguistic communities, the hegemony of a single cultural, institutional and linguistic frame of reference can be avoided. However, some form of cultural dominance may (unintentionally) re-enter with the working language in multinational drafting teams (Granda et al., 2010; Harkness et al., 2010). In the same vein, the methodological and scientific background of country representatives may influence the outcome in favor of certain research or cultural traditions and at the expense of others (Bréchon, 2009). Carefully selected research teams, respect vis-à-vis others, intercultural awareness and a common ground in terms of project goals and processes are thus crucial factors to a successful collaboration. Collaboration between researchers can be organized differently. In small cross-national studies, country representatives may collectively contribute to each stage of development work. In larger studies, it is not uncommon to have smaller multinational core teams doing the actual development work but regularly reaching out to the entire multinational research group (see also Chapter 4 in this Handbook).
Bréchon (2009) compares the decision-making processes for the Eurobarometer, the ESS, the European Values Study (EVS) and the ISSP. While the Eurobarometer constitutes a survey type on its own due to its political origins – decisions are undertaken by the administrative team in charge, under the supervision of a European Commissioner – the other three surveys are academically run. The decisionmaking process in the ESS and the EVS is rather centralized and ultimately in the hands of a core team, even though study-wide feedback and collaboration is solicited. In the ISSP annual general meetings and formalized voting procedures provide the backbone of a highly democratic study, in which each member has the same voting power on drafting group composition or question selection, for instance (ISSP, 2012). Apart from joint discussions, proposition of items for consideration or feedback on others’ items, development work can include more empirically-driven forms of crossnational input, in particular advance translation and pretesting. Advance translation (Dorer, 2011), also called translatability assessment (Dept, 2013), has recently been added to the toolbox of cross-national questionnaire developers; although it was long called for by the research community (Harkness and Schoua-Glusberg, 1998). Advance translation involves producing a translation of a pre-finalized master questionnaire with the explicit goal of spotting potential harmonization problems. These can then be removed or tackled in the final master questionnaire before translation starts. The rationale behind this procedure is that many problems are identified only once an actual translation is attempted. An advance translation does not have to be as thoroughly conducted as a final translation since stylistic fine-tuning for a field study is not needed. Once a questionnaire is translated it should be subjected to a cognitive pretest (see Chapter 24 in this Handbook). Cognitive pretesting assesses to what extent the target group understands the items in the intended
Harmonizing Survey Questions Between Cultures and Over Time
way. Recent years have seen the advent of cross-national cognitive interviewing studies during the development phase of a questionnaire (Fitzgerald et al., 2011; Lee, 2014; Miller et al., 2011). The unique feature of these studies is that they are conducted in a comparable fashion (usually with specifications on what needs to be kept identical across countries and where some leeway in implementation is allowed) and that they are subsequently analyzed with a cross-national perspective in mind. For instance, researchers may be interested in whether ‘the system of public services’ is understood similarly across countries (Fitzgerald et al., 2011) or whether respondents across countries conceptualize pain in similar ways (Miller et al., 2011). Problems resulting from obvious translation errors or from difficult or ambiguous master question wording, issues of cultural portability and general design issues may all be discovered in these cross-national interviewing studies (Fitzgerald et al., 2011). The latter three types of problems point towards problems for cross-cultural implementation and should lead to the rewording of the master items or even reconsideration of empirical indicators for the concept in question. The diverse types of problems that can be found in cross-cultural interviewing show that monolingual pretesting is usually insufficient for cross-national studies. Despite its usefulness in detecting comparability flaws, cross-cultural cognitive interviewing may suffer from small case numbers (especially at the country level) and the organizational challenges in setting up these studies. Against this backdrop crossnational web probing has been developed. Cross-national web probing involves asking cognitive probes – similar to those typically used in cognitive interviews – in crossnational web surveys (Braun et al., 2014). For instance, Behr and Braun (2015) followed up on a questionnaire item on ‘satisfaction with democracy’ with a probe asking for the reasons for having chosen a certain answer value on the scale. Implemented in a cross-national
509
context, probes such as these allow researchers to understand which societal, personal or methodological aspects influence respondents’ answers. Answer patterns can then be established, coded, and assessed in terms of comparability. Among the advantages of cross-national web probing are large sample sizes, better country coverage, the possibility to quantify answer patterns across countries, probe standardization across countries, anonymity of answers, and ease of implementation. Even though web probing is affected by probe non-response or mismatching answers, sample size can compensate for these to a great extent. The usefulness of cognitive interviewing and web probing as well as their respective strengths and limitations are discussed in Meitinger and Behr (2016). While cognitive interviewing or web probing allow for insights into the thought processes of respondents, pilot studies or field tests collect ‘real’ quantitative survey data and thus allow for preliminary statistical analyses, ranging from non-response distributions over means and correlations to sophisticated equivalence checks. The types and number of items representing a concept as well as available sample sizes will determine the types of analyses that are possible at the piloting stage (see also Chapter 24 in this Handbook, in this regard). A finalized master questionnaire needs to be translated following specific translation and assessment procedures. Parallel translation, team-based review approaches, thorough documentation of problems and decisions, as well as pretesting all can contribute to a measurement instrument that is as comparable as possible to the master version (see Chapter 19 in this Handbook). There may still be (traces of) difference left, because connotations of words or slight shifts of meaning due to different semantic systems cannot fully be ruled out. This is but one of the reasons why Smith (2004), amongst others, calls for multiple indicators for the same concept to disentangle societal and linguistic differences.
510
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
OUTPUT HARMONIZATION This section presents strategies for output harmonization. We first describe survey design procedures aimed at ensuring that exante output harmonization results in comparable measures. Then the respective issues for ex-post output harmonization are presented.
Design of Ex-ante Output Harmonized Target Variables and Questionnaire Items When a survey question stem can be meaningfully translated into different languages and cultures, but the empirical realizations, i.e. the number or even types of response categories vary across countries, translation and adaptation will not suffice to render measures comparable. This is true, for example, for indicators such as ‘highest educational qualification obtained’, ‘political parties’ or ‘marital status’.7 In such cases, survey organizers need to employ procedures for ex-ante output harmonization. Successful ex-ante output harmonization relies on a well-structured survey design procedure that, like in the case of input harmonization, requires the collaboration of a cross-national survey design team with expertise in crossnational measurement of the concepts in question, and local experts for the same concepts. Although ex-ante output harmonization is most well-known for its application to demographic and socio-economic variables (see Chapter 20 in this Handbook) it may also be relevant to certain attitudinal and behavioral questionnaire items. For indicators requiring ex-ante output harmonization, there is often a tension between validity in a specific cultural context and cross-national comparability. While drafting questionnaire items a country’s or culture’s specificities (such as institutions, history, legal constructs, concrete objects, symbols and products or customs
in the everyday conceptualizations of quantity, space and time) may have to be taken into account in order to accurately measure the concept in question at the national level. At the same time, the cross-cultural survey design team needs to ensure that harmonization into the pre-specified target variable after data collection is possible, in order to achieve comparable measurement. To achieve ‘harmonizability’, it is recommended to define both the comparable target variable (see Introduction) as well as the correspondence between country-specific response categories and target variable already during the phase of questionnaire design (i.e. ex ante), rather than only during data processing (i.e. ex post) – otherwise harmonization problems may be identified too late. However, over-adjustment towards the target variable, e.g. by implementing the target categories directly without consideration of potentially important country-specific variations, may lead to oversimplification of the country-specific measure and should thus also be avoided. After the theoretical concept, an empirical indicator and target variables have been specified, it should now be clear whether translation and adaptation are insufficient to render a measure cross-culturally comparable. If so, it should be determined which elements of the measure require output harmonization and which elements can be input harmonized. For those elements that cannot be input harmonized – usually response categories – specific steps for ex-ante output harmonization need to be followed: 1 Target variable specification (adoption or devel opment of a comparative coding scheme or classification). 2 Questionnaire item design, especially response options and their mapping to the target variable. 3 Application of harmonization recodes (after data collection).
Firstly, a comparative target variable to represent the identified concept(s) has to be specified by the survey organizers, including an explicit coding scheme for the
Harmonizing Survey Questions Between Cultures and Over Time
response values. The coding scheme can range from simple scales without official status (e.g. for harmonizing marital status or family or household type across societies, see examples in the ESS, EVS or ISSP) to multi-digit standard classifications such as the International Standard Classification of Occupations (ISCO, see International Labour Organization, 2007) or the International Standard Classification of Education (ISCED, see UNESCO, 2006; UNESCO Institute for Statistics, 2012). When the underlying classification is complex, as in the case of ISCO or ISCED, survey organizers also need to decide whether they want to use and publish the most detailed coding (as is common for ISCO) or some simplified version (as is common – but problematic, see Schneider, 2010 – for ISCED). In the latter case there is a risk that ad-hoc variations of standards are chosen, which decrease standardization across surveys and the usefulness of the variable. In general social surveys, it is useful for researchers if such target variables are rather differentiated in order to allow flexible recoding according to the requirements of the research question. It would be helpful if the possibility of transformation into the official standard is ensured at this stage, especially if official data is to be used as a reference (e.g. for checking sample representativeness or developing adjustment weights). Often survey organizers provide aggregated or otherwise derived variables (either exclusively or additionally) that can be directly used for statistical analyses. These would also benefit from cross-survey standardization. When adopting an official standard classification then the official documentation (such as the classification document, operational manuals, glossaries and the like) should be made available to all involved in the questionnaire design and coding process. Additionally, it has proven helpful to provide country teams with a coding template and further available resources, such as dedicated short guidelines briefly explaining the concept and its relevance, the theoretical
511
rationale behind measuring it, an explanation of each code in the comparative coding scheme of the target variable, and to include a few examples for its measurement and coding, while illustrating common errors or pitfalls. If only one example is used this single example can be too influential and country teams may remain unaware of the degrees of variation required across country-specific measurement instruments. For this reason, examples should be chosen wisely, not only using the simplest countries as cases but rather different complex ones. In a second step, each country team develops the country-specific questionnaire item and response categories in such a way as to allow later recoding into the target variable specified in the previous step. Here it is advisable to start from questionnaire items already existing in the respective country. Newly developed or amended questionnaire items should, as usual, undergo pretesting. Especially measures that differ substantially across countries, as in the case of indicators requiring ex-ante output harmonization, need to be pretested in every country participating in a comparative survey. At the end of this step, there should be two core outputs: a country-specific questionnaire item (possibly with an input-harmonized question stem and country-specific response categories) and coding instructions for the data processing stage to convert the country-specific response categories to the comparative target variable. In some cases, centralized consultation and sign-off procedures, even though they are time-consuming and pose an extra burden on central and country teams, could be a useful strategy to assure comparability. Content and the purpose of the survey will influence the decision on which target variables survey organizers want to design in this way. As a final step, after the data have been collected, the coding instructions developed in the previous step need to be executed. This is ideally a merely technical process that is then followed by a quality assessment (see Section
512
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
‘Assessing the Quality of Harmonized Measures’). When coding open-ended questions, as in the case of occupations or residual ‘other …’ options, training those coding the information into the classification is strongly advisable. Alternatively, using a dedicated service with established expertise for this difficult task may be an option, but even then one should check the rules and procedures followed by its coders. It is also advisable to assess the degree of inter-coder reliability. For documentation purposes the measurement guidelines and templates should be made available to data users alongside the countryspecific questionnaires, coding instructions and country-specific source variables. In practice, the process described here will often be less straightforward and more iterative with feedback loops from the countryspecific items to the definition of the target variable and the coding scheme. As long as this process is clearly documented, it should not jeopardize comparability of the resulting measure.
Deriving Expost Output Harmonized Measures from Existing Data For analyses of social change based on survey data covering a long or historical time span, or when pooling specialized national surveys on topics for which no cross-national survey exists, it is usually necessary to reconcile existing data sources, i.e. to harmonize ex post (Granda et al., 2010). Prominent examples for this approach are the Luxembourg Income Study (LIS, see Smeeding et al., 1990), the different projects of the Integrated Public Use Microdata Series (IPUMS, see www.ipums.org), the CrossNational Equivalent File of household panel surveys (CNEF, see Burkhauser et al., 2000, http://cnef.ehe.osu.edu/), the International File of Immigrant Surveys (IFIS, see van Tubergen, 2004) or the International Social Mobility File (ISMF, see Ganzeboom and
Treiman, 2012, http://www.harryganzeboom. nl/ismf/index.htm). Other extensive harmonization projects include the endeavor of Breen and colleagues who have combined 117 national surveys for cross-national studies of social mobility (Breen et al., 2009, 2010), or the ‘Democratic Values and Protest Behavior’ project jointly carried out by the Institute of Philosophy and Sociology at the Polish Academy of Sciences and the Mershon Center for International Security Studies at Ohio State University in which over 1,700 national survey files are pooled (see http:// dataharmonization.org/). The founders of CNEF, a set of harmonized household panel surveys from around the world, point to two important aspects of expost harmonization, which are the importance of national laws, institutions, history and culture as well as the lack of international standard instruments or coding schemes (which has however somewhat improved since 2000): Even the most sophisticated national surveys are unlikely to have cross-national comparability as a survey goal. Hence, while most national surveys use equivalent measures of age and gender, there is no international standard for measuring complex concepts like income, education, health or employment. Thus, researchers interested in doing crossnational work must investigate the institutions, laws and cultural patterns of a country in order to ensure that the variables they create for their analyses are equivalently defined across countries. (Burkhauser et al., 2000: 354)
What harmonizing data means in this context is nicely described by IPUMS International, an attempt to harmonize and integrate microdata from censuses: Integration – or ‘harmonization’8 – is the process of making data from different censuses and countries comparable. For example, most censuses ask about marital status; however, they differ both in their classification schemes (one census might recognize only a general category of ‘married’, while another might distinguish between civil and religious marriages) and in the numeric codes assigned to each category (‘divorced’ might be coded as a ‘4’ in one census and as a ‘2’ in another). To create an integrated variable for
Harmonizing Survey Questions Between Cultures and Over Time
marital status we recode the marital status variable from each census into a unified coding scheme that we design. Most of this work is carried out using correspondence tables …9
While in principle similar issues as in exante output harmonization have to be considered in ex-post output harmonization, the process is more driven by the available data than the desired target variables, reflecting general problems of secondary research (Dale et al., 2008). Therefore, ex-post harmonization has to live with the fact that survey questions concerning the same underlying concept or indicator may be worded (quite) differently across surveys. Also, it is impossible to change basic survey design features by which the various data sets may differ (e.g. survey mode, sample design, prevalence of proxy interviewing). For example, we know that sensitive questions – which are also not the same across cultures10 – generate different results depending on the survey mode. Therefore, responses to such questions should be combined with caution if data were collected using different modes in different countries. This does not mean, however, that surveys carried out with the same mode can always be easily combined; after all, the given mode may work differently in different countries (e.g. due to variations in literacy levels, see Smith, 2004). The degree of comparability that can be achieved using ex-post harmonization is therefore almost destined to be lower than for surveys designed to be comparative from the outset. Turning to the ex-post harmonization process step by step, the analyst first has to establish, across all included countries, comparable underlying theoretical concepts from existing questionnaires and data sets.11 Databases documenting questionnaires for large numbers of surveys offering search facilities based on key concepts and keywords can be of great value in this respect.12 Next, the analyst has to assemble the questionnaire items from the relevant questionnaires and analyze the respective variables from
513
the data sets to find a common denominator in order to code them into one harmonized, cross-nationally comparable target variable. In the case of ex-post output harmonization one thus needs to go from questionnaire item to target variable rather than the other way round, which requires a certain degree of pragmatism. For variables with ‘natural’ units measured on a ratio or interval scale such as temperature, currency or distance, re-scaling to a common standard is possible without loss of information (e.g. converting measures from the imperial to the metric system). With scales in attitudinal measures, however, the situation is different. In this case, for ex-post harmonization we especially need to consider the number of scale points, whether the scale is bipolar or unipolar, scale labeling and the availability of a ‘don’t know’ option. The only case in which a technical transformation of scales is rather straightforward is when all scales have similar even or uneven numbers of response options, and all are unipolar or bipolar (but not a mix of the two). In all other cases, respondents use different scales differently (method effects) and measurement error is also likely to differ, so that the responses from a 10-point scale cannot be just re-scaled to correspond to a 4-point scale. In this example, it is likely that the 10-point scale is more sensitive at the extreme ends than the 4-point scale, potentially leading to different conclusions. In a similar vein, responses from a bipolar 5-point scale cannot be equated with responses from a unipolar 5-point-scale. There are several solutions to this problem, two of which are: •• Dichotomizing items. However, this carries the cost of losing all differentiation – and thus poten tial for association with other variables – within the two extreme categories. •• Thinking about the items in terms of an underly ing latent variable and using scaling techniques adapted to ordinal variables or using external criteria in order to estimate the position of scale values in each context (see Clogg, 1984; Mohler
514
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
et al., 1998; Mair and de Leeuw, 2010). However, latent variable modeling requires multiple (at least three) questionnaire items for each concept in each of the surveys, which may be difficult to achieve.
Also the labeling of the scale categories or end points often differ across surveys, as they necessarily do across languages in a crossnational setting: vague scale quantifiers such as ‘very’, ‘strongly’, ‘pretty’, ‘not too much’ or ‘probably’ combined with some adjective may not find exact equivalents across countries or they may not tap equivalent cut points in the underlying latent continuum (Mohler et al., 1998). When one scale has a midpoint and/or a ‘don’t know’ option and the other does not, this problem is most obvious. It is therefore hard to judge which scale points are equivalent and how two different scales should be harmonized ex post. Analysis of the covariance structure between different versions of recoded scales might indicate to which degree certain common scales can be considered to produce equivalent data (see Section ‘Comparability of Meaning of Multiitem Measures’ below). For categorical variables, this is also a difficult and crucial step because of the risk of losing information when trying to achieve a common coding scheme across countries or data sources: While it may be possible to code standard variables (see Chapter 20 ‘When Translation is not Enough: Background Variables in Comparative Surveys’ in this Handbook) or whatever would be ideal for the research question at hand from existing sources, this is not always the case. On the one hand, a cross-nationally identical coding can often only be achieved by aggregating categories in the different data sources, usually following the data with the least differentiation (‘the lowest common denominator’), resulting in aggregation error (see also Section ‘Assessing Completeness and Comparative Validity of Output-harmonized Measures’ below). Thereby it may happen that relevant aspects of the concept in question
are ‘harmonized away’. Harmonizing data to a highly simplified scheme not based on theoretical considerations risks producing irrelevant or invalid data. Indeed, for certain concepts it may be impossible to arrive at any satisfactory ex-post harmonization that allows valid comparisons over time and/or across countries. On the other hand, existing data can sometimes be harmonized in various ways, and different research questions and theoretical backgrounds will result in differently harmonized variables. Then ex-post harmonization cannot be done ‘once and for all’ but data need to be re-examined for different research purposes. To solve both problems, at least for background variables, Granda et al. (2010: 319) describe the approach of hierarchical coding, as applied, for example, in the IPUMS project (see Table 33.1). Its aim is to preserve as much information and thus validity from the original data as possible, especially if the harmonized variables are designed to allow data users to derive more specific measures later on (e.g. when data centers produce time series/longitudinal data files for the community). By using multi-digit codes, differing amounts of detail across studies can be retained while offering cross-nationally equivalent categories. The first one or two digits contain information available in all sources, while further digits provide additional information available in some data sets only. Technically, it often helps to put the response categories of existing variables into a spreadsheet next to each other to establish common boundaries between categories, and then construct the desired target variable. The result is a correspondence table which is very useful also for documentation purposes (see an example from IPUMS concerning marital status, using a hierarchical coding system, Table 33.1). The outcomes of this process are thus – in the optimal case – the assembled source variables from different data sources, their mapping or recode to the comparable coding scheme, as well as the newly constructed, detailed harmonized variable.
Harmonizing Survey Questions Between Cultures and Over Time
515
Table 33.1 IPUMS Integrated Coding Scheme for Marital Status, slightly simplified Code
Target variable
Survey 1
Survey 2
Survey 3
100 101 102
0 = Single
5 = Single
5 = Single
200 210
SINGLE/NEVER MARRIED Never married Single, previously in a consensual union Single, previously in a religious marriage MARRIED/IN UNION Married (not specified)
211
Civil
212
Religious
213
Civil and religious
7 = Only civil marriage 8 = Only religious marriage 6 = Civil and religious marriage
2 = Only civil marriage 3 = Only religious marriage 1 = Civil and religious marriage
2 = Civil marriage only 3 = Religious marriage only 1 = Civil and religious marriage
214 220
Polygamous Consensual union
9 = Living maritally
4 = Consensual or other
4 = Other
300 310 320 321
SEPARATED/DIVORCED/SPOUSE ABSENT Separated or divorced Separated Legally separated
322 330 340 350 400 999
De facto separated Divorced Married, spouse absent (n.s.) Consensual union, spouse absent WIDOWED UNKNOWN/MISSING
2 = Legally separated (desquitado) 1 = Separated 3 = Divorced
7 = Legally separated (desquitado) 6 = Separated 8 = Divorced
7 = Separated/left (desquitado) 6 = Separated 8 = Divorced
4 = Widower 5 = Don’t know
9 = Widower 0 = No declaration
0 = Widower 9 = No answer/left blank
103
Source: https://international.ipums.org/international/examples/transtable_example.html
ASSESSING THE QUALITY OF HARMONIZED MEASURES
comparability of harmonized measures in the quest for quality.
Reliability and validity are the two basic quality aims for survey measures in general. In cross-national research, comparability needs to be added to these two aspects of data quality. This does not only mean comparability in a nominal sense, but includes the requirement for reliability and validity to be comparable across countries (Smith, 2004). This section presents ways to assess the reliability, validity and
Assessing Process Quality There are two basic approaches to quality assurance for harmonized survey measures: process quality and output quality assessment. To assess the process quality of harmonized measures – which is obviously only possible in the case of ex-ante harmonization, i.e. for comparative surveys – one looks for documentation of the questionnaire
516
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
design and harmonization process indicated as an output of the processes described in the Sections on Input Harmonization and Output Harmonization above. Only a well-documented, transparent harmonization process allows researchers to check the relationship between theoretical concepts and empirical measures across countries; and in the case of ex-ante output harmonization, between country-specific questionnaire items, country-specific variables and harmonized variables. If, for example, it is unclear which educational qualifications are mapped to which categories of a cross-national coding framework such as ISCED, it is impossible to say whether the resulting variable can be regarded as comparable or not across countries (assuming the cross-national coding scheme in principle ensures comparability). Such black boxes do not help when trying to interpret results from statistical analyses. This has e.g. been the case with the education variables in the EU-LFS in the past, when Eurostat published ‘only’ general survey data quality reports (e.g. for the EU-LFS 2013: Eurostat, 2014). This has fortunately changed with the introduction of ISCED 2011 from 2014 onwards (Eurostat, 2015). Given the high ‘documentation burden’, lack of documentation of harmonization procedures can be expected to be the rule rather than the exception. The ISSP and ESS both provide detailed templates for documenting ex-ante output harmonization strategies. A final issue concerning documentation is that it is often difficult to find as this information typically is not published using persistent identifiers. Based on Figure 33.1, questions to be considered in this harmonization process are: •• Is there a common understanding of the underly ing theoretical concept, supported by interna tionally accepted definitions and scope? •• Is the underlying theoretical concept relevant in all countries studied? •• Are the indicators directly comparable across cultures? If not, how was the cross-cultural equivalence of indicators established?
•• Is the translation approach and process docu mented? Was a suitable approach adopted? •• Are all language versions of the questionnaire and (if the survey was interviewer-administered) show cards publicly available? The same for the material given to the interviewee or the instruc tions to interviewers. •• For variables that were output-harmonized: {{ Is the implemented comparative coding scheme suitable and available (i.e. one that validly measures the underlying theoretical concept across countries)? {{ Is the link between country-specific and har monized variables clear? Are country-specific variables publicly available, to check whether specifications were followed? {{ Were comparative definitions applied con sistently across countries?
These questions are also useful when evaluating ex-post output-harmonized data, as they should be equally meticulously documented. However, often these questions are not easy to answer and require some degree of expert knowledge. Specifically for ex-post harmonization the use of indicators reflecting the degree of comparability of variables across contexts was proposed: … each variable is assigned a reliability code that represents the degree of cross-national comparability that the surveys permit. For example, a code of ‘1’ indicates that the variables are completely comparable, whereas a code of ‘4’ indicates that there is no comparable variable between the two surveys. These reliability codes are based on direct comparisons of the survey instruments as well as on knowledge of institutional differences across the countries (Burkhauser et al., 2000: 362).
Assessing Reliability In contrast to process quality assessment, output quality assessment takes the available data and uses statistical methods to measure data quality in quantitative ways. In the remainder of this section, consistency or reliability, completeness, (comparative) validity and comparability will be looked into as quality criteria, while acknowledging that there are other criteria that are less relevant
Harmonizing Survey Questions Between Cultures and Over Time
with respect to harmonization (e.g. timeliness, confidentiality or accessibility). Reliability of individual harmonized measures at the level of surveys or data sets can be conceptualized as data consistency. Consistency of data across data sources is a necessary but insufficient condition of comparability. Consistency of harmonized survey data across data sources can be checked by comparing descriptive statistics (mean and variation of interval level variables; distributions of categorical variables) across data sources, or with external data, such as censuses or register data containing the same variables (i.e. also coded in the same way, which can sometimes only be achieved by aggregating categories). This can be done even when original country-specific variables are unavailable. In this case, however, no indepth interpretation is possible beyond diagnosing the degree of (in-)consistency. For example, the chapters in Schneider (2008) tried to reconstruct the education distributions found in the European Labour Force Survey from national data sources and found many instances in which it was unclear how the harmonized data came about exactly. Ortmanns and Schneider (2015) compared education distributions, all coded in ISCED 97, across four cross-national public opinion surveys over five years using Duncan’s dissimilarity index and found major discrepancies that require closer investigation.
Assessing Completeness and Comparative Validity of Outputharmonized Measures With respect to completeness of output- harmonized data, two aspects can be distinguished: loss of cases or even of whole countries or survey waves due to insufficient ‘harmonizability’ of the collected data, and loss of information (i.e. variation) in harmonized variables compared to country-specific source variables. There is a tradeoff between the two, i.e. analysts often have
517
to choose between more valid harmonized variables but at the cost of losing countries where this coding cannot be achieved, or better country coverage but with fewer valid harmonized variables. Both will be presented in sequence here. Simple measures of completeness are the proportion of respondents for whom the harmonized target variable can be derived relative to the number of respondents for whom the source variables have non-missing values. For good reasons excluding groups of cases that cannot be harmonized is uncommon as it would distort the sample. Usually, the whole sample is then excluded, e.g. a country or survey wave (or combination of both) with insufficient measurement quality. At the survey level, indicating the proportion of countries or survey waves that could be harmonized would provide a simple measure of completeness. However, harmonized data will usually be pretty complete because completeness in practice often takes priority over comparative validity when defining target variables in ex-post harmonization, reflecting a pragmatic approach to harmonization. In ex-ante harmonization, completeness should not be an issue if measurement instruments are carefully designed to allow deriving the comparative target variable. For assessing validity in terms of the loss of information occurring from the aggregation of response categories for the purpose of output harmonization, the reference data are typically the original country-specific variables. Detailed external data could, however, also be envisaged. Two ways for analyzing loss of information can be distinguished: the comparison of the original and harmonized variables with respect to a) their variability, and b) their explanatory power relative to a criterion variable (comparative construct validation). The idea behind both is that harmonization usually entails the aggregation of categories of country-specific variables. This necessarily leads to some loss of information, which may in turn lead to aggregation error in statistical analyses, such as attenuation
518
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of correlations or regression coefficients (as well as confounding bias in coefficients of third variables). The questions for assessing comparative validity are: How much relevant information (and thus validity) is lost through harmonization, and how much does this differ across countries (or data sets)? If the loss of information differs strongly across countries, correlations with the harmonized variable will be attenuated by different degrees in different countries which invalidates cross-national comparisons of correlation and regression estimates.13 There is a caveat though: If the country-specific source variable itself is measured with a low degree of differentiation and thus may not be particularly valid this analysis may not reveal any loss of information with respect to the harmonized variable for the affected country. Thus, it remains highly important that measures are valid within the survey country in addition to being harmonizable for crossnational comparison. The ‘pure’ loss of information or aggregation error is best assessed by comparing a measure of dispersion of the harmonized variable with the same measure of dispersion of the country-specific variable. Granda et al. (2010: 323) provide the following general equation for such a quality measure: QX hi =
disp hi disp oi
where disp hi is the dispersion of the harmonized variable and disp oi is the dispersion of the original (country-specific) variable, all in data set i. Such an analysis was e.g. conducted for educational attainment measures in Schneider (2009: Chapter 6) using the index of qualitative variation (Mueller et al., 1970) as the measure for dispersion, and Granda et al. (2010) provide an example using religious denomination. This method is especially advisable when the harmonized variable is supposed to be used as a ‘multi-purpose’ variable where all
information contained in the variable may be relevant for one or the other analyst. When a measure is to be evaluated with a specific theoretical background and dependent variable in mind, comparative construct validation may be the more adequate procedure. This additionally allows for distinguishing relevant and irrelevant information, given a specific hypothesized relationship. With comparative construct validation then, the loss of information is evaluated by comparing the predictive power of one or several differently harmonized variables with the predictive power of the country-specific source variable when predicting a criterion variable in a country-by-country regression analysis. Here the criterion variable needs to be comparable across countries. The analyst can then perform sensitivity analyses, checking how much explanatory power is lost by comparing the (adjusted) determination coefficient R2 from a regression model using the harmonized variable as a predictor compared to a regression model using the country-specific or source variable as a predictor of the criterion variable. For categorical variables, dummy indicators should be constructed from the harmonized and country-specific variables, respectively. The equation for the respective quality measure would be the same as above, just replacing disp by (adjusted) R2. This analysis may hint at problematic harmonized variables for the specific relationship in question. Such an analysis was, for example, conducted by Kerckhoff et al. (2002) and Schneider (2010) to compare the quality of differently harmonized education variables.
Comparability of Meaning of Multi-item Measures A number of statistical procedures are available to quantitatively test the equivalence of meaning when using multiple indicators to measure a latent construct. In comparative surveys, these methods are ideally used during the piloting stage of a comparative
Harmonizing Survey Questions Between Cultures and Over Time
survey, and not only in the analysis stage, to assess the quality of ex-ante harmonization, especially translation and adaptation. However, for ex-post harmonization, they can obviously only be used in the analysis stage and, given the lack of ex-ante procedures in this setting, are amongst the only procedures to empirically assure comparability of measurement. The most common method used to assess equivalence of a measurement instrument today is multigroup confirmatory factor analysis (MGCFA) presented in detail in Chapter 39 in this Handbook (see also the literature cited there). Depending on the type of latent variable of interest and depending on the link between the latent and observed variables other methods, for example extensions of Latent Class Analysis or Item Response Theory, may be more appropriate (for an overview see Wirth and Kolb, 2012). In principle, all these methods test whether constraining certain parameters of a measurement model to be equivalent across groups, e.g. countries, still lead to satisfactory model fit. Depending on the number and kind of parameters that are constrained to be equal, different levels of equivalence can be distinguished. As mentioned, these methods are especially useful to test and develop instruments in a comparative survey design setting.
CONCLUSION We have distinguished two major approaches to harmonization: ex-ante and ex-post harmonization. For the former, two subtypes were introduced, namely input and ex-ante output harmonization. With respect to most concepts, input harmonization results in the highest level of comparability. However, this approach is not always feasible. If the concepts we are interested in are shaped by national institutions, histories and cultures, strict input harmonization does not work because respondents do not think about these
519
things in the same way across countries. In these cases, we may have to consider varying numbers and types of answering categories, sometimes in addition to adaptations of question wording, and decide how these countryspecific variables should be combined or recoded to a harmonized international measure, which we call target variable. That is, we have to apply ex-ante output harmonization. Most comparative surveys show a mixture of these two approaches. Input harmonization is also less feasible if a comparative survey is connected with or built on pre-existing noncomparative surveys in some or all countries. The higher the proportion of ex-ante output harmonization in a comparative survey is, the more national teams will determine its content by, for example, taking questionnaire items or data from existing sources, thereby challenging comparability. The concrete harmonization strategy followed by a comparative survey, as other aspects of design, affects its other quality dimensions, such as timeliness, response burden or cost. Thus, the concrete mixture of input and ex-ante output harmonization reflects the preferences and constraints of those responsible for the survey with respect to several relevant criteria. It should also be noted that the approach taken for harmonization should correspond to decisions on other design features and standardization measures applied in a comparative survey (for the latter see Lynn, 2003), in accordance with its aims. Whether theoretical concepts are appropriate across contexts and whether survey measures can be harmonized, either by translation or adaptation of questionnaire items or by recoding during data processing, depends on the specific national, cultural and historical conditions prevailing in the contexts in which we want to run a survey. The same is true for ex-post harmonization. Which target variables the harmonized data file will contain depends not only on the availability of data in the source files but also on our theoretical premises, our research questions and our perspective on the concrete historical and
520
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
national context. Thus, harmonization – both ex ante and ex post –is not just a ‘mechanical’ task of menial recoding work, but depends on expert knowledge of concrete historical, political, social and other national conditions met in the countries and times of interest, and it is not free of normative judgment (see also Chapter 20 in this Handbook). It is therefore paramount that any harmonization project is done in close collaboration of scholars from or at least with in-depth knowledge about the cultural contexts of interest, as well as the theoretical concepts to be studied. The multi-cultural perspective established by such a group helps to prevent resulting measures that are culturally biased, invalid or irrelevant in some country. For the same reason, all steps of a harmonization process should be closely documented. This has the additional advantage that successful solutions from one project may be carried over to other projects thereby supporting cumulative, comparative research.
NOTES 1 Definitions of measurement equivalence are stricter and rely on statistical properties of measures (see Milfont and Fischer, 2010; Steenkamp and Baumgartner, 1998; van Deth, 1998). 2 Most readers probably think about a crossnational survey, but sometimes the cultural heterogeneity within a country (or another relevant target population) is so large that even within the same national context a cross-cultural approach has to be followed. 3 Differences of distributions also often need to be taken into account when adapting a questionnaire to different contexts. Think about a question on religious affiliation: In India, the answering categories should include ‘Hindu’, while this would not have to be the case in a continental European country. 4 See http://www.gesis.org/fileadmin/upload/dienstleistung/daten/umfragedaten/issp/members/ codinginfo/BV_questionnaire_for_issp2014.pdf 5 Other, more general aspects of process quality are of great importance, too, but not covered here (see for example Lyberg and Biemer, 2008; Lyberg and Stukel, 2010).
6 See, for example, for the ESS: http://www.europeansocialsurvey.org/data/deviations_index.html, or for the Comparative Study of Electoral Systems: http://www.cses.org/datacenter/module4/ data/cses4_codebook_part2_variables.txt 7 This assumes that in the countries surveyed there are political parties or regulations defining different categories of marital status or there is a formal educational system; assumptions that might be wrong. 8 Since comparative surveys by design and thus exante harmonization are a more recent, extremely costly and thus special endeavor, the term ‘harmonization’ is often used synonymously with the term ‘ex-post harmonization’. 9 https://international.ipums.org/internationalaction/faq, What are integrated variables? 10 The issue of sensitive questions should also be taken into consideration during input harmonization; truly multi-cultural questionnaire design groups, advance translation and cognitive pretesting can help to identify such questions. 11 This would be much easier if the intended theoretical concepts were routinely documented alongside survey questionnaires, which is not (yet) common practice but highly recommended. 12 Such databases are offered by the Council for European Social Science Data Archives (CESSDA; http://www.cessda.net/catalogue/), the German survey data archive at GESIS – Leibniz Institute for the Social Sciences (http://zacat.gesis.org/), the UK data service (http://nesstar.ukdataservice. ac.uk/) or the Inter-university Consortium for Political and Social Research (http://www.icpsr. umich.edu/icpsrweb/ICPSR/). 13 If, conceptually, the analyst is only interested in certain aspects of a concept (reflected in single categories of a variable), loss of information may not be a concern. For example, when looking at the effects of being divorced, there is no need to keep all possible distinctions on the variable of marital status as predictor variables. In the end, the analyst needs to think carefully about the theoretical concept and target variable and subject the desired variable to quality checks.
RECOMMENDED READINGS There are surprisingly few texts discussing general approaches to the harmonization of survey data (but see Granda et al., 2010). Some more papers detail the path taken to harmonization in specific surveys, in particular the International Social Survey Programme
Harmonizing Survey Questions Between Cultures and Over Time
(Scholz, 2005), the European Social Survey (Kolsrud and Kalgraff Skjak, 2005), the Cross-National Equivalent File (Lillard, 2013) or household surveys in European official statistics (Ehling, 2003; Körner and Meyer, 2005). Then there are those contributions focusing on a specific variable, such as education (Schneider, 2009), occupation (Ganzeboom and Treiman, 2003) or income (Canberra Group, 2011). Finally, we can recommend the practical advice gained from different projects at the Minnesota Population Center that can be found on the web at http://www.ipums.org/.
REFERENCES Behr, D., and Braun, M. (2015). Satisfaction with the way democracy works: how respondents across countries understand the question. In P. B. Sztabinski, H. Domanski, and F. Sztabinski (eds), Hopes and Anxieties: Six Waves of the European Social Survey (pp. 121–138). Frankfurt/Main: Lang. Braun, M., Behr, D., Kaczmirek, K., and Bandilla, W. (2014). Evaluating cross-national item equivalence with probing questions in web surveys. In U. Engel, B. Jann, P. Lynn, A. Scherpenzeel, and P. Sturgis (eds), Improving Survey Methods: Lessons From Recent Research (pp. 184–200). New York: Routledge. Bréchon, P. (2009). A breakthrough in comparative social research: the ISSP compared with the Eurobarometer, EVS and ESS surveys. In M. Haller, R. Jowell, and T.W. Smith (eds), The International Social Survey Programme, 1984–2009 (pp. 28–44). London: Routledge. Breen, R., Luijkx, R., Müller, W., and Pollak, R. (2009). Nonpersistent inequality in educational attainment: evidence from eight European countries. American Journal of Sociology, 114(5), 1475–1521. Breen, R., Luijkx, R., Müller, W., and Pollak, R. (2010). Long-term trends in educational inequality in Europe: class inequalities and gender differences. European Sociological Review, 26(1), 31–48. doi:10.1093/esr/ jcp001
521
Burkhauser, R. V., Butrica, B. A., Daly, M. C., and Lillard, D. R. (2000). The cross-national equivalent file: a product of cross-national research. In I. Becker, N. Ott, and G. Rolf (eds), Soziale Sicherung in einer dynamischen Gesellschaft. Festschrift für Richard Hauser zum 65. Geburtstag (pp. 354–376). Frankfurt/Main: Campus. Canberra Group (2011). Handbook on Household Income Statistics (2nd edn). Geneva: United Nations. Clogg, C. C. (1984). Some statistical models for analyzing why surveys disagree. In C.F. Turner and E. Martin (eds), Surveying Subjective Phenomena (Vol. 2, pp. 319–366). New York: Russell Sage Foundation. Dale, A., Wathan, J., and Higgins, V. (2008). Secondary analysis of quantitative data sources. In P. Alasuutari, L. Bickman, and J. Brannen (eds), The SAGE Handbook of Social Research Methods (pp. 520–535). London: Sage. Depoutot, R. (1999). Quality definition and evaluation. In European Commission (ed.), The Future of European Social Statistics: Harmonisation of Social Statistics and Qua lity (pp. 31–50). Luxembourg: Office of O fficial Publications of the European Communities. Dept, S. (2013). Translatability assessment of draft questionnaire items. Paper presented at the conference of the European Survey Research Association (ESRA), Ljubljana, SI. Unpublished. Dorer, B. (2011). Advance translation in the 5th round of the European Social Survey (ESS). FORS Working Paper Series 2011, 4. Ehling, M. (2003). Harmonising data in official statistics: development, procedures, and data quality. In J. H. P. Hoffmeyer-Zlotnik and C. Wolf (eds), Advances in Cross-National Comparison. A European Working Book for Demographic and Socio-Economic Variables (pp. 17–31). New York: Kluwer Academic/ Plenum Publishers. Eurostat (2014). Quality report of the European Union Labour Force Survey 2013. Eurostat Statistical Working Papers. Luxembourg: Publications Office of the European Union. Eurostat (2015). ISCED mappings 2014. https:// circabc.europa.eu/w/browse/51bfe88f-cb684316-9092-0513c940882d [accessed 2015/12/29].
522
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Fitzgerald, R., Widdop, S., Gray, M., and Collins, D. (2011). Identifying sources of error in cross-national questionnaires: application of an error source typology to cognitive interview data. Journal of Official Statistics, 27, 569–599. Ganzeboom, H. B. G., and Treiman, D. J. (2003). Three internationally standardised measures for comparative research on occupational status. In J. H. P. Hoffmeyer-Zlotnik and C. Wolf (eds), Advances in CrossNational Comparison: A European Working Book for Demographic and Socio-Economic Variables (pp. 159–193). New York: Kluwer Academic/Plenum Publishers. Ganzeboom, H. B. G., and Treiman, D. J. (2012). International Stratification and Mobility File: Conversions for Country-specific Occupation and Education Codes. http://www.harryganzeboom.nl/ISMF/ismf. htm [accessed 2015/12/29]. Grais, B. (1999). Statistical harmonisation and quality. In European Commission (ed.), The Future of European Social Statistics - Harmonisation of Social Statistics and Quality (pp. 51–114). Luxembourg: Office of Official Publications of the European Communities. Granda, P., Wolf, C., and Hadorn, R. (2010). Harmonizing survey data. In J. A. Harkness, M. Braun, B. Edwards, T. Johnson, L. E. Lyberg, P. Ph. Mohler, B.E. Pennell, and T. W. Smith (eds), Survey Methods in Multinational, Multicultural and Multiregional Contexts (pp. 315–334). Hoboken, NJ: Wiley. Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R. (2009). Survey Methodology (2nd ed.). Hoboken, NJ: Wiley. Harkness, J. A., and Schoua-Glusberg, A. (1998). Questionnaires in translation. In J. Harkness (ed.), ZUMA-Nachrichten Spezial 3: Cross-Cultural Survey Equivalence (pp. 87– 127). Mannheim: ZUMA. Harkness, J., Mohler, P. Ph., and van de Vijver, F. J. R. (2003a). Comparative research. In J. Harkness, F. J. R. van de Vijver, and P. Ph. Mohler (eds), Cross-Cultural Survey Methods (pp. 3–16). Hoboken, NJ: Wiley. Harkness, J., van de Vijver, F. J. R., and Johnson, T. P. (2003b). Questionnaire design in comparative research. In J. Harkness, F. J. R. van de Vijver, and P. Ph. Mohler (eds),
Cross-Cultural Survey Methods (pp. 19–34). Hoboken, NJ: Wiley. Harkness, J.A., Edwards, B., Hansen, S.E., Miller, D.R., and Villar, A. (2010). Designing questionnaires for multipopulation research. In J. A. Harkness, M. Braun, B. Edwards, T. Johnson, L. E. Lyberg, P. Ph. Mohler, B-E. Pennell, and T. W. Smith (eds), Survey Methods in Multicultural, Multinational, and Multiregional Contexts (pp. 33–58). Hoboken, NJ: Wiley. Heath, A. F., Fisher, S., and Smith, S. (2005). The globalization of public opinion research. Annual Review of Political Science, 8, 297–333. Hoffmeyer-Zlotnik, J. H. P., and Wolf, C. (eds). (2003). Advances in Cross-National Comparison. A European Working Book for Demographic and Socio-Economic Variables. New York: Kluwer Academic/Plenum Publishers. International Labour Organization (2007). Resolution Concerning Updating the International Standard Classification of Occupations. Geneva: International Labour Organization. http://www.ilo.org/public/english/bureau/ stat/isco/docs/resol08.pdf [accessed 2015/12/29]. ISSP (2012). International Social Survey Programme (ISSP) Working Principles. http:// www.issp.org/uploads/editor_uploads/files/ WP_FINAL_9_2012_.pdf [accessed 2015/12/29]. Jowell, R. (1998). How comparative is comparative research? American Behavioral Scientist, 42(2), 168–177. Kerckhoff, A. C., Ezell, E. D., and Brown, J. S. (2002). Toward an improved measure of educational attainment in social stratification research. Social Science Research, 31(1), 99–123. Kolsrud, K., and Kalgraff Skjak, K. (2005). Harmonising background variables in the European Social Survey. In J. H. P. Hoffmeyer-Zlotnik and J. A. Harkness (eds), ZUMA Nachrichten Spezial 11: Methodological Aspects in CrossNational Research (Vol. 11, pp. 163–182). Mannheim: ZUMA. Körner, T., and Meyer, I. (2005). Harmonising socio-demographic information in household surveys of official statistics: experiences from the Federal Statistical Office Germany. In J. H. P. Hoffmeyer-Zlotnik and J. A. Harkness (eds),
Harmonizing Survey Questions Between Cultures and Over Time
ZUMA Nachrichten Spezial 11: Methodological Aspects in Cross-National Research (Vol. 11, pp. 149–162). Mannheim: ZUMA. Lee, J. (2014). Conducting cognitive interviews in cross-national settings. Assessment, 21(2), 227–240. Lillard, D. R. (2013). Cross-national harmonization of longitudinal data: the example of national household panels. In B. Kleiner, I. Renschler, B. Wernli, P. Farago, and D. Joye (eds), Understanding Research Infrastructures in the Social Sciences (pp. 80–88). Zürich: Seismo. Lyberg, L. E., and Biemer, P. P. (2008). Quality assurance and quality control in surveys. In E. D. de Leeuw, J. J. Hox, and D. A. Dillman (eds), International Handbook of Survey Methodology (pp. 421–441). New York: Lawrence Erlbaum Associates. Lyberg, L., and Stukel, D. M. (2010). Quality assurance and quality control in crossnational comparative studies. In J. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, Ph. Mohler, B.E. Pennell, and T. W. Smith (eds), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 227–249). Hoboken, NJ: Wiley. Lynn, P. (2003). Developing quality standards for cross-national survey research: five approaches. International Journal of Social Research Methodology, 6, 323–336. Mair, P., and de Leeuw, J. (2010). A general framework for multivariate analysis with optimal scaling: the R package aspect, Journal of Statistical Software, 32(9), 1–23. Meitinger, K., and Behr, D. (2016). Comparing cognitive interviewing and online probing: do they find similar results? Field Methods. doi:10.1177/1525822X15625866 Milfont, T. L., and Fischer, R. (2010). Testing measurement invariance across groups: applications in cross-cultural research. International Journal of Psychological Research, 3(1), 111–121. Miller, K., Fitzgerald, R., Padilla, J.-L., Willson, S., Widdop, S., Caspar, R., Dimov, M., Grey, M., Nunes, C., Prüfer, P., Schöbi, N., and Schoua-Glusberg, A. (2011). Design and analysis of cognitive interviews for comparative multinational testing. Field Methods, 23(4), 379–396. Mohler, Ph., Smith, T., and Harkness, J. (1998). Respondent’s ratings of expressions from
523
response scales: a two-country, two-language investigation on equivalence and translation. In J. Harkness (ed.), ZUMANachrichten Spezial 3: Cross-Cultural Survey Equivalence (pp. 159–184). Mannheim: ZUMA. Mueller, J. H., Costner, H. L., and Schuessler, K. F. (1970). Statistical Reasoning in Sociology (2nd edn). Boston, MA: Houghton Mifflin. Ortmanns, V., and Schneider, S. L. (2015). Harmonization still failing? Inconsistency of education variables in cross-national public opinion surveys. International Journal of Public Opinion Research, online first, doi: 10.1093/ijpor/edv025. Pollack, R., Bauer, G., Müller, W., Weiss, F., and Wirth, H. (2009). The comparative measurement of supervisory status. In D. Rose and E. Harrison (eds), Social Class in Europe: An Introdcution to the European Socio-economic Classification (pp. 138–157). Abingdon: Routledge. Schneider, S. L. (ed.) (2008). The International Standard Classification of Education (ISCED97). An Evaluation of Content and Criterion Validity for 15 European Countries. Mannheim: MZES. http://www.mzes.uni-mannheim.de/publications/books/isced97.html [accessed 2016/03/02]. Schneider, S. L. (2009). Confusing Credentials: The Cross-Nationally Comparable Measurement of Educational Attainment. Oxford: University of Oxford, Nuffield College. http:// ora.ouls.ox.ac.uk/objects/uuid:15c39d54f896-425b-aaa8-93ba5bf03529 [accessed 2015/12/29]. Schneider, S. L. (2010). Nominal comparability is not enough: (in-)equivalence of construct validity of cross-national measures of educational attainment in the European Social Survey. Research in Social Stratification and Mobility, 28(3), 343–357. Scholz, E. (2005). Harmonisation of survey data in the International Social Survey Programme (ISSP). In J. H. P. Hoffmeyer-Zlotnik and J. A. Harkness (eds), ZUMA Nachrichten Spezial 11: Methodological Aspects in CrossNational Research (Vol. 11, pp. 183–200). Mannheim: ZUMA. Smeeding, T. M., O’Higgins, M., and Rainwater, L. (eds) (1990). Poverty, Inequality, and
524
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Income Distribution in Comparative Perspective: The Luxembourg Income Study (LIS). Washington D.C.: The Urban Institute. Smith, T. W. (2003). Developing Comparable Questions in Cross-National Surveys. In J. Harkness, F. J. R. van de Vijver, and P. Ph. Mohler (eds), Cross-Cultural Survey Methods (pp. 69–91). Hoboken, NJ: Wiley. Smith, T. W. (2004). Developing and evaluating cross-national survey instruments. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, and E. Singer (eds), Methods for Testing and Evaluating Survey Questionnaires (pp. 431–452). Hoboken, NJ: Wiley. Smith, T. W. (2005). The laws of studying social change. Survey Research, 36(2), 1–5. Steenkamp, J.-B. E. M., and Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78–90.
UNESCO (2006). International Standard Classification of Education: ISCED 1997 (re-edition). Montreal: UNESCO Institute for Statistics. UNESCO Institute for Statistics (2012). International Standard Classification of Education ISCED 2011. Montreal: UNESCO Institute for Statistics. Van Deth, J. W. (1998). Equivalence in comparative political research. In J. W. Van Deth (ed.), Comparative Politics: The Problem of Equivalence (pp. 1–19). London: Routledge. Van Tubergen, F. (2004). International File of Immigration Surveys: Codebook and Machine Readable Data File. Utrecht: University of Utrecht, Department of Sociology, Interuniversity Center for Social Science Theory and Methodology. Wirth, W., and Kolb, S. (2012). Securing equivalence: problems and solutions. In F. Esser and T. Hanitzsch (eds), The Handbook of Comparative Communication Research (pp. 469–485). New York: Routledge.
PART VIII
Assessing and Improving Data Quality
34 Survey Data Quality and Measurement Precision D u a n e F. A l w i n 1
Measurement, as we have seen, always has an element of error in it. The most exact description or prediction that a scientist can make is still only approximate. If, as sometimes happens, a perfect correspondence with observation does appear, it must be regarded as accidental … (Kaplan, 1964: 215)
INTRODUCTION Despite the potential for error, precise measurement is considered the sine qua non of research investigations in all sciences. Without it, empirical knowledge is meager and underdeveloped, regardless of the field of study. One principle of precision has to do with measurement validity, which requires, as classically defined, that one’s measures correspond to the construct of interest and that it succeeds in measuring what it purports to measure (Cronbach and Meehl, 1955; Kaplan, 1964: 198–201). A corollary to the principle of validity is that the reliability of
measurement is a necessary (although not sufficient) condition for valid measurement. Without reliable measurement one cannot achieve empirical validity because the reliability of measurement sets limits on magnitudes of correlation among variables (Lord and Novick, 1968: 61–63). Unfortunately, there is very little available empirical evidence that establishes the validity of social measurement – validity of measurement is often taken on faith and relies on the consensus of investigators studying a given problem. Similarly, there is only a limited empirical basis for concluding that social measurement is acceptably reliable, although the available knowledge is increasing. Certainly, quality concerns have been the focus of a substantial literature on survey methods (see, e.g., Biemer et al., 1991; Lyberg et al., 1997; Kaase, 1999; Schaeffer and Dykema, 2011), and there is growing recognition that measurement errors pose a serious limitation to the validity and usefulness of the information collected via survey
528
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
methods (see, e.g., Smith, 2011). In addition to the threats posed by coverage or sampling frame problems, sampling errors, and survey non-response, measurement errors can render the survey enterprise of little value when errors are substantial. These issues were emphasized in a recent book on Question Evaluation Methods (Madans et al., 2011) (hereafter QEM), which observed that in order for survey data to be ‘viewed as credible, unbiased, and reliable’, it is necessary to develop best practices regarding indicators of quality. They noted that indicators of the quality of sample designs, how the design was implemented, and other characteristics of the sample are well developed and accepted, but they stressed the fact that the same cannot be said for the quality of the measurement. QEM phrased the issues as follows (Madans et al., 2011: 2): Content is generally evaluated according to the reliability and validity of the measures derived from the data. Quality standards for reliability, while generally available, are not often implemented due to the cost of conducting the necessary data collection. While there has been considerable conceptual work regarding the measurement of validity, translating the concepts into measureable standards has been challenging.
In basic agreement with these claims, I here argue that in order to understand how to evaluate the validity and reliability of measures obtained in surveys, there are several steps that need to be taken. First, it is necessary to conceptualize the nature of the measurement errors in surveys and bring theory and data to bear on a variety of different issues related to the quality of survey data. Second, it is essential to develop statistical models that specify the linkage between true and observed variables. Third, it is necessary to develop robust research designs that permit the estimation of level of measurement error. And finally, it is important to link such estimates of measurement errors to several elements of survey design and practice, including (1) the characteristics of the population of interest, (2) the topic or topics linked to the
purpose of the survey, (3) the design of the questionnaire, including the wording, structure, and context of the questions, as well as the response formats provided, and (4) the specific conditions of measurement, including the mode of administration and the quality of interviewer training (see Alwin, 1989). This chapter sets forth the argument that, given available models and data resources, we increasingly are able to assess these issues and evaluate the link between these elements of survey design, especially question/ questionnaire characteristics, to estimates of data quality. Following the argument of Madans et al. (2011), I propose that the reliability of measurement should be used as a major criterion for assessing differences in measurement quality associated with particular survey questions, interviewing practices, modes of administration, or types of questionnaires. Reliability is clearly not the only criterion for survey quality – questions of validity ultimately must be addressed – but without high reliability, other matters cannot be addressed because the reliability of measurement places limits on empirical associations. As already noted, other important indicators of the quality of survey data include the quality of the procedures for obtaining coverage of the population, methods of sampling, and levels of non-response (Biemer, 2010) – but minimizing measurement error is an essential to the enterprise (Alwin, 2014). Elsewhere I have argued that the various sources of survey error are nested, and measurement errors exist on the ‘margins of error’ – that is, measurement precision represents a prism through which we view other aspects of survey data (Alwin, 2007: 2–8). In other words, having excellent samples representative of the target population, having high response rates, having complete data, etc., does us little good if our measurement instruments evoke responses that are fraught with error. Hence, improving survey measurement by scrutinizing the nature and level of measurement errors is an important issue on the agenda for survey methodologists.
Survey Data Quality and Measurement Precision
The main objectives of this chapter, then, are to accept the challenge presented by QEM (Madans et al., 2011) concerning the reliability and validity of survey data and to raise several questions about the nature and extent of measurement error in surveys and their effects on the quality of the resulting data. I, first, provide a conceptual framework for discussing issues of measurement precision that has laid the groundwork for the recent development of programs of research for studying measurement errors, especially approaches to quantifying the extent of measurement error in single survey questions (see, e.g., Alwin, 2007; Saris and Gallhofer, 2007). Further, I review the major programs of research in this area, and provide a selective review of the available literature that allows an assessment of the conclusions reached about the quality of survey measurement and the efficacy of specific measurement approaches using the criteria of reliability and validity, as proposed by the authors of QEM (Madans et al., 2011).
CONCEPTUALIZING MEASUREMENT ERROR In his treatise on social measurement, Otis Dudley Duncan (1984: 119) noted that ‘measurement is one of the many human achievements and practices that grew up and came to be taken for granted before anyone thought to ask how and why they work’. The essence of measurement involves the assignment of values (numbers, or numerals, depending on the level of measurement) to population elements (households, persons, etc.) and represents the link between theory (theoretical concepts) and empirical indicators (observed variables) (Kaplan, 1964: 171–178). Almost everyone agrees that while ‘measurement works’, it is also the case that making errors in the assignment of values to units of observation is routine. Measurement error, as distinct from other
529
survey errors, is usually defined as follows (see Groves, 1989: vi): Error that occurs when the recorded or observed value is different from the true value of the variable.
The ‘true value’ of a variable is an abstraction because it is rarely, if ever, known. When applied to elements of a finite population, it serves the same purpose as the idea of there being a population mean, variance, or some other parameter of interest for that population, even though they are never observed. In this sense, measurement error is also an abstract concept and has no direct empirical referent. Thus, we can rarely, if ever, define measurement error for a person except as a hypothetical quantity. It should be noted that a ‘true value’ is not the same as a ‘true score’, but in the application of some theories of measurement, we use the idea of a ‘true score’ as one way to define, or give meaning to, the notion of a true value. Within particular statistical models, the concept of ‘true score’ has a very precise definition, if not an empirical one, but almost everyone agrees that the idea of a ‘true value’ is nothing more (or less) than an abstraction (see Groves, 1991; Alwin, 2007; Bohrnstedt, 2010).
Observed vs Latent Values In order to understand how the properties of measurement error can be conceptualized and estimated, we begin with a fundamental distinction between observed (or manifest) variables and unobserved (or latent) variables, both of which may have either categorical or continuous metrics. Focusing on the linkage between these two levels of discourse – observed and unobserved variables – allows us to conceptualize reliability and validity in terms that are familiar to most researchers, although it is difficult for some researchers to accept the fact that what we are measuring is inherently unobserved, even though it may have a factual basis. The first step is to think
530
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of a response to a survey question as an observed variable that reflects some latent variable or construct of interest. Hence we define the observed response variable (Y) as a random variable in some finite population of interest that has a latent counterpart (T), which is also a random variable in that population. We posit that T is in part responsible for the production of the response Y, and in a causal sense, T leads to Y, i.e., depicted as T Y (see, e.g., Figure 34.1). With this background, we can begin to conceptualize reliability and validity. The observed variable Y is a ‘survey response’, while the ‘latent’ or ‘unobserved’ variable T is what the survey question is intended to measure, i.e., the true value. We consider T to be a part of (or a component of) Y, but the two are not necessarily identical because a number of types of ‘survey measurement errors’ are also part of Y. We define these globally using the notation E. Hence we conceptualize Y as having two major parts – what we are attempting to measure (T) and our inability to do so (errors of measurement E), i.e., T Y E (see Figure 34.1). Apart from asserting that Y, T, and E are random variables in some population of interest, to this point we have imposed no constraints on the properties of and relationships among these three processes, except that E (or measurement error) reflects the difference between Y and T, i.e., E = Y – T (Groves, 1989: vi). While Y is observed, both T and E are unobserved, or latent, variables. This formulation attributes the variation observed in survey responses to two general considerations: variation in the unobserved phenomenon one is trying to measure, e.g., an
attitude, or a person’s income or employment status, and unobserved variation contributed by errors of measurement. The formulation introduced here is very general, allowing a range of different types of Y and T random variables: (1) Y and T variables that represent discrete categories, as well as (2) Y and T variables that are continuous. If Y (observed) and T (latent) are continuous random variables in a population of interest, for example, and we assume that E is random with respect to the latent variable – that is, E is random with respect to levels of T – we have the case represented by classical true-score theory (CTST) (see Lord and Novick, 1968). CTST refers to latent variable T as a ‘true score’ and defines measurement error simply as the difference between the observed and true scores, i.e., E = Y – T, consistent with the narrative introduced above. In the CTST tradition, a model for the g-th measure of Y is defined as Yg = Tg + Eg and the variance of Yg is written as VAR[Yg] = VAR[Tg] + VAR[Eg]. Note that here we assume that COV[Tg,Eg] = 0, or as formulated here, in the population model the true- and error-scores are independent, i.e., measurement error is random. Because it is the most familiar one and easiest to implement, I develop this model briefly below; but it is important to note that other combinations (not discussed here) are possible, including continuous latent variables measured by categorical observed variables (the typical item response theory [IRT] logistic model), or categorical latent variables represented by categorical observed variables (latent class models) (see, e.g., Alwin, 2007, 2009; Bohrnstedt, 2010; Biemer, 2011).
Figure 34.1 Path diagram of the classical true-score model for a single measure. Source: Alwin (2007)
Survey Data Quality and Measurement Precision
The Concepts of Reliability and Validity The concept of reliability is often conceptually defined in terms of the consistency of measurement, which is an appropriate characterization, indicating the extent to which ‘measurement remains constant as it is repeated under conditions taken to be constant’ (see Kaplan, 1964: 200). Consider a hypothetical thought experiment that observes some quantity of interest (Y1) using some measurement device – a child’s height using a tape measure, the pressure in a bicycle tire using a tire pressure gauge, family income based on a question in a household survey, or an attitude or belief about some social object assessed using a Likert-type question in an opinion survey. Then imagine repeating the experiment, obtaining a second measure (Y2) under the assumption that neither the quantity being measured nor the measurement device has changed. If one obtains consistent results across these two replications, we say the measure of Y is reliable, but if they are inconsistent, we say it is unreliable. These notions are captured by Campbell and Fiske’s (1959: 83) famous definition of reliability as the ‘agreement between two efforts to measure the same thing, using maximally similar methods’. Of course, reliability is not an ‘either-or’ property; ultimately, we seek to quantify the degree of reliability or consistency in social measurement, and in defining the concept of reliability more formally, we give it a mathematical definition (see below).
531
The key idea here was expressed by Lord and Novick (1968) who state that ‘the correlation between truly parallel measurements taken in such a way that the person’s true score does not change between them is often called the coefficient of precision’ (Lord and Novick, 1968: 134). In this case the only source contributing to measurement error is the unreliability or imprecision of measurement. The assumption here, as is true in the case of cross-sectional designs, is that ‘if a measurement were taken twice and if no practice, fatigue, memory, or other factor affected repeated measurements’ the correlation between the measures reflects the precision, or reliability, of measurement (emphasis added) (Lord and Novick, 1968: 134). Figure 34.2 provides a path diagram of such a situation of two measures, Y1 and Y2, of the same underlying trait (denoted as T in this picture), illustrating the idea of two parallel measurements.2 Following the above definition, if Y1 and Y2 are truly parallel measurements, the correlation between them is referred to as the reliability of measurement with respect to measuring T. In practical situations in which there are in fact practice effects, fatigue, memory, or other spurious factors contributing to the correlation between repeated measures, the simple idea of the correlation between Y1 and Y2 is not the appropriate estimate of reliability. Indeed, in survey interviews of the type commonly employed it would be almost impossible to ask the same question twice without memory or other factors contributing to the
Figure 34.2 Path diagram of the classical true-score model for two tau-equivalent measures. Source: Alwin (2007)
532
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
correlation of repeated measures. Thus, asking the same question twice within the same interview would not be an appropriate design for estimating the reliability of measuring the trait T.3 By contrast, establishing the validity of survey measurement is even more difficult to assess than reliability because within a given survey instrument, there is typically little available information that would establish a criterion for validation. As noted earlier, in the psychometric tradition, the concept of validity has mainly to do with the utility of measures with respect getting at particular theoretical concepts. This concept of validity is difficult to adapt to the case of survey measures, given the absence of well-defined criteria representing the theoretical construct, but several efforts have been made. There is a well-known genre of research situated under the heading ‘record check studies’, which involves the comparison of survey data with information obtained from other record sources (e.g., Marquis, 1978; Traugott and Katosh, 1979, 1981; Bound et al., 1990; Kim and Tamborini, 2014). Although rare, such studies can shed light on survey measurement errors; however, as noted earlier, correlations among multiple sources of information are limited by the level of reliability involved in reports from either source (Alwin, 2007: 48–49). Whereas reliability may be conceptualized in terms of agreement between two efforts to assess the underlying value using maximally similar, or replicate, measures, the concept of validity is best conceptualized as agreement or consistency between two efforts to measure the same thing using maximally different measurements (Campbell and Fiske, 1959: 83). Of course, methodologists differ with regard to what is considered ‘maximally different’ (see, e.g., my critique of the multitrait-multimethod (MTMM) approach, Alwin, 2011). Certainly record check studies easily assess quality in the application of the Campbell–Fiske criterion of ‘maximally different’ measures, but as we note below, the
MTMM approach applied to cross-sectional surveys substitutes the notion of ‘trait validity’ for ‘construct validity’, and typically does not follow the dictum of Campbell and Fiske (1959: 83) that validity requires the use of ‘maximally different’ methods of measurement (see Alwin, 2011). Finally, it is important to understand the relationship between reliability and validity. Among social scientists, almost everyone is aware that without valid measurement, the contribution of the research may be more readily called into question. At the same time, most everyone is also aware that the reliability of measurement (as distinct from validity) is a necessary, although not sufficient, condition for valid measurement. In the CTST tradition, the validity of a measurement Y is expressed as its degree of relationship to a second measurement, some criterion of interest (Lord and Novick, 1968: 61–62). It can be shown that the criterion validity of a measure – its correlation with another variable – cannot exceed the reliability of either variable (Lord and Novick, 1968; Alwin, 2007). There is a mathematical proof for this assertion (Alwin, 2007: 291–292), but the logic that underlies the idea is normally accepted without formal proof: if our measures are unreliable, they cannot be trusted to detect patterns and relationships among variables of interest, including relationships aimed at establishing the validity of measurement.
QUANTIFYING MEASUREMENT ERROR Three things are necessary to properly assess the nature and extent of measurement error and to quantify the concepts of reliability and validity, as discussed by Madans et al. (2011). First, one needs a model that specifies the linkage between true and observed variables; second, one needs a research design that permits the estimation of the parameters of such a model; and third, one needs an
Survey Data Quality and Measurement Precision
interpretation of these parameters that is consistent with the concept of reliability (or validity) (Alwin and Krosnick, 1991). In the following sections we focus on these three desiderata.
Models for Measurement Error Classical true-score theory (CTST) provides a theoretical model for formalizing the statement of the basic ideas for thinking about measurement error and ultimately for the estimation and quantification of reliability of measurement, as discussed by Madans et al. (2011). This model was discussed in the foregoing section and is a useful place to begin for several reasons. First, this model historically connects our intuitive notions about reliability and validity with more formal statistical models that underlie their quantification. Second, this model forms the basis for most current available designs for assessing measurement errors in surveys. And third, there is a growing literature on the quality of survey measurement that employs tools based on these ideas. This presentation is by necessity brief, and more detailed discussions are available (see, e.g., Alwin, 2005, 2007). CTST provides the classical definitions of observed score, true score, and measurement error, as well as several results that follow from these definitions, including the definitions of reliability and validity. It begins with definitions of these scores for a fixed person, p, a member of a finite population (S) for which we seek to estimate the reliability of measurement of the random variable Y. Reference to these elements as persons is entirely arbitrary, as they may be households, organizations, work groups, families, counties, or any other theoretically relevant unit of observation. We use the reference to ‘persons’ because the classical theory of reliability was developed for scores defined for persons and because the application of the theory has been primarily in studies of persons. It is important to note that throughout
533
this chapter the assumption is made that there exists a finite population of persons (S) for whom the CTST model applies and that we wish to draw inferences about the extent of measurement error in that population. The classical psychometric approach defines the observed score as a function of a true score and an error score for a fixed person, i.e., as y = τ + ε, where E(ε) = E(τε) = 0 and E(τ) = E(y) (see Lord and Novick, 1968). Although the model is stated for a fixed person, p, it is not possible to estimate errors of measurement for that person – rather, the model only allows us to estimate properties of the population of persons for which the model holds. The model assumes that Y is a univocal measure of the continuous latent random variable T, and that there is a set of multiple measures of the random variable {Y1, Y2, … Yg, …YG} that have the univocal property, that is, each measures one and only one thing, in this case T.4 An observed score ygp for a fixed person p on measure g is defined as a (within person) random variable for which a range of values for person p can in theory be observed. Recall the ‘repeated measurement’ thought experiment from above, and imagine a hypothetical infinite repetition of measurements creating a propensity distribution for person p relating a probability density function to possible values of Yg. The true score τgp for person p on measure g is defined as the expected value of the observed score ygp, where ygp is sampled from the hypothetical propensity distribution of measure yg for person p. From this we define measurement error for a given observation as the difference between the true score and the particular score observed for p on Yg, i.e., εgp = ygp – τgp. Note that a different error score would result had we sampled a different ygp from the propensity distribution for person p, and an infinite set of replications will produce a distribution for εgp. Several useful results follow from these simple definitions (see Lord and Novick, 1968; Alwin, 2007), the technical details of which we do not present here (see Alwin,
534
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
2005, 2007). One of the most important is that the sample estimate of reliability is the squared correlation between observed and true scores, i.e., COR(YT)2, which equals the ratio of true variance to the observed variance, that is, VAR(T)/VAR(X). The challenge is to design surveys that will produce a valid estimate of this ratio (see below). To summarize, for our present purposes we take reliability to refer to the relationship between Y and T (as stated above), whereas validity has to do with the relationship between T and some criterion C, that is, between the latent variable being measured and a ‘criterion’ or the theoretical construct of interest. When one has a ‘record’ of the variable or a ‘gold standard’ for this theoretical construct, then one can examine the relationship between T and C. Such designs are relatively rare, but when available, they can be very useful for assessing validity (see, e.g., Alwin, 2009). Of course, both measures of ‘gold standards’ and of ‘alternative methods’ contain measurement error, so such correlations or measures of consistency, must always be interpreted in light of reliability of measurement in both variables.
Measurement Invalidity Normally, we think of measurement error as being more complex than the random error model developed above. In addition to random errors of measurement, there is also
the possibility that Yg contains systematic (or correlated) errors. There is a great deal of literature within the field of psychology (e.g., Podsakoff et al., 2012; Eid and Diener, 2006), as well as the survey methods literature (e.g., McClendon, 1991; Tourangeau et al., 2000) which is concerned with nonrandom or systematic errors, otherwise known as ‘method effects’. We consider such errors as sources of invalidity, which include ‘agreeing response bias’ (or acquiescence), social desirability, and consistencies due to the similarity or proximity of response scales, or method effects (Andrews, 1984; Alwin and Krosnick, 1985; Krosnick and Alwin, 1988). The relationship between random and systematic errors can be clarified if we consider the following extension of the classical truescore model: Yg = T*g + Mg + Eg, where Mg is a source of systematic error in the observed score, T*g is the true-value, uncontaminated by systematic error, and Eg is the random error component discussed above (see Alwin, 1989, 2007). This model directly relates to the one shown in Figure 34.3, in that Tg = T*g + Mg. The idea, then, is that the variable portion of measurement error contains two types of components, a random component, for example, and a non-random, or systematic component, Mg, i.e., both T*g and Mg reflect reliable variance. It is frequently the case that systematic sources of error increase reliability. This is, of course, a major threat to the usefulness of classical true-score theory in assessing
Figure 34.3 Path diagram of the classical true-score model for a single measure composed of one trait and one method. Source: Alwin (2007)
Survey Data Quality and Measurement Precision
the quality of measurement. It is important to address the question of systematic measurement errors, but that often requires a more complicated measurement design. Ideally, then, the goal would be to partition the variance in Yg into those portions due to T*g, Mg, and Eg. This can be implemented using a multitrait-multimethod (MTMM) measurement design along with confirmatory factor analysis, as discussed below (see Alwin, 1974, 1997, 2007; Saris and Andrews, 1991; Scherpenzeel, 1995; Scherpenzeel and Saris, 1997; Saris and Galhofer, 2007), although this model is not without problems (Alwin, 2011).
DESIGNS FOR ASSESSING MEASUREMENT ERRORS Using the above ideas, there are several approaches that have been used to assess the level of measurement error in survey data. I describe these here.
Internal Consistency Approaches Traditionally, the main models for reliability estimation available to survey researchers came from the internal consistency approach to reliability estimation for composite scores (Bohrnstedt, 2010). For example, one internal consistency approach, due to Cronbach (1951), that is often used for composite scores is often referred to as ‘internal consistency reliability’ (ICR) – also called ‘coefficient alpha’ or ‘Cronbach’s alpha’. The ICR measure estimates the lower bound of reliability for a set of univocal measures that are tau-equivalent in structure. Let Y symbolize such a linear composite defined as the sum Y1 + Y2 + … Yg + … + YG, where G is the number of measures, assumed to be tau-equivalent (see Figure 34.2). The formula for the ICR (denoted as α) is given as follows:
α=
535
G ∑ VAR(Yg ) g G − 1 1 − VAR(Y)
Jöreskog (1971) reformulated this coefficient for the case of congeneric measures, and Greene and Carmines (1979) generalized all of these models to the general case of linear composites (see also Heise and Bohrnstedt, 1970). Recent literature has focused on newer ICR models (see Raykov, 1997, 2012; Raykov and Shrout, 2002; Raykov and Marcoulides, 2015). It can be shown (although we do not do so here) that this formula is derived from the definition of a classical true-score model of the ratio of the true variance to the observed variance of a composite variable, i.e., VAR(T)/VAR(Y) (see Alwin, 2007: 51–53). Given any model, these two components can be estimated. The ICR methods assume a model in which there are multiple univocal measures of a single underlying variable and independence in the errors of measurement. Such ICR measures can be calculated for any set of G measures, whether they meet the assumptions of the model or not, and for our purposes the reliability of composite scores is of little interest. Despite their popularity, these ICR models are not technically estimates of reliability in models where the assumptions of univocity and random measurement errors do not hold. Given this relatively stringent set of requirements on the design of survey measures, they are often unrealistic for purposes of assessing the reliability in survey measurement (see Sijtsma, 2009, for a similar argument). In short, it is difficult to obtain multiple measures of the same construct in the typical cross-sectional survey, even when researchers fashion their measures as suitable for the application of psychometric models for reliability estimation. Simply put, it is difficult to demonstrate that the assumptions required for these models are ever met, and although useful as an index of first-factor saturation,
536
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
ICR estimates are not very useful to establishing the reliability of survey measures (Alwin, 2007).
Common Factor Models Given the difficulties discussed above with internal consistency estimates of reliability, many researchers have turned to the common factor model for estimating the error properties of their measures. It is a straightforward exercise to express the basic elements of CTST as a special case of the metric (unstandardized) form of the common factor model and to generalize this model to the specification of K sets of congeneric measures (see Jöreskog, 1971; Alwin and Jackson, 1979). Consider the following common factor model: Y = Λ T + E where Y is a (G × 1) vector of observed random variables, T is a (K × 1) vector of true score random variables measured by the observed variables, E is a (G × 1) vector of error scores, and Λ is a (G × K) matrix of regression coefficients relating true and observed random variables. The covariance matrix among measures under this model can be represented as follows:
∑
YY
= ΛΦ Λ ′ + Θ 2
where ΣYY, Φ, and Θ2 are covariance matrices for the Y, T, and E vectors defined above, and Λ is the coefficient matrix as defined above. For purposes of this illustration we consider all variables to be centered about their means. Here we take the simplest case where K = 1, that is, all G variables in Y are measures of T; however, we note that the model can be written for the general case of multiple sets of congeneric measures. In the present case the model can be represented as follows:
Y1 λ1T E1 Y2 λ 2 T E2 . = . x [T] + . . . . YG λGT EG For G congeneric measures of T there are 2G unknown parameters in this model (G λgT coefficients and G error variances, θ2g) with degrees of freedom (df) equal to 0.5G(G + 1) – 2G. In general this model requires a scale be fixed for T, since it is an unobserved latent random variable. Two options exist for doing this: (1) the diagonal element in Φ can be fixed at some arbitrary value, say 1.0, or (2) one of the λgT coefficients can be fixed at some arbitrary value, say 1.0. For G measures of T the tau-equivalent model has G + 1 parameters with df = 0.5G (G + 1) – G + 1. For a set of G parallel measures there are 2 parameters to be estimated, VAR[T] and VAR[E], with df = 0.5G(G + 1) – 2. Note that both the tau-equivalent measures and parallel measures form of this model invoke the assumption of tau-equivalence. This is imposed on the model by fixing all λgT coefficients to unity. In order to identify the tauequivalent or parallel measures model, observations on two measures, Y1 andY2, are sufficient to identify the model. For the congeneric model G must be ≥ 3. It should be clear that the congeneric measures model is the most general and least restrictive of these models and the tau-equivalent and parallel measures models simply involve restrictions on this model (see Jöreskog, 1971; Alwin and Jackson, 1979). What has been stated for the model in the above paragraphs can be generalized to any number of G measures of any number of K factors. The only constraint is that the assumptions of the model – univocity and random measurement error – are realistic for the measures and the population from which the data come. Although there is no way of testing whether the model is correct, when the
Survey Data Quality and Measurement Precision
model is overidentified the fit of the model can be evaluated using standard likelihoodratio approaches to hypothesis testing within the confirmatory factor analysis framework. There is, for example, a straightforward test for whether a single factor can account for the covariances among the G measures. Absent such confirming evidence, it is unlikely that a simple true score model is appropriate.
Multiple Measures vs Multiple Indicators While the above common factor model is basic to most methods of reliability estimation, including internal consistency estimates of reliability (e.g., Cronbach, 1951; Heise and Bohrnstedt, 1970; Greene and Carmines, 1979), the properties of the model very often do not hold. This is especially the case when multiple indicators (i.e., observed variables that are similar measures within the same domain, but not so similar as to be considered replicate measures) are used as opposed to multiple measures (i.e., replicate measures). Before discussing the multitrait-multimethod approaches in greater depth below, I here review the distinction between models involving multiple measures and those involving multiple indicators in order to fully appreciate their utility. Following this discussion, I then present the common factor model that underlies the MTMM approach and its relation to the CTST model. An appreciation of the distinction between multiple measures and multiple indicators is critical to an understanding of the difficulties of designing survey measures that satisfy the rigorous requirements of the MTMM extension of the CTST model for estimating reliability. Let us first consider the hypothetical, but unlikely, case of using replicate measures within the same survey interview. Imagine the situation, for example, where three measures of the variable of interest are obtained, for which the random measurement error model holds in each case, as follows:
537
Y1 = T1 + E1 Y2 = T2 + E2 Y3 = T3 + E3 In this hypothetical case, if it is possible to assume one of the measures could serve as a record or criterion of validity for the other, it would be possible to assess the validity of measurement, but we do not place this constraint on the present example. All we assume is that the three measures are replicates in the sense that they are virtually identical measures of the same thing. This model assumes that each measure has its own true score, and it is assumed that the errors of measurement are not only uncorrelated with their respective true scores, they are also independent of one another. This is called the assumption of measurement independence referred to earlier (Lord and Novick, 1968: 44). It is important to realize in contemplating this model that it does not apply to the case of multiple indicators. The congeneric measurement model (or the special cases of the tau-equivalent and parallel measures models) is consistent with the causal diagram in Figure 34.4a. This picture embodies the most basic assumption of classical true-score theory that measures are univocal, that is, that they each measure one and only one thing which completely accounts for their covariation and which, along with measurement error, completely accounts for the response variation. This approach requires essentially asking the same question multiple times, which is why we refer to it as involving multiple or replicate measures (we consider the multiple indicators approach below). We should point out, however, that it is extremely rare to have multiple measures of the same variable in a given survey. It is much more common to have measures for multiple indicators, that is, questions that ask about somewhat different aspects of the same things. The multiple indicator model derives from the traditional common factor model, a psychometric tradition that predates classical
538
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Figure 34.4 Path diagram for the relationship between random measurement errors, observed scores and true scores for the multiple measures and multiple indicator models. Source: Alwin (2007)
true-score theory. In such models the K latent variables are called common factors because they represent common sources of variation in the observed variables. The common factors of this model are responsible for covariation among the variables. The unique parts of the variables, by contrast, contribute to the lack of covariation among the variables. Covariation among the variables is greater when they measure the same factors, whereas covariation is less when the unique parts of the variables dominate. Indeed, this
is the essence of the common factor model – variables correlate because they measure the same thing(s). The common factor model, however, draws attention not only to the common sources of variance, but to the unique parts as well. This model is illustrated in Figure 34.4b. A variable’s uniqueness is the complement to the common parts of the data (the communality) and is thought of as being composed of two independent parts, one representing specific variation and one representing random
Survey Data Quality and Measurement Precision
measurement error variation. Using the traditional common factor notation for this, the variable’s communality is the proportion of its total variation that is due to common sources of variation, in this case there is one source of common variation, whose variance is denoted VAR(Tg). The uniqueness of Yg, denoted VAR(Ug), equals the sum of specific and measurement error variance, i.e., VAR(Ug) = VAR(Sg) + VAR(Eg). In the traditional standardized form of the common factor model, the uniqueness is the complement of the communality, that is, VAR(Ug) = 1.0 – VAR(Tg). Of course, specific variance is reliable variance, and thus the reliability of the variable is not only due to the common variance, but to specific variance as well. In common factor analytic notation, the reliability of variable j is a function of both VAR(Tg) and VAR(Sg). Unfortunately, however, because specific variance is thought to be independent (uncorrelated with) sources of common variance, it becomes confounded with measurement error variance. Because of the presence of specific variance in most measures, it is virtually impossible to use the traditional form of the common factor model as a basis for reliability estimation (see Alwin and Jackson, 1979; Alwin, 1989), although this is precisely the approach followed by the MTMM/confirmatory factor analytic approach to reliability estimation we discuss below (see Alwin, 2007; Saris and Gallhofer, 2007). The problem here – which applies to the case of multiple indicators – is that the common factor model typically does not permit the partitioning of VAR(Ug) into its components, VAR(Sg) and VAR(Eg). In the absence of specific variance (what we here refer to as the ‘multiple measures’ model), classical reliability models may be viewed as a special case of the common factor model, but in general it is risky to assume that VAR(Ug) = VAR(Eg) (Alwin and Jackson, 1979). The model assumes that while the measures may tap (at least) one common factor, it is also the case that there is a specific source of variation in each of the measures, as follows (returning to the notation used above):
539
Y1 = T1 + S1 + E1 Y2 = T2 + S2 + E2 Y3 = T3 + S3 + E3 Assume for present purposes that specific variance is orthogonal to both the true score and error scores, as is the case in standard treatments of the common factor model. Note that the classical true-score assumptions do not hold for the model depicted in Figure 34.4b. Here the true scores for the several measures are not univocal and therefore cannot be considered congeneric measures. In case (b) there are disturbances in the equations linking the true scores. Consequently the reliable sources of variance in the model are not perfectly correlated. In the language of common factor analysis, these measures contain more than one factor. Each measure involves a common factor as well as a specific factor, that is, a reliable portion of variance that is independent of the common factor. Unless one can assume measures are univocal, or build a more complex array of common factors into the model, measurement error variance will be over-estimated, and item-level reliability under-estimated (see Alwin and Jackson, 1979). This can be seen by noting that the random error components included in the disturbance of the measurement model in Figure 34.4b contain both random measurement error (the E’s) and specific sources of variance (the S’s). Consistent with the common factor model representation, which assumes that the unique part of Yg is equal to Ug = Sg + Eg, we can define the variance of Yg as VAR(Yg) = VAR(Tg) + VAR(Sg) + VAR(Eg) = VAR(Tg)+VAR(Ug), and VAR(Ug) is known to equal the sum of specific and measurement error variance, and unfortunately, the common factor model cannot separate specific variance from measurement error variance, i.e., the error variance in the common factor model
540
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
c ontains both reliable and unreliable sources of variance. A second issue that arises in the application of multiple indicators is that there may be correlated errors, or common sources of variance, masquerading as true scores. Imagine that respondents’ use of a common agreedisagree scale is influenced by a ‘method’ factor, e.g., the tendency to agree vs disagree (called ‘yea-saying’ or ‘acquiescence’ in the survey methods literature). In this case there are two common factors at work in producing Tg. In such a case of non-random measurement error, we can formulate the model as follows: Yg = Tg + Eg, where Tg = T*g + Mg. This indicates that the covariance among measures, and therefore the estimates of reliability are inflated due to the operation of common method factors. This was illustrated in the path diagram in Figure 34.3 presented earlier. The multiple measures model can, thus, be thought of as a special case of the multiple indicators model in which the latent true scores are linear combinations of one another, i.e., perfectly correlated. Unfortunately, unless one knows that the multiple measures model is the correct model, interpretations of the ‘error’ variances as solely due to measurement error are inappropriate. In the more general case (the multiple indicators case) one needs to posit a residual, referred to as specific variance in the factor analytic tradition, as a component of a given true score to account for its failure to correlate perfectly with other true scores aimed at measuring the same construct. Within a single cross-sectional survey, there is no way to distinguish between the two versions of the model, that is, there is no available test to detect whether the congeneric model fits a set of G variables, or whether the common factor model is the more appropriate model. To summarize our discussion up to this point, it is clear that two problems arise in the application of the CTST model to crosssectional survey data. The first is that specific variance, while reliable variance, is allocated
to the random error term in the model, and consequently, to the extent specific variance exists in the measures the reliability of the measure is underestimated. This problem could be avoided if there were a way to determine whether the congeneric (multiple measures) model is the appropriate model, but as we have noted, within a single crosssectional survey, there is no way to do this. The second problem involves the assumption of a single common factor, which is a problem with either version of the model shown in Figure 34.4. In this case the problem involves the presence of common method variance, which tends to inflate estimates of reliability. The latter problem is one that is also true of the multiple measures model in that it is just as susceptible to the multiple factor problem as the multiple indicator model. The problem is actually more general than this, as it involves any source of multiple common factors, not simply common method factors.
MEASUREMENT ERROR MODELS FOR SPECIFIC QUESTIONS In the past few decades, the concepts of reliability and validity have increasingly been applied to single survey questions (e.g., Marquis and Marquis, 1977; Alwin, 1989, 1992, 1997, 2007, 2010; Andrews, 1984; Alwin and Krosnick, 1991; Saris and Andrews, 1991; Scherpenzeel and Saris, 1997; Saris and Gallhofer, 2007). This has been one of the most important advances in the assessment of data quality and measurement precision. Several approaches (discussed below) have proven useful as a measure of data quality (Groves, 1989; Saris and Andrews, 1991); however, there is a reluctance on the part of many survey methods experts to evaluate questions in terms of their reliability and/or validity, and they therefore ignore valuable literature based on the evaluation of survey questions in terms of their reliability and/or validity (see, e.g.,
Survey Data Quality and Measurement Precision
Krosnick and Presser, 2010; Schaeffer and Dykema, 2011). It is therefore important to clearly indicate what these concepts mean and how they can be useful as a tool for the evaluation of the quality of survey measurement, as proposed by Madans et al. (2011) and others. Due to the restrictions in space, I limit myself to the case of measuring continuous latent variables. As we mentioned earlier, with respect to research designs, according to Campbell and Fiske (1959), both concepts of reliability and validity require that agreement between measurements be obtained – validity is supported when there is correspondence between two efforts to measure the same trait through maximally different methods; reliability is demonstrated when there is agreement between two efforts to assess the same thing using maximally similar, or replicate, measures (Campbell and Fiske, 1959: 83). Thus, both types of studies involve assessing the correspondence between measures – studies of reliability focus on the consistency of repeating the measurement using replicate measures, whereas researches on validity are concerned with the correspondence of a given measure to some criterion of interest, taking into account the reliability of either measure.
Designs for Studying Validity As noted earlier, in the psychometric tradition, the concept of validity has mainly to do with the utility of measures with respect getting at particular theoretical concepts (Cronbach and Meehl, 1955). Establishing the validity of survey measurement is difficult because within a given survey instrument there is typically little available information that would establish a criterion for validation. There are several available designs that embody the principle articulated by Campbell and Fiske (1959), that the best evidence for validity is the convergence of measurements employing maximally different methods. First, there is a well-known
541
genre of research situated under the heading ‘record check studies’, which involves the comparison of survey data with information obtained from other record sources (e.g., Marquis, 1978; Bound et al., 1990; Kim and Tamborini, 2014). Although rare, such studies can shed light on survey measurement errors; however, as noted earlier, correlations among multiple sources of information are limited by the level of reliability involved in reports from either source (Alwin, 2007: 48–49). A second design that has been used in the study of non-factual content, e.g., attitudes, beliefs, and self-perceptions, involves the application of multitrait-multimethod (MTMM) measurement designs, where the investigator measures multiple concepts using multiple methods of measurement. In recent years, a great deal of attention has been paid to a model originally proposed by Campbell and Fiske (1959), involving the measurement of multiple concepts or ‘traits’ each measured using multiple methods – hence, the multitrait-multimethod measurement design (see Alwin, 1974, 1997, 2007; Scherpenzeel, 1995; Scherpenzeel and Saris, 1997; Saris and Gallhofer, 2007). Using this design, methods have been developed for separating validity and invalidity in survey reports. Its use has been limited to the study of components of variance in measures of attitudes, beliefs, and self-descriptions primarily because there are generally not multiple measurement formats for gathering factual data. In this framework reliability can be partitioned into ‘valid’ and ‘invalid’ components of consistency, and in the following discussion of reliability estimation for single survey questions I cover the MTMM methods that are available. A third type of validation study is what has been referred to as ‘gold standard’ studies, where an alternative approach to measurement, e.g., an event history calendar, is compared to data gathered on the same people using the method that represents the gold standard or the accepted or standard approach to measuring particular content
542
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
(see, e.g., Belli et al., 2001; Alwin, 2009). All of these qualify under the definition of validity studies, in that all three are aimed at examining the correspondence between responses to a particular survey question and the ‘true’ or ‘best’ indicator of the construct it is intended to measure.
Designs for Studying Reliability There are several traditions for assessing reliability of survey measures. From the point of view of any of these traditions, reliability estimation requires repeated measures, and in this context we can make several observations. First, I already mentioned the internal consistency approaches based on classical true-score models for composite scores (see Cronbach, 1951; Greene and Carmines, 1979), but I noted that these approaches are not helpful for studying the reliability of specific questions. Second, in the discussion of designs for assessing validity, I mentioned the Campbell and Fiske (1959) multitraitmultimethod (MTMM) model (Alwin, 1974; Andrews, 1984; Saris and Gallhofer, 2007), which provides a factor analytic model for assessing reliability. And while these are useful to some extent, they employ multiple indicators rather than multiple measures (i.e., replicate measures), and are therefore only approximations of estimates of reliability (see Alwin, 2007, 2011). Third, there are methods based on generalizability theory (Rajaratnam, 1960; Cronbach et al., 1963; O’Brien, 1991) that have been used to study the reliability of multiple reports for several different raters/respondents. Fourth, there are methods using latent class models for categoric measures (Clogg and Manning, 1996; Alwin, 2007, 2009; Biemer, 2011). Finally, the Quasi-simplex model for longitudinal designs (Heise, 1969; Jöreskog, 1970; Wiley and Wiley, 1970; Alwin, 2007), is viewed as one of the most useful ways to study reliability because it is a design that involves repeated over-time measurement, but allows
for changes in the underlying latent variable and thus allows for the separation of reliability from stability. The vast majority of estimates of data quality in surveys have relied on two of these designs: (1) the MTMM approach (see Saris and Gallhofer, 2007), and (2) the longitudinal quasi-Markov simplex approach (Alwin, 2007), and I consider each of these approaches in the following discussion.
MULTITRAIT-MULTIMETHOD MODELS There is an increasing amount of support for the view that shared method variance inflates ICR estimates. One approach to dealing with this is to reformulate the CTST along the lines of a multiple-factor approach and to include sources of systematic variation from both true variables and method factors. With multiple measures of the same concept, as well as different concepts measured by the same method, it is possible to formulate a multitrait-multimethod model. In general, the measurement of K traits measured by each of the M methods (generating G = KM observed variables) allows the specification of such a model. Following from our discussion of nonrandom measurement errors above we can formulate an extension of the common factor representation of the CTST given above as follows: Y = ΛT* Τ* + ΛM M + Ε where Y is a (G × 1) vector of observed random variables, T* is a (K × 1) vector of ‘trait’ true score random variables, M is a (M × 1) vector of ‘method’ true score random variables, and E is a (G × 1) vector of error scores. The matrices ΛT* and ΛM are (G × K) and (G × M) coefficient matrices containing the regression relationships between the G observed variables and the K and M latent trait and method latent variables. Note that
Survey Data Quality and Measurement Precision
with respect to the CTST model given above, Λ T = ΛT* T* + ΛM M. The covariance structure for the model can be stated as: Σ YY = Λ Τ* Λ Μ Φ Τ Λ Τ* Λ Μ ′ + Θ2 where ΦT has the following structure: Φ Τ* ΦT = 0
0 ΦΜ
Note that the specification of the model places the constraint that the trait and method factors are uncorrelated. The estimation of this model permits the decomposition of reliable variance in each observed measure into ‘valid’ and ‘invalid’ parts. There is an extensive literature to assist in the design and interpretation of these models (see Alwin, 1974, 1997, 2007; Scherpenzeel and Saris, 1997; Saris and Gallhofer, 2007).
QUASI-SIMPLEX MODELS It can be argued that for purposes of assessing the reliability of survey data, longitudinal data provide an optimal design, given the difficulties in the assumptions needed in crosssectional designs (see Alwin, 2007). Indeed,
the idea of replication of questions in panel studies as a way of getting at measurement consistency has been present in the literature for decades – the idea of ‘test-retest correlations’ as an estimate of reliability being the principle example of a longitudinal approach. The limitations of the test-retest design are well-known (see Moser and Kalton, 1972: 353–354), but they can be overcome by incorporating three or more waves of data separated by lengthy periods of time (see Alwin, 2007: 96–116). The multiple-wave reinterview design discussed in this paper goes well beyond the traditional test-retest design, and specifically, by employing models that permit change in the underlying true score (using the quasi-Markov simplex approach) allows us to overcome one of the key limitations of the test-retest design.5 Hence, a second major approach to estimating reliability of single items uses the re-interview approach within a longitudinal design, or what are often called ‘panel’ designs. To address the issue of taking individuallevel change into account, Coleman (1964, 1968) and Heise (1969) developed a technique based on three-wave quasi-simplex models within the framework of a model that permits change in the underlying variable being measured. This same approach can be generalized to multi-wave panels. This class
Figure 34.5 Path diagram of the quasi-Markov simplex model – general case (P > 4). Source: Alwin (2007)
543
544
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of auto-regressive or quasi-Markov simplex model specifies two structural equations for a set of P over-time measures of a given variable Yt (where t = 1, 2, …, P) as follows (see Figure 34.5): Yt = Tt + Et Tt = βt,t−1 Tt−1 + Zt The first equation represents a set of measurement assumptions indicating that (1) over-time measures are assumed to be τ-equivalent, except for true score change, and (2) that measurement error is random. The second equation specifies the causal processes involved in change of the latent variable over time. Here it is assumed that Zt is a random disturbance representing true score change over time. This model assumes a lag-1 or Markovian process in which the distribution of the true variables at time t is dependent only on the distribution at time t – 1 and not directly dependent on distributions of the variable at earlier times. If these assumptions do not hold, then this type of simplex model may not be appropriate. In
order to estimate such models, it is necessary to make some assumptions regarding the measurement error structures and the nature of the true change processes underlying the measures. All estimation strategies available for such multi-wave data require a lag-1 assumption regarding the nature of the true change. This assumption in general seems a reasonable one, but erroneous results can result if it is violated. The various approaches differ in their assumptions about measurement error. One approach assumes equal reliabilities over occasions of measurement (Heise, 1969). This is often a realistic and useful assumption, especially when the process is not in dynamic equilibrium, i.e., when the observed variances vary with time. Another approach to estimating the parameters of the above model is to assume constant measurement error variances rather than constant reliabilities. Where P = 3 (see Figure 34.6) either model is just-identified, and where P > 3 the model is overidentified with degrees of freedom equal to 0.5[P (P + 1)] – 2P. The four-wave model has two degrees of freedom, which can be used to perform likelihood-ratio tests of the fit of the model.
Figure 34.6 Path diagram for a three-wave quasi-Markov simplex model. Source: Alwin (2007)
Survey Data Quality and Measurement Precision
We can write the above model more compactly for a single variable assessed in a multi-wave panel study as: Y = ΛY T + E T=BT+Z = [I−B]−1 Z Here Y is a (P × 1) vector of observed scores; T is a (P × 1) vector of true scores; E is a (P × 1) vector of measurement errors; Z is a (P × 1) vector of disturbances on the true scores; ΛY is a (P × P) identity matrix; and B is a (P × P) matrix of regression coefficients linking true scores at adjacent timepoints. For the case where P = 4 we can write the matrix form of the model as: Y1 1 Y2 = 0 Y3 0 Y41 0
0 1 0 0
0 0 1 0
1 0 0 0 −β21 1 0 − β32 1 0 − β43 0
0 T1 E1 0 T2 E 2 × + 0 T3 E3 1 T4 E 4 0 T1 Z1 0 T2 Z2 = × 0 T3 Z3 1 T4 Z4
The reduced-form of the model is written as:
545
104–110) and can be estimated using several different structural equation modeling approaches. The most common design using this model is a three-wave panel design, in which constraints are placed on the error structure in order to estimate the model (e.g., Alwin, 2007; Alwin et al., 2014, 2015; Alwin and Beattie, 2016). One of the main advantages of the re-interview design, then, is that under appropriate circumstances it is possible to eliminate the confounding of the systematic error component discussed earlier, if systematic components of error are not stable over time. In order to address the question of stable components of error, the panel survey must deal with the problem of memory, because in the panel design, by definition, measurement is repeated. So, while this overcomes one limitation of cross-sectional surveys, it presents problems if respondents can remember what they say and are motivated to provide consistent responses. If re-interviews are spread over months or years, this can help rule out sources of bias that occur in cross-sectional studies. Given the difficulty of estimating memory functions, estimation of reliability from re-interview designs makes sense only if one can rule out memory as a factor in the covariance of measures over time, and thus, the occasions of measurement must be separated by sufficient periods of time to rule out the operation of memory.
Y = ΛY[I − B]−1Z + E and the covariance matrix for Y as: ∑YY = ΛY[I − B]−1ψ[I − B′]−1ΛY + Θ2 Where B and ΛY are of the form described above, ψ is a (P × P) diagonal matrix of variances of the disturbances on the true scores, and Θ2 is a (P × P) diagonal matrix of measurement error variances. This model is thoroughly discussed elsewhere (see Heise, 1969; Wiley and Wiley, 1970; Bohrnstedt, 1983; Saris and Andrews, 1991; Alwin, 2007:
RESEARCH FINDINGS There is no lack of expert judgment among survey methodologists on how to write good questions and develop good questionnaires. Vast amounts have been written on the topic of what constitutes a good question over the past century, from the earliest uses of surveys down to the present (e.g., Galton, 1893; Ruckmick, 1930; Belson, 1981; Converse and Presser, 1986; Krosnick and Fabrigar, 1997; Schaeffer and Presser, 2003; Krosnick
546
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
and Presser, 2010; Schaeffer and Dykema, 2011). Many ‘tried and true’ guidelines for developing good survey questions and questionnaires aim to codify a workable set of rules for question construction, but little hard evidence confirms that following them improves survey data quality. On some issues addressed in this literature, there is very little consensus on the specifics of question design, which leads some to view the development of survey questions as more of an art than a science. Still, efforts have been made to codify what attributes are possessed by ‘good questions’ and/or ‘good questionnaires’ (e.g., Sudman and Bradburn, 1974, 1982; Schuman and Presser, 1981; Tanur, 1992; Schaeffer and Dykema, 2011), and many seek certain principles that can be applied in the practical situation. In this section I provide a selective review of the available literature that employs the rigorous criteria of data quality discussed above and allows an assessment of the conclusions reached about the quality of survey measurement and the following sources of variation in the efficacy of specific measurement approaches: (1) attributes of populations; (2) content of survey questions; (3) the architecture of survey questionnaires; (4) the source of the information; and (5) the attributes of survey questions.
Attributes of Populations In some circles, reliability is thought of as being primarily a characteristic of the measures and the conditions of measurement, but as the above discussion makes clear, it is also inherently conditional on the attributes of the population to which the survey instrument is applied. This was recognized early on by Converse (1964) in his classic discussion of the survey measurement of subjective phenomena. It is known that reliability (and validity) can only be defined with respect to a particular population and its attributes, e.g., education or age. In the case of schooling,
there is mixed support for the idea that less educated respondents produce a greater amount of response error (see Alwin and Krosnick, 1991; Alwin, 2007). In the case of age, there are strong theoretical reasons to expect cognitive aging to produce less reliability in survey reports, but studies comparing rates of measurement error across age groups in actual surveys have generally not been able to find support for the hypothesis that aging contributes to greater errors of measurement (see review by Alwin, 2007: 218–220). Of course, age and education are confounded, in that the earlier-born cohorts have less education than more recent ones, and although there are serious challenges to assessing subpopulation differences in measurement quality, this is an important area for further investigation.
Content of Survey Questions There is a long-standing distinction in the survey methods literature between objective and subjective questions (e.g., Kalton and Schuman, 1982; Turner and Martin, 1984; Schuman and Kalton, 1985). Another way to capture this distinction is by reference to the measurement of ‘facts’ versus ‘non-facts’, in which the former referred to information ‘directly accessible to the external observer’, and the latter to phenomena that ‘can be directly known, if at all, only by persons themselves’ (Schuman and Kalton, 1985: 643). In the words of Schuman and Kalton (1985: 643), ‘the distinction is a useful one, since questions about age, sex, or education could conceivably be replaced or verified by use of records of observations, while food preferences, political attitudes, and personal values seem to depend ultimately on respondent self-reports’. Recent research confirms the commonly held view that factual material can be more precisely measured than content that is essentially subjective, although there is considerable overlap. Few survey questions are
Survey Data Quality and Measurement Precision
perfectly reliable – but the typical factual question can be shown to be substantially more reliably measured on average than the typical non-factual question. Some factual questions produce highly reliable data, e.g., reports by women of the number of children they have had, self-reports of age, and selfreports of weight (Alwin, 2007: 327). Still, most factual survey content is measured with error, although perhaps less vulnerable to sources of error than are non-factual questions. Even variables considered to be relatively ‘hard’ social indicators, such as education, income, and occupation have levels of reliability that are far from perfect (see Alwin, 2007: 302–304; Hout and Hastings, 2012). Variables that involve subjective content (including attitudes, beliefs, values, selfevaluations, and self-perceptions) have lower reliabilities, in part because of the difficulties of translating internal cues related to such content into the response framework offered to the respondent; however, there is little difference in reliability estimates across types of non-factual content in average reliability (Alwin, 2007: 158–162; Alwin et al., 2015). Alwin and Krosnick (1991: 144–147; see also Alwin et al., 2015) proposed three reasons why subjective, or non-factual, variables are more difficult to measure. They reasoned, first, that it is more difficult to argue that subjective traits ‘exist’ relative to objective facts; second, the measurement of subjective phenomena is made difficult by the ambiguity of the respondent’s internal cues; and third, the ambiguity in response scale options for translating the respondent’s implicit states into expressed scale positions creates further difficulties.
The Architecture of Survey Questionnaires When researchers put together questionnaires for use in surveys, it is typically believed that considerations involving the organization of questions into subunits larger
547
than the question affects the quality of data, e.g., where the question is placed in the interview, or whether a question is placed in a series of questions that pertain to the same specific topic, or in a series of questions will include questions that not only deal with the same topic, but also involve the exact same response formats. The latter are referred to as batteries of questions. Results provide relatively strong evidence that questions in what we referred to as a ‘topical series’ are less reliable than ‘stand-alone questions’ (at least among factual questions) and questions in ‘batteries’ are less reliable than questions in series (among non-factual questions) (Alwin, 2007: 171–172; Alwin et al., 2015). Perhaps the best explanation is that the same factors motivating the researcher to group questions together – contextual similarity – are the same factors that promote measurement errors (see Andrews, 1984: 431). Similarity of question content and response format may actually distract the respondent from fully considering what information is being asked for, and this may reduce the respondent’s attention to the specificity of questions. Thus, measurement errors may be generated in response to the ‘efficiency’ features of the questionnaire. Unfortunately, it appears that the respondents may also be more likely to ‘streamline’ their answers when the investigator ‘streamlines’ the questionnaire. Another possibility is that the context effects result from other aspects of the questions, particularly question form (e.g., type of rating scale used, the length of the question, the number of response categories, the labeling of response categories, or the explicit provision of a ‘Don’t Know’ option). In other words, measurement reliability may have less to do with question context (i.e., placement as stand-alone questions, or questions in series and batteries) per se, and more to do with the formal properties of questions appearing in one or the other type of questionnaire unit. If we can account for the observed differences in reliability by reference to the formal attributes of questions rather than the context
548
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
in which they are embedded, then the focus of our attention can be simplified. That is, we can perhaps pay less attention to the way in which the survey questionnaire is organized and more attention to the internal characteristics of the questions themselves.
Sources of Information The most common form of survey data involves self-reports, but it is also commonplace for respondents to surveys to be asked questions about other people, including their spouse, their parents and children, and sometimes, their friends and co-workers. Because of differences in the process of reporting about the characteristics of others and the more common self-report source of information, one would expect that the nature of measurement errors might be different for proxy vs self-reports (Blair et al., 1991). Some evidence, however, suggests ‘for many behaviors and even for some attitudes, proxy reports are not significantly less accurate than self-reports’ (Sudman et al., 1996: 243). If true, this is an encouraging result because it is often not possible to obtain self-reports in all cases and the proxy is frequently the only source of information on the person in question, e.g., the head of household. We should note, however, that the broader literature on this topic presents inconsistent results regarding the quality of reporting by proxies (see Moore, 1988). Using measurement reliability as a criterion for comparing Alwin (2007) and his colleagues compared several measures involving proxy reports with comparable measures involving self-reports, where the content of the questions was identical (e.g., respondent’s education compared to spouses’ education), and of course, the same person was reporting in each case (the measures were often placed in different locations in the questionnaire). In virtually every case he found that the estimated reliability was higher for self-reports compared to the proxy reports, averaging
across six measures a difference of approximately 0.90 versus 0.80 in estimates of reliability. He found in these comparisons that proxy reports were significantly less reliable than self-reports, a conclusion that is consistent with both theory and experience, but one that is not consistent with the conclusions of Sudman et al. (1996). He concluded that further research is needed to create a broader inferential basis for the findings, but that the difference was remarkable given the small number of cases (Alwin, 2007: 151–152).
Attributes of Survey Questions There are a number of different approaches to the evaluation of the attributes of questions that affect data quality (see, e.g., Sudman and Bradburn, 1974; Converse and Presser, 1986; Tanur, 1992; Sudman et al., 1996; Madans et al., 2011; Schaeffer and Dykema, 2011). Indeed, as already noted, there is a large literature that makes an effort to provide practical guidelines for the ‘best practices’ of question and questionnaire design. Many of these approaches employ subjective criteria, and rarely do they employ rigorous methods for defining the desirable attributes of questions. Sudman and Bradburn (1974) pioneered in an effort to quantify the ‘response effects’ of various question forms. Schuman and Presser (1981) continued this effort by performing ballot experiments that varied question wording, question forms, question order, question context, etc. This early work stimulated a vast amount of research focusing on the effects of these experimental variations involving form, response order, and the like (e.g., Alwin and Krosnick, 1985; Krosnick and Alwin, 1987, 1989; Schwarz et al., 1991; Rammstedt and Krebs, 2007; Krebs and Hoffmeyer-Zlotnik, 2010; Moors et al., 2014). For the most part, we ignore this voluminous literature, and here restrict our attention primarily to those studies that employ the concept of reliability as an empirical criterion for evaluating the quality of
Survey Data Quality and Measurement Precision
survey data, especially those using the multitrait-multimethod (MTMM) approach to reliability and validity assessment (e.g., Andrews, 1984; Saris and van Meurs, 1990; Saris and Andrews, 1991; Scherpenzeel, 1995; Alwin, 1997; Scherpenzeel and Saris, 1997; Saris and Gallhofer, 2007), and the use of longitudinal methods of reliability assessment (see Alwin and Krosnick, 1991; Alwin, 1992, 1997, 2007). Following up on this possibility that the attributes of questions can have a direct effect on data quality, several investigators have examined the formal characteristics of survey questions that are believed to affect the quality of measurement, using measurement reliability and validity as a criteria for evaluating these issues. With respect to ‘fixed-form’ or ‘closed’ questions, we usually consider the question and response format as one ‘package’. There is, however, some value in disentangling the role of the response format per se in its contribution to the likelihood of measurement error. There are several types of closed-form questions, distinguished by the type of response format: Likert-type or ‘agree-disagree’ questions, ‘forced-choice’ questions (with two or three options), ‘feeling thermometers’, and various kinds of other rating scales, as well as the number of categories and the order in which they are presented. There is little apparent difference in estimated reliability associated with these several types of closed-form question formats (Alwin, 2007: 185–191). Previous comparisons of various types of closed-form response formats have examined other variations in the formal characteristics of survey questions, specifically the number of response options provided in closed-form questions, the use of unipolar versus bipolar response formats, the use of visual aids, particularly the use of verbal labeling of response categories, the provision of an explicit Don’t Know option, and the length of questions. Isolating the ‘effects’ of formal attributes of questions is challenging, due to the confounding of question content, context and
549
question characteristics. However, in most cases it is possible to isolate critical comparisons of specific forms of questions by carefully controlling for other pertinent features of question content and context. One of the most serious of these is the fact that rating scales differ with respect to whether they are assessing unipolar versus bipolar concepts. The key distinction I make here is between ‘unipolar’ versus ‘bipolar’ rating scales. Typically, the ends of a bipolar scale are of the same intensity but opposite valence, while the ends of a unipolar scale tend to differ in amount or intensity, but not valence. Our discussion of this issue noted that one often finds that some types of content, such as attitudes, are always measured using bipolar scales, whereas others, such as behavioral frequencies, are always measured using unipolar scales. Unipolar scales rarely use more than five categories, and it can be readily seen when viewed from this perspective that three-, four-, or five-category scales used to measure unipolar concepts is quite different from similar scales intended to measure bipolar concepts. Clearly, the issue of what the ‘middle category’ means for three- and fivecategory scales is quite different depending upon whether the concept being measured is unipolar or bipolar. One clearly cannot evaluate the use of bipolar and unipolar scales separately from the issue of the number of categories and vice versa. Although the early research on question form essentially ignored the question of the number of response options used in survey measurement, there is a small literature developing around the issue of how many response categories to use in the gathering of information in survey interviews (see Maitland, 2012a). The overall expectation regarding the number of response categories was based on information theory, which argues that scales with a greater number of categories can carry more information and thereby enhance measurement accuracy (e.g., Alwin, 1992, 1997). Some recent research has cast some doubt on this
550
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
hypothesis. On the basis of an elaborate splitballot MTMM comparison of five-, seven-, and eleven-response category measures of attitudes using face-to-face interviews in 23 of the 25 countries in the European Social Survey (ESS), Revilla et al. (2014) found the most support for five-category response scales. A problem with assessing the reliability of measurement of questions with different numbers of response categories (not addressed in the Revilla et al., 2014 research) is that the results depend on the approach used to estimate reliability (Alwin, 2007). There are several conclusions that emerge from the most recent examination of these issues, which more adequately handles the statistical estimation issues (see Alwin, 2007: 191–196). First, there seems to be little if any support for the information-theoretic view that more categories produce more reliability. There is no monotonic increase in estimated reliability with increases in the number of categories. Indeed, if one were to leave aside the nine-category scales, the conclusion would be that more categories produce systematically less reliability. Similarly, there is little support for the suggestion that fivecategory response formats are less reliable than four- and six-category scales, nor that seven-category scales are superior to all others. One aspect of previous analyses which is upheld in Alwin’s (2007) recent research is the conclusion that nine-category scales are superior to seven-point scales in levels of assessed reliability. This research raises the general issue of whether questions with middle alternatives are more or less reliable, given that respondents often use the middle category for other purposes, rather than to simply express neutrality (see, e.g., Sturgis et al., 2014). Finally, there are relatively few differences between unipolar and bipolar measures, net of number of response categories, that cannot be attributed to the content measured. Evaluating the differences in reliability of measurement across categories of different lengths reveals the superiority of four-and
five-category scales for unipolar concepts. For bipolar measures, the two-category scale continues to show the highest levels of measurement reliability, and following this the three- and five-category scales show an improved level of reliability relative to all others. There are many fewer substantive differences among the bipolar measures, although it is relatively clear that seven-category scales achieve the poorest results (see Alwin, 2007; Maitland, 2012a). Due to the vagueness of many non-factual questions and response categories and the perceived pressure on the part of respondents to answer such questions (even when they have little knowledge of the issues or have given little thought to what is being asked), there is also concern respondents will behave randomly, producing measurement unreliability. The explicit offering of a Don’t Know option may forestall such contributions to error. Numerous split-ballot experiments have found when respondents are asked if they have an opinion, the number of Don’t Know responses is significantly greater than when they must volunteer a noopinion response (see review by Krosnick, 2002). There is not complete agreement among studies focusing on the quality of measurement, although most research supports the null hypothesis. Andrews (1984) found that offering respondents the option to say ‘don’t know’ increased the reliability of attitude reports. Alwin and Krosnick (1991) found the opposite for seven-point rating scales and no differences for agree-disagree questions. McClendon and Alwin (1993) and Scherpenzeel and Saris (1997) found no differences in measurement error between forms. Alwin (2007: 196–200) compared measures of non-factual content using questions without an explicit Don’t Know option and comparable questions that did provide such an option – within three-, five-, and seven-category bipolar rating scales – and found no significant differences. Labels of response options reduce ambiguity in translating subjective responses into
Survey Data Quality and Measurement Precision
categories of the response scales (Maitland, 2012b). Based on simple notions of communication and information transmission, the better labeled response categories may be more reliable. It is a reasonable expectation that the more verbal labeling that is used in the measurement of subjective variables, the greater will be the estimated reliability. Andrews (1984: 432) reports that data quality is below average when all categories are labeled. My own research suggests that a significant difference in reliability exists for fully- vs partially-labeled response categories, such that measures with fully-labeled categories were more reliable (see Alwin, 2007: 200–202). This supports the conclusions of prior research, which found that among seven-point scales, those fully labeled were significantly more reliable (Alwin and Krosnick, 1991). Finally, one element of survey question writing to which the majority (but not all) of researchers subscribes is that questions should be as short as possible. With regard to question length, Payne’s (1951: 136) early writings on surveys, for example, suggested that questions should rarely number more than 20 words. In this tradition, a general rule for formulating questions and designing questionnaires is that questions should be short and simple (e.g., Sudman and Bradburn, 1982; Brislin, 1986; Fowler, 1992; van der Zouwen, 1999). Other experts suggest that lengthy questions may work well in some circumstances, and have concluded, for example, that longer questions may lead to more accurate reports in some behavioral assessments (see e.g., Cannell et al., 1977; Marquis et al., 1972; Converse and Presser, 1986). Advice to researchers on question length has therefore been somewhat mixed, and there is a common belief among many survey researchers, supported in part by empirical estimates of measurement reliability – that authors of survey questions should follow the KISS principle, that is, ‘keep it short and simple’ (see Alwin, 2007; Alwin and Beattie, 2016).
551
CONCLUSION Survey research plays an extraordinarily important role in contemporary societies. Some have even described survey research as the via regia for modern social science (Kaase, 1999: 253). Vast amounts of survey data are collected for many purposes, including governmental information, public opinion and election surveys, advertising and marketing research as well as basic scientific research. Given the substantial social and economic resources invested each year in data collected to satisfy these social and scientific information needs, questions concerning the quality of survey data are strongly justified. Measurement error has serious consequences in the study of social behavior using sample surveys, regardless of the characteristics of the subject population. Without reliable measurement, the quantitative analysis of data hardly makes sense; yet there is a general lack of empirical information about these problems. Although the application of concepts of reliability and validity have been applied to survey data for several decades (e.g., Marquis and Marquis, 1977; Marquis, 1978; Andrews, 1984; Alwin, 1989, 1992; Saris and Andrews, 1991), systematic application of these ideas to standard types of survey questions for representative populations are only recently becoming available (e.g., Alwin, 1992, 1997, 2007; Scherpenzeel and Saris, 1997; Hout and Hastings, 2012; Saris and Gallhofer, 2007). Although there are some advantages in the use of MTMM designs in cross-sectional surveys for the purpose of estimating sources of invalidity, the use of longitudinal designs are optimal for assessing reliability. The main advantages of the re-interview design for reliability estimation are two. First, the estimate of reliability obtained includes all reliable sources of variation in the measure, both common and specific variance. Second, under appropriate circumstances it is possible to eliminate the confounding of the systematic error component discussed earlier, if
552
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
systematic components of error are not stable over time. In order to address the question of stable components of error, the panel survey must deal with the problem of memory, because in the panel design, by definition, measurement is repeated. So, while this overcomes one limitation of cross-sectional surveys, it presents problems if respondents can remember what they say and are motivated to provide consistent responses. If re-interviews are spread over months or years, this can help rule out sources of bias that occur in crosssectional studies. Given the difficulty of estimating memory functions, estimation of reliability from re-interview designs makes sense only if one can rule out memory as a factor in the covariance of measures over time, and thus, the occasions of measurement must be separated by sufficient periods of time to rule out the operation of memory. The analysis of panel data in the estimation of reliability also must be able to cope with the fact that people change over time, so that models for estimating reliability must take the potential for individual-level change into account (see Coleman, 1964, 1968; Wiggins, 1973; Goldstein, 1995). Given these requirements, techniques have been developed for estimating measurement reliability in panel designs where there are three or more waves, wherein change in the latent variable is incorporated into the model. With this approach there is no need to rely on multiple indicators within a particular wave or cross-section in order to estimate the measurement reliability, and there is no need to be concerned about the separation of reliable common variance from reliable specific variance. That is, there is no decrement to reliability estimates due to the presence of specific variance in the error term; here specific variance is contained in the true score. This approach is possible using modern SEM methods for longitudinal data, and is discussed further in related literature (see Alwin, 1989, 1992, 2007; Alwin and Krosnick, 1991; Saris and Andrews, 1991). Despite the importance of these developments for understanding the quality of
survey data, there is a reluctance on the part of many survey methods experts to evaluate questions in terms of their reliability and/or validity (see, e.g., Schaeffer and Dykema, 2011; Krosnick and Presser, 2010). They therefore ignore important literature in which the evaluation of survey questions is based on the quantification of their reliability and/or validity. This may be due in part because of a failure to understand these tools, or because the availability of such statistical estimates require more rigorous designs than most people are willing to consider as relevant. This is why the QEM approach of Madans et al. (2011) reflects a positive development and why it is important to emphasize the inclusion of studies of reliability and validity of survey measures as one set of tools for the overall evaluation of the quality of survey measurement. It is toward this end that I have written the present chapter and I hope it has helped further the development of the science of survey data quality.
NOTES 1 The author wishes to thank the editors and Peter Schmidt for their generous comments on an earlier draft of this chapter. 2 The definition of parallel measurement insists upon tau-equivalence, i.e., T1 = T2, and equality of error variances, or VAR(E1) = VAR (E2). This means, at a minimum, Y1 and Y2 are expressed in the same metric, or are standardized in some way into a common metric. If they are not, then one needs to be able to express the units of T1 as a linear function of T2 and vice versa, that is, there must be a known relation between units of one measure to units of the other. If not, then these become parameters of the model, which renders this model with two measures under-identified. A similar model with three measures would be just-identified, and with four measures, overidentified. 3 Neither is it likely to be an appropriate design for reliability, if the same question is repeated with different response scales, as in the application of the MTMM design (see below), but it is nonetheless commonly accepted as an approach to estimating reliability.
Survey Data Quality and Measurement Precision
4 Note that we use upper case letters (e.g., Y, T, and E) for random variables, and lower case notation for properties of persons (e.g., y, τ, and ε) in order to distinguish levels of analysis. 5 The literature discussing the advantages of the quasi-Markov simplex approach for separating unreliability from true change is extensive (see, e.g., Humphreys, 1960; Heise, 1969; Jöreskog, 1970; Wiley and Wiley, 1970; Werts et al., 1980; Alwin, 1989, 1992, 2007; Alwin and Krosnick, 1991).
RECOMMENDED READINGS Alwin (2007). A review of the application of the concept of reliability to the evaluation of survey data and an application of longitudinal methods to the estimation of reliability of measurement. Kempf-Leonard et al. (eds) (2005). The most comprehensive review of social measurement available in the literature. Madans et al. (2011). A discussion of the array of approaches to evaluating the quality of measurement in surveys. Marsden and Wright (eds) (2010). A comprehensive compendium of information on survey methods. Saris and Gallhofer (2007). A review of question evaluation using the multitrait-multimethod approach to measurement error evaluation.
REFERENCES Alwin, D.F. (1974). Approaches to the Inter pretation of Relationships in the MultitraitMultimethod Matrix. In H.L. Costner (ed.), Sociological Methodology 1973–74 (pp. 79–105). San Francisco: Jossey-Bass. Alwin, D.F. (1989). Problems in the Estimation and Interpretation of the Reliability of Survey Data. Quality and Quantity, 23, 277–331. Alwin, D.F. (1992). Information Transmission in the Survey Interview: Number of Response Categories and the Reliability of Attitude Measurement. In P.V. Marsden (ed.), Sociological Methodology 1992 (pp. 83–118). Washington DC: American Sociological Association.
553
Alwin, D.F. (1997). Feeling Thermometers vs. Seven-point Scales: Which are Better? Sociological Methods and Research, 25, 318–340. Alwin, D.F. (2005). Reliability. In K. KempfLeonard and others (eds), Encyclopedia of Social Measurement (pp. 351–359). New York: Academic Press. Alwin, D.F. (2007). Margins of Error – A Study of Reliability in Survey Measurement. Hoboken, NJ: John Wiley & Sons. Alwin, D.F. (2009). Assessing the Validity and Reliability of Timeline and Event History Data. In R.F. Belli, F.P. Stafford, and D.F. Alwin (eds), Calendar and Time Diary Methods in Life Course Research (pp. 277–307). Thousand Oaks, CA: SAGE Publications. Alwin, D.F. (2010). How Good is Survey Measurement? Assessing the Reliability and Validity of Survey Measures. In P.V. Marsden and J.D. Wright (eds), Handbook of Survey Research (pp. 405–434). Bingley, UK: Emerald Group Publishing, Ltd. Alwin, D.F. (2011). Evaluating the Reliability and Validity of Survey Interview Data Using the MTMM Approach. In J. Madans, K. Miller, A. Maitland, and G. Willis (eds), Question Evaluation Methods: Contributing to the Science of Data Quality (pp. 265–293). Hoboken, NJ: John Wiley & Sons. Alwin, D.F. (2014). Investigating Response Errors in Survey Data. Sociological Methods and Research, 43, 3–14. Alwin, D.F., and Beattie, B.A. (2016). The KISS Principle – Survey Question Length and Data Quality. Sociological Methodology. Forthcoming. Alwin, D.F., and Jackson, D.J. (1979). Measurement Models for Response Errors in Surveys: Issues and Applications. In K.F. Schuessler (ed.), Sociological Methodology 1980 (pp. 68–119). San Francisco, CA: Jossey-Bass. Alwin, D.F., and Krosnick, J.A. (1985). The Measurement of Values in Surveys: A Comparison of Ratings and Rankings. Public Opinion Quarterly, 49, 535–552. Alwin, D.F., and Krosnick, J.A. (1991). The Reliability of Survey Attitude Measurement: The Influence of Question and Respondent Attributes. Sociological Methods and Research, 20, 139–181. Alwin, D.F., Beattie, B.A., and Baumgartner, E.M. (2015). Assessing the Reliability of
554
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Measurement in the General Social Survey: The Content and Context of the GSS Survey Questions. Paper presented at the 70th Annual Conference of the American Association for Public Opinion Research, Hollywood, FL, May 14, 2015. Alwin, D.F., Zeiser, K., and Gensimore, D. (2014). Reliability of Self-reports of Financial Data in Surveys: Results from the Health and Retirement Study. Sociological Methods and Research, 43, 98–136. Andrews, F.M. (1984). Construct Validity and Error Components of Survey Measures: A Structural Modeling Approach. Public Opinion Quarterly, 46, 409–442. Belli, R.F., Shay, W.L., and Stafford, F.P. (2001). Event History Calendars and Question List Surveys: A Direct Comparison of Methods. Public Opinion Quarterly, 65, 45–74. Belson, W.A. (1981). The Design and Understanding of Survey Questions. Aldershot, England: Gower. Biemer, P.P. (2010). Overview of Design Issues: Total Survey Error. In P.V. Marsden and J.D. Wright (eds), Handbook of Survey Research (pp. 27–57). Bingley, UK: Emerald Group Publishing, Ltd. Biemer, P.P. (2011). Latent Class Analysis of Survey Error. Hoboken, NJ: John Wiley & Sons. Biemer, P.P., Groves, R.M., Lybert, L.E., Mathiowetz, N.A., and Sudman, S. (1991). Measurement Errors in Surveys. New York: Wiley-Interscience. Blair, J., Menon, G., and Bickart, B. (1991). Measurement Effects in Self and Proxy Responses to Survey Questions: An Information-Processing Perspective. In P.B. Biemer, R.M. Groves, L.E. Lyberg, N.A. Mathiowetz, and S. Sudman (eds), Measurement Errors in Surveys (pp. 145–166). New York: Wiley-Interscience. Bohrnstedt, G.W. (1983). Measurement. In P.H. Rossi, J.D. Wright, and A.B. Anderson (eds), Handbook of Survey Research (pp. 70–121). New York: Academic Press. Bohrnstedt, G.W. (2010). Measurement Models for Survey Research. In P.V. Marsden and J.D. Wright (eds), Handbook of Survey Research (pp. 347–404). Bingley, UK: Emerald Group Publishing, Ltd. Bollen, Kenneth. (1989). Structural Equations with Latent Variables. New York: John Wiley & Sons.
Bound, J., Brown, C., Duncan, G.J., and Rodgers, W.L. (1990). Measurement Error in Cross-sectional and Longitudinal Labor Market Surveys: Validation Study Evidence. In J. Hartog, G. Ridder, and J. Theeuwes (eds), Panel Data and Labor Market Studies (pp. 1–19). Amsterdam: Elsevier Science Publishers. Brislin, R.W. (1986). The Wording and Translation of Research Instruments. In W.J. Lonner and J.W. Berry (eds), Field Methods in CrossCultural Research (pp. 137–164). Newbury Park, CA: SAGE. Campbell, D.T., and Fiske, D.W. (1959). Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix. Psychological Bulletin, 6, 81–105. Cannell, C.F., Marquis, K.H., and Laurent, A. (1977). A Summary of Studies of Interviewing Methodology. Vital and Health Statistics, Series 2, No. 69, March. Clogg, C.C., and Manning, W.D. (1996) Assessing Reliability of Categorical Measurements Using Latent Class Models. In A. van Eye and C.C. Clogg (eds), Categorical Variables in Developmental Research: Methods of Analysis (pp. 169–182). New York: Academic Press. Coleman, J.S. (1964). Models of Change and Response Uncertainty. Englewood Cliffs, NJ: Prentice-Hall. Coleman, J.S. (1968). The Mathematical Study of Change. In H.M. Blalock, Jr., and A.B. Blalock (eds), Methodology in Social Research (pp. 428–478). New York: McGraw-Hill. Converse, J.M., and Presser, S. (1986). Survey Questions: Handcrafting the Standardized Questionnaire. Beverly Hills, CA: SAGE. Converse, P.E. (1964). The Nature of Belief Systems in the Mass Public. In D.E. Apter (ed.), Ideology and Discontent (pp. 206–261). New York: Free Press. Cronbach, L.J. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16, 297–334. Cronbach, L.J., and Meehl, P.E. (1955) Construct Validity in Psychological Test, Psychological Bulletin, 52, 281–302. Cronbach, L.J., Rajaratnam, N., and Gleser, G.C. (1963). Theory of Generalizability: A Liberalization of Reliability Theory. British Journal of Statistical Psychology, 16, 137–163. Duncan, O.D. (1984). Notes on Social Measurement. New York, NY: Academic Press.
Survey Data Quality and Measurement Precision
Eid, M., and Diener, E. (2006). Handbook of Multimethod Measurement in Psychology. Washington, DC: American Psychological Association. Fowler, F.J. (1992). How Unclear Terms Affect Survey Data. Public Opinion Quarterly, 56, 218–231. Galton, F. (1893). Inquiries into the Human Faculty and Its Development. Macmillan. Goldtsein, H. (1995) Multilevel Statistical Models, 2nd edition. London: Arnold. Greene, V.L., and Carmines, E.G. (1979). Assessing the Reliability of Linear Composites. In K.F. Schuessler (ed.), Sociological Methodology 1980 (pp. 160–175). San Francisco, CA: Jossey-Bass. Groves, R.M. (1989). Survey Errors and Survey Costs. New York: John Wiley & Sons. Groves, R.M. (1991). Measurement Error Across the Disciplines. In P.P. Biemer et al. (eds), Measurement Errors in Surveys (pp. 1–25). New York: John Wiley & Sons. Heise, D.R. (1969). Separating Reliability and Stability in Test-Retest Correlation. Measurement Errors in Surveys, 34, 93–191. Heise, D.R., and Bohrnstedt, G.W. (1970). Validity, Invalidity, and Reliability. In E.F. Borgatta and G.W. Bohrnstedt (eds), Sociological Methodology 1970 (pp. 104–129). San Francisco, CA: Jossey-Bass. Hout, M., and Hastings, O.P. (2012). Reliability and Stability Estimates for the GSS Core Items from the Three-wave Panels, 2006– 2010. GSS Methodological Report 119. Chicago, IL: NORC. http://public data.norc. org:41000/gss/documents//MTRT/MR119 Humphreys, L.G. (1960). Investigations of the Simplex. Psychometrika, 25, 313–323. Jöreskog, K.G. (1970). Estimating and Testing of Simplex Models. British Journal of Mathematical and Statistical Psychology, 23, 121–145. Jöreskog, K.G. (1971). Statistical Analysis of Sets of Congeneric Tests. Psychometrika, 36, 109–133. Kaase, M. (ed.) (1999). Quality Criteria for Survey Research. Berlin: Akademic Verlag GmbH. Kalton, G., and Schuman, H. (1982). The Effect of the Question on Survey Response: A Review. Journal of the Royal Statistical Association, 145, 42–73.
555
Kaplan, A. (1964). The Conduct of Inquiry. San Francisco, CA: Chandler Publishing Company. Kempf-Leonard, K., and others (eds) (2005). Encyclopedia of Social Measurement. New York: Academic Press. Kim, C., and Tamborini, C.R. (2014). Response Error in Earnings: An Analysis of the Survey of Income and Program Participation Matched with Administrative Data. Sociological Methods and Research, 43, 39–72. Krebs, D., and Hoffmeyer-Zlotnik, J. (2010). Positive First or Negative First? Effects of the Order of Answering Categories on Response Behavior. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 6(3), 118–127. Krosnick, J.A. (2002). The Causes of No-Opinion Responses to Attitude Measures in Surveys: They Are Rarely What They Appear to Be. In R.M. Groves, D.A. Dillman, J.L. Eltinge, and R.J.A. Little (eds), Survey Nonresponse (pp. 87–100). New York: John Wiley & Sons. Krosnick, J.A., & Alwin, D.F. (1989). Aging and susceptibility to attitude change. Journal of Personality and Social Psychology, 57, 416–425. Krosnick, J.A., and Alwin, D.F. (1987). An Evaluation of a Cognitive Theory of ResponseOrder Effects in Survey Measurement. Public Opinion Quarterly, 51, 201–219. Krosnick, J.A., and Alwin, D.F. (1988). A Test of the Form-Resistant Correlation Hypothesis: Ratings, Rankings, and the Measurement of Values. Public Opinion Quarterly, 52, 526–538. Krosnick, J.A., and Fabrigar, L.R. (1997). Designing Rating Scales for Effective Measurement in Surveys. In L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz, and D. Trewin (eds), Survey Measurement and Process Quality (pp. 141–164). New York: John Wiley & Sons. Krosnick, J.A., and Presser, S. (2010). Question and Questionnaire Design. In Peter V. Marsden and James D. Wright (eds), Handbook of Survey Research (pp. 263–313). Bingley, UK: Emerald Group Publishing. Lord, F.M., and Novick, M.L. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley. Lyberg, L.E., Biemer, P.P., Collins, M., de Leeuw, E., Dippo, C., Schwarz, N., and Trewin, D.
556
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
(1997). Survey Measurement and Process Quality. New York: John Wiley & Sons. Madans, J., Miller, K., Maitland, A., and Willis, G. (2011). Question Evaluation Methods: Contributing to the Science of Data Quality. Hoboken, NJ: John Wiley & Sons. Maitland, A. (2012a). How Many Scale Points Should I Include for Attitudinal Questions? Survey Practice, April Retrieved from http:// w w w. s u r v e y p r a c t i c e . o r g / i n d e x . p h p / SurveyPractice Maitland, A. (2012b). Should I Label All Scale Points or Just the End Points for Attitudinal Questions? Survey Practice, April Retrieved from http://www.surveypractice.org/index. php/SurveyPractice Marquis, K.H. (1978). Record Check Validity of Survey Responses: A Reassessment of Bias in Reports of Hospitalizations. Santa Monica, CA: The Rand Corporation. Marquis, K.H., Cannel C.F., and Laurent, A. (1972). Reporting Health Events in Household Interviews: Effects of Reinforcement, Question Length, and Reinterviews. Vital and Health Statistics, Series 2, No 45. Marquis, M.S., and Marquis, K.H. (1977). Survey Measurement Design and Evaluation Using Reliability Theory. Santa Monica, CA: The Rand Corporation. Marsden, P.V., and Wright, J.D. (eds) (2010). Handbook of Survey Research. Bingley, UK: Emerald Group Publishing, Ltd. McClendon, M.J. (1991). Acquiescence and Recency Response-Order Effects in Interview Surveys. Sociological Methods and Research, 20, 60–103. McClendon, M.J., and Alwin, D.F. (1993). NoOpinion Filters and Attitude Measurement Reliability. Sociological Methods and Research, 21, 438–464. Moore, J.C. (1988). Self/Proxy Report Status and Survey Response Quality. Journal of Official Statistics, 4, 155–172. Moors, G, Kieruj, N.D., and Vermunt, J.K. (2014). The Effect of Labeling and Numbering of Response Scales on the Likelihood of Response Bias. Sociological Methodology, 44, 369–399. Moser, C.A., and Kalton, G. (1972). Survey Methods in Social Investigation, 2nd edition. New York: Basic Books. O’Brien, D.P. (1991). Conditional Reasoning: Development. In R. Dulbecco (ed.), Encyclopedia
of Human Biology. San Diego, CA: Academic Press. Payne, S.L. (1951). The Art of Asking Questions. Princeton, NJ: Princeton University Press. Podsakoff, P.M., MacKenzie, S.B., and Podsakoff, N.P. (2012). Sources of Method Bias in Social Science Research and Recommendations on How to Control It. Annual Review of Psychology, 63, 539–569. Rajaratnam, N. (1960). Reliability Formulas for Independent Decision Data when Reliability Data are Matched. Psychometrika, 25, 261–271. Rammstedt, B., and Krebs, D. (2007). Does Response Scale Format Affect the Answering of Personality Scales? Assessing the Big Five Dimensions of Personality with Different Response Scales in a Dependent Sample. European Journal of Psychological Assessment, 23(1), 32–38. Raykov, T. (1997). Estimation of Composite Reliability for Congeneric Measures. Applied Psychological Measurement, 21(2), 173–184. Raykov, T. (2012). Scale Construction and Development using Structural Equation Modeling. In R. Hoyle (ed.), Handbook of Structural Equation Modeling (pp. 472–492). New York: Guilford Press. Raykov, T, and Marcoulides, G.A. (2015). A Direct Latent Variable Modeling Based Method for Point and Interval Estimation of Coefficient Alpha. Educational and Psychological Measurement, 75, 146–156. Raykov, T., and Shrout, P.E. (2002). Reliability of Scales with General Structure: Point and Interval Estimation Using a Structural Equation Modeling Approach. Structural Equation Modeling: A Multidisciplinary Journal, 9, 195–212. Revilla, M.A., Saris, W.E., and Krosnick, J.A. (2014). Choosing the Number of Categories in Agree-Disagree Scales. Sociological Methods and Research, 43, 73–97. Ruckmick, C.A. (1930). The Uses and Abuses of the Questionnaire Procedure. Journal of Applied Psychology, 14, 32–41. Saris, W.E., and Andrews, F.M. (1991). Evaluation of measurement instruments using a structural modeling approach. In P.P. Biemer, R.M. Groves, L.E. Lyberg, N.A. Mathiowetz, and S. Sudman (eds), Measurement Errors in
Survey Data Quality and Measurement Precision
Surveys (pp. 575–597). New York: John Wiley & Sons. Saris, W.E., and Gallhofer, I.N. (2007). Design, Evaluation, and Analysis of Questionnaires for Survey Research. Hoboken, NJ: John Wiley & Sons. Saris, W.E., and van Meurs, A. (1990). Evaluation of Measurement Instruments by Metaanalysis of Multitrait-Multimethod Studies. Amsterdam: North-Holland. Schaeffer, N.C., and Dykema, J. (2011). Questions for Surveys: Current Trends and Future Directions. Public Opinion Quarterly, 75, 909–961. Schaeffer, N.C., and Presser, S. (2003). The Science of Asking Questions. Annual Review of Sociology, 29, 65–88. Scherpenzeel, A.C. (1995). A Question of Quality: Evaluating Survey Questions by Multitrait-Multimethod Studies. PhD thesis, University of Amsterdam. The Netherlands. Scherpenzeel, A.C., and Saris, W.E. (1997). The Validity and Reliability of Survey Questions: A Meta-Analysis of MTMM Studies. Sociological Methods and Research, 25, 341–383. Schuman, H., and Kalton, G. (1985). Survey Methods. In G. Lindzey and E. Aronson (eds), Handbook of Social Psychology, 3rd edition (pp. 634–697). New York: Random House. Schuman, H., and Presser, S. (1981). Questions and Answers: Experiments in Question Wording, Form and Context. New York: Academic Press. Schwarz, N., Knauper, B., Hippler, H.-J., and Clark, L. (1991). Rating Scale Numeric Values May Change the Meaning of Scale Labels. Public Opinion Quarterly, 55, 570–582. Sijtsma, K. (2009). On the Use, the Misuse, and the Very Limited Usefulness of Cronbach’s Alpha. Psychometrika, 74, 107–120. Smith, T.W. (2011). Refining the Total Survey Error Perspective. International Journal of Public Opinion Research, 23, 464–484. Sturgis, P., Roberts, C., and Smith, P. (2014). Middle Alternatives Revisited: How the Neither/Nor Response Acts as a Way of Saying ‘I
557
Don’t Know’? Sociological Methods and Research, 43, 15–38. Sudman, S., and Bradburn, N.M. (1974). Response Effects in Surveys. Chicago, IL: Aldine. Sudman, S., and Bradburn, N.M. (1982). Asking Questions: A Practical Guide to Questionnaire Design. San Francisco, CA: Jossey-Bass. Sudman, S., Bradburn, N.M., and Schwarz, N. (1996). Thinking About Answers: The Application of Cognitive Processes to Survey Methodology. San Francisco, CA: Jossey-Bass. Tanur, J.M. (ed.) (1992). Questions About Questions – Inquiries into the Cognitive Bases of Surveys. New York: Russell Sage Foundation. Tourangeau, R., Rips, L.J., and Rasinski, K. (2000). The Psychology of Survey Response. Cambridge: Cambridge University Press. Traugott, M., and Katosh, J.P. (1979). Response Validity in Surveys of Voting Behavior. Public Opinion Quarterly, 43, 359–377. Traugott, M., and Katosh, J.P. (1981). The Consequences of Validated and Self-reported Voting Measures. Public Opinion Quarterly, 45, 519–535. Turner, C.F., and Martin, E. (1984). Surveying Subjective Phenomena, Vol. 1. New York: Russell Sage. van der Zouwen, J. (1999). An Assessment of the Difficulty of Questions Used in the ISSPquestionnaires, the Clarity of Their Wording, and the Comparability of the Responses. ZAInformation, 46, 96–114. Werts, C.E., Breland, H.M., Grandy, J., and Rock, D.R. (1980). Using Longitudinal Data to Estimate Reliability in the Presence of Correlated Measurement Errors. Educational and Psychological Measurement, 40, 19–29. Wiggins, L.M. (1973) Panel Analysis: Latent Probability Models for Attitude and Behaviour Processes. New York: Elsevier. Wiley, D.E., and Wiley, J.A. (1970). The Estimation of Measurement Error in Panel Data. Measurement Errors in Surveys, 35, 112–117.
35 Nonresponse Error: Detection and Correction Jelke Bethlehem and Barry Schouten
INTRODUCTION In this chapter, the focus is on unit-nonresponse, which in the following is simply referred to as nonresponse. Methodology is described for the detection of nonresponse bias and for the correction of nonresponse bias. Both detection and correction may take place during survey data collection and afterwards in the estimation. Adjustment during data collection refers to adaptive and responsive survey designs that have emerged in recent years. Nonresponse detection may be viewed as a monitoring and analysis stage and serves methodology to minimize the impact of nonresponse in the data collection and/or in the estimation stage. In every stage, however, the central role is for auxiliary variables. In the data collection stage, nonresponse error can be reduced by ensuring that nonresponse is unrelated to the survey variables of interest, or, in other words, that the survey response is as representative as possible with respect
to these variables. Since representativity with respect to survey variables can typically not be assessed, the focus shifts to the height of the response rate and the impact on relevant auxiliary variables. Such variables may arise from the sampling frame, e.g. gender, size of a business, and urbanization of the area of residence, from linked administrative data, e.g. income and reported VAT, or from paradata, e.g. status of the dwelling and neighborhood. Data collection design traditionally is uniform, i.e. the same strategies are applied to all selected population units, but interviewers usually do adapt their strategies to population units in a non-formalized way. If there is a strong difference in both costs and nonresponse error between strategies for subpopulations formed by relevant auxiliary variables, then adaptation becomes meaningful. Such adaptive designs essentially attempt to balance survey response and, doing so, correct during data collection. Such balancing does not preclude correction in the estimation stage which is usually based on a larger and
Nonresponse Error: Detection and Correction
more accurate set of auxiliary variables. The two stages are discussed separately. The reader is referred to Chapter 27 in this Handbook for a more elaborate discussion of the causes and consequences of nonresponse, to Chapter 30 for a general description of weighting methods, i.e. not necessarily related to nonresponse, and to Chapter 26 for an extended discussion on the implementation of adaptive survey designs. Groves and Couper (1998), Groves et al. (2002), Särndal and Lundström (2005), and Bethlehem et al. (2011) contain extended discussions of nonresponse error detection and correction.
DETECTING NONRESPONSE ERROR The detection of nonresponse error cannot be discussed without introducing some notation and a framework. An important role in the framework is played by response probabilities and response propensities.
Response Propensities and Nonresponse Error Nonresponse in Surveys There are various ways of selecting a sample for a survey, but over the years it has become clear that the only scientifically sound way to do this is by means of probability sampling. All objects in the surveyed population (persons, households, companies, etc.) must have a non-zero probability of selection, and all these selection probabilities must be known. If these conditions are satisfied, unbiased estimates of population characteristics can be computed. Moreover, the accuracy of these estimators can be quantified, for example by means of a margin of error or a confidence interval. In practice, the survey situation is usually not so perfect. There are many phenomena that may cause problems, one of the most important ones being nonresponse. One
559
obvious effect of nonresponse is that the sample size is smaller than planned. This leads to less precise, but still valid, estimates of population characteristics. This is not a serious methodological problem, as it can be taken care of by making the initial sample size larger. However, more budget is required per respondent. A far more serious effect of nonresponse is that estimates of population characteristics may be biased. This situation occurs if, due to nonresponse, some groups in the population are over- or under-represented, and these groups behave differently with respect to the characteristics to be investigated. Hence, the representativity of the survey is affected. In this section we describe the general theoretical framework for nonresponse. Such a framework allows us to get insight in the possible effects of nonresponse. It also helps to explore possible correction techniques. An often used approach is to introduce the concept of the response probability. It is assumed that every member of the target population of the survey has a certain probability to respond in the survey if asked to do so. These response probabilities are an abstract concept, and, therefore, their values are unknown. The idea is to estimate the response probabilities using the available data. Analysis of the estimated response probabilities may provide insight in the possible effects of nonresponse, and also they can be used to correct estimates for a bias due to nonresponse. Let the finite survey population U consist of a set of N identifiable elements, which are labeled 1, 2, …, N. The values of a target variable Y of the survey are denoted by Y1, Y2, …, YN. Objective of the sample survey is assumed to be estimation of the population mean
Y =
1 N ∑ Yk. N k =1
(1)
To estimate this population parameter, a probability sample of size n is selected without replacement. The first order inclusion probabilities of the sampling design are
560
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
denoted by π1, π2, …, πN. A sample can be represented by the set of indicators a1, a2, …, aN, where the k-th indicator ak assumes the value 1 if element k is selected in the sample, and otherwise it assumes the value 0. So E(ak) = πk, for k = 1, 2, …, N. Horvitz and Thompson (1952) show in their seminal paper that it is always possible to construct an unbiased estimator in the case of full response. This estimator is defined by
yHT =
1 N ak ∑ Yk . N k =1 π k
nR =
1 ∑ ak Rk. N k =1
* yHT =
N
1 N ak Rk Yk ∑ N k =1 π k ρ k
(3)
rHT =
1 N ak Rk ∑ N k =1 π k
(5)
of the mean response probability
Note that this realized sample size is a random variable. The values of the target variable only become available for the nR responding elements. To obtain an unbiased estimator of the population mean, the Horvitz–Thompson estimator (2) must be adapted. In principle,
(2)
This is an unbiased estimator of the population mean. The situation is different in the case of nonresponse. It is assumed that each element k in the population has a certain, unknown probability ρk of response. If element k is selected in the sample, a random mechanism is activated that results with probability ρk in response and with probability 1 – ρk in nonresponse. Under this model, a set of response indicators R1, R2, …, RN can be introduced, where Rk = 1 if the corresponding element k responds, and where Rk = 0 otherwise. So, P(Rk = 1) = ρk, and P(Rk = 0) = 1 – ρk. The survey response only consists of those elements k for which ak = 1 (it is selected in the sample) and Rk = 1 (it responds). Hence, the number of available cases is equal to
of the response probabilities are unknown. Lacking any knowledge about the response behavior of the elements in the target population, the only solution is to replace each response probability ρk in (4) by the unbiased estimate
(4)
is an unbiased estimator. Unfortunately, it cannot be used in practice because the values
ρ=
1 N ∑ ρk . N k =1
(6)
The resulting estimator is
yHT , R =
1 N ak Rk Yk . ∑ N k =1 π k rHT
(7)
This is not an unbiased estimator. Bethlehem (1988) shows that the bias is approximately equal to B( yHT , R ) = E ( yHT , R − Y )
≈
cor (Y , ρ )S ( y)S ( ρ ) , ρ
(8)
in which cor(Y, ρ) is the correlation between the target variable of the survey and the response probabilities, S(y) is the standard deviation of the target variable, S(ρ) is the standard deviation of the response probabilities, and ρ is the mean response probability. By taking a closer look at this expression, three important observations can be made: 1 The stronger the correlation between the target variable and the response behavior, the larger the bias will be. If there is no correlation, there is no bias. 2 The larger the variation of the response prob abilities, the larger the bias will be. If all response probabilities are the same, there is no bias. 3 The smaller the response probabilities, the larger the bias. A high response rate reduces the bias.
Nonresponse Error: Detection and Correction
In the literature on nonresponse, three types of nonresponse mechanisms are distinguished. They are related to expression (8). For a recent discussion see Seaman et al. (2013). The first nonresponse mechanism is called Missing Completely At Random (MCAR). This is the case if nonresponse is completely unrelated to all variables in the survey. Consequently cor (Y , ρ ) = 0 for all Y. All estimators will be unbiased. The second mechanism is called Not Missing At Random (NMAR). This is the case if there is a direct relationship between the target variable and response behavior. Since cor (Y , ρ ) ≠ 0 and S( ρ ) > 0 estimates for this target variable will be biased. It will not be possible to reduce the bias. The third mechanism is called Missing At Random (MAR). This is the case if there is a relationship between some auxiliary variables X in the survey and response behavior. Estimates for target variables can be biased. It is possible to remove the bias by applying some correction technique like adjustment weighting (See subsection ‘Adjustment by Estimation: Weighting’ in section ‘Correcting Nonresponse Error’).
Response Propensities Response probabilities play an important role in the analysis and correction of nonresponse. Unfortunately, these probabilities are unknown. The idea is now to estimate the response probabilities using the available data. To this end, the concept of the response propensity is introduced. Following Little (1986), the response propensity of element k is defined by
ρ X ( X k ) = P ( Rk = 1 X k ) ,
(9)
where Rk is the response indicator, and Xk = (Xk1, Xk2, …, Xkp) is a vector of values of, say, p auxiliary variables. So the response propensity is the probability of response conditional on the values of a set of auxiliary
561
variables. The response propensity is a special case of the propensity score introduced by Rosenbaum and Rubin (1983). The response propensities are also unknown, but they can be estimated provided the values of the auxiliary variables are available for both the respondents and nonrespondents. To be able to estimate the response propensities, a model must be chosen. The most frequently used one is the logistic regression model. It assumes the relationship between response propensity and auxiliary variables can be written as
ρ (X ) logit( ρX ( X k )) = log X k 1 − ρX ( X k ) p
(10)
= ∑ X kj β j j =1
where β = (β1, β2, …, βp) is a vector of p regression coefficients. The logit transformation ensures that estimated response propensities are always in the interval [0, 1]. Another model that can be used, is the probit model. And even a linear regression model can be used, provided that response propensities are not too close to 0 or 1. Bethlehem (2012) shows that the values of estimated response propensities sometimes hardly differ for different models. As an example, the logit model is applied to the General Population Survey (GPS). The GPS was a face-to-face survey. The target population consisted of persons of age 12 and older. All persons had the same selection probability. The GPS survey data set contains many auxiliary variables. Six of these variables were selected for the logit model. Of course, all these variables must have explanatory power. The variables were degree of urbanization, ethnic origin, type of household, job status (yes/no employment), age (in 13 categories), and average value of the houses in the neighborhood. The model was fitted, and subsequently used to estimate the response propensities.
562
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Figure 35.1 Distribution of the estimated response propensities.
Figure 35.1 shows the distribution of the estimated response propensities. It is clear that the response propensities are not equal. There is substantial variation. The propensities vary between 0.150 and 0.757. Apparently there are people with small response propensities. They are under-represented in the survey response. And there are people with large response propensities. They are over-represented in the survey response. Insight in the relationship between auxiliary variables and response propensities can be obtained by making boxplots. Figure 35.2 shows an example. It shows the distribution of the response propensities within each degree of urbanization. One conclusion is that response is particularly low in very strongly urbanized areas (the big cities). Figure 35.2 also shows that the response propensities vary both between categories and within categories. The ideal situation would be that the response propensities do not vary within categories but only between categories. That would indicate MAR. Unfortunately this is not the case. Correction using degree of urbanization only will
probably decrease the bias but will not eliminate it. This single auxiliary variable is not capable of completely explaining nonresponse; more auxiliary variables are required.
Indicators for Detecting Nonresponse Error In this section, various indicators are described and discussed that are indirect measures of nonresponse error in a survey. However, let there be no doubt about it: nonresponse error cannot be detected without information that is auxiliary to the survey. Such auxiliary information may be available from the sampling frame, from administrative data that can be linked to the sample, from (commercial) databases that provide aggregated statistics that can be linked to the sample, from paradata that is observed during data collection, or from population statistics produced from a government census or a reference survey. However, in all cases the variables that can be taken from these sources need to relate to the key survey variables.
Nonresponse Error: Detection and Correction
563
Figure 35.2 Boxplots of response propensities by degree of urbanization.
The indicators in this section merely transform the multivariate auxiliary information to informative lower dimensional statistics; without relevant auxiliary variables they are not meaningful. There are three levels of auxiliary information: (1) the auxiliary variable is available for all population units including respondents and nonrespondents; (2) the auxiliary variable is available for all sample units; and (3) the auxiliary variable is available for respondents and on an aggregated population level. The three options are referred to as population level, sample level, and aggregated population level auxiliary variables. The detection of nonresponse error for aggregated population level variables needs more care, because sampling causes random variation in the distributions; only the net effect of sampling error and nonresponse error can be observed so that part of the observed bias is just random noise. This complication diminishes for large samples and does not exist for self-selection or non-probability-based samples where the population is the sample.
Next, it should be noted that indicators are not needed per se and that nonresponse error can simply be analyzed by standard multivariate techniques like logistic or probit regression models. In these models missingness is explained by the auxiliary variables and the regression coefficients are informative about the direction and magnitude of error. For aggregated population level variables such analyses are less straightforward but may still be performed through logit models. However, in all cases regression coefficients cannot easily be compared from one survey response to the other and they have no obvious translation to the variable-level or to the overall level. This is where the utility of indicators comes in: They have been developed to allow for a quick, multivariate topdown look at nonresponse error that allows for comparisons between surveys, waves of surveys, and during data collection. In this chapter, the attention is restricted to indicators for representativity, or R-indicators, but there is a close relation to balance indicators, see Särndal (2011).
564
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
The website www.risq-project.eu contains SAS and R code for all indicators discussed in the following subsections, including analytic standard error approximations. See also De Heij et al. (2015).
Sample-based R-indicators and Partial R-indicators R-indicators form a hierarchical set of indirect measures for nonresponse bias. The overall R-indicator, usually simply referred to as the R-indicator, gives a single overall view of the representativeness of response over a set of variables. Partial R-indicators detail that view to variables and to categories of variables. They come in unconditional and conditional versions, where conditional versions adjust for multi-collinearity between variables. The various R-indicators are defined and motivated in this subsection. R-indicators are based on the concept of representative response: The response of a survey is called representative with respect to X if the response propensity function ρX(x) is constant in all possible x that X can attain. The indicators measure deviations from representative response. The overall R-indicator, see Schouten et al. (2009) and Shlomo et al. (2012), for X is defined as the standard deviation S(ρX) of the response propensities transformed to the [0,1] interval by
R ( X ) = 1 − 2S ( ρ X ) .
(11)
When all propensities are equal, the standard deviation is zero and, hence, fully representative response is represented by a value of 1 for the indicator. A value of 0 indicates the largest possible deviation from representative response. In order to locate the sources of deviations from representative response, Schouten et al. (2011) introduce partial R-indicators for categorical variables X. Partial R-indicators perform an analysis-of-variance decomposition of the total variance of response propensities into between and within variances,
where the between and within components follow population stratifications based on the variables in X. The between and within variance components help to identify variables that are responsible for a large proportion of the variance. The partial R-indicators are linked to a second definition called conditional representative response, defined as a lack of within variance. The resulting between and within components are termed unconditional and conditional partial R-indicators. Schouten et al. (2011) restricted themselves to categorical variables. The extension to continuous variables is relatively straightforward but is not discussed in the literature nor implemented in code. The reason is that continuous variables are often categorized before they are used as explanatory variables in models for nonresponse. For partial R-indicators, again some notation is introduced. Let Z be a categorical auxiliary variable not included in X. Let ρX,Z (x, z) be the probability of response given that X = x and Z = z. The response to a survey is called conditionally representative with respect to Z given X when conditional response propensities given X are constant for Z, i.e. when ρX,Z (x, z) = ρX (x) for all z. The square root of the between variance of the response propensities ρ X , Z , S B ( ρ X , Z ) = M 1 Z ( ρ X , Z ( x k , Z k ) − ρ )2 , for a ∑ m =1 ∑ k ∈U m , k N stratification based on Z, is called the unconditional partial R-indicator. It is denoted by Pu(Z) and it holds that Pu(Z) ∈ [0, 0.5]. So values of Pu(Z) close to 0 indicate that Z does not produce variation in response propensities, while values close to 0.5 represent a variable with maximal impact on representativity. For categorical variables the between variance can be further decomposed to the category-level in order to detect which categories contribute most. Let Z be a categorical variable with categories m = 1, 2, …, M and let Zm,k be the 0–1 dummy variable that indicates
Nonresponse Error: Detection and Correction
whether unit k is in category m, i.e. Zk = m or not. For example, Z represents the region of a country and Zm is the indicator for area m. The unconditional partial R-indicator for category m is defined as Pu ( Z , m) =
Nm N
1 ∑ Z m ,k ρ X , Z ( x k , Z k ) − Nm 1 ∑ ρ X ,Z ( x k , Z k ) . N
(12)
with Nm being the number of population units in category m. We have that Pu ( Z , m) ∈ [ −0.5, 0.5]. So a value close to 0 implies that the category subpopulation shows no deviation from average response behavior, while values close to –0.5 and 0.5 indicate maximal underrepresentation and overrepresentation, respectively. The logical counterpart to the unconditional partial R-indicator is the conditional partial R-indicator. It considers the other variance component: the within variance. The conditional partial R-indicator for Z given X, denoted by Pc(Z|X), is defined as the square root of the within variance Sw ( ρ X , Z ) for a stratification based on X. Again it can be shown that Pc ( X | Z ) ∈ [ 0, 0.5] , but now the interpretation is conditional on X. A value close to 0 means that the variable does not contribute to variation in response propensities in addition to X, while large values indicate that the variable brings in new variation. When X is type of economic activity and Z is region, then Pc(Z|X) = 0 means that one should focus on economic activity when improving response representativity, as region does not add any variation. Again for categorical variables Z, the within variance can be broken down to the category level. The category-level conditional partial R-indicator for category m is
Pc ( Z , m | X ) =
.(13)
2 1 ∑ Z m , k ( ρ X , Z ( xk , Z k ) − ρ X ( xk ) ) . N −1
565
Unlike the unconditional indicators, the conditional indicators do not have a sign. A sign would have no meaning as the representation may be different for each category of X. For instance, in some categories a certain economic activity may have a positive effect on response while in others it may have a negative effect. The conditional partial R-indicator for Z is always smaller than the unconditional partial R-indicator for that variable; the impact on response behavior is to some extent removed by accounting for other characteristics of the population unit. To this point, it has been assumed that population level and/or sample level auxiliary variables are available. For these two levels, response propensities can be estimated directly and without bias asymptotically, given the right specification of the link function between auxiliary variables and nonresponse. Aggregated population level variables require a different estimation, as no information is available about nonrespondents. Bianchi et al. (2016) give an extensive discussion of how to estimate indicators for this level of auxiliary information. In evaluating representativeness of response, it is important to test statistically whether (partial) R-indicator values are different. For independent samples, statistical testing is standard and can be performed using approximated standard errors, see Shlomo et al. (2012), available in SAS and R code at www.risq-project.eu. For dependent samples such testing could be performed using resampling methods accounting for the sampling design.
Coefficient of Variation and Nonresponse Bias R-indicators can be interpreted in terms of nonresponse bias through the variance of response propensities. Consider the standardized bias of the design-weighted, unadjusted response mean yˆr of an arbitrary variable Y. The standardized bias of the mean can be bounded from above by
566
( )
| B ˆyr | (s y )
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
= = =
( )
| E ˆyr − Y | s( y) | Cov ( y, ρY ) |
(14)
ρ s( y) | Cov ( y, ρℵ) |
ρ s( y) 1 − R (ℵ) , = 2ρ
≤
s ( ρℵ)
ρ
with ρ the unit response rate (or average response propensity and ℵ some ‘super’ vector of auxiliary variables providing full explanation of nonresponse behavior. Clearly, the propensity function ρℵ is unknown. Since R-indicators are used for the comparison of the representativity of response in different surveys or the same survey over time, the interest lies in the general representativity of a survey, i.e. not the representativity with respect to single variables. Therefore, as an approximation for (14)
CV ( X ) =
values that are not. If the R-indicator takes a value below some lower bound, then measures to improve response are paramount. Response-Representativity functions can be used for deriving such lower bounds for the R-indicator. They are a function of a threshold γ and the unit response rate ru The threshold γ represents a quality level. The functions are defined as
1− R(X ) 2ρX
(15)
is used. CV is the coefficient of variation of the response propensities and is the maximal (standardized) bias for all variables that are linear combinations of the components of X. For other variables, (15) does not provide an upper bound to the bias. The choice of X, therefore, is very important, but even for relevant X, (15) cannot be extrapolated to all survey target variables. If the selected variables in X are correlated with the survey variables, then (15) is informative as a quality indicator. If it is not correlated with the survey variables, then it has limited utility. A useful graphical display of unit response rates and response representativity is given by so-called response-representativity functions. Ideally, one would like to bound the R-indicator from below, i.e. to derive values of the R-indicator that are acceptable and
RR ( γ , p ) = 1 − 2 ρ X γ ,
(16)
and follow by demanding that the maximal bias given by (15) is not allowed to exceed the prescribed threshold γ, i.e. from taking CV(X) = γ. For the sake of brevity, the reader is referred to Bethlehem et al. (2011) for examples.
Indicators and Auxiliary Variables As explained in ‘Nonresponse in Surveys’ above, the auxiliary variables may be available at different levels. The most informative level is the population level where auxiliary variables can be linked to all population units. From one institute to the other and from one country to the other, there, however, are great differences in the levels of availability and the options to link auxiliary variables. An option that always exists is the use of paradata (e.g. Kreuter, 2013), i.e. measurements about the data collection process and observations on the sample units. Clearly, there is a limit to what auxiliary information can be observed during data collection. As mentioned earlier, for different selections of X, the (partial) R-indicators attain different values and they do not allow for statements about NMAR nonresponse outside the selected vector of variables. The selection is, therefore, a crucial and influential part of the analysis. The purpose of the indicator determines the selection of the auxiliary variables that are used. There are three purposes: comparing multiple surveys, comparing different waves or designs of the same survey, and monitoring and analyzing data collection of a single survey. When multiple surveys are compared, it is essential that
Nonresponse Error: Detection and Correction
representativity is evaluated in terms of generally available and relevant characteristics. For the comparisons within a single survey, it is important to select characteristics that are as close as possible to the survey topics and key variables. For monitoring and analysis the same applies but additionally paradata may be added to break down the data collection process into its steps, i.e. screening cases for eligibility, making contact, obtaining participation, completion of a full interview, and no attrition after several time points.
Example: Detecting Bias in the GPS Survey The example of the GPS is again used for illustration. The GPS is a general purpose survey with a wide range of variables of interest. For this reason it is decided to consider representativity with respect to a general set of auxiliary variables: age (in five-year classes), ethnicity (native, first generation non-western non-native, first generation western non-native, second generation non-western non-native, second generation western non-native), household type (single, single parent, couple without children, couple with children, other), house value (in classes of 50 thousand euro), job status (no employment, employment), and urbanization (not, little, moderate, strong, very strong). These six auxiliary variables are used by Statistics Netherlands to compare R-indicator values across surveys. Data collection of the GPS consisted of two months with a mode switch after the first month from face-to-face to telephone. Table 35.1 gives the response rate, R-indicator, coefficient of variation, and variable-level partial R-indicators after the first month and after the full two months. Remarkably, the R-indicator decreases after the first month while the response rate increases. Both changes are significant at the 5% level. As a result, the coefficient of variation decreases only modestly and non-significantly. The partial R-indicators increase for all variables, most notably for ethnicity
567
Table 35.1 Response rate, R-indicator, coefficient of variation, and partial R-indicators for the six selected auxiliary variables. Standard errors in brackets Indicator
Month 1
Month 1 and 2
Response rate R-indicator CV Age Pu Pc Ethnicity Pu
46.4% 0.835 (0.005) 0.181 (0.006) 0.024 (0.003) 0.018 (0.003) 0.039 (0.003)
58.7% 0.806 (0.005) 0.167 (0.005) 0.030 (0.003) 0.020 (0.003) 0.054 (0.003)
Pc Pu Pc Pu Pc Pu Pc Pu Pc
0.021 (0.003) 0.043 (0.003) 0.025 (0.003) 0.043 (0.003) 0.011 (0.003) 0.002 (0.004) 0.002 (0.004) 0.068 (0.003) 0.047 (0.003)
0.031 (0.003) 0.051 (0.003) 0.027 (0.003) 0.056 (0.003) 0.017 (0.003) 0.018 (0.003) 0.011 (0.003) 0.074 (0.003) 0.046 (0.003)
Household type House value Job status Urbanization
and job status. Job status was the only variable with non-significant partial R-indicator values after the first month, but after two months they are now significantly different from zero. Hence, from the indicator values it is concluded that the second month did not reduce nonresponse error and that more of the same responses come in. Table 35.2 gives the category-level partial R-indicator values for urbanization, the variable with the largest partial R-indicators. The values show a clear pattern: the stronger the urbanization of the area of residence, the stronger the underrepresentation. In the subsection ‘Adaptive Survey Designs’, the design is adapted in order to obtain a lower coefficient of variation for the six auxiliary variables.
CORRECTING NONRESPONSE ERROR In this section, the correction of nonresponse error is discussed at the data collection stage and at the estimation stage.
568
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 35.2 Category-level partial R-indicators for urbanization after one month and after two months. Standard errors in brackets Indicator
Month 1
Month 1 and 2
Pu Pc Pu Pc
–0.055 (0.002) 0.035 (0.002) –0.014 (0.002) 0.014 (0.002)
–0.062 (0.002) 0.037 (0.002) –0.008 (0.001) 0.011 (0.001)
Moderate
Pu Pc
0.014 (0.002) 0.012 (0.002
0.013 (0.002) 0.011 (0.002)
Little
Pu Pc Pu Pc
0.019 (0.002) 0.012 (0.002) 0.033 (0.002) 0.024 (0.002)
0.023 (0.002) 0.013 (0.002) 0.032 (0.002) 0.021 (0.002)
Very strong Strong
Not
Adjustment by Design: Adaptive Survey Design Adaptive Survey Designs Adaptive survey designs assign different data collection strategies to different population strata based on quality and cost considerations. They may do so based only on available frame and administrative data at the start of data collection, so-called static designs, but may also incorporate paradata stored during data collection, so-called dynamic designs. For extensive discussion on adaptive survey designs and the closely related responsive survey designs, see Groves and Heeringa (2006), Peytchev et al. (2010), Schouten et al. (2013), Wagner (2013), and Wagner et al. (2013). Figure 35.3 presents the general decision problem underlying to adaptive survey designs. The data collection is divided into T phases and the population has M disjoint
strata. Membership of a sample unit to a stratum may be known beforehand but may also depend on paradata observations about that unit coming in during data collection. A set of candidate strategies, S, is selected, e.g. a set of survey modes or a set of time slots for calls. For each stratum m, a decision needs to be made what strategy from S is applied at phase t. We call this strategy St, and s = (s1, …, sT) is the vector of selected strategies over all phases. The optimization problem consists of choosing allocation probabilities, p(s = (s1, …, sT) n m), for all strata m. These probabilities may be updated during data collection based on paradata that are recorded in previous phases. The optimization problem depicted in Figure 35.3 is often simplified by assumptions about the dependence of response in a phase conditional on strategies assigned in previous phases. The most extreme simplification is where phases are assumed to have independent response; an assumption that usually does not hold. The implementation of adaptive survey designs is marked by the following steps: 1 Choose (proxy) quality measures. 2 Choose a set of candidate design features. 3 Define cost constraints and other practical constraints. 4 Link available frame data, administrative data, and paradata. 5 Form strata with the auxiliary variables for which design features can be varied. 6 Estimate input parameters (e.g. contact and participation propensities, costs). 7 Optimize the allocation of design features to the strata.
Figure 35.3 General optimization setting for adaptive survey designs.
569
Nonresponse Error: Detection and Correction
8 Conduct, monitor, and analyze data collection. 9 In case of incidental deviation from anticipated quality or costs, return to step 7. 10 In case of structural deviation from anticipated quality or costs, return to step 6. 11 Adjust for nonresponse in the estimation.
Hence, the important ingredients of the designs are quality measures, candidate design features (i.e. strategies), constraints on costs and other constraints, and estimators for design performance and costs. Since estimators for design performance and costs may suffer from imprecision or bias themselves, it is imperative that designs are monitored and some transitional learning period is implemented to anticipate on deviations from expected quality and costs. It goes beyond the scope of this chapter to discuss the ingredients in detail. Chapter 26 in this Handbook discusses these in detail. Schouten and Shlomo (2016) discuss the various options to optimize adaptive survey design. All options share one objective: they attempt to balance response propensities over relevant population strata. The natural question about adaptive survey designs is whether tailoring survey design to population subgroups based on a selection of auxiliary variables is still effective after nonresponse adjustment afterwards on these same variables. Motivation for adaptive survey design comes from two observations. First, it is believed that stronger bias on given variables is a signal of a less perfect data collection process that may affect variables of interest even more strongly, even after adjustment. Second, under the assumption that auxiliary variables are random draws from the universe of all possible variables, it can be shown that more bias on these variables implies a larger expected bias on other arbitrary variables, even after adjustment. Schouten et al. (2014) discuss this important question in detail and perform an empirical study to find evidence that bias is smaller even after nonresponse adjustment.
An Example: Adaptive Survey Designs for the GPS Survey The GPS had two months of data collection, where the second month was a telephone follow-up. Subsection ‘Example: Detecting Bias in the GPS Survey’ concluded that the second month was not successful in improving representativity of the response and only stabilized the coefficient of variation for six, generally important auxiliary variables: age, ethnicity, household type, house value, job status, and urbanization. Here, adaptation of the design is investigated by restricting the second month follow-up to particular sample cases. Hence, there are two strategies, s ∈ {F2F, F2F → Phone}, i.e. yes or no a face-to-face follow-up. First, the adaptive survey design strata need to be formed. The optimization strategy is to remove subpopulations from follow-up when their representation has grown after the second month. Based on the categorylevel partial R-indicators, this holds for four subpopulations: natives, couples with or without children, persons living in moderate to not urbanized areas, and persons with an employment. These four subpopulations are crossed to obtain 16 strata: {1 = not to moderate urbanization, 0 = other} × {1 = native, 0 = non-native} × {1 = couples, 0 = other households} × {1 = employment, 0 = no employment}. Table 35.3 gives the category-level unconditional partial R-indicator values for the 16 strata. There are four strata with (significantly) positive values. Strongest Table 35.3 Category-level unconditional partial R-indicators for the 16 strata. Standard errors in brackets Stratum
Pu
Stratum
Pu
(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1)
−0.031 (0.003) −0.027 (0.003) −0.033 (0.003) −0.016 (0.003) −0.026 (0.003) −0.026 (0.003) −0.009 (0.002) 0.002 (0.001)
(1, 0, 0, 0) (1, 0, 0, 1) (1, 0, 1, 0) (1, 0, 1, 1) (1, 1, 0, 0) (1, 1, 0, 1) (1, 1, 1, 0) (1, 1, 1, 1)
−0.003 (0.002) −0.004 (0.003) −0.002 (0.002) 0.009 (0.002) −0.004 (0.001) −0.003 (0.002) 0.024 (0.002) 0.040 (0.002)
570
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 35.4 Values of the indicators for the adaptive survey design with restricted follow-up in month 2. Standard errors in brackets Indicator
Month 1
Month 1 and 2
Adaptive design
Response rate R-indicator CV
46.4% 0.835 (0.005) 0.181 (0.006)
58.7% 0.806 (0.005) 0.167 (0.005)
50.8% 0.872 (0.006) 0.130 (0.006)
overrepresentation is for native persons with or without employment living in couples in non-urbanized areas (strata 15 and 16). These four strata are deselected and do not receive the follow-up. Table 35.4 contains the values of the various indicators for response. Approximately 40% of the nonrespondents after month 1 receives a telephone follow-up. The response rate increases with almost 5% and the R-indicator increases significantly.
Adjustment by Estimation: Weighting In this section, the focus is on relatively simple methods for nonresponse weighting. More advanced model-based methods exist and are, for instance, discussed in Bethlehem et al. (2011). See also Imbens (2004) for a useful overview of methods. Furthermore, nonresponse adjustment may also be performed by multiple imputation instead of weighting, which is especially advantageous when auxiliary information itself is subject to item-nonresponse and when there are many survey variables of interest. See Little and Rubin (2002) for a discussion. In this section, little attention is devoted to standard error approximations. Here, the reader is referred to Chapter 30 in this Handbook.
Weighting Adjustment Weighting adjustment is a family of techniques that attempt to improve the accuracy of survey estimates by assigning weights to respondents based on auxiliary information. The techniques differ in the level of auxiliary information they need; for some aggregated
population level information is sufficient while others assume population level information. By comparing the response distribution of an auxiliary variable with its population (or complete sample) distribution, it can be determined if the sample is representative for the population (with respect to this variable). If this distribution differs considerably, one must conclude that the response lacks representativity. To correct this, adjustment weights are computed. Weights are assigned to records of the respondents. Estimates of population characteristics are then computed by using the weighted values instead of the unweighted values. Weighting adjustment is often used to correct surveys that are affected by nonresponse. If the survey response lacks representativity with respect to an auxiliary variable X, the Horvitz–Thompson estimator for the population mean of X will be biased:
E ( x HT , R ) ≈
1 N ρk ∑ X k ≠ X . N k =1 ρ
(17)
The idea behind weighting is to restore representativity with respect to X by assigning weights wk to respondents such that the weighted estimator for the population mean of X is exactly equal to this population mean:
x HT , R ,W =
1 N wk ak Rk X k = X . ∑ N k =1 π k rHT
(18)
If the weights are constructed such that the survey response becomes representative with respect to several auxiliary variables, and these auxiliary variables are correlated with the target variables of the survey, the survey
571
Nonresponse Error: Detection and Correction
response will also become representative with respect to the target variables. As a result, the bias will disappear. This section describes a number of weighting adjustment techniques. To keep the exposition simple we assumed the lack of representativity is only caused by nonresponse. So, there are no under-coverage effects. We also assume that a probability sample has been selected without replacement, and with known first order selection probabilities. A much more detailed description of these weighting techniques can be found in e.g. Bethlehem et al. (2011).
Post-stratification Post-stratification is a well-known and often used weighting technique, see e.g. Cochran (1977) or Bethlehem (2002). To carry out poststratification, qualitative auxiliary variables are needed. By crossing these variables, population and sample are divided into a number of non-overlapping strata (subpopulations). All elements in one stratum are assigned the same weight, and this weight is equal to the population proportion in that stratum divided by the sample proportion in that stratum. Suppose, crossing the stratification variables produces L strata. The number of population elements in stratum h is denoted by Nh, for h = 1, 2, …, L. Hence, the population size is equal to N = N1 + N2 + … + NL. In the case of full response, the weight wk for an element k in stratum h is defined by,
wk =
Nh N nh n
(19)
where nh is the sample size in stratum h, and n is the total sample size. If the values of the weights are taken into account, the result is the post-stratification estimator
yPS =
1 L ∑ N h yHT ,h N h =1
(20)
where yHT ,h is the Horvitz–Thompson estimator for estimating the response mean in
stratum h. So, the post-stratification estimator is equal to a weighted sum of Horvitz– Thompson estimators in the strata. In the case of nonresponse, only the survey response is available for estimation purposes. The weight wk for an element k in stratum h is now defined by, wk =
Nh N nR , h n R
(21)
where nR,h is the number of respondents in stratum h, and nR is the total response. Using these weights leads to the post-stratification estimator under nonresponse:
yPS =
1 L ∑ N h yHT ,R,h N h =1
(22)
where yHT , R ,h is the Horvitz–Thompson estimator for estimating the response mean in stratum h. The bias of this estimator is approximately equal to 1 L ∑ N h B( yHT ,R,h ) N h =1 (23) corh ( y, ρ )Sh ( y)Sh ( ρ ) 1 L ≈ ∑ Nh N h =1 ρh
B( yPS ) =
in which corh ( y, ρ ) is the correlation between target variable and response behavior in stratum h, Sh ( y) is the standard deviation of the target variable in stratum h, Sh ( ρ ) is the standard deviation of the response probabilities in stratum h, and ρh is the average response probability in stratum h. The bias of this estimator is small if there is a strong relationship between the target variable and the stratification variables. The variation in the values of the target should manifest itself between strata but not within strata. In other words, strata should be homogeneous with respect to the target variables. In nonresponse correction terminology, this situation comes down to Missing At Random (MAR).
572
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
The bias of the estimator will also be small when the variation of the response probabilities is small within strata. This implies that there must be strong relationships between the auxiliary variables and the response probability. In conclusion, application of post-stratification will successfully reduce the bias of the estimator if proper auxiliary variables can be found. Such variables should satisfy the following conditions: •• they have to be measured in the survey; •• their population distribution must be known; •• they must be strongly correlated with all target variables; •• they must be strongly correlated with the response behavior.
Unfortunately, such variables are not very often available, or there is only a weak correlation. Post-stratification can only be used if (1) the strata obtained by crossing all stratification variables all have a sufficient number of observations, and (2) the population fractions are available for all strata. This is not always the case. A way out is presented by more advanced weighting adjustment techniques, like generalized regression estimation and raking ratio estimation. These techniques are described in detail in, for example, Bethlehem et al. (2011) and Särndal and Lundström (2005). Here we give an overview.
Generalized Regression Estimation Generalized regression estimation is sometimes also called linear weighting. It assumes there is a set of auxiliary variables X1, X2, … Xp that can be used to predict the values of a target variable Y. In the case of full response, the generalized regression estimator is defined by
yGR = yHT + ( X − x HT )’b,
(24)
in which the subscript GR denotes the generalized regression estimator, yHT is the Horvitz–Thompson estimator for the mean of the target variable, X is the vector of
population means of the auxiliary variables, and x HT is the vector of Horvitz–Thompson estimators for the population means of the auxiliary variables. Furthermore, b is the (estimated) vector of regression coefficients. The better the regression model fits the data, the smaller the variance of the estimator will be. In the case of nonresponse, only the response data can be used. The generalized regression estimator is now equal to,
yGR , R = yHT , R + ( X − x HT , R )’bR
(25)
where bR is the vector of regression coefficients obtained by fitting the model just using the survey response data. The subscript GR,R denotes the generalized regression estimator, but just computed for the respondents. The generalized regression estimator can be biased under nonresponse, but the bias vanishes if b = bR, i.e. the regression model fitted under complete response is equal to the model fitted under nonresponse. The conclusion can be that the better the regression model fits the data, the more the bias will be reduced. By rewriting expression (24) it can be shown that generalized regression estimation is a form of weighting adjustment. The weight wi for observed element i is equal to wi = v′Xi, and v is a vector of weight coefficients that is equal to
−1
n v = n ∑ xi xi′ X , i =1
(26)
see e.g. Bethlehem et al. (2011). The value of a weight for a specific respondent is determined by the values of the corresponding auxiliary variables. Post-stratification is a special case of generalized regression estimation. If the stratification is represented by a set of dummy variables, where each dummy variable denotes a specific stratum, expression (24) reduces to expression (20). Generalized regression estimation can be applied in more situations than
Nonresponse Error: Detection and Correction
post-stratification. For example, post-stratification by age class and sex requires the population distribution of the crossing of age class by sex to be known. If just the marginal population distributions of age class and sex separately are known, post-stratification cannot be applied. Only one variable can be used. However, generalized regression estimation makes it possible to specify a regression model that contains both marginal distributions. In this way more information is used, and this will generally lead to better estimates. Generalized regression estimation has the disadvantage that some correction weights may turn out to be negative. Such weights are not wrong, but simply a consequence of the underlying theory. Usually, negative weights indicate that the regression model does not fit the data too well. Some analysis packages are able to work with weights, but they do not accept negative weights. This may be a reason not to apply generalized regression estimation. There are also methods that set lower and/or upper bounds to weights, see e.g. Särndal and Lundström (2005). It should be noted that generalized regression estimation will only be effective in substantially reducing the bias if Missing At Random (MAR) applies to the set of auxiliary variables used.
Raking Ratio Estimation Correction weights produced by generalized regression estimation are the sum of a number of weight coefficients, where each term in the regression model contributes a weight coefficient. It is also possible to compute correction weights in a different way, namely as the product of a number of weight factors. This weighting technique is usually called raking ratio estimation, iterative proportional fitting, or multiplicative weighting. Raking ratio estimation can be seen as a process in which several post-stratifications are applied simultaneously, where each poststratification uses a different set of auxiliary variables. A graphical display of this process is shown in Figure 35.4.
573
Start
Post-stratification 1
Post-stratification 2
Post-stratification p
Weights changed?
End
Figure 35.4 Raking ratio estimation.
The process starts by conducting poststratification 1. This results in weights being assigned to all respondents. Now the response is representative with respect to the stratification variables in this post-stratification. Next, post-stratification 2 is conducted. The weights are adjusted such that the response becomes representative with respect to the variables in this second post-stratification. Because of this adjustment the response is no longer representative with respect to the variables in post-stratification 1. This process continues until all post-stratifications have been conducted. Then the process restarts and all post-stratifications are repeated, the cycle continuing until the weights stop changing. The final weight of a respondent can be seen as a product of factors where each post-stratification contributes a factor. Raking ratio estimation has the advantage that the weights are always positive. It has
574
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
the disadvantage that there is no clear model underlying the approach. Moreover, there is no simple and straightforward way to compute standard errors of weighted estimates. Generalized regression estimation is based on a regression model, which allows for computing standard errors.
Weighting Adjustment with a Reference Survey Correction techniques are effective provided auxiliary variables have a strong correlation with the target variables of the survey and with the response behavior. If such variables are not available, one might consider conducting a reference survey. This reference survey is based on a probability sample, where data collection takes place with a mode leading to high response rates and little bias, e.g. CAPI (Computer Assisted Personal Interviewing, with laptops) or CATI (Computer Assisted Telephone Interviewing). Such a survey can be used to produce accurate estimates of population distributions of auxiliary variables. These estimated distributions can be used as benchmarks in weighting adjustment techniques. The reference survey approach has been applied by several market research organizations, see e.g. Börsch-Supan et al. (2004) and Duffy et al. (2005). They used the reference survey approach to reduce the bias caused by self-selection of respondents. An interesting aspect of the reference survey approach is that any variable can be used for adjustment weighting as long as it is measured both in the reference survey and in the main survey. For example, some market research organizations use ‘webographics’ or ‘psychographic’ variables that divide the population into ‘mentality groups’. See Schonlau et al. (2004) for more details about the use of such variables. It should be noted that use of estimated population distribution will increase the variance of the estimators. The increase in variance depends on the sample size of the reference survey: the smaller the sample size, the larger the variance. So, using a reference
survey may come down to reducing the bias at the cost of increasing the variance.
Propensity Weighting Propensity weighting was originally used by market research organizations to correct for a possible bias in their web surveys. Examples can be found in Börsch-Supan et al. (2004) and Duffy et al. (2005). The original idea behind propensity weighting goes back to Rosenbaum and Rubin (1983, 1984). They developed a technique for comparing two groups. They attempt to make the two groups comparable by simultaneously controlling for all variables that were thought to explain the differences. In the case of a survey, there are also two groups: those who respond and those who do not. In the case of nonresponse, the response propensities (also called propensity scores) are obtained by modeling a variable that indicates whether or not someone participates in the survey. Usually a logistic regression model is used where the indicator variable is the dependent variable and attitudinal variables are the explanatory variables. These attitudinal variables are assumed to explain why someone participates or not. Fitting the logistic regression model comes down to estimating the conditional probability of participating, given the values of the explanatory variables. Once response propensities have been estimated, they can be used to reduce a possible response bias. There are two general approaches: response propensity weighting and response propensity stratification. Response propensity weighting is based on the principle of Horvitz and Thompson (1952) that always an unbiased estimator can be constructed if the selection probabilities are known. In case of nonresponse, selection depends on both the sample selection mechanism and the response mechanism. Repeating expression (4), the unbiased estimator for the population mean would be
* yHT =
1 N ak Rk Yk ∑ N k =1 π k ρ k .
(27)
575
Nonresponse Error: Detection and Correction
This estimator cannot be used, because the response probabilities ρk are response propensities ρX(Xk). Then, the response propensities are estimated using a logistic unknown. This problem can be solved by first replacing the response probabilities ρk by the regession model. Hence, the estimated response propensities are substituted in expression (27), leading to the estimator
* yRPW =
ak Rk 1 N Yk . ∑ N k =1 π k ρˆ X ( X k )
Table 35.5 Estimating the percentage having a PC Weight variable
Estimate
Standard error
Degree of urbanization Ethnic origin Has a job Age House value Household type
57.4 57.2 56.9 56.9 56.8 56.4 56.1
0.36 0.36 0.36 0.34 0.32 0.35 0.34
(28)
The estimated response propensities can, for example, also be used to improve the estimators that are part of the generalized estimator. A somewhat different approach is response propensity stratification. It takes advantage of the fact that estimates will not be biased if all response probabilities are equal. In this case, selection problems will only lead to fewer observations, but the composition of the sample is not affected. The idea is to divide the sample in strata in such a way that all elements within a stratum have (approximately) the same response propensities. Consequently, (approximately) unbiased estimates can be computed within strata. Next, stratum estimates are combined into a population estimate. According to Cochran (1968), dividing the response propensities in five strata should be sufficient to reduce the nonresponse bias.
An Example We illustrate the various weighting techniques using the survey data set for the GPS survey. There are several auxiliary variables in this data set. For this illustration, the six most effective ones are selected. These are the variables degree of urbanization, ethnic origin, type of household, has a job (yes/no), age (in 13 categories), and average value of the houses in the neighborhood. First, we explore the effects of simple post-stratification. For each auxiliary variable separately, we apply post-stratification, and observe the effects on estimates for two target variables: the percentage of people
having a PC at home, and the percentage of people owning a house. Table 35.5 contains the results for the percentage having a PC. The variable degree of urbanization has almost no effect on the estimate. The variable household type has the strongest effect. This variable is able to reduce the bias. The estimate goes down from 57.4 to 56.1. Apparently there is a correlation between having a PC and the type of household of a person. Table 35.6 contains the results for the percentage owning a house. The effect of the various auxiliary variables is different. The variable has a job has no effect at all. Not surprisingly, the variable house value has the strongest effect. The estimate reduces from 62.5 to 60.4. Note that also the standard error of the estimate is smaller. This is an indication that post-stratification by house value leads to more homogeneous strata. Why not apply post-stratification by crossing all auxiliary variables? This may turn out to be impossible because of empty strata. We can however choose a weighting technique Table 35.6 Estimating the percentage owning a house Weight variable
Estimate
Standard error
Has a job Age Ethnic origin Household type Degree of urbanization House value
62.5 62.3 62.1 61.6 61.1 60.9 60.4
0.35 0.35 0.34 0.35 0.33 0.34 0.31
576
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 35.7 Weighting techniques using all six auxiliary variables Weighting technique
Estimate
Estimate
Generalized regression estimation Raking ratio estimation Response propensity weighting Response propensity stratification
55.6 55.6 55.5 55.8
58.8 58.9 58.9 59.3
that uses all variables without crossing them. Table 35.7 shows the results for the four advanced weighting techniques discussed in this chapter (standard errors are not displayed but are all around 0.30). The estimates for the percentage having a PC are all lower in Table 35.3 than in Table 35.1, and also the standard error is smaller. We can conclude that having more auxiliary variables in a weighting model gives a stronger bias correction. Note also that the type of weighting technique is less important than the set of auxiliary variable that is used. The same conclusion can be drawn for estimating the percentage of people owning a house. Weighting adjustment is a powerful technique for reducing nonresponse bias, but only if two conditions are fulfilled: the auxiliary variables must be able to explain the behavior of both the target variables and the response. Only if all relevant variables are in the weighting model, then the bias will be removed completely.
DISCUSSION This chapter discusses methodology to detect nonresponse error and correct nonresponse error during or after data collection. The landscape for surveys has, however, changed over the last decade and will continue to do so for the coming decade. Today’s research on survey nonresponse focuses on paradata, data collection monitoring, adaptive and responsive survey designs, mixed-mode surveys, indirect measures of nonresponse bias, and nonprobability-based sampling. The reasons are manifold, e.g. the emergence of web and smart
phones as candidate survey modes, an increased potential to record process information or paradata, and continuously decreasing response rates. These recent developments are steering survey researchers in the direction of response processes and into a confrontation between probability-based and non-probability-based sampling designs. The methodology in this chapter is applicable also to non-probabilitybased samples, i.e. treating the population as the sample. Furthermore, the methodology can incorporate paradata in a straightforward way, either as auxiliary variables in forming adaptive survey design strata and weighting strata or in detailing the response process into sub-steps. However, with mixed-mode designs in mind, there is a clear need to have a combined look at nonresponse and measurement error, and to extend detection and correction to a total survey error perspective.
RECOMMENDED READINGS There is an extensive literature on nonresponse reduction and nonresponse adjustment. The monographs by Groves and Couper (1998), Särndal and Lundström (2005), and Bethlehem et al. (2011) are recommended for further reading. Groves and Couper (1998) give in-depth conceptual models and discussion of causes for nonresponse and potential leads to reduce nonresponse. Särndal and Lundström (2005) give a detailed account of nonresponse adjustment methods. Finally, Bethlehem et al. (2011) provide a topical overview of the whole area of nonresponse research and best practices. We refer to the website www.risq-project.eu for a wide range of background papers to indicators and for code in R and SAS to compute indicators.
REFERENCES Bethlehem, J.G. (1988), Reduction of Nonresponse Bias through Regression Estimation, Journal of Official Statistics, 4, 251–260.
Nonresponse Error: Detection and Correction
Bethlehem, J.G. (2002), Weighting Nonresponse Adjustments Based on Auxiliary Information. In: Groves, R.M., Dillman, D.A., Eltinge, J.L., and Little, R.J.A. (eds), Survey Nonresponse (pp. 275–288). New York: Wiley. Bethlehem, J.G. (2012), Using Response Probabilities for Assessing Representativity. Discussion Paper 201212. The Hague, The Netherlands: Statistics Netherlands, available at www.cbs.nl. Bethlehem, J.G., Cobben, F., and Schouten, B. (2011), Handbook of Nonresponse in Household Surveys. Hoboken, NJ: Wiley. Bianchi, A., Shlomo, N., Schouten, B., Da Silva, D., and Skinner, C.J. (2016), Estimation of Response Propensities and Indicators of Representative Response using Population-Level Information. Survey Methodology, 35 (1), 101–113. Börsch-Supan, A., Elsner, D., Faßbender, H., Kiefer, R., McFadden, D., and Winter, J. (2004), Correcting the Participation Bias in an Online Survey. Report, University of Munich, Munich, Germany. Cochran, W.G. (1968), The Effectiveness of Adjustment by Subclassification in Removing Bias in Observational Studies, Biometrics, 24, 205–213. Cochran, W.G. (1977), Sampling Techniques, 3rd edition. New York: Wiley. De Heij, V., Schouten, B., and Shlomo, N. (2015), RISQ 2.1 Manual. Tools in SAS and R for the Computation of R-indicators and Partial R-indicators, available at www.risq-project.eu. Duffy, B, Smith, K., Terhanian, G., and Bremer, J. (2005), Comparing Data from Online and Face-to-face Surveys, International Journal of Market Research, 47, 615–639. Groves, R.M. and Couper, M.P. (1998), Nonresponse in Household Interview Surveys. New York: Wiley. Groves, R.M., Dillman, D.A., Eltinge, A., and Little, R.J.A. (2002), Survey Nonresponse. New York: Wiley. Groves, R.M. and Heeringa, S.G. (2006), Responsive Design for Household Surveys: Tools for Actively Controlling Survey Errors and Costs, Journal of the Royal Statistical Society: Series A, 169, 439–457. Horvitz, D.G. and Thompson, D.J. (1952), A Generalization of Sampling Without Replacement from a Finite Universe, Journal of the American Statistical Association, 47, 663–685.
577
Imbens, G.W. (2004), Estimation of Average Treatment Effects Under Exogeneity: A Review, The Review of Economics and Statistics, 86 (1), 4–29. Kreuter, F. (2013), Improving Surveys with Paradata: Analytic Uses of Process Information. Hoboken, NJ: Wiley. Little, R.J.A. (1986), Survey Nonresponse Adjustment for the Estimates of Means. International Statistical Review, 54, 139–157. Little, R.J.A. and Rubin, D.B. (2002), Statistical Analysis with Missing Data, Wiley Series in Probability and Statistics, 2nd edition. Hoboken, NJ: Wiley. Peytchev, A., Riley, S., Rosen, J., Murphy, J., and Lindblad, M. (2010), Reduction of Nonresponse Bias in Surveys through Case Prioritization, Survey Research Methods, 4, 21–29. Rosenbaum, P.R. and Rubin, D.B. (1983), The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrika, 70 (1), 41–55. Rosenbaum, P. and Rubin, D. (1984), Reducing Bias in Observational Studies Using Subclassification on the Propensity Score, Journal of the American Statistical Association, 79, 516–524. Särndal, C.E. (2011), The 2010 Morris Hansen Lecture: Dealing with Survey Nonresponse in Data Collection, in Estimation, Journal of Official Statistics, 27 (1), 1–21. Särndal, C.E. and Lundström, S. (2005), Estimation in Surveys with Nonresponse. Chichester: John Wiley & Sons. Schonlau, M., Zapert, K., Payne Simon, L., Haynes Sanstad, K., Marcus, S., Adams, J., Kan, H., Turber, R., and Berry, S. (2004), A Comparison Between Responses from Propensity-weighted Web Survey and an Identical RDD Survey, Social Science Computer Review, 22, 128–138. Schouten, B., Calinescu, M., and Luiten, A. (2013), Optimizing Quality of Response through Adaptive Survey Designs, Survey Methodology, 39 (1), 29–58. Schouten, J.B., Cobben, F., and Bethlehem, J. (2009), Indicators for the Representativeness of Survey Response, Survey Methodology, 35 (1), 101–113. Schouten, B., Cobben, F., Lundquist, P., and Wagner, J. (2014), Theoretical and Empirical Evidence for Balancing of Survey Response
578
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
by Design. Discussion paper 201415. Den Haag: CBS, available at www.cbs.nl. Schouten, B. and Shlomo, N. (2016), Selecting Adaptive Survey Design Strata with Partial R-indicators, forthcoming by International Statistical Review. Schouten, J.B., Shlomo, N., and Skinner, C. (2011), Indicators for Monitoring and Improving Representativeness of Response, Journal of Official Statistics, 27 (2), 231–253. Seaman, S., Galati, J., Jackson, D., and Carlin, J. (2013), What is Meant by ‘Missing at Random’?, Statistical Science, 28 (2), 257–268.
Shlomo, N., Skinner, C., and Schouten, J.B. (2012), Estimation of an Indicator of the Representativeness of Survey Response, Journal of Statistical Planning and Inference, 142, 201–211. Wagner, J. (2013), Adaptive Contact Strategies in Telephone and Face-to-Face Surveys, Survey Research Methods, 7 (1), 45–55. Wagner, J., West, B.T., Kirgis, N., Lepkowski, J.M., Axinn, W.G., and Kruger Ndiaye, S. (2013), Use of Paradata in a Responsive Design Framework to Manage a Field Data Collection, Journal of Official Statistics, 28 (4), 477–499.
36 Response Styles in Surveys: Understanding their Causes and Mitigating their Impact on Data Quality Caroline Roberts
INTRODUCTION It is well established that people’s answers to survey questions are influenced by a variety of factors, which interact to produce errors in the data (see e.g. Groves, 1989). Several decades of research in psychology, statistics, econometrics and survey methodology testify to a range of effects observable in respondents’ reported values causing them to deviate from the true values of interest to the researcher. This chapter discusses one such effect, which concerns the measurement of respondents’ evaluations of objects and ideas on rating scales. Specifically, it concerns the tendency for respondents to disproportionately select one response alternative on the rating scale more than others, independent of the content of the question (Paulhus, 1991). This tendency, usually referred to as ‘response style’, can be problematic because it can affect the conclusions of analyses combining multiple items with a shared response format to form indices and scales. Given the
ubiquity of rating scale measures in surveys, and the popularity of multi-item measures of theoretical constructs in social research, it is of interest to review what is currently known about the mechanism underlying this type of error, and what researchers should do to control its impact on the data. This chapter attempts such a review. The chapter is organised in three parts. In the next section I clarify the defining features of response styles and discuss the problems they can create for data analysts. Then, I discuss how changing definitions of response styles over time have contributed to confusion about whether researchers should be concerned about them, review research findings relating to the causes of response styles, and present a theoretical model for organising these research findings and consolidating what is understood about their antecedents. In the final section, I describe different approaches to the question of what to do about response styles, presenting different methods used for measuring and controlling
580
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
their influence on the conclusions of substantive research, and ways of reducing the likelihood of stylistic responding through optimal questionnaire design. The chapter concludes with a brief discussion of future developments in the field, and recommendations for data producers and users.
WHAT ARE RESPONSE STYLES AND WHY DO THEY MATTER? Errors found in the answers respondents give to surveys, which cause sample estimates to deviate from the true population value for a construct of interest, are collectively referred to as measurement errors and, sometimes, as response errors (Groves, 1989). They include effects associated with single-item measures, which are generally driven by the content of the question and its answer categories, and errors affecting multiple questions sharing a common response format (including multiitem measures). Examples of the former include errors caused by inaccurate recall of factual information (recall errors), and overand under-reporting in answers to questions on sensitive topics (social desirability bias). A number of other measurement problems affecting individual measures have been referred to as ‘question effects’ (Schuman and Presser, 1981) or ‘response effects’ (Groves et al., 2009) and include errors associated with how the question has been worded (question wording effects), the order in which the response options are presented (response order effects), and the order in which the actual questions are presented (question order or context effects) (see Tourangeau et al., 2000, for a comprehensive review). These types of errors similarly appear to be influenced by the content of the question, meaning the respondent must attend at least partially to what the question is asking and the available response alternatives for problems to arise. Other types of question and response effect may arise on individual items as a function
of the form of the question, or the types of response alternatives offered. For example, a number of studies have shown that offering a ‘Don’t Know’ or ‘No Opinion’ option causes more people to report that they do not know their answer to a question than when such an option is omitted (e.g. Krosnick et al., 2002). Similarly, questions presenting a long list of response alternatives asking respondents to ‘check all that apply’ tend to be more susceptible to response order effects than other question forms (primacy or recency, depending on the mode of presentation) (e.g. Krosnick and Alwin, 1987). Meanwhile, questions with dichotomous response options can provoke problems such as guessing (in the case of true/false options in knowledge tests – see e.g. Ebel and Frisbie, 1991) or acquiescence (in the case of yes/no formats – see e.g. Knowles and Condon, 1999). Errors associated with the form of the question become especially problematic when they affect answers to several questions throughout the questionnaire or parts of it, such as where single- and multi-item measures sharing a common response format are presented together in a ‘battery’. One question form that is ubiquitous in social surveys requires respondents to indicate the strength and/or valence of their evaluation of an object (often presented in the form of an evaluative statement) on a ‘rating scale’, and it is common practice to use the same rating scale for batches of questions on related topics, including purposely designed scales. Decades of experimental research manipulating different features of the design of the rating scale (e.g. number of scale points, extent and nature of labelling) have generated best-practice recommendations concerning the optimal form to use for different types of measurement constructs (see e.g. Fabrigar et al., 2005). Nevertheless, measurement problems associated with this type of question form persist. In particular, the form suffers from the fact that some respondents appear to be unequally attracted to particular scale points, selecting them more frequently than others apparently
581
RESPONSE STYLES IN SURVEYS
independently of the content of the question (Baumgartner and Steenkamp, 2001). A variety of names have been used in the literature to refer to this type of error (discussed below), the most common of which is response style. The response styles that have received the most attention in the literature, according to a review by Van Vaerenbergh and Thomas (2013) are: (1) Acquiescence Response Style (ARS); (2) Disacquiescence Response Style (DARS); (3) Net Acquiescence Response Style (NARS); (4) Mid-point Response Style (MRS); (5) Extreme Response Style (ERS); (6) Response Range (RR); (7) Mild Response Style (MLRS); and (8) Noncontingent Responding (NCR). Table 36.1, which is based on similar summaries
by Van Vaerenbergh and Thomas (2013), Baumgartner and Steenkamp (2001), and Podsakoff et al. (2003), provides a brief definition for each of these eight response styles, and where appropriate, an illustration of the preferred response option(s) in a seven-point rating scale. It is important to note that this list of response styles is not, and can never be exhaustive, as the particular manifestation of response style depends on the design of the rating scale being used. Rather, the list represents the response styles that have been most frequently documented and theorised (to a greater and lesser degree of success). They are presented here as examples of the types of response errors that form the focus of this chapter.
Table 36.1 Description of eight common response styles Response style
Definition
Acquiescence Response Style (ARS)
Tendency to agree with items regardless of content/use the positive end of the scale. Sometimes referred to as positivity or leniency bias, yeasaying, or agreement tendency. Only the highest response categories are used. Tendency to disagree with items regardless of content/use the negative end of the scale. Sometimes called negativity bias, nay-saying, or disagreement tendency. Only the lowest response categories are used. Tendency to show greater acquiescence than dis-acquiescence.
●●●
Tendency to use the middle response category of a rating scale, regardless of content.
●
Tendency to select the most extreme response options regardless of content. Sometimes referred to as extremeness, extremity bias, or firm response style. The highest or lowest response categories of a rating scale are selected. Tendency to use a wide or narrow range of response categories around the individual’s mean response. Conceptually and empirically similar to, highly correlated with, but not identical to ERS (people who tend to give extreme positive responses also tend to give extreme negative responses). Sometimes called response polarisation or standard deviation. Tendency to avoid the highest and lowest response categories of a rating scale. This is the complement of ERS. Sometimes referred to as timidity. Tendency to respond to items carelessly, randomly or non-purposefully. Also referred to as random responding or mental coin-flipping. May be likened to ‘satisficing’ (Krosnick, 1991), though satisficing may produce response effects associated with other response styles (e.g. ARS, MRS).
●●
Disacquiescence Response Style (DARS) Net Acquiescence Response Style (NARS) Mid-point Response Style (MRS) Extreme Response Style (ERS) Response Range (RR)
Mild Response Style (MLRS) Noncontingent Responding (NCR)
Respondent’s use of a seven-point rating scale
●●●
●●●●●
582
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Impact of Response Styles on Data Quality Response errors affecting rating scales can be problematic because of their potential to affect the quality of measurement across multiple items in the questionnaire. If several questions share the same rating scale format, then response styles may simultaneously bias responses across multiple measures, and the observed relationships between those measures (Baumgartner and Steenkamp, 2001). The effects may be more concerning still, however, where multiple items sharing the same rating scale format are intended to measure a single underlying construct (as in a scale or index). In this case, response errors may be confounded with the measurement of the construct, with the potential to lead researchers to erroneous conclusions about that construct and its relationship with other constructs of interest (Vaerenbergh and Thomas, 2013). To consider the effects that response styles can have on the data, and, more importantly, to be able to evaluate whether or not they should be a cause for concern, it is helpful to consider the different forms that survey error may take. A key distinction can be made between error that is relatively consistent or systematic, resulting in biased estimates, and error that is inconsistent or random, which affects the variance of an estimate (Viswanathan, 2005; Groves, 1989). Whereas random error typically weakens the relations between variables, systematic error may strengthen the relations between variables, leading to false conclusions. Random error may result from unintended mistakes made by the respondent, while response styles or any stable behaviour on the part of respondents confronted with specific forms of a question (de Castellarnau and Saris, 2013) tend to result in systematic errors. Response styles affect the data in two important ways (Baumgartner and Steenkamp, 2001). Firstly, they can bias univariate distributions, which can be either inflated or deflated,
depending on the nature of the response style. Similarly, the means and variances of individual items can be affected, and this can affect comparative statistical tests, such as t-tests and F-tests (Cheung and Rensvold, 2000). This means that any between-group comparisons of responses to items using rating scales may be confounded by differences in response style between the groups, and this could weaken the validity of the conclusions drawn from such analyses. Secondly, response styles can affect the relationships between individual variables, and notably, the magnitude of correlations between variables. These correlations form the basis of many multivariate statistical techniques including Cronbach’s alpha, regression analysis, factor analysis and structural equation modelling, so the impact of the error may be far-reaching. Response styles can also affect the distribution of scores on multi-item measures, and conclusions about the relationship between multi-item measures by either inflating or deflating the correlation between respondents’ scale scores (Paulhus, 1991; Podsakoff et al., 2003). The confounding of content and method in the measurement of latent constructs can also bias estimates of construct reliability and validity, and reduce measurement invariance across groups (Weijters et al., 2008), for example, if groups surveyed with different modes of data collection, or in different languages adopt different types of response style. It is reasonable to assume that not all rating scale items are equally affected by response styles (Podsakoff et al., 2012). Nevertheless, there is sufficient agreement in the literature that response styles pose a threat to the validity of any conclusions of research based on the analysis of multi-item measures using rating scales to warrant some remedial action (e.g. Baumgartner and Steenkamp, 2001). Indeed, Alwin (2007) has argued that, ‘from a statistical point of view there is hardly any justification for ignoring survey measurement errors’ (p. 3). Unfortunately though,
RESPONSE STYLES IN SURVEYS
empirical information has been lacking about the extent to which response styles are a problem in existing survey data, and about the optimal methods of dealing with them. As a consequence, relatively few analysts investigate response styles and control for their impact in their research. The general lack of clarity about whether survey analysts should worry about measurement error in general, and in particular, about response styles can, at least partly, be attributed to two problematic characteristics of the literature on the topic. The first is inconsistency in the labelling of different types of error in respondents’ answers. The second is the fact that attention across disciplines has been divided differentially between, on the one hand, understanding the causes of measurement errors with a view to reducing them in future data collections, and, on the other, attempting to measure their impact on the conclusions drawn from existing data (see Groves, 1989). In the remainder of this chapter, I attempt to address these problems. In the next section, I describe how confusion surrounding the nature and severity of response styles has come about. Then, I review what is known about their causes and about how to mitigate their negative impact, through statistical modelling and optimal questionnaire design.
EXPLAINING RESPONSE STYLES Changing Definitions of Response Style Over Time Response styles have attracted interest from researchers across a variety of different backgrounds, resulting in a large and often unwieldy literature. This literature has been plagued by inconsistency in the terminology used to describe different types of error. The problem of response style was first identified in the psychometric literature, as a phenomenon affecting test scores in educational and
583
personality measurement. One of the earliest contributors to address the issue was Cronbach (1946), who used the term response set to describe ‘any tendency causing a person consistently to give different responses to test items than he would when the same content is given in a different form’ (p. 476). According to Cronbach, a defining characteristic of response set was the stability of individual differences in their application across measurement situations. Thus, although the particular manifestation of specific response sets was dependent on the format of a given question or set of questions making up a test, response sets seemed to have a stable component reflecting a consistent individual style or personality trait, causing respondents to apply them in the same way whenever confronted with a particular test format. This observation led Jackson and Messick (1958: 244) to propose an alternative labelling of Cronbach’s concept, suggesting response style replace response set, to emphasise the apparent dependence of these response tendencies on personality traits. Cronbach’s early work on this topic was the stimulus for a vast number of studies of response errors in psychometric testing, investigating the stability of individual styles of responding (especially, ARS), their relationship to personality and their potential detriment to and value for researchers. This work used the terms response set, response style, and response bias interchangeably to refer to the same, or related phenomena, with a lack of consistency in definition that caused confusion for contributors attempting to make sense of the literature (Rundqvist, 1966). An effort to distinguish response set from response style was made by Rorer (1965), who provided the following definitions: ‘the term “set” connotes a conscious or unconscious desire on the part of a respondent to answer in such a way as to produce a certain picture of himself’ (p. 133) – in other words, the respondent is motivated to respond in a particular way (e.g. a socially desirable way). By contrast, Rorer defined response ‘style’ as
584
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
a ‘tendency to select some response category a disproportionate amount of the time independently of the item content’ (Rorer, 1965: 134). In other words, unlike response sets, response styles were seen to be ‘contentless’, (or at least, to result from ambiguity surrounding question content), while response sets were content-driven. Rorer used the term response bias synonymously with response style, but Rundqvist (1966: 166) defined it as ‘any tendency that reduces the influence of content in controlling the response’ to provide a more general term to refer to both response set and response style. A major problem with the distinction between these alternative forms of error (if such a distinction can be discerned) is the fact that a particular response alternative on a rating scale may be ‘preferred’ and selected more often as a result of either or both a response style or a response set, making it difficult for interested researchers to identify the underlying cause of the effect. For Rorer (1965), the distinction was important, because, according to his analysis, once the effect of response style was separated from response set, the former was found to have a relatively trivial impact on the variability of scale scores (indeed, contrary to Cronbach, he claimed there was little evidence of consistency in the application of response styles across measurement occasions, and hence they posed little cause for concern). Thus, identifying the role of item content in producing biased response was key to determining the relative impact of response style compared to response set. By contrast, Rundqvist (1966) identified content as just one of many sources of variance in (personality) scale scores, including, in addition to ‘sets to create a definite impression’ (p. 166), the social desirability connotations of a particular statement; the form in which a statement is presented (whether positively or negatively worded) and the proportion of each form in the test; and the design of the response scale. In his view, response style should be considered
a product of these joint influences on response, and ‘disentangling any of these influences from the interaction of all’ is ‘exceedingly difficult’ (p. 167). More recent contributors to the field have continued to make the distinction between errors in responses to rating scales that arise independently of the question content, and errors that appear to arise, at least partially, as a result of the question content (see e.g. Paulhus, 1991). Furthermore, some contributors continue to emphasise the consistency with which individuals apply particular response tendencies across measurement contexts in their definitions. For example, Peer and Gamliel (2011: 2) describe response set as a more situational and temporary response pattern, while they see response style as a ‘more long-term traitlike quality that is assumed to remain similar across different questionnaires’. However, inconsistency in definitions of the two types of response error persists. Fabrigar and Krosnick (1995: 46), for example, define response set as ‘the tendency for an individual to respond to questions in a particular fashion as a result of the structural features of the questions or the data-gathering situation, independent of the content of the questions’. In contrast, they describe response styles as ‘response tendencies independent of content that are a function of dispositions of individual respondents, rather than a function of situational factors’ (p. 46). Similarly, Furr and Bacharach (2013) in a reference text on psychometrics provide the following distinction: ‘factors associated with aspects of the testing situation or the test itself (e.g. question wording, question format) may be referred to as response sets, while factors linked to more stable characteristics of respondents that produce biases (e.g. a tendency to respond in a socially desirable way) may be referred to as response styles’ (ibid.: 299).
These authors note that ‘psychologists are not consistent in their use of these terms’ (p. 299), and indeed this is reflected in the definitions they provide.
RESPONSE STYLES IN SURVEYS
Inconsistency in the labelling of different types of response error not only reflects theoretical debate surrounding the nature of their underlying causes, but is also a function of the diverse disciplinary approaches that have investigated them. Researchers focused on the reduction of error (‘reducers’; Groves, 1989: 5) have concentrated on identifying and understanding the causes of error, with a view to developing bestpractice principles of questionnaire design to minimise the likelihood of errors occurring in the first place (Groves, 1989: 484). Meanwhile, researchers focused on the measurement and correction of error (‘measurers’; ibid.) have concentrated on the development of statistical models of measurement error designed to estimate its magnitude. Their objective has been to manage the potential impact of error on the conclusions of research, lumping different types of measurement error – be they response sets, response styles, or any other type likely to affect multiple questionnaire items – into a more general category of error associated with using the same method for measuring different underlying constructs. Thus, in addition to research debating the subtle distinctions between response sets and response styles, the literature also includes contributions concerned with the problem of common method variance and common method bias (see Podsakoff et al., 2003, 2012 for reviews). The advantage of this approach is that it acknowledges that errors associated with the method of measurement may be largely unavoidable, despite the best efforts of reducers, as well as the inherent difficulty of separating out their various confounded causes (Blasius and Thiessen, 2012). The distinction between these divergent approaches to the problem of response styles helps to structure the remainder of the chapter. The following section reviews research into the causes of response styles, which to clarify, are defined here as the tendency to use one point on the rating scale disproportionately more than others, independent of item content.
585
Research Findings Relating to the Causes of Response Styles The preceding discussion highlights how definitions of response style, and efforts to distinguish response style from other types of error have been closely tied to theoretical explanations of their causes. The response style literature condenses these into just two categories: stimulus level and respondent level influences (Van Vaerenbergh and Thomas, 2013). While the stimulus level takes into account characteristics of the survey instrument and situational factors that might have an effect on questionnaire completion, respondent level influences refer to personal characteristics of respondents that account for variation in the tendency to adopt particular response styles across measurement situations. Stimulus-level sources of response styles include the scale format; the mode of data collection (including the interviewer); the cognitive demands imposed by the survey questions; the survey language; and the respondent’s degree of personal involvement in the questionnaire topic (ibid.). By contrast, respondent-level sources of response styles include demographic characteristics (notably education, age, income, employment and race), personality and culture, which together have received the greatest amount of research attention (ibid.). The overwhelming conclusion based on research relating to these two sources of influence is that the findings are mixed, leading to different (and often ambiguous) conclusions relating to different types of response style. This is clearly the case for research addressing stimulus-level variables. For example, regarding response scale format, the literature finds that ARS and ERS are less common in longer response scales (Kieruj and Moors, 2013; Weijters et al., 2010a), while MRS is more likely when longer scales are offered (Kieruj and Moors, 2010). Fully labelled scales tend to increase ARS (Weijters et al., 2010a), but help to reduce ERS (Kieruj and Moors, 2010). ARS and ERS are more
586
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
common in telephone surveys than in faceto-face and self-administered surveys (Holbrook et al., 2003; Ye et al., 2011), while MRS is correspondingly less common on the telephone (Weitjers et al., 2008). Olson and Bilgen (2011) find higher levels of ARS in interviews carried out by more experienced interviewers, while Hox et al. (1991) do not. The findings relating to certain respondent-level influences on response styles are also inconsistent. For example, ARS seems to be more common among respondents with the lowest levels of education (Meisenberg and Williams, 2008). Some studies find more MRS and ERS among better-educated respondents, and others find the opposite. Similarly, ARS, ERS and MRS appear to be more common among older respondents (e.g. Billiet and McClendon, 2000; Greenleaf, 1992), though this has not been found to be true in all studies (e.g. Eid and Rauber, 2000; Austin et al., 2006), nor does it apply to other response styles, such as DARS. Contradictory results similarly pertain to respondent sex (though the weight of evidence suggests greater use of response styles among women), income and employment, and race (van Vaerenbergh and Thomas, 2013). The conclusions that can be drawn based on research into personality and response style are no more definitive. Recall that the notion of response style, as proposed by Jackson and Messick (1958), and Rorer (1965), hinges on the idea that certain respondents appear to consistently respond in certain ways across measurement occasions, suggesting that stable characteristics may be responsible. (e.g. Weijters et al., 2010b). In support of this conclusion, a number of studies find a relationship between the use of response styles and certain personality traits, though once again they have produced somewhat mixed findings. A major drawback of research investigating the relationship between personality and response style, however, is the fact that personality measures themselves may be affected by respondents’ use of response
styles (Bentler et al., 1971). Some researchers have attempted to incorporate this into their research, but it remains a considerable limitation in the measurement of personality correlates of response style, and restricts the certainty with which conclusions can be drawn about any causal relationship between the two. The increasing popularity of crossnational survey research in recent years, and the implicitly cross-cultural nature of many within-country studies in contemporary, multicultural societies, has increased the motivation to identify possible artefactual confounds affecting comparisons of interest. It is not surprising, therefore, that there has been a proliferation of comparative studies into response styles. Two important conclusions from this body of research are that certain response styles appear to be used to a greater or lesser degree by different cultural groups, and that country-level characteristics and culture seem to explain a greater proportion of the variance between response styles compared with socio-demographic variables (Van Vaerenbergh and Thomas, 2013) (although according to Baumgartner and Steenkamp (2001), variation across different scales may, in fact, be more important than cross-national differences). While various possible explanations for the differences observed between response styles across countries have been proposed (based, for example, on Hofstede’s (2001) dimensions of individualism, uncertainty avoidance, masculinity and power distance), once again, no clear picture emerges from the studies that have been carried out, and the conclusions appear to be sensitive to the research methods that have been used (van Vaerenbergh and Thomas, 2013). In sum, the literature addressing response styles and their antecedents has produced a wealth of findings about the correlates of stylistic responding, but few firm conclusions about the most important predictors of different kinds of response style. Part of the challenge in digesting these mixed findings lies in the absence of a theoretical framework
RESPONSE STYLES IN SURVEYS
for organising hypothesised relationships and observed findings. The survey methodological literature on response effects is better equipped in this regard, and in the next section, I briefly review cognitive theories about the survey response process and discuss their applicability to the specific problem of response style.
A Theoretical Framework for Understanding the Causes of Response Styles Most survey methodologists trained in how to design questionnaires are now familiar with the response process model first developed by Cannell et al. (1981), and later elaborated by Tourangeau et al. (2000), which forms the theoretical basis for understanding a wide range of measurement errors observed in questionnaire data. The model postulates a series of cognitive processes involved in answering survey questions, including comprehension of what the question is asking, retrieval from memory of information needed to respond to the question, evaluation of or judgement based on retrieved information to decide how to answer, and finally, selection of an appropriate response category. Measurement errors in single-item survey measures may arise as a result of faulty processing at any one or more of these stages. For example, recall errors are likely to result from problems arising at the retrieval and judgement processes, while social desirability bias most likely occurs at the response selection stage, as a result of respondents deliberately misreporting their answers (see Tourangeau et al., 2000). An important extension of this theoretical approach – the theory of ‘satisficing’ (Krosnick, 1991) – has informed understanding of a variety of response effects observed in subjective questions with rating scales, including multi-item batteries, and, therefore, provides a helpful framework for understanding response styles.
587
Satisficing theory starts from the premise that an optimal response to a survey question involves careful, systematic processing at each of the stages of the response process, but that survey respondents sometimes fail to engage in sufficiently effortful processing, providing instead merely satisfactory answers based on only a cursory, or possibly biased, comprehension-retrieval-judgementselection procedure. Krosnick (1991) identifies a number of response effects commonly observed in surveys with what he calls ‘strong’ and ‘weak’ forms of satisficing (depending on whether stages of processing are shortcut altogether, or merely executed more superficially), some of which correspond to the response styles listed in Table 36.1. Strong satisficing includes saying Don’t Know, failing to differentiate between items in a battery and rating all items at the same point on a rating scale (‘nondifferentiation’), endorsing the status quo in preference to social change, and so-called ‘mental coin-flipping’ (selecting responses at random) (Krosnick, 1991: 215). Weak satisficing includes selecting the first available and plausible response alternative (as typically occurs with response order effects) and acquiescence, which Krosnick defines as the tendency to agree with any assertion made by the interviewer. In terms of the response styles in Table 36.1, acquiescence corresponds to ARS (and its related effects DARS and NARS), and mental coin-flipping to NCR. Krosnick (1991) does not consider ERS, but it may share similarities with selecting the first available response alternative, depending on the mode of presentation; neither does he mention MRS, but a number of subsequent studies have investigated whether MRS is consistent with the predictions of satisficing theory. Finally, nondifferentiation, which is often measured as the sum or proportion of preferred identical responses given across a battery of items sharing the same response scale, is closely related to RR and can be considered as a kind of net response style. Indeed, Blasius and Thiessen (2012) define nondifferentiation as
588
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
such (they call it ‘limited response differentiation’ or LRD) and explain it in terms of a respondent’s strategy to simplify the task of responding to multiple items with a common rating scale format. For these authors, all response styles ‘share the feature that the respondent favours a subset of the available responses; the main difference between them is which particular subset is favoured. As a consequence, all forms of response tendencies result in less response differentiation between different items’ (p. 141). Krosnick (1991) does not talk about causes of satisficing, but rather talks about ‘conditions that foster satisficing’ (p. 220), that is, situations that are more or less likely to lead respondents to adopt these kinds of response strategies. Specifically, he argues that, the likelihood that a given respondent will satisfice when answering a particular question is a function of three factors: the first is the inherent difficulty of the task that the respondent confronts; the second is the respondent’s ability to perform the required task; and the third is the respondent’s motivation to perform the task …. The greater the task difficulty, and the lower the respondent’s ability and motivation to optimize, the more likely satisficing is to occur. (p. 221)
This framework provides considerable scope for investigating a range of response errors (including those classified as response styles), and provides an alternative means of categorising the causes of those errors. Instead of simply distinguishing stimulus and respondent-level influences, causal influences can be organised in terms of how they contribute to increasing task difficulty, lowering respondent ability, and reducing respondent motivation to engage in effortful processing, enabling a more theory-driven analysis of direct effects on response quality. In Van Vaerenbergh and Thomas’s (2013) review, for example, topic involvement is considered as a stimulus-level factor with a positive relation with certain types of response style. In Krosnick’s model, however, it would be treated as a task characteristic directly influencing respondent motivation, and thus, as
having an indirect effect on the likelihood of response error. Likewise, the model implies that different factors present in the conditions that foster satisficing are likely to interact with one another to produce different types of response error (Krosnick, 1991: 225), though to date, relatively few studies have attempted to explicitly examine such multiplicative effects. For example, cognitive load, which van Vaerenbergh and Thomas (2013) also treat as a task characteristic, and which is an aspect of task difficulty, appears to have a direct effect on certain response styles (e.g. Knowles and Condon, 1999), but is very likely to interact with respondent ability and motivation to influence response quality. The application of satisficing theory to research into response styles is not original. For example, Podsakoff et al.’s (2012) review of sources of method bias in social science research argues that ‘when respondents are satisficing rather than optimizing, they will be more likely to respond stylistically and their responses will be more susceptible to method bias’ (p. 560); and the authors identify a number of ability, motivation and taskrelated factors that may cause or facilitate biased responding. Their review highlights the utility of organising the literature on the causes of response styles in this way, as it facilitates the search for effective solutions to the problem. However, other attempts to connect these divergent literatures may unwittingly have added to existing confusion over terminology and causal mechanisms. For example, Engel and Köster (2015) and Schouten and Calinescu (2013) argue that satisficing itself can be viewed as a response style, which in turn is responsible for a series of different response effects. Given the lack of agreement in the past over the defining features of response style, it seems more helpful to consider the types of errors that have been referred to as response styles as merely examples of the various forms of response ‘error’ or ‘effect’ that may be more or less likely to occur under conditions that foster satisficing. In line with this, Blasius
RESPONSE STYLES IN SURVEYS
and Thiessen’s (2012: 141) conclusion that all types of response tendency (including satisficing) can be seen as task simplification strategies seems to offer considerable scope for future thinking about the various single and joint influences on how respondents react when confronted with multiple items to be evaluated on rating scales. Given that understanding the causes of error is key to developing methods of error reduction, as we shall see in the next section, this theoretical framework is also likely to yield the most effective strategies for reducing the impact of response styles on data quality.
MITIGATING THE IMPACT OF RESPONSE STYLES Detecting and Measuring Response Styles A number of different approaches have been developed to detect and control for response styles on multi-item measures. Van Vaerenbergh and Thomas review nine methods commonly used (2013; see Table III; see also Baumgartner and Steenkamp, 2001). Some of these techniques are purely diagnostic tools, while others enable the simultaneous detection and control of response styles. Thus, the choice of method will depend on the researcher’s particular requirements, and also partly on the burden each places on the analyst: some of the techniques are relatively simple to apply; others are considerably more complex, requiring particular types of research design, and/or purposely developed statistical software. The main conclusion that emerges from reviewing these methods is that each one suffers from drawbacks, and as a result, there is some debate about which are the optimal measurement and correction strategies to adopt. Perhaps the principal limitation of these methods to note is that only a few make it possible to simultaneously identify and
589
address the potentially damaging effects of multiple response styles. Instead, particular methods have been developed and tested on specific response styles such as ARS, ERS or MRS. For example, specifying a method factor in Confirmatory Factor Analysis (CFA) can be used to control for ARS (Billiet and McClendon, 2000), while multi-trait-multimethod (MTMM) models can be used to account for ARS and DARS (e.g. Saris and Aalberts, 2003). Neither of these approaches can be used to assess ERS or MRS, however. Furthermore, both methods require specific types of research designs. In the case of CFA, the scale must contain a balanced mix of positive and negative items, while in the case of MTMM, an experimental design is needed in which multiple methods of measuring the same construct are compared, to enable the decomposition of observed variance into true and error variance (Andrews, 1984). Both these requirements entail risks of other potential methodological confounds, such as memory effects across the repeated measures in MTMM, or errors due to the cognitive demands of responding to the reversed items needed for a balanced scale. Item-response theory can be used for measuring and controlling for ERS, but places considerable burden on data users because of additional complex procedures that need to be implemented alongside it. Similarly, latent-class regression analysis (e.g. Moors, 2010), which can be used to assess whether a method factor emerges can be hard to specify and requires specific software, while latent-class factor analysis (Kieruj and Moors, 2013), which can be used to specify method factors for ARS and ERS, cannot account for DARS or MRS (Van Vaerenbergh and Thomas, 2013). The two most comprehensive procedures in use (in terms of the possibility they provide to handle multiple response styles in one go) involve including additional measures in the item battery – so-called ‘representative indicators for response styles’ (RIRS) (Weijters, 2006). The basic RIRS method and its related representative indicators response styles
590
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
means and covariance structure (RIRMACS) method (Weitjers et al., 2008) make it possible to test for various types of response style simultaneously (ARS, DARS, NARS, ERS and MRS) and to control for them in later analyses. However, these approaches require the inclusion in the questionnaire of at least 5, and up to 14 items per response style, which is unlikely to be feasible in most multi-purpose social surveys, and paradoxically, is likely to significantly compound burden on respondents, thus increasing the likelihood of stylistic responding and other forms of satisficing. In the literature on satisficing (see e.g. Holbrook et al., 2003), the most commonly used method of assessing the presence of response effects involves a simple count procedure by which the sum of responses given at each point of the scale or the proportion of items for which a particular response was given is calculated across the scale or battery (or across the questionnaire as a whole). This method has similarly been used to diagnose the effects of response styles (e.g. Reynolds and Smith, 2010). The approach can be used to identify respondents’ preferences for particular scale points, and for the purposes of comparing the prevalence of such preferences across comparison groups. To be informative, however, the procedure requires that the items across which responses are counted be maximally heterogeneous (and preferably include some reverse worded items), and ideally, uncorrelated (Greenleaf, 1992). A further disadvantage of the count procedure is that there are few recommendations about how to deal with any observed differences in the measured response style across comparison groups. One relatively simple option, however, is to include indicators of stylistic responding as control variables in multivariate analyses. For example, Reynolds and Smith (2010) used three relatively straightforward indicators of response style use across an entire questionnaire: (a) the use of particular response categories; (b) the spread of respondents’ answers (as measured by the index of dispersion); and
(c) the tendency to select options on the righthand side of the scale to capture the direction of response categories (Reynolds and Smith, 2010). They then controlled for the effects of response styles on cross-national comparisons by including these three indicators as covariates in an ANCOVA. Though the effectiveness of this way of controlling for response styles has not been compared empirically with the more sophisticated methods in use, its simplicity, and the fact that no modification to the research design is needed make it particularly appealing to less experienced researchers. More recently, Blasius and Thiessen (2012) have recommended scaling methods such as multiple correspondence analysis (MCA) and categorical principal component analysis (CatPCA) for the purposes of screening data for various response tendencies (ARS, ERS, MRS and LRD) prior to conducting comparative analyses. These techniques provide a means to visualise in two-dimensional space the structure of respondents’ ‘cognitive maps’ (p. 12) produced by answers across multiple items in a battery, and thus reveal the quality of the data by virtue of the relative location of each response category vis-à-vis the others. The extent of overlap between the cognitive maps of different groups under comparison provides clues as to whether the points on the rating scale hold the same meaning across all respondents, and thus their approach can be viewed as a preliminary step in the assessment of measurement invariance across groups. If the data quality is good, then the first dimension accounting for the greatest proportion of variance should reflect a coherent pattern of positive and negative responses to the items. If the quality is bad, however, then the first dimension will instead reflect a methodological artefact (p. 12) – in other words, a response style. The principal appeal of these authors’ approach is that it obviates the need to check for all different types of response style or response effect and does not require the addition of extra items in the questionnaire. Instead, it focuses on the actual
RESPONSE STYLES IN SURVEYS
pattern of responses across a battery of items and allows conclusions about quality to be drawn on the basis of observation, rather than a kind of ‘fishing expedition’ for different clues about response quality. Unfortunately though, once again, the effectiveness of the method appears to depend on certain data requirements – namely, that responses to multiple statements be measured on the same scale (the more statements, the better), that the items be heterogeneous, and that some be reverse-coded. As well as providing an alternative method for screening data quality, Blasius and Thiessen (ibid.) propose a single measure of response differentiation, which they argue captures all tendencies to disproportionately favour certain response options or combinations of responses, which, as mentioned previously, they see as manifestations of an overarching strategy by survey respondents to reduce task difficulty. Their ‘Index of Response Differentiation’ (IRD) is compellingly simple to compute (see Blasius and Thiessen, 2012: 145, for further details), and can be used as an indicator of the extent to which a respondent has engaged in task simplification as a control variable in later analyses to help to ‘minimise confusion between genuine substantive relationships and method-induced relationships’ (p. 163). For these reasons, and in contrast to the alternative methods discussed, it can be concluded that the IRD seems to offer researchers one of the most straightforward and pragmatic solutions to the measurement and control of different response styles in substantive analyses of multi-item measures.
Designing Questionnaires to Reduce the Risk of Error While the previous section highlighted methods that can be used to minimise the impact response styles can have on the conclusions drawn from affected data, this section considers what can be done at the
591
questionnaire design stage to enhance the quality of measurement. A substantial body of research into optimal questionnaire design principles (see e.g. Fowler, 1995) provides guidance to researchers about how to improve the respondent’s experience of survey participation by increasing motivation and reducing burden, and thus mitigating the likelihood of various types of response error associated with task simplification and/or satisficing (including response styles). The empirical research on which these guidelines are based demonstrate, for example, the potential advantages of including specific instructions to encourage careful responding (Huang et al., 2012), the deleterious effects of excessive questionnaire length on response quality (Herzog and Bachman, 1981), the risks of grouping large numbers of questionnaire items into batteries or grids (Dillman et al., 2009), and the cognitive complexities of responding to attitude measures in the form of agree-disagree statements (Saris et al., 2010.). This latter point is particularly relevant to the problem of response styles. For respondents faced with a series of statements to which they should indicate their level of agreement, the response task involves not only processes of comprehension, retrieval and evaluation to decide what the attitude object is and what their general position towards that object is, but also a complex process of mapping that attitudinal position onto the available response categories. This process will vary in difficulty depending on the wording of the statements, and the available response alternatives – i.e. the number of scale points and whether or how they are labelled. Even if empirically based recommendations to use either a seven-point, fully labelled response scale for bipolar constructs (e.g. Krosnick and Fabrigar, 1997) or more recently, five answer categories for agreedisagree rating scales (Revilla et al., 2014) are followed, this mapping task may still impose an excessive cognitive burden on respondents, particularly when deciding what disagreement to certain types of
592
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
statement mean, and particularly if statements are worded in the negative. Given that response tendencies such as ARS are a direct product of asking respondents the extent to which they agree with something, the recommendations emanating from this research are to avoid this question format altogether, except when explicitly asking respondents to evaluate their agreement with ideas, when such a format is appropriate (Fowler, 1995). Instead, it is recommended to use ‘construct-’ or ‘item-specific’ response options that match the actual dimension of interest to the researcher (Saris et al., 2010). So if the researcher wishes to know how much respondents agree with the statement ‘It is important to design questionnaires well’, they should instead ask ‘How important do you think it is to design questionnaires well?’ with the response alternatives ranging from e.g. not at all important to very important. Construct-specific response formats reduce task difficulty for respondents by matching the response task to the question being asked and facilitating the mapping processes involved in providing an answer. More importantly, they physically remove the response options that invite satisficing to begin with (in this case, the ‘agree’ option, and the ambiguous ‘neither/nor’ middle alternative). Furthermore, the logical extension of writing questions in this way is that questionnaires end up with a greater variety of question formats, which should help to better maintain the motivation of the respondent to provide ‘good’ answers. Recommendations to avoid long batteries of statements with agree-disagree rating scales have yet to be fully taken on board by researchers responsible for designing questionnaires, however. Paradoxically, progress in this respect is likely to be hampered by the fact that the recommendations run counter to those of psychometricians and researchers concerned with the effects of response styles, who instead emphasise the need to include more items in scales to reduce random measurement error, improve reliability and validity,
and to ensure that systematic error can effectively be modelled and controlled for in subsequent analyses. This line of argument is regrettably circular, and unlikely to improve survey measurement quality in the future. Nevertheless, efforts to raise awareness about the fact that different methods of measurement (i.e. question formats) produce measurements of varying quality may well help to break this impasse. Specifically, the application of results from meta-analyses of MTMM experiments (see Saris and Gallhofer, 2014) comparing measurements based on questions with varied formats provide estimates of their relative quality (specifically, the product of the question format’s reliability and validity), on which researchers designing questionnaires can base decisions about which methods to use in their own research, and which data users can employ to correct for measurement errors during analysis (de Castellarnau and Saris, 2013; Saris and Gallhofer, 2014). The most ambitious of these is the Survey Quality Predictor (SQP) program (see http:// sqp.upf.edu/), based on the results of extensive MTMM experimentation conducted in the European Social Survey across 20 languages, which are now publicly available to researchers seeking to maximise the quality of their survey questions by selecting optimal response formats. It is important to acknowledge, however, that these estimates of quality have been derived from questionnaires that placed considerable burden on respondents (who had to respond to multiple measures of the same trait/construct) and rest on assumptions that may, unfortunately, be untenable (e.g. regarding the ‘purity’ of repeated measures of the same construct) (Groves, 1989). These features of MTMM research have limited its widespread use by other researchers, but need not limit the utility of SQP.
Future Developments The preceding section highlighted the paradox that many of the methods traditionally
RESPONSE STYLES IN SURVEYS
recommended to deal with the problem of response styles are precisely those that are likely to give rise to them in the first place. Future developments in this field will need to move beyond this impasse to ensure that future surveys produce data of better quality, and that data users are equipped with the skills they need to appropriately handle the data and interpret their findings. The methodological advances already reviewed relating to the measurement and reduction of response styles are likely to play an important role in achieving this aim. Also key is the need to simplify and harmonise the nomenclature surrounding different types of error, so that researchers from different backgrounds and disciplines engaged in using survey data can share a common understanding of the problems that can affect response quality, their probable causes, and of the optimal methods to control their impact on the data. A key challenge concerns the notion of burden. To improve the quality of data, questionnaire designers must seek to reduce the burden placed on respondents, to reduce the need for them to engage in their own task simplification strategies. Equally, data producers arguably have a responsibility to reduce the burden on analysts to minimise the likelihood that the conclusions they reach are marred by compromised data quality. The two, of course, go hand in hand, but a more clearly delineated division of labour would help to ensure that data are handled appropriately. Standard practice in reporting on survey quality already entails the assessment of sampling errors and design effects, evaluation of nonresponse and coverage errors, and at least some comment on items most likely to be affected by errors of observation. Such reports should ideally be extended to include the results of attempts to quantify measurement errors, and guidance to analysts about how to handle the most seriously affected data. This may require modifications to the survey and questionnaire design to facilitate the measurement of certain types of error, so the onus is on survey designers to develop
593
suitable research designs that will make such estimation possible. Such decisions must, of course, be taken in the context of the tradeoff between errors and costs inherent in the survey design process (Groves, 1989), while keeping concerns about burden paramount.
RECOMMENDED READINGS Blasius and Thiessen (2012), Podsakoff et al. (2012) and Van Vaerenbergh and Thomas (2013). Podsakoff et al. (2012) and Van Vaerenbergh and Thomas (2013) provide excellent recent reviews of the literature on common methods biases and response styles. Blasius and Thiessen (2012) present methods for assessing the quality of survey data (including detecting the effects of stylistic responding) and some solutions for how to handle compromised data in analyses.
REFERENCES Alwin, D. F. (2007). Margins of Error: A Study of Reliability in Survey Measurement. Hoboken, NJ: John Wiley & Sons. Andrews, F. M. (1984). Construct validity and error components of survey measures: A structural modeling approach. Public Opinion Quarterly, 48: 409–442. Austin, E. J., Deary, I. J., and Egan, V. (2006). Individual differences in response scale use: Mixed Rasch modeling of responses to NEOFFI items. Personality and Individual Differences, 40: 1235–1245. Baumgartner, H., and Steenkamp, J. B. E. M. (2001). Response styles in marketing research: A cross-national investigation. Journal of Marketing Research, 38: 143–156. Bentler, P. M., Jackson, D. N., and Messick, S. (1971). Identification of content and style: A two-dimensional interpretation of acquiescence. Psychological Bulletin, 76: 186–204. Billiet, J. B., and McClendon, M. J. (2000). Modeling acquiescence in measurement models for two balanced sets of items. Structural Equation Modeling, 7: 608–628.
594
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Blasius, J., and Thiessen, V. (2012). Assessing the Quality of Survey Data. London: SAGE Publications. Cannell, C. F., Miller, P. V., and Oksenburg, L. (1981). Research on interviewing techniques. Sociological Methodology, 12: 389–437. Cheung, G. W., and Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in cross-cultural research using structural equations modeling. Journal of Cross-Cultural Psychology, 31: 187–212. Cronbach, L. J. (1946). Response sets and test validity. Educational and Psychological Measurement, 6: 475–494. De Castellarnau, A., and Saris, W. (2013). A simple procedure to correct for measurement errors in survey research. ESS Edunet module, http://essedunet.nsd.uib.no/cms/ topics/measurement/ [accessed 26/1/15]. Dillman, D. A., Smyth, J. D., and Christian, L. M. (2009). Internet, Mail and Mixed-Mode Surveys: The Tailored Design Method (3rd edn). Hoboken, NJ: John Wiley & Sons, Inc. Ebel, R. L., and Frisbie, D. A. (1991). Essentials of Educational Measurement (5th edn). Englewood Cliffs, NJ: Prentice Hall. Eid, M., and Rauber, M. (2000). Detecting measurement invariance in organizational surveys. European Journal of Psychological Assessment, 16: 20–30. Engel, U., and Köster, B. (2015). Response effects and cognitive involvement in answering survey questions. In U. Engel, B. Jann, P. Lynn, A. Scherpenzeel, and P. Sturgis (eds), Improving Survey Methods: Lessons from Recent Research (pp. 35–50). New York: Routledge/Taylor & Francis. Fabrigar, L. R., and Krosnick, J. A. (1995). Attitude measurement and questionnaire design. In A. S. R. Manstead and M. Hewstone (eds), Blackwell Encyclopedia of Social Psychology (pp. 42–47). Oxford: Blackwell Publishers. Fabrigar, L. R., Krosnick, J. A., and MacDougall, B. L. (2005). Attitude measurement: Techniques for measuring the unobservable. In T. Brock and M. C. Green (eds), Persuasion: Psychological Insights and Perspectives (pp. 17–40). Thousand Oaks, CA: SAGE Publications. Fowler, F. J. (1995). Improving Survey Questions. Applied Social Research Methods Series Volume 38, Thousand Oaks, CA: SAGE.
Furr, R. M., and Bacharach, V. R. (2013). Psychometrics: An Introduction (2nd edn). Thousand Oaks, CA: SAGE. Greenleaf, E. A. (1992). Improving rating scale measures by detecting and correcting bias components in some response styles. Journal of Marketing Research, 29: 176–188. Groves, R. M. (1989). Survey Errors and Survey Costs. Hoboken, NJ: John Wiley and Sons. Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R. (2009). Survey Methodology (2nd edn). Wiley Series in Survey Methodology, Hoboken, NJ: John Wiley & Sons. Herzog, A. R., and Bachman, J. G. (1981). Effects of questionnaire length on response quality. Public Opinion Quarterly, 45, 549–559. Hofstede, G. H. (2001). Culture’s Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations (2nd edn). Thousand Oaks, CA: SAGE Publications, Inc. Holbrook, A. L., Green, M. C., and Krosnick, J. A. (2003). Telephone vs. face-to-face interviewing of national probability samples with long questionnaires: Comparisons of respondent satisficing and social desirability response bias. Public Opinion Quarterly, 67, 79–125. Hox, J. J., De Leeuw, E., and Kreft, I. G. (1991). The effect of interviewer and respondent characteristics on the quality of survey data: A multilevel model. In P. P. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, and S. Sudman (eds), Measurement Errors in Surveys (pp. 439–461). New York: John Wiley & Sons. Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., and DeShon, R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business Psychology, 27: 99–114. Jackson, D. N., and Messick, S. (1958). Content and style in personality assessment. Psychological Bulletin, 55: 243–252. Kieruj, N. D., and Moors, G. (2010). Variations in response style behavior by response scale format in attitude research. International Journal of Public Opinion Research, 22: 320–342. Kieruj, N. D., and Moors, G. (2013). Response style behavior: Question format dependent
RESPONSE STYLES IN SURVEYS
or personal style? Quality and Quantity, 47: 193–211. Knowles, E. S., and Condon, C. A. (1999). Why people say ‘yes’: A dual-process theory of acquiescence. Journal of Personality and Social Psychology, 77: 379–386. Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of attitude measures in surveys. Applied Cognitive Psychology, 5: 213–236. Krosnick, J. A., and Alwin, D. F. (1987). An evaluation of a cognitive theory of response order effects in survey measurement. Public Opinion Quarterly, 51: 201–219. Krosnick, J. A., and Fabrigar, L. R. (1997). Designing rating scales for effective measurement in surveys. In L. Lyberg, P. Biemer, M. Collins, L. Decker, E. DeLeeuw, C. Dippo, N. Schwarz, and D. Trewin (eds), Survey Measurement and Process Quality (pp. 141– 162). New York: Wiley-Interscience. Krosnick, J. A., Holbrook, A. L., Berent, M. K., Carson, R. T., Hanemann, W. M., Kopp, R. J., Mitchell, R. C., Presser, S., Ruud, P. A., Smith, V. K., Moody, W. R., Green, M. C., and Conaway, M. (2002). The impact of ‘no opinion’ response options on data quality: Non- attitude reduction or an invitation to satisfice? Public Opinion Quarterly, 66: 371–403. Meisenberg, G., and Williams, A. (2008). Are acquiescent and extreme response styles related to low intelligence and education? Personality and Individual Differences, 44: 1539–1550. Moors, G. (2010). Ranking the ratings: A latent-class regression model to control for overall agreement in opinion research. International Journal of Public Opinion Research, 22: 93–119. Olson, K., and Bilgen, I. (2011). The role of interviewer experience on acquiescence. Public Opinion Quarterly, 75 (1): 99–114. Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. Shaver, and L. S. Wrightman (eds), Measurement of Personality and Social Psychological Attitudes (pp. 17–59). San Diego, CA: Academic Press. Peer, E., and Gamliel, E. (2011). Too reliable to be true? Response bias as a potential source of inflation in paper-and-pencil questionnaire reliability. Practical Assessment, Research and Evaluation, 16 (9): 1–8.
595
Podsakoff, P. M., MacKenzie, S. B., Lee, J-Y., and Podsakoff, N. P. (2003). Common method biases in behavioural research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88 (5): 879–903. Podsakoff, P. M., MacKenzie, S. B., and Podsakoff, N. P. (2012). Sources of method bias in social science research and recommendations on how to control it. Annual Review of Psychology, 63: 539–569. Revilla, M. A., Saris, W. E., and Krosnick, J. A. (2014). Choosing the Number of Categories in Agree-Disagree Scales. Sociological Methods and Research, 43, 73–97. Reynolds, N., and Smith, A. (2010). Assessing the impact of response styles on cross-cultural service quality evaluation: A simplified approach to eliminating the problem. Journal of Service Research, 13: 230–243. Rorer, L. G. (1965). The great response-style myth. Psychological Bulletin, 63 (3): 129–156. Rundqvist, E. A. (1966). Item and response characteristics in attitude and personality measurement: A reaction to L.G. Rorer’s ‘The great response-style myth’. Psychological Bulletin, 66 (3): 166–177. Saris, W. E., and Aalberts, C. (2003). Different explanations for correlated disturbance terms in MTMM studies. Structural Equation Modeling, 10: 193–213. Saris, W. E., and Gallhofer, I. N. (2014). Design, Evaluation and Analysis of Questionnaires for Survey Research (2nd edn). Hoboken, NJ: John Wiley & Sons. Saris, W., Revilla, M., Krosnick, J. A., and Shaeffer, E. (2010). Comparing questions with agree/disagree response options to questions with item-specific response options. Survey Research Methods, 4: 61–79. Schouten, B., and Calinescu, M. (2013). Paradata as input to monitoring representativeness and measurement profiles: A case study of the Dutch Labour Force Survey. In F. Kreuter (ed.), Improving Surveys with Paradata (pp. 231–258). Hoboken, NJ: Wiley. Schuman, H., and Presser, S. (1981). Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording and Context. New York: Academic Press. Tourangeau, R., Rips, L. J., and Raskinski, K. A. (2000). The Psychology of Survey Response. Cambridge: Cambridge University Press.
596
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Van Herk, H., Poortinga, Y. H., and Verhallen, T. M. M. (2004). Response styles in rating scales: Evidence of method bias in data from six EU countries. Journal of Cross-Cultural Psychology, 35: 346–360. Van Vaerenbergh, Y., and Thomas, T. D. (2013). Response styles in survey research: A literature review of antecedents, consequences, and remedies. International Journal of Public Opinion Research, 25 (2): 195–217. Viswanathan, M. (2005). Measurement Error and Research Design. Thousand Oaks, CA: SAGE Publications. Weijters, B. (2006). Response Styles in Consumer Research. Doctoral dissertation, Ghent University, Ghent, Belgium.
Weijters, B., Schillewaert, N., and Geuens, M. (2008). Assessing response styles across modes of data collection. Journal of the Academy of Marketing Science, 36: 409–422. Weijters, B., Cabooter, E., and Schillewaert, N. (2010a). The effect of rating scale format on response styles: The number of response categories and response category labels. International Journal of Research in Marketing, 27: 236–247. Weijters, B., Geuens, M., and Schillewaert, N. (2010b). The stability of individual response styles. Psychological Methods, 15: 96–110. Ye, C., Fulton, J., and Tourangeau, R. (2011). More positive or more extreme? A metaanalysis of mode differences in response choice. Public Opinion Quarterly, 75 (2): 349–365.
37 Dealing with Missing Values Martin Spiess
INTRODUCTION Most data sets are affected by missing values, i.e., data values of scientifically interesting variables that are assumed to exist but are not observed and can not deterministically be derived from observed values. This characterization is not a clear-cut definition of the phenomenon of missing values, but it includes the most common situations. Examples are unanswered questions or not reported reactions of statistical units in general, values which are not observed because units are exposed only to parts or blocks of a larger questionnaire (‘missings by design’), if impossible or implausible values are deleted, or, in the context of causal inference and non-experimental settings, when units are observed only under one of two (or more) conditions. Depending on the context, ‘don’t know’ or ‘does not apply’ answers may mask missing data. The transition from (single) missing data values or ‘item nonresponse’ to ‘unit nonresponse’ where entire units are not observed,
is vague since unit nonresponse is just an extreme form of item nonresponse, and both versions of missing values could in principle be treated with the same techniques. However, although there is an overlapping area, like missings by design or panel attrition in longitudinal data sets, treatment of item and unit nonresponse is usually different due to different techniques as well as types and sources of information available. The present chapter treats the handling of missing values. Usually, researchers of empirical sciences are interested in inferences with respect to aspects of populations approximated by statistical models. These models, including the underlying assumptions, are based on theoretical considerations or at least some intuition and/or results from previous empirical research. For most researchers, missing values are just annoying as they cause difficulties with standard software and may lead to invalid inferences if not properly accounted for. Thus, an important objective connected with a method to compensate for missing values, is
598
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
to allow valid inferences or justify decisions as if there were no missing values. A secondary objective may be the possibility of applying easy-to-use (standard) analysis software. If compensating for missing values is inevitable, additional assumptions or knowledge is necessary depending on the affected variables and the mechanism that led to the missing values (‘missing mechanism’). A first decision is therefore whether the missing mechanism can safely be ignored for valid inferences. Unfortunately this is not an easy decision: among other things it mainly depends on the missing mechanism itself. In some situations the mechanism may be known, e.g., from empirical research on nonresponse or due to the adopted research design. In most cases, very little is known about the missing mechanism and inferences are either based on implicit or explicit assumptions about this mechanism. Therefore, another important issue is the robustness of inferences of scientific interest with respect to a possibly misspecified missing mechanism. If the missing mechanism cannot be ignored, then a strategy to compensate for the missing values has to be chosen. This strategy depends on the missing mechanism, the assumed model of scientific interest including all assumptions, the availability of resources (e.g., software, programming skills, time), but also on the adopted inferential concept (e.g., classical, likelihood, Bayes) and whether or not the inferences are evaluated based on frequentist or non-frequentist criteria.
CLASSIFICATION OF MISSING VALUES One important step in deciding whether the missing mechanism can be ignored or not, is the classification of missing values as being missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR) (Rubin, 1976). For a precise description of this classification, some notation needs to be introduced.
The k = 1, … , K variables intended to be surveyed for the n-th unit (n = 1,… , N) , like income, age or highest graduation, duration of the interview but also variables on more aggregated levels like living area, will be denoted as v n = (vn,1 … , vn, K )′ , and all N vectors vn will be collected in a (N × K) matrix V. If there are missing values, a (N × K) matrix R of response variables rn,k may be introduced, which indicate if the value of vn,k is observed (rn,k =1) or not (rn,k = 0). Let Vobs be the part of V which is observed and Vmis be the part of V which is not observed. The binary variables rn,k are considered to be random variables and the probability mass function of R conditional on V, the missing mechanism, will be denoted as g(R | V; g), where g is some unknown parameter. Missing values are MCAR if the probability of the observed pattern of missing and observed values does not depend on the matrix-valued variable V, for each possible value of g, i.e., if
g(R; g) = g(R | V; g) for all V.
If the missing values are MCAR, then the observed part of the data set is a simple random subsample from the complete sample without any missing value, and thus the observed part of the sample is non-selective. As an example, let us assume that the two variables, income and age, are being surveyed, and that age is always observed but income is not observed for some units. The missing incomes are MCAR if the probability of the observed pattern of observed and missing income values does neither depend on age nor on income. Missing values are MAR if the probability of the observed pattern of missing and observed values given the observed data Vobs does not depend on variable Vmis for each possible value of g or, more formally, if
g(R | Vobs; g) = g(R | V; g) for all Vmis.
If missing values are MAR, then conditional on the observed values in Vobs, the
Dealing with Missing Values
missing values are MCAR. In the example, missing incomes are MAR if the probability of the observed pattern of observed and missing income values does depend on observed age but not additionally on income itself. This would hold if older cohorts, as compared to younger people, tend to conceal their income, but if the tendency to report one’s income given a certain age would be completely at random. If missing values are MAR, the observed data may or may not be selective with respect to the research question of interest. In the example, if income and age were unrelated, then even if the probability of observing income would depend on age, the observed incomes would be a simple random sample from the incomes in the complete sample. On the other hand, if there is a positive relationship between income and age, the observed subsample of incomes would systematically differ from the complete sample, the direction depending on the relationship between age and the response indicator. Finally, missing values are NMAR if the probability of the observed pattern of missing and observed values depends on the unobserved values even after conditioning on the observed values in V. This would be the case if the probability of observing or not observing income does depend on observed age but in addition on the amount of income itself. Hence even for a given age, the probability of observing income would depend on income itself, for example, when the probability of reporting one’s income decreases with higher income within each age cohort. If missing values are NMAR, then the observed part of the sample is selective with respect to the analyses of scientific interest in general and strong external information is necessary to allow valid inferences.
AD-HOC TECHNIQUES To solve missing values problems, several techniques have been proposed. One is
599
complete case analysis (CCA), where only the completely observed cases enter the analysis of interest. This allows valid inferences if the missing values are MCAR, but generally not if missing values are MAR or NMAR. A related technique, that also allows valid inferences when missing values are MCAR, is available case analysis (ACA), which uses all cases for which the necessary values to calculate the statistic of interest have been observed. If various statistics have to be calculated, ACA avoids the problem of CCA of possibly ignoring a large amount of information by deleting all incompletely observed units. However, it may lead to technical problems, for example, if a variancecovariance matrix is calculated. Since each entry of this matrix may be calculated based on different numbers of observations, the resulting matrix may not be positive definite, which is a requirement for various statistical techniques. Hence, even if the missing values are MCAR, a technique may be desirable that avoids the loss of a big amount of information or technical problems. One such strategy is to replace each missing value by a predicted value. Many versions of this strategy, also called single imputation, have been used, from very simple ones, like replacing each missing by a zero or another arbitrarily chosen value, the unconditional mean, median, or mode of the incompletely observed relevant variable, to more elaborated versions, like using the mean, median, or mode conditional on completely observed variables or using the observed value of a (randomly selected) similar unit as the predicted value. Obviously, a strategy that uses some arbitrarily chosen value like, e.g., a zero, for each unobserved value and treating this value as being observed, almost always leads to biased inferences, even if the missing values are MCAR. A related strategy in the regression context with missing predictor values is to include interaction terms of the variables with missing values and response variables indicating if the values of the predictor
600
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
variable are observed or not and one minus these response variables as additional predictors into the regression model. However, this strategy does not work with standard software in general, even in the linear regression model, as either the regression parameter estimators themselves or the variance estimator will be biased (Jones, 1996). If each missing is replaced with the unconditional mean, median, or mode and the missing values are MCAR, estimators of location will not be biased but, e.g., estimators of spread will be downward biased. Hence, measures of dependence will be biased and standard errors as calculated by software for completely observed data sets will be downward biased. If the missings are not MCAR, this strategy will generally lead to invalid inferences. Using conditional means, medians, or modes to replace each missing is a step in the right direction because this strategy takes into account relationships between the variable to be imputed and the other variables in the data set. Hence, this strategy has the potential of allowing unbiased inferences. On the other hand, replacing each missing value by a conditional mean or any other single value, is equal to replacing unobserved independent information by an estimate based on information from the observed part of the data set and some (usually weak) assumptions. Hence, although the imputed values carry (almost) no additional information which is not already in the observed part of the sample, once the data set is completed and is treated as being completely observed, the amount of information is systematically overestimated by standard software for completely observed data sets. This leads to a systematic underestimation of variances of estimators and to proportions of falsely rejecting the null hypothesis being systematically too high. Thus, even if single imputations are properly generated, standard software cannot be used for inferences, but has to be adapted correspondingly. The main problem with ad-hoc methods is that an intuitive, seemingly simple justification replaces theoretical considerations.
IGNORABILITY OF THE MISSING MECHANISM Ignoring the missing mechanism is equivalent to analyzing Vobs by (usually implicitly) fixing R at the observed values and assuming that the values in Vobs are realizations from the marginal distribution of V with respect to Vmis,
f (Vobs ; θ) = ∫ f (V; θ)dVmis ,
where q is the parameter of scientific interest (Rubin, 1976). Here and in the following, Vmis will be assumed to consist only of continuous variables. In case of categorical variables, integrals have to be replaced by sums. An important criterion for analyses being valid, based on this marginal distribution which ignores the missing mechanism, depends on whether missings are MCAR, MAR, or NMAR. Another criterion, usually seen as being less important, is whether the parameters of the model of scientific interest and the parameters of the missing mechanism are distinct, i.e., their joint parameter space is the product of their respective parameter spaces in case of non-random parameters, or if they are a priori stochastically independent, in case the parameters are assumed to be random variables. In both cases we will talk below about distinct parameters. Often these parameters can be assumed to be distinct. This would, however, not apply if, e.g., one has to assume, conditional on covariates, a correlation matrix of the response indicators and the random variables of scientific interest with common parameters. If missing values are MCAR, then as noted in ‘Classification of missing values’ above, Vobs is a simple random subsample of V and the missing mechanism can safely be ignored regardless of the inferential concept or statistical model adopted. If missing values are MAR, then it depends on the chosen inferential concept, the evaluation criteria, and the model whether the missing mechanism is ignorable or not. Under
Dealing with Missing Values
MAR, g(R | Vobs; g) = g(R | V; g), and thus the joint distribution of Vobs and R is f (Vobs ; θ)g(R | Vobs ; γ ) =
∫ f (V; θ)g(R | V; γ )dV
mis
. (1)
Adopting a likelihood approach requires specification of the joint distribution of the variables treated as random, conditional on variables fixed at the observed values. The parameters of this probability density or mass function are fixed and unknown. The likelihood function, or likelihood for short, is just this probability density or mass function with reversed roles of variables and parameters. Thus it is a function of the unknown parameter where all variables are fixed at their observed values. A maximum likelihood (ML) estimate is the parameter value that maximizes the natural logarithm of the likelihood (‘log-likelihood’). The corresponding function of the observed variables is the ML estimator. Thus if missing values are MAR, only f (Vobs ; θ) needs to be considered to find the most plausible value of q. The likelihood plays an important role in inferential statistics whether inferences are evaluated from a frequentist point of view, which is usually conditional on R, or not, in which case likelihood analysis is evaluated at the observed variable values. It is essential in Bayes inference, where prior information about a parameter is transformed into posterior knowledge via the likelihood and inference is usually evaluated at the observed variable values. Maximum likelihood inferences are often evaluated in large samples from a frequentist point of view, i.e., properties of estimators are considered based on (hypothetical infinite) repetitions of the underlying random process. Then statements are still valid in general, and will be fully efficient if in addition q and g are distinct. Nevertheless, there are some subtle additional requirements for valid analyses. From (1) it is clear that the distribution of any statistic as a function of
601
Vobs is not independent from the missing mechanism. Hence, for example, calculating standard errors for ML estimators should be based on the observed rather than on the expected Fisher information, as this expectation ignores that there are missing values. Although the distribution of any function of Vobs conditional on R depends on the missing mechanism in general, it seems as if ignoring the dependence of higher moments than the mean and variance on the missing mechanism for hypothesis tests and confidence intervals still allows valid inferences. However, a limitation is that ML inferences may be less robust with respect to model misspecification if missing data are MAR instead of MCAR. Hence, if missing values are MAR, then the requirement that all necessary assumptions for ML estimation are met is stronger than in the MCAR case. Further, it should be noted that a frequentist evaluation requires stronger assumptions if analyses are not conditional on R. Further note that this discussion does not refer to non-ML estimation. The distinction between missing values being MCAR or MAR is not as clear-cut as it may seem, as the following example shows. If the model of scientific interest is a regression model, then V is partitioned into a dependent part Y and a covariate part X, and we are interested in the conditional distribution f (Y | X; θ) . In the presence of missing values and based on a frequentist evaluation conditional on R, the conditional distribution to be considered is f (Y | X, R; θ, γ ) . However, if g(R | X; γ ) = g(R | Y, X; γ ) , i.e., if the probability of the observed pattern of observed and missing values depends only on variables conditioned upon, then it follows that f (Y | X; θ) = f (Y | X, R; θ, γ ) . For example, given independence of observations and the same model structure for all units in a cross-sectional context, we may perform a CCA standard regression analysis and still arrive at valid inferences. In that case, if only covariates are affected by missing values, then a CCA analysis is equivalent to ignoring available and useful information
602
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
for the analysis of interest, and inferences – although valid – will not be fully efficient. On the other hand, if only the dependent variable is affected by missing values, then the incompletely observed cases do not convey information with respect to the ML regression analysis and thus inferences will be valid and fully efficient given the observed data. If missing values are NMAR, then the distribution of Vobs does depend on R which in turn depends in an usually not completely known way upon Vmis. Thus the integral or sum has to be taken over the joint distribution of V and R, simplification of the right-hand side term in (1) is not possible and the missing mechanism cannot in general be ignored. In this case, information from other sources than the observed data set, like strong distributional assumptions or restrictions, are necessary for valid inferences. Note, however, that in the above regression example, R may depend on X for valid inferences, which may not even be observed for the excluded cases. To illustrate the above arguments, consider the following example. Assume r and v = (v1 v2 )′ to be multivariate normally distributed, with means E (r ) = µr , E (v1 ) = µ1 and E (v2 ) = µ2 , variances var(r) = var(v1) = var(v2) = 1, and correlations cor(r, v1 ) = ρr ,1 , cor(r, v2 ) = ρr ,2 and cor(v1 , v2 ) = ρ1,2 . Let r be the binary response indicator and r = 1 if r > 0 and r = 0 else. The values of r and v1 are always observed, r is never observed and the values of v2 are only observed if r = 1. Given this setup, r | v1 , v2 is normally distributed with E (r | v1 , v2 ) = γ 0 + v1 γ 1 + v2 γ 2 , where
γ1 = γ2 =
ρr ,1 − ρr ,2 ρ1,2
and
1− ρ
2 1,2
ρr ,2 − ρr ,1 ρ1,2 2 1 − ρ1,2
Missing values in v2 are and thus r does not depend which is the case if ρr ,1 = 0 Missing values are MAR, if
(2)
.
MCAR, if r on v1 and v2, and ρr ,2 = 0 . ρr ,1 ≠ ρr ,2 ρ1,2
and ρr ,2 = ρr ,1 ρ1,2 , they are NMAR if ρr ,1 ≠ ρr ,2 ρ1,2 and ρr ,2 ≠ ρr ,1 ρ1,2 . Note that the assumptions of r being independent from v1 and v2 or independent from v2 given v1 are stronger than assuming that the probability of the observed pattern of observed and missing values does not depend on v1 and v2 (MCAR) or that is does not depend on v2 given v1 (MAR), and are made to simplify the presentation. On the other hand, v2 | v1 , r is normally distributed with
E (v2 | v1 , r ) = β 0 + v1 β1 + r β 2 , where
β1 =
β2 =
ρ1,2 − ρr ,1 ρr ,2 1 − ρr2,1
ρr ,2 − ρr ,1 ρ1,2 1 − ρr2,1
and .
(3)
Upon comparing g2 and b2, it is obvious that if ρr ,2 = ρr ,1 ρ1,2 , then the missing mechanism is ignorable with respect to the regression of v2 on v1 since v2 | v1 does not additionally depend on r and thus on r. Furthermore, regression parameters may be estimated based on the completely observed cases only, if it is assumed that the same regression relationship holds for all units. Note that these arguments justify a complete case analysis and thus also hold if values of v1 are missing according to the same mechanism. If parameters q of the joint distribution of (v1 v2 )′ are of interest, then given a sample v1 = (v1,1 ,… , vN ,1 )′ , v 2 = (v1,2 ,… , vN ,2 )′ and r = (r1 ,… , rN )′ , the relevant distribution is
f (v 2,obs , v1 | r; θ, γ ) = ∫ f (v 2 | v1 ; θ2|1 ) f (v1 ; θ1 ) g(r | v1 , v 2 ; γ r |1,2 ) g(r; γ r )
dv 2,mis ,
where it is assumed that variables (vn,1 , vn,2 , rn ) , n = 1,… , N , are distributed as (v1 , v2 , r ) , and θ2|1 θ1 , γ r|1,2 and γ r are the parameters governing the corresponding distributions. If the (vn,1 , vn,2 , rn ) ’s are
603
Dealing with Missing Values
independent and missingness depends only on vn,1 , then N
f (v 2,obs , v1 | r; θ, γ ) = ∏ f (vn,2 , vn,1 ; θ) n r
n =1
1− rn
f (vn,1 ; θ1 )
g(rn | vn,1 ; γ r |1 ) g(rn ; γ r )
,
(4)
and the relevant part to estimate q is N r 1− r ∏ n=1 f (vn,2 , vn,1 ; θ) n f (vn,1 ; θ1 ) n . Taking the logarithm and maximizing this function for given values r, v1 and v2,obs, with respect to m1 and m2 assuming normality, gives the ML estimators 1 N ∑vn,1 , N n =1 1 N µˆ 2 = N ∑rn (vn,2 − ρ1,2 (vn,1 − µ1 )). ∑rn n=1
µˆ1 =
n =1
Thus, m1 can be estimated as usual from the complete sample and m2 can be estimated from the observed part of the sample plus a correction term. Note that mˆ 2 can be rewritten as
µˆ 2 =
1 N ∑(rn vn,2 + (1 − rn )(β0 + β1vn,1 )), N n =1
other variables has to be correctly specified. Maximizing the corresponding log-likelihood function with respect to the model parameter of scientific interest and the parameter governing the missing mechanism, provides the ML estimators of both parameters. If all modeling assumptions are met, the ML estimator is asymptotically normally distributed, consistent, and asymptotically efficient in many cases. However, the model specification depends not only on the model of interest but also on the missing mechanism and the type of the affected variables. Its software implementation depends in addition on the pattern of the missing values. Thus even if the missing mechanism is ignorable, a situation we will consider in the first part of this section, software may not be available for many situations of scientific interest. Likelihood inference is comparatively simple if the missing mechanism is ignorable, the likelihood can be factored into easyto-handle components and the parameters of these components are distinct. As an illustration consider the example from the last section. Ignoring the missing mechanism, the relevant density in (4) can be rewritten as N N r ∏ n=1 f (vn,1 ; θ1 ) ∏ n=1 f (vn,2 | vn,1 ; θ) n , leading to the log-likelihood N
where the last term in brackets is the conditional mean of vn,2 | vn,1 . Thus, to calculate mˆ 2, each missing value is replaced by the (to be estimated) conditional mean of vn,2 | vn,1 from (3).
LIKELIHOOD APPROACHES The Ignorable Case Basically, ML estimation proceeds in the same way whether or not there are missing values. That is, the joint probability density or mass function of the observed random variables, including R, conditional on all
l ( µ1 , µ2 , ρ ; v1 , v 2,obs ) = ∑l ( µ1 ; vn,1 ) n =1
N
+ ∑rn l (β, σ 2 ; vn,2 | vn,1 ), n =1
2 where β = ( β 0 β1 )′ and σ 2 = 1− ρ1,2 is the error variance from the regression of vn,2 on vn,1. Since the parameters of the two components are distinct, we can maximize the first part to get an estimate for m1 and separately the second component to get estimates for b and s2 based on the completely observed part of the sample. Estimates of m2 and r can then easily be derived. Of course, factorizing the likelihood is a strategy that may be applied in situations with more than two variables.
604
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
An example is a linear panel or longitudinal model where units are observed at least at the first time and may drop out later. If the missing mechanism is ignorable, which gains plausibility if dropout is an absorbing state, then ML estimation is possible based on the observed variable values for each unit only and assuming multivariate normality of the depend variables by a factorization of the likelihood similar to the bivariate case considered above. Given distinctness of parameters and a decomposition of the log-likelihood into easy-to-maximize log-likelihoods also allows the simple calculation of a variance estimator for the ML estimators, so that likelihood or Bayes inference can easily be carried out. However, in applications this requires that corresponding software is available being able to combine and transform the estimation results properly. Otherwise, standard software may be used and results need to be combined with additional programming effort. In the context of structural equation modeling, ML estimation with incompletely observed units is often called full information maximum likelihood (FIML) estimation. The difference between FIML and ML in this context is that in the latter case only the completely observed units are included. The pattern of missing values in the last two examples is called monotone. More general, consider a data set given as a matrix in which k = 1, … , K variables for each unit are given in a row vector and row vectors of all units are stacked over each other to form the matrix of observations with N rows. A missing pattern is monotone if it is possible to arrange the columns of this matrix in such a way that for each unit n = 1,…, N, the values of variables 1,… , kn are observed and, if kn < K, the values of variables kn +1,… , K are not observed. If the missing pattern is monotone, then methods to compensate for missing values are generally less complicated and easier to justify theoretically than for non-monotone missing patterns. It should be noted, however, that factorizing the log-likelihood
is sometimes possible even if the missing pattern is non-monotone, e.g., when different data bases are combined (‘file-matching’; see Little and Rubin, 2002). If a missing pattern is non-monotone, but close to being monotone, then one might adopt the strategy of creating a monotone missing pattern by either deleting units or by imputing values based on a simple technique. However, both versions should be applied only if the number of units to be deleted or values to be imputed is small, so that the consequences of applying possibly inadequate techniques are minimized. A better solution would be the application of the expectation maximization (EM) algorithm offered by some software packages. The EM algorithm (Dempster et al., 1977) and its further developments simplify ML estimation in otherwise complicated problems by deconstructing a difficult task into a sequence of less complicated tasks which are simpler to solve and whose solution converges to the solution of the difficult task. Thus, the EM algorithm is particularly suited to deal with general missing data problems, but the field of applications is remarkably broader than standard missing data situations. Based on a log-likelihood function and starting values for the unknown parameters, the EM algorithm iterates over two steps, the expectation or E-step, and the maximization or M-Step. In the E-step, the expectation of the log-likelihood is taken at some starting values of the parameters with respect to the variables with missing values, conditional on the observed values. In this step, missing values are replaced by estimated values or, in more general problems, missing sufficient statistics or even the log-likelihood function itself is estimated. In the M-step, the resulting function is maximized with respect to the parameters. These new parameter values are then used for the next E-step and so on. The sequence of parameter values generated in this way successively increases the loglikelihood. In many practical applications, the sequence of parameter values converges
Dealing with Missing Values
to the ML estimator. Technical shortfalls of the EM algorithm are that it can still involve complicated functions, that it may converge very slowly and that standard errors are not automatically calculated. Hence, several extensions have been developed to overcome these deficiencies. An example of the EM algorithm is given in the next subsection.
The Nonignorable Case If missing values are nonignorable then, in general, the relationships between Vobs, R and Vmis need to be considered explicitly. Likelihood inferences start with the mixed probability density and mass function of V or Vobs and R to find the most plausible values of q and, if as in most cases the missing mechanism is unknown, g for the given data. The joint probability density and mass function, e.g., of Vobs and R, can be written either in the form f (Vobs ; θ)g(R | Vobs ; γ ) , which is a selection model representation, or as a pattern mixture model f (Vobs | R;ϑ )g(R; ξ) . Through the term f (Vobs | R;ϑ ) pattern mixture models allow inferences about differences of subpopulations defined by patterns of missing data. Selection models in contrast allow inferences for the whole population. One way to get ML estimates in this case is via the EM algorithm. The approach of iterating over the E- and the M-step is in principle the same as in the case of an ignorable missing mechanism, but now the parameter of the missing mechanism has to be estimated as well. As an example, consider univariate normally distributed variables vn with E(vn) = m and var(vn) = 1, which are only observed (rn = 1) if vn ≤ g + m, and not observed otherwise. Let the units be sorted such that the first N1 values are observed, and the remaining N-N1 values are not observed. Both, the observed and the unobserved values are from a truncated normal distribution and the rn’s are from a Bernoulli distribution. The EM algorithm requires specification of the joint distribution for the
605
complete data, which in a selection model repre sentation for each n, is given by r 1− r φ (vn − µ )Φ (γ ) n (1 − Φ (γ )) n , where φ (⋅) is the density and Φ (⋅) is the distribution function of the univariate standard normal distribution. For the E-step we need to take the conditional mean of the log-likelihood function of the whole sample with respect to the unobserved variables, vn,mis, given the values of all observed variables. This function is maximized with respect to the unknown parameters m and g in the M-step. Note that γˆ is simply the N1/N-quantile of the standard normal distribution and needs to be calculated just once. The estimator for m in the j-th iteration step is N1 j −1 given by µˆ j = (∑ n =1 vn + ( N − N1 )µˆ mis |obs ) / N , j −1 ˆ j −1 + φ (γˆ )/(1 − Φ (γˆ )) deno where µˆ mis |obs = µ tes the estimated conditional mean of variables vn,mis given the observed values and estimates from the (j-1)-th iteration step (M-step). In the E-step of the j-th iteration, j ˆ mis|obs has to be calculated. Given a starting value ˆ 0 , the E- and the M-step are repeated until the absolute change in the parameter estimates is below a predefined threshold. Generalizations of the above example lead to various selection models proposed in the literature, e.g., by Heckman (1976), to model selection of women into the labor force, or in a panel data context, to model self-selection of respondents into a question with respect to party identification based on a survey questionnaire (Spiess and Kroh, 2010). These models can be estimated by ML, two-step, or semi-parametric approaches.
MULTIPLE IMPUTATION General Multiple imputation (MI) is a method proposed by Rubin (1996, 2004) to compensate for item nonresponse in situations with not more than 30% to 50% of missing information. The idea is to generate multiple
606
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
predictions (‘imputations’) for each missing value, thus generating multiple versions of a completed data set which can be analyzed using standard methods for completely observed data sets. The multiple results are then combined according to some simple rules to allow inferences about statistical models of scientific interest. MI is developed based on a Bayesian approach and for the final inferences to be valid, the imputation method has to be proper. Let the data set not being affected by missing values be denoted as the completedata set, and the (unknown) estimates of the parameter of scientific interest and its variance – both would have been observed if there were no missing values – be denoted as complete-data set statistics. Further note that the process of arriving at an incompletely observed data set can be interpreted as consisting of two steps: a first step in which the sample is selected from a population – this sample includes all the variable values of interest from all selected units (completedata set) – and, due to nonresponse, a second selection step leading to the observed subsample from the complete-data set. Then, for a MI method to be proper, the estimator of the parameter of scientific interest and its variance, based on a multiply imputed data set, must be approximately unbiased for the corresponding complete-data statistics, and an approximately unbiased estimator of the variance of the estimator of scientific interest across imputations must exist. If the multiple imputation method is proper and the inference based on the complete-data set is valid, then the analysis using the multiply imputed data set tends to be valid for the unknown population parameters even in the frequentist sense (Rubin, 1996, 2004). Within a Bayesian framework, q is considered to be a random variable and inferences are based on the posterior distribution of q. This posterior distribution is a function of a likelihood and a prior distribution of q, the latter reflecting beliefs or knowledge about q before the data are observed. The difference
between the prior and the posterior distribution represents the effect of information carried by the observed data on knowledge about q. Notationally omitting dependence on other parameters than q, the key to understanding MI is the posterior distribution of q, i.e., the distribution of q after Vobs and R have been observed,
p(θ | Vobs , R) = ∫ p(θ | Vmis , Vobs , R)
p(Vmis | Vobs , R) dVmis , (5)
where p(θ | Vmis , Vobs , R) is the completed-data posterior distribution of q, which we would get for given values Vobs, Vmis and R, and p(Vmis | Vobs, R) = ∫ p(Vmis | Vobs , R, θ) p(θ | Vobs , R) dθ is the posterior predictive distribution of Vmis. The posterior distribution p(θ | Vobs , R) is thus the mean of the completed-data posterior distribution of q with respect to possible values of Vmis. Two moments of the posterior distribution of q are of special interest: The mean and the variance of q given the observed data,
E (θ | Vobs , R) = E ( E (θ | V, R) | Vobs , R) and Var(θ | Vobs , R) = E (Var(θ | V, R) | Vobs , R)
+ Var( E (θ | V, R) | Vobs , R)).
Given (5), suppose we could simulate values for Vmis from p(Vmis | Vobs , R) and values for q from p(θ | Vmis , Vobs , R) . These would be values simulated from their conditional joint distribution. Repeating this simulation process M times would generate M predictions or imputations for the missing values in Vmis. Once a data set is multiply imputed and analyzed with standard software for completely observed data sets, combining the results is straightforward following Rubin’s (2004) combining rules: let θˆ m be the estimator of
Dealing with Missing Values
scientific interest based on the m-th imputed ˆ θˆ ) be the varidata set (m = 1,…, M) and V( m ˆ ance estimator of θ m , then the final estimator for q is given by
1 θˆ M = ∑θˆ m (6) M m and its variance can be estimated by
ˆ (θˆ ) = W + (1 + M −1 )B , (7) V M M M ˆ (θˆ ) / M is the within where WM = ∑ mV m
variability and B M = ∑ m(θˆ m − θˆ )(θˆ m − θˆ )′ / (M - 1) is the between variability, which reflects the information missing due to nonresponse. If the missing values could be derived from observed values without error, then BM would vanish. On the other hand, the diagonal elements of BM increase with decreasing information about the missing values available in the sample. Thus, for M = ∞, the fraction of information missing due to nonresponse is a function of the variances in BM relative to those in V(θˆ ∞ ) . Now, an imputation method is proper if, for infinite M, estimators (6) and WM are approximately unbiased for the corresponding estimates in the complete-data set, and BM is an approximately unbiased estimator for the true variance of the θˆ m ’s across the imputations. Given the imputation method is proper and the analysis method that would have been applied to the complete-data set is valid, then in large samples, the estimator θˆ is approximately unbiased and normally distributed with variance that can be estimated by (7) in many cases, and inferences based on standard tests tend to be valid. These results are based on asymptotic (with respect to M and N) considerations, but they can be expected to hold in finite samples as well, as long as N and M are not too small. The number of imputations proposed varies between 5 and 20, but more than 20 have also been used, depending on the (estimated) fraction of missing information. In general, more is
607
better, as too few imputations lead to a loss of precision. However, test statistics have been derived for small N and M, so that tests for MI-based estimators are available even in these cases. According to (5), imputations should be generated based on the posterior predictive distribution of Vmis. However, besides Bayesian methods, other methods like Bootstrap techniques that approximate the Bayesian approach (‘Approximate Bayesian Bootstrap’) can be proper as well. Thus, to generate MIs, several imputation methods could be adopted. An important point when choosing an imputation method, is to make sure that the variation in the imputations reflect the whole uncertainty of the predictions, which can even include uncertainty about the posterior predictive distribution. Otherwise standard errors will be systematically too small. The posterior distributions of q and Vmis in (5) are too general to be practical. Thus simplifying assumptions are necessary. Among these are independence assumptions as well as the assumption of ignorability of the missing mechanism.
The Ignorable Case If the missing mechanism is ignorable, the vn’s are independent and identically distributed, and the parameters of the different distributional models are distinct, then the posterior predictive distribution for Vmis and the posterior distribution for q simplify considerably, one important simplification being their independence from R. A simple yet instructive and important case arises when the pattern of missing values is monotone. Then p(Vmis | Vobs , θ) can be represented by a sequence of conditional distributions of the variables with missing values and p(θ | Vobs ) is a function of easy-to-model conditional distributions depending only on the values of observed variables and the prior distribution of the parameters involved. Furthermore,
608
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
posterior independence of the parameters follows from prior distinctness. To create multiple imputations one may proceed as follows: Estimate a regression model of the variable with the smallest fraction of missing values on the completely observed variables, using only those cases for which this dependent variable is observed (note that the regression relationship is not biased, if the missing values are MAR). Assuming a prior distribution for all the parameters of this model, derive their posterior distribution based on the estimation results in the latter step and randomly select parameter values from this posterior distribution. For the cases with missings in this dependent variable, use these values and the observed variables to predict values of the systematic part of the model and complete the individual predictions by adding noise according to the assumptions about the stochastic part of the regression model. Repeat these steps over all variables with missing values, choosing at each step as dependent variable the variable with the smallest fraction of missing values from the remaining variables and as covariates all completed and completely observed variables. Several imputations can be generated either by repeating these steps M times, or by generating M imputations for each variable in each round. As a specific case, consider the estimator for m2 in the example from the section ‘Ignorability of the missing mechanism’ above. Implicitly, the ML estimator imputes the conditional mean of v2,n | v1,n. However, if we would impute the (estimated) conditional mean for each case with missing v2,n and treat them as independent observations in our analysis, the standard errors would be systematically too small. To generate proper imputations, we would thus first estimate a regression of v2,n on v1,n based on the completely observed cases. Assuming a uniform prior distribution for the unknown regression parameters and, if we also have to estimate the error variance, for log σ , we can simulate values for the regression parameters from a
normal distribution with mean and variance equal to the estimates from the last step, and a value for the error variance as a function of a chi-square random variable. To generate one imputation for each case with observed v1,n but missing v2,n, we could estimate its value on the regression line using v1,n and the simulated regression parameters and add a value for the error, randomly selected from a normal distribution with mean zero and variance given by the simulated variance. If the missing pattern is non-monotone, then generation of imputations is more difficult in general, since it is often not obvious which conditional distributions are needed. In this case, Markov Chain Monte Carlo (MCMC) Methods, or versions of it, like the Gibbs-Sampler or the Data Augmentation Algorithm, may be adopted. The idea underlying MCMC versions is to approximately calculate complex integrals, which can be written as means, by simulation. This is exactly what is needed given (5). In the multiple imputation context, however, the complicated distributions under the integral in (5) are replaced by simpler conditional distributions, and values for q and Vmis are generated by cycling back and forth between updating q and Vmis conditional on the last generated value of the other variable. A crucial point of this strategy is that the distribution of the generated values converges to a stationary distribution, which in turn requires that a joint distribution exists. Since the latter is not guaranteed, one either assumes that a joint distribution exists without explicitly specifying this distribution, or the values Vmis are generated under a specific joint distribution like the multivariate normal. Even if a stationary joint distribution exists and although statistical and graphical tools are usually provided by software packages to decide if the distribution of the generated values converged, this decision is not easy, as convergence may seem to be reached according to some but probably not according to other, not considered statistics. Further, since
Dealing with Missing Values
imputations should be independent draws from the posterior predictive distribution, there should be a sufficient number of iterations between the selection of generated values as imputations. Following the two appraches, there are two main strategies to generate multiple imputations: First, generate imputations assuming a specific joint conditional distribution. For example, assume that all variables with missing values conditional on the observed values follow a multivariate normal distribution (e.g., Schafer, 1997). Since in real data sets, variables are often neither normally distributed nor continuous, it has been proposed to transform variables or round imputations to plausible values. However, both ideas are problematic: Transforming variables with missing values requires knowledge about the conditional, not the unconditional, distribution of the variables with missing data, and rounding may lead to biased estimators of final interest (Horton et al., 2003). The other strategy is to sequentially estimate univariate conditional models as described for monotone missing patterns and once the data set is completed, cycle over all variables with missing values, generating new imputations that replace the imputations generated in the last cycle by regressing each variable with missing data on all other completely observed or completed variables in the same sequence as before (e.g., Raghunathan et al., 2001; van Buuren, 2012). This strategy allows to adopt univariate regression models specific to each variable. Hence, e.g., for a binary variable, a Logit- or Probit model may be adopted. It is also relatively easy to respect restrictions with regard to possible values of variables or deterministic dependencies on other variables, like imputing the cost for public transport only if the answer to the question ‘Do you use the public transport system?’ has been answered or imputed with ‘yes’. The downside is that there may not always exist a joint conditional distribution of Vmis, the consequences of which are not completely known.
609
The Nonignorable Case If missing values are nonignorable, then (5) can still be simplified by assuming independence of units or monotone missing patterns, but not to the same degree as in the ignorable case since R does not cancel out. To generate imputations in this case, the selection or the pattern mixture representation can be adopted. However, for both modeling approaches assumptions need to be made which are hard to justify. However, once the modeling assumptions are made, generating imputations proceeds basically as in the ignorable case.
DISCUSSION Whether or not missing data techniques allow valid inferences, depends on the missing mechanism, the involved model including assumptions, and the adopted inferential concept. If missing values are MCAR, then CCA analysis would be acceptable, although not optimal, because incompletely observed units are deleted. If the missing mechanism is ignorable, then the models of scientific interest may be estimated based on likelihood functions using all observed data values, or, more general, MI-based likelihood or nonlikelihood estimation methods. Still there may be situations in which compensation for nonresponse is not necessary, e.g., if only the dependent variable of a regression model is affected by missing values and the observed pattern of observed and missing values depends only on covariates. If a technique that compensates for nonresponse is adopted, then assumptions are made which may not be met and may thus lead to invalid inferences. Therefore, sometimes it may be wise to do without a compensation technique, although in most cases some method to compensate for missing values is needed. If the missing mechanism is ignorable, then besides likelihood methods or MI techniques,
610
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
weighting methods have been proposed (e.g., Robins et al., 1994; Wooldridge, 2007). Weighting is traditionally used to compensate for unit-nonresponse, but due to its complexity in the missing data context it seems not to be suitable for missing value problems. Further, most software solutions assume that the missing mechanism is ignorable. If the missing mechanism is nonignorable, additional strong assumptions are necessary to allow for valid inferences. For this situation, software is available only for very specific cases. Likelihood-based estimation leads to consistent, asymptotically efficient, and normally distributed estimators. Hence, if all necessary assumptions are met, the likelihood approach is strongly recommended, and it works even under a frequentist evaluation if missing values are MAR and the parameters are distinct. From an applied point of view, another advantage may be that compensation for missing values and the model of interest is done under one model. However, the advantageous properties of ML estimators can only be claimed if all assumptions are met, and they are strong even without missing values. With missing values, however, additional assumptions are necessary. A simple generalization of the example at the end of ‘Ignorability of the missing mechanism’ above to the case of three jointly normally distributed variables, one being affected by missing values, shows again that missing values are implicitly imputed by (estimated) conditional means. The regression coefficients are functions of variances, covariances, and means of the random variables of the complete-data model. Hence if, e.g., the structure of the covariance matrix of the complete-data model is misspecified, then the imputation model will be misspecified and the estimator of the mean tends to be biased. On the other hand, if a variable that enters the model of scientific interest as a covariate is affected by missing values, then an imputation model has to be specified. In software packages applying likelihood analyses this is
often done implicitly and if the variables to be imputed are continuous, based on a homoscedastic linear model. However, this is only correct if the assumption of joint normality of these variables is correct. Otherwise this relationship can not be homoscedastic linear (Spanos, 1995). Hence, although it seems as if all modeling tasks are done under one model, if a likelihood approach is adopted, there may be additional and strong implicit assumptions involved. Thus, assumptions that have to be made explicit when MIs are generated are often implicit in ML analyses. This also implies that if several different models are estimated based on the same data set, the implicit assumptions to compensate for missing values may be in conflict with each other. If inferences are based on multiply imputed data sets, there is always the risk that the analysis and the imputation model are in conflict. In many practically relevant cases this may lead to conservative inferences, i.e., confidence intervals that are too wide and rejection rates of null hypothesis tests which are too low, although there may be cases where variances are underestimated. Generally, however, MI-based inferences seem to be rather robust with respect to slight to medium misspecifications of the imputation model and the missing mechanism in practically relevant situations (Liu et al., 2000; Schafer and Graham, 2002) as long as likelihood methods are used to generate the imputations. For example, if the imputation model is misspecified, then there seems to be a ‘self-correcting’ property or tendency of MI-based inferences to mask small biases by too large standard errors (Little and Rubin, 2002). However, more research is needed to identify situations in which the MI method as described in this chapter allows valid inferences and when it does not. On the other hand, MI is applicable also if a non-likelihood approach is chosen to estimate the model of scientific interest. From an applied point of view, available software that generates MIs according to the conditional approach, allows the generation
Dealing with Missing Values
611
of imputations of very different types, like continuous data, binary, ordered or unordered categorical or count data rather easily. Furthermore, restrictions on data ranges can often easily be implemented. Additional variables or higher polynomial or interaction terms not considered to be important in the model of scientific interest can easily be used to generate imputations, which is important for the MAR assumption to become more likely. Most software allows to generate MIs in a very flexible way – but still not flexible enough. For example, to generate imputations for count data, only the very restrictive Poisson model may be available, or to generate imputations for continuous data, only the linear homoscedastic regression model with normally distributed errors may be provided. Many software packages provide MI techniques which are not based on explicit prediction models, like predictive mean matching or nearest neighbor imputation. Instead of an explicit prediction model, imputations are in fact observed values from units close to the unit with a missing value, which is sometimes seen as an advantage of this technique. However, the validity of the final inferences depends on the missing mechanism: if there are not enough possible donors in the neighborhood of the incompletely observed unit, i.e., the incompletely observed unit is located in a sparsely populated region, the final inferences tend to be invalid. Of course, MI-based inference with finite M are less efficient than ML-based inferences. However, thanks to fast computers and several easy-to-use available software packages, generating a large number of MIs is no longer very time-consuming even in large data sets, so that differences in efficiency can be made arbitrarily small.
likelihood methods are certainly preferred over other techniques, if all necessary assumptions are met. However, as increasingly more robust non-likelihood methods are used and likelihood solutions realized in software packages are available only for very specific situations, MI based on univariate conditional models seems to be more promising as it allows to generate imputations for very general situations with different types of variables affected by nonresponse, and complicated relationships between them. However, even within this technique further research and developments are necessary. For example, more knowledge is necessary about the consequences of non-existing joint distributions of variables with missing values, given the adopted prediction models. Hence, future research should consider more flexible imputation models to make MI-based analyses more robust. Further, it would probably be worth to explore techniques that allow the use of additional external information of different types to generate imputations, and there is still a lot of theoretical work to do, e.g., to justify the use of conditional imputation models in very general situations based on flexible imputation models. Since there is a lack of software solutions to facilitate sensitivity analyses, i.e., to analyze data sets assuming different missing mechanisms, it seems to be worth to develop tools to get a feeling of how robust inferences are in the light of different plausible missing mechanisms. These solutions would easily include software solutions for nonignorable situations. An interesting alternative approach, which should receive more attention in the future, is a kind of ‘worst-case’-scenario with respect to interesting parameters, similar to what is proposed by Horowitz et al. (2003).
CONCLUDING REMARKS
RECOMMENDED READINGS
Several techniques are available to compensate for missing data. Among them,
A detailed yet understandable treatment of MI, supported by many examples, is given in van
612
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Buuren (2012). At the same time it is a practical guide for handling missing data with MICE, a package that generates imputations adopting the conditional approach, and is available with the free software R. Familiarity with basic statistical concepts and multivariate methods is assumed. Little and Rubin (2002) give a thorough treatment of the topic with main focus on maximum likelihood estimation and a detailed and understandable description of the EM algorithm. It includes many examples, but requires statistical knowledge at least on a social science master level. An extensive treatment of the justification underlying MI can be found in Rubin (2004). Various ways of generating MIs and many examples are given. However, familiarity with statistical approaches and concepts is presupposed. Wooldridge (2007) gives a flavor of weighting ideas in the context of missing values. Very technical, requires good econometric knowledge.
REFERENCES Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Ser. B, 39(1), 1–22. Heckman, J.J. (1976). The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for such Models. Annals of Economic and Social Measurement, 5(4), 475–492. Horowitz, J.L., Manski, C.F., Ponomareva, M. and Stoye, J. (2003). Computation of Bounds on Population Parameters When the Data Are Incomplete. Reliable Computing, 9, 419–440. Horton, N.J., Lipsitz, S.R. and Parzen, M. (2003). A Potential for Bias When Rounding in Multiple Imputation. The American Statistician, 57(4), 229–232.
Jones, M.P. (1996). Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. Journal of the American Statistical Association, 91(433), 222–230. Little, R.J.A. and Rubin, D.B. (2002). Statistical Analysis With Missing Data (2nd edn). Hoboken, NJ: John Wiley & Sons. Liu, M., Taylor, J.M.G. and Belin, T.R. (2000). Multiple Imputation and Posterior Simulation for Multivariate Missing Data in Longitudinal Studies. Biometrics, 56, 1157–1163. Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J. and Solenberger, P. (2001). A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology, 27(1), 85–95. Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1994). Estimation of Regression Coefficients when Some Regressors are not Always Observed. Journal of the American Statistical Association, 89(427), 846–866. Rubin, D.B. (1976). Inference and Missing Data. Biometrika, 63(3), 581–592. Rubin, D.B. (1996). Multiple Imputation After 18+ Years. Journal of the American Statistical Association, 91(434), 473–489. Rubin, D.B. (2004). Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: John Wiley & Sons. Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Schafer, J.L. and Graham, J.W. (2002). Missing Data: Our View of the State of the Art. Psychological Methods, 7(2), 147–177. Spanos, A. (1995). On Normality and the Linear Regression Model. Econometric Reviews, 14(2), 195–203. Spiess, M. and Kroh, M. (2010). A Selection Model for Panel Data: The Prospects of Green Party Support. Political Analysis, 18, 172–188. van Buuren, S. (2012). Flexible Imputation of Missing Data. Boca Raton, FL: CRC Press. Wooldridge, J.M. (2007). Inverse Probability Weighted Estimation for General Missing Data Problems. Journal of Econometrics, 141, 1281–1301.
38 Another Look at Survey Data Quality Victor Thiessen† and Jörg Blasius
INTRODUCTION There are numerous factors contributing to poor survey data quality are numerous, some of which have been detailed in stdies of total ... total survey error (see Groves, 2004, as well as Chapters 3 and 10 in this Handbook). Some ingredients, such as producing equivalent meanings in cross-national research conducted in different languages, have technical remedies, such as back translations. Others, such as the optimal number of response options (see Alwin, 1997 for an example), or what factors increase the response rate (see Groves et al., 2004) are better resolved by empirical research than theoretical arguments. In this chapter we focus on the three actors involved in every survey research with faceto-face or telephone interviews, namely (1) the respondents, (2) the interviewers and (3) supervisor(s) at the survey research organization (in short, SRO) conducting the survey. While a lively discussion on respondent effects on data quality, especially with
respect to response styles (Baumgartner and Steenkamp, 2001) and satisficing behaviour (Krosnick, 1991) is ongoing, at best only an embryonic discussion on the effects from the other two actors can be found (Blasius and Thiessen, 2012). Blasius and Thiessen (2012, 2013) could show, for example, that interviewer fabrication of at least parts of their interviews occurs more frequently than one might anticipate, and that sometimes even employees in SROs fabricate data in the simplest way, namely via copy-and-paste procedures. Blasius and Thiessen (2012) summarize such misbehaviour on the part of all three actors as ‘task simplification’ procedures, which results in an increase of random error and/or systematic bias. These are especially problematic when their magnitudes differ in groups to be compared. For example, valid conclusions in comparative research presuppose similar levels and direction of errors between the groups and/or countries being compared. Evidence is mounting that substantive findings in any comparative research
614
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
are confounded by variation in response rates and other methodological factors such as response styles (Hamamura et al., 2008; Heath et al., 2009). Much is known about protocols and techniques to minimize survey errors in the planning and execution of surveys. Less is known about how to detect data deficiencies once they have been collected. This chapter is intended to partially rectify this shortcoming. We begin by outlining our conceptualization of the sources and reasons for poor data quality. This is followed by three case studies that exemplify our approach. Although the given examples are selected to exemplify problematic issues in cross-national comparative research (PISA 2009, ESS 2010, ISSP 2006), the proposed techniques can be applied to secondary analyses of any survey. Consequently, we do not discuss any strategies to improve new survey questions or to predict their quality (compare especially the articles in Part II of this Handbook), we focus exclusively on well-known data sets that are often used for secondary data analysis. We conclude with a discussion of possible implications of our findings for performing sound within- and cross-national comparative analysis.
OUR APPROACH TO DATA QUALITY Our approach to data quality incorporates many elements of current approaches. The specific contribution in this chapter is an attempt to synthesize them into a general theory of data quality. Our starting premise is that the same factors influence data quality at each of the three levels of actors (respondents, interviewers and SROs) that are active in the research process. We contend that at each level of actors, the likelihood of simplifying the task is a function of three factors: task difficulty, ability to perform the task and commitment to the task. In the following section we provide a brief review of the literature relevant to our contention.
Respondents With respect to respondents, the task of answering questionnaire items is similar to any other linguistic task that takes a question– answer form. Tourangeau and his associates (Tourangeau and Rasinski, 1988; Tourangeau et al., 2000) developed a comprehensive cognitive process model to conceptualize the dynamics involved in such linguistic tasks: respondents must understand the question, retrieve the relevant information, synthesize that information into a summary judgment, and select the most appropriate response option. Krosnick (1991) adopted this conceptualization, but also recognized that the survey context is one in which some respondents may not invest the time and mental effort to perform the task of responding to the survey questions in an optimal manner, with measurement error being introduced at each of the four steps. He adopted Simon’s (1957) concept of satisficing for such situations. Satisficing may take on diverse forms, such as endorsing the status quo, non-differentiation of responses to a battery of items, resorting to ‘don’t know’ responses, or selecting the first reasonable response offered. Krosnick (1991) argued that the extent of satisficing behaviour is determined by the difficulty of the task, the respondents’ ability and their motivation. He considered task difficulty to be a function of problematic question construction, respondent ability to be a function of their cognitive sophistication, and motivation to be a function of interest in, or knowledge of, the topic and length of the survey. We agree that abstract or lengthy questions that contain unfamiliar words, which are phrased negatively or refer to topics on which the respondent has little knowledge, contribute to task difficulty (Blasius and Thiessen, 2001). However, in our view task difficulty is also a function of exposure to surveys or similar tasks, such as filling out various forms. Further, surveys and forms are more common in some countries than others.
Another Look at Survey Data Quality
The task of responding to survey items will be more challenging for some respondents, depending on their mental competencies and their experience with surveys or filling in forms. For those who find the task difficult, two options present themselves. The first is simply to decline the task, which would manifest itself in both unit and item non-response. The empirical evidence consistently shows that response rate increases with respondents’ cognitive sophistication and its correlates, such as education (Harris, 2000; Coelho and Esteves, 2007). The second is to simplify the task to make it more manageable. Note that what all forms of satisficing have in common is that they reduce the task difficulty. Respondents with little motivation to participate can either refuse to participate, or they can employ satisficing tactics to reduce their time and energy commitment. Many empirical findings are at first glance congruent with the expectations derived from satisficing theory. Satisficing theory simply assumes that some respondents put in less effort to provide optimal responses than others. While this may sometimes be the case, recent research suggests that this association is due to task simplification rather than satisficing (Kaminska et al., 2010); i.e., ability rather than effort is the underlying dynamic. A second strand of research focuses specifically on the tendency of some respondents to favour particular response options, known as response style or response set (see Chapter 36 in this Handbook). A variety of distinct response styles have been proposed, such as acquiescence (Bachman and O’Malley, 1984), and extreme response style (Greenleaf, 1992), with Baumgartner and Steenkamp (2001) defining seven different response styles. Attempts to isolate, measure and control for multiple varieties of response styles have met with limited success (Van Rosmalen et al., 2010). Additionally, response styles appear to be content (De Beuckelaer et al., 2010) and therefore not inherently a characteristic of the respondent. For these reasons
615
we consider all types of response styles to be manifestations of task simplification. That is, respondents who experience difficulty with the survey items (as well as those who wish to minimize their time and mental effort) find it simpler to favour just a subset of the available response options. What response styles have in common is that they reduce the variability of respondents to batteries of items; i.e., they minimize item differentiation, which is one component of response quality. A further component of response quality is the crudeness of the response. In its simplest manifestation, this takes the form of providing rounded answers to questions such as age and income. A more important manifestation, however, resides in the respondent’s ability to make fine distinctions between the content of related but conceptually distinct survey items. This ability is fundamentally connected with the cognitive ability of the respondents and the resultant complexity of their mental maps. It is our contention that the greater the cognitive ability of respondents, the more complex their mental maps are on any domain. Empirically this would manifest itself in the number of dimensions required to describe their views on a given topic. For example, on a topic such as attitudes towards immigrants, a single evaluative dimension (such as favour or oppose) might be sufficient to capture the views of those with a low level of cognitive complexity. At higher levels of cognitive complexity, respondents might simultaneously see the benefits of immigrants in certain respects but also the disadvantages of them in other respects. For such respondents, two or more dimensions might be required to capture the multidimensionality of their mental maps. This aspect of data quality has received little attention to date. In cross-national comparative research, we would expect the dimensionality of mental maps to be lower in countries with less literacy and less familiarity in filling out forms or survey interviews. A major source of systematic measurement error is due to socially desirable responses
616
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
(SDRs). Just like other human interactions, social surveys invoke impression-management dynamics on the part of the respondent. These typically manifest themselves as SDRs on normatively sensitive topics (Kuncel et al., 2005; Tourangeau et al., 2010). Some respondents may also edit their responses based on interviewer characteristics, such as their gender, ethnicity and age, especially on sensitive topics relevant to those characteristics. Specifically, although the evidence is not entirely consistent, respondents tend to provide what they infer to be more socially desirable responses, given the assumed characteristics of the interviewer (see Davis et al., 2010 for a review of the empirical findings).
Interviewers It has long been recognized that interviewers affect data quality (see Winker et al., 2013 for an in-depth overview). For example, research has documented large interviewer differences in non-contact and non-response rates (O’Muirchearteagh and Campanelli, 1999: Hansen, 2006). It is our contention that the same factors that affect the quality of the data obtained from respondents also affect the quality of data produced by interviewers, namely the difficulty of their task, their skill in contacting respondents and eliciting their participation, and the effort they expend in discharging their tasks. The tasks of contacting, obtaining respondent participation, and completing the survey as prescribed can be difficult. To a large extent this is a function of the study design. Some of the same factors that create problems for the respondent (long questionnaire, complex questions) also make it difficult for the interviewer. Interviewers differ in their ability to elicit respondent interest and commitment to the survey task. Despite relatively large interviewer effects, little success has been achieved in explaining these effects from easily measurable interviewer characteristics such as their age, interviewing experience, gender or education
(Campanelli and O’Muirchearteagh, 1999; Pickery et al., 2001; Hansen, 2006). Further, interviewers themselves likely differ in their effort and commitment to produce quality data. O’Muirchearteagh and Campanelli (1999) found that although the characteristics of those who refuse to participate in a survey differ from those who cannot be contacted, interviewers who were successful in contacting respondents were also successful in getting them to participate. They conclude that the types of skills required for the one task are identical to those necessary for the other task. Alternatively, some interviewers may put more effort into optimizing the contact and response rates. Thus, lack of ability or lack of effort on the part of the interviewer results in lower quality of data. Duration of interviews can be considered to be an overall measure of effort. This would account for Olson and Bilgen’s (2011) finding that interviewers reporting the shortest average interview time produced the highest proportion of acquiescent responses. It is our contention that the overall prevalence of task-simplifying short-cuts depends on two additional factors: the fear of getting caught (which increases with the extent of SRO monitoring) and the commitment to norms of professional/ethical conduct. All these factors may tempt some interviewers to take a variety of short-cuts that reduce the quality of the data (Gwartney, 2013). In its extreme, interviewer short-cuts can take the form of fabricating parts or all of a questionnaire. The consensus is that the incidence of fabrication is quite low (Gwartney, 2013; Slavec and Vehovar, 2013) with estimates typically ranging between 1 and 10 percent (Menold et al., 2013). Nevertheless, in some surveys a high proportion of faked or partially faked interviews have been found (Blasius and Thiessen, 2012, 2013; Bredl et al., 2012); for some countries a very strong interviewer effect accounting for up to 55 percent of explained variance (Blasius and Thiessen, 2016) has been documented.
Another Look at Survey Data Quality
When fabricating data, interviewers can be expected to balance their reduced time and effort with the fear of their fraudulence being discovered by the SRO. To reduce their time, fakers tend to avoid responses that lead to follow-up questions, for example. To minimize detection, skilled fakers impute what they consider plausible responses to their ‘respondents’ based to some extent on the stereotypes they have of respondents of a certain age, gender, ethnicity or socio-economic status. They also tend to generate less apparently contradictory responses than actual respondents do. These and other similar tactics produce multivariate distributions for the faked interviews that differ from those of genuine respondents (Blasius and Thiessen, 2012; Bredl et al., 2012; Menold et al., 2013).
Survey Research Organizations What differentiates SROs is their ability to conduct a survey and the amount of resources they commit to producing data of high quality. The expertise required to conduct quality surveys is a function of experience of SRO staff in executing the various components of surveys, such as sampling design and training of field workers. With respect to commitment of resources, we note only that SROs, like other organizations, are always under pressure to stay within budget and within the contracted time for delivering the product. This creates a tension between taking shortcuts to keep within the contracted time and estimated budget vs their reputation as an organization that produces quality surveys. While the heads of the organizations are unlikely to give the order to copy-and-paste data there will be a pressure on the employees to keep within the budget and to perform the task in time. It must be kept in mind that a strong relation between the thoroughness of SROs monitoring procedures and interviewer’s data fabrication can be expected: the lower the control mechanisms are, the more easily interviewers can (partly) fabricate
617
their interviews without being detected. Thereby, some task simplifications may reduce the quality of the data, but within bounds acceptable to the client, such as back checks to only 10 percent of respondents; still other short-cuts violate professional norms, which if discovered would severely jeopardize the reputation of the SRO, such as fabricating entire interviews through copyand-paste procedures. The incidence of such violations is likely to be rather rare.
CASE STUDIES Respondent Task Simplification To illustrate respondent-level data quality issues, we present information on the 13 ‘learning strategies’ items taken from the student questionnaire in PISA 2009 (p. 17, question 27; https://pisa2009.acer.edu.au/ downloads/PISA09_Student_questionnaire. pdf), for example ‘When I study, I try to memorize everything that is covered in the text’. The four response categories ranged from ‘almost never’, ‘sometimes’, ‘often’ to ‘almost always’. Item wording and other technical information is contained in the Technical Report (OECD, 2012: 293–294). Since our main purpose here is to show that respondent behaviours vary systematically with their cognitive attainments, we divided the respondents into reading achievement quintiles. For space limitation reasons, we provide detailed information for just two countries, namely Australia and the USA. Table 38.1 provides the results. The ‘Percent missing’ column shows the percentage of the learning strategy items with at least one item unanswered (listwise deletion). In both countries for the lowest reading achievement quintile, a sizable percentage of 15-year olds failed to provide all 13 responses (11 percent in Australia, 7 percent in the USA). In both countries, this percentage declines nearly monotonically by
618
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Table 38.1 Student response behaviours by reading achievement quintile, Australia and USA Achievement quintile
(A) Australia 1 2 3 4 5 h2 (B) USA 1 2 3 4 5 h2
Percent missing
IRD
Cronbach’s α Mean
Eigenvalues D1
D2
D3
SD
11.01 2.18 1.36 0.99 0.38 0.050
0.41 0.48 0.53 0.57 0.61 0.128
7.36 6.52 5.55 5.15 4.25
0.98 1.20 1.41 1.58 1.82
0.69 0.83 0.93 0.96 1.22
0.94 0.92 0.89 0.87 0.83
2.21 2.34 2.40 2.52 2.65 0.067
0.67 0.62 0.56 0.53 0.47
6.66 1.78 1.87 1.30 0.52 0.023
0.41 0.48 0.52 0.57 0.61 0.102
7.67 6.11 6.07 5.03 4.28
0.91 1.29 1.43 1.67 1.94
0.65 0.88 0.89 1.03 1.23
0.94 0.91 0.90 0.87 0.83
2.40 2.46 2.46 2.55 2.56 0.008
0.73 0.63 0.63 0.56 0.51
reading achievement quintile. This supports our hypothesis that the lower the cognitive ability, the greater the likelihood that the task will be declined. We assume that students in the lowest reading achievement group find the task of responding to the questionnaire items to be more difficult than those in higher achievement groups. The index of item response differentiation (IRD) calculates the extent to which respondents utilized the available response options expressed as a proportion of the maximum possible variability of choice. The value ranges between zero and one; the higher it is, the better the respondents can differentiate between the items (for the formula see Blasius and Thiessen, 2012: 145). The response behaviours in this respect are remarkably similar in Australia and the USA, increasing monotonically from 0.41 to 0.61 between the bottom and top quintiles, respectively. These are not trivial relationships, since between 10 and 13 percent of the variance in item discrimination is accounted for by reading achievement. This pattern supports our second contention, namely that the more difficult the task, the lower the quality of the performance.
In each country, we performed a principal components analysis (PCA) separately for each quintile. The eigenvalues for the first three (unrotated) dimensions are presented in the columns labelled D1 to D3. If we rely on the common Kaiser’s eigenvalue criterion, then we would have to conclude that in both countries, only one dimension should be considered substantively interpretable among students in the lowest achievement quintile. In contrast, three dimensions or factors would be judged interpretable for the top achievement group. As a result, using Kaiser’s criterion, we should be able to distinguish between three latent learning strategies among the top reading achievers, but only between two for most of the middle quintile achievers, and we would be unable to create meaningful latent sub-scales among the lowest reading achievers. This pattern conforms to our expectation that the higher the quality of responses, the finer the distinctions that survey research methods can detect. Note that PISA considered these items to capture three dimensions (memorization, elaboration and control strategies) (OECD, 2012: 293–294) in all participating countries among all students.
Another Look at Survey Data Quality
By Cattell’s scree test, we would conclude that only one dimension should be interpreted in any of the achievement groups in either country, since there is a clear ‘elbow’ between the first and second dimension. This justifies creating a single summary scale for the 13 items. Cronbach’s alpha for the different achievement groups shows what many researchers would find to be a puzzling result: the lower the reading achievement, the higher the reliability of the scale. In both countries, the reliability ranges from a high of 0.94 in the lowest achievement quintile to 0.83 in the top group. In other words, the reliability of the responses decreases with cognitive attainment. This pattern is the exact opposite of what the methodological literature suggests, namely that higher quality responses are obtained from respondents with greater cognitive skills. Stated differently, the responses to the learning strategies items are more predictable for respondents with lower reading achievement. Greater internal consistency of responses can be due to either greater substantive coherence in the subject matter, or to greater methodologically induced repetitive responses, such as, in the extreme, providing identical responses to all (or almost all) 13 items. The latter dynamic seems more likely, especially since we have already shown that response differentiation between items is positively associated with reading achievement. The column of mean endorsement of learning strategy items is presented to illustrate that cross-national comparisons can be confounded by data quality issues. In Australia, a moderate association exists between reading achievement and endorsement of the learning strategy items, accounting for just under 7 percent of the variance, while only a trivial association is found in the USA, accounting for less than 1 percent of the variance. However, since item non-response is also associated with reading achievement, and this association is stronger in Australia than in the USA, the net result is that, under the common practice of listwise deletion of cases with missing values, the national mean
619
endorsement in Australia would be over- estimated relative to the USA estimate. The final column shows that the withinquintile variability of mean endorsements decreases with reading achievement quintile in both countries. From measurement theory (on the assumption of only random measurement error) one would expect the reverse, since Cronbach’s alpha documented that the amount of random measurement error increased with reading achievement. So why would there be greater heterogeneity in learning strategies among the lower-achievement groups? The possibility that we favour concerns the greater likelihood of response style variability among those with less cognitive skills. That is, we argue that as task difficulty increases, respondents are more likely to simplify the task by typically selecting from only a subset of available responses. Different respondents happen to choose different subsets of responses, which increases the variance of the measured construct.
Interviewer Task Simplification To calculate the effects caused by interviewers to simplify their task, we use data from the European Social Survey 2010 (ESS 2010). An easy way for interviewers to simplify their task is to ask only one or two questions from a set of items and then to ‘generalize’ the given answer(s). Scaling these data via PCA or any other scaling technique will result in interviewer-specific mean values on the latent scale (using PCA, this would be the factor scores). To exclude sample point effects and variability caused by respondents’ own simplification behaviour, we restrict our analyses to interviewers who conducted 15 or more interviews. For the scaling technique we used categorical principal component analysis (CatPCA) since it is more appropriate for handling categorical data than simple PCA (Gower and Blasius, 2005). CatPCA can be understood as PCA for ordered categorical data, like PCA it
620
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
provides factor scores, factor loadings and eigenvalues (Gifi, 1990). The underlying premise is that in countries where the perceived corruption is high, the interviewers are also more inclined to fabricate parts of the data than in countries with low perceived corruption. We chose Denmark and Switzerland for examples of countries with low perceived corruption and Greece and Russia as examples with high perceived corruption (see Transparency International, http://www.transparency.org/research/cpi/ overview, accessed 4 August, 2014). If the interviewers produce interviewer-specific mean factor scores and if these are correlated with the perceived corruption in the country, the standard deviation of the means will be larger in countries with high perceived corruption, i.e., Greece and Russia. To illustrate the interviewer’s task simplification, we use two sets of questions: first, seven 11-point items on personal trust in institutions ([country’s] parliament, the legal system, the police, politicians, political parties, the European Parliament, the United Nations; English Questionnaire, questions B4 to B10, p. 7; https://www.europeansocialsurvey.org/docs/round5/fieldwork/source/ ESS5_source_main_questionnaire.pdf) and second, six 11-point items on satisfaction (with one’s life as a whole, the current state of the country’s economy, how well their government is doing its job, the extent of democracy in the country, the state of education, and health service; English Questionnaire, questions B24 to B29, pp. 10–11). For both item sets we performed CatPCA restricted to a one-dimensional solution (for details on this method, see Gifi, 1990; De Leeuw, 2006) over all respondents from all countries, and saving the factor scores (applying listwise deletion). All seven trust variables load highly on the first dimension (factor loadings between 0.725 and 0.881, not shown), extracting 67.1 percent of the total variation; the higher the factor score, the greater the trust in institutions. A similar result was obtained for the six satisfaction variables, with all of
them having high loadings (between 0.611 and 0.820, not shown) on the first dimension; the higher the factor score, the greater the satisfaction. Table 38.2 shows for both CatPCA factors and the four selected countries the country mean, the standard deviation, and the distribution of the factor scores across interviewers with 15 or more interviews (countryspecific centred to zero). Table 38.2 shows that in Denmark and Switzerland the mean values for trust in institutions and satisfaction are clearly above average, in Russia these values are close to the overall mean and in Greece they are clearly below average. These values are the ones usually used for country-specific interpretations, showing that people in Switzerland and Denmark have greater trust in their institutions and are more satisfied than those in Greece. However, the countries also differ largely in the standard deviations of the single interviewers. Take as an example trust in institutions: approximately 68 percent of the interviewers with 15 or more interviews in Denmark produce mean values between 0.63 and 0.99; in Switzerland the respective interval runs between 0.42 and 0.80. In contrast, approximately 68 percent of the interviewers in Greece produce mean values between –0.31 and –1.41, while in Russia the respective interval is even larger, ranging from 0.76 to –0.90 (compare also the percentages of interviewers in the centred intervals around the country-specific mean). Since the sample point effects should be similar in all countries, the only plausible interpretation is that the interviewers in Russia and Greece have rather clear but very different assumptions about the Russian and the Greek population – in Russia, some of them are very positive, some of them are very negative, resulting in a country-specific mean value close to the general mean; in Greece some of them are close to the overall mean and some of them extremely negative, resulting in a low trust value for Greece. The same interpretation applies to the items on satisfaction (see Table 38.2).
621
Another Look at Survey Data Quality
Table 38.2 Interviewer effects in ESS 2010
Satisfaction
Trust in institutions
Interviewers
Denmark N = 59
Switzerland N = 37
Greece N = 94
Russia N = 28
Mean Standard dev. Low to –1.0 –0.999 to –0.75 –0.749 to –0.50 –0.499 to –0.25 –0.249 to 0.249 0.25 to 0.499 0.50 to 0.749 0.75 to 0.999 1.0 to high
0.81 0.18 0 0 0 10.2 83.1 6.8 0 0 0
0.61 0.19 0 0 0 13.5 81.1 5.4 0 0 0
–0.86 0.55 4.3 1.1 16.0 9.6 33.0 21.3 10.6 0 4.3
–0.07 0.83 3.6 14.3 10.7 21.4 14.3 0 17.9 3.6 14.3
Mean Standard dev. Low to –1.0 –0.999 to –0.75 –0.749 to –0.50 –0.499 to –0.25 –0.249 to 0.249 0.25 to 0.499 0.50 to 0.749 0.75 to 0.999 1.0 to high
0.78 0.16 0 0 0 10.2 88.1 1.7 0 0 0
0.96 0.15 0 0 0 5.4 91.9 2.7 0 0 0
–1.13 0.57 8.5 2.1 6.4 12.8 37.2 16.0 10.6 2.1 4.3
–0.08 0.62 0 10.7 7.1 32.1 17.9 3.6 10.7 14.3 3.6
Data Fabrication and SROs Task Simplification Our method of detecting possible instances of fraudulent behaviour on the part of either research institutes or interviewers relies on the fact that identical response patterns (IRPs) across any array of input variables will always produce identical latent scores in PCA or multiple correspondence analysis (MCA). If there is a significant number of IRPs when a large number of uncorrelated variables are entered into the analysis, this is a strong indicator that some interviews have been copied, either by the interviewers or by the employees of the institutes. Not all instances of IRPs are necessarily due to data fabrication, since there are several reasons why some instances of IRPs might occur. One possibility is that, by pure coincidence,
two or more respondents have identical views on the items subjected to the analysis. A second possibility is that errors are made in the data entry process. Specifically, the information from a given respondent might inadvertently be entered twice. This could account for instances where there are precisely two IRPs, since it is highly improbable that a given respondent’s information would be entered inadvertently more than two times. A third possibility is that some interviewers fabricated large parts of their interviews, or even worse, the entire interview. However, the possibility of the falsifications being discovered (through back checks or automatic record comparison programs) makes complete fabrications unlikely. From a task simplification point of view, a rational approach would be for the interviewer to contact the respondent, ask the demographic
622
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
questions (which back checks could easily verify), and then duplicate all or most of the responses to the time-consuming opinion questions (see Blasius and Friedrichs, 2012: 53, for the guide on ‘how to successfully fake an interview’). This would result in IRPs for the attitudinal/opinion items combined with variable demographic responses. A final possibility is that the employees of an SRO fabricated some data in order to meet their quota in the contracted time and within the estimated budget. This could be done by taking batches of completed interviews and duplicating them by some variant of a copy-andpaste procedure, perhaps randomly changing some fields to avoid detection. The most likely manifestation of this would be to have numerous instances of precisely two IRPs with the demographic information being more likely to be variable than the opinion information. It is evident that the smaller the number of input variables, the fewer the number of response options, and the higher the association between the items subjected to the analyses, the greater the likelihood that IRPs are genuine opinions. On the other hand, since data fabricators are likely to change a few responses just to avoid detection, the larger the number of input variables, the more unlikely it is to detect instances of data fabrication, since a change of just one number in the input data produces distinct factor scores. To detect such kinds of fabrication, we typically start with a rather large (30 or more) number of input variables, with many of them being uncorrelated. This means that we almost certainly will fail to identify some instances of data fabrication should they exist, but with the advantage that there should be extremely few instances of multiple respondents genuinely having identical IRPs. As a second step, we then examine in greater detail the location and distribution of the IRPs to help decide between the various possible reasons for having IRPs. To illustrate task simplification via data fabrication, we use the ISSP 2006 data.1
Our screening procedures (Blasius and Thiessen, 2012) suggested that in two countries – Latvia and South Africa – some of the data might have been fabricated. In this section we describe the procedures we used to detect anomalous data in South Africa and the resulting evidence. The ISSP data is generally considered to be of high quality (Scheuch, 2000), and the Study Monitoring Report provides detailed technical and methodological information on each participating country’s implementation of the survey. According to the Study Monitoring Report (http://www.gesis.org/ en/issp/issp-modules-profiles/role-of-government/2006/), South African data appears to have undergone reasonably stringent quality control checks. Face-to-face interviews were conducted with 2,939 respondents from an eligible N of 3,384. Three call backs were employed to increase the response rate, 10 percent of the interviews were supervised, and 15 percent were backchecked. Some measure of coding reliability was used and 100 percent of the keying of data was verified. Data checks/edits on filters, logic, or consistency were used. These procedures are consistent with professional survey standards. Nevertheless, these assertions were not critically examined and we conclude that task simplification through data fabrication did occur. We began by selecting an array of 37 adjacent variables (question 7a to question 18, see English Questionnaire, ISSP 2006; http:// www.gesis.org/issp/modules/issp-modulesb y - t o p i c / r o l e - o f - g ove r n m e n t / 2 0 0 6 / ) . Included are ten 4-point variables on the government’s responsibilities (running from ‘definitely should be’ to ‘definitely should not be’, with a fifth option being ‘can’t choose’), six 5-point variables on how successful their government is at certain areas (ranging from ‘very successful’ to ‘very unsuccessful’ with a ‘can’t choose’ sixth category), three 4-point variables concerning what the government is expected to do against terrorist acts, personal interest in politics, six 5-point
Another Look at Survey Data Quality
variables on political efficacy and trust, and two questions on how many politicians and how many public officials engage in corrupt practices. No demographic or other factual information is included. To include the various missing options in the analyses, we used MCA as a scaling technique based on the 37 items. Considering all categories (including the ‘can’t choose’ option but without ‘no answer’), there are 6.69 × 1026 possible response patterns. If we were to assume that the variables have equal marginal distributions and the responses to the items are independent of each other, the probability of any identical response pattern would be 1.49 × 10–27. Although neither assumption holds, the value is far removed from usual values of statistical significance, such as 10–2 or 10–3. Therefore, it would be most unlikely to find several respondents with precisely the same pattern of responses. Since no substantive interpretation of the resulting factors is necessary, there is no 16
623
need to perform MCA separately for each country; the resulting factor scores are used solely to identify multiple occurrences of identical response patterns. Once the MCA factor scores have been obtained and saved, a simple bar graph of the counts of the factor scores is constructed separately for each country included in the analysis. If there are no instances of the same factor score occurring more than once, then a perfectly rectangular bar graph would be produced (all outcomes occurred precisely one time). Response sequences that occur more than once appear as spikes in the bar graph. Figure 38.1 shows the bar graph for South Africa. It shows numerous instances of identical factor scores within the 2,939 cases: eight instances of exactly two identical scores, two instances of three identical scores, two of four, two of five, one of six, one of 13 cases, and one of 14 cases. The latter contains only missing cases from variable q7a to variable q18; including them in
SOUTH AFRICA
14 12 10 8 6 4 2
8.67 1.81 1.28 0.92 0.69 0.50 0.40 0.26 0.17 0.10 0.02 -0.04 -0.09 -0.14 -0.18 -0.22 -0.25 -0.28 -0.30 -0.31 -0.32 -0.34 -0.35 -0.36 -0.37 -0.38 -0.39 -0.40 -0.41 -0.42 -0.43 -0.44 -0.45 -0.46 -0.47 -0.49 -0.51 -0.59
0
Figure 38.1 South Africa: Bar graph of factor scores.
624
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
the data set is not meaningful but they can be easily excluded (and in most computations they will be excluded by definition). Except for Latvia (in which we found 13 instances of two identical scores, four instances of three identical scores and one instance of four identical instances out of 1,069 cases), and a few countries that also provided cases with missing values only in the given set of variables (for example, Australia: 7, Canada: 7, Chile: 2), most of the countries had no instances of IRPs, the others had one or two (Israel had five instances of two identical scores, Russia had six instances of two identical scores and one of four identical scores). It might happen that a very few questionnaires are typed in twice by mistake, but we doubt that this is the case for South Africa and Latvia. To shed more light on the possible reasons for the IRPs, we focussed on the 13 cases in South Africa with identical factor scores (–0.073951) by listing their respondent IDs. From this it emerged that the IDs for these 13 cases ranged from 4658 to 5192 (the overall range of IDs was from 8 to 7042), covering a range of 260 cases (out of 2,939 cases in the entire data set); excluding the highest and the lowest ID number, the remaining 11 identical cases ranged from 4801 to 5051, covering a range of 121 cases. Clearly, these IRPs were not randomly distributed across the whole range of IDs, decreasing the probability that they represent genuine convergence of opinion. Further, the factor score is close to the mean, i.e., there are many different response patterns providing values close to the factor score of –0.073951, which decreases the likelihood of similar response patterns just by coincidence. As a next step we listed all cases in the ID range of 4801 and 5051 with MCA factor scores between –0.06 and –0.08 to examine all cases with IRPs or near IRPs within that range. Table 38.3 presents the results. It shows that in addition to the 11 cases with identical IRPs (with shadowed ID), there were 20 instances of near-IRPs, and two additional cases that follow the last ID in the list
of the selected IRPs. Among the near-IRPs, the differences were minor (usually only one number differed and at most four, compared to the selected IRP), and occurred mainly in the first part of question 7. Finally, the demographic data clearly differ for these cases, as shown in the final three columns of Table 38.3 (S = sex, A = age, E = years of education). None of the characteristics of these 33 cases are consistent with either a data entry error interpretation or genuine respondent identical viewpoints. If we assume that interviewers were given interview forms with consecutive blocks of pre-coded respondent IDs, then the patterns for the 33 cases could be compatible with an interviewer fabrication interpretation. However, it is most unlikely that a single interviewer completed all 260 interviews ranging from ID 4658 to ID 5192, and if s/ he did, why use a single interview to copy it 33 times instead of using 33 different interviews, copying each of them once? From the copy-and-paste and simplification points of view s/he would not save time doing it either way, but the risk of detection would increase (the employee of the SRO who received the interviews would just have to scan the pile). If different interviewers fabricated the data, why would they work together and use the same master interview for copy-and-paste? Since all the interviews are from the region ‘Free State’ but with different community sizes (not shown in Table 38.3), it is more plausible that a pile of interviews from this region was missing and an employee of the SRO who had access to the electronic data file copied a case several times while making minor changes. To summarize our findings with respect to IRPs, we note that (a) they were located adjacent or nearly adjacent to each other at a few points when the data set is sorted by respondent ID, with a large majority of them contained within a narrow range of 260 cases, (b) in all instances, the demographic data for the IRPs varied, and (c) for the majority of the items the modal response across all respondents differed from that of the IRPs.
2 2 2 8 1 3 2 2 2 1 2 2 2 8 2 3 2 2 2 1 2 1 1 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 2 1 2 1 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 1 8 1 3 2 2 2 1 2 2 1 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 2 8 2 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 2 8 1 2 2 2 2 1 2 2 2 8 1 2 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 1 8 1 3 2 1 2 1 1 1 1 8 2 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 1 1 1 8 2 3 2 2 2 1 1 1 1 8 2 3 2 2 2 1 1 1 1 8 2 3 2 2 2 1 1 1 1 8 2 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 1 1 1 8 2 3 2 2 2 1 2 2 2 8 3 2 2 2 2 1 2 2 1 8 1 3 2 2 2 1 2 2 1 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 2 8 1 3 2 2 2 1 2 2 1 8 1 3 2 2 2 1 2 2 1 8 1 3 2 2 2 1
Q7
S = sex; A = age; E = years of education
4801 4810 4812 4813 4814 4838 4839 4840 4843 4844 5006 5008 5009 5010 5012 5020 5022 5024 5025 5026 5034 5035 5036 5037 5038 5039 5040 5048 5049 5050 5051 5053 5054
ID 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3 2 4 4 4 4 3
Q8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Q9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Q10
Q11 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 1 4 2 2 4 4 2 4 2 2 4 4 2 2 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4 2 4 2 2 4 4
Table 38.3 ISSP 2006 – partial listing of South African duplicated data 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2
Q12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Q13 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 2 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1
Q14 2 1 4 3 2 1 4 3 2 1 4 2 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3
Q15-Q18 2 2 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1
S 35 22 49 21 24 65 60 50 72 45 30 49 45 49 55 55 60 39 18 45 35 49 18 48 50 54 25 50 37 52 40 47 45
A 10 9 8 11 10 12 12 12 12 12 13 15 14 12 12 12 98 13 12 12 98 10 12 98 11 98 98 15 15 15 16 16 15
E
626
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
These features lead us to reject all possible explanations for the IRPs except that of data fabrication on the part of an employee of the SRO who had access to the electronic file.
CONCLUSION Our first case study documented that the quality of data obtained from respondents is systematically connected to their cognitive sophistication, especially their literacy skills. Consistent with the empirical literature, we found that the lower the cognitive ability of a respondent the higher the item non-response. Also consistent with the literature, we documented that response differentiation increases with cognitive ability. Surprisingly, however, we found that the lower the cognitive ability of respondents, the higher the reliability of their responses. Since reliability is one of the key criteria for data quality, we have the paradox that respondents traditionally considered to produce inferior data appear to produce superior data. Our contention is that this paradox is the result of structurally different cognitive maps among respondents of differing cognitive abilities. Specifically, respondents with high cognitive ability are able to make fine distinctions between different survey questions. This would account for the fact that we found that one dimension is sufficient to capture variation in learning strategies among respondents in the lowest quintile of reading achievement, whereas three dimensions are necessary for those in the top quintile. The implication of this finding for survey analysts is that the dimensionality of a set of items is not simply a function of the content of the items; it is also a function of the cognitive sophistication of the respondents. Respondents with less cognitive ability perceive greater similarity between items than do those with greater cognitive ability. As a result, measures of reliability such as Cronbach’s alpha are higher among those with less cognitive ability.
The three patterns documented in the first case study are also evident in all the other countries that participated in the PISA project and in the previous years in which PISA was conducted, and with cognitive abilities measured by mathematics achievement as well as reading achievement (Thiessen and Blasius, 2008; Blasius and Thiessen, 2012). Such consistency of findings constitutes powerful evidence of the strong links between the cognitive abilities of respondents and the quality of their survey responses. Clearly data quality is implicated by many factors that are not discussed in this paper. Cultural differences in the meanings of some questions are particularly likely in questions concerning religious matters or the appropriate division of household labour. With respect to the European Social Survey 2010 data we found that interviewers have a pronounced impact on the substantive solution. With respect to trust in institutions and satisfaction with several aspects of life we could show that in some countries the attitudes towards these items differ strongly by interviewer; in countries such as Greece and Russia, the interviewerspecific mean values cannot be explained away with sample-point effects as could be argued for Switzerland and Denmark. We note in conclusion that much attention has been given on calculating correct standard errors and on developing statistical techniques that attempt to incorporate and correct for specific methodological inadequacies such as response styles. While these efforts are to be applauded, we feel that the survey methodology community has expended insufficient effort on screening data for possible manifestations of poor data quality and on what to do with particular forms of faulty data. In the presence of fabricated data, all calculations of standard errors are incorrect, even if for no reason other than that the number of cases is artefactually inflated. Perhaps we should add a further ingredient to the determinants of data quality, namely researcher practices. Researchers too have a vested interest in limiting the time and energy they expend in
Another Look at Survey Data Quality
analyzing their data. Hence they cut corners and simply take their data as given; i.e., they fail to screen their data to detect problematic interviewers or respondents, and sometimes even the SRO. Partly this is also due to lack of current guidelines on what to do when problematic cases have been found. Should one delete such cases altogether, or should one keep them in the analysis but with reduced weights? With respect to the ESS 2010 it is reasonable to conclude that in some countries a large minority of interviewers fabricated large parts of the data. Here it is probably best to exclude the respective countries from any comparative research. With respect to the ISSP 2006 data, we showed that in South Africa employees of the responsible SRO probably fabricated parts of their data via copy-andpaste. The respective cases are easily identified and one could delete them. At the same time, how can one have faith in the remaining data when parts are known to have been fabricated, leaving open the question whether the other parts have simply been fabricated more carefully, for example, by (randomly) changing some answers? Based on our analyses, we cannot give a more satisfying answer than that the research community needs to start a conversation on these topics.
NOTE 1 Version 1.0, downloaded 1 October, 2009. Retrieved from http://www.gesis.org/issp/modules/isspmodules-by-topic/role-of-government/2006/ [accessed on 15 June 2016]. † Victor Thiessen passed away suddenly and unexpectedly at the age of 74 on the evening of February 6th, 2016, in the company of his wife Barbara and very close friends.
RECOMMENDED READINGS Krosnick (1991) for introducing the concept of satisficing and for providing empirical evidence of its correlates.
627
Tourangeau and associates (2000) for providing a model of cognitive functioning and its implication for respondent behaviours.
REFERENCES Alwin, D. F. (1997). Feeling thermometers versus 7-point scales: Which are better? Sociological Methods & Research, 25(3), 318–340. Bachman, G. G., and O’Malley, P. M. (1984). Yea-saying, nay-saying, and going to extremes: Black-white differences in response styles. Public Opinion Quarterly, 48(2), 491–509. Baumgartner, H., and Steenkamp, J.-B. E. M. (2001). Response styles in marketing research: A cross-national investigation. Journal of Marketing Research, 38(1), 143–156. Blasius, J., and Friedrichs, J. (2012). Faked interviews. In: S. Salzborn, E. Davidov and J. Reinecke (eds), Methods, Theories, and Empirical Applications in the Social Sciences. Festschrift für Peter Schmidt (pp. 49–56). Wiesbaden: Springer. Blasius, J., and Thiessen, V. (2001). Methodological artifacts in measures of political efficacy and trust: A multiple correspondence analysis. Political Analysis, 9(1), 1–20. Blasius, J., and Thiessen, V. (2012). Assessing the Quality of Survey Data. London: SAGE. Blasius, J., and Thiessen, V. (2013). Detecting poorly conducted interviews. In: P. Winker, N. Menold and R. Porst (eds), Interviewers´ Deviations in Surveys – Impact, Reasons, Detection and Prevention (pp. 67–88). Frankfurt: Peter Lang. Blasius, J., and Thiessen, V. (2016). Perceived corruption, trust and interviewer behavior in 26 European countries. Unpublished Manuscript. Bredl, S., Winker, P., and Kötschau, K. (2012). A statistical approach to detect interviewer falsification of survey data. Survey Methodology, 38(1), 1–10. Campanelli, P., and O’Muirchearteagh, C. (1999). Interviewers, interviewer continuity, and panel survey non-response. Quality & Quantity, 33(1), 59–76.
628
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Coelho, P. S., and Esteves, S. P. (2007). The choice between a five-point and a ten-point scale in the framework of customer satisfaction measurement. International Journal of Market Research, 49(3), 313–339. Davis, R. E., Couper, M. P., Janz, N. K., Caldwell, C. H., and Resnicow, K. (2010). Interviewer effects in public health surveys. Health Education Research, 25(1), 14–26. De Beuckelaer, A., Weijters, B., and Rutten, A. (2010). Using ad hoc measures for response styles: A cautionary note. Quality & Quantity, 44(4), 761–775. De Leeuw, J. (2006). Nonlinear principal component analysis and related techniques. In: M. Greenacre and J. Blasius (eds), Multiple Correspondence Analysis and Related Methods (pp. 107–133). Boca Raton, FL: Chapman & Hall. Gifi, A. (1990). Nonlinear Multivariate Analysis. Chichester: Wiley. Gower, J., and Blasius, J. (2005). Multivariate prediction with nonlinear principal components analysis: Theory. Quality and Quantity, 39, 359–372. Greenleaf, E. A. (1992). Measuring extreme response style. Public Opinion Quarterly, 56, 328–351. Groves, R. M. (2004). Survey Errors and Survey Costs. Hoboken, NJ: John Wiley & Sons. Groves, R. M., Presser, S., and Dipko, S. (2004). The role of topic interest in survey participation decisions. Public Opinion Quarterly, 68(1), 2–31. Gwartney, P. A. (2013). Mischief versus mistakes: Motivating interviewers to not deviate. In: P. Winker, N. Menold and R. Porst (eds), Interviewers’ Deviations in Surveys: Impact, Reasons, Detection and Prevention (pp. 195–215). Frankfurt: Peter Lang. Hamamura, T., Heine, S. J., and Paulhus, D. L. (2008). Cultural differences in response styles: The role of dialectical thinking. Personality and Individual Differences, 44(4), 932–942. Hansen, K. M. (2006). The effects of incentives, interview length, and interviewer characteristics on response rates in a CATI-study. International Journal of Public Opinion Research, 19(1), 112–121. Harris, S. (2000). Who are the disappearing youth? An analysis of non-respondents to
the School Leavers Follow-up Survey, 1995. Education Quarterly Review, 6(4), 33–40. Heath, A., Martin, J., and Spreckelsen, T. (2009). Cross-national comparability of survey attitude measures. International Journal of Public Opinion Research, 21(3), 293–315. Kaminska, O., McCutcheon, A. L., and Billiet, J. (2010). Satisficing among reluctant respondents in a cross-national context. Public Opinion Quarterly, 74(5), 956–984. Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of attitude measures in surveys. Applied Cognitive Psychology, 5(3), 213–236. Kuncel, N. R., Credé, M., and Thomas, L. L. (2005). The validity of self-reported grade point averages, class ranks, and test scores: A meta-analysis and review of the literature. Review of Educational Research, 75(1), 63–82. Menold, N., Winker, P., Storfinger, N., and Kemper, C. J. (2013). A method for ex-post identification of falsifications in survey data. In: P. Winker, N. Menold and R. Porst (eds), Interviewers’ Deviations in Surveys: Impact, Reasons, Detection and Prevention (pp. 25– 47). Frankfurt: Peter Lang. O’Muirchearteagh, C., and Campanelli, P. (1999). A multilevel exploration of the role of interviewers in survey non-response. Journal of the Royal Statistical Society, 162(3), 437–446. OECD (2012). PISA 2009 Technical Report. PISA, OECD Publishing. Retrieved from http:// dx.doi.org/10.1787/9789264167872-en [accessed on 15 June 2016]. Olson, K., and Bilgen, I. (2011). The role of interviewer experience on acquiescence. Public Opinion Quarterly, 75(1), 99–114. Pickery, J., Loosveldt, G., and Carton, A. (2001). The effects of interviewer and respondent characteristics on response behavior in panel surveys. Sociological Methods & Research, 29(4), 509–523. Scheuch, E. K. (2000). The use of ISSP for comparative research. ZUMA-Nachrichten, 46, 64–74. Simon, H.A. (1957). Models of Man. New York: Wiley. Slavec, A., and Vehovar, V. (2013). Detecting interviewer’s deviant behavior in the Slovenian National Readership Survey. In: P. Winker,
Another Look at Survey Data Quality
N. Menold and R. Porst (eds), Interviewers’ Deviations in Surveys: Impact, Reasons, Detection and Prevention (pp. 131–144). Frankfurt: Peter Lang. Thiessen, V., and Blasius, J. (2008). Mathematics achievement and mathematics learning strategies: Cognitive competencies and construct differentiation. International Journal of Educational Research, 47(4), 362–371. Tourangeau, R., Groves, R. M., and Redline, C. D. (2010). Sensitive topics and reluctant respondents: Demonstrating a link between nonresponse bias and measurement error. Public Opinion Quarterly, 74(3), 413–432. Tourangeau, R., and Rasinski, K. A. (1988). Cognitive processes underlying context
629
effects in attitude measurement. Psychological Bulletin, 103(3), 299–314. Tourangeau, R., Rips, L. J., and Rasinski, K. A. (2000). The Psychology of Survey Response. New York: Cambridge University Press. Van Rosmalen, J., Van Herk, H., and Groenen, P. J. F. (2010). Identifying response styles: A latent-class bilinear multinomial logit model. Journal of Marketing Research, 47, 157–172. Winker, P., Menold, N., and Porst, R. (eds) (2013). Interviewers’ Deviations in Surveys: Impact, Reasons, Detection and Prevention. Frankfurt: Peter Lang.
39 Assessment of Cross-Cultural Comparability J a n C i e c i u c h , E l d a d D a v i d o v, P e t e r S c h m i d t a n d René Algesheimer
INTRODUCTION – MEASUREMENT INVARIANCE AS A PRECONDITION FOR CROSS-CULTURAL COMPARABILITY Comparing cultures seems to be easier nowadays than ever before. Researchers are provided with data collected in many different large cross-cultural surveys such as the European Social Survey (ESS), the International Social Survey Program (ISSP), the European Values Study (EVS) or the World Values Study (WVS), to name just a few. These datasets contain measurements of various variables that are considered to tap important social scientific dimensions of research, making them attractive for crossculture comparisons and investigations. These variables include but are not limited to basic human values, well-being, attitudes, beliefs, behaviors, life conditions, trust, sociodemographic attributes and many others. However, the widespread availability of such data does not mean automatically that meaningful
cross-cultural comparisons of the measured variables are possible. It could well be the case that these variables, although measured in a similar way across cultures, are not comparable. Indeed, recent years have witnessed not only a growing number of cross-cultural datasets but also a growing awareness of problems and dangers connected with crosscultural comparisons with several researchers suggesting that measurement invariance has to be established as a precondition for any meaningful comparisons to take place (Byrne et al., 2009; Chen, 2008). Measurement invariance is a property of an instrument (usually a questionnaire) intended to measure a given latent construct. Measurement invariance affirms that a questionnaire does, indeed, measure the same construct in the same way across various groups (Chen, 2008; Davidov et al., 2014; Meredith, 1993; Millsap, 2011; Steenkamp and Baumgartner, 1998; Van de Vijver and Poortinga, 1997; Vandenberg, 2002; Vandenberg and Lance, 2000).
Assessment of Cross-Cultural Comparability
Measurement invariance is a necessary precondition for a meaningful comparison of data across groups. It does neither suggest that the results obtained across the various groups are identical nor that there are no differences between the groups regarding the measured construct. Instead, it implies that the measurement operates similarly across the various groups to be compared and, therefore, that the results of the measurement can be meaningfully compared and interpreted as being similar across groups. If invariance is not established, interpretations of comparisons between groups are problematic (Byrne et al., 2009; Chen, 2008; Meredith, 1993). There are two main dangers in using noninvariant measurement instruments in comparative studies. The first danger is that different constructs may be measured across the various groups even though the same measurement instrument is used. Horn and McArdle (1992) metaphorically speak of this case as a comparison between apples and oranges, and Chen (2008) describes it as a comparison between chopsticks and forks. For example, if attitudes toward democracy constitute a different concept in two countries (e.g. Iran and Switzerland), then using the same questions to measure attitudes toward democracy in the two countries may yield measures of a different concept (Ariely and Davidov, 2010). The second danger is that despite measuring the same construct, the instrument may operate differently under different conditions or across different groups, which also disrupts the ability to make meaningful comparisons. For example, if individuals in a certain culture react more sensibly or carefully to the mere measurement effect of a certain question item (a ‘stimulus’) than individuals in another country, then researchers might receive responses that are not comparable to each other across the two countries, even if the concept has the same meaning in both cultures. Indeed, social desirability response bias, yes-saying tendency or extreme response bias may be more present in one culture than in another and might affect
631
the scores (Johnson and van de Vijver, 2003; Karp and Brockington, 2005). If a measurement is noninvariant, it is possible that differences that are found between groups do not correspond to real differences or, conversely, that real differences are obscured by the noninvariant measurements. Recent years have witnessed an increasing body of literature that assessed measurement invariance across countries, language groups or time for various concepts such as human values (Davidov and de Beuckelaer, 2010; Davidov et al., 2008), nationhood and national identity (Davidov, 2009; Sarrasin et al., 2013) or attitudes toward immigration (Meuleman and Billiet, 2012), just to name a few.
HOW CAN MEASUREMENT INVARIANCE BE ESTABLISHED? Since measurement invariance is of great importance for comparative research, it is recommended to test it empirically (Byrne et al., 2009; Chen, 2008; Vandenberg and Lance, 2000). There are many procedures for measurement invariance testing across groups. The first and most widely used method is multigroup confirmatory factor analysis (MGCFA; Jöreskog, 1971). This method involves setting cross-group constraints on parameters and comparing more restricted models with less restricted ones (Byrne et al., 1989; Meredith, 1993; Steenkamp and Baumgartner, 1998; Vandenberg and Lance, 2000). Recently, several procedures for measurement invariance testing have been developed in the MGCFA framework. Focusing on this, we first describe the general scheme of measurement invariance testing in MGCFA, and we then provide a brief overview of the procedures with respect to the six main decisions that have to be made while testing for measurement invariance. Let us assume that there is a questionnaire Q consisting of six items, X1–X6. The items
632
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
are observed variables that serve as reflective indicators to measure two latent variables, h1 and h2 (Bollen, 2007). The items X1, X2 and X3 are indicators measuring latent variable h1, and items X4, X5 and X6 are indicators measuring the latent variable h2. The factor loadings of the indicators are denoted as l, the indicator intercepts as t and the measurement errors as d. We collect data using questionnaire Q in two groups, A and B. These groups may be different countries, cultures, regions within a country or time points, or alternatively different conditions of data collection (e.g. online and paper and pencil). Figure 39.1 presents the measurement model in the two groups (for a discussion about methods on how to identify the model, see Little et al., 2006). Various procedures for constraining parameters across groups have been developed for testing measurement invariance. The choice of the procedure depends on our responses to each of the following six questions: 1. What level of measurement invariance is required? 2. Which type of data is used?
3. Which rules would be applied for model evalu ation? 4. Are cross-loadings and/or error correlations allowed? 5. Is full measurement invariance necessary, or is it sufficient to establish partial measurement invariance? 6. Is exact measurement invariance necessary, or is it sufficient to establish approximate measure ment invariance?
All of these issues will be addressed in more detail below. Following this, we will present our recommendations to researchers who wish to conduct cross-group comparisons.
The First Issue – What Level of Measurement Invariance is Required? One can differentiate between several levels of measurement invariance. Each level is defined by the parameters constrained to be equal across groups. The first and lowest level of measurement invariance is called configural invariance (Horn and McArdle, 1992; Horn et al., 1983; Meredith, 1993;
Figure 39.1 A model for testing for measurement invariance of two latent variables measured by three indicators across two groups with continuous data. The two factors are allowed to covary. (Note: The subscripts denote the group and the superscripts denote the item number.)
633
Assessment of Cross-Cultural Comparability
Vandenberg and Lance, 2000). Following Thurstone’s (1947) principle of simple structures, the pattern of nonzero (called salient) and zero (called nonsalient) factor loadings defines the structure of the measurement. Configural invariance then states that the same configuration of salient and nonsalient factors holds across different groups. It requires all groups to have the same latent variables loading onto the same items. The model parameters are estimated for all groups simultaneously. In CFAs, nonsalient loadings are constrained to zero. We can then argue that configural invariance holds if (a) the model with zero loadings on non-hypothesized factor connections fits the data well across groups, (b) all salient factor loadings are substantially and significantly different from zero, and (c) factor correlations are significantly below unity (discriminant validity). The fit of the model being tested provides the baseline against which the models testing for higher levels of measurement invariance are analyzed. Configural invariance does not guarantee that we are measuring the same construct in the same way in all groups; further analyses at subsequent levels are necessary to confirm this. The second level is called metric measurement invariance (Horn and McArdle, 1992; Steenkamp and Baumgartner, 1998; Vandenberg and Lance, 2000) or weak measurement invariance (Marsh et al., 2010; Meredith, 1993). Metric invariance is tested by constraining the factor loadings between the observed items and the latent variable to be equal across the compared groups (Vandenberg and Lance, 2000). Therefore, metric invariance across groups A and B of the model presented in Figure 39.1 is established when the following conditions are satisfied:
λ =λ 1
1
A
B
andλ A = λ B and ... andλ A = λ B 2
2
6
6
Note: The subscripts denote the group and the superscripts denote the item number.
If metric measurement invariance is established, one may assume that people in both groups interpret the respective items in the same way. The measured construct in both groups has the same meaning, although there is still a lack of certainty as to whether the construct is being measured in the same way across both groups. It is worth noting that some scholars suggest that this test is only a necessary but not a sufficient condition for guaranteeing that the measured construct has the same meaning in both groups. They suggest additionally conducting cognitive interviews to guarantee that people in both groups understand the construct similarly (Latcheva, 2011). The third and even higher level of measurement invariance is called scalar measurement invariance (Vandenberg and Lance, 2000) or strong measurement invariance (Marsh et al., 2010; Meredith, 1993). Scalar measurement invariance is tested by constraining not only the factor loadings (as in the case of testing for metric measurement invariance) but also the indicator intercepts to be equal across groups (Vandenberg and Lance, 2000). The underlying assumption of scalar invariance is that differences in the means of the observed items are related to mean differences of the underlying constructs. Full scalar measurement invariance across groups A and B of the model presented in Figure 39.1 is supported when the following conditions are satisfied:
λ
1 A
= λ B and λ A = λ B and ... andλ A = λ B 1
2
2
6
6
and
τ
1 A
= τ B and τ A = τ B and ... and τ A = τ 1
2
2
6
6 B
Note: The subscripts denote the group and the superscripts denote the item number. If scalar invariance is established, one may assume that respondents not only understand the concept and its question items similarly, but that they also use the scale in the same
634
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
way in each group to respond to the questions; thus, it implies that the same construct (metric measurement invariance) is measured in the same way (scalar measurement invariance). Meredith (1993) and Marsh et al. (2009, 2010) also proposed a fourth level, called strict measurement invariance. Strict measurement invariance is tested by imposing additional constraints to those imposed in the scalar measurement invariance model: Not only factor loadings and indicator intercepts but also error variances of the items are constrained to be equal across groups. Strict measurement invariance across groups A and B of the model presented in Figure 39.1 is supported when the following conditions are satisfied:
λ
1 A
= λ B and λ A = λ B and ... and λ A = λ B 1
2
2
6
6
and
τ
1 A
= τ B and τ A = τ B and ... and τ A = τ 1
2
2
6
6 B
and 1
1
2
2
6
6
δ A = δ B and δ A = δ B and ... and δ A = δ B Note: The subscripts denote the group and the superscripts denote the item number.
Recommendations The decision regarding which level of measurement invariance should be tested and established depends on the research goals. Configural invariance enables only an overall investigation of the similarity of the measurement models of the groups under study. In contrast, metric measurement invariance allows, in addition, the comparison of relationships between the constructs (covariances and unstandardized regression coefficients) across groups. For example, it allows comparing the covariance between h1 and h2 across groups A
and B and drawing conclusions meaningfully, or comparing how an external variable V is associated with h1 or h2 in the two groups. Scalar invariance is needed to conduct meaningful comparisons of means across groups; therefore, if a researcher is interested in whether the latent mean of variable h1 is larger in group A or group B, scalar invariance should be established beforehand. Although according to Meredith (1993) and Marsh et al. (2010), a comparison of observed variables (scale scores) requires establishing strict measurement invariance, the literature shows the general convention that scalar invariance is a sufficient precondition for testing such scale score differences.
The Second Issue – Which Type of Data is Used? Likert scales are frequently used in crosscultural research. This type of scale consists of several points – for example, five points ranging from 1 (e.g. I completely disagree, not like me at all) to 5 (e.g. I completely agree, very much like me). The model presented in Figure 39.1 assumes that the data are continuous. The literature has demonstrated that this assumption is often justified, in particular when the dataset is relatively large or when the responses display a normal distribution (e.g. Curran et al., 1996; Flora and Curran, 2004; Urban and Mayerl, 2014: 140–145). However, strictly speaking, Likertscale data are ordered in a categorical rather than continuous manner. Lubke and Muthén (2004), therefore, proposed treating Likert scales as ordinal data. In continuous CFA and MGCFA, the observed variables are directly linked to the unobserved, latent variable. In the ordinal CFA and MGCFA, observed categorical variables are not directly linked to the latent variable of interest. Rather, a set of latent continuous variables is introduced between the observed and the latent
Assessment of Cross-Cultural Comparability
635
Figure 39.2 A model for testing for measurement invariance of two latent variables measured by three indicators across two groups with ordinal data. The two factors are allowed to covary. (Note: The subscripts denote the group and the superscripts denote the item number.)
variables. Muthén (1983) termed these variMGCFA the slope is determined jointly by the ables the latent response distribution. factor loadings of the latent response distribuFigure 39.2 presents the model for meastion, the intercept of the latent response disurement invariance testing with ordinal data tribution, and the thresholds. Simultaneously across two groups. constraining all of these parameters to be The variable X* (the latent response disequal across groups implies metric and scatribution) is partitioned into categories of lar invariance at the same time. Measurement the observed variable X. The partitioning invariance (i.e. metric and scalar invariance) is performed using threshold parameters (v across the groups A and B of the model prein Figure 39.2; with four answer categories sented in Figure 39.2 is supported when the there are three thresholds). The observed catfollowing conditions are satisfied: egorical value for X changes when a threshold 1 1 2 2 6 6 is exceeded on the latent response variable = λ B and λ A = λ B and ... andλ A = λ B λ A X*. Latent response distributions load onto the latent variable of interest; therefore, facand tor loadings and intercepts are parameters of 1 1 2 2 6 6 the latent response distributions rather than τ A = τ B andτ A = τ B and ... andτ A = τ B of the observed variables. Instead of merely estimating loadings and intercepts (as in the and continuous CFA), threshold parameters are 11 11 12 12 63 63 additionally estimated in the ordinal CFA and = and v A v B and ... and v A = v B v A v= B in MGCFA. The main difference between continuous and ordinal MGCFA is that, strictly speaking, Note: The subscripts denote the group. The it is not possible to distinguish between metfirst superscripts denote the item number and ric and scalar invariance in ordinal MGCFA, the second superscript denotes the threshold whereas it is possible to do so in continuous parameter number. MGCFA. The slope of the item response curve The identification of the model requires a in the continuous MGCFA is determined only separate estimation of intercepts and threshby the factor loadings, whereas in ordinal olds. Thus, either intercepts or thresholds can be
636
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
constrained to equality across groups. According to Muthén and Asparouhov (2002), intercepts in MGCFA are analogous to thresholds in MGCFA; both reflect the way respondents use the scale. Thus, releasing thresholds in MGCFA; while constraining all intercepts to zero is preferable (Muthén and Asparouhov, 2002).
Recommendations Ordinal CFA seems to be more appropriate for Likert data. However, the choice of the analysis procedure also depends on the type of scale used. For example, Rhemtulla et al. (2012) recommend using ordinal (also called categorical) CFA with fewer than five categories, while with five categories and more they suggest using continuous CFA. In general it has been shown that, especially with two or three categories and when the distributions were extremely skewed, the bias of the continuous approach might be considerable (Finney and Di Stefano, 2011). This issue is, however, still far from being resolved. We thus recommend researchers to run both types of analyses (continuous and ordinal MGCFA) to test for invariance and briefly report the results of each when using Likert data. If the two procedures produce similar findings, they could be regarded as a robustness test for the violations of assumptions, which are made when using continuous MGCFA. Furthermore, reporting results of both procedures may be useful for applied researchers who seek comparisons between the two approaches and help to accumulate information for the future meta-analysis of findings based on models that assume or do not assume that ordinal categorical data are continuous. If researchers collect the data themselves, it would be advantageous to use Internet surveys with real continuous scales, or at least increase the number of scaling points to five or more (Rhemtulla et al., 2012).
The Third Issue – Which Rules to Apply for Model Evaluation? Measurement invariance is supported when the model fits the data well. There are two
main approaches to evaluate the fit of the model. The first one, which to date is the one used most often in the literature, relies on the global fit indices. The correctness of the model is assessed based on given coefficients that describe the entire model (Chen, 2007). The other approach, presented recently by Saris et al. (2009), criticizes the use of global fit measures and focuses on testing local misspecifications. Both approaches will be discussed briefly below. Historically, the first global fit measure was the LR c2 (likelihood ratio chi-square). According to Jöreskog (1969), c2 aided in freeing CFA of many subjective decisions that had to be made in exploratory factor analysis. As stated by Hu and Bentler (1995), the subjective judgment was replaced by an objective test of the extent to which two matrices differed from each other: the observed and the hypothesized one in the model. Unfortunately, c2 is not free of problems; these problems have been recognized and widely discussed in the literature (e.g. Bentler and Bonett, 1980; Hu and Bentler, 1998; Hu et al., 1992; Jöreskog, 1993; Kaplan, 1990, West et al., 2015 ). The first is that the models tested in CFA are always only approximations of reality; therefore, using c2 to test the hypothesis that the observed covariance matrix equals the hypothesized matrix is an unnecessary strong expectation (Jöreskog, 1978). Another problem is that c2 is sensitive to various characteristics of the tested model that are irrelevant to the possible misspecification. The most known problem is its sensitivity to sample size. As Bentler and Bonett (1980) stated, ‘in large samples, virtually any model tends to be rejected as inadequate’ (p. 588). To resolve these problems, various other model fit indices were developed (Hu and Bentler, 1995; Marsh et al., 2005). The fit indices measure the deviation of the analyzed model from baseline models instead of the deviation of the hypothesized covariance matrix from the observed matrix. In the literature, cut-off criteria for the model fit
Assessment of Cross-Cultural Comparability
coefficients were proposed. The root mean square error of approximation (RMSEA), the comparative fit index (CFI) and the standardized root mean square residual (SRMR) are among those used most often. When the RMSEA value is 0.08 or lower, the model can be assumed to perform reasonably well (Hu & Bentler, 1999; Marsh, Hau, & Wen, 2004). When the RMSEA value is lower than 0.05, the model can be assumed to perform very well (Brown, 2006; Browne & Cudeck, 1993). CFI values between 0.90 and 0.95 or larger indicate an acceptable model fit (Hu and Bentler, 1999). When the SRMR value is lower than 0.08, the model can be assumed to perform reasonably well and when it is lower than 0.05, the model can be assumed to perform very well, (Hu & Bentler, 1999; Marsh et al., 2004). It is worth noting that the cut-off criteria mentioned above are sometimes criticized because these criteria rely on simulation studies with partly nonrealistic assumptions such as factor loadings ranging between 0.70 and 0.80 (Hu and Bentler, 1999), whereas researchers are very often confronted with considerably lower factor loadings. In addition, some authors criticize that the cut-off criteria recommended by Hu and Bentler are too liberal (see, e.g., Marsh et al., 2004). Simulations which use more realistic conditions recommend somewhat different cut-off criteria for the global fit measures. For example, Beauducel and Wittmann (2005) propose to rely on RMSEA and SRMR rather than on the CFI while evaluating measurement models. Also, Kenny and McCoach (2003) suggest that researchers rely rather on RMSEA than on CFI when the models are very complex. To assess whether a given level of measurement invariance is established, global fit measures are compared between the more and less constrained models. If the change in model fit is smaller than the criteria proposed in the literature, measurement invariance for that level is established. Chen (2007) proposed the use of cut-off criteria in deciding whether the fit of a more restrictive model
637
has considerably deteriorated. According to Chen (2007), if the sample size is larger than 300, metric noninvariance is indicated by a change in CFI larger than 0.01 when supplemented by a change in RMSEA larger than 0.015 or a change in SRMR larger than 0.03 compared with the configural invariance model. Regarding scalar invariance, noninvariance is evidenced by a change in CFI larger than 0.01 when supplemented by a change in RMSEA larger than 0.015 or a change in SRMR larger than 0.01 compared with the metric invariance model (Chen, 2007; for a discussion about stricter cut-off criteria, see Meade et al., 2008). Sass et al. (2014) have recently provided cut-off criteria for measurement invariance testing using ordinal MGCFA. Although fit indices are often used to make decisions about the acceptance or rejection of the model and establishing a given level of measurement invariance, some researchers (Marsh et al., 2004; Saris et al., 2009) have criticized the use of fit indices with fixed cutoff values as if they were test statistics. Saris et al. (2009) argued that because both c2 and other fit indices are affected not only by the misspecification size but also by other characteristics of the models (e.g. sample size, model size, number of indicators, size of parameters), assessing the global fit of the entire model may lead to a wrong decision. As an alternative, Saris et al. (2009) proposed investigating specific misspecifications in the model. A correct model should not contain any relevant misspecifications, whereas each serious misspecification is an indication for the necessity either to reject or to modify the model. Thus, it is possible that according to the common global fit criteria one would accept a model that in reality contains serious misspecifications and should be rejected or modified. It is also possible that the global fit measures will recommend the rejection or modification of a model that does not contain any relevant misspecification and may in reality be accepted. According to Saris et al. (1987), estimates of the misspecifications can be obtained
638
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
using a combination of the expected parameter change (EPC) and the modification index (MI), which are usually provided by structural equation modeling (SEM) software packages. The MI (Sörbom, 1989) provides information on the minimal decrease in the c2 of a model when a given constraint is released. Decreasing c2 leads to an improvement of the model. EPC provides a prediction of the minimal change of the given parameter when it is released (Saris et al., 1987). Thus, EPC provides a direct estimate of the size of the misspecification, whereas the MI provides a significance test for the estimated misspecification (Saris et al., 1987). The MI provides information on the size of the EPC. However, the decision regarding whether the size of the EPC should be regarded as a relevant misspecification should consider also the power of the MI test, which is not directly provided by the SEM program. The power is calculated using the EPC and the MI with the publicly available Jrule program developed by Oberski (2009) for Mplus output and by van der Veld for LISREL output (van der Veld et al., 2008; for details, see Saris et al., 2009). Although not set in stone, Saris et al. (2009) formulated suggestions about how large the size of a misspecification that a researcher may be willing to tolerate may be. These suggestions should be used with caution. Specifically, Saris et al. (2009) suggest that deviations larger than 0.4 for standardized estimates of cross-loadings and deviations larger than 0.1 for standardized estimates of other parameters may be treated as serious misspecifications, and smaller ones may be ignored. The second question that should be addressed by a researcher is: What size of power do we treat as high enough to detect the defined size of misspecification? Saris et al. (2009) suggest a value of 0.75, although according to Cohen (1988, 1992), 0.8 would be preferable. Both pieces of information (misspecification size and the size of power to detect potential misspecifications) must be
indicated by the researcher in advance in the Jrule program (Oberski, 2009; Saris et al., 2009). An application of this approach to measurement invariance testing was recently presented by Cieciuch et al. (2015).
Recommendations The use of global fit measures and their cutoff criteria is the most common approach for testing measurement invariance. However, the approach suggested by Saris et al. (2009) is very appropriate to evaluate whether a certain level of invariance is established, because it is more robust to characteristics of the model and the data which are not necessarily relevant to the problem at hand. We recommend researchers who assess measurement invariance to apply both evaluation criteria and compare their conclusions. The former criteria would provide information about the fit of the model to the data. The latter would allow a more careful evaluation of local misspecifications which may be detrimental to drawing meaningful substantive conclusions in comparative analysis if not addressed by the researcher. One should remember that although the cutoff criteria are useful in evaluating the model fit, they cannot be used in a strict sense. In fact, all models are ‘wrong’ because none of them reflects all details of the reality under study in a precise way. Thus, the cut-off criteria can be used as a kind of a signpost rather than as strict decision rules. They are, however, helpful to understand the model and the extent of its misspecifications.
The Fourth Issue – Are CrossLoadings Allowed? A cross-loading implies allowing an item to load not only on the latent variable which it is supposed to measure but also on a different latent variable in the model that it was not supposed to measure. The test of measurement invariance is typically based on
639
Assessment of Cross-Cultural Comparability
MGCFA models which do not allow for cross-loadings. However, this is quite a demanding constraint, and not all measurement instruments used in cross-cultural research are successfully validated without allowing for at least some cross-loadings (Marsh et al., 2010; Podsakoff et al., 2003). On the one hand, researchers often suggest that every measure of a concept should be tested using CFA (e.g. Borsboom, 2006) and display sufficient levels of convergent and discriminant validity (i.e. no cross-loadings); on the other hand, there are some well-established psychological measurement instruments which received no empirical support whatsoever when tested with a CFA model without allowing for any cross-loadings. Probably the best-known case is the measure of the Big Five personality traits (neuroticism, extraversion, openness to experience, agreeableness and conscientiousness) developed by Costa and McCrae (McCrae et al., 1996). The Big Five measures are welldefined, and these measures have been repeatedly and successfully replicated in exploratory factor analyses (EFA), which by definition allow for cross-loadings, whereas CFAs without cross-loadings often failed to support their measurement models. The explanation for this phenomenon is that CFA requires that each item loads onto only one factor, with loadings equal to zero for all other factors. Some researchers have argued that this criterion is too restrictive for personality or value research because it is difficult to avoid situations where items have secondary (weaker) loadings on other concepts (Marsh et al., 2010; McCrae et al., 1996). CFAs of such measures that do not allow for cross-loadings result in both poor model fit and inflated correlations among factors. By way of contrast, other researchers have argued that such results simply suggest that indicators that load on more than one factor are poor and should be replaced by better indicators (Borsboom, 2006). A discussion about the meaning of such results for personality theory or for other theories where this
problem exists is beyond the scope of the present paper. For our purposes, let us consider a situation in which a strict CFA structure that allows no cross-loadings is not supported by the data in each single group. Such a situation makes invariance testing across groups impossible, because before invariance may be tested across groups, the model has to be properly defined and empirically supported in each group separately. It is reasonable in such a situation to allow for cross-loadings. Marsh et al. (2009) recently developed the exploratory structural equation modeling (ESEM) approach to enable tests of measurement invariance for models with welldefined structures that, nevertheless, are not clear and simple enough for the assumption of zero cross-loadings in CFA. Figure 39.3 displays such a model using the ESEM approach. ESEM differs from CFA in the fact that, in ESEM, all possible factor loadings are estimated (like in EFA). At the same time, ESEM differs from EFA in the fact that, in ESEM, measurement errors are estimated like in CFA. Additionally, ESEM offers model fit statistics similar to those provided in CFA analyses. Thus, metric measurement invariance across the groups A and B of the model presented in Figure 39.3 is supported by the data when the following conditions are satisfied:
λ
1 A
= λ B and λ A = λ B and ... andλ A = λ B 1
2
2
6
6
and
λ
7 A
= λ B and λ A = λ B and ... andλ A = λ B 7
8
8
12
12
Note: The subscripts denote the group and the superscripts denote the item number. Following the same line of reasoning as before, scalar invariance across the groups A and B of the model presented in Figure 39.3 is supported by the data when the following conditions are satisfied:
640
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Figure 39.3 A model for testing for measurement invariance using an ESEM approach with two factors, three indicators measuring each factor and two groups. The two factors are allowed to covary. (Note: The subscripts denote the group and the superscripts denote the item number.)
λ
1 A
= λ B and λ A = λ B and ... andλ A = λ B 1
2
2
6
6
and
λ
7 A
= λ B and λ A = λ B and...and λ A = λ B 7
8
8
12
12
and
τ
1 A
= τ B and τ A = τ B and ... and τ A = τ 1
2
2
6
6 B
Note: The subscripts denote the group and the superscripts denote the item number.
Recommendations ESEM offers an alternative option to test for measurement invariance in cases where a CFA model in one or more of the groups cannot be supported by the data because of a large number of cross-loadings. Therefore, we recommend using ESEM for analyses in which there is both empirical and theoretical justification for several cross-loadings and where such a cross-loading structure is replicated in various studies, thereby eliminating the risk of playing by chance.1
The Fifth Issue – Is Full Measurement Invariance Necessary, or is it Sufficient to Establish Partial Measurement Invariance? Establishing measurement invariance is a necessary and, at the same time, a very demanding and difficult task. If the appropriate level of measurement invariance is supported, the substantive analysis may be performed meaningfully. But what about the very common case in which measurement invariance is not supported? There are two options. The first concludes with the lack of measurement invariance, and any test of the substantive hypotheses comparing groups is precluded. The second looks for a compromise solution and differentiates between full and partial measurement invariance (Byrne et al., 1989; Steenkamp and Baumgartner, 1998). Partial invariance is supported when the parameters of at least two indicators per construct (i.e. loadings for the metric level of measurement invariance, and loadings and intercepts at the
Assessment of Cross-Cultural Comparability
scalar level of measurement invariance) are equal across groups.
Recommendations The compromise solution proposed by Byrne et al. (1989) is useful in applied research when full invariance is not supported by the data. It is worthwhile to begin with a test of full measurement invariance. If it is not supported, we propose trying to establish partial measurement invariance and identifying two items where factor loadings (for assessing metric invariance) and both factor loadings and intercepts of the same items (for assessing scalar invariance) are similar across groups. Byrne et al. (1989) proposed releasing the constraints on noninvariant parameters in all groups. This procedure can be especially effective when several groups are compared. If at least two parameters for each latent variable can viably remain constrained across all groups, then partial measurement invariance for that variable is supported. However, we would like to note that several studies have indicated that partial invariance may not be sufficient for meaningful crossgroup comparisons (e.g. de Beuckelaer and Swinnen, 2011; Steinmetz, 2011). Further simulations are needed to provide a more informative recommendation for applied researchers on how to handle partial measurement invariance when full measurement invariance is not given.
The Sixth Issue – Is Exact Measurement Invariance Necessary, or is it Sufficient to Establish Approximate Measurement Invariance? Measurement invariance implies that given parameters are exactly equal across groups. Constraining parameters to be equal across groups is a very strict requirement. One can question this strong requirement and wonder if it is really necessary that the parameters are equal. Perhaps ‘almost equal’ would be
641
sufficient, assuming that we were able to operationalize ‘almost’. Such a consideration underlies the Bayesian approach to measurement invariance (Muthén and Asparouhov, 2013; van de Schoot et al., 2013). Approximate measurement invariance permits differences across groups between parameters such as factor loadings or intercepts that are very small rather than precisely zero as in the exact approach. In other words, the differences between parameters across groups specified in a Bayesian approach are considered to be variables, and the distribution of the parameter differences is described by priors (Muthén and Asparouhov, 2012, 2013). For example, the distribution can be defined as normal, with a mean equal to 0.0 (i.e. zero difference on average between the parameters across groups) and a small variance (e.g. 0.01), that could be noted as N(0, 0.01). Simulation studies conducted by van de Schoot and colleagues (2013) show that such small variances (0.01 or 0.05) are tiny enough to accept models that would be otherwise rejected in the exact approach although they do not contain any serious misspecifications. Furthermore, they suggest that such a variance is small enough not to distort the conclusions about measurement invariance. For example, in a recent study by Davidov and colleagues (2015) it is demonstrated that whereas an exact test of a scale to measure attitudes toward immigration in the European Social Survey fails to establish cross-country measurement invariance, an approximate test does support the invariance properties of the scale. Hence, approximate scalar invariance across groups A and B of the model presented in Figure 39.1 is supported by the data when the following conditions are satisfied2:
λ − λ ~ N ( 0, 0.01) and λ − λ ~ N ( 0, 0.01) and ... and λ − λ ~ N ( 0, 0.01) 1
1
A
B
2
2
A
B
6
6
A
B
642
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
and
τ −τ ~ N ( 0, 0.01) and τ −τ ~ N ( 0, 0.01) and ... and τ −τ ~ ( 0, 0.01) 1
1
A
B
2
2
A
B
6
6
A
B
Note: The subscripts denote the group and the superscripts denote the item number. The global fit of the Bayesian model can detect whether actual deviations are larger than those that the researcher allows in the prior distribution. The model fit can be evaluated based on the posterior predictive probability (ppp) value and the credibility interval (CI) for the difference between the observed and replicated c2 values. According to Muthén and Asparouhov (2012) and van de Schoot et al. (2013), the Bayesian model fits the data when the ppp is larger than zero (although simulation studies are still required to determine how small it should be) and the CI contains zero. Additionally, Mplus lists all parameters that significantly differ from the priors. This feature is equivalent to modification indices in the exact measurement invariance approach. While the model is assessed based on ppp and CI, these values provide global model fit criteria that are similar to the criteria in the exact approach (Chen, 2007). If the indices are not satisfactory, one can release parameters listed as noninvariant and try to establish partial approximate measurement invariance.
Recommendations Bayesian analysis and testing for approximate measurement invariance may be more appropriate for applied researchers than testing for exact measurement invariance, because the strict (and often unrealistic) requirement of equality of parameters can be relaxed. The possibility of testing for approximate invariance and determining the scope of the approximation by defining priors is very promising. It is a very new approach (Muthén and Asparouhov, 2013); consequently, extant
knowledge about how results from approximate measurement invariance assessments compare to exact measurement invariance assessments for the same data is limited. Additionally, the cut-off criteria for the ppp model fit and the recommended size of the priors should be clarified in future research. Simulation studies as well as comparisons of results obtained using the approximate and exact approaches using real data are needed. Therefore, we recommend running and reporting Bayesian analyses alongside exact measurement invariance tests when exact measurement invariance tests fail to receive empirical support.
SUMMARY AND CONCLUSIONS Due to the increasing availability of crosscultural data, comparing cultures seems to be easier nowadays than ever before. Researchers can compare scores of various variables or examine whether relationships between different variables vary across groups. However, the widespread availability of such data does not automatically mean that meaningful cross-cultural comparisons of the measured variables are possible. It could well be the case that these variables, although measured in a similar way across cultures, are not comparable. Questions may be understood differently in different groups, concepts may have a different meaning and respondents may react differently to research questions (e.g. when social desirability bias is more strongly present in one culture than in another). As a result, comparisons may be questionable at best and misleading at worst. Therefore, measurement invariance has to be established before conclusions are drawn. Establishing measurement invariance is not a sufficient condition that can guarantee comparability. In-depth interviews and deep knowledge of the cultural groups to be compared are equally important to guarantee that concepts are similarly understood. However,
Assessment of Cross-Cultural Comparability
establishing measurement invariance is a necessary condition to allow meaningful comparisons across groups. There is a large body of methodological literature that describes how measurement invariance may be tested. In this chapter we discussed several approaches for assessing measurement invariance across groups and tried to provide recommendations or general guidelines for applied researchers to help in deciding which method to use to assess invariance. The reader is referred to further studies to gain a deeper understanding and get acquainted with additional tools on how to test for measurement invariance (e.g. item response theory or latent class analysis; for several applications and references to this literature, see Davidov et al., 2011; for a discussion on the use of cognitive interviews, see Braun et al., 2013 and Latcheva, 2011; for a discussion about a new approach to estimate means for a large number of groups using alignment, see Asparouhov and Muthén, 2013). The decision of which method to adopt for establishing measurement invariance is left to the discretion of the researcher depending on the data used and the substantive research questions that need to be addressed. Whatever method is chosen, we have tried to underline in this chapter the importance of establishing measurement invariance: if a measurement is noninvariant, it is possible that differences that are found between groups do not correspond to real differences or, conversely, that real differences are obscured by the noninvariant measurements. The methods we have presented are not limited to the establishment of measurement invariance across countries and may be applied when comparisons between other groups such as cultures, regions within countries, language groups, time points or different conditions of data collection are undertaken. Finally, when researchers fail to establish measurement invariance, comparisons become problematic. However, revealed differences in the measurement properties
643
across groups may provide valuable information about more substantive cross-group differences. For example, if a certain item loads in a systematically different way on a latent variable across two groups, it may indicate that this item is understood in a different way across the two countries, thereby providing hints about an important feature in which the two countries vary. Thus, rather than ignoring such measurement differences and proceeding with the performance of questionable comparative research despite measurement noninvariance, examining measurement noninvariance more deeply and investigating differences in the measurement properties (i.e. in factor loadings or intercepts) in detail may constitute a worthwhile medium to obtain a deeper understanding of the groups at hand (see, e.g. Davidov et al., 2012 or Jak et al., 2013; see also Poortinga, 1989). Furthermore it might be useful to check the robustness of the violation of the assumption by using techniques proposed by Oberski et al. (2015) and Jouha and Moustaki (2015).
ACKNOWLEDGMENTS The work of Jan Cieciuch, Eldad Davidov and René Algesheimer was supported by the University Research Priority Program Social Networks of the University of Zürich. The work of Peter Schmidt was supported by the Alexander von Humboldt Polish Honorary Research Fellowship granted by the Foundation for Polish Science for international cooperation of Peter Schmidt with Jan Cieciuch. We would like to thank Lisa Trierweiler for the English proof of the manuscript.
NOTES 1 A related issue is whether to allow for error correlations. The need to introduce error correlations between items (which belong to the same or to
644
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
different latent constructs) indicates that these items share variance which is not accounted for by their latent variable. Some scholars recommend controlling for a common method factor to account for this common variance before testing for invariance as an alternative to introducing error correlations. Discussing this method is beyond the scope of this study, but for further details about this approach, see, for example, Podsakoff et al. (2003). 2 Methodologists have still not strictly determined whether and under which conditions larger variances may be allowed. It is also possible to run a Bayesian analysis of the model presented in Figure 39.3 that contains all of the cross-loadings. Crossloading differences across groups may also be defined by zero-means with small-variance priors.
RECOMMENDED READINGS The following literature could help readers to focus on those few studies (from the long list of references) that we think may be useful for further in-depth reading. Davidov et al. (2008, 2011, 2014), Millsap (2011), and Vandenberg and Lance (2000).
REFERENCES Ariely, G., and Davidov, E. (2010). Can we rate public support for democracy in a comparable way? Cross-national equivalence of democratic attitudes in the World Value Survey. Social Indicators Research, 104(2), 271–286. doi:10.1007/s11205-010-9693-5 Asparouhov, T., and Muthén, B. O. (2013). Multiple Group Factor Analysis Alignment. Mplus Web Notes 18, Version 3. Retrieved from http://www.statmodel.com/examples/ webnote.shtml. Accessed 23 August, 2013. Beauducel, A., and Wittmann, W. W. (2005). Simulation study on fit indexes in CFA based on data with slightly distorted simple structure. Structural Equation Modeling, 12(1), 41–75. doi:10.1207/s15328007sem1201_3 Bentler, P. M., and Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88(3), 588–606. doi:10.1037//00332909.88.3.588
Bollen, K. A. (2007). Interpretational confounding is due to misspecification, not to type of indicator: Comment on Howell, Breivik, and Wilcox (2007). Psychological Methods, 12(2), 219–228. doi:10.1037/1082-989X.12.2.219 Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425– 440. doi:10.1007/s11336-006-1447-6 Braun, M., Behr, D., and Kaczmirek, L. (2013). Assessing cross-national equivalence of measures of xenophobia: Evidence from probing in web surveys. International Journal of Public Opinion Research, 25(3), 383–395. doi:10.1093/ijpor/eds034 Brown, T. A. (2006). Confirmatory Factor Analysis for Applied Research. New York: Guilford Press. Browne, M.W., and Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen and J. S. Long (eds), Testing Structural Equation Models (pp. 136–162). Newbury Park, CA: SAGE. Byrne, B. M., Oakland, T., Leong, F. T. L., van de Vijver, F. J. R., Hambleton, R. K., Cheung, F. M., and Bartram, D. (2009). A critical analysis of cross-cultural research and testing practices: Implications for improved education and training in psychology. Training and Education in Professional Psychology, 3(2), 94–105. doi:10.1037/a0014516 Byrne, B. M., Shavelson, R. J., and Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. P sychological Bulletin, 105(3), 456–466. doi:10.1037/00332909.105.3.456 Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14(3), 464– 504. doi:10.1177/0734282911406661 Chen, F. F. (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in crosscultural research. Journal of Personality and Social Psychology, 95(5), 1005–1018. doi:10.1037/a0013193 Cieciuch, J., Davidov, E., Oberski, D. L., and Algesheimer, R. (2015). Testing for measurement invariance by detecting local misspecification and an illustration across online and paperand-pencil samples. European Political Science, 14 (521–538). doi: 10.1057/eps2015.64
Assessment of Cross-Cultural Comparability
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd edn). New York: Academic Press. Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159. doi:10.1037/0033-2909.112.1.155 Curran, P. J., West, S. G., and Finch, J. F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16–29. doi:10.1037/1082-989X.1.1.16 Davidov, E. (2009). Measurement equivalence of nationalism and constructive patriotism in the ISSP: 34 countries in a comparative perspective. Political Analysis, 17(1), 64–82. doi:10.1093/pan/mpn014 Davidov, E., and de Beuckelaer, A. (2010). How harmful are survey translations? A test with Schwartz’s human values instrument. International Journal of Public Opinion Research, 22(4), 485–510. doi:10.1093/ijpor/edq030 Davidov, E., Cieciuch, J., Schmidt, P., Meuleman, B., and Algesheimer, R. (2015). The comparability of attitudes toward immigration in the European Social Survey: Exact versus approximate equivalence. Public Opinion Quarterly, 79, 244–266. doi:10.1093/poq/nfv008 Davidov, E., Dülmer, H., Schlüeter, E., Schmidt, P., and Meuleman, B. (2012). Using a multilevel structural equation modeling approach to explain cross-cultural measurement noninvariance. Journal of Cross-Cultural Psychology 43, 558–575. doi:10.1177/0022022112438397. Davidov, E., Meuleman, B., Cieciuch, J., Schmidt, P., and Billiet, J. (2014). Measurement equivalence in cross-national research. Annual Review of Sociology, 40, 55–75. Advance online publication. doi:10.1146/ annurev-soc-071913-043137 Davidov, E., Schmidt, P., and Billiet, J. (eds) (2011). Cross-Cultural Analysis: Methods and Applications. New York: Routledge. Davidov, E., Schmidt, P., and Schwartz, S. H. (2008). Bringing values back in. The adequacy of the European Social Survey to measure values in 20 countries. Public Opinion Quarterly, 72(3), 420–445. doi:10.1093/ poq/nfn035 De Beuckelaer, A., and Swinnen, G. (2011). Biased latent mean comparisons due to measurement noninvariance: A simulation study. In E. Davidov, P. Schmidt, and J. Billiet
645
(eds), Cross-Cultural Analysis: Methods and Applications (pp. 117–147). New York: Routledge. Finney, S. J., and Di Stefano, C. (2011). Nonnormal and categorical data in structural equation modeling. In G. R. Hancock and R. O. Mueller (eds), Structural Equation Modeling: A Second Course (pp. 439–492). Charlotte: Information Age. Flora, D. B., and Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9(4), 466–491. doi:10.1037/1082-989X.9.4.466 Horn, J. L., and McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3–4), 117–144. doi:10.1080/03610739208253916 Horn, J. L., McArdle, J. J., and Mason, R. (1983). When in invariance not invariant: A practical scientist’s look at the ethereal concept of factor invariance. Southern Psychologist, 1, 179–188. Hu, L. T., and Bentler, P. M. (1995). Evaluating model fit. In R. Hoyle (ed.), Structural Equation Modeling: Issues, Concepts, and Applications (pp. 76–99). Newbury Park, CA: Sage. Hu, L. T., and Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3(4), 424–453. doi:10.1037/1082-989x.3.4.424 Hu, L. T., and Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 66(1), 1–55. doi:10.1080/10705519909540118 Hu, L. T., Bentler, P. M., and Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted. Psychological Bulletin, 112(2), 351–362. doi:10.1037/00332909.112.2.351 Jak, S., Oort, F. J., and Dolan, C. V. (2013). A test for cluster bias: Detecting violations of measurement invariance across clusters in multilevel data. Structural Equation Modeling, 20(2), 265–282. doi:10.1080/ 10705511.2013.769392 Johnson, T. P., and van de Vijver, F. (2003). Social desirability in cross-cultural research.
646
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
In J. A. Harkness, F. J. R. van de Vijver, and P. P. Mohler (eds), Cross-Cultural Survey Methods (pp. 193–209). Hoboken, NJ: Wiley. Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34(2), 183–202. doi:10.1007/bf02289343 Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 409–426. doi:10.1007/ bf02291366 Jöreskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psychometrika, 43(4), 443–477. doi:10.1007/ bf02293808 Jöreskog, K. G. (1993). Testing structural equation models. In K. A. Bollen and J. S. Long (eds), Testing Structural Equation Models (pp. 294–316). Newbury Park: Sage. Jouha, K., and Moustaki, I. (2015). Nonequivalence of measurement in latent variable modeling of multigroup data: A sensitivity analysis. Psychological Methods, 20 (4), 523–536. doi: 10.1037/met0000031 Kaplan, D. (1990). Evaluating and modifying covariance structure models: A review and recommendation. Multivariate Behavioral Research, 25(2), 137–155. doi:10.1207/ s15327906mbr2502_1 Karp, J. A., and Brockington, D. (2005). Social desirability and response validity: A comparative analysis of overreporting voter turnout in five countries. The Journal of Politics, 67(3), 825–840. doi:10.1111/j.1468-2508.2005. 00341.x Kenny, D. A., and McCoach, D. B. (2003). Effect of the number of variables on measures of fit in structural equation modeling. Structural Equation Modeling, 10, 333–351. doi:10.1207/S15328007SEM1003_1 Latcheva, R. (2011). Cognitive interviewing and factor-analytic techniques: A mixed method approach to validity of survey items measuring national identity. Quality & Quantity, 45(6), 1175–1199. doi:10.1007/s11135-009-9285-0 Little, T. D., Slegers, D. W., and Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling, 13(1), 59–72. doi:10.1207/s15328007sem1301_3 Lubke, G. H., and Muthén, B. O. (2004). Applying multigroup confirmatory factor models
for continuous outcomes to Likert scale data complicates meaningful group comparisons. Structural Equation Modeling, 11(4), 514– 534. doi:10.1207/s15328007sem1104_2 Marsh, H. W., Hau, K. T., and Grayson, D. (2005). Goodness of fit in structural equation models. In A. Maydeu-Olivares and J. J. McArdle (eds), Contemporary Psychometrics (pp. 275–340). Mahwah, NJ: Lawrence Erlbaum Associates. Marsh, H. W., Hau, K. T., and Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11(3), 320– 341. doi:10.1207/s15328007sem1103_2 Marsh, H. W., Ludtke, O., Muthén, B., Asparouhov, T., Morin, A. J. S., Trautwein, U., and Nagengast, B. (2010). A new look at the Big Five factor structure through exploratory structural equation modeling. Psychological Assessment, 22(3), 471–491. doi:10.1037/ a0019227 Marsh, H. W., Muthén, B., Asparouhov, T., Lüdtke, O., Robitzsch, A., Morin, A. J. S., and Trautwein, U. (2009). Exploratory structural equation modeling, integrating CFA and EFA: Application to students’ evaluations of university teaching. Structural Equation Modeling, 16(3), 439–476. doi:10.1080/10705510903008220 McCrae, R. R., Zonderman, A. B., Bond, M. H., Costa, P. T., and Paunonen, S. V. (1996). Evaluating replicability of factors in the Revised NEO Personality Inventory: Confirmatory factor analysis versus Procrustes rotation. Journal of Personality and Social Psychology, 70(3), 552– 566. doi:10.1037/0022-3514.70.3.552 Meade, A. W., Johnson, E. C., and Braddy, P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93(3), 568–592. doi:10.1037/0021-9010.93.3.568 Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525–543. doi:10.1007/ bf02294825 Meuleman, B., and Billiet, J. (2012). Measuring attitudes toward immigration in Europe: The cross-cultural validity of the ESS immigration scales. ASK Research & Methods, 21(1), 5–29.
Assessment of Cross-Cultural Comparability
Millsap, R. E. (2011). Statistical Approaches to Measurement Invariance. New York, London: Routledge Taylor & Francis Group. Muthén, B. (1983). Latent variable structural equation modeling with categorical data. Journal of Econometrics, 22(1–2), 43–65. doi:10.1016/0304-4076(83)90093-3 Muthén, B., and Asparouhov, T. (2002). Latent Variable Analysis with Categorical Outcomes: Multiple-group and Growth Modeling in Mplus. Mplus Web Notes 4. Retrieved from https://www.statmodel.com/download/webnotes/CatMGLong.pdf [accessed on 15 June 2016]. Muthén, B., and Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17(3), 313–335. doi:10.1037/a0026802 Muthén, B., and Asparouhov, T. (2013). BSEM Measurement Invariance Analysis. Mplus Web Notes 17. Retrieved from http://www. statmodel.com/examples/webnotes/webnote17.pdf [accessed on 15 June 2016]. Oberski, L. D. (2009). Jrule for Mplus version 0.91 (beta) [Computer software]. Retrieved from https://github.com/daob/JruleMplus/ wiki [accessed on 15 June 2016]. Oberski, L. D. (2012). Evaluating sensitivity of parameters of interest to measurement invariance in latent variable models. Political Analysis, 22(1), 45–60. doi: 10.1093/pan/mpt014 Podsakoff, P. M., MacKenzie, S. B., Lee, J.-Y., and Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879–903. doi:10.1037/ 0021-9010.88.5.879 Poortinga, Y. H. (1989). Equivalence of cross-cultural data: An overview of basic issues. International Journal of Psychology, 24(6), 737–756. doi:10.1080/ 00207598908247842 Rhemtulla, M., Brosseau-Liard, P. E., and Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17(3), 354–373. doi:10.1037/a0029315
647
Saris, W. E., Satorra, A., and Sörbom, D. (1987). The detection and correction of specification errors in structural equation models. Sociological Methodology, 17, 105–129. Saris, W. E., Satorra, A., and van der Veld, W. M. (2009). Testing structural equation models or detection of misspecifications? Structural Equation Modeling, 16(4), 561– 582. doi:10.1080/10705510903203433 Sarrasin, O., Green, E. G. T., Berchtold, A., and Davidov, E. (2013). Measurement equivalence across subnational groups: An analysis of the conception of nationhood in Switzerland. International Journal of Public Opinion Research, 25(4), 522–534. doi:10.1093/ ijpor/eds033 Sass, D. A., Schmitt, T. A., and Marsh, H. W. (2014). Evaluating model fit with ordered categorical data within a measurement invariance framework: A comparison of estimators. Structural Equation Modeling, 21, 167–180. doi:10.1080/10705511.2014. 882658 Steenkamp, J.-B. E. M., and Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25(1), 78–90. doi:10.1086/209528 Steinmetz, H. (2011). Estimation and Comparison of Latent Means Across Cultures. In E. Davidov, P. Schmidt, and J. Billiet (eds), Crosscultural Analysis: Methods and Applications (pp. 85–116). New York: Routledge. Sörbom, D. (1989). Model modification. Psychometrika, 54(3), 371–384. doi:10.1007/ BF02294623 Thurstone, L. L. (1947). Multiple Factor Analysis. Chicago, IL: The University of Chicago Press. Urban, D., and Mayerl, J. (2014). Strukturgleichungsmodellierung. Ein Ratgeber für die Praxis [Structural equation modeling. A guide for practical applications]. Wiesbaden: Springer. Van de Schoot, R., Kluytmans, A., Tummers, L., Lugtig, P., Hox, J., and Muthén, B. O. (2013). Facing off with Scylla and Charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Frontiers in Psychology, 4, 770. doi:10.3389/fpsyg.2013.00770
648
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Van de Vijver, F. J. R., and Poortinga, Y. H. (1997). Towards an integrated analysis of bias in cross-cultural assessment. European Journal of Psychological Assessment, 13(1), 29–37. doi:10.1027/1015-5759.13.1.29 Van der Veld, W. M., Saris, W. E., and Satorra, A. (2008). JRule 2.0: User manual. Unpublished manuscript. Vandenberg, R. J. (2002). Toward a further understanding of and improvement in measurement invariance methods and procedures. Organizational Research Methods, 5(2), 139– 158. doi:10.1177/1094428102005002001
Vandenberg, R. J., and Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research M ethods, 3(1), 4–70. doi:10.1177/109442810031002 West, S. G., Taylor, A. B., and Wu, W. (2015). Model fit and model selection in structural equation modeling. In R. Hoyle (ed.), Handbook of Structural Equation Modeling (pp. 209–231). New York: Guilford.
PART IX
Further Issues
40 Data Preservation, Secondary Analysis, and Replication: Learning from Existing Data L y n e t t e H o e l t e r, A m y P i e n t a a n d J a r e d L y l e
BACKGROUND: DATA ACCESS AND RESEARCH TRANSPARENCY AS SCIENTIFIC VALUES Merton (1973) proposed that the behaviors of scientists converge around four norms or ideals: communality, universalism, disinterestedness, and organized skepticism. The first norm, communality, is a behavioral ideal where scientists share ownership of the scientific method and their findings. A contemporary outgrowth of this earlier writing has been an emphasis in the scholarly literature and public sphere on ensuring research transparency, which includes the sharing of results and also the research data underlying those results in a timely fashion (for example, the American Association for Public Opinion Research’s Transparency Initiative). The counter norm of communality is secrecy in which scientists protect their findings to maintain their standing or reputation within the field. Although most scientists identify with the norm of communality, it seems they
view their colleagues as having much less regard for such shared ownership (Anderson et al., 2007). At the highest levels of government, there is increasing pressure to make research outputs (including data) available to the widest group of people. As such, federal agencies are beginning to plan how best to ensure that scientific results they have funded, broadly defined to include publications and research data, are made publicly available (Executive Office of the President, 2013). Placing data in the public domain serves scientific progress because sharing research data allows scientists to build on the work of those who have come before them. For example, in survey research there is high value placed on seeing the question wording and survey design methodology of prior surveys to ensure that (1) valid and reliable measures are used in surveys, (2) measures are comparable across different surveys, and (3) poor measures are refined or discarded in favor of better measures. Also, sharing data allows for replication of results – a behavior
652
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
that is valued in many of the social science disciplines including political science, sociology, and economics. It is an especially good time for disciplines to take stock of how they are doing with regards to research transparency and sharing data given that this behavior is still mostly a voluntary practice in the social sciences. Most social science journals do not require study registration, pre-analysis plans, or data sharing (Miguel et al., 2014). However, given the policy relevance of the work of social scientists, it is especially important that the methods and data underlying published results are available for scientific and public scrutiny (McCullough and McKitrick, 2009; Miguel et al., 2014). Additionally, the transformation in social science over the last two decades such that most data are ‘born digital’ makes replication and reproducibility of research through the exposure of one’s methods much easier. For example, the explosive growth in computer-assisted interviewing techniques has enabled the immediate digitization of respondent answers into data files which can be used, analyzed, and shared more readily. At the same time, there is recognition that such advances in computing, information, and communication technologies have made science increasingly data intensive with social scientists producing large, complex databases that take considerable resources to share and reuse. Meeting the norms of research transparency and data sharing are often difficult for survey researchers. Although mediasensationalized concerns of plagiarism and falsification or fabrication of data receive most of the attention, scientists acknowledge that the challenges to integrity they face are more likely to be related to ensuring correct interpretations of data and results due to limited time and resources that affect the quality and use of the data (DeVries et al., 2006). Complex survey data sets are tremendously expensive to collect, but also true is that those survey data are expensive to clean, prepare, and document for analysis both by those who
collected the data and by the secondary user. Another concern is protecting the confidentiality of survey respondents who increasingly are asked to provide identifying information ranging from details about where they live to precise time of transactions and encounters (e.g., medical visits) that define the experiences that social scientists want to study. These limitations have to be addressed to guarantee that survey researchers are able to meet the norms of research transparency and data sharing. Here we present current thinking on the issues presented above: decisions about data sharing, preservation, dissemination, common obstacles to sharing, and secondary analysis.
WHICH DATA ARE SHARED? Given the tide of public opinion and government interest in research transparency, it becomes especially important for social scientists to converge around a set of criteria for determining which survey data have shortterm value and need only be kept for years versus survey data that have long-term value and should be kept indefinitely. Most data could be made available for short-term use with relatively little cost. However, one of the challenges we face is establishing systematic criteria to select data that should be kept permanently. It is plausible to think that most data would be beneficial to place in the public domain close to the end of the study when any associated primary publications become available to meet the needs for replication and research transparency described above. This would not necessitate making decisions about the long-term value of the data collection. Costs associated with providing long-term preservation and maintaining access to and utility of data, however, demand prioritizing some types of data over others before making such commitments. While all data are
DATA PRESERVATION, SECONDARY ANALYSIS, AND REPLICATION
of value, there are attributes that make some data more likely to retain long-term scientific value than other data. Data collected at great expense and that make a significant contribution to understanding initially are expected to do so in the longer term as well, for example: (1) data collected using a probability sampling frame – where the sampling procedures and frames have been well documented – allow scholars to understand broader population trends and processes (this is especially true for nationally representative data); (2) data that are from or about an understudied demographic group to validate and understand the experiences of smaller groups within the diverse society; (3) longitudinal or repeated-cross-sectional data where individuals or cohorts are followed over long periods of time provide the strongest evidence for causal relationships in the social sciences; (4) data that represent methodological innovations, especially when designed to provide a unique perspective on questions of broader interests; (5) data that are theoretically important and represent key pieces of a field’s collective knowledge; and (6) data underlying highly cited publications or highly cited researchers.
ARCHIVING DATA Archiving data enables and encourages reuse and replication by making data accessible and independently understandable. Archiving is much more than data storage and retrieval; it is the active preservation and curation (i.e., enhancement) of data to enable present and future use. This includes: rich and descriptive study documentation, long-term preservation of formats and files, and protection of confidential and sensitive data. Archiving is most successful when survey researchers build in good practices from the very planning stages of their projects. Documenting decisions about sample, question wording, and mode of data collection
653
at the time they are made, and keeping that information with the data throughout the process, is the best way to ensure that an end user can determine the quality of the data at the time of analysis. ICPSR provides detailed guidance and examples of archival components in the Guide to Social Science Data Preparation and Archiving (ICPSR, 2012) and the UK Data Archive provides a similar guide, Managing and Sharing Data (UK Data Archive, 2011).
Documentation (Metadata) Study documentation, or metadata, helps describe the context of the data, as well as increases the discoverability of the data through search engines and online catalogs. Good documentation ensures that a data collection contains information such that each variable in the data file is completely selfexplanatory and that users are knowledgeable about the potential, or lack thereof, for generalizing from those data. As survey data are regularly collected using complex sample designs and comprised of thousands of variables, good documentation practices are even more important and appreciated. Examples of documentation include codebooks, user guides, and questionnaires, as well as study description pages on Websites.
Long-Term Preservation Long-term preservation makes it possible for current investments to be utilized into the future. By long-term we mean ‘a period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community’ (Consultative Committee for Space Data Systems, 2002). Preserving data goes well beyond simply storing files on a hard drive. Physical media fail, infrastructure changes, and recognized file formats change.
654
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Preservation is the active process of planning which formats, media, and organization will robustly stand over time. Recommended open and widely accepted file formats for quantitative data are ASCII, tab-delimited, SAS, SPSS, and Stata. Documentation is preserved using XML and PDF/A. Proper preservation helps retain data integrity.
DISSEMINATION Discovery Assuring that one’s data are preserved and shared does not necessarily make certain that those data will be found and used by other researchers. As noted above and earlier in this volume (Chapter 32), data are only as discoverable as the documentation surrounding those data is strong. That is, search engines looking across or within data archives and other sites can only provide users with relevant results when the data have been well described. At a minimum, metadata should include information about the study itself – title, data producer, and keywords reflecting the subjects addressed. As standards for documenting data are more broadly adopted and refined, information is often captured and made available down to the question or variable level (e.g., the Data Documentation Initiative). This level of markup becomes the engine behind question banks and other search tools that facilitate discovery. Researchers can then begin their searches with a broad concept, making use of keywords and study titles, or with a specific word, phrase, or combination of keywords, and receive a list of results with equal detail. As an example, a dataset might be the result of a study of political attitudes but also includes a few questions about internet use or religious practices. If the search was conducted only over keywords, someone interested in the latter questions might never find these data, but if the metadata includes
variable-level information, it is much more likely that the study would come up. Beyond keywords, other characteristics of the data may be of most interest to the potential user. Methodological details such as whether the data are cross-sectional or longitudinal, sampled from a nationally representative frame or a local area, and the racial diversity of the respondents may be more important for a given researcher (or teacher) than the content of the questions, depending on the intended objective. All of this information should be included in the documentation to accompany a dataset if the goal is for those data to be found and reused.
Data Citation While traditional bibliographic citations are an ingrained and canonized part of the research process, sadly, the need to cite data is just now becoming recognized. Indeed, in order for scientific assertions to withstand the scrutiny demanded by the principles of research transparency, the link between the data and the publication must be preserved (CODATA/ITSCI Task Force on Data Citation, 2013). Proper data citation makes it easier to discover data, encourages the replication of scientific results, and gives proper credit to data contributors. Citing data is simple. There are five basic elements to include: title, author, date, version, and a persistent identifier such as a Digital Object Identifier (DOI) (Altman and King, 2007; ICPSR, 2014). Here is an example of a well-formed data citation: United States Department of Justice. Federal Bureau of Investigation. Uniform Crime Reporting Program Data [United States]: Arrests by Age, Sex, and Race, 1984. ICPSR23328-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-03-05. doi: 10.3886/ ICPSR23328.v1.
In general, citations can be placed in the bibliography or references section of a paper.
DATA PRESERVATION, SECONDARY ANALYSIS, AND REPLICATION
Ensuring Greater Access to Research and Results Data that have been properly archived and made available to the scientific community help to level the playing field and allow more scientists to enter into discussions of topics of importance to the larger community. Given today’s funding environment, both at the university and government levels, it is unlikely that a single researcher will be able to conduct a large-scale survey to explore a topic in which he/she is interested. Rather than having to rely on a small-scale study of a convenience sample, having data available for secondary research gives that individual a chance to carry out the study on data that have been carefully collected and documented, lending more credibility to the results. Additionally, while the data may not have been collected with his/her research question in mind, the nature of large omnibus surveys can give another researcher the ability to answer his/her original question and to put it in a larger context by including factors that he/she may not have thought to include in an original data collection but which are present in the secondary data. This allows individuals at institutions (or in countries) lacking a strong research infrastructure to contribute to the body of scientific knowledge on the same level as colleagues with those resources. Science as a whole benefits from the kinds of intellectual diversity that occur when more researchers are engaged in the conversations. Likewise, easy access to secondary data makes comparative research, especially on a large scale, possible. As practices of data preservation and sharing are emphasized in the US and Europe (see the Consortium of European Social Science Data Archives; http://www.cessda.net), they also spread to researchers in other countries. Beginning the process of properly documenting, archiving, and disseminating data allows researchers and other individuals in such places as Nepal, South Africa, and Ghana to have access to
655
information about their own societies that few have had before now. Additionally, survey programs such as the ‘Barometers’ (e.g., Asia Barometer, Euro-barometer, Arab Barometer, Afrobarometer) and the World Values Surveys make studying attitudes and behaviors crossculturally doable for researchers anywhere with access to the Internet. Finally, when data are properly preserved and shared, it becomes easier to introduce new audiences to using data. These new audiences might include policy makers, journalists, and even undergraduate students in the social sciences. The structure and metadata that can be built around shared data make the task of data exploration less daunting and more meaningful, and online analysis tools based on this infrastructure remove the requirement of access to and knowledge of a statistical package.
BARRIERS TO DATA REUSE Even with mechanisms in place to promote data sharing and reuse, barriers still exist for both data producers and consumers.
Delayed Dissemination Making data available to other researchers is often met with resistance by principal investigators who are concerned about being ‘scooped’ by others before getting their own work out. Professional associations and journals across the social sciences recognize the researcher’s right to a first pass at the data and have presented timelines for dissemination. The policies include slight differences, but in general, data used in a publication should be made available at the time of the publication, though in some situations an embargo of up to a year after publication may be granted. Providing data for secondary analysis typically occurs near or after the end of a project period, with the exact timing
656
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
of availability set forth in the data management plan prior to data collection. Researchers need to anticipate the time involved in preparing data for sharing as they write the data management plan – keeping good documentation throughout the process makes the data preparation straightforward and quick. Waiting to label variables and values and/or create a codebook can result in a delay in sharing, either on the part of the researcher or of the receiving data archive. Additionally, time and resource constraints on a researcher who is both trying to produce sharable survey data and publish results can present obstacles to sharing data promptly.
Confidential Data The ability to share data publicly begins prior to even asking the first survey question, it begins with the informed consent form. Researchers can be so concerned about informing respondents that their answers will be kept confidential that the consent form is written in a way that does not allow the original investigator to share data for (re)use by other researchers. Once this hurdle is crossed and data can be shared, the types of data and level of detail collected in modern surveys often raises questions about disclosure risks to respondents. Where it used to be the case that removing direct identifiers (name, address, any type of meaningful ID number) and possibly detailed responses about one’s job, income, or education would be enough to protect respondents’ confidentiality, this is no longer the case. Using today’s complex data requires safe data, safe places, safe people, and safe output (Ritchey, 2009). One way in which data can be shared while confidentiality is maintained is through a process of data modification called data ‘masking’. These procedures refer to techniques that act upon relationships within the data to anonymize, permute, or perturb the data such that the aggregate characteristics in which the researcher is interested remain
intact, but identifying information is decoupled from the data (Lupia and Alter, 2014). Secondary analysts themselves can be held accountable for protecting the confidentiality of the respondents through the implementation of data use agreements that lay out detailed plans for data security and, sometimes, for intended analytic strategies. In some cases, these researchers may be asked to pay a fee to cover incremental costs of handling secure data and agreements or conducting a security audit. Finally, technological advances make possible the sharing of data with even the highest levels of disclosure risk. Online analysis interfaces allow data providers like ICPSR to release administrative and other data required by some user communities but to restrict the kinds of analyses that can be run and/or the output when case counts in key cells are small. ‘Virtual’ data enclaves are a functional equivalent to physical enclaves – analysts conduct their work from their own offices through a remote computing environment such that the data are never released from behind the firewalls of the archive. Typically, analyses run in this fashion are also subject to output review to guarantee that the correct levels of aggregation are maintained and disclosure risk is minimized.
Proprietary Data New challenges have been posed as researchers utilize data from commercial vendors for scholarly analyses. The same notions of research transparency and need for replication apply, but exceptions have to be made for how these data are shared in such circumstances. In lieu of providing replication data, many journals now require authors to notify the editors at the time of manuscript submission that the data are proprietary and to provide program code along with a description of the process by which other researchers could obtain the data from the vendor (Lupia and Alter, 2014).
DATA PRESERVATION, SECONDARY ANALYSIS, AND REPLICATION
SECONDARY ANALYSIS Replication Many fields are recognizing the value and importance of replication in the scientific process and, to that end, more journals are requiring researchers to submit data and/or program code along with articles for publication. Replication policies help to provide safeguards against fraud and scientific misconduct as well as making innocent mistakes in data entry or coding identifiable (Glandon, 2011; Tenopir et al., 2011). Replication data are also an important learning tool for graduate students to learn the steps in analyzing data and constructing a set of published results (King, 1995). Sometimes the journal’s requirement is for the analytic dataset, while others ask for the original data and data manipulation code as well as analytic code. Journals also differ in whether they require the data to be placed into any reputable archive, the journal’s own repository, or if an author’s Website adequately constitutes sharing. Compliance with these requirements is not complete, however, and even when the data are available, exact replication has been possible in only a small number of cases (McCullough and McKitrick, 2009; Vinod, 2001).
Most Data are Under-Analyzed As technology has allowed surveys to become longer and more complex, the resulting datasets provide a wealth of information that can be used to answer many more research questions than an individual analyst is likely to be able to explore throughout her career. Additionally, data collected with one purpose in mind may include questions that make them the best data available for examining a completely different topic. For example, the National Survey of Family Growth is a survey designed to provide information about family and fertility behaviors including relationship histories, contraceptive use, and
657
sexual activity. However, the study design includes an oversample of minority respondents and the survey contains questions covering family and fertility broadly defined. Therefore, the data have also been used to examine such topics as racial intermarriage and general trends in divorce (c.f., Fu and Wolfinger, 2011); cervical cancer prevention (c.f., Hewitt et al., 2002); incarceration and high-risk sex partnerships among men (c.f., Khan et al., 2009); religious affiliation and women’s education and earnings (Lehrer, 2005); and the relationship between family violence and smoking for adolescents (Elliott, 2000). Many datasets offer similar potential for analysis beyond what was initially intended.
Training Secondary analysis is instrumental in training students (and others) about social science and data analysis. Materials have been developed such that instructors can include data from important studies in their fields in ways that enhance student learning of substantive course content, before the student is even exposed to a research methods or statistics course in the curriculum. The wealth of shared data offers faculty many options depending on their learning objectives and students can be exposed to data from the beginning to the end of their college careers with new learning taking place each time. Even with students who are not social science majors, including data in the classroom has been shown to increase critical thinking and quantitative literacy skills (e.g., Wilder, 2010). Replicating and/or extending findings from a well-known study is another way to make data analysis come alive for students, and the requirements of journals to deposit data for replication makes the ability to carry out such assignments more realistic. The relationship between secondary data and training exists for skilled researchers as well as novices – many complex data
658
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
collection efforts include workshops or short courses providing instruction in using the data as part of their dissemination strategies. The ease of obtaining secondary data and the complexities incumbent in many of the studies suggest the need for caution by lessexperienced researchers and more guidance from their mentors, however, as there is increased potential for results that are based on incorrect analytical techniques.
Harmonization The availability and variety of datasets for secondary analysis creates the opportunity for combining data in new and interesting ways, often resulting in a new resource that is greater than the sum of the component parts. Surveys conducted on the same topics over time or across different populations make it possible for scholars to examine trends in a larger context. Rarely are the data directly comparable though, so researchers have been engaging in ‘data harmonization’ whereby new datasets are created by combining information from several existing sources. Combining the data requires attention to details such as the wording of questions and answer choices for similar concepts across studies, differences in the original sample design and the specific respondents to which a question was asked, and changes in time or measurement units. Decisions about when questions are similar enough to be input for the same variable in the resultant data file often require substantive expertise, knowledge of survey design, and the ability to examine the original data. Some detail will almost always be lost as data are reduced to the lowest common denominator, but this tradeoff is weighed against the chance to study a topic of interest through a much broader lens than would otherwise be possible. The capability of researchers to harmonize data is heavily reliant on the availability of the original datasets and the quality of related documentation. If details about the sample design or skip patterns are not
available, for example, fully informed decisions about variable comparability cannot be made. Data harmonization efforts, then, follow the transparency principles set forth above relating to judgments made and the potential effects of original differences or harmonization decisions. When the harmonized dataset is created and shared, the original data and documentation are typically available along with the harmonized data (c.f., the Collaborative Psychiatric Epidemiology Surveys or the Integrated Fertility Survey Series available from ICPSR, or the Integrated Public Use Microdata Series or Integrated Health Interview Series from the Minnesota Population Center).
Sharing Data Across Disciplines Interdisciplinary work has become the norm in many realms of science. Biological scientists, psychologists, and sociologists might work together on a project to examine the effects of traumatic events on brain chemistry and later life outcomes or hormone levels and competitiveness among athletes. It is becoming more common for data collection efforts to include measures that would be of interest to scholars in multiple disciplines, but discovery of data across disciplines, along with standards for metadata and documentation to enable discoverability, is critical to this type of sharing.
FUTURE DIRECTIONS Based on the trends described above, three important themes appear in the future of data sharing, preservation, and secondary analysis: (1) demand for more sharing, (2) continued attention to confidentiality issues, and (3) an increasing presence of proprietary data. The emphasis on data sharing and research transparency is likely to continue to gain momentum. Professional associations now
DATA PRESERVATION, SECONDARY ANALYSIS, AND REPLICATION
include statements about data sharing in their codes of ethics (for example, the American Association for Public Opinion Research, American Economic Association, American Political Science Association, and the American Sociological Association) and the number of journals calling for data deposit as part of the publication process continues to grow. Research sponsors are requiring individual researchers to include Data Management Plans as part of their grant application packets, and those plans must include some form of data sharing. In addition to adding to scientific knowledge through replication, this type of data stewardship ensures that the funding dollars are used as efficiently and effectively as possible. Additionally, organizations such as the Berkeley Initiative for Transparency in the Social Sciences (BITSS) and the Center for Open Science are offering mechanisms through which individuals can register pre-analysis plans. The idea of registering plans before beginning analysis offers a large step forward for the social science community. For example, journals seem more likely to publish works with significant results, which means that work resulting in null results may be repeated many times over by scientists who are unaware that others have run the same models or tested the same hypotheses (Lupia and Elman, 2014). The ease of running statistical analyses allows researchers to engage in data mining or running a number of models and presenting only those with significant results. Having plans registered in advance should minimize these behaviors. Whether the call is for depositing full datasets for secondary analysis, registering analysis plans prior to beginning work, or providing analytic datasets and accompanying program code for replication, this is a trend that is likely to continue into the future. The second important area concerns confidentiality and management of disclosure risk for data that are becoming increasingly complex. Interdisciplinary work, partnered with the inclusion of new methods of data collection, is resulting in datasets that
659
include bio-markers, video observations, GPS coordinates, and information from multiple members of a sampled unit (i.e., family, classroom). These advances in research techniques – in addition to the matching of data based on geographic units or administrative records that has become fairly commonplace – provide researchers with very rich data to analyze the context of behaviors and attitudes, but they also provide the greatest risk to confidentiality seen to date. Methods such as statistical techniques for masking or perturbing data; technological innovations that allow users to conduct analyses in secure, but remote, data environments; and emphasis on data use agreements constantly evolve to meet the challenges of new types and combinations of data. Lastly, proprietary data – data collected for business purposes and thus having commercial value – is becoming a bigger issue in the world of secondary analysis. Similar to geographic or administrative data, purchasing behaviors, cell phone records, GPS trackers, sensor information, and other forms of proprietary data have the potential to add a great deal of context to social research. Journals and professional associations have acknowledged the utility of these data and have put forth guidelines for scholars who wish to publish articles from such data, recognizing that they cannot be shared in the same way as data from other sources. What remains to be seen, however, is whether the research community can persuade these data holders that making their data more easily available for replication and follow-up studies would actually increase the robustness of the initial findings and prove beneficial for the companies themselves. In short, though types of data may change and studies become more complex, it is unlikely that the call for transparency in research and responsible use of funding dollars will change. Therefore, sharing research and analytic data, proper preservation and easy access for shared data, and conducting secondary analyses will remain critical components of the scientific enterprise.
660
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
RECOMMENDED READING Glandon (2011), Lupia and Alter (2014), Lupia and Elman (2014), and National Research Council of the National Academies (2005).
REFERENCES Altman, M. and King, G. (2007). A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine 13(3/4). http://dx.doi.org/10.1045/march2007-altman Anderson, M.S., Martinson, B.C., and DeVries, R. (2007). Normative dissonance in science: results from a national survey of U.S. scientists. Journal of Empirical Research in Human Research Ethics 2(4): 3–14. CODATA/ITSCI Task Force on Data Citation (2013). Out of cite, out of mind: the current state of practice, policy and technology for data citation. Data Science Journal 12: 1–75. http://dx.doi.org/10.2481/dsj.OSOM13-043 Consultative Committee for Space Data Systems (2002). Open Archival Information System (OAIS) standards of the Consultative Committee for Space Data Systems (CCSDS). DeVries, R., Anderson, M.S., and Martinson, B.C. (2006). Normal misbehavior: scientists talk about the ethics of research. Journal of Empirical Research on Human Research Ethics v1(1): 43–50. Elliott, G.C. (2000). Family Violence and Smoking among Young Adolescent Females: An Extension of the Link to Risky Behavior. Washington, DC: American Sociological Association. Executive Office of the President, Office of Science and Technology Policy (2013). Increasing Access to the Results of Federally Funded Scientific Research. Memorandum for the Heads of Executive Departments and Agencies. Issued February 22, 2013. Fu, V.K. and Wolfinger, N.H. (2011). Broken boundaries or broken marriages? Racial intermarriage and divorce in the United States. Social Science Quarterly 92(4): 1096–1117. Glandon, P.J. (2011). Appendix to the report of the Editor: report on the American Economic
Review data availability compliance project. The American Economic Review 101(3): 695–699. Hewitt, M., Devesa, S., and Breen, N. (2002). Papanicolaou Test use among reproductiveage women at high risk for cervical cancer: analyses of the 1995 National Survey of Family Growth. American Journal of Public Health 92(4): 666–669. Inter-university Consortium for Political and Social Research (ICPSR) (2012). Guide to Social Science Data Preparation and Archiving: Best Practice throughout the Data Life Cycle (5th edn). Ann Arbor, MI: ICPSR. http://www.icpsr. umich.edu/files/ICPSR/access/dataprep.pdf Khan, M.R., Doherty, I.A., Schoenbach, V.J., Taylor, E.M., Epperson, M.W., and Adimora, A.A. (2009). Incarceration and high-risk sex partnerships among men in the United States. Journal of Urban Health 86(4): 584–601. King, G. (1995). Replication, replication. PS: Political Science and Politics 28(3): 443–449. Lehrer, E. (2005). Religious Affiliation and Participation as Determinants of Women’s Educational Attainment and Wages. IZA Discussion Paper No. 1725. Bonn, Germany: Institute for the Study of Labor. Lupia, A. and Alter, G. (2014). Data access and research transparency in the quantitative tradition. PS: Political Science and Politics 47(1): 54–59. Lupia, A. and Elman, C. (2014). Openness in political science: data access and research transparency. PS: Political Science and Politics 47(1): 19–42. McCullough, B.D. and McKitrick, R. (2009). Check the Numbers: The Case for Due Diligence in Policy Formation. Studies in Risk & Regulation. The Fraser Institute. Merton, R.K. (1973). The Sociology of Science: Theoretical and Empirical Investigations. Chicago, IL: University of Chicago Press. Miguel, E., Camerer, C., Casey, K., Esterling, K.M., Gerber, A., Glennerster, D.P. … Van der Laan, M. (2014). Promoting transparency in social science research. Science 343(6166): 30–31. National Research Council of the National Academies (2005). Expanding Access to Research Data: Reconciling Risks and Opportunities. Washington, DC: The National Academies Press.
DATA PRESERVATION, SECONDARY ANALYSIS, AND REPLICATION
Tenopir, C., Palmer, C.L., Metzer, L., van der Hoeven, J., and Malone, J. (2011). Sharing data: practices, barriers, and incentives. Proceedings of the American Society for Information Science and Technology 48(1):1–4. UK Data Archive (2011). Managing and Sharing Data: Best Practices for Researchers. Essex, UK. http://www.data-archive.ac.uk/media/2894/ managingsharing.pdf
661
Vinod, H.D. (2001). Care and feeding of reproducible econometrics. Journal of Econometrics 100: 87–88. Wilder, E.I. (2010). A qualitative assessment of efforts to integrate data analysis throughout the sociology curriculum: feedback from students, faculty, and alumni. Teaching Sociology 38(3): 226–246.
41 Record Linkage Rainer Schnell
DEFINITION Record linkage (RL) operations identify the same objects in different databases using common identifiers or unique combinations of variables. Record linkage of surveys usually links responses of a given respondent from a survey to responses of the same respondent in a different survey, or to administrative data relating to the same respondent. Record linkage therefore refers to combining micro data of the same unit. The unit may be not a person, but an organization, a commercial enterprise or a patent application, but the word ‘record linkage’ is generally considered to be confined to micro data on the same unit. Therefore, record linkage is conceptually different from data fusion, in which data of different units is used to generate synthetic data-sets. Finally, record linkage is technically different from enhancing surveys with aggregate data. Since record linkage operates exclusively on a micro data level, privacy is a central topic in record linkage research.
Record linkage has different names in different academic fields. Nearly every possible combination of one word of the set {entity, record, name, identity} with another word of the set {resolution, detection, linkage, deduplication, matching, identification} has been used in the literature and many more labels are in use. Most commonly, statisticians and survey methodologists use ‘record linkage’, while computer scientists describe the process as ‘duplicate detection’ or ‘entity resolution’.
APPLICATIONS In Survey Research, record linkage is widely used for linking survey responses to administrative databases. Usually additional variables from the database are merged with survey data-sets, for example to reduce respondent burden. Other isolated applications of record linkage in Survey Research can be found in the literature, but none of them are widespread.
Record Linkage
663
Some examples of such non-trivial applications might be useful (for details, see Schnell 2014). The intersection of databases can be used for building sampling frames of special populations, for example physicians changing their profession. Such intersections are the core of undercoverage estimation procedures for census operations (Kerr 1998), or for the coverage evaluation of sampling frames in general. Further examples include deduplication of sampling frames and capture–recapture studies for size estimations of rare populations. Since repeated sampling of a population will observe at least some elements more than once, record linkage of these samples can be used for retrospective panel construction (Antonie et al. 2013; Ruggles 2002). This feature of record linkage can be of particular importance for surveys on elusive populations that avoid the collection of identifying characteristics, for example illegal migrants (for a review, see Swanson and Judson 2011). Beyond building sampling frames, record linkage of administrative databases with surveys have been used for the imputation of missing survey responses and the validation of survey responses (for example, Kreiner et al. 2013). Given the increasing availability of survey, administrative and social network data, record linkage might be misused to re-identify survey respondents (Ramachandran et al. 2012). Therefore, record linkage is a standard technique for checking the anonymity of Scientific Use Files (Domingo-Ferrer and Torra 2003).
as names, dates of birth and addresses must be used. These characteristics are neither unique, nor stable over time and often recorded with errors. For example, Winkler (2009: 362) reports that 25 percent of matches in a census would have been missed if only exact matching identifiers had been used.
LINKING WITH AND WITHOUT PERSONAL IDENTIFICATION NUMBERS
Linking with Encrypted Identifiers
If a unique personal identification number (PID) is available, record linkage is technically simple. However, few countries have universal national PIDs (for example, Denmark, Finland, Norway, Sweden). In most other countries pseudo-identifiers such
Linking Using a Trusted Third Party Due to privacy concerns or privacy jurisdiction, record linkage is often performed by a trusted third party (for example, in medical research). The principle idea of using a third party is the separation of personal identifiers (like names) from the subject matter data. The database owners send their databases without personal identifiers but with sequence numbers to the research group. Furthermore, the database owners send the list of personal identifiers and sequence numbers (but without data) to the trustee. The trustee identifies identical persons in the databases and delivers the list of corresponding sequence numbers to the research group, which then links the separate databases (which no longer contain personal identifiers) based on the linkage list of the trustee. Therefore the trustee sees only the sequence numbers and personal identifiers, whereas the research group sees only the sequence numbers and the data. However, the trustee also learns who is a member of which database.
Knowledge gain of the trustee can be prevented by using encrypted identifiers. Standard encrypting functions like MD-5 or SHA-1 (Katz and Lindell 2015) are widely used for the encryption of identifiers. However, if the personal identifiers are prone to typing or OCR errors, even a small error will result in a completely different encoding after encryption. Therefore, a common
664
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
technique is the use of phonetic encodings for names (Christen 2012a): slightly different names are transformed according to phonetic rules, so that very similar variants of a name form a common group. For almost 100 years, the most popular phonetic function has been Soundex, patented as an indexing system by Russel (1918). As an example, the names Smith, Smithe, Smithy, Smyth, Smythe all have the Soundex code S530, but the name Smijth has the code S523. For record linkage, phonetic codes are normally additionally encrypted with a cryptographic function such as SHA-1. Cases with non-matching phonetic codes (in the example, Smijth [S523] and Smith [S530]) will not be matched by this procedure. Such a combination of standardization, phonetic encoding and encryption of the phonetic code will result in acceptable linkage results for many applications. Variants of this techniques are widely used in medical applications (Borst et al. 2001). Of course, persons appearing with entirely different names in the two databases will be lost (for example, in some cultures, name changes due to marriage). If these non-matches are related to variables of interest, the resulting missing information cannot statistically be considered as missing at random. One example are neonatal databases: children of mothers with changes in their names might be different from children of mothers with no change in their names, or different from mothers whose names are missing (Ford et al. 2006). In such applications, additional techniques have to be used.
STEPS IN A RECORD LINKAGE PROCESS Most record linkage operations follow a standard sequence: 1 Before any matching can be done, the data files must be standardized (‘preprocessing’). 2 Since not all possible pairs of records can be compared, subsets of potential pairs are selected (‘blocking’).
3 Within these subsets, similarity measures between records are computed. 4 Records whose similarity exceeds certain thresh olds are considered as belonging to the same units.
The estimation of those thresholds is the most demanding mathematical task in record linkage. The actual merging of records is technically trivial. Due to missing identifiers or missing blocking variables, more than one blocking strategy usually has to be used, thus making record linkage an iterative process. Even after many iterations of different blocking strategies, some records remain unlinked. In many applications, these remaining records are checked manually. Each step will be explained in more detail in the following subsections.
Preprocessing Different databases usually have different formats, adhere to varying standards concerning the handling of incomplete information and the coding of identifiers. Therefore, the fields containing information on identifiers have to be standardized. This processing step in record linkage is called preprocessing. At this stage, character encodings and dates are unified, umlauts transcribed, uppercase/lowercase characters transformed, punctuation removed, academic and professional titles extracted and assigned to special fields, address data is parsed according to a standard format and corrected (for example, Gill 2001; Randall et al. 2013). Some systems replace nicknames. Furthermore, in some applications gender is imputed using names if necessary. Frequently, fields contain nonstandard values to indicate missing information; these codes have to be replaced by standard values. Preprocessing is a necessary step in every linkage operation. Non-experts tend to underestimate the time and efforts required for preprocessing by far. If national databases have to be linked for the first time, preprocessing
Record Linkage
might require months of labor. This is especially true for address information. If the linkage has to use address information, the preprocessing effort required increases sharply. The initial compilation of national standardized address information alone might cost millions of dollars (an example for this is the 2011 German Census, see Kleber et al. 2009). Therefore, preprocessing address data is an industry in itself in most countries without personal identification numbers and centralized registries. Although preprocessing systems may be commercially available (offered for example by postal services), in Europe, the usage of these systems for academic purposes or official statistics is often limited due to privacy protection laws.
Blocking Comparing each record of one file with each record of a second file requires n × m comparisons. Even with 10,000 comparisons per second, linking a national database with ten million records and a survey with 10,000 respondents would take nearly 4 months of computing. In practice, record linkage is therefore usually based on comparisons within small subsets of all possible pairs. Techniques for selecting these subsets are called blocking methods (Christen 2012b). For example, only persons born on the same day are compared. Since this kind of blocking (called standard blocking) relies on errorfree blocking variables, different blocking variables (for example, postcodes, sex, date of birth, phonetic codes of names) are used sequentially. Blocking substantially reduces the time needed for linkage. On the other hand, different blocking strategies will yield different subsets of correct and incorrect linked record pairs. Therefore, blocking will generally result in the loss of some correct links. Since currently no optimal strategy for the use of different techniques is known, the applied sequence of blocking criteria is usually the result of trial-and-error.
665
Similarity Measures To compare different names, measures of similarity between names are needed. There are numerous string similarity measures, but many studies have shown the superior performance of the Jaro–Winkler string comparator (Winkler 1995). This measure uses a weighted sum of the number of characters the strings have in common and the proportion of transposed characters, whereby the first characters of the strings receive more weight.
Classification To decide whether a potential pair of records should be considered a match, the measures of similarity of all identifiers have to be combined numerically. Though different computational methods have been used, for most applications today the standard method is probabilistic record linkage (see Herzog et al. 2007 for a textbook). This approach is based on a statistical decision theory (Fellegi and Sunter 1969) for estimating optimal weights for a decision on potential matching records. Current implementations of the probabilistic record linkage technique rely on expectation–maximization algorithms (Herzog et al. 2007; Jaro 1989). Estimating optimal parameters usually requires a lot of experimentation for given databases and is by no means an automated procedure (Newcombe 1988). There are many variants of the basic technique for special circumstances, for example when it is certain that each record from a database A is contained in database B. In this case, adjusting the weights is recommended (one-by-one matching). If multiple databases have to be linked, sequentially pairwise linking is a suboptimal procedure and should be avoided (Sadinle and Fienberg 2013).
Clerical Edit Even after many iterations of blocking and threshold modifications, in most applications a subset of unlinked records remains. Generally
666
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
speaking, at least a subset of the remaining records should be checked by humans to determine whether further links can be made. For example, for some records additional information might be used to compensate otherwise missing identifiers (Ong et al. 2014). Clerical edit is very labor-intensive and is usually restricted to small subsets of the datasets. Clerical edit should be supported by appropriate software tools like record browsers with the ability to search using regular expressions.
PRIVACY PRESERVING RECORD LINKAGE (PPRL) Record linkage methods that allow for errors in the identifiers and do not reveal information on personal identity are needed to avoid privacy concerns. This field of research is called privacy preserving record linkage (PPRL; for an overview see Christen 2012a: 199–207). For real world settings, only a few of the proposed procedures are suitable. A procedure used successfully in different practical settings (see for example Randall et al. 2014) has been suggested by Schnell et al. (2009). The identifiers are split into a set of substrings with the length 2 and then mapped by 10–30 different cryptographic functions into binary vectors (‘BloomFilters’). Each identifier is mapped to one Bloom-Filter. Only the Bloom-Filters are compared during record linkage. However, all basic PPRL approaches seem to be vulnerable to cryptographic frequency attacks (Niedermeyer et al. 2014). Therefore, current PPRL approaches try to make cryptographic attacks much more difficult. For example, if one common Bloom-Filter for all identifiers is used (Schnell et al. 2011), successful attacks on this protocol are more challenging (Kroll and Steinmetzer 2014). Revising PPRL protocols to make them more resilient against attacks is an active field of research. A specific problem of PPRL is privacy preserving blocking. With datasets exceeding
millions of records, such as administrative databases, blocking becomes unavoidable. Very few blocking techniques are suitable for data encoded within PPRL approaches. The baseline procedure is a technique called ‘sorted neighborhood’ (Hernández and Stolfo 1998). Here, both databases are pooled and sorted according to a blocking key. A window of a fixed size is then slid over the records. Two records from different input files form a candidate pair if they are covered by the window at the same time. With modern blocking techniques, even census databases can be linked with a PPRL technique in acceptable time (Schnell 2014).
USING RECORD LINKAGE IN PRACTICE Although administrative databases and datasets in Official Statistics may include millions of records, in Survey Research, the number of records in datasets is usually smaller. In nonprofessional settings, especially with small datasets, sometimes record linkage is attempted manually. This will only work if complete unique identifiers without errors are available for all records in all databases. Manually linking is error prone and labor-intensive. Even manually checking a presumed link takes a few minutes for each case (Antonie et al. 2013). Therefore, computer-based record linkage instead of manually linking is nearly always the recommended procedure for Survey Research.
Software Record linkage for applied research should be conducted with professional record linkage software. Many ad hoc and academic solutions are not suitable for large databases. Due to different requirements, a general recommendation for choosing among the available commercial and academic programs cannot
Record Linkage
be given. However, most textbooks on record linkage do give some advice for choosing software (for example, Christen 2012a).
Record Linkage Centers In general, applied research groups should consider the use of dedicated research infrastructure for record linkage. Some countries have established national research centers for the linkage of administrative databases with surveys. For example, the Administrative Data Liaison Service (ADLS) in the UK and the German Record Linkage Center (GRLC) provide services for research requiring record linkages. Since linkage operations usually have to adhere to legal requirements such as controlled access, security clearances and approved procedures, research groups attempting record linkages of large administrative databases in Europe will benefit from centers such as ADLS and GRLC.
667
linking of their survey responses to other databases. Large differences in the proportion of consenting respondents have been observed. In a review of the literature, Sakshaug and Kreuter (2012) reported between 24 percent and 89 percent consent to linkage for different surveys. Refusing the linkage request is a kind of nonresponse and therefore sensitive to minor variations in the interview situation, such as the position of the request within an interview, the social skills of the interviewer, and last but not least the respondents’ trust in the survey organization and survey sponsor. If no obvious rational interest of a respondent prevents a linkage of survey responses, the bias due to non-consent to linkage is likely to be small. However, it is to be expected that consent rates will decrease with an increase in linkage requests.
FUTURE DEVELOPMENTS Required Effort Even with the assistance of Record Linkage Centers, linking surveys and administrative databases usually takes much longer than expected. Depending on the data quality and the cost of missed matches, preprocessing may require anything from a few days to many months of labor. After preprocessing linking of national databases can be done in at most a few days even using standard hardware. The technical linkage of two surveys rarely needs more than a few hours after preprocessing. However, it seems to take two years on average to get all required approvals for record linkage on a national level in most European countries.
Respondents’ Permission for Linkage Under most circumstances, the legal situation is simpler if the respondents agree to the
Due to the many advantages of linking surveys to administrative, business and mobility data, the number of linked surveys will increase. Since privacy preserving record linkage is a very active research field in computer science, major technical advances are likely. Obtaining the legal permission to link surveys and databases on a national level despite increasing privacy concerns of citizens will remain the main challenge for record linkage projects.
RECOMMENDED READINGS The current state of the art textbook on record linkage from a computer science perspective has been written by Christen (2012a). Herzog et al. (2007) approach record linkage from a more statistical point of view. Applications of record linkage in survey research are described by Schnell (2015).
668
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
REFERENCES Antonie, L., Inwood, K., Lizotte, D.J., and Andrew Ross, J. (2013). Tracking people over time in 19th century Canada for longitudinal analysis. Machine Learning, 95 (1), 129–146. Borst, F., Allaert, F-.A., and Quantin, C. (2001). The Swiss solution for anonymous chaining patient files. In: V. Patel, R. Rogers, and R. Haux (eds), Proceedings of the 10th World Congress on Medical Informatics (pp. 1239– 1241). Amsterdam: IOS Press. Christen, P. (2012a). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer. Christen, P. (2012b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24 (9), 1537–1555. Domingo-Ferrer, J. and Torra, V. (2003). Disclosure risk assessment in statistical microdata protection via advanced record linkage. Statistics and Computing, 13 (4), 343–354. Fellegi, I.P. and Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183–1210. Ford, J.B., Roberts, C.L., and Taylor, L.K. (2006). Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatric and Perinatal Epidemiology, 20 (4), 329–337. Gill, L. (2001). Methods for Automatic Record Matching and Linkage and their Use in National Statistics. Newport: Office for National Statistics. Hernández, M.A. and Stolfo, S.S. (1998). Realworld data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2, 19–37. Herzog, T.N., Scheuren, F.J., and Winkler, W.E. (2007). Data Quality and Record Linkage Techniques. New York: Springer. Jaro, M.A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84 (406), 414–420. Katz, J. and Lindell, Y. (2015). Introduction to Modern Cryptography, 2nd edn. Boca Raton, FL: CRC Press.
Kerr, D. (1998). A review of procedures for estimating the net undercount of censuses in Canada, the United States, Britain and Australia. Demographic Documents, No. 5, Statistics Canada. Kleber, B., Maldonado, A., Scheuregger, D., and Ziprik, K. (2009). Aufbau des Anschriftenund Gebäuderegisters für den Zensus 2011. Wirtschaft und Statistik, 7, 629–640. Kreiner, C.T., Lassen, D.D., and Leth-Petersen, S. (2013). Measuring the accuracy of survey responses using administrative register data: evidence from Denmark, Working Paper No. 19539, National Bureau of Economic Research. Kroll, M. and Steinmetzer, S. (2014). Automated cryptanalysis of bloom filter encryptions of health records, Working Paper WP-GRLC-2014-05, German Record Linkage Center, Nuremberg. Newcombe, H.B. (1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford: Oxford University Press. Niedermeyer, F., Steinmetzer, S., Kroll, M., and Schnell, R. (2014). Cryptanalysis of basic Bloom filters used for privacy preserving record linkage. Journal of Privacy and Confidentiality, 6 (2), 59–79. Ong, T.C., Mannino, M.V., Schilling, L.M., and Kahn, M.G. (2014). Improving record linkage performance in the presence of missing linkage data. Journal of Biomedical Informatics, 52, 43–54. Ramachandran, A., Singh, L., Porter, E.H., and Nagle, F. (2012). Exploring Re-identification Risks in Public Domains. Center for Statistical Research & Methodology Research Report Series (Statistics #2012–13). US Census Bureau. Randall, S.M., Ferrante, A.M., Boyd, J.H., Bauer, J.K., and Semmens, J.B. (2013). The effect of data cleaning on record linkage quality. BMC Medical Informatics and Decision Making, 13 (1), 64–73. Randall, S.M., Ferrante, A.M., Boyd, J.H., Bauer, J.K., and Semmens, J.B. (2014). Privacypreserving record linkage on large real world datasets. Journal of Biomedical Informatics, 50, 205–212. Ruggles, S. (2002). Linking historical censuses: a new approach. History and Computing, 14 (1–2), 213–224.
Record Linkage
Russel, R.C. (1918). Untitled patent US1261167 (A) 1918-04-02. Sadinle, M. and Fienberg, S.E. (2013). A generalized Fellegi–Sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108 (502), 385–397. Sakshaug, J.W. and Kreuter, F. (2012). Assessing the magnitude of non-consent biases in linked survey and administrative data. Survey Research Methods, 6 (2), 113–122. Schnell, R. (2015). Linking surveys and administrative data. In: U. Engel, B. Jann, P. Lynn, A. Scherpenzeel, and P. Sturgis (eds), Improving Surveys Methods: Lessons from Recent Research (pp. 273–287). New York: Routledge. Schnell, R. (2014). An efficient privacypreserving record linkage technique for administrative data and censuses. Statistical Journal of the IAOS, 30 (3), 263–270.
669
Schnell, R., Bachteler, T., and Reiher, J. (2009). Privacy-preserving record linkage using bloom filters. BMC Medical Informatics and Decision Making, 9 (41), 1–11. Schnell, R., Bachteler, T., and Reiher, J. (2011). A novel error-tolerant anonymous linking code, Working Paper WP-GRLC-2011-02, German Record Linkage Center, Nuremberg. Swanson, D.A. and Judson, D.H. (2011). Estimating Characteristics of the ForeignBorn by Legal Status. Dordrecht: Springer. Winkler, W.E. (1995). Matching and record linkage. In: B.G. Cox, D.A. Binder, B. Nanjamma Chinnappa, A. Christianson, M.J. Colledge, and P.S. Kott (eds), Business Survey Methods (pp. 355–384). New York: Wiley. Winkler, W.E. (2009). Record linkage. In: D. Pfeffermann and C.R. Rao (eds), Handbook of Statistics Vol. 29A (pp. 351–380). Amsterdam: Elsevier.
42 Supplementing Cross-National Survey Data with Contextual Data J e s s i c a F o r t i n - R i t t b e r g e r, D a v i d H o w e l l , S t e p h e n Q u i n l a n a n d B o j a n To d o s i j e v i ć
INTRODUCTION The increasing availability and use of contextual data mark important developments for social scientists as they open the door to leveraging cross-national public opinion datasets in new and exciting ways. By adding contextual information, researchers can take into account socio-economic and institutional settings and how such contexts affect survey respondents. The importance of context in understanding behavior should not be underestimated: contextual variables are becoming indispensable across the social science disciplines, from economics to education. Sociologists use contextual data to trace the effect of groups and structures on individual outcomes (Erbring and Young, 1979), and political scientists have a long tradition of utilizing contextual data in single country studies of voting (reviewed in Achen and Shively, 1995). The inclusion of contextual data in crossnational surveys has a short history. This has
not been for lack of interest, as over four decades ago Stein Rokkan (1970) argued for the use of cross-national surveys to examine ‘the structural contexts of the individuals’ reactions to politics’ (p. 15). Rather, this shortfall was due to both a lack of access to such data and a shortage of appropriate techniques to analyze them. Since that time, access to contextual data has been greatly facilitated by the information age, the advent of the Internet, and efforts by many governments to make relevant data more available to the public. Contextual data can now be easily accessed from national statistics offices and government websites, as well as from public databases assembled by organizations such as Eurostat, the International Telecommunications Union, the United Nations, and the World Bank. The availability of geolocated contextual data has also been a boon to such efforts.1 Given that contextual data is now more accessible than ever before, cross-national surveys are more commonly including such information within their datasets.
SUPPLEMENTING CROSS-NATIONAL SURVEY DATA WITH CONTEXTUAL DATA
The aim of this chapter is to explore the use of contextual data by cross-national surveys, detailing the advantages and disadvantages, as well as providing some guiding principles for achieving the best results. This is illustrated by examples from a number of projects, especially the Comparative Study of Electoral Systems (CSES).2
CONTEXTUAL DATA Contextual data can be divided into two major subtypes, each of which we describe below:
Aggregated Data Aggregated contextual data combine information about lower-level units into a higherlevel unit in a way that summarizes the properties of the lower-level units (DiezRoux, 2002; Rydland et al., 2007). A common method of aggregation is by geography – for example, taking information about each of the regions of a country and combining it into a variable at the country level. Aggregated contextual data come in two forms. The first type is exogenous to individual-level units – for instance, economic indicators such as gross domestic product (GDP) or unemployment statistics for a country. The second type refers to the social setting in which individuals operate, and is ‘created by the aggregation of individual traits’ (Shively et al., 2002: 220), allowing the indirect capture of certain variables which would be otherwise difficult to measure. These data can be taken from small social groups, neighborhoods, or regions, and aggregated spatially or organizationally up to an entire country.
System-Level Data System-level contextual data are used to depict the properties of systems or
671
institutions. In contrast with aggregated data, institutional contexts ‘… are taken to be exogenous to individual behavior and the level at which we measure institutional rules is determined by the unit and level of analysis at which research is being conducted’ (Shively et al., 2002: 220). System-level data are not aggregated up from lower-level units, but are drawn from sources such as constitutions, electoral commission documentation, legal rules, or government information pages. In political science, examples include the type of government and characteristics of a country’s electoral system.
THE CURRENT STATE OF CONTEXTUAL DATA IN CROSS-NATIONAL SURVEY PROGRAMS A number of theories in the social sciences are based on the assumption that individuals are influenced by the context in which they operate, hence Dogan’s (1994) call that ‘survey data and aggregate data should be combined whenever possible’ (p. 58). Increased accessibility of contextual data, favorable cost–benefit ratios for attaching it to surveys, and development of empirical techniques to accommodate multi-level analysis (e.g., Jusko and Shively, 2005) has resulted in contextual data becoming more prominent in cross-national survey programs. However, it was not always so. Early attempts to combine individual survey data with contextual data were generally within single country studies. Examples can be found in the study of educational outcomes (Raudenbush and Willms, 1995), political behavior (Butler and Stokes, 1969), and racial attitudes (Taylor, 1998). Yet, with the increasing availability of such data, this has been changing. Today, a number of cross-national survey research projects present combined individual and contextual-level data. One of the leaders in this field has been the European Social Survey (ESS).3 The ESS offers
672
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
data on demography, education, employment, economic development, and characteristics of the political system (Rydland et al., 2007). Contextual data in the ESS have been used to investigate diverse topics across multiple disciplines, including whether overall levels of immigration exert an influence on individual opinions about immigration (Cides and Citrin, 2007), and the impact of setting on self-reported life satisfaction (KöötsAusmees, Realo, and Allik). The Afrobarometer,4 a survey about the social and political attitudes of people across African states, includes measures detailing the availability of schools, sewage systems, and electricity in the geographic regions around its survey respondents. Meanwhile, the Generations and Gender Programme (GGP),5 a longitudinal cross-national survey which focuses on the relationships between parents and children and partners, attaches data on legal regulations, economic and cultural indicators, welfare state policies, and institutions. Another study, the Six Country Immigrant Integration Comparative Survey (SCIICS),6 offers variables to measure the effects that context has on immigrant integration, with a special focus on the ethnic composition of a respondent’s place of residence. In political science, advances in combining surveys and contextual data in a comparative framework have been driven by the field of electoral behavior, where the influence of context on individuals’ attitudes and behaviors has been well documented (Achen and Shively, 1995; Huckfeldt and Sprague, 1993; Pattie and Johnston, 2000; Steenbergen and Jones, 2002). Consequently, electoral scholars have been at the forefront in combining their survey datasets with contextual components. For example, The Comparative National Elections Project (CNEP)7 provides information about the media and associational structures of the countries it covers. The European Election Studies (EES)8 project, which focuses on European Parliamentary elections, links up with the project entitled Providing an Infrastructure for Research on Electoral Democracy in the
European Union (PIREDEU)9 to provide its users with information on electoral laws and party system characteristics as well as various economic and socio-cultural indicators. The Comparative Manifestos Project (CMP),10 an exhaustive set of measures about political party positions, allows linking of its data with cross-national surveys such as the Eurobarometer11 and World Values Survey (WVS).12 The Comparative Candidates Survey (CCS),13 a cross-national survey of political candidates, includes data about each candidate’s electoral constituency as well as about the national political system. More so than other political science research initiatives, the Comparative Study of Electoral Systems (CSES) project has been a pioneer in combining contextual data with cross-national survey data. Founded in 1994, the CSES project helps understand how individual attitudes and behavior – especially in regards to voting and turnout – vary under different electoral systems and institutional arrangements. The first three modules of CSES include 130 post-election surveys from 51 countries. The CSES augments its survey data with contextual information about the electoral district, national electoral system, and other social, political, and economic conditions that characterize each participating country and election. Researchers have used CSES to investigate questions such as whether some electoral systems foster closer ideological congruence between representatives and citizens (Blais and Bodet, 2006; Golder and Stramski, 2010) or impact individuals’ decision to vote (Banducci and Karp, 2009; Dalton, 2008; Fisher et al., 2008). As with all data sources, researchers should seek to evaluate the quality and reliability of contextual data sources prior to use, with an eye to the transparency in description, origin of materials, consistency of methods, completeness of observations, and crossnational comparability (Herrera and Kapur, 2007). The remainder of our contribution details the advantages of contextual data as well as some of the challenges practitioners will face.
SUPPLEMENTING CROSS-NATIONAL SURVEY DATA WITH CONTEXTUAL DATA
ADVANTAGES OF CONTEXTUAL DATA WITHIN CROSS-NATIONAL SURVEYS Substantive Advantages Inclusion of contextual data has a number of substantive advantages. It allows analysts to disentangle causal heterogeneity – in other words, whether relationships observed at the individual level are conditioned by context. After all, a long-standing tradition in the social sciences, from sociology to political science and psychology, views individual behaviors and attitudes as contingent upon the social and institutional context in which individuals find themselves (Barton, 1968; Erbring and Young, 1979). This overcomes a main weakness of pooled surveys, namely that individuals are treated as independent observations, forcing the practitioner to conduct analyses at the individual level even knowing that most theoretical frameworks assume behavior to be linked to context. Thus, as Vanneman (2003) observes, ‘the trend towards more contextualized survey data is exciting because it presents a more realistic picture of social life than the atomized individuals that used to dominate our survey designs’ (pp. B-83). Attaching contextual data to cross-national surveys further enables comparison by allowing the testing of propositions controlling for country and regional factors. This provides greater confidence in the generalizability of findings. It also opens up new frontiers in research. For instance, in identifying the contextual factors that affect the manifestation of attitudes towards immigrants and immigration (Ceobanu and Escandell, 2010) or those of nationalism and national identification (Staerklé et al., 2010). These research designs can provide evidence that is relevant not only for basic science but for policy. For example, decisions in regards to electoral system reform benefit from knowing the associations between system features and public attitudes and behaviors (Bowler and Donovan, 2012; Gallagher, 1998; Karp, 2013).
673
Methodological Advantages In cross-national survey data, individual observations are ‘nested’ within countries, although regression modeling assumes that observations are independent of one another. Failure to take account of this data structure when modeling the data could result in an estimation of incorrect standard errors and an increased likelihood of type-I errors (Hox, 2010; Steenbergen and Jones, 2002). The inclusion of contextual data requires practitioners to address the problem of clustering of individual level observations head-on by compelling researchers to consider a multilevel modeling strategy, and thus allows practitioners to decompose the variance into estimates at the individual and contextual level. While this does not offer a complete solution to the problem, it does permit practitioners to address the multi-level character of the data which has implications for model efficiency (Steenbergen and Jones, 2002: 219–220).
CHALLENGES OF CONTEXTUAL DATA WITHIN CROSS-NATIONAL SURVEYS Organizational Challenges While the situation has improved in recent years, there remains an advanced industrial democracy bias in terms of data availability and reliability, with contextual data more often partially or wholly unavailable in developing countries. Data is also sometimes scarcer for territories that are not formally independent (e.g., Taiwan). Accessing and using this data is not always a straightforward process (Rydland et al., 2007: 3). While most databases provide documentation detailing classifications and sources, the detail provided can be inconsistent. For example, Rydland et al. (2007: 9) observed that the Adult Literacy variable included in the ESS and sourced from the World Bank’s World Development
674
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Indicators14 had a definition of literacy that was unclear, open to misinterpretation, and could pose difficulties for comparative analysis. A concern was that this issue was not obvious to all analysts using the data. This puts pressure on cross-national survey programs and their users to pay special attention to the reliability and documentation of any issues associated with contextual information gathered from external sources. Analysts also need to be cognizant of original contextual data developed and provided by the cross-national survey programs themselves. As with external data sources, the extent and level of detail of documentation provided varies from project to project. For its part, the CSES errs on the side of caution, providing extensive details of all of its variables in its codebook, including the contextual modules. In the CSES codebook, the data source for each country is reported, along with the date that the data was added, definitions of key terms, and notes for election studies in which deviations are observed. The timing of data releases is another issue to contend with. When a dataset is released, contextual data may be available for recent years in some countries but not in others. In these circumstances, the programs or analysts may be tempted to use alternative sources of data to supplement the missing values, but this can create problems in terms of comparability (see Herrera and Kapur, 2007). In the CSES, when contextual data is unavailable for the year in question and should the project obtain the data from another source, it is clearly stated in the documentation. This documentation empowers analysts to make their own decisions as to whether to incorporate the observations or not in their analyses.
Methodological Challenges Sometimes there is a lack of clear and consistent guidelines in terms of deciphering ‘the correct’ classification for contextual variables. The simplest case is when the variables
are inherent features of the system – for example, institutional variables such as the type of electoral system employed in a country. Here, the main issue is to decipher an appropriate fit between coding schemes and specific cases (for example, it may be a matter of judgment whether a particular system is better categorized as a proportional or a mixed electoral system). It is more complicated when the variable in question could be operationalized in a number of different ways. As an example, consider voter turnout, in which the denominator can be understood as the number of eligible voters according to the electoral register, the voting age population, or by the eligible voting age population. While all three are legitimate measures of electoral participation, the choice of denominator can lead to very different turnout estimates (e.g., McDonald and Popkin, 2001). In this particular circumstance, the CSES adopts a policy of providing multiple measures of turnout, leaving it to the analyst to decide which one to use in their analysis. The classification controversies do not end here, as some measures can be conceived of in a multidimensional manner (as examples, the concept of democracy or the concept of corruption) while other measures could be influenced by ideological partiality. Freedom House’s Freedom in the World15 country ratings are widely used and valued but have proved controversial, being criticized on both partiality and dimensionality grounds (see Bollen and Paxton, 2000). Equally problematic is that concepts may be operationalized in various manners. For instance, GDP estimates for a country in a particular year can vary depending on the agency from which the data point is sourced (Rydland et al., 2007). While the quality of contextual data has greatly improved over time (Atkinson et al., 2002; Harkness, 2004), cross-national and over-time comparison for some contextual measures remains difficult. Some data providers periodically update their contextual measures retroactively, when estimates based on updated information become available.
SUPPLEMENTING CROSS-NATIONAL SURVEY DATA WITH CONTEXTUAL DATA
Others may modify their methodology at certain points in time. In 2012, for instance, Transparency International changed the methodology for constructing their Corruption Perceptions Index (CPI),16 resulting in the new country scores being incomparable with earlier years and previous editions. More controversial, however, is when the contextual measures are indirectly measured concepts, often expressed in the form of complex indexes based on aggregation of ‘expert judgments’, or ‘trained coders’. Take for instance measures of democracy issued by the Polity IV Project17 (Marshall et al., 2014). Polity IV uses multiple sources for its data with coding undertaken by numerous coders. Where coding discrepancies arise, they are ‘discussed collectively to refine concepts and coding guidelines and enhance coder training’ (Marshall et al., 2014: 6). And while inter-coder reliability tests are performed, only the outcome scores are published, with little information about reliability and inter-coder variance. It is, in fact, a common problem that most estimate-based contextual data does not come with measures of uncertainty. For example, a frequently used country-level variable, the UNDP Human Development Index (HDI),18 provides point estimates but no information about uncertainty and variability of the individual country estimates. A second example, the CMP, also initially did not provide uncertainty estimates. This has spawned a debate between scholars (reviewed in Volkens et al., 2013) and resulted in analysts paying greater attention and developing methods for the estimation of uncertainty for CMP data. Two examples of progress in this regard are the CPI and the Standardized World Income Inequality Database (SWIID).19 The CPI measures the perceived levels of public sector corruption by aggregating data from a number of different sources. In addition to providing the point estimate, the data includes estimates of standard errors, confidence intervals, the number of surveys used for a particular point estimate, and also source codes.
675
The SWIID provides inequality estimates with associated uncertainty represented by 100 separate imputations of the complete series. This gives users greater power in (a) how to interpret this variable and (b) how to make appropriate choices in terms of modeling this variable. Yet this information about uncertainty is still rarely used in analyses. Ideally, most contextual data sources, in particular those devised as a result of indices or ‘expert judgments’, would have uncertainty estimates attached to the point estimates and that they would be included in the datasets to provide practitioners with more information and give them greater control. There is also the important matter of validity, a particular concern when contextual measures are based on the aggregation of individual scores from public opinion surveys. The idea here is that aggregated scores (e.g., country averages) may reflect a phenomenon conceptualized on and relevant for contextuallevel units. For instance, the average scores on the materialism- postmaterialism scale may be taken as indicators of political culture on the national level (as done by Inglehart, 1990). However, certain matters need to be borne in mind. First, there is a conceptual leap from the individual level to contextual level, which requires theoretical justification that aggregated individual measures indeed operationalize the hypothesized contextuallevel phenomenon. Second, the aggregation of individual-level data measuring political culture to the level of nations relies on the assumption that countries are culturally homogeneous – given that only one data point is generated, usually the mean – and thus ignores cultural differences within societies (Silver and Dowley, 2000).
Analytical Challenges Analysis of contextualized survey data continues to place new demands on researchers’ skillsets. While in the past available statistical methods and computing power have
676
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
limited the possibilities, the problem today is mainly about which analytical strategy to pursue. Pitfalls range from the use of simpler and more familiar methods when more robust and powerful approaches are available, to the other extreme where complex methods are used when simpler approaches would have in fact been more appropriate to answer the research question. Analyses of theories and/or data that pertain to multiple levels are exposed to the danger of committing cross-level fallacies, namely the individualistic fallacy and the ecological fallacy. The former is committed by unwarranted generalizations from individuallevel relationships to higher-level relationships. Meanwhile, the ecological fallacy is the inferring of individual-level relationships on the basis of aggregate level observations.20 As Subramanian et al. (2009) suggest, awareness of the cross-level fallacies ‘should lead to a “natural” interest in multilevel thinking and modelling cross-level processes’ (p. 343). Yet, multi-level analytical strategies are not a panacea. Selection of appropriate analytical methods, e.g., a single-level (aggregate or individual-level), or a multi-level model in its various forms (e.g., with random intercepts only, or random intercepts and slopes), requires close scrutiny both of theory and available data. However, a more technical difficulty is that aggregate-level units are often not randomly selected. In cross-national public opinion surveys, included countries have virtually always been samples of convenience. This can affect the robustness of research findings, although there are different schools of thought about the weight that should be given to this problem. Lucas (2014) for one, argues for a strict approach: ‘Once we recognize that many probability sample designs produce non-probability samples for the MLM [multi-level modelling], we need no longer waste time estimating MLMs on data for which findings will be biased in an unknown direction and for which any hypothesis tests are unjustified’ (p. 1645). While there is no
obvious solution to this, as cross-national research will always be subject to samples of convenience at the country level due to resources and interest, the key point we wish to emphasize is that practitioners need to factor this into their thinking when analyzing cross-national surveys that include contextual data.
FUTURE FRONTIERS With cross-national surveys increasingly including contextual data components, the objective of this chapter has been to draw practitioners’ attention to the advantages of adopting an individual-contextual research strategy while also being cognizant of the significant challenges and complexities involved. One key take-away point from our analysis is that researchers should be provided with the most comprehensive information available concerning the conceptualization and measurement of contextual variables, enabling them to make the ultimate call about the suitability of these measures for their particular analyses. Current trends indicate that more and more cross-national survey programs will include contextual data in their datasets. With this in mind, projects utilizing contextual measures should seek to ensure that accurate and detailed documentation information is available to analysts and that, whenever possible, conceptual clarity is improved, and measures of uncertainty, where possible, are provided. With improvements in data availability and more layers of data becoming available (as example, regional data, and electoral district data) comes a series of challenges. Accurate data linkage depends on the availability of co-occurring unique identifiers. For example, while some countries have regional data easily accessible, most do not, especially those outside of Europe. The challenge will therefore be to ensure there is a wide range of countries who make these data available, and that the
SUPPLEMENTING CROSS-NATIONAL SURVEY DATA WITH CONTEXTUAL DATA
sub-national units chosen are standardized across sources (i.e., The Nomenclature of Territorial Units for Statistics in the European Union), so that cross-national survey research can fully take advantage of this opportunity. Considering the vast technological advances that have enabled the collection of huge amounts of personal data, including the possibility to spatially locate survey respondents, complications associated with respondent confidentiality have become more a salient topic. These developments have spurred a greater degree of interest in risk disclosure procedures (De Capitani di Vimercati et al., 2011; Robbin, 2001). Numerous means of anonymizing sensitive data have been suggested, ranging from confidential editing techniques (De Capitani di Vimercati et al., 2011) to the removal of unique identifiers, and adjusting spatial data coordinates (Gutmann et al., 2008). However, there is a debate about the applicability and advantages/disadvantages of each of these proposed solutions. Nonetheless, we suggest that practitioners using cross-national survey data with macro components will have to become more attentive to these issues as detailed data becomes more freely available. All that being said, attaching geolocated data to survey interviews has the potential to spur substantial theoretical advances. The challenge of increased availability of personal and spatial data will be to balance these two competing ends.
NOTES 1 Geolocation (identifying geographical location) by itself does not count as contextual data, but geolocating a survey respondent provides a bridge that allows the future attachment of individual data to contextual data. 2 The CSES is a cross-national survey project that studies elections comparatively and has an extensive contextual data component. All authors are either current or past members of the CSES Secretariat. The project is available at http://www. cses.org
677
3 4 5 6
http://www.europeansocialsurvey.org http://www.afrobarometer.org http://www.ggp-i.org https://www.wzb.eu/en/research/migration-anddiversity/migration-integration-transnationalization/ projects/six-country-immigrant-int 7 http://u.osu.edu/cnep/ 8 http://eeshomepage.net 9 http://www.piredeu.eu 10 https://manifestoproject.wzb.eu 11 http://ec.europa.eu/public_opinion 12 http://www.worldvaluessurvey.org 13 http://www.comparativecandidates.org 14 http://data.worldbank.org/products/wdi 15 https://freedomhouse.org/ 16 http://www.transparency.org/research/cpi 17 http://www.systemicpeace.org/polity/polity4.htm 18 http://hdr.undp.org/en/statistics/understanding/ indices 19 http://myweb.uiowa.edu/fsolt/swiid/swiid.html 20 Although the issue of cross-level fallacies has been known for more than half a century, it still provokes controversies. A recent illustration is a debate between Seligson (2002) and Inglehart and Welzel (2003) with mutual allegations of committing individualistic and ecological fallacies.
RECOMMENDED READINGS Rydland, L. T., Arnesen, S., and Østensen, Å. G. (2007). Contextual Data for the European Social Survey: An Overview and Assessment of Extant Resources. The authors conduct a comprehensive overview of contextual data for the ESS. The article is a thorough investigation of a range of complex issues researchers face when using survey data with contextual data in a comparative perspective. Klingemann, Hans-Dieter (ed.) (2009). The Comparative Study of Electoral Systems; Dalton, R., Farell, D., and McAllister, I. (2011). Political Parties and Democratic Linkage; and Thomassen, J. (2014). Elections and Democracy. Oxford: Oxford University Press. This series of volumes provides researchers with a myriad of examples showing how to use contextual data on political institutions to demonstrate how they might affect individual political behavior using the Comparative Study of Electoral Systems (CSES).
678
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
REFERENCES Achen, C. H., and Shively, W. P. (1995). CrossLevel Inference. Chicago, IL: University of Chicago Press. Atkinson, T., Cantillon, B., Marlier, E., and Nolan, B. (2002). Social Indicators. The EU and Social Inclusion. Oxford: Oxford University Press. Banducci, S. A., and Karp, J. A. (2009). Electoral systems, efficacy, and voter turnout. In H.-D. Klingemann (ed.), The Comparative Study of Electoral Systems (pp. 109–134). Oxford: Oxford University Press. Barton, A. H. (1968). Bringing society back in. American Behavioral Scientist, 12(2), 1–9. Blais, A., and Bodet, M.-A. (2006). Does proportional representation foster closer congruence between citizens and policy makers? Comparative Political Studies, 39(10), 1243–1262. Bollen, K. A., and Paxton, P. (2000). Subjective measures of liberal democracy. Comparative Political Studies, 33(1), 58–86. Bowler, S., and Donovan, T. (2012). The limited effects of election reforms on efficacy and engagement. Australian Journal of Political Science, 47(1), 55–70. Butler, D., and Stokes, D. (1969). Political Change in Britain. London: MacMillan. Ceobanu, A. M., and Escandell, X. (2010). Comparative analyses of public attitudes toward immigrants and immigration using multinational survey data: a review of theories and research. Annual Review of Sociology, 36, 309–326. Cides, J. M., and Citrin, J. (2007). European opinion about immigration: the role of identities, interests and information. British Journal of Political Science, 37(3), 477–504. Dalton, R. J. (2008). The quantity and the quality of party systems: party system polarization, its measurement, and its consequences. Comparative Political Studies, 41(7), 899–920. de Capitani di Vimercati, S., Foresti, S., Livraga, G., and Samarati, P. (2011). Anonymization of statistical data. it-Information Technology Methoden und innovative Anwendungen der Informatik und Informationstechnik, 53(1), 18–25.
Diez-Roux, A. V. (2002). A glossary for multilevel analysis. Journal of Epidemiology and Community Health, 56(8), 588–594. Dogan, M. (1994). Use and misuse of statistics in comparative research. Limits to quantification in comparative politics: the gap between substance and method. In M. Dogan and A. Kazancilgil (eds), Comparing Nations (pp. 35–71). Oxford: Blackwell. Erbring, L., and Young, A. (1979). Individuals and social structure. Sociological Methods & Research, 4(7), 396–430. Fisher, S. D., Lessard-Phillips, L., Hobolt, S. B., and Curtice, J. (2008). Disengaging voters: do plurality systems discourage the less knowledgeable from voting? Electoral Studies, 27(1), 89–104. Gallagher, M. (1998). The political impact of electoral system change in Japan and New Zealand, 1996. Party Politics, 4(2), 203–228. Golder, M., and Stramski, J. (2010). Ideological congruence and electoral institutions. American Journal of Political Science, 54(1), 90–106. Gutmann, M., Witkowski, K., Colver, C., O’Rourke, J., and McNally, J. (2008). Providing spatial data for secondary analysis: issues and current practices relating to confidentiality. Population Research and Policy Review, 27(6), 639–665. Harkness, S. (2004). Social and Political Indicators of Human Well-Being. UNUWIDER Research Paper no. 2004/33 (May). Herrera, Y. M., and Kapur, D. (2007). Improving data quality: actors, incentives, and capabilities. Political Analysis, 15(4), 365–386. Hox, J. J. (2010). Multilevel Analysis. Techniques and Applications (2nd edn). New York: Routledge. Huckfeldt, R. R., and Sprague, J. (1993). Citizens, contexts and politics. In A. W. Finifter (ed.), Political Science: The State of the Discipline II (pp. 281–303). Washington, DC: American Political Science Association. Inglehart, R. (1990). Culture Shift in Advanced Industrial Society. Princeton, NJ: Princeton University Press. Inglehart, R., and Welzel, C. (2003). Political culture and democracy: analyzing crosslevel linkages. Comparative Politics, 36(1), 61–79.
SUPPLEMENTING CROSS-NATIONAL SURVEY DATA WITH CONTEXTUAL DATA
Jusko, K. L., and Shively, W. P. (2005). Applying a two-step strategy to the analysis of crossnational public opinion data. Political Analysis, 13(4), 327–344. Karp, J. A. (2013). Voters’ Victory?: New Zealand’s First Election Under Proportional Representation. Auckland: Auckland University Press. Kööts-Ausmees, L., Realo, A., & Allik, J. (2013). The relationship between life satisfaction and emotional experience in 21 European countries. Journal of Cross-Cultural Psychology, 44(2), 223–244. Lucas, S. R. (2014). An inconvenient dataset: bias and inappropriate inference with the multilevel model. Quality & Quantity, 48(3), 1619–1649. Marshall, M. G., Gurr, R. T., and Jaggers, K. (2014). Polity IV Project: Political Regime Characteristicsand Transitions, 1800–2013. Dataset Users’ Manual. Center for Systemic Peace: Polity IV Project. Retrieved from h t t p : / / w w w. s y s t e m i c p e a c e . o r g / i n s c r / p4manualv2013.pdf [accessed on 15 June 2016]. McDonald, M. P., and Popkin, S. L. (2001). The myth of the vanishing voter. American Political Science Review, 95(4), 963–974. Pattie, C., and Johnston, R. (2000). ‘People who talk together vote together’: an exploration of contextual effects in Great Britain. Annals of the Association of American Geographers, 90(1), 41–66. Raudenbush, S. R., and Willms, J. D. (1995). The estimation of school effects. Journal of Educational and Behavioral Statistics, 20(4), 307–335. Robbin, A. (2001). The loss of personal privacy and its consequences for social research. Journal of Government Information, 28(5), 493–527. Rokkan, S. (1970). Citizens, Elections, Parties: Approaches to the Comparative Study of the Processes of Development. New York: McKay. Rydland, L. T., Arnesen, S., and Østensen, Å. G. (2007). Contextual Data for the European
679
Social Survey: An Overview and Assessment of Extant Resources. NSD. Bergen. Retrieved from http://www.nsd.uib.no/om/rapport/ nsd_rapport124.pdf [accessed 2016-06-15]. Seligson, M. A. (2002). The renaissance of political culture or the renaissance of the ecological fallacy? Comparative Politics, 34(3), 273–292. Shively, W. P., Johnson, M., and Stein, R. (2002). Contextual data and the study of elections and voting behavior: connecting individuals to environments. Electoral Studies, 21(2), 219–233. Silver, B. D., and Dowley, K. M. (2000). Measuring political culture in multiethnic societies reaggregating the world values survey. Comparative Political Studies, 33(4), 517–550. Staerklé, C., Sidanius, J., Green, E. G., and Molina, L. E. (2010). Ethnic minority – majority asymmetry in national attitudes around the world: a multilevel analysis. Political Psychology, 31(4), 491–519. Steenbergen, M. R., and Jones, B. S. (2002). Modeling multilevel data structures. American Journal of Political Science, 46(1), 218–237. Subramanian, S. V., Jones, K., Kaddour, A., and Krieger, N. (2009). Revisiting Robinson: the perils of individualistic and ecologic fallacy. International Journal of Epidemiology, 38(2), 342–360. Taylor, M. C. (1998). How white attitudes vary with the racial composition of local populations: numbers count. American Sociological Review, 63(4), 512–535. Vanneman, R. (2003). Tracking trends and contexts: problems and strengths with the NSF surveys. In R. Tourangeau (ed.), Recurring Surveys: Issues and Opportunities. National Science Foundation, Arlington, VA: Report to the National Science Foundation on a workshop held on March 28–29, 2003 (pp. B-79–B-83). Volkens, A., Bara, J., Budge, I., McDonald, M. D., and Klingemann, H.-D. (eds) (2013). Mapping Policy Preferences from Texts. Oxford: Oxford University Press.
43 The Globalization of Surveys To m W . S m i t h a n d Y a n g - c h i h F u
INTRODUCTION As both a product of and facilitator for globalization, survey research has been expanding around the world. Even before the invention and transmission of national and cross-national surveys, Western scholars and missionaries who visited other areas of the world often conducted or facilitated small-scale surveys in an effort to better understand the local societies. For example, even though the first nationwide survey in the most populous China was not implemented until 2004 by the Chinese General Social Survey, local social surveys had been carried out under the guidance of individual American scholars as early as 1917– 1919 near Beijing (www.chinagss.org/; Han, 1997). The continued, more collective efforts helped grow into national surveys, not only in Western Europe, Eastern Europe and Russia, but also in Asia and Latin America (Heath et al., 2005; Smith, 2010a; Worcester, 1987). In the later stage of globalization, surveys are being conducted in more and more
countries and cross-national studies are both increasing in number and encompassing a larger number of participating countries. As with other economic, social, and cultural institutions, the integration or interconnectedness of surveys emerges as a manifestation of the worldwide spread of ideas and practices. The opportunity for the scientific, worldwide, and comparative study of human society has never been greater, but the challenges to conducting such research loom large. The total-surveyerror paradigm indicates that achieving valid and reliable results is a difficult task (Lessler, 1984; Smith, 2011). The difficulty is greatly magnified when it comes to comparative survey research. Cross-national/cross-cultural survey research not only requires validity and reliability in each and every survey, but functional equivalence across surveys and populations must be achieved (Harkness, 2009; Harkness et al., 2010; Johnson, 1998; Smith, 2010a; Verma, 2002). By achieving this, the full potential of global, survey research would be realized.
The Globalization of Surveys
This chapter covers: (1) the development of cross-national, survey research in general; (2) the contemporary situation, including conditions in (a) the academic, governmental, and commercial sectors, (b) contemporary coverage and limitations, (c) data archives, (d) international academic, professional, and trade associations, (e) journals, (f) crossnational handbooks and edited volumes, and (g) international standards and guidelines; (3) the concept of world opinion; (4) alternative sample designs for a global survey; and (5) prospects for additional developments and methodological improvement.
HISTORICAL DEVELOPMENT Cross-national, survey research has progressed through three distinct stages of development (Smith, 2010b). The first ran from the advent of public opinion polls in the 1930s until about 1972. During it, comparative, survey research (1) consisted of a relatively small number of studies that covered a limited number of societies, (2) was directed by a small group of researchers, and (3) was conducted on a one-time, topic-specific basis. Shortly after the start of national, representative surveys in the United States in the mid1930s (Converse, 1987), survey research took root in other countries (Worcester, 1987). Gallup took the early lead in the spread of survey research. In 1937, Gallup established a counterpart in the United Kingdom, and at least as early as 1939, American and British Gallup were fielding parallel questions. By the mid-1940s, Gallup established affiliates in a dozen countries and a spin-off of the Roper Organization, International Research Associates, also set up survey-research organizations around the world. During and immediately after World War II, the Allies also promoted the spread of survey research and established local organizations in the occupied countries. The first major comparative example of coordinated,
681
cross-national, survey research by the Allies was the Strategic Bombing Surveys carried out by the US government in Germany and Japan at the end of World War II to measure the impact of the Allied bombing on civilian populations (MacIsaac, 1976). Social scientists also promoted crossnational collaborations. These included the How Nations See Each Other Study in nine countries in 1948–1949 by William Buchanan and Hadley Cantril (1953); the Comparative Study of Teachers’ Attitudes in seven countries (Rokkan, 1951); the Civic Culture Study in five nations in 1959–1960 by Gabriel Almond and Sidney Verba (1963); the Pattern of Human Concerns Study by Cantril (1965) in 14 countries in 1957–1963; the Attitudes toward Europe Study in five countries in 1962 as part of the European Community (EC); and the Political Participation and Equality Study in seven nations in 1966–1971 by Verba et al. (1978). While cross-national, most of these early collaborations were Eurocentric. Two of these early studies (Teachers and Attitudes toward Europe) were restricted to Europe and with the notable exception of Cantril’s Human Concerns study, the rest focused on Europe, with 13 surveys from Europe and 8 from the rest of the world (Smith, 2010b). The second stage ran from 1973 to 2002 during which comparative, survey research (1) expanded in scope, (2) became sustained, and (3) became collaborative. First, both the number of studies increased and the number of countries included in many studies greatly expanded. Second, rather than one-time, intermittent enterprises, cross-national research was increasingly conducted on a continuing basis. Finally, rather than being led by a small cadre of researchers from a few countries, survey research was increasingly headed either by collaborative teams of social scientists drawn from most, if not all, of the participating societies, or involved studies formally representing an association of countries such as the EC. This second stage was heralded by the launch of the EC’s Eurobarometer which developed from
682
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
the earlier Attitudes towards Europe Study and the two rounds of the European Communities Studies in 1970–1971 (www.esds.ac.uk/findingData/snDescription.asp?sn=2911). It was established as a biannual study in 1973–1974 and has grown over time as the European Union (EU) has expanded (see http:// ec.europa.eu/public_opinion/index_en.htm). Equally important was the founding during this period of a substantial number of ongoing, collaborative, research programs organized by social scientists: 1 The associated European and World Value Surveys (EVS/WVS) started in 1981 and, across five rounds, have grown from 20 to 48 countries (plus 8 countries with partial versions) (www. worldvaluessurvey.org and www.europeanval uesstudy.eu). 2 The International Social Survey Program (ISSP) has conducted 29 annual studies from 1985 through 2013 while expanding from 4 to 49 countries (Smith, 2007a; www.issp.org).1 3 The Comparative National Elections Project (CNEP) started in the late 1980s and has had three rounds and 20 participating countries (www.cnep.ics.ul.pt). 4 The Comparative Study of Electoral Systems (CSES) has completed three rounds (www.cses. org), expanding from 33 countries in round 1 to 44 countries in round 3. 5 The various, loosely-related Global Barometers (www.globalbarometer.net) (Lagos, 2008) consists of the New Democracies/New European Barometers (1991–2005) (www.esds.ac.uk/findingData/ snDescription.asp?sn=3293), the Latinobarómetro (1995–present) (www.latinobarometro.org), the Afrobarometer (1999–present) (www.afrobarometer. org), the Asian Barometer (2001–present) (www. asianbarometer.org), and the Arab Barometer (2005–present) (www.arabbarometer.org).2
Additionally, the ad hoc studies that characterized the first period continued during the second stage. These also often increased in scope. Examples include the World Fertility Study, carried out in 61 countries (including 41 developing nations) from 1974 to 1982 (Cleland and Scott, 1987; Cornelius, 1985), and the International Social Justice Project,
in 12 countries in 1991 with follow-ups in some countries (www.isjp.de). During the third stage starting in 2002, cross-national, survey research became part of the social-science infrastructure. In particular, the degree of central coordination and control notably increased. The establishment of the biennial European Social Survey (ESS) in 2002 capstoned this advance (Jowell et al., 2007) (www.europeansocialsurvey.org). While the ESS, like the WVS, ISSP, and CSES, is a collaboration of social scientists, unlike those earlier consortia, it has centralized funding for the design, direction, and methodological monitoring of the national surveys. While the data collection is funded nation-by-nation, their notable level of centralized resources and coordination distinguishes the ESS from the earlier collaborations. Other developments during this third period have been a continuing expansion in the number and size of cross-national studies and more cross-project collaboration. The Arab Barometers, East Asian Social Surveys (www. eassda.org), and ESS are examples of new cross-national studies initiated in recent years. Also, as indicated above, the major global collaboration (CSES, Global Barometers, ISSP, WVS) have all expanded coverage. Likewise, the new Gallup World Poll grew from covering an average of 113.5 countries in 2006–2007, to 122 in 2008–2010, to 145 in 2011–2012. In terms of inter-study collaborations, the ESS and GSS have carried out joint projects, and the CSES and ISSP have organized workshops, sponsored joint conference sessions, and discussed other collaboration.
INTERNATIONAL AND CROSSNATIONAL SURVEYS Globalization has triggered both the necessity for and existence of international survey research. The number of countries conducting surveys, the number of surveys conducted in each country, and the number and size of
The Globalization of Surveys
cross-national, comparative surveys have all expanded. There are several types of contemporary, cross-national surveys. First, there are the global, general-topic, general-population, social-science collaborations discussed above (e.g. the CNEP, CSES, Global Barometers, ISSP, and WVS). These are large, on going, and expanding collaborations that seek information on a wide range of topics and coverage of societies across the globe (Smith et al., 2006). They have been widely used in scholarly publications.3 Second, there are global, generalpopulation studies on specialized topics, such as the International Mental Health Stigma Survey (www.indiana.edu/∼sgcmhs/ index.htm), the World Mental Health Survey (www.hcp.med.harvard.edu/wmh/ index.php), the International Adult Literacy Survey/Adult Literacy and Life Skills Surveys (http://nces.ed.gov/surveys/all), the Programme for the International Assessment of Adult Competencies (www.oecd.org/site/ piaac/#d.en.221854), the Demographic and Health Surveys (www.measuredhs.com), the Multinational Time Use Study (www. timeuse.org/mtus), the World Health Survey (www.who.int/healthinfo/survey/en/index. html), the International Crime Victims Survey (http://rechten.uvt.nl/icvs), and the World Internet Project (www.worldinternetproject. net). These include scholarly collaborations, United Nations (UN) affiliated projects, and programs by other international organizations, such as the World Bank and the Organization for Economic Cooperation and Development (OECD). Third, there are global, special-population studies on specialized topics such as student surveys, like the Programme for International Student Assessment (PISA; www.pisa.oecd. org), the Relevance of Science Education (ROSE; www.ils.uio.no/english/rose), the Progress in International Reading Literacy Study (PIRLS; http://nces.ed.gov/surveys/pirls), and the Trends in International Mathematical and Science Study (TIMSS; http://nces.ed.gov/timss).
683
Fourth, there are regional, general-population, general-topic, social-science surveys, such as the ESS (www.europeansocialsurvey.org), the East Asian Social Survey (EASS; www.eassda.org), the Latin American Public Opinion Project (www.vanderbilt.edu/lapop), and the various regional barometers (Lagos, 2008). Like the global, general-topic surveys, these operate on a continuing basis under the leadership of social scientists. Fifth, there are regional, special-population, special-topic surveys like the Survey of Health, Ageing, and Retirement in Europe (SHARE; www.share-project.org), the European Working Conditions Survey (www. eurofound.europa.eu/euro/ewcs/surveys), the European Election Studies (www.ees- homepage.net), and the European Quality of Life Survey (www.eurofound.europa.eu). These are especially common in the EU. Sixth, there are global polls conducted by large commercial companies such as Gallup Inc. (www.gallup.com), GfK (www.gfk.com), Harris Interactive (www.harrisinteractive. com), ICF International (www.icfi.com), Ipsos (www.ipsos.com), and TNS/Kantar Group (www.tnsglobal.com and www.kantar.com). There have been a series of mergers creating larger and more international commercial firms (e.g. Ipsos taking over Synovate and GfK acquiring NOP). Rather than primarily engage in comparative studies, these firms collect national as well as international data. They mostly conduct market research, but also cover public opinion and other areas. Seventh, there are consortia of commercial firms. Some represent long-term, general collaborations such as the WIN/Gallup International Association (GIA), which was formed in 2010 when the World Independent Network of Market Research and GIA merged,4 (www.gallup-international.com) and Globescan (www.globescan.com), established in 1987, and others are more project-specific collaborations, such as the Pew Global Attitudes project in 2002–2013 (http://pewglobal.org). Finally, there are harmonization projects that merge and make more comparable studies not
684
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
originally designed for comparative purposes such as the Luxembourg Income Study (www. lisproject.org), the International Stratification and Mobility File (www.sscnet.ucla.edu/issr/ da/Mobility/mobindex.html), the Integrated Public Use Microdata Series, International (IPUMS, International; https://international. ipums.org/international), and the many efforts of the UN (http://unstats.un.org) and Eurostat (http://epp.eurostat.europa.eu). These cross-national surveys have been integrated or interconnected broadly by two approaches. The first approach is top-down: a survey organization or company, often Western-based, initiates a cross-national survey series by either sponsoring surveys in other countries or asking local agencies to seek funding to implement the surveys. The content and methods of the top-down surveys are often predetermined or decided by the dominating organization or company. The second approach tends to be bottom-up: national teams collaborate and launch cross-national surveys, and teams from other countries join them later. As a rule, the content and methods of the bottom-up surveys are decided collectively, with each team being responsible for its own costs of survey operation. The so-called safari surveys are the extreme example of the topdown model (Kuechler, 1987; Smith, 2004). As globalization further develops and the continued adoption of the survey innovation becomes more self-sustaining given favorable political and economic circumstances, the shift over time has clearly gone from top-down to a more collaborative, bottom-up approach. The Afrobarometers are an interesting example. They started with major leadership from American scholars, but have become much more Afro-centric over time.
CONTEMPORARY COVERAGE AND LIMITATIONS Both the global expansion of survey research and its limitations are evident by analyzing
participation in major cross-national surveys. A comparison across the CSES, Global Barometers, ISSP, and WVS found that 65.3% of the world’s countries were covered in one or more study. The completely missed countries fell into three main categories. First, countries that were small in both area and population and often geographically isolated (e.g. islands) were often not covered. These principally included the microstates of Europe (e.g. Monaco, San Marino, Vatican City), Pacific islands (e.g. Fiji, Kiribati, Tonga), and Caribbean islands (e.g. Barbados, Dominica, St. Lucia). Second, strongly authoritarian countries such as Myanmar, North Korea, and Uzbekistan were generally missed. For the nine countries that Freedom House listed in 2013 as the worst of the worst on political rights and civil liberties, only two were included in any of these cross-national studies, with Syria and Sudan each being included in just one of the four cross-national studies. Finally, countries undergoing sustained civil wars and other internal unrest were often not covered (e.g. Afghanistan, Democratic Republic of the Congo, Somalia, South Sudan). An analysis of the Gallup World Polls (GWPs) produced similar results. From 2006 to 2012, the GWPs conducted surveys in 162 countries or territories, thus covering 78.7% of generally recognized countries plus a few other areas (e.g. Hong Kong and Puerto Rico). While covering more countries, the GWPs essentially missed the same types of areas as the four cross-national collaborations discussed above did. Moreover, neither the GWPs nor the major academic collaborations covered all countries and regions equally well. Looking across the seven rounds of the GWPs, a coverage completeness statistic was computed. It took the total number of countries in a region times the number of rounds (7) and compared that base to the number of surveys conducted in the GWPs from 2006 to 2012. South America had the highest completeness level (85.7%), followed by Asia (77.8%), Europe (72.9%), Africa (55.3%),
The Globalization of Surveys
North America (43.9%), and Other (Oceania and Pacific islands – 13.2%). However, if the regions are realigned as Latin America and the Caribbean vs the remainder of North America (Canada and the United States), the completeness rates are respectively 32.8% and 100.0%. Similarly, if Australia and New Zealand are separated from Other, their completeness rate is 85.7% and the remaining Other area’s completeness rate falls to 0.0%. Thus, the so-called First World has the most complete coverage and Third-World regions the lowest. In addition to the coverage of countries discussed above, territories and contested areas are also usually missed. These include many island dependencies especially in the Caribbean and Pacific, which are missed just like many of the independent nations from these same regions, and other areas such as Greenland (part of Denmark, but routinely excluded from Danish samples) and French Guiana. Also, typically missed are contested areas like Northern Cyprus, Transnistria, and Western Sahara. Among the few areas in these groups that are occasionally included in cross-national surveys are Puerto Rico and Palestine. While surveys are being conducted both in more countries and more frequently, there are still many legal constraints on the conducting of surveys and dissemination of survey results. In 2012, the World Association for Public Opinion Research (WAPOR) updated its Freedom to Publish Opinion Poll Results (Chang, 2012). Information was collected about government restrictions on surveys in 85 countries/jurisdictions. The publication of pre-election polls had blackout periods in 46% of countries lasting from 1 to 45 days. In 16% of countries, exit polls of voters were either forbidden or severely limited. In 14% of countries, the specific questions or subjects of surveys were restricted (and in another 9% of countries, the situation was unclear). China illustrates this situation. Questions about consumer-preferences and other marketresearch topics are widely conducted and
685
essentially unrestricted, questions about the Communist party are strictly prohibited, and in between is a huge gray area of uncertainty. Nor is the situation improving. Between 2002 and 2012, 13 countries increased embargoes on pre-election polls and 11 reduced their embargoes (Chang, 2012). In just the last two years, WAPOR combated political efforts to restrict surveys in Mexico, Peru, Russia, and Ukraine. Likewise, ESOMAR has been involved regarding regulations in France and the European Union. Another limitation lies in the types of surveys that cross national borders. The existing cross-national surveys are largely limited to cross-sections, rather than panels. Like cross-sectional surveys, panel surveys have spread from the West to other parts of the world, with projects from different countries focusing on very similar topics. Due to the temporal complexity added to survey design and operation, however, national panel surveys have not been well integrated or interconnected. Furthermore, some of these late-coming surveys modify the unit of their panels based on the core cultural values that matter more in the countries of the surveys. For example, the Panel Survey of Family Dynamics (PSFD), which has been conducted in China and Taiwan, treats family as a complicated social institution in Chinese societies and thus includes key family members as the targets in the panels (http://psfd.sinica.edu.tw/). Even though it is difficult to integrate panel surveys across national borders, one of the most established panel survey programs, the Panel Survey of Income Dynamics (PSID), does work with comparative panel surveys from other countries to produce a cross-national equivalence file, or CNEF (www.human.cornell.edu/pam/ research/centers-programs/german-panel/ cnef.cfm). The CNEF incorporates panel data collected by non-Western countries, such as Korean Labor and Income Panel Study, alongside panel survey series from Western countries. The equivalence file such as this somehow compensates for the lack of
686
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
cross-national panel surveys, and should contribute to the rapidly rising data archiving as a result of the globalization of surveys.
DATA ARCHIVES AND DATA SOURCES Microdata from most of the cross-national surveys carried out by social scientists and governments and microdata from some surveys conducted by commercial firms are stored in and accessible from major, international, survey archives such as the following: Association of Religion Data Archives, Pennsylvania State University – www.thearda.com Interuniversity Consortium for Political and Social Research, University of Michigan – www.icpsr. umich.edu IPUMS, International – https://international.ipums. org/international Latin American Public Opinion Project, Vanderbilt University – www.vanderbilt.edu/lapop Roper Center for Public Opinion Research, University of Connecticut – www.ropercenter.uconn.edu EU’s Eurobarometer – http://ec.europa.eu/public_ opinion/index_en.htm GESIS Data Archive for the Social Sciences (formerly the Central Archive for Empirical Social Research at the University of Cologne) – www.gesis. org/en/institute/gesis-scientific-sections/dataarchive-for-the-social-sciences Norsk samfunnsvitenskapelig datatjeneste (Norwegian Social Science Data Services), University of Bergen – www.nsd.uib.no Social Science Japan Data Archive at the University of Tokyo – http://ssjda.iss.u-tokyo.ac.jp/en UK Data Archive, University of Essex – www.dataarchive.ac.uk
All of these have extensive international and cross-national holdings, but none focuses on comparative, survey-research data.5 Of particular value are several questionlevel, online repositories of data: (1) IPOLL at the Roper Center (www.ropercenter.uconn. edu/CFIDE/cf/action/home/index.cfm?CFID =28311&CFTOKEN=35566476), (2) Polling the Nations (http://poll.orspub.com), (3) the
UK Data Service Variables and Question Bank at Essex http://discover.ukdataservice. ac.uk/variables), and (4) ZACAT at GESIS (http://zacat.gesis.org/webview). These allow searches for specific question wordings and present basic results. Only the UK Data Service and ZACAT are free. Also, many cross-national programs provide documentation and data from their project websites. These include the CSES, ESS, ISSP, and WVS. Also, some commercial projects make available reports, and sometimes data, at corporate websites. However, full access is usually limited to subscribers or otherwise restricted. Other sites of particular interest include World Public Opinion of the Program on International Policy Attitudes (www. worldpublicopinion.org) and the Pew Global Attitudes Project (www.pewglobal.org).
INTERNATIONAL ACADEMIC, PROFESSIONAL, AND TRADE ASSOCIATIONS Academic, professional, and trade associations are another important component of the comparative, survey-research community. There are various types of associations such as (1) the main academic and professional associations in the social and statistical sciences – the International Political Science Association (www.ipsa.org), the International Sociological Association (www.isa-sociology.org), the International Statistical Institute (http://isi.cbs. nl), and its affiliate the International Association of Survey Statisticians (http://isi.cbs.nl/iass); (2) academic and professional associations related to survey research, like the marketresearch-oriented ESOMAR (formerly the European Society for Opinion and Market Research; www.esomar.org), the Asian Network for Public Opinion Research (ANPOR; http://anpor.org/en/index.php), the European Survey Research Association (ESRA; http://esra.sqp.nl/esra/home), and the World
The Globalization of Surveys
Association for Public Opinion Research (WAPOR; www.wapor.org); (3) trade associations, like the Council of American Survey Research Organization (CASRO; https://www. casro.org), European Federation of Associations of Market Research Organizations (EFAMRO; www.efamro.com), and ESOMAR (which has both individual and organizational members); (4) social-science, archival organizations like the International Association for Social Science Information, Service, and Technology (www. iassistdata.org), the Council of European Social Science Data Archives (www.nsd.uib.no/ cessda/home.html), and the International Federation of Data Organizations for the Social Sciences (www.ifdo.org); (5) survey-researchmethodology collaborations such as the Comparative Survey Design and Implementation Workshop (CSDI; www.csdiworkshop.org), the series of International Workshops on Household Survey Nonresponse (www.nonresponse.org), and the loosely connected International Conference series starting with the International Conference on Telephone Survey Methodology in 1987 through the International Conference on Methods for Surveying and Enumerating Hard-to-Reach Populations in 2012; and (6) other social-science associations and organizations from long-established organizations as the UN’s International Social Science Council (ISSC; www.unesco.org/ngo/issc.org) and the US-based Social Science Research Council (www.ssrc.org) to new entities like the ISSC’s World Social Science Forum (2009–2013) (www.unesco.org/ngo/issc/3_activities/3_ worldforum.htm). More and more, these associations and organizations are collaborating to advance survey research around the world. For example, WAPOR and ESOMAR have regularly held joint meetings since 1949, have published a number of coordinated reports such as the ESOMAR/WAPOR Guide to Opinion Polls (http://wapor.unl. edu/esomarwapor-guide-to-opinion-polls) and the joint report on polling in Georgia (Frankovic et al., 2013), and have participated in the development of the International
687
Organization of Standardization (ISO) standards (see below). Similarly, WAPOR and the American Association for Public Opinion Research (AAPOR) also regularly have joint conferences, have jointly approved Standard Definitions: Final Disposition of Case Codes and Outcome Rates for Surveys (www.aapor. org/Standard_Definitions1.htm), and both back AAPOR’s Transparency Initiative (www. aapor.org/Transparency_Initiative.htm).
SURVEY-RESEARCH AND SOCIALSCIENCE JOURNALS Major survey-research journals include Public Opinion Quarterly, Survey Practice, and Journal of Survey Statistics and Methodology of AAPOR (with the last journal co-published with the American Statistical Association), WAPOR’s International Journal of Public Opinion Research, ESRA’s Survey Research Methods, Statistics Sweden’s Journal of Official Statistics, Statistics Canada’s Survey Methodology, and Field Methods. There are also various journals on social-science methodology such as the Bulletin of Sociological Methods, International Journal of Social Research Methodology, Quality and Quantity, Sociological Methods, and Sociological Methods and Research. In addition, there are many comparative and international journals in the social sciences. A few examples from sociology are Acta Sociologica, Comparative Sociology, European Sociological Review, International Journal of Comparative Sociology, International Journal of SocioEconomics, International Journal of Sociology, and International Sociology.
CROSS-NATIONAL HANDBOOKS AND EDITED VOLUMES There are of course thousands of books using survey research with an international
688
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
perspective and a similarly large number dealing with survey-research methodology. Examples of books that bring the two topics together include: Christof Wolf, Dominique Joye, Tom W. Smith, and Yang-chih Fu (eds) SAGE Handbook of Survey Methodology (2016); Edith D. de Leeuw, Joop J. Hox, and Don A. Dillman (eds) International Handbook of Survey Methodology (2008); Wolfgang Donsbach and Michael Traugott (eds) SAGE Handbook of Public Opinion Research (2008); John G. Geer (ed.) Public Opinion and Polling around the World: A Historical Encyclopedia (2004); Peter V. Marsden and James D. Wright (eds) Handbook of Survey Research (2010); Juergen H. P. Hoffmeyer-Zlotnik and Christof Wolf (eds) Advances in Cross-National Comparison: A European Working Book for Demographic and Socio-Economic Variables (2003); Roger Jowell, Caroline Roberts, Rory Fitzgerald, and Gillian Eva (eds) Measuring Attitudes Cross-Nationally: Lessons from the European Social Survey (2007); Janet A. Harkness, Michael Braun, Brad Edwards, Timothy P. Johnson, Lars Lyberg, Peter Ph. Mohler, Beth-Ellen Pennell, and Tom W. Smith (eds) Survey Methods in Multinational, Multiregional, and Multicultural Contexts (2010); and Janet A. Harkness, Fons van de Vijver, and Peter Ph. Mohler (eds) CrossCultural Survey Methods (2003).
INTERNATIONAL STANDARDS AND GUIDELINES Recently, international standards for survey research have been developed and their adoption is spreading (Lynn, 2003; see also Chapter 2, this Handbook). The most authoritative are the Standards for Market, Opinion, and Social Research which were first issued by the ISO in 2006 and revised in 2012 (www.iso.org). Other examples are Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys, initially created by AAPOR in 1998 and later adopted
by WAPOR, the ISSP, and other groups (www. aapor.org/responseratesanoverview); the International Guidelines for Opinion Surveys of the OECD (http://www.oecd.org/std/ leading-indicators/37358090.pdf); and the Cross-Cultural Survey Guidelines of the CSDI Guidelines Initiative (http://projects.isr.umich. edu/csdi/).
THE CONCEPT OF WORLD OPINION One aspect of the globalization of survey research is the expansion of the concept of ‘world opinion’. There are, however, very different ways in which world opinion is conceptualized and used. One prominent approach sees it as reflecting the collective judgment of the international community about the actions of nations or other actors. Rusciano and Fiske-Rusciano (Rusciano, 2001, 2010; Rusciano and Fiske-Rusciano, 1990, 1998), as a part of their work on global opinion theory, consider world opinion from a spiral-of-silence perspective. They indicate that public opinion ‘consists of attitudes or behaviors which an individual can or must express in order to avoid social isolation’ and following from this that ‘world opinion refers to the moral judgments of observers which actors must heed in the international area or risk isolation as a nation’. They typically have measured world opinion by analyzing articles in newspapers (Rusciano, 2001; Rusciano and Fiske-Ruskiano, 1990), but have since indicated (Rusciano, 2010) that one should look ‘for evidence in the relevant discourse of international affairs – e.g. the news media, policy statements or papers, United National proceedings, and global opinion polls’. Stearns’ (2005) views of world opinion generally overlap with those of the Ruscianos. He states that it involves the capacity to react to developments (real or imagined) in distant parts of the globe with some sense of impassioned outrage and a belief that there are or should be some common standards
The Globalization of Surveys
for humanity, plus a recognition in many societies … that such evidence of outrage may need to be accommodated .… (p. 7)
He further indicates that world opinion goes beyond ‘polling results … in that it involves more active expressions through petitions, demonstrations, and boycotts, though polling may confirm the strong views involved’ (p. 8). Goot (2004) also mention protests (e.g. boycotts, demonstration, and acts of terrorism) as part of world opinion, but measures these only via general surveys. Another perspective thinks of world opinion as the attitudes of people around the world, typically as measured by crossnational surveys. This is the approach implicitly or explicitly adopted by the major cross-national projects introduced above. It does not assume there is or should be any global consensus, nor that world opinion is restricted to attitudes or standards that are formed by the global community and directed at wayward nations and other actors. This approach heavily depends on the collection, comparison, and aggregation of national opinion surveys.
ALTERNATIVE SAMPLE DESIGNS OF GLOBAL SURVEYS The dominant approach to conducting a global survey has been to conduct comparable, national surveys in as many countries as possible. But some have instead advocated a more directly global survey in which worldwide and not country-specific results are the primary goal. Rusciano and Fiske-Rusciano (1998) outline a general model for doing a global survey of world opinion. Stearns (2005) also seems to advocate a more global rather than nation-by-nation measure of world opinion, but does not discuss how this might be achieved. The most detailed attempt to operationalize such an approach has been developed by Gilani and Gilani (2013), in what they call the ‘global-centric method of
689
sampling and surveys’. They have collected a sample frame of blocks that represent 99.5% of the world’s population and propose drawing samples proportional to size without first selecting country as a sampling unit. Tom W. Smith and the late Roger Jowell once discussed the merits of the traditional country-by-country vs direct global sampling approaches. Smith described a hybrid approach. He indicated that if one considered the larger nations as in one stratum, one could include these with certainty and then sample proportional to population countries in several regional, non-certainty strata. This could lower the number of countries that needed to be sampled, reduce the total number of interviews that would be needed, and produce a merged sample that was more representative of and generalizable to the world in general. Jowell noted that countries were an important organizing unit both politically and culturally, and that one wanted to maximize the number of countries covered to both exploit and understand the inter-country variation. Both were of course correct.
FUTURE PROSPECTS While impediments remain to achieving global survey research, the political and economic barriers to survey research have diminished over time, and it is probable that coverage will continue to expand. National surveys are conducted in most countries, and in both the commercial and academic sections, comparative surveys are routinely carried out both regionally and globally. But notable challenges stand in the way of achieving valid, reliable, and comparable measurements across surveys. Minimizing total, survey error in a single survey is difficult, doing so in multiple surveys conducted in one society is still more difficult, and doing this in many surveys across languages, societies, and cultures is the most difficult of all (Harkness et al., 2010; Smith, 2007b,
690
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
2010a, 2011). Conducting multiple surveys is naturally more difficult and error prone simply because there are more moving parts that must be designed, operated, checked, and coordinated. But cross-national/cross-culture surveys are especially difficult to successfully design and execute because measurement and content are easily confounded, and this often makes methodological and substantive explanations for differences both plausible. To reliably and validly ascertain the actual cross-national/cross-cultural differences and similarities that prevail across societies, one must ensure that measurement error has been minimized and that functional equivalence has been achieved. Achieving functional equivalence is impeded by several factors. First, while notable progress has been made to improve survey methodology, much more research is needed about (a) the sources of measurement error and how to minimize these and (b) maximizing measurement comparability. Second, comparative surveys often do not utilize the best existing methods and therefore do not achieve the best possible results permitted by the current state of the art of survey methodology. While this may come from lack of expertise on the part of the principal researchers and/or data collectors, it usually reflects a lack of resources. Although the technical knowledge and the intent to quality research may exist, the necessary resources to design and conduct top-flight research and optimal comparisons are often not available. In 1987 to mark the 50th anniversary of Public Opinion Quarterly (POQ), Robert Worcester (1987) wrote the following on the ‘internationalization’ of survey research: In another 50 years someone may be asked by the editor of POQ to look back on 100 years of public opinion research and will perhaps chronicle the development of public opinion research in what we now know as the Third World to First World standards; the true World Poll dream of George Gallup and Jean Stoetzel as a regular tool of guidance for world organizations in the way the EuroBarometer provides input to the EEC and its member countries; developments in technology
and polling methodologies to extend the usefulness, timeliness, and accuracy of poll findings; the ‘cinematographic poll’ providing a moving picture of public opinion on an ongoing basis; developments in question wording techniques, sampling, analysis, and reporting; and, hopefully, the defeat of efforts, well-meaning or not, to limit the taking and publication of well-founded expert public opinion polls. (p. S84)
We are now halfway to that 50-year mark and have generally made progress along these lines. But much work still remains especially in the methodological advances that are needed to ensure functional equivalence and high data quality in cross-national surveys.
NOTES 1 The ISSP started as a collaboration between existing social-indicators programs in the US (the National Opinion Research Center’s General Social Survey (GSS)), Germany (the Zentrum fuer Umfragen und Methoden’s ALLBUS), the UK (Social Community Planning Research’s British Social Attitudes Study), and Australia (Australia National University’s National Social Science Survey) and extended bilateral studies carried out as part of the GSS and ALLBUS in 1982–1984. 2 Despite the overlapping use of the term ‘barometer’ there is limited connection between these later organizations and the EC’s Eurobarometer. There are also other organizations using the term ‘barometer’ such as the Asia Barometer (www.asiabarometer.org) that are unconnected with the Global Barometers. The New European Barometer does not appear to be a formal member of the Global Barometers, but has had some connection (Lagos, 2008). A new entity, the Eurasia Barometer, is an outgrowth of the New Democracies/New European Barometers. 3 Cross-national survey research has produced a large and invaluable body of findings. For example, the CSES lists 601 publications using its surveys, the WVS’s bibliography has about 3,350 entries, and the ISSP’s bibliography has 5,566 references. 4 Gallup Inc. is the company founded by George Gallup Sr. and is headquartered in the US. GIA and WIN merged in 2010. The WIN/GIA is not affiliated with Gallup Inc. and is headquartered in Switzerland. GIA was formed in 1947 and some affiliates had ties to George Gallup and Gallup Inc. in the past. In 2013, WIN/GIA had affiliates in
The Globalization of Surveys
73 countries. A few members of WIN/GIA are also affiliated with TNS. 5 For other European archives see the members of the Council of European Social Science Data Archives (www.cessda.org/about/members)
RECOMMENDED READINGS Harkness et al. (2010) won the AAPOR Book Award in 2013 and is the single best source on cross-national survey methodology. Examples of books that present results from major crossnational collaborations are Jowell et al. (2007) for the ESS and Haller et al. (2009) for the ISSP.
REFERENCES Almond, G. and Verba, S. (1963). The Civic Culture: Political Attitudes and Democracy in Five Nations. Princeton, NJ: Princeton University Press. Buchanan, W. and Cantril, H. (1953). How Nations See Each Other: A Study in Public Opinion. Urbana, IL: University of Illinois Press. Cantril, H. (1965). The Pattern of Human Concerns. New Brunswick: Rutgers University Press. Chang, R. (2012). The Freedom to Publish Opinion Poll Results: A Worldwide Update of 2012. Lincoln, NE: World Association for Public Opinion Research. Cleland, J. and Scott, C. (eds) (1987). The World Fertility Survey: An Assessment. New York: Oxford University Press. Converse, J. M. (1987). Survey Research in the United States: Roots and Emergence, 18901960. Berkeley, CA: University of California Press. Cornelius, R. M. (1985). The World Fertility Survey and Its Implications for Future Surveys. Journal of Official Statistics, 1, 427–433. de Leeuw, E. D., Hox, J. J., and Dillman, D. A. (eds) (2008). International Handbook of Survey Methodology. New York: Lawrence Erlbaum. Donsbach, W. and Traugott. M. (eds) (2008) Handbook of Public Opinion Research. London: Sage. Frankovic, K. A., Grabowska, M., Rivière, E., and Traugott, M. (2013). Making Public
691
Polling Matter in Georgia: A Report on PreElection Polling in the 2012 Georgia Parliamentary Elections. Amsterdam: ESOMAR/WAPOR Report. Geer, J. G. (ed.) (2004). Public Opinion and Polling around the World: A Historical Encyclopedia. Santa Barbara, CA: ABC Clio. Gilani, I. and Gilani, B. (2013). Global and Regional Polls: A Paradigmatic Shift from ‘State-centric’ to ‘Global-centric’ Approach. Paper presented to the World Association for Public Opinion Research, Boston, May. Goot, M. (2004). World Opinion Surveys and the War in Iraq. International Journal of Public Opinion Research, 16, 239–268. Haller, M., Jowell, R., and Smith, T. W. (2009). The International Social Survey Programme 1984–2009, Charting the Globe. London and New York: Routledge. Han, M. (1997). Three Landmarks in the Development of Research Methods and Methodology of China’s Sociology. Journal of Peking University (Humanities and Social Sciences) 4, 5–15. (in Chinese) Harkness, J. (2009). Comparative Survey Research: Goals and Challenges. In E. D. de Leeuw, J. J. Hox, and D. A. Dillman (eds), International Handbook of Survey Methodology (pp. 56–77). New York: Lawrence Erlbaum. Harkness, J., Edwards, B., Braun, M., Johnson, T., Lyberg, L., Mohler, P., Pennell, B-E., and Smith, T.W. (eds) (2010). Survey Methods in Multinational, Multiregional, and Multicultural Contexts. New York: Wiley and Sons. Harkness, J. A., van de Vijver, F., and Mohler, P. Ph. (eds) (2003). Cross-Cultural Survey Methods. Hoboken, NJ: John Wiley & Sons. Heath, A., Fisher, S., and Smith, S. (2005). The Globalization of Public Opinion Research. Annual Review of Political Science, 8, 297–333. Hoffmeyer-Zlotnik, J. H. P. and Wolf, C. (2003). Advances in Cross-National Comparison: A European Working Book for Demographic and Socio-economic Variables. New York: Kluwer Academic. Johnson, T. P. (1998). Approaches to Equivalence in Cross-Cultural and Cross-National Survey Research. In J. A. Harkness (ed.), Nachrichten Spezial, Cross-Cultural Survey Equivalence. Mannheim: ZUMA.
692
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Jowell, R., Roberts, C., Fitzgerald, R., and Eva, G. (eds) (2007). Measuring Attitudes CrossNationally: Lessons from the European Social Survey. Thousand Oaks, CA: Sage. Kuechler, M. (1987). The Utility of Surveys for Cross-national Research. Social Science Research, 16, 229–244. Lagos, M. (2008). International Comparative Surveys: Their Purpose, Content, and Methodological Challenges. In W. Donsbach and M. Traugott (eds), (pp. 580–593). Handbook of Public Opinion Research. London: Sage. Lessler, J. (1984). Measurement Error in Survey. In C. F. Turner and E. Martin (eds), Surveying Subjective Phenomena. New York: Russell Sage. Lynn, P. (2003). Developing Quality Standards for Cross-national Survey Research: Five Approaches. International Journal of Social Research Methodology, 6, 323–336. MacIsaac, D. (1976). Strategic Bombing in World War Two: The Story of the United States Strategic Bombing Survey. New York: Garland. Marsden, P. V. and Wright, J. D. (eds) (2010). Handbook of Survey Research. Bingley: Emerald. Rokkan, S. (ed.) (1951). Proceedings of the International Seminar on Comparative Social Research. Oslo: Institute for Social Research. Rusciano, F. L. (2001). A World Beyond Civilizations: New Directions for Research on World Opinion. International Journal of Public Opinion Research, 13, 10–24. Rusciano, F. L. (2010). Global Opinion Theory and the English School of International Relations. New Global Studies, 4, 1–22. Rusciano, F. L. and Fiske-Rusciano, R. (1990). Towards a Notion of ‘World Opinion’. International Journal of Public Opinion Research, 2, 305–332. Rusciano, F. L. and Fiske-Rusciano, R. (1998). World Opinion and the Emerging International Order. Westport, CT: Praeger.
Smith, T. W. (2004). Developing and Evaluating Cross-National Survey Instruments. In S. Presser et al. (eds), Methods for Testing and Evaluating Survey Questionnaires. New York: John Wiley & Sons. Smith, T. W. (2007a). International Social Survey Program. Unpublished NORC report. Smith, T. W. (2007b). Formulating the Laws for Studying Societal Change. Paper presented to the FCSM Research Conference, Arlington, VA. Smith, T. W. (2010a). Surveying across Nations and Cultures. In J. D. Wright and P. V. Marsden (eds), Handbook of Survey Research, 2nd edition (pp. 733–764). New York: Academic Press. Smith, T. W. (2010b). The Origin and Development of Cross-national Survey Research. Seminar on the Early Days of Survey Research and Their Importance Today, Vienna, July. Smith, T. W. (2011). Refining the Total Survey Error Paradigm. International Journal of Public Opinion Research, 23, 464–484. Smith, T. W., Kim, J., Koch, A., and Park, A. (2006). Social-Science Research and the General Social Surveys. Comparative Sociology, 5, 33–44. Stearns, P. N. (2005). Global Outrage: The Impact of World Opinion on Contemporary History. Oxford: One World. Verba, S., Nie, N. H, and Kim J. (1978). Participation and Political Equality: A SevenNation Comparison. Cambridge: Cambridge University Press. Verma, V. (2002). Comparability in International Survey Statistics. Paper presented to the International Conference on Improving Surveys, Copenhagen, August. Wolf, C., Joye, D., Smith, T. W., and Fu, Yangchih (eds) (2016) SAGE Handbook of Survey Methodology. London: SAGE. Worcester, R. M. (1987). The Internationalization of Public Opinion Research. Public Opinion Quarterly, 51, S79–S85.
Index Page references to Figures or Tables will be in italics, followed by the letters ‘f’ and ‘t’, as appropriate AAPOR see American Association for Public Opinion Research (AAPOR) Aboriginal peoples, 58 abstraction, error of, 203 access panels (online panels of general populations), 7, 336 accuracy, 122–3 Adams, S. M., 182 adaptive cluster sampling, 469 adaptive designs, 132–3, 416, 558 adaptive survey designs, 568–70 adaptive total design (ATD), 128, 133–4, 139 example, 569–70 address-based sampling, 60 administrative data, 38 Administrative Data Liaison Service (ADLS), UK, 667 advance translation, 508 Advertising Research Foundation (ARF), 18 Afro Barometer, 10, 672 aggregation, 4, 5, 30, 77, 223, 298, 335, 346, 420, 511, 562, 598, 656, 662 aggregated data, 671 aggregation mode, 106 and balance between analytic potential vs data confidentiality, 494, 496 and cross-national survey data, supplementing with contextual data, 671, 675, 676 microaggregation, 496–7 population levels, 563, 565, 570 and research question, 105, 107, 112 Ainsworth, B. E., 212 Alexander, J. T., 496 Allensbach Institute, 10 Allgemeine Bevölkerungsumfrage in den Sozialwissenschaften (ALLBUS), Germany, 6 Alliance of International Market Research Institutes (AIMRI), 18 Alwin, D. F., 32, 548, 550, 582 American Anthropological Association, 291 American Association for Public Opinion Research (AAPOR), 8, 18, 19, 22, 23, 173 Code on Professional Ethic and Practice, 334 establishment (1947), 94, 95 response rates/nonresponse, 410, 411 Standard Definitions, 24, 387 Transparency Initiative, 17 see also World Association for Public Opinion Research (WAPOR) American Community Survey (ACS), 149
American Institutional Review Boards (IRB), 168 American Marketing Association (AMA), 18, 19 American National Election Studies (ANES), 94, 447 American National Standards Institute, 16 American Sociological Association (ASA), 291 American Sociological Review, 9 American Statistical Association (AmStat), 17, 90 Amowitz, L. L., 184 Analysis of Pre-Election Polls and Forecasts, Committee (US), 95–6 Anderson, D. W., 472 Andrews, F. M., 239, 550, 551 ANOVA (analysis of variance), 353, 371 answering processes, cognitive models psychological approach, 211 question response process, 210–11 and cognitive interviewing, 214–15 socio-cultural approach, 212–14 Arab Barometer, 10 archiving of data, 653–5, 686 surveys and society, 62, 65 Asia Barometer/Asian Barometer Survey (ABS), 10, 13n5, 281 Asian Network for Public Opinion Research, 18 Asparouhov, T., 636, 642 ASPIRE (A System for Product Improvement, Review and Evaluation), 36, 135, 135–7, 139 Association of Consumer Research (ACR), 18 Association of European Market Research Institutes (AEMRI), 18 Association of the Marketing and Social Research Industry (Canada) (AMSRI), 18 associations other than professional and trade, 20–1 professional and trade, 17–19 attrition bias, 113 Audit Bureau of Circulation (ABC), 18, 19 Australia, 166–7 Australia Gallup Poll, 96 auxiliary data/variables, 419, 558, 563, 576 available case analysis (ACA), 599 Bacharach, V. R., 584 back translation, 272–3 background variables, comparative surveys, 288–307 education attainment, 292–8, 296–7 CASMIN education scheme, 294 fields of specialization, 298
694
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
positional measures, 295 qualifications, 293–5 years of, 293 ethnicity, 289–92 occupation, 298–300 measurement through ISCO coding, 299–300 social position, 300–2 socio-economic groups and social class models, 301–2 socio-economic indexes and one-dimensional continuous measures, 300–1 Bagozzi, R. P., 195 Bailey, L. H., 88 Baker, R. P., 334, 364 Balanced Repeated Replication (BRR), 484 balanced sampling, 322–3 Ball, Richard, 457 Ballou, J., 4 Bandilla, W., 150 Bankier, M. D., 472 Barometers, 7, 10, 62, 655 Baumgarnter, H., 581, 586, 615 Bayesian analysis, 38, 642 Beatty, P., 215 Beauducel, A., 637 Beaumont, J.-F., 466 behaviour coding analysis procedures, 368 key procedures, 367–8 purpose and context, 367 resource requirements, benefits and limitations, 368–9 Behr, D., 47, 509 Belden, J., 87, 95 Belgium, 204–5 Bentler, P. M., 636 Berkeley Initiative for Transparency in the Social Sciences (BITSS), 659 Berzelak, J., 148, 152 Best Linear Unbiased estimator see BLUE (Best Linear Unbiased estimator) Bethlehem, J., 334, 338, 410, 413, 483, 560, 561, 570, 571 Beyer, H., 119 bias, 567 measurement, 28 nonresponse, 415–16, 565–6 polls, 63 question wording, 30 selection, 32, 71, 334 social desirability, 167–8 Biemer, P. P., 5, 33, 35, 135, 137 Biffignandi, S., 334, 338 Big Data, 8, 123, 458 bivariate/multivariate associations among variables, 110 Blair, J., 374 Blasius, J., 587–9, 590, 591, 613
blocking methods, 665 Blom, A., 418 Bloom-Filters, 666 BLUE (Best Linear Unbiased estimator), 317, 325 Blyth, B., 144 Boeije, H., 377 Bonett, D. G., 636 Booth, C., 88 Borsboom, D., 196 Börsch-Supan, A., 574 Bossarte, R. M., 434 Bourdieu, P., 14n12, 427, 435 Bowley, A. L., 96, 312 Bradburn, N., 409, 548 Braun, M., 47, 48, 509 Bréchon, P., 508 Brewer, K. R. W., 315, 319 Breyfogle, F., 128 Brick, J. M., 413, 414, 417, 418 British Household Panel Survey (BHPS), 7 British Market Research Association (BMRA), 18 British Social Attitudes (BSA), 6 Buddhism, 280 Byrne, B. M., 641 calibration, weighting, 338, 461, 463–5 calibration variable, 463 Calinescu, M., 588 Callegaro, M., 335, 337 Campanelli, P., 616 Campbell, D. T., 28, 239, 531, 532, 541, 542 CAMSIS scale, 301 Cannell, C. F., 587 Cantril, H., 71, 93, 94, 96, 97, 211 CAPI see Computer-Assisted Personal Interviewing (CAPI) surveys Carmines, E. G., 535 Carnap, R., 195, 205 CASCOT system, 300 case studies, 330 case-control studies, 110 CASMIN education scheme, 294 CASMIS scale, 291 categorical principal component analysis (CatPCA), 590, 619, 620 categorization process, 59 CATI see Computer Assisted Telephone Interviewing (CATI) cell phones, 60, 78, 160 Census Bureau, US, 28, 36, 126, 361, 372, 376, 496 Census Codebook, US, 452 census data, 59 Census of Population of Housing, 450 Center for Open Science, 659 CFA see confirmatory factor analysis (CFA) Chasiotis, A., 172 Chen, F. F., 637 Cherington, Paul, 91
Index
CHi square Automatic Interaction Detection (CHAID), 467 Chicago Record, 90 children, interviewing, 83–4 China, 166 and globalization of surveys, 680, 685 and polling, 93, 97 and translation of measurement instruments, 278, 280, 281, 282f China Health and Retirement Longitudinal Study (CHARLS), 172 chi-square distance, 472 choice experiments, 115–16 Clancy, K. J., 230 Clark, R., 98 Clark, V. L. P., 172 classical true-score theory (CTST), 530, 532, 533, 537–8, 539 Cleland, J., 168 climate, survey see survey climate Clinton, Bill, 99 Clinton, Hillary, 99 closed-end questions, 33 cluster effect, 322 cluster sampling, 96, 323–4, 481 adaptive, 469 multi-stage, 323 one-stage, 323 stratified multi-stage, 482 two-stage, 323 Cochran, W.G., 571 codebooks, 450–54, 452, 453 codes existing, 18–19 International Standard Classification of Occupations (ISCO), 299–300, 511 professional and trade, 17–19 ethical issues, 78–9 implementing and enforcing, 21–2 role of codes and road to professionalization, 22–3 violation, alleged, 22 see also under European Society for Opinion and Market Research (ESOMAR); International Chamber of Commerce (ICC) coefficient of precision, 531 Coenders, G., 244 cognition, 211 Cognitive Aspects of Survey Methodology (CASM), 32, 210, 211, 212 cognitive interviewing analysis procedures and implications for question modification, 363 Four-Stage Cognitive Model, 361 key procedures, 362–3 logical/structural problems, 362 participants, 362 purpose and context, 361–62 question response process, 214–15
695
resource requirements, benefits, limitations and practical issues, 363–4 thinking aloud, 362 verbal probing, 362 see also interviews cohort studies, 112, 113 Colectica (DDI lifecycle tool), 446 Collaborative Psychiatric Epidemiology Surveys (CPES), interactive codebook for, 453 collectivism, 163, 280 common factor models, 536–8, 539 communication modes, 148, 257 comparability challenge, in comparative survey research, 41–3 constructs and items, 45–9 project components, 43–5 Comparative Candidates’ Survey (CCS), 672 Comparative Manifestos Project (CMP), 672 Comparative National Elections Project (CNEP), 672, 682 Comparative Study of Electoral Systems (CSES), 11, 671, 672, 674, 677n2, 682 comparative survey research background variables see background variables, comparative surveys challenges, 41–53 comparability, 41–3, 45–9 of constructs and items, 45–9 of project components, 43–5 coverage errors, 43 democratic bias, 48 equivalence of measurement instruments ex ante, securing, 46–7 ex post, securing, 47–8 ethical/other concerns, 48–9 functional equivalence, 11, 46, 248, 250, 680, 690 international comparative cognitive studies compared to intercultural comparative cognitive studies, 47 measurement errors, 44–5 multicultural/multinational contexts, 157–8 non-response errors, 43 project components, comparability problems, 43–5 sampling for comparative surveys best sampling designs, 350–1 estimation, 353–4 future developments, 354 history and examples, 346–9 main requirements, 349–54 prediction of effective sample size and design effects, 351–53 sampling errors, 43 unique definition of target population and suitable sampling frames, 349–50 selection bias, 71 survey methodology, 10–11 Western assumptions problem, 48 comparison errors, 31 complete case analysis (CCA), 598
696
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
comprehension, research questions ordering questions, 228–9 visual design for, 227 wording for, 221–2 Computer Assisted Telephone Interviewing (CATI), 99, 126, 149 active management of, 404–6 adjustment strategy, 405–6 challenges, active management, 406 key indicators, 404–5 response propensity model, 404 responsive and adaptive designs, 398, 399, 403 responsive design strategy, 403–4 surveys, 403–6 Computer Audio Recorded Interviewing (CARI), 123, 133, 134, 367 Computer-Assisted Interviewing (CAI), 182, 446 Computer-Assisted Personal Interviewing (CAPI) surveys, 116, 117, 398, 404, 506, 574 indicators, 400–1 interventions, 401–2 issues, 399–400 lessons learned and future research, 403 responsive and adaptive designs, 398, 399 responsive design for, 399–403 survey modes, 147, 152 concepts and constructs, theoretical, 194–5 link with observed/latent variables, 196–7 operationalization of construct, 201–2 theoretical elaboration of concepts, 200–1 validity assessment of measured construct, 202–5 concepts-by-intuition, 238 basic, 239–41 conceptual and measurement validity, 194–200, 205, 206 conceptual equivalence, harmonization in view of, 278–80 conditional distributions, 110 conditional partial R-indicator, 565 conditional Poisson sampling, 321 conditional representative response, 564 confidence intervals, sampling, 331 confidentiality issues, 488–501, 656 data access, controlled, 497–8 importance of data confidentiality, 489–90 see also disclosure; ethical considerations configural invariance, 633, 634 confirmatory factor analysis (CFA), 205, 239, 589, 637, 639, 640 continuous, 634, 635 ordinal, 636 Confucianism, 280 congeneric measures, 196, 207n7, 535, 536, 539 consent issues, 80–1 construct validity, 197 contact phase mode change, 144 content validity, 199 context effects, 547
contextual data aggregated, 671 challenges of use within cross-national surveys analytical, 675–6 methodological, 674–5 organizational, 673–4 current state of in cross-national survey programs, 671–72 methodological advantages of use with cross-national surveys, 673 substantive advantages of use with cross-national surveys, 673 supplementing cross-national survey data with, 670–79 system-level, 671 continuous confirmatory factor analysis (continuous CFA), 634 continuous quality improvement (CQI), 126–31, 129, 132, 133 convenience sampling, 330 Converse, J.M., 281 Cornfield, J., 313 Corruption Perceptions Index (CPI), Transparency International, 675 Council of American Survey Research Organizations (CASRO), 18, 19, 22 Council of Canadian Survey Research Organizations (CCSRO), 18 Couper, M., 68, 75n2, 227, 364, 412, 417, 418, 421 coverage error, 30, 43 CQI see continuous quality improvement (CQI) credibility interval (CI), 642 Crespi, I., 22 critical to quality (CTQ), 126, 127, 128, 134 Cronbach, L. J., 205, 583 Cronbach’s alpha, 370, 582, 626 cross-country analyses, 485–6 cross-cultural comparability, assessment, 630–48 measurement invariance, establishing, 631–42 cross-cultural equivalence, 376 cross-cultural psychology, 275 Cross-Cultural Survey Guidelines, 163, 172, 444 cross-cultural/cross-national surveys data collection, mixing methods for, 146 goal, 269 good questions for cross-national survey data, 670–79 harmonization, 274–5, 278–80 psychology, 275 supplementing data with contextual data, 670–79 advantages of contextual data within cross-national surveys, 673 challenges of contextual data within cross-national surveys, 673–6 current state of contextual data in cross-national survey programs, 671–72 future trends, 676–7
Index
survey climate, 67–8 survey life cycle for, 445f survey questions, harmonizing, 504–5 team-based review, 272 translation of measurement instruments for see translation of measurement instruments, cross-cultural surveys Crossley, A., 91, 92, 94, 95 Cross-National Equivalent File (CNEF), 7, 512 cross-sectional design, 111–12 CTST see classical true-score theory (CTST) Cullen, J. B., 280 cultural frame, 163, 164 Current Employment Survey (CES), 126, 127, 130 Current Population Survey, US, 146 Czechoslovak Institute of Public Opinion, 97 Da Silva, D. N., 468 Dahlhamer, J. M., 366 Dalenius, T., 4 Danish Data Archive (DDA), 447 DASHISH project, 284 data access see data access administrative, 38 analysis of survey data, 62–3 under-analysed, 657 archiving, 62, 65, 653–4, 686 barriers to reuse, 655–6 census, 59 citation, 654 collection see data collection confidential, 656 contextual, supplementing cross-national survey data with, 670–79 disciplines, sharing across, 658 discovery, 654 dissemination, 654–5 delayed, 655–6 documentation see data documentation ensuring greater access to research and results, 655 EU Data Protection Directive and Regulation, 80, 85n1, 85n2, 489 fabrication, 622 future directions, 658–9 harmonization, 658 long-term preservation, 653–4 metadata see metadata missing, 30 proprietary, 656 quality see data quality; quality considerations replication, 657 reusing, 450, 655–6 secondary analysis, 657–8 sharing, 652–3, 658 training, 657–8
697
see also confidentiality issues; data blurring; Data Documentation Initiative (DDI); data life cycle; data privacy; data protection; Data Science; data swapping; Data without Boundaries project (DwB); information; record linkage data access controlled, 497–8 and research transparency, 651–52 data analysis (stratified and clustered surveys), 477–87 cross-country analyses, 485–6 multilevel models, 485 sampling, 477–82 statistical software, 484–5 subpopulation analysis, 485 variance estimation, 484 see also cluster sampling; sampling; stratification/ stratified sampling Data Augmentation Algorithm, 608 data blurring, 496–7 data collection changes in methods, 64 digital, 78 mixing methods for, 146–7 multicultural/multinational contexts, 166–70 multiple modes and barriers, 59–61 protocols, 146, 170 data confidentiality see confidentiality issues data documentation, 443–59 across life cycle, 443–59 big data, 458 broader audience for data and metadata, 457–8 content inclusiveness, 444–5 Data Documentation Initiative see Data Documentation Initiative (DDI) data life cycle, 444 future developments, 454–8 innovative uses, 454–6 interactive, 453–4 metadata see metadata, rich paper era, emerging from, 452–3 quality indicators, 457 and research transparency, 456–7 standards-based tools, 446–7 thinking beyond the codebook, 450–54 web technology, use of, 450–54 Data Documentation Initiative (DDI), 21, 445–6 codebook Tools, 446 DDI Alliance, 445 Lifecycle Tools, 446 data life cycle, 444 data ‘masking,’ 656 data privacy, 81, 117 data protection, 81–2 data quality and measurement precision, 527–57 conceptualizing measurement error, 529–32 content of survey questions, 546–7
698
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
designs for assessing measurement errors, 535–40 information sources, 548 measurement error models for specific questions, 540–42 multitrait-multimethod models, 542–3 populations, attributes, 546 quantifying measurement error, 532–5 quasi-simplex models, 543–5 research findings, 545–51 survey methodology, 9–10 Data Science, 38 data swapping, 495–6 Data without Boundaries project (DwB), 13n5 Dataverse Network, 446 Davidov, E., 641 De Heer, W., 409 de la Puenta, M., 213 de Leeuw, E. D., 147, 150, 409, 418 Deff design effect, 352, 353, 480, 481, 482 DEGREE variable, 295 DeMaio, T., 374 Deming, W. E., 9 democracy, and surveys, 12 Denmark, 621t, 626, 663, 685 Depoutot, R., 504 Derge, D., 97 description descriptive designs, 110–11 experimental and non-experimental designs, 108 research questions, 106 design effects, 352, 353, 480, 481, 482 design of research/survey adaptive designs, 132–3 adaptive total design, 133–4 areas of turmoil, 179–84 cohort studies, 113 comparative survey research, 45 complexity of survey designs, 6 cross-sectional, 111–12 descriptive designs, 110–11 difference-in-difference design, 109, 110 experimental and non-experimental, 108–11 ex-post-facto design, 110 general aspects, 114 innovative examples for causal analysis with survey data, 118–19 international standardization of surveys, 118 interrupted time-series design, 112 longitudinal designs, 111, 112 questionnaires/questions, 218–35 responsive design see responsive design rolling cross-section design, 112 sample design, 159–62 economic conditions and infrastructure, 160 language barriers/multiple languages, 159–60 physical environment, 160–1 political context, 160
research traditions and experience, 161–2 social and cultural context, 159–60 Six Sigma, 134 survey design, from a TSE perspective, 33–5 temporal dimension, 111–14 Total Survey Error (TSE), implications for, 123–5 units of analysis and observation, 107–8 visual design of questions and questionnaires for comprehension, 227 to facilitate response, 225–8 for judgment, 227 for reporting, 228 see also design robustness design robustness, 124 design weighting, 460, 462–3, 483 Desrosières, A., 13n3, 62, 301 deterministic samples, 313–14 Deville, J.-C., 314, 323, 463, 464, 465, 469, 473 Dewey, Thomas, 95 difference-in-difference design, 109, 110, 118 Dillman, D., 124–5, 142, 144, 145, 150, 152, 220, 229, 259, 414, 426, 429, 431, 435 directories, 60 disclosure coarsening, 494 control methods, 493–7 non-perturbative, 493–5 perturbative, 495–7 identity, 490 measures of risk, 491–93 noise addiction, 495 re-identification, 490 risk assessment, 490–51 standards, 17 top- and bottom-coding, 494–5 see also confidentiality issues discovery, 654 disproportional allocation, 479 dissemination of data, 654–5 delayed, 655–6 distance function, 463, 464 DNA, 171 documentation societies in turmoil, 185 translation, 274 Dogan, M., 671 domain indicator, 461 Donsbach, Wolfgang, 22–3 doorstep interaction, 70 DuBois, W. E. B., 88 Duffy, B., 574 Duncan, O. D., 529 Durrant, G. B., 415, 418 Dykema, J., 230–1 dynamic designs, 568 East Asia Barometer Survey (EABS), 281 East Asia Social Survey (EASS), 7, 278–9, 281
Index
East Asia Value Survey (EAVS), 281 ecological fallacy, 108 economic-exchange theory, 426, 431 education attainment, 292–8, 296–7 CASMIN education scheme, 294 fields of specialization, 298 positional measures, 295 qualifications, 293–5 years of, 293 Edwards, J. R., 195 Edwards, M. L., 152 effect generalizability, 124 Eisenhower, Dwight, 94, 97 election forecasting, 38 Electoral College, US, 89 electoral polls, 17, 21–2, 63 of 1948, 95–6 exit polls, 98 Elias, Peter, 300 ELLIPS initiative, 8 email, 61 embedded probing, 372–3 Engel, U., 588 epidemiological research, 110 epistemic correlation, 206n1 equivalence conceptual, 278–300 cross-cultural, 376 functional see functional equivalence of measurement instruments, 45–6, 520n1 ex ante, securing, 46–7 ex post, securing, 47–8 Ericsson, K. A., 362 Erikson, R., 301 errors of abstraction, 203 comparison, 31 correlated, 31 coverage, 30, 43 frame, 30 of generalization, 203 measurement see measurement errors nonresponse see nonresponse errors non-sampling, 28 random, 31 reducing, 99 sampling, 27, 28, 29, 43, 318 survey administration, 30–1 systematic, 31 Total Survey Error (TSE) see Total Survey Error (TSE) types of survey error source, 29 uncorrelated, 31 Esmer, Y., 435 ESOMAR see European Society for Opinion and Market Research (ESOMAR) Esposito, J. L., 359
699
ESS see European Social Survey (ESS) estimation design-based, 316–17 model-based, 317–18 estimation weight, 460 ethical considerations comparative surveys, 48–9 consent and notification, 80–1 data privacy, 81 do no harm, 83 incentives to increase response rates, 434–5 interviewing of children, 83–4 principles, 79–80 protecting respondents’ data, 81–2 research in multiple contexts, 171–2 respectfulness, 82–3 standards, 17 survey and market research, 77–86 turmoil, societies in, 184–5 ethical imperialism, 48 ethnic drifting, 58 ethnicity background variables, comparative surveys, 289–92 boundaries, 290 measurement, 58 EU Data Protection Directive and Regulation, 80, 85n1, 85n2, 489 Eurasia Barometer, 10 Eurobarometer, 10, 672, 690n2 European and World Value Surveys (EVS/WVS), 682 European Election Studies (EES) Project, 672 European Federation of Associations of Market Research Organizations (EFAMRO), 18 European Labour Force Survey see Labour Force Survey (LFS) European Social Survey (ESS), 7, 11, 20, 47, 69, 115, 118, 238, 388, 437, 486, 503, 550, 592 background variables, comparative surveys, 291, 292, 295 Central Scientific Team, 506 comparative survey research, 347, 349, 350, 351, 354 contextual data, 671–72 cross-cultural comparability, assessment, 630, 641 data documentation, 444–5 nonresponse, 409–10, 415, 419 Sampling Guidelines, 351 European Society for Opinion and Market Research (ESOMAR), 18, 19, 20, 22, 80, 85, 85n4, 94, 685, 686 Guide to Opinion Polls, 686 Guideline on Interviewing Children and Young People, 84 Guideline on Mobile Research, 83 Guideline on Social Media Research, 82 ICC/ESOMAR International Code, 79, 81, 83 members, 79 European Socio-economic Classification (ESeC) project, 301
700
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
European Statistic on Income and Living Conditions (EU-SILC), 7 European Statistical System, 301 European Survey on Income and Living Conditions (EU-SILC), 410 European Survey Research Association (ESRA), 8, 18 European Union Statistics on Income and Living Conditions Survey, 486 European Values Survey (EVS), 7, 347, 508, 630 Eurostat, 506 evaluation of surveys/survey products, 134–9 exclusion, 412 exit polls, 98 expectation maximization (EM), 604 expected parameter change (EPC), 638 experimental designs, 109, 110–11 experiments choice experiments, 115–16 natural, 110 and observational studies, 108 survey, 115–16 expert opinion, 71 expert selection, sampling, 330 explanatory research, 106–7, 108 exploratory structural equation modeling (ESEM), 639, 640 exploratory survey research, 106 ex-post-facto design, 110 eXtensible Markup Language (XML), 446 external validity, 28, 111 Fabrigar, L. R., 584 face-to-face interviews/surveys, 60, 81, 142, 143, 186n1, 255, 432 factor analysis, 63, 582 factorial surveys, 115 fake surveys, 74 Farrall, S., 374 Federal Committee on Statistical Methodology (FCSM), 457 Ferrandez, F. L. A., 290 Field, H., 94, 95, 98 field interviewers (FIs), 125 field-based probing, 372–3 fieldwork, 382–96 announcing the survey, 384–5 back-checks, 389 defining actions and repercussions in case fieldwork targets not met, 390 ex-post data checks, planning for, 388–90 interviewer effects, 384, 392–4 monitoring and controlling, 419 planning, 384–90 protocols, 391 real-time fieldwork monitoring, planning for, 387–8 respondent incentives, 385–6 responsive designs, 390–92
Fienberg, S. E., 493 Fisher, R. A., 394 Fiske, D.W., 239, 531, 532, 541, 542 Fiske-Rusciano, R., 689 fit-for-purpose/fitness-to-use, 341, 375 fixed effects, 113, 118 fixed size sampling design, 318 Flash Eurobarometer, 348 for-profit firms, 23 Forsyth, B. H., 376 Fowler, F.J. Jr, 230–1, 367 frame choice, 162 frame errors, 30 Franzen, Raymond, 94 Freedom in the World country ratings, Freedom House, 674 frequentist inferential approach, 332 Frey, F. W., 45 Fried, E., 211 Friedland, C. J., 182 full information maximum likelihood (FIML), 604 Fuller, W. A., 467 Fulton, J., 149 functional equivalence, 11, 46, 248, 250t, 680, 690 Furr, R. M., 584 Gallhofer, I. N., 194, 195, 200, 236, 239, 245 Gallup, G., 91, 92, 93, 95, 96, 690n4 Gallup polls, 92, 93, 94 Gallup World Polls, 684–5 Galton, F., 13n3, 90 Galvani, L., 322 Gamliel, E., 584 Gangl, M., 118 Ganzeboom, H. B. G., 300 Gaudet, Hazel, 93–4 General Population Survey (GPS), 561 General Social Survey (GSS), US, 6, 502 generalization, error of, 203 generalized regression estimation, 464, 572–3 generalized weight share method (GWSM), 461, 469, 472 Generations and Gender Programme (GGP), 672 Generic Statistical Business Process Model (GSBPM), 447, 448, 458n3 Generic Statistical Information Model (GSIM), 447, 458n4 Geographic Information System (GIS), 179 geolocation, 677n1 Gerber, E.R., 212, 213, 214 German Record Linkage Center (GRLC), 667 German Socio-Economic Panel (G-SOEP), 7, 118 Germany, 304n6 and data analysis, 478, 479, 481, 482 East and West, 479, 482 and response rates, 435, 437 and sampling, 348, 349, 350, 353
Index693
GESIS, international survey programs at, 346 Gesis Summer School in Survey Methodology, 158 Gestalt psychology, 226 Gibbs-Sampler, 608 gifts, 430 Gilani, I. and B., 689 Gilbert, C., 184 Gini, C., 312, 322 Global Barometer Surveys, 7, 10, 13n5, 690n2 Global Positioning System (GPS) devices, 44, 173, 181, 561, 567, 575 adaptive survey designs for survey, 569–70 detecting bias in survey, 567 globalization of surveys, 680–92 alternative sample designs, 688 contemporary coverage and limitations, 684–6 cross-national handbooks and edited volumes, 687–8 data archives and data sources, 686 future prospects, 689–90 historical development, 681–82 impacts, 64 international academic, professional, and trade associations, 686–7 international and cross-national surveys, 682–4 international standards and guidelines, 688 survey-research and social-science journals, 687 world opinion concept, 688–9 Goldman, E., 68, 71, 93 Goldthorpe, J. H., 301 Goldwater, B., 98 Goot, M., 689 Gould, S. J., 63 Goyder, J., 71 GPS see Global Positioning System (GPS) devices Grais, B., 505 Granda, P., 518 Greece, 620, 621, 626 Greene, V. L., 535 Grice, H. P., 213 Groves, R. M., 4, 29, 32, 34, 68, 75n2, 132, 391, 398, 406, 410, 412, 413, 415, 416, 417, 418, 419, 425, 427, 434, 443, 568 Growth of American Families (GAF), 450 Grundy, P. M., 315 Guttman, G. G., 198 Haberstroh, S., 164 Hainmueller, J., 118 Hanif, M., 319 Hansen, M. H., 28 Hantrais, L., 49 Harkness, J. A., 158, 163, 165–6, 167 harmonization cross-cultural surveys, 274–5, 278–80 data, 658 input, 506, 507–9
701
output, 510–14 survey questions see survey questions, harmonizing Harris, Lou, 98 Harris Interactive, 339 Harris Organization, UK, 98 Harris-Kojetin, B. A., 366 Harrison, E., 4, 5, 301, 412 Harrison, T., 97 Heckman, J. J., 605 Heeringa, S., 132, 160, 161, 162, 350, 391, 398, 406, 417, 484, 568 Helbig, M., 118 Herbst, S., 90–1 Hermann, D., 211 High-Level Group (HLG), Modernisation of Statistical Production and Services, 447 Hirschi, T., 106 historical perspective, 6–8 Hoffmeyer-Zlotnik, J. H. P., 292 Hofstede, G. H., 165, 280, 586 Hoglund, K., 181, 183 Holbrook, A. L., 590 Hollerith card, 90 Horn, J. L., 631 Horowitz, J. L., 611 Horvitz, D.G., 315, 560 Horvitz–Thompson estimator, 316, 321, 323, 462, 463, 465, 473, 560, 570, 571 household panels/surveys, 112, 258–9 Household Response Workshop, 21 Hox, J., 153, 418 Hu, L. T., 636 Hubbard, F., 347 Human Development Index (HDI), UNDP, 675 Hurja, E., 91 hybrid designs, 113 Hyman, H., 94, 230 identical response patterns (IRPs), 621, 622 Ijsselmuiden, C.B., 48 in scope (IS), 473 incentives to increase response rates, 425–40 charities, 430 conditional/’promised,’ 429 consequences of using, 433–4 differential, 434–5 ethical and practical considerations, 434–5 gifts, 430 impact on data quality, 433–4 impact on recruitment process, 433 long-term consequences, 435 lotteries, 430 monetary, 429–30 non-monetary, 430 quasi-monetary, 430 semi-monetary, 430 social acceptance, 435
702
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
and survey modes, 431–33 unconditional/prepaid, 429, 430 values, 431 when and where to use, 435–7 see also nonresponse; response rates Index of Response Differentiation (IRD), 591 indirect sampling, 461, 468–9 individual matching, 333 individualism, 163 industrial associations, 24n1 InfoQ, 37 information availability of, 7, 406, 490 contextual, 68, 449, 670, 672, 674 gathering of, 4, 73, 77, 78, 183, 373, 684 personal, 67, 70, 78, 80, 81, 88 reliable, 44, 178 sources, 4, 8, 117, 182, 532, 548, 597 suppression of, disclosure control, 493–4 see also data; Generic Statistical Information Model (GSIM); Geographic Information System (GIS); Office of War Information (OWI) information transmission and communication, 147 informed consent, 48 inherent risk, 136 in-person national surveys, 38 input harmonization, 506, 507–9 Institute for Social Research (ISR), US, 95 Institutional Review Board, 33 instrumental-variable designs, 110 Integrated Fertility Survey Series (IFSS), 450 Integrated Public Use Microdata Series (IPUMS), 507, 512, 514, 658, 684 IPUMS Integrated Coding Scheme for Marital Status, 515 Intelligence Quotient (IQ) tests, 63 Interactive Voice Response (IVR), 61, 147 internal consistency reliability (ICR), 535 internal validity, 28, 111 internally displaced populations (IDP), 181 International Association of Applied Psychology (IAAP), 18 International Association of Survey Statisticians (IASS), 18 International Chamber of Commerce (ICC), ICC/ ESOMAR International Code, 79, 81, 83, 85 International Conference on Methods for Surveying and Enumerating Hard-to-Reach Populations, 21 International Conference on Survey Methods in Multicultural, Multinational and Multiregional Contexts, 158 International Conference on Telephone Survey Methodology, 21 International Field Directors and Technologies Conference, 21 International Labour Organization, 299
International Organization for Standardization (ISO), 16, 24, 687, 688 ISO 20252 on Market, Opinions and Social Research, 22, 173, 338 ISO 26363 on Access Panels in Market, Opinion and Social Research, 20 International Political Science Association (IPSA), 18, 686 International Republic Institute (IRI), 98 International Social Survey Programme (ISSP), 11, 62, 71, 146, 280, 347, 436, 622, 630, 682, 690n1 background variables, comparative surveys, 291, 292, 295 comparative survey research, 46, 47 General Assembly, 46 response scales, translation, 281–2 survey questions, harmonizing, 503, 506 International Socio-Economic Index (ISEI), 300 International Sociological Association (ISA), 18, 686 International Standard Classification of Education (ISCED), 294–5 ISCED 2011, 296–7, 298 International Standard Classification of Occupations (ISCO), coding, 299–300, 511 International Statistical Institute (ISI), 18, 686 International Workshop on Comparative Survey Design and Implementation (CSDI), 21, 158 International Workshop on Household Survey Nonresponse, 409 Internet email, 61 questionnaires, 258–9 self-administered surveys, 60 web surveys see web surveys web technology, use of, 450–54 Interpretivist perspective, 376 interrupted time-series design, 110, 112 Inter-university Consortium for Political and Social Research (ICPSR), 220, 447 interviewer-administration (IAQ), 360, 375 interviews children, interviewing, 83–4 cognitive interviewing and question response process, 214–15 face-to-face, 60, 81, 142, 143, 186n1, 255, 432 incentives for conducting, 85n4 interviewer debriefing, 366–7 interviewer effects, 384, 392–4 interviewer task simplification, 619–20, 621 interviewer-respondent matching, 168 interviewers and data quality issues, 616–17 interviewers and measurement errors, 45 multiple, 31 paper advance letters in, 144–5 standardized, 30 survey climate, 67 surveys and society, 60
Index693
telephone, 142–3, 145, 256–7 training of interviewers, 386–7 see also cognitive interviewing intra-class correlations (ICCs), 393 IPUMS (Integrated Public Use Microdata Series) see Integrated Public Use Microdata Series (IPUMS) ISSP see International Social Survey Programme (ISSP) item count technique (ICT), 224 Item Response Differentiation (IRD), 618 Item Response Theory (IRT), 519, 530 analysis procedures and implications of findings, 370 key procedures, 369–70 purpose and context, 369 resource requirements, benefits and limitations, 370 iterative proportional fitting see raking ratio estimation Jackknife Repeated Replication (JRR), 484 Jäckle, A., 418 Jackson, A., 89 Jackson, D. N., 583, 586 Jackson, J. S. H., 198 Jahoda, M., 88 Jaro–Winkler string comparator, 665 Jenkins, C., 220 Jensen, A., 312 Jobe, J. B., 211 Johnson, Lyndon, 97 Johnson, T. P., 45–6, 47 Jöreskog, K. G., 535 Journal of Market Research, 19 Journal of Official Statistics (JOS), 8 journalism, 18 Jowell, R., 689 judgmental sampling, 330 judgment ordering of questions for, 229 visual design for, 227 wording questions for, 222–3 Kalton, G., 139, 472, 546 Kasprzyk, D., 139 Kass, V. G., 468 Kellogg, P., 88 Kenett, R.S., 37 Kennedy, C., 153 Kennedy, J. F., 97 Kenny, D. A., 637 Kerckhoff, A. C., 518 Kerlinger, F. N., 195, 196, 199, 205 Kern, H. L., 118 Kim, J., 75n4 Kish, L., 28, 342, 346, 347, 352 Klausch, T., 150 Klopfer, F. J., 281 Kolenikov, S., 153
703
Koning, P. L., 239 Kosovo, 180 Köster, B., 588 Kradolfer, S., 290 Kreuter, W., 409, 415, 484, 667 Krosnick, J. A., 32, 281, 371, 550, 584, 587, 588, 614 Krumpal, I., 119 Kruskal, J., 13n3 Kuper, H., 184 Labor Force Survey (LFS), 137, 138, 146, 397, 410, 506, 517 Laflamme, F., 395 laissez-faireism, 23 Lambert, Gerald, 93, 94 Landreth, A., 374 Latent Class Analysis (LCA), 369, 519 latent constructs, 370 latent response distribution, 635 latent variables, 195–6, 204, 206n4 link with theoretical concepts/constructs, 196–7 Latin America, 58 Latino Barometer, 10 Latvia, 622 Lavallée, P., 469, 470 Lawry, L., 184 Lazarsfeld, P., 88, 94, 95, 197 Le Guennec, J., 465 Le Play, F., 88 Lee, J., 363 Lee, S., 164 Lepkowski, J. M., 162 Lessler, J. T., 376 Leung, K., 276, 277 leverage–salience theory, 428 Lewinsky, Monica, 99 Lewiston and Clarkston Quality of Life Survey, 266 Life and Labours of the People of London (Booth), 88 life-history studies, 113 Likert, R., 93, 95 Likert data, 636 Likert scales, 634 limited response differentiation (LRD), 588 Lin, I.-F., 415 Lin, Y., 347 linear weighting, 338, 339 linguistics, 9 Link, H. C., 92 LISS panel, 8 Literary Digest, 8, 90, 91, 92, 93, 340 Little, R. J., 466, 561, 570 Loftus, E., 211 logical positivism, 197 logistic regression model, 467, 561 logit model, 561 longitudinal designs/studies, 111, 112, 146, 336 multicultural/multinational contexts, 168, 172
704
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
Longitudinal Internet Studies for the Social Sciences (LISS) panel, Netherlands, 259 Loosveldt, G., 72–3, 75n3 Lord, F.M., 198, 531 Lou Harris Poll, 96 Lubke, G. H., 634 Lucas, S. R., 676 Luiten, A., 419 Lundström, S., 467 Luxembourg Income Study (LIS), 512 Lyberg, I., 75n1 Lyberg, L. E., 5, 33, 35, 75n1, 443 Lynn, P., 160, 161, 162, 346 macro indicators, 5 Madans, J., 359, 528, 529, 532, 533, 541 Madge, Charles, 97 mail questionnaires, 257–8 mail surveys, 61, 62, 124, 330, 420, 437 incentives to increase response rates, 431, 432 mixed-mode survey, designing, 258, 261, 262, 263 survey modes, 142, 144, 145, 150 Maloney, Jack, 94 Maraun, M., 198 March, Lucien, 312 Market and Social Research profession, 77 market research profession, 78, 90 Market Research Quality Standards Association (MRQSA), 18 codes, 19 Marketing Research Association (MRA), 18 Code of Marketing Research Standards, 22 Professional Researcher Certification Programme, 22 marketplace-of-ideas approach, 23 Markov Chain Monte Carlo (MCMC) Methods, 608 Markovian process, 544 Marplan, Ltd, 96 Marsh, H. W., 634 Martin, E., 221 Mason, R., 229 mass media, 4, 18 Mass-Observations, 97 maximum likelihood (ML), 601, 603, 610 Maynard, D. W., 367 McCable, S. E., 484 McClendon, M. J., 550 McCoach, D. B., 637 McKinley, William, 90 mean squared error (MSE), 31, 33, 34, 124, 134, 338 measurement bias see measurement bias common, achieving across survey response modes, 260–64 concepts and constructs, theoretical, 194–5 link with observed/latent variables, 196–7 operationalization of construct, 201–2
theoretical elaboration of concepts, 200–1 validity assessment of measured construct, 202–5 conceptual validity, 194–200, 205, 206 errors see measurement errors ethnicity, 58 independence see measurement independence instruments see measurement instruments latent variables, 195–7 meaning in a survey context, 193–209 observable variables, 195, 197 operationalization, 196 of construct, 201–2 from operationalized construct to measurement, 200–5 reliability, 527 terms, distinguishing, 194–6 units, 107 see also measurement invariance, establishing measurement bias, 28 measurement errors, 44–5, 529, 580 conceptualizing, 529–32 and data quality, 150–1 designs for assessing, 535–40 common factor models, 536–8 internal consistency approaches, 535–6 multiple measures vs multiple indicators, 537–40 designs for studying reliability, 541 designs for studying validity, 541–42 interviewers, 45 models for, 533–4 models for specific questions, 540–42 non-random, 540 observed vs latent values, 529–30 potential for, and common mixed-mode designs, 151 quantifying, 532–5 reliability and validity concepts, 531–32 measurement independence, 537 measurement instruments securing equivalence, 45–6 ex ante, 46–7 ex post, 47–8 translation for cross-cultural surveys, 269–87 measurement invalidity, 534–5 measurement invariance whether cross-loadings allowed, 638–40 description/function, 630, 631 establishment, 631–42 exact or approximate, 641–42 full or partial, 640–41 level of measurement invariance required, 632–4 rules to apply for model evaluation, 636–8 type of data used, 634–6 measurement validity, 201, 202, 215, 527 and conceptual validity, 194–200, 205, 206 see also measurement invalidity Mecatti, F., 472 Medeiros, N., 457
Index693
media, 11 see also mass media Media Ratings Council (MRC), 18 Medway, R. L., 149 Meehl, P. E., 205 Meitinger, K., 509 Meredith, W., 634 Merllié, D., 13n3 Merton, R. K., 651 Messer, B. J., 150 Messiani, A., 470 Messick, S., 583, 586 metadata, 444–5, 653 broader audience for data and metadata, 457–8 capture in key data series, 447, 449 capturing at source, 447 capturing in National Statistical Institutes (NSIs), 447 documentation, 653 metadata-driven survey design, 446 re-use, 449–50 structured machine-actionable metadata standards, using, 445–7 variable comparison tool based on DDI metadata, 451 see also data Metadata Portal Project, 447 methodology, survey challenges and principles, 3–15 comparative dimension, 10–11 as a discipline within disciplines, 8–10 journals, 8 more research, need for, 10–11 survey climate, 71–2 Total Survey Error (TSE) as paradigm for, 27–40 Methodology of Intercultural Surveys (MINTS), 283 Methods for Testing and Evaluating Survey Questionnaires (Presser), 359 metric invariance, 633 Meunsterberg, Hugo, 88 Michigan Questionnaire Documentation System (MQDS), 446 microaggregation, 496–7 microdata, 137, 686 analytical potential vs data confidentiality, 488, 489, 490, 491, 492, 493, 495, 496, 497, 498 Integrated Public Use Microdata Series (IPUMS), 507, 512, 658, 684 Miller, K., 214, 215 Missing at Random (MAR), 561, 562, 571, 573, 598, 599, 602, 610, 611 Missing Completely at Random (MCAR), 561, 598, 599, 600, 601, 609 missing data, 30 missing values, 597–612 ad-hoc techniques, 599–600 classification, 598–9 definitions, 597
705
discussion, 609–11 ignorability of missing mechanism, 600–603 ignorable case, 603–5 likelihood approaches, 603–5 multiple imputation, 605–9 general, 605–7 ignorable case, 607–9 nonignorable case, 609 nonignorable case, 605 Mitofsky, W., 60, 98–9 mixed-mode designs, 144–7 concurrent mixed-mode design, 146 contact phase mode change, 144–6 data collection, mixing methods for, 146–7 effective use of mixed-mode strategies, 151–3 longitudinal studies, 146 methods of mixing, 151–2 whether mixed-mode strategy improves quality, 149–51 mixed-mode survey see mixed-mode survey, designing response phase mode change, 144, 146–7 sequential, 146, 150 single-mode to mixed-mode, 142–4 see also modes, survey mixed-mode survey, designing, 255–68 achieving common measurement across survey response modes, 260–64 aural vs visual presentation of questions, reducing differences from, 263–4 avoiding mode-specific question structures when possible, 261–2 survey as potential solution for coverage/response problems, 259–60 web+mail mixed-mode design, testing, 264–6, 265 why single mode studies are declining, 255–8 wording differences, reducing, 262–3 Mneimneh, Z., 179 mobile phones, 60, 78 mode effects, 31 model-based estimation, 317–18 modes, survey, 142–56 avoiding mode-specific question structures when possible, 261–2 contact phase mode change, 144–6 contact strategies, mixing, 144–6 differences between, 147–9 estimation and adjustment, 153 mixed-mode see mixed-mode survey, designing mode differences and measurement error, 150–1 multiple and mixed modes, 419–20 nonresponse, 412 response phase mode change, 146–7 response rates, 149–50 single-mode, 142, 256–9 telephone calls to screen respondents, 145 unified (uni) mode design, 152–3
706
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
modification index (MI), 638 Mohl, C., 398 Mohler, Ph, 45–6 monocultural surveys, 41 Morning, A., 290–91 Mosteller, F., 13n3 MRA see Marketing Research Association (MRA) MTMM see MultiTrait-Multimethod (MTMM) models Müller, W., 492 multicultural/multinational contexts, surveying in, 157–77 background, 157–8 challenges, 158–70 coordination of multinational projects, 171 cross-national research, good questions for, 247–9 data collection, 166–70 economic conditions and infrastructure, 160, 169 ethical considerations, 171–2 future directions, 172–3 physical environment, 160–1, 169–70 political context, 160, 166, 168–9 questionnaire development, 163–6 research traditions and experience, 161–2, 166, 170 sample design, 159–62 social and cultural context data collection, 166–8 questionnaire development, 163–6 sample design, 159–60 standardization, 170–1 survey context, 158–70, 159 Multigroup Confirmatory Factor Analysis (MGCFA), 519, 631, 634, 636, 638–9 ordinal, 637 Multigroup Structural Equation Modelling, 207n9 multi-level designs, 114 multi-level modelling (MLM), 485, 676 multinational panels, 168 multiple correspondence analysis (MCA), 590, 621, 623 multiple imputation, 605–9 general, 605–7 ignorable case, 607–9 nonignorable case, 609 multiple-frame approaches, 470 multiplicative weighting see raking ratio estimation MultiTrait-Multimethod (MTMM) models, 239, 245, 249, 369, 589 data quality and measurement precision, 535, 537, 541, 542–3, 548, 550, 551 split-ballot design, 244 Muthén, B. O., 634, 636, 642 Nathan, G., 470 National Center for Health Statistics (NCHS), 361 National Comorbidity Replication (NCS-R), 450 National Council of Public Polls (USA) (NCPP), 18 National Democratic Institute (NDI), 98 National Educational Panel Study, Germany, 113
National Election Studies, University of Michigan, 447 National Fertility Surveys (NFS), 450, 458n6 National Health Interview Survey in Disability, 229 National Institutes of Statistics, 299, 300 National Latino and Asian American Study (NLAAS), 450, 454 National Longitudinal Survey of Youth, US, 7 National Opinion Research Center (NORC), US, 95, 447 National Research Council, US, 421 National Science Foundation, US, 447 National Statistical Institutes (NSIs), 447 National Survey of American Life (NSAL), 450 National Survey of Family Growth (NSFG), 400, 401, 450, 458n7 natural experiments, 110 Nedyalkova, D., 325 Neely, B., 124 Neo-Weberian class schema, 301 Nesstar (DDI Codebook tool), 446 network sampling, 330, 469 New European Barometer, 690n2 New Zealand, 166–7, 300 Neyman, J., 28, 96, 314, 315, 325 Nielsen, A. C., 93 Nigeria, 167 Nixon, Richard, 97 Noelle-Neumann, E., 10 Nomenclature of Territorial Units for Statistics (NUTS), 494 noncontact, 411 noncoverage, 412 nondifferentiation, 587–8 non-probability sampling approaches and strategies approximations of standard probability sampling principles, 332–5 specific modelling approaches, 335–6 definitions, 329 low response rate, 330 non-negligible population, 330 non-probability online panels, 336–8 purposive sampling, 330 sampling design approximations, 332–4 selected topics, 336–40 weighting, 338–40 see also probability sampling; sampling non-profit organizations, 98 nonresponse, 19, 30, 127, 409–24 adjustors, 413 arguments for and against survey participation, 417–18 auxiliary data, 419 challenge, 409–14 collection of interviewer observations, 415 correlates, 414, 415 current approaches to nonresponse problem, 412–14 current state of affairs, 418–19
Index693
developments in nonresponse research agenda, 414–21 errors see nonresponse errors factors behind, 411–14 fieldwork, monitoring and controlling, 419 hard-to-survey groups and exclusion mechanisms, 420 high quality survey data, importance, 420–21 indirect sampling, 468–9 models of persuasion and nonresponse bias, 415–16 multiple and mixed modes, 419–20 new research agenda, towards, 421 noncontact, 411 nonresponse propensity and measurement error, 416 nonresponse weighting adjustment, 465–8 paradata, use, 415 response rates, calculating, 410–11 responsive design, 415 sequential design for nonresponse reduction, 416–17 survey climate and culture, 417–18 survey ethics and data protection, 420 survey of interviewers, 418 survey participation theories, 414–15 and weighting, 460, 469 nonresponse errors, 27, 43, 558–78 adaptive survey designs, 568–70 adjustment by estimation, 570–76 coefficient of variation and nonresponse bias, 565–6 correcting, 567–76 detecting, 559–67 indicators for detecting, 562–7 generalized regression estimation, 572–3 GPS survey, detecting bias in, 567 indicators and auxiliary variables, 566–7 post-stratification, 571–72 raking ratio estimation, 573–4 and response propensities, 559–62 sample-based R-indicators and partial R-indicators, 564–5 nonresponse weights, 338 non-statistical dimensions, 34 non-verbal communication, 148 NOPVO study, 337 Norway, 331 Norwegian Social Science Data Services (NSD), 446 Not Missing At Random (NMAR), 561, 566, 598, 599, 602 Novick, M., 198, 531 null hypotheses, 16 O’Barr, W. M., 49 Oberg, M., 181, 183 objective variables, 239 observable variables, 195, 196–7 observational studies observational or quasi-experimental designs, 108–9 types, 110–11 units of observation, 108
707
occupation, 298–300 measurement through ISCO coding, 299–300 Oesch, D., 302 Office of Management and Budget (OMB), 135, 413, 457 Office of Public Opinion Research, US, 93 Office of War Information (OWI), 93–4 Olson, K., 417 O’Muircheartaigh, C., 160, 161, 162, 350, 616 one-factor theory, 63 Open Data Kit (ODK) Collection, 182 operationalization/operationalism, 4, 194, 196, 197, 200 construct, operationalization of, 201–2 opinion polls, 80 surveys and public policy, 63–4 Opsomer, J. D., 468 optimal allocation, 322 ordinal confirmatory factor analysis, 636 Organization of Economic Cooperation and Development (OECD), Programme for International Student Assessment, 20 Ortmanns, V., 517 out of scope (OOS), 473, 474 output harmonization, 506, 510–14 overcoverage, sampling, 312 Oyserman, D., 163, 164 panel research, inviting and selecting respondents for, 145 panel studies/surveys, 432 access panels, 7, 336 household panels, 112, 258–9 multinational panels, 168 non-probability panels, 336–8 offline panels, 8 online panels, 7, 78, 146, 336–8 Panel Survey of Income Dynamics (PSID), 7, 685 paper advance letters, in interviews and web surveys, 144–5 paradata, 7, 39n1, 69, 398, 466 research questions, 115, 116–17 paralinguistic communication, 148 Pareto diagrams, 36 participation in surveys arguments for and against, 417–18 cognitive interviewing, 362 reasons for, 426–8 theories, 414–15 willingness to participate, 69–71 Payne, S., 8, 376, 551 Pearson, K., 13n3 Peer, E., 584 Pennell, B. E., 168, 170, 179 perception, visual design for, 227 Persian Gulf War, 99 personal identification number (PID), linking records with and without, 663–4
708
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
perturbative disclosure methods, 495–7 non-perturbative, 493–5 Peytchev, A., 124, 409, 410, 568 Pham, J. P., 180 phase capacity, 132 Philippines, 97–8 Philippovich, E. von, 88 PIAAC see Programme for the International Assessment of Adult Competencies (PIAAC) PISA study, educational research, 118 planned experiments analysis procedures and implications of findings, 371 key procedures, 371 purpose and context, 370–1 resource requirements, benefits and limitations, 372 Podsakoff, P. M., 588 point of contact (POC), 127, 130 Poisson sampling, 320–21 Poisson-Binomial distribution, 320 Poland, 97 political context, 160, 166, 168–9 political science, 9 polls electoral/pre-election, 17, 21–2, 63, 95–6, 98 Gallup polls, 92, 93, 94 historical development, 87–102 early developments and examples, 87–9 election of 1948, 95–6 international advances, 97–8 nineteenth century, 89–90 post-war/post-1948, 96–8 twentieth century (first half), 90–5 twenty-first century, 98–9 ‘just in time’ results, 63 opinion polls, 63–4, 80 public opinion, 89 straw, 92 straw polls, 6, 89, 90 Popper, K., 197 population, 161, 311–13 hard-to-reach groups, 114, 160 see also target population population characteristics, 32 population weighting, 338 Postal Delivery Sequence File (PDSF), 142 postal surveys, 142 post-stratification, 461, 571–72 post-stratified estimator, 464 preprocessing, 664–5 Presser, S., 21, 281, 359, 366, 374, 548 pretesting, questionnaires, 359–81 analysis procedures and implications of findings for questionnaire modification, 363, 365, 366, 368, 370, 373 background and scope, 359–61 behaviour coding, 367–9 cognitive interviewing, 361–64
comparison vs combination of pretesting techniques, 374–5 definitions, 358 evolution of pretesting methods as function of survey administration mode, 375 field-based probing, 372–3 future directions, 376–7 interviewer debriefing, 366–7 Item Response Theory (IRT), 369–70 key literature sources, 359–60 minimal standards, 376 planned experiments, 370–72 procedures within field pretest, 366 psychometric analysis, 369–70 questionnaire development, pretesting and evaluation sequence, 360 resource requirements, benefits and limitations, 363–4, 366–9, 372, 373 with response process in mind, 230–1 Stand-Alone Pretesting Procedures, 361 Total Survey Quality framework, 375–6 usability testing, 364–6 Priede, C., 374 priming effects, unintentional, 229 principal components analysis (PCA), 618 privacy preserving record linkage (PPRL), 666 probabilistic panels, 8 probability proportional to size (PPS) sampling, 481 probability sampling, 96, 312–13 approximations of standard principles, 332–5 deciding about approximations, 340–1 definitions, 329 direct incorporation of probability sampling design principles, 333–4 indirect measures to approximate sample designs, 332–3 sampling design principles, 331 sampling theory developed for, 329 statistical inference principles, 331–32 see also non-probability sampling; sampling probit model, 561 product quality, 36 professional and trade associations codes, 17–19, 78–9 existing, 18–19 implementing and enforcing, 21–2 role and road to professionalization, 22–3 description, 18 industrial associations, 24n1 professional and academic associations, 24n1 standards, 16 Program for International Student Assessment, 486 Programme for the International Assessment of Adult Competencies (PIAAC), 294, 350, 413 propensity score adjustment (PSA), 338 propensity scores, 574 propensity weighting, 38, 574–5
Index693
proportional allocation, 322, 479 proprietary data, 656 prospective cohort studies, 113 protocols, 179, 185, 614 contact, 43, 388 data collection, 170, 418 fieldwork, 391, 419 operational, 45 paper-based contact, 388 PPRL, 666 recruitment, 391 survey, 131 ‘think-aloud,’ 32 training, 184 translation, 171 psychological approach, cognitive models, 211 Psychological Corporation, 92–3 psychology, 9, 226, 275 psychometric analysis, 9 analysis procedures and implications of findings for questionnaire modification, 370 key procedures, 369–70 purpose and context, 369 resource requirements, benefits and limitations, 370 public opinion polls, 89 Public Opinion Quarterly, 8, 68, 71, 72, 93, 690 public opinions, surveys, 12 content, 72–3 methodology, 71–2 see also electoral polls The Pulse of Democracy (Gallup), 93 purposive sampling, 330 Q-Bank database, 377, 457, 458n8 qualitative research, 194, 273 quality considerations, 613–29 assessment, 136–7 case studies, 617–26 respondent task simplification, 617–19 continuous quality improvement, 126–31, 129, 132 coverage and representability, 150 data fabrication and SROs’ task simplification, 621–26, 625 data quality approach to, 614–17 interviewers, 616–17 and measurement errors, 150–1 respondents, 614–16 incentives to increase response rates, impact on quality, 433–4 interviewer task simplification, 619–20, 621 measurement errors and data quality, 150–1 whether mixed-mode strategy improves quality, 149–51 quality profiles, 137, 139 response styles, impact on data quality, 582–3 standards, 170–1
709
survey research organizations, 617, 621–26, 625 translation of measurement instruments, cross-cultural surveys, 270, 275 see also data quality and measurement precision; Survey Quality Predictor (SPQ 2.0) program; Total Survey Quality (TSQ) quantitative research, 4, 273 quasi-experimental designs, 108–9 quasi-simplex models, 543–5 quasi-Markov simplex model, 543 three-wave, 544 question effects, 580 question response process, cognitive models, 210–11 and cognitive interviewing, 214–15 Questionnaire Design Experimental Research Survey (QDERS), 372 Questionnaire Design Research Laboratory (QDRL), 361 questionnaires architecture, 547–8 contents, 180, 219–20 designing, 218–35 for comprehension, 221–2, 227, 228–9 goals, 219 holistic design, 230 for judgment, 222–3, 227, 229 ordering of questions, 228–30 for reporting, 223–5, 228, 229–30 for retrieval, 222, 229 visual design, to facilitate responses, 225–8 development, 163–6 Internet, 258–9 interview administered vs self-administered, 147 mail, 257–8 pretesting see pretesting, questionnaires see also questions, research questions, research aural vs visual presentation, reducing differences from, 263–4 auto-advance/carousel, 153 avoiding mode-specific question structures when possible, 261–2 categories, 57–9 comprehension ordering questions for, 228–9 visual design for, 227 wording for, 221–2 creating a good question assertions, basic elements, 238–9, 240, 241 basic concepts and concepts-by-intuition, 237–41 cross-national research, 247–9 cumulative experience, using, 236–53 decisions to complete the questions, 242–3 designing theoretically valid requests for an answer, 237–42 requests for an answer, 241–2 Survey Quality Predictor (SPQ 2.0) program see Survey Quality Predictor (SPQ 2.0) program
710
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
description, 106 designing, 218–35 for comprehension, 221–2, 227, 228–9 holistic design, 230 for judgment, 222–3, 227, 229 ordering of questions, 228–30 for perception, 227 pretesting with response process in mind, 230–1 for reporting, 223–5, 228, 229–30 for retrieval, 222, 229 visual design, to facilitate responses, 225–8 experience of respondents, 220–1 explanation, 106–7 exploration, 106 grid questions, traditional, 153 judgment ordering of questions for, 229 visual design for, 227 wording questions for, 222–3 open- or closed-ended, 223 ordering, 228–30 for comprehension, 228–9 for judgment, 229 randomized, 119 for reporting, 229–30 for retrieval, 229 perception, visual design for, 227 reporting ordering of questions for, 229–30 visual design for, 228 wording of questions for, 223–5 retrieval ordering questions for, 229 wording questions for, 222 retrospective, 111 types, 105–7 wording, 226 bias, 30 and categories, 57–9 for comprehension, 221–2 for judgment, 222–3 reducing of differences, 262–3 for reporting, 223–5 for retrieval, 222 see also questionnaires; survey questions; survey questions, harmonizing quota sampling, 96, 314, 330 Radvanyi, L., 98 Rae, S. F., 93 railroads, standard gauge for, 24n3 raking ratio estimation, 338, 464, 572, 573–4 Random Digit Dialling (RDD), 60, 124, 143, 160, 162, 375 random errors, 31 random probes, 47 random sampling, 312 election of 1948, 96
simple random sampling (SRS), 331, 332, 478 simple random sampling without replacement, 319 stratified, 98 telephones, 98–9 see also sampling Randomized Response Technique (RRT), 224 rank swapping, 496 RDD see random digit dialling (RDD) reciprocation principle, 427 record linkage, 117, 662–9 applications, 602–3 blocking, 665 centers, 667 classification, 665 clerical edit, 665–6 with encrypted identifiers, 663–4 future developments, 667 preprocessing, 664–5 privacy preserving record linkage (PPRL), 666 required effort, 667 respondents’ permission for, 667 similarity methods, 665 software, 666–7 steps in process, 664–6 using a trusted third party, 663 using in practice, 666–7 with and without personal identification numbers, 663–4 Redline, C., 221 referendum results, 74 reflective indicators, 196 refusal conversion, 416 regression analysis, 42, 110, 464, 518, 582, 589, 601, 602 generalized regression estimation, 464, 572–3 reliability assessing, 516–17 designs for studying, 542 internal consistency reliability (ICR), 535 measurement, 527 and validity, 531–32 replication, 484, 657 reporting ordering of questions for, 229–30 visual design for, 228 wording of questions for, 223–5 representative democracy, 64 representative indicators for response styles (RIRS), 589–90 representative indicators response styles means and covariance structure (RIRMACS), 589–90 research, survey cross-national research, good questions for, 247–9 designs, 107–14 experimental and non-experimental, 108–11 general aspects, 114 temporal dimension, 111–14 units of analysis and observation, 107–8
Index693
future directions, 64–5 panel research, inviting and selecting respondents for, 145 paradata, 39n1, 116–17 recent developments and examples, 114–19 record linkage, 117 research questions, types, 105–7 description, 106 explanation, 106–7, 108 exploration, 106 research traditions and experience, 161–2, 166, 170 survey experiments, 115–16 use of, 63–4 see also design of surveys; methodology, survey; surveys research data life cycle, 444 research imperialism, 508 research transparency, and data access, 651–52 residual risk, 135, 136 respectfulness, 82–3 respondents data protection, 81–2 data quality, 614–16 incentives, 385–6 inviting and selecting for panel research, 145 questions, experience of, 220–1 reminding to complete a survey, 145–6 respondent debriefing, 372–3 respondent task simplification, 617–19 rights, 12 selection issues, 30, 161 task simplification, 617–19 telephone calls to screen, 145 response bias, 583 response effects, 580 response fetishism, 413 response phase mode change, 146–7 response probability, 559 response process, and pretesting, 230–1 response propensity model, 404 response propensity stratification, 575 response propensity weighting, 574 response rates, 69, 149–50 calculating, 410–11 codes, implementing and enforcing, 21 incentives as possible measure to increase, 429–38 minimum response rate, 411 reasons for failure to answer, 428–9 reasons for participation, 426–8 survey climate, 70–1 see also nonresponse response scales design of, 280–81 translation of, 281–3 response set, 583 response styles, 579–96 causes, research findings relating to, 585–7
711
common, 581 definitions, 580–3 changing, over time, 583–5 designing questionnaires to reduce error risk, 591–92 detecting and measuring, 589–91 explaining, 583–9 impact on data quality, 582–3 mitigating impact of, 589–93 questionnaire development, 164–5 theoretical framework for understanding causes, 587–9 translation of measurement instruments, cross-cultural surveys, 280 responsive design (RD), 131–2, 397–408 CAPI surveys, 399–403 CATI surveys, 403–6 and nonresponse, 417 paradata, 398 planning a responsive design survey, 398–9 retrieval ordering questions for, 229 wording questions for, 222 retrospective cohort studies, 113 retrospective debriefing, 364–5 Rivers, D., 335 RL see record linkage Robinson, Claude, 96 Rockefeller, J. D., 99n1 Rockefeller, N., 96, 97, 98 Rockefeller Foundation, 93, 98 Rogers, E., 88 Rokkan, S., 7 rolling cross-section design, 112 Roosevelt, E., 99 Roosevelt, F. D., 90, 91, 93 Roosevelt, Theodore, 88 root mean square error of approximation (RMSEA), 637 Roper, E., 91, 92, 93, 95 Roper Center for Public Opinion Research iPOLL Databank, 220 Rorer, L. G., 583, 584, 586 Rose, D., 301 Rosenbaum, P., 108, 561 Ross, D. A., 168 Rossi, P. H., 115 rotating panel designs, 113 Rothgeb, J. M., 359 Royall, R. M., 315 Rubin, D. B., 561, 570, 605 Rundqvist, E. A., 584 Rusciano, F. L., 689 Russel, R. C., 664 Russia, 290, 431, 680, 685 and data quality, 620, 621, 624, 626 Ryan, J. M., 214 Sakshaug, J. W., 417, 667 Salama, P., 180
712
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
sample matching, 335 sample surveys, 91 sampling balanced, 322–3 basic concepts, 311–15 cluster, 96, 323–4, 481 comparative surveys, special challenges best sampling designs, 350–1 estimation, 353–4 future developments, 354 history and examples, 346–9 main requirements, 349–54 prediction of effective sample size and design effects, 351–3 unique definition of target population and suitable sampling frames, 349–50 conditional Poisson, 321 convenience, 330 definitions, 310 deterministic samples, 313–14 disclosure control, 493 error, 27 estimation, 316–18 fixed size design, 318 frame choice, 162 indirect, 461, 468–9 non-probability see non-probability sampling Poisson, 320–21 population, 309–10 population mean, 477 probability see probability sampling purposive, 330 quota, 96, 314, 330 random see random sampling representativeness of sample, 314–15 sample design, 159–62 approximations in, 332–4 choice, 324–5 economic conditions and infrastructure, 160 main designs and schemes, 318–24 physical environment, 160–1 political context, 160 principles, 331 research traditions and experience, 161–2 social and cultural context, 159–60 sampling error see sampling errors statistical, 315–16 stratification see stratification/stratified sampling systematic, 320 two-phase, 324 unequal probability sampling designs with fixed sample size, 319–20 uniqueness, 492 volunteer, 330 see also sampling frames; sampling weights sampling errors, 43, 318 reducing, 33
and Total Survey Error, 27, 28, 29, 30 see also error; sampling sampling frames, 6, 60, 61, 92, 123, 257, 331, 400, 528 and basics of sampling for survey research, 312, 314, 319, 323 choice, 162 comparative survey research, 333, 351, 353 definitions, 349–50 fieldwork, 391, 392, 394 multicultural/multinational contexts, surveying in, 159, 160, 161, 169 nonresponse, 412, 413, 420 survey modes, 142, 143, 145, 146 weighting, 461, 463, 466, 468, 470, 471 sampling schemes, 312 sampling weights, 460 Saris, W. E., 194, 195, 200, 236, 239, 244, 245, 376, 636, 637–8 Särndal, C.-E., 463, 465, 467, 473 satisficing, 32, 587, 588, 615 Satorra, A., 244 Saudi National Health and Stress Survey (SNHS), 168 Sautory, O., 465 scalar invariance, 633, 634, 637, 639 Schaeffer, N. C., 230–1, 367, 415, 418 Scheuch, E. K., 49 Schneider, S. L., 517, 518 Schober, M., 213 Schouten, B., 132–3, 564, 568, 569, 588 Schuman, H., 47, 546 Schwarz, N., 163, 164, 211, 213 Scientific Institute of Mexican Public Opinion, 98 Scott Long, J., 457 Seber, G. A., 469 secondary analysis, 657–8 selection bias, 32, 71, 334 self, concept of, 48 self-selection, in web surveys, 330 self-administered surveys, 60–1 self-administration (SAQ), 360, 375 sensitive topics, 31 sequential cognitive interviewing, 179–80 sequential mixed-mode designs, 146, 150 Sheatsley, P.B., 230 Shishido, K., 281–2 Shlomo, N., 564, 565, 569 Shmueli, G., 37 short message service (SMS), 182 Si, S.X., 280 Silver, Nate, 38 Simmons, E., 386 Simon, H. A., 362 simple random sampling (SRS), 331, 332, 478 simple random sampling without replacement, 319 Simpson, O.J., 99 simulated dashboard, 129 simulation studies, 641, 642
Index693
Sinclair, Sir John, 88 Singer, E., 409, 430, 434, 435 single-frame approach, weighting, 471–2 unit-multiplicity approach, 472–3 single-mode studies, 142 why declining, 256–9 see also mixed-mode designs Sirken, M. G., 470, 472 Six Country Immigrant Integration Comparative Survey (SCIICS), 672 Six Sigma, 134 Sjoberg, G., 71 Skinner, C. J., 483 Sledgehammer (DDI lifecycle tool), 446 Smith, T. W., 21, 163, 509, 689 social and cultural context data collection, 166–8 questionnaire development, 163–6 sample design, 159–60 social desirability bias, 167–8 social exchange theory, 414, 415 social network studies, 114 Social Research Associations (SRAs), 18 Social Science Research Council, US, 95 Social Science Variables Database (SSVD), 454–5 social-exchange theory, 427 socially desirable responses (SDRs), 615–16 society, and surveys, 57–66 analysis of survey data, 62–3 data archives, availability, 62 data collection, 59–61 demographics, measuring, 57–9 ethnicity measurement, 58 general societal characteristics, 73–4 marital status, 58 opinions and behaviours, measuring, 59 questions, wording and categories, 57–9 sex and age, 58 social class, 58 social survey, invention of, 88 statistics choice, impact on society, 61–3 Society of Automobile Engineers, 23 socio-cultural approach, cognitive models, 212–14 socio-economic status (SES), 110 sociology, 9 software, record linkage, 666–7 sophisticated falsificationism, 197 Sorensen, N., 164 Soundex (indexing system), 664 South Africa, 622, 623, 624, 625, 627, 655 split panel designs, 113 SQP 2.0 program see Survey Quality Predictor (SPQ 2.0) program SROs see survey research organizations (SROs) Standard Eurobarometer, 347 standardization/standards, 16–26 codes of professional and trade associations, 17–19
713
disclosure standards, 17 documentation tools, standards-based, 446–7 ethical standards, 17 formats, 59 and globalization of surveys, 688 international collaborations, 20 international standardization of surveys, 118 multicultural/multinational contexts, surveying in, 160, 170–1 organizations, 20 outcome or performance standards, 17 procedural standards, 17 questionnaire pretesting, 376 technical and definition standards, 17 types of standards, 16–17 standardized root mean square residual (SRMR), 637 Standardized World Income Inequality Database (SWIID), 675 Stanley, J., 28 Stanton, F., 94 statistical groups, 18 statistical inference, sampling, 315–6 approximations in, 334–5 principles, 331–2 statistical schools, 38 statistical significance, 16, 38 statistical software, 484–5 statistics, 9, 61–3 Statistics Netherlands, 145, 419 Statistics Sweden, 68, 135 StatTransfer (DDI lifecycle tool), 446 Stearns, P. N., 688–9 Steenkamp, J.- B. E. M., 581, 586, 615 Stoezel, J., 10 Stoop, I., 4, 5, 412, 429, 434 Storms, V., 72–3, 75n3 Stouffer, S., 94 Strack, F., 213 stratification/stratified sampling, 29, 321–2, 479–81 cluster sampling, 482 response propensity stratification, 575 stratified multi-stage cluster sampling, 482 stratified random sampling, 98 straw polls, 6, 89, 90, 92 structural equation modelling, 198, 582 Study Monitoring Report, 622 subjective variables, 239 subpopulation analysis, 485 Subramanian, S. V., 676 Subscriber Identity Module (SIM) cards, 182 substitution, 411 Sudman, S., 548 Sundgren, B., 489 Superior Council of Normalization, Togo, 16 survey administration errors, 30–1 survey climate, 11 background, 68–9
714
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
defining and assessing, 67–76 definition of ‘climate,’ 67 general societal characteristics, 73–4 measurement, 69–74 public opinions, 71–3 willingness to participate, 69–71 Survey Documentation and Analysis (SDA), 446 Survey Errors and Survey Costs (Groves), 29, 32 survey life cycle, 3, 445t Survey Life Cycle Diagram, 444 Survey of Health, Aging and Retirement in Europe (SHARE), 7, 11, 20, 350, 506 Survey of Income and Program Participation, US, 118 Survey Quality Predictor (SPQ 2.0) program, 237, 250, 457 background, 243 coding, 245 illustration, 246–7 knowledge upon which based, 243–5 limits of scope, 245–6 new questions, creating, 245 predictions, making, 245, 246–7 see also questions, research survey questions attributes, 548–51 content, 546–7 harmonization see survey questions, harmonizing survey questions, harmonizing, 502–24 assessing completeness and comparative validity of output-harmonized measures, 517–18 assessing harmony of harmonized measures, 515–19 assessing process quality, 515–16 assessing reliability, 516–17 comparability of meaning of multi-item measures, 518–19 cross-cultural survey, 504–5 ex-ante input harmonization, 504 ex-ante output harmonization, 506 design of ex-ante output harmonized target variables and questionnaire items, 510–12 ex-post harmonization, 507 deriving ex-post output harmonized measures from existing data, 512–14, 515 harmonization approaches, 503–7 input harmonization, 504, 506, 507–9 output harmonization, 506, 510–14 overview of approaches, 504 Survey Research Methods (SRM), 8 survey research organizations (SROs), 617 task simplification, and data fabrication, 621–26, 625 surveying, 4, 6 surveys announcing, 384–5 climate see survey climate comparative research see comparative survey research complex, 324 constraints, 29
cross-cultural see translation of measurement instruments, cross-cultural surveys definitions, 4–5 and democracy, 12 design see design of research/survey error see error; Total Survey Error (TSE) evaluation of surveys/survey products, 134–9 face-to-face, 60, 81, 142, 143, 186n1, 256, 432 globalization see globalization of surveys harmonization of questions, 502–24 historical perspective, 6–8 implementation, 125–34 mail-in, 330 methodology see methodology, survey mixed-mode see mixed-mode survey nonresponse in, 559–61 panel, 432 participation in see participation in surveys public opinions about, 71–3 respondents see respondents response modes, achieving common measurement across, 260–64 and society see society, and surveys surveys on, 71–3 telephone see telephone interviews/surveys use/usefulness, 11–12 web see web surveys Swedish Data Act, 68 Swiss Household Panel (SHP), 7 Swiss National Science Foundation, 435 Switzerland, 159, 350, 437, 631 and data quality, 620, 621t, 626 and response rates, 430, 431, 432, 434 and sampling, 312, 324 and survey climate, 73, 74 and translation of measurement instruments, 274, 291, 303n4 systematic errors, 31 systematic sampling, 320 system-level data, 671 tailored design method, 124–5, 416 target population definition, 311–12 ethical considerations for, 184–5 weighting, 468, 469 see also population Tarnai, J., 144 task simplification interviewer, 619–20, 621 respondents, 617–19 Taylor-series linearization method, 484, 485 team-based review, 272 Technical Committee 225 (TC225), 20 telephone interviews/surveys, 147, 256–7, 432 computer-assisted, 99, 126 Random Digit Dialling (RDD), 60, 124, 143, 160, 162, 375
Index693
random sampling, 60, 98–9 survey modes, 142–3, 145 see also mobile phones tele-voting (SMS voting), 330 temporal dimension, research design, 111–14 theoretical validity, 198 Thevenot, L., 301 Thiessen, V., 587–8, 589, 590, 591, 613 think-aloud protocols, 32 Thomas, T. D., 581, 588, 589 Thomas, W., 88 Thompson, D. J., 315 Thompson, J. W., 91, 97, 560 Thompson, S. K., 469 Thrasher, J. F., 363, 374 Thurstone, L. L., 633 TIER (project), 457 Tillé, Y., 319, 323, 325 Todorov, A., 229 Total Survey Error (TSE) paradigm, 3, 9, 39, 438 ASPIRE approach, 36, 135, 135–7, 139 assessing quality, 136–7 continuous quality improvement, 126–31, 129, 132 designing surveys from perspective of, 33–5 development of approach, 28–33 future of, 37–9 implications for survey design, 123–5 implications for survey implementation, 125–34 merging with Total Survey Quality, 35–7 as paradigm for survey methodology, 27–40 quality profiles, 137, 139 responsive design, 131–2 and sampling error, 27, 28, 29, 30 theory and practice, 122–41 Total Survey Quality (TSQ), 27, 28, 39, 375 future of, 37–9 merging with Total Survey Error, 35–7 questionnaire pretesting, 375–6 Tourangeau, R., 32, 147, 151, 214, 228, 372, 587 training data, 657–8 of interviewers, 386–7 protocols, 184 translatability assessment, 508 translation of measurement instruments, cross-cultural surveys, 269–87 adaptation in a generic sense, 275–6 in a specific sense, 276–8 vs translation, 275–8 advance translation, 508 in Asia, 278–83 attitude and opinion items in translation, 278–83 back translation, 272–3 conceptual equivalence, harmonization in view of, 278–300 empirical assessment of a translation, 273–4 future developments, 283–4
715
good questionnaire design as precondition for translation quality, 270 harmonization, 274–5, 278–300 production of a translation, 271–2 quality in hands of translation commissioner, 275 response scales, 281–3 response styles, 280 survey target regions, harmonization between, 279 translation and translation assessment, 270–75 translation documentation, 274 translation vs adaptation, 275–8 Transparency Initiative (AAPOR), 17 Transparency International, Corruption Perceptions Index, 675 TRAPD team translation model, 163, 212 Treiman, D.J., 300 trend designs, 112 Triandis, H.C., 280 True Score model, 239 Truman, H., 95 turmoil, societies in, 178–89 adaptivity and flexibility, 179–80 definitions, 178–9 documentation, 185 ethical considerations, 184–5 local partnership and community engagement approach, 182–3 mixed method approach, 180–1 neutrality focused approach, 184 researchers, ethical considerations for, 185 survey design and implementation principles, 179–84 target population, ethical considerations for, 184–5 technological approach, 181–2 two-phase sampling, 324 Uganda, 180 undercoverage, sampling, 312 unequal probability sampling designs with fixed sample size, 319–20 UNESCO Institute for Statistics, 167 unified (uni) mode design, 152–3 uniqueness, 492, 538 unit nonresponse see nonresponse United States Census Bureau see Census Bureau, US Chinese immigrants, 167 and globalization of surveys, 681, 685 household panels/surveys, 259, 266 polls, 90, 93 Postal Service, 258 question banks, 220 survey methodology, 10 translation of measurement instruments, 283 visits to, 230 see also American Association for Public Opinion Research (AAPOR); specific institutions and surveys units of analysis and observation, 107–8
716
THE SAGE HANDBOOK OF SURVEY METHODOLOGY
University of Michigan, Summer Institute in Survey Research Techniques, 158 university survey-research programs, 24n2 unmanned aerial vehicles (UAVs), 182 usability testing analysis procedures and implications of findings, 365 key procedures, 364–5 purpose and context, 364 resource requirements, benefits, limitations and practical issues, 366 Uskul, A.K., 163 validity comparative, 518 concept, 532 conceptual and measurement, 194–200, 205, 206 construct, 197 content, 199 internal and external, 28, 111 measured construct, validity assessment, 202–5 theoretical, 198 Valliant, R., 315, 338, 483, 484 Van de Schoot, R., 641 Van de Vijver, F., 172, 276 Van de Voort, P. J., 239 Van Vaerenbergh, Y., 581, 588, 589 Vanneman, R., 672 variance estimation, 484 Vartivarian, S., 466 Vaughan, J.P., 168 Vehovar, V., 333, 341 verbal communication, 148 verbal probing, 362 Verrijn, Coenraad Alexander, 312 voice-over-internet-protocol (VOIP), 171 volunteer sampling, 330 Wachsler, R.A., 230 WageIndicator survey, 332–3 Wagner, J., 401, 568 Waksberg, J., 60, 98–9 WAPOR see World Association for Public Opinion Research (WAPOR) war work, survey use, 94 Warner, U., 292 Warwick, D.P., 49 web surveys, 61, 143, 330, 432 paper advance letters in, 144–5 usability testing, 364, 365 web technology, use of, 450–54 Weber, Max, 290 weighting, 460–76 adjustment with a reference survey, 574 adjustments, 482–3, 570–81 base weights, 338
calibration, 338, 463–5 computing weights from combined sources, 470–3 context, 461–2 design, 462–3, 482 domain-membership approach, 471 nonresponse weighting adjustment, 465–8 propensity weighting, 38, 574–5 single-frame approach, 471–2 standard methods, 462–8 use of weights in analysis, 483 Weisberg, H.F., 29 Wellens, T.R., 212 West, B. T., 484 Wikipedia, 458 Wilensky, Harold L., 22 Williams, D., 417 willingness to participate, 69–71 Willis, G., 211, 215, 362, 374, 377 Willke, W., 94 Wilmot, A., 386 Wilson, E., 93 Wimmer, A., 289, 290 Winkler, W. E., 663 Wittmann, W. W., 637 Wolf, C., 505, 688 Worcester, R., 690 World Association for Public Opinion Research (WAPOR), 8, 17, 18, 19, 20, 22, 79, 94, 685 Guide to Opinion Polls, 686 Standard Definitions, 24 see also American Association for Public Opinion Research (AAPOR) World Bank, World Development Indicators, 673–4 World Fertility Survey, 162, 682 World Health Survey (WHS), 20 World Mental Health Survey, 168 world opinion concept, 688–9 World Values Survey (WVS), 7, 20, 62, 280, 282, 630, 672 Wright, E.O., 300, 301 Wyer, R.S., 164 Xu, X., 470 Yang, Y., 165 Yates, F., 315 Ye, C., 435 YouGovPolimetrix, 339 Young, J. W., 93 Zeisel, H., 88 Ziefle, A., 118 Zizek, F., 312 Znanieki, F., 88