Research Handbook on Digital Sociology 178990675X, 9781789906752

Exploring the implications of the digital transformation on society, as well as demonstrating how we might use the digit

152 31 4MB

English Pages [493] Year 2023

Table of contents :
Front Matter
Copyright
Contents
Contributors
Abbreviations
PART I Introduction
1. Introduction and overview to the
2. Social theory and the internet in everyday life
PART II Researching the digital society
3. Digital and computational demography
4. Digital technologies and the future of social surveys
5. Mobile devices and the collection of social research data
6. Unlocking big data: at the crossroads of computer science and the social sciences
7. Regression and machine learning
8. Investigating social phenomena with agent-based models
9. Inclusive digital focus groups: lessons from working with citizens with limited digital literacies
PART III Analysing digital lives and online interaction
10. Social networking site use in professional contexts
11. Online dating and relationship formation
12. Studying mate choice using digital trace data from online dating
13. Testing sociological theories with digital trace data from online markets
14. Using YouTube data for social science research
15. Automated image analysis for studying online behaviour
PART IV Digital participation and inequality
16. Social disparities in adolescents’ educational ICT use at home: how digital and educational inequalities interact
17. The early roots of the digital divide: socioeconomic inequality in children’s ICT literacy from primary to secondary schooling
18. Digital inequalities and adolescent mental health: the role of socioeconomic background, gender, and national context
19. The gender gap in digital skills in cross-national perspective
PART V Consequences of digital technological change
20. Doing family in the digital age
21. The mental health cost of swiping: is dating app use linked to greater stress and depressive symptoms?
22. Social media and well-being at work, at home, and in-between: a review
23. The digital transition of the economy and its consequences for the labour market
24. Further training in the context of the digital transformation
25. Digital campaigning: how digital media change the work of parties and campaign organizations and impact elections
Index

Recommend Papers

Research Handbook on Public Sociology (Research Handbooks in Sociology series) 1800377371, 9781800377370

Engaging with the key debates and issues in a continuously evolving field, Lavinia Bifulco and Vando Borghi bring togeth

161 90 5MB Read more

Research Handbook on the Sociology of Globalization (Research Handbooks in Sociology series) 1839101563, 9781839101564

Talk about globalization is all around us. It is far rarer that suitable definitions of what globalization is, and accou

114 14 4MB Read more

Research Handbook on the Sociology of Law 1789905176, 9781789905175

This unique Research Handbook maps the historical, theoretical, and methodological concepts in sociology of law, explori

986 86 4MB Read more

Research Handbook on the Sociology of Organizations 1839103256, 9781839103254

With original contributions from leading experts in the field, this cutting-edge Research Handbook combines theoretical

104 67 6MB Read more

The Oxford Handbook of Digital Media Sociology 9780197510636, 0197510639

Digital media are normal. But this was not always true. For a long time, lay discourse, academic exhortations, pop cultu

108 106 21MB Read more

The Oxford Handbook of Digital Media Sociology 9780197510650, 9780197510636, 0197510639

Digital media are normal. But this was not always true. For a long time, lay discourse, academic exhortations, pop cultu

101 87 1MB Read more

Handbook of Research on Advanced Research Methodologies for a Digital Society 1799884732, 9781799884736

Doing research is an ever-changing challenge for social scientists. This challenge is harder than ever today as current

152 57 32MB Read more

Handbook of the Sociology of the Military (Handbooks of Sociology and Social Research) 9780387324562, 0387324569

Never before has there been so extensive a collection of what has been thought, said, and written about the sociology of

111 82 34MB Read more

Handbook of the Sociology of Morality, Volume 2 (Handbooks of Sociology and Social Research) 3031320212, 9783031320217

This handbook articulates how sociology can re-engage its roots as the scientific study of human moral systems, actions,

108 15 11MB Read more

Handbook of Research on Disruptive Innovation and Digital Transformation in Asia [1 ed.] 1799864774, 9781799864776

With new technologies constantly being created, implemented, and sold, it is a robust opportunity for companies to hop o

345 84 20MB Read more

Research Handbook on Digital Sociology
178990675X, 9781789906752

Author / Uploaded
Jan Skopek

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

RESEARCH HANDBOOK ON DIGITAL SOCIOLOGY

RESEARCH HANDBOOKS IN SOCIOLOGY Series Editor: Hans-Peter Blossfeld, Professor of Sociology, University of Bamberg, Germany The Research Handbooks in Sociology series provides an up-to-date overview on the frontier developments in current sociological research fields. The series takes a theoretical, methodological and comparative perspective to the study of social phenomena. This includes different analytical approaches, competing theoretical views and methodological innovations leading to new insights in relevant sociological research areas. Each Research Handbook in this series provides timely, influential works of lasting significance. These volumes will be edited by one or more outstanding academics with a high international reputation in the respective research field, under the overall guidance of series editor Hans-Peter Blossfeld, Professor of Sociology at the University of Bamberg. The Research Handbooks feature a wide range of original contributions by well-known authors, carefully selected to ensure a thorough coverage of current research. The Research Handbooks will serve as vital reference guides for undergraduate students, doctoral students, postdoctorate students and research practitioners in sociology, aiming to expand current debates, and to discern the likely research agendas of the future. For a full list of Edward Elgar published titles, including the titles in this series, visit our website at www.e-elgar.com.

Research Handbook on Digital Sociology Edited by

Jan Skopek Associate Professor in Sociology Department of Sociology, Trinity College Dublin, Ireland

RESEARCH HANDBOOKS IN SOCIOLOGY

Cheltenham, UK • Northampton, MA, USA

© Jan Skopek 2023

With the exception of any material published open access under a Creative Commons licence (see www.elgaronline.com), all rights are reserved and no part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical or photocopying, recording, or otherwise without the prior permission of the publisher.

Chapters 6 and 25 are available for free as Open Access from the individual product page at www.elgaronline.com under a Creative Commons Attribution NonCommercialNoDerivatives 4.0 Unported (https://creativecommons.org/licenses/by-nc-nd/4.0/) license. Published by Edward Elgar Publishing Limited The Lypiatts 15 Lansdown Road Cheltenham Glos GL50 2JA UK Edward Elgar Publishing, Inc. William Pratt House 9 Dewey Court Northampton Massachusetts 01060 USA A catalogue record for this book is available from the British Library Library of Congress Control Number: 2023930151 This book is available electronically in the Sociology, Social Policy and Education subject collection http://dx.doi.org/10.4337/9781789906769

ISBN 978 1 78990 675 2 (cased) ISBN 978 1 78990 676 9 (eBook)

EEP BoX

Contents

List of contributorsviii List of abbreviationsxvi PART I

INTRODUCTION

1

Introduction and overview to the Research Handbook on Digital Sociology2 Jan Skopek

2

Social theory and the internet in everyday life Pu Yan

PART II

23

RESEARCHING THE DIGITAL SOCIETY

3

Digital and computational demography 48 Ridhi Kashyap and R. Gordon Rinderknecht, with Aliakbar Akbaritabar, Diego Alburez-Gutierrez, Sofia Gil-Clavel, André Grow, Jisu Kim, Douglas R. Leasure, Sophie Lohmann, Daniela V. Negraia, Daniela Perrotta, Francesco Rampazzo, Chia-Jung Tsai, Mark D. Verhagen, Emilio Zagheni, and Xinyi Zhao

4

Digital technologies and the future of social surveys Marcel Das and Tom Emery

5

Mobile devices and the collection of social research data Bella Struminskaya and Florian Keusch

6

Unlocking big data: at the crossroads of computer science and the social sciences115 Oliver Posegga

7

Regression and machine learning Lukas Erhard and Raphael Heiberger

130

8

Investigating social phenomena with agent-based models Pablo Lucas and Thomas Feliciani

146

9

Inclusive digital focus groups: lessons from working with citizens with limited digital literacies Elinor Carmi, Eleanor Lockley, and Simeon Yates

87 101

161

PART III ANALYSING DIGITAL LIVES AND ONLINE INTERACTION 10

Social networking site use in professional contexts Christine Anderl, Lea Baumann, and Sonja Utz v

179

vi Research handbook on digital sociology 11

Online dating and relationship formation Maureen Coyle and Cassandra Alexopoulos

195

12

Studying mate choice using digital trace data from online dating Jan Skopek

211

13

Testing sociological theories with digital trace data from online markets Wojtek Przepiorka

241

14

Using YouTube data for social science research Johannes Breuer, Julian Kohne, and M. Rohangis Mohseni

258

15

Automated image analysis for studying online behaviour Carsten Schwemmer, Saïd Unger, and Raphael Heiberger

278

PART IV DIGITAL PARTICIPATION AND INEQUALITY 16

Social disparities in adolescents’ educational ICT use at home: how digital and educational inequalities interact Birgit Becker

293

17

The early roots of the digital divide: socioeconomic inequality in children’s ICT literacy from primary to secondary schooling Giampiero Passaretta and Carlos J. Gil-Hernández

307

18

Digital inequalities and adolescent mental health: the role of socioeconomic background, gender, and national context Pablo Gracia, Melissa Bohnert, and Seyma Celik

328

19

The gender gap in digital skills in cross-national perspective José-Luis Martínez-Cantos

PART V

348

CONSEQUENCES OF DIGITAL TECHNOLOGICAL CHANGE

20

Doing family in the digital age Claudia Zerle-Elsäßer, Alexandra N. Langmeyer, Thorsten Naab, and Stephan Heuberger

21

The mental health cost of swiping: is dating app use linked to greater stress and depressive symptoms? Gina Potarca and Julia Sauter

22

Social media and well-being at work, at home, and in-between: a review Julius Klingelhoefer and Adrian Meier

23

The digital transition of the economy and its consequences for the labour market Werner Eichhorst and Gemma Scalise

365

379 398

419

Contents vii 24

Further training in the context of the digital transformation Thomas Kruppe and Julia Lang

25

Digital campaigning: how digital media change the work of parties and campaign organizations and impact elections Andreas Jungherr

433

446

Index463

Contributors

Aliakbar Akbaritabar is a postdoctoral research scientist and the current chair of the research area on migration and mobility at the Max Planck Institute for Demographic Research, Rostock, Germany, in the Laboratory of Digital and Computational Demography. With a background in computational social sciences and sociology of science, his research interests concern social networks, scholarly mobility, and networks of scientific collaboration. Diego Alburez-Gutierrez is Head of the Independent Research Group on Kinship Inequalities at the Max Planck Institute for Demographic Research, Rostock, Germany. His research interests lie at the intersection of demography, kinship, and socioeconomic inequality. Diego’s work has used mathematical modelling, micro-simulation techniques, and empirical analysis to study intergenerational processes in demography. Cassandra Alexopoulos is Associate Professor of Communication at the University of Massachusetts, Boston. Her research examines the factors that influence romantic and sexual decision-making. She uses a variety of methods to understand how media may influence people’s romantic and sexual behaviours, including lab experiments, surveys, and content analyses. Christine Anderl is a post-doctoral scientist at the Everyday Media Lab of the Leibniz Institut für Wissensmedien, Tübingen, Germany. In a project funded by the German Research Foundation she investigates the benefits of professional social media use. Her research interests also include the relationship between digital technologies and health and well-being. Lea Baumann was a PhD candidate at the Leibniz Institut für Wissensmedien, Tübingen, Germany. Her research focused on professional social media use and the recommendation of business contacts on professional social media. Birgit Becker is Professor of Sociology with a focus on empirical educational research at the Faculty of Social Sciences, Goethe University Frankfurt, Germany. Her research interests include educational inequality, early childhood, migration and integration, and digital inequality. Her work has been published in journals such as American Educational Research Journal, European Sociological Review, and Journal of Children and Media. Melissa Bohnert is a PhD candidate in sociology at Trinity College Dublin, Ireland. Her research interests include digital use, child and adolescent development, and social inequality. Her PhD project uses quantitative longitudinal data to analyse how digital technologies and digital inequalities impact the well-being outcomes of children in Ireland. Johannes Breuer is a senior researcher in the team Survey Data Augmentation at GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany. Furthermore, he co-leads the team Research Data and Methods at the Center for Advanced Internet Studies, Bochum, Germany. His research concerns the use and effects of digital media, computational methods, and open science. viii

Contributors ix Elinor Carmi is Lecturer in Data Politics and Social Justice at the Sociology & Criminology Department at City University, London, UK. She specialises in data politics, data literacies, and feminist data. Carmi has been co-investigator in various research projects such as ‘Developing a Minimum Digital Living Standard’, funded by Nuffield Foundation. Her work contributes to emerging debates in academia, policy, health organisations, and digital activism. Seyma Celik is a PhD candidate in sociology at Trinity College Dublin, Ireland. She has a background in developmental and social psychology. Her research focuses on adolescents’ digital engagement, parental mediation, and mental health outcomes by considering socioeconomic and gender inequalities in both micro and macro contexts. Maureen Coyle is Assistant Professor in Psychology at Widener University, USA. Her primary area of research involves the relationship between computer-mediated communication and person perception in interpersonal and romantic relationships. She implements both quantitative and qualitative methods to assess how individuals reduce and respond to ambiguity in text-based interactions. Marcel Das is Director of Centerdata, a non-profit research institute located at Tilburg University, the Netherlands. Furthermore, he is Professor of Econometrics and Data Collection at Tilburg University. He has led numerous national and international research projects and published in international journals related to statistical and empirical analyses of survey data and methodological issues in web-based (panel) surveys. Werner Eichhorst is a team leader for research and coordinator for labour market and social policy in Europe at the Institute of Labor Economics in Bonn, Germany. He is also Honorary Professor at Bremen University. His main research interests relate to the comparative analysis of labour market institutions and performance as well as the political economy of labour market reform strategies. Tom Emery is Associate Professor in the Department of Public Administration and Sociology at Erasmus University Rotterdam, the Netherlands. He is also Deputy Director of the Dutch National Infrastructure for Social Science. Before, he was Deputy Director of the Generations and Gender Programme at NIDI in The Hague. His research relates to comparative survey methodology and policy measurements in multilevel contexts. Lukas Erhard is an academic staff member at the Department of Computational Social Science, Institute of Social Science, University of Stuttgart, Germany. His research interests relate to studying the European migration discourse based on quantitative discourse analysis. His methodological focus is on natural language processing, regression methods, and social network analysis. His work has been published in European Sociological Review and Scientometrics. Thomas Feliciani is a PhD candidate at the University of Groningen and a research assistant at University College Dublin. He does research in computational social science, studying ethnic residential segregation, its impact on opinion polarisation, academic peer review, and science evaluation. Sofia Gil-Clavel is a PhD candidate at the Digital and Computational Laboratory of the Max Planck Institute for Demographic Research, Rostock, Germany, afﬁliated with the University of Groningen, the Netherlands. In her work, Soﬁa uses data from Facebook and Twitter to

x Research handbook on digital sociology study older people’s usage of communication technologies and migrants’ cultural integration, respectively. Previously she studied actuarial and computer science. Carlos J. Gil-Hernández is a social scientist at the European Commission’s Joint Research Centre, Centre for Advanced Studies, in the project ‘Social Classes in the Digital Age’. He is a quantitative sociologist with interdisciplinary interests in social stratification and social policy. His publications have featured in journals such as Sociology of Education, European Sociological Review, and Research in Social Stratification and Mobility. Pablo Gracia is Assistant Professor of Sociology at Trinity College Dublin, Ireland. His research lies at the intersection of families and inequalities. His ongoing research focuses on family relations and social inequalities, child and adolescent well-being, parents’ mental health, gender gaps in time use, and digital divides. He has published in journals such as European Sociological Review, Journal of Marriage and Family, and New Media & Society. André Grow was a research scientist at the Laboratory of Digital and Computational Demography of the Max Planck Institute for Demographic Research, Rostock, Germany. His work focuses on agent-based computational modelling, family sociology, social stratification, and digital demography. His research has been published in journals such as Population Studies, European Journal of Population, and Social Psychology Quarterly. Raphael Heiberger is Assistant Professor for Computational Social Science at the University of Stuttgart, Germany. He utilises methods from natural language processing, social network analysis, and machine learning to answer questions on financial markets, political discourse, and scientific careers. His work is published in journals such as American Sociological Review, European Sociological Review, and Social Networks, and Physica A. Stephan Heuberger was a research assistant at the Department of Family and Family Policy of the German Youth Institute. He was a member of the research division ‘Life Situations and Living Arrangements of Families’. His research focus relates to media use in the family. Andreas Jungherr holds the Chair of Political Science and the Governance of Complex and Innovative Technological Systems at the Institute for Political Science at the Otto-Friedrich-University Bamberg, Germany. His research concerns the effects of digitalisation on politics and society as well as the challenges and opportunities of using big data, artificial intelligence, and other computer-based methods in social science. Ridhi Kashyap is Professor of Demography and Computational Social Science at the University of Oxford and Professorial Fellow of Nuffield College, UK. She co-leads the strand on digital and computational demography at Oxford’s Leverhulme Centre for Demographic Science. Her research interests span topics such as mortality and population health, gender inequality, family, migration, and the impacts of technological change on population and human development processes. Florian Keusch is Professor of Social Data Science and Methodology in the Department of Sociology at the University of Mannheim, Germany, and Adjunct Research Professor at the University of Maryland, USA. His research concerns modern methods of collecting data for the behavioural and social sciences, in particular mobile web surveys and passive data collection through smartphones, wearables, or digital traces.

Contributors xi Jisu Kim holds a PhD in data science from Scuola Normale Superiore in Italy and is currently a research scientist at the Max Planck Institute for Demographic Research, Rostock, Germany. She has been working on establishing novel methods to improve relevant statistics of international migration using social media data in the intersected fields of migration sciences, social networks, and data-driven algorithms. Julius Klingelhoefer is a research assistant and doctoral student at the Assistant Professorship for Communication Science located at the School of Business, Economics and Society of the Friedrich-Alexander-University Erlangen-Nuremberg, Germany. His research interests relate to effects of social media on mental health, digital well-being at work, and digital disconnection. Julian Kohne is a scientific staff member at GESIS – Leibniz Institute for the Social Sciences, Cologne, and PhD student at Ulm University, Germany. His academic interests include interpersonal relationships, group dynamics, social networks and text as data. His PhD project investigates how social relationships are expressed through different communication patterns, using donated WhatsApp chat logs. Thomas Kruppe is a member of the research department ‘Active Labour Market Policies and Integration’ at the Institute for Employment Research, Research Institute of the Federal Employment Agency in Nuremberg, Germany. He is also a fellow of the Labor and Socio-Economic Research Center and outside lecturer at the University of Erlangen-Nuremberg. His research interests concern labour market policy, especially further training. Julia Lang is Head of the working group ‘Further Training’ and a member of the research department ‘Active Labour Market Policies and Integration’ at the Institute for Employment Research, Research Institute of the Federal Employment Agency in Nuremberg, Germany. Her research interests concern active labour market policy and further training. Alexandra N. Langmeyer is Head of the research division ‘Living Situations and Developmental Environments of Children’ of the Department of Children and Child Care at the German Youth Institute. Her research interests include child well-being, life situations of children, the pandemic’s impact on children’s lives, family socialisation, and growing up in the digital world. Douglas R. Leasure is Senior Researcher and Data Scientist at the Leverhulme Centre for Demographic Science at the University of Oxford specialising in developing methods using geospatial and digital trace data for population estimation in data-sparse settings. He held post-doctoral research positions at WorldPop, University of Southampton, and Odum School of Ecology, University of Georgia, after completing a PhD in biology at the University of Arkansas. Eleanor Lockley is a research fellow at the Culture and Creativity Research Institute at Sheffield-Hallam University, UK. She works across and within sociology, media, human– computer interaction, and cultural studies. Her research focuses on digital media and the sociology of communication as well as aspects of digital inclusion and digital and media literacy. Sophie Lohmann is a research scientist in the Laboratory of Digital and Computational Demography at the Max Planck Institute for Demographic Research, Rostock, Germany. With a background in social psychology, her research interests centre around the intersections of

xii Research handbook on digital sociology behaviour change, health behaviour, and sustainability behaviour, and how digital data sources can enable insights into these topics. Pablo Lucas is Tenured Assistant Professor and Geary Fellow at University College Dublin, Ireland. His research is focused on two streams: (1) topics related to Latin American human rights, particularly Operation Condor and (2) computational social science, particularly using agent-based models. José-Luis Martínez-Cantos is Associate Professor of Applied Economics at Complutense University of Madrid and member of the Institute of Sociology for the Study of Contemporary Social Transformations (TRANSOC). He has an interdisciplinary background mixing sociology, economics, social psychology, and gender studies. His main research interests are related to gender, information and communication technology, education, and the labour market. Adrian Meier is Assistant Professor for Communication Science at the Friedrich-Alexander-Universität Erlangen-Nuremberg, Germany. Previously, he held positions as Assistant Professor at the University of Amsterdam and Research Associate at the Johannes Gutenberg University of Mainz. His research investigates the positive and negative consequences of digital media and communication for users’ well-being, health, and self-regulation. M. Rohangis Mohseni is a post-doctoral researcher at the Research Group of Media Psychology and Media Design, Technical University Ilmenau, Germany. His research concerns (im)moral behaviour, such as aggression, helping, and moral courage in digital media (e.g., computer games and social media). His habilitation project investigates sexist online hate speech. Thorsten Naab is research consultant at the Department of Children and Child Care at the German Youth Institute. His research concerns media literacy development, media use over the life course, media effects, and employs qualitative and quantitative methodology. Daniela V. Negraia was Researcher in the Sociology Department at Oxford University and Guest Researcher with the Max Planck Institute for Demographic Research, Rostock, Germany. Currently, she is Researcher with Amazon Pharmacy. Her work focuses on measuring population-level inequalities in time use and well-being at the intersection of parenting, gender, and social class. Giampiero Passaretta is Assistant Professor at the Department of Political and Social Sciences at Universitat Pompeu Fabra, Spain. Previously, he held research positions at Trinity College Dublin, European University Institute, and Stockholm University. His research interests include education and labour market inequalities, comparative sociology, and quantitative methods of data analysis. Daniela Perrotta is a postdoctoral researcher at the Max Planck Institute for Demographic Research, Rostock, Germany, in the Laboratory of Digital and Computational Demography. Perrotta obtained her PhD in complex systems for life sciences from the University of Turin in 2018. Her research focuses on harnessing innovative data collection schemes and computational methods to study human mobility and the spread of infectious diseases.

Contributors xiii Oliver Posegga is Interim Chair of Information Systems and Social Networks at the University of Bamberg, Germany, and an affiliate of the Center for Collective Intelligence at the MIT Sloan School of Management. His research adopts an interdisciplinary perspective bridging information systems research, computer science, and the social sciences and focuses on the collective dynamics of digitally enabled networks. Gina Potarca is a researcher at Unisanté Lausanne (centre universitaire de médecine générale et santé publique) and the University of Geneva, Switzerland. She obtained her PhD degree in sociology at the University of Groningen, the Netherlands, in 2014. Her research interests revolve around the application of digital, multilevel, and longitudinal methods for the study of assortative mating, the social effects of the digital revolution, and mental health. Wojtek Przepiorka is Associate Professor at the Department of Sociology, Utrecht University, the Netherlands. His research interests relate to analytical sociology, economic sociology, environmental sociology, organisational behaviour, quantitative methodology, and applied data science. His work is published in leading journals such as American Sociological Review, American Journal of Political Science, European Sociological Review, and Social Forces. Francesco Rampazzo is a Lecturer in Demography at the Leverhulme Centre for Demographic Science, the Department of Sociology, and Nuffield College at the University of Oxford, UK. His research focuses on repurposing non-traditional data sources, especially from marketing, to infer demographic behaviours and dynamics. R. Gordon Rinderknecht is a postdoctoral researcher at the Max Planck Institute for Demographic Research, Rostock, Germany, in the Laboratory of Digital and Computational Demography. His research interests focus on social isolation, time use, and inequality. Methodically he is interested in opportunities and challenges related to online data collection. Gordon obtained his PhD in sociology from the University of Maryland in 2020. Julia Sauter is a postdoctoral researcher at LIVES Centre at University of Geneva, Switzerland. She is a member in the research project ‘Family Ties and Vulnerability Processes: Network-Wide Properties, Agency and Life-Course Relational Reserves’. Her research interests relate to family configurations and their social capital. Gemma Scalise is Assistant Professor in Economic Sociology at the University of Milan Bicocca, Italy, Department of Sociology and Social Research. Previously, she was Assistant Professor at the University of Bergamo and Max Weber Fellow at the European University Institute. Her research concerns labour market and welfare regulation, comparative political economy, and European governance. Carsten Schwemmer is Professor of Computational Social Sciences at the University of Munich, Department of Sociology, Germany. His research concerns applying computational methods for ethnic minority and gender studies, digital media, political communication, and sociotechnical systems. His work is published in journals such as Computers in Human Behavior, European Sociological Review, and Social Media + Society. Jan Skopek is Associate Professor at the Department of Sociology, Trinity College Dublin, Ireland. His research interests relate to family and social stratification, social demography and the life course, educational inequality, quantitative methodology, and digital social research. He is co-editor of several books on social inequality. His work is published in internationally

xiv Research handbook on digital sociology renowned journals such as American Sociological Review, Social Forces, and European Sociological Review. Bella Struminskaya is Assistant Professor of Methodology and Statistics at the Department of Methodology and Statistics at Utrecht University and Affiliated Researcher at Statistics Netherlands. Her research focuses on data collection using new technologies such as apps, sensors, wearables, data donation, as well as broader survey methodology topics such as non-response, panel conditioning, smartphone surveys, and online and mixed-mode surveys. Chia-Jung Tsai is a PhD student at the Max Planck Institute for Demographic Research, Rostock, Germany, in the Laboratory of Digital and Computational Demography. Her research interests focus on migration studies, causal inference, and experimental design. Chia-Jung is also affiliated with Pompeu Fabra University, Barcelona, Spain. Saïd Unger is a computational social scientist and doctoral researcher at the Department of Communication at University of Münster, Germany. His research relates to the study of disinformation and science of science using social network analysis and quantitative content analysis. Sonja Utz is Professor for Communication via Social Media at the University of Tübingen, Germany and Head of the Everyday Media Lab at the Leibniz Institut für Wissensmedien in Tübingen. Her research focuses on social media use in professional settings, informal learning with social media, (para)social processes, algorithm acceptance, and human–machine communication. Mark D. Verhagen is a researcher at the Leverhulme Centre for Demographic Science at the University of Oxford, UK. His work focuses on the incorporation of computational methods into traditional social science work. He is interested in applying concepts and methods from computer science in order to improve model specification, pattern recognition and inferential effect estimation within traditional social science domains. Pu Yan is Assistant Professor at the Department of Information Management at Peking University. Previously, she was a post-doctoral researcher at the University of Oxford. Her research concerns the influence of the information and communication technology on everyday information practices and the role of digital media in people’s everyday lives. She employs mixed-methods approaches involving computational social science approaches, survey, and ethnography. Simeon Yates is Professor of Digital Culture at University of Liverpool, UK. He has led numerous interdisciplinary research projects examining the impact of the internet and digital technology on society. His research and publications have explored the impacts of digital on language, arts, work, politics, and culture. Simeon’s work has particularly focused on issues of digital inequality. Emilio Zagheni is Director of the Max Planck Institute for Demographic Research, Rostock, Germany. He is best known for his work on combining digital trace data and traditional sources to measure and understand migrations and to advance population science more broadly. In various capacities, he has played a key role in favouring collaboration and exchange between demographers, statisticians, and computational social scientists.

Contributors xv Claudia Zerle-Elsäßer is Leader of the research division ‘Life Situations and Living Arrangements of Families’ of the Department of Family and Family Policy at the German Youth Institute. She also works on the ‘Growing Up in Germany’ survey. Her research relates to fatherhood, family formation and arrangements, and the digitalisation of family life. Xinyi Zhao is a PhD student affiliated at the Max Planck Institute for Demographic Research, Rostock, Germany, and the Department of Sociology at the University of Oxford. Her research interests include applying digital and computational innovations in demography and social science, with a particular interest in migration and gender disparity. Her background is in cartography and geographic information systems.

Abbreviations

ABC model ABM AES AID:A ALE ALMP API BMI CAPI CATI CMC CSS DPES DV ELIS EMA ESM ESS EU EVS FAS FtF GGS GPS GUI HBSC HTML IAB ICT IR IRT

agent-based computational model Agent-based model/modelling Adult Education Survey Growing Up in Germany: An Investigation of Everyday Life in German Families accumulated local effect active labour market policy application programming interface body mass index computer-assisted personal interviewing computer-assisted telephone interviewing Computer-mediated communication computational social science Dutch Parliamentary Election Study dependent variable everyday life information-seeking ecological momentary assessment enterprise social media European Social Survey European Union European Value Study Family Affluence Scale face to face Generations and Gender Survey Global Positioning System Growing Up in Ireland Health Behaviour in School-Aged Children HyperText Markup Language Institute for Employment Research information and communications technologies information retrieval item response theory xvi

Abbreviations xvii ISEI KHB LISS ML MS NEPS ODISSEI OECD OLS PASS PDP PIAAC PISA PSM R&D RF SDQ SES SHARE SME SML SNS TDS TGI TILT TS TSE UK UML US VET

International Socio-Economic Index of Occupational Status Karlson–Holm–Breen Longitudinal Internet Studies for the Social Sciences machine learning micro simulation National Educational Panel Study Open Data Infrastructure for Social Science and Economic Innovations Organisation for Economic Co-operation and Development ordinary least squares Panel Study Labour Market and Social Security partial dependence plot Programme for the International Assessment of Adult Competencies Programme for International Student Assessment public social media research and development random forest Strengths and Difficulties Questionnaire socioeconomic status Survey of Health, Ageing and Retirement in Europe small and medium-sized enterprise supervised machine learning social networking site(s) Total Difficulties Score trust game with incomplete information Test of Technological and Information Literacy traditional statistics total survey error United Kingdom unsupervised machine learning United States vocational education and training

PART I INTRODUCTION

1. Introduction and overview to the Research Handbook on Digital Sociology Jan Skopek

1

A RESEARCH HANDBOOK ON DIGITAL SOCIOLOGY

It is hard to evade common places on the digital transformation of society when introducing a handbook on digital sociology. Digitisation – the coding of data and information in digital, non-analogue and, thus, non-physical form – digitalisation – the mediation or enabling of social and economic interactions and transactions through digital technology – and the pervasiveness of digital products and online services in all aspects of human life represent a societal factum and, by all accounts, will continue to do so in the unfolding twenty-first century. Digital technologies like the internet, social media, streaming, mobile apps and online devices from the good old stationary computer to mobile devices in the form of handhelds, smartphones and wearables inextricably permeate the ways we go about our everyday lives, how we seek information, how we do economic transactions, how we construct our identities or how we pursue and maintain social relationships. The ‘fourth’ industrial revolution is rolling (Schwab & Davis, 2018), and looming large are new generations of digitally interconnected technologies – such as smart devices, the internet of things, 5G wireless technologies, artificial intelligence, augmented social reality or Mark Zuckerberg’s vision of the ‘metaverse’ – that will only further blur the lines between the physical and digital worlds in economy and society. The ‘digital’ is everywhere and there is no way back to de-digitalisation. Hence, defining where the ‘digital’ starts and where it ends, or what the ‘digital’ is or what it is not, can be challenging if drawing such distinctions are nowadays meaningful at all. That leads us to the not entirely unproblematic term of ‘digital sociology’. What is digital sociology? Is it another new sociological sub-discipline with its own set of topics, theories and methods? Or does digital sociology stand for a transient trend, in a way, a milestone on the discipline’s way into the digital age? And, given that the digital is ever more becoming an essential and formative part of the modern world around us, is the quest for sketching a digital sociology meaningful at all? Isn’t, consequently, all sociology somehow becoming ‘digital’? It has been recognised early that the ‘internet’ provides a new social space that called for sociological scrutiny into the implications it has on sociological subjects like social inequality, social networks and social capital formation, political participation, organisations and institutions, as well as culture (DiMaggio, Hargittai, Neuman & Robinson, 2001). Before long, ‘internet research’, ‘internet studies’ or ‘internet sociology’ were debated as new fields of sociological and social science scholarship (Cavanagh, 2007; Dutton, 2013; Hine, 2005; Shrum, 2005). From 2010 onwards, however, it seems the notion of ‘digital sociology’ or digital social research has increasingly superseded the earlier ‘internet sociology’. Lamenting a conspicuous lack of the ‘digital’ in mainstream sociology, various sociologists called for digital sociology to be a necessary new sub-discipline and came up with programmes of what a digital sociology 2

Introduction 3 might be and how it may make sociology matter in a digital world (e.g., Fussey & Roth, 2020; Lupton, 2015; Marres, 2017; Orton-Johnson & Prior, 2013; Selwyn, 2019). One of the most cited accounts came from Deborah Lupton (2015), who proposed a four-fold typology to define and delineate the field of digital sociology. According to Lupton, digital sociology entails, first, the study of digital technology use with a focus on how people use digital technologies in their everyday lives. Second, Lupton highlights aspects of data and analysis of data; digital sociology, thereby, involves the qualitative and quantitative analysis of (large) digital datasets obtained from the digital media and digital user-generated content including text, images, video and audio. Lupton, third, underlines digital sociology’s task of being critical and reflexive in relation to the digital society, and fourth, argues that professional practice in sociology in relation to doing and disseminating research as well as carrying out teaching will ever more rely on digital technologies and tools. In a special issue in the journal Sociology, Fussey and Roth (2020) chartered digital sociology’s intellectual origins and development. Drawing on the concept of ‘affordances’ defined as possibilities but also constraints technologies offer for action (Hutchby, 2001), they describe digital sociology as the study of ‘the affordances of technologies in various social spheres and how they shape and are shaped by social relations, social interaction and social structures’ (Fussey & Roth, 2020, p. 660). Neil Selwyn, in his book entitled What Is Digital Sociology?, concludes that digital sociology is less a sub-discipline and more an ‘ongoing evolution of the discipline … revitalizing classic sociological concerns while introducing novel (or at least substantially altered) points of contention and curiosity’ (Selwyn, 2019, chapter 5, para. 50). In other words, ‘digital sociology’ denotes a transformative process that the discipline of sociology – much like humanities or communication sciences – is going through and ‘in twenty years from now there may well not be a digital sociology per se’ (Selwyn, 2019, chapter 5, para. 54). I personally find much appeal in Selwyn’s evolutionary perspective on digital sociology. Sociology becoming digital raises new sociological questions in relation to the digital turn but also seeks new opportunities for addressing classic sociological concerns in the context of the digital society. Likewise, digital sociology represents a forum for inter- and cross-disciplinary convergence – both in theory and methodology – in the study of the digital society. Thus, I believe digital sociology approached from an evolutionary account can be useful to secure and continue sociology’s position to produce research of relevance for understanding modern society and, by extending its empirical programme via adopting new digitally mediated methods of data collection and analysis, can help sociology to rise up from its ‘empirical crisis’ (Burrows & Savage, 2014; Savage & Burrows, 2007). Based on those ideas this volume has been put together. Rather than elaborating on another programmatic definition of what a digital sociology may or may not be, this Handbook’s mission is pragmatic: it adopts a research-based approach to various themes of digital sociology and gives voice to international, both junior and senior, academics who work on aspects of the digital society from various disciplinary and sub-disciplinary angles and perspectives. In doing so, the Handbook’s approach is two-fold. First, it covers social science research carried out within, on and through the digital society. That entails research being concerned with how people use and integrate digital technologies in relation to their everyday lives, research being concerned with participation and inequality in the context of the digital society and research using digital technologies for collecting and analysing data in empirical sociology and social sciences at large. Second, the book aspires to be a platform for inter- and cross-disciplinary exchange that addresses not only sociologists but invites researchers from various fields and

4 Research handbook on digital sociology disciplines for conversation and collaboration. While some of the chapters in this volume were written by experts from sociology, various other chapters were contributed by a wider spectrum of experts in computational social science (CSS), computer and information sciences, communication sciences, cultural studies, demography and digital demography, data science and methodology, economics, family research, media research, political science, psychology and social psychology. The Handbook organises those various expert contributions into five parts corresponding to five major themes of digital sociology. Part I, apart from providing an overview in this chapter, explores the relationship between social theory and the internet in everyday life aiming to integrate sociology, communication and information sciences to advance our understanding of the digital society. Part II – ‘Researching the digital society’ – addresses novel and innovative methodologies in the context of digital sociology and social research including various contributions from the growing field of CSS. For example, authors discuss recent developments in digital and computational demography, the future of surveys and survey methodology, the increasing role of passive and ‘in situ’ data collection via mobile devices, ‘big data’ and how social science can unleash its full potential, innovative methodologies such as machine learning (ML) or agent-based modelling (ABM) as well as how to design digital focus group interviews in qualitative research. Part III focuses on the analysis of digital lives and online interaction. Showcased are selected topics such as the use of professional social networks, the culture and use of online dating as well as how we can learn about general aspects of assortative dating and mating based on digital trace data generated by online dating sites, or what digital trace data can reveal about the workings of online markets and reputation systems. In addition, the part includes some pathbreaking applications of CSS such as the collection of big data from YouTube for communication and media research or the automatic analysis of images uploaded on digital social networks to study online behaviour. Part IV of the book – ‘Digital participation and inequality’ – presents some latest research on the digital divide and digital inequality. A series of chapters tap into socioeconomic correlates of digital inequalities and education among the younger generation, school-aged children and adolescents, and gender-related inequality from novel longitudinal as well as cross-national perspectives. Finally, Part V concerns micro- and macro-level consequences of digital technological change in various societal domains. Chapters in that part not only cover the consequences of digital social media on family life or people’s well-being and mental health but also address the larger impact of digital technological change on labour markets and economic inequality, education and political campaigning and elections. Taken together, the Handbook offers a comprehensive resource for teaching and research in relation to digital sociology (or perhaps better sociology becoming ‘digital’) but also for digital social research in general. It speaks to academics, students and practitioners who want to get an overview of the variety of social research that is currently being conducted in relation to the digital turn of modern society. While the book can be read part by part, all chapters can be read independently too. In order to set the stage for the various themes and topics and to guide the reader through the book, the remainder of this introduction provides a synopsis of the book’s themes and a preview of what will follow in the chapters.

Introduction 5

2

SOCIAL THEORY AND THE INTERNET IN EVERYDAY LIFE

As part of the introductory part of the book, Chapter 2 by Yan elaborates on social theory and the internet of everyday life. The digital transformation of economy and societies has left nearly no aspect of our everyday lives untouched. We use digital tools and media to engage in interpersonal communication through messaging, to seek and share information through search engines or social media, to shop online or to seek entertainment. In our daily routine practice digital content and digitally mediated communication via stationary or mobile devices, especially smartphones, are inextricably incorporated in our everyday life experiences and behaviours, which renders, as Yan argues, the past notions of a dichotomy between ‘online’ and ‘offline’ life ever more artificial and even irrelevant. Yan argues that routine digital practices are fundamental for understanding the profound impact digital technologies have on society and, consequently, posits theorising and scrutinising the ways the internet and digital technologies shape people’s routine lives to be a priority for social sciences. Addressing the lack of social science research on routine digital practices, the chapter reviews and connects theoretical models from information science, communication science and sociology, all of which explain different aspects of digital everyday life. Information science, for example, provides models on information-seeking behaviour of individuals, which help to understand why and how individuals interact with digital technologies to satisfy their daily informational needs. Communication science, for example, provides with domestication theory a viable framework that can explain why, how and when certain digital technologies are adopted and how they are embedded into everyday information routines. Sociology contributes a rich literature on digital inequalities and divides, that analyses disparities in digital access, use and skills as they align with dimensions of social stratification and inequality. Various chapters later in the book will revisit digital inequality in more detail. Overall, Yan’s compelling review demonstrates that even though the three disciplines share a common interest in the impact of digitalisation and information and communications technology (ICT) on society and some shared theoretical (e.g., habitus theory) and empirical-methodological basis, there is a large disconnect between the fields manifesting, for example, simply in a lack of cross-referencing. Yan provides a visionary outlook on how breaking those disciplinary boundaries can foster theoretical and empirical progress in the understanding of digitalisation’s impact on society.

3

RESEARCHING THE DIGITAL SOCIETY

Rapid technological change and digitalisation in nearly all aspects of our lives has a substantial impact on how we design and carry out social research. The ubiquity of mobile devices such as smartphones, handhelds or wearables increasingly promises not only new opportunities for survey methodology but also new forms of ‘on the go’ data collection, all of which challenge traditional paper- or phone-based survey research. The digital age, the data revolution and – perhaps most importantly – the revolution of digitally mediated social interaction and ‘user-generated’ content has led to a near infinite universe of ‘big data’ resembling ‘digital’ traces people leave behind when using social media, networks and applications in everyday life. Novel digital-social technologies and applications such as social networking sites (SNS)

6 Research handbook on digital sociology and digital social media have created new social arenas, but they also record dynamic aspects of social life that had previously been virtually inaccessible to standard survey-based social research. Occasionally, the data revolution on the one hand and sociology’s reliance on the traditional sample survey and the interview on the other hand have also led to pessimistic accounts of empirical sociology’s prowess to keep pace with the data reality of the digital society; perhaps best epitomised by Savage and Burrow’s seminal essays on the ‘coming crisis of empirical sociology’ (Burrows & Savage, 2014; Savage & Burrows, 2007). While the potential of novel data for social research has been discussed for quite some time, the concrete realisation has been facing various hurdles in terms of data access, collection, management and analysis. There is an ‘implementation’ gap which, in part, might be explained by a mismatch of skills taught in traditional social science courses and the skills required when dealing with big data (Golder & Macy, 2014). Such gap might be addressed by CSS, an emerging field that interlinks social research and computer science or, more precisely, ‘an interdisciplinary field that advances theories of human behaviour by applying computational techniques to large datasets from social media sites, the Internet, or other digitized archives such as administrative records’ (Edelmann, Wolff, Montagne & Bail, 2020, p. 62). CSS may play an active part of sociology’s evolution as discussed above via, for example, the introduction of ML techniques which are common in computer science but widely absent in the empirical tool chest of standard social sciences. Furthermore, CSS works with computer simulation of social systems as pursued by ABM which advance our understanding of complex dynamic social systems. Those developments are by no means constrained to quantitative research strategies. Qualitative research is changing and facing new challenges. For example, qualitative interviewing and focus groups are increasingly carried out using online tools, and this trend has gained additional momentum during the Covid-19 pandemic. Part II of the book contains a series of chapters that tap into the new brave world of digital social research. Chapter 3 by Kashyap, Rinderknecht and colleagues starts off by providing a comprehensive overview of digital and computational approaches in the field of demography. The focus on demography is by no means a limitation; rather, one may argue it is an inspiration for sociology and social sciences in general since demography as a field has been an early adopter of CSS approaches. In the centre of the authors’ discussion stands the following question: what is the relation between demography and the digital revolution ongoing since the turn of the millennium? The chapter approaches this question in three ways. Firstly, it discusses empirical evidence on how the digital revolution has affected people’s everyday lives, ranging from how people communicate, interact, seek information, use their time (for leisure or work) to consuming goods or services. Those developments bear consequences for fundamental demographic processes such as health and mortality, fertility and family, and migration. For example: What is the connection between social media use and health? How do people use the internet to learn about health and health services? What is the role of misinformation on health outcomes? How does social media affect fertility outcomes such as teenage pregnancies? What effects result from the ‘internetisation’ of international migration? Secondly, the chapter discusses how the digital revolution has created a wide range of new data sources for demographic research such as digital trace data created as a by-product of people’s digital activities or satellite-gathered geospatial or remotely sensed data. Moreover, enhanced online surveys and novel crowdsourcing approaches to participant recruitment are discussed. Thirdly, the chapter discusses the use of computational methods such as simulation or ML techniques for demographic applications. Even though there are still many challenges related to those new

Introduction 7 data sources and methodologies, there is hardly any doubt about their significance for future social research. The following chapters zoom into more specific aspects of digital social research. Chapter 4 by Das and Emery discusses the future of social surveys at the backdrop of the digital transformation. Showcasing the Dutch online panel Longitudinal Internet Studies for the Social Sciences (LISS) hosted at Centerdata for illustration, Das and Emery put focus on online interviewing and other digital data collection in surveys. General online panels such as LISS are of great value for academic research as they provide a fully fledged data infrastructure that builds on representative probability samples. Such samples cover the ‘online’ population but also the ‘offline’ population via equipping participants with technical equipment and internet services. Establishing a data infrastructure also involves the routine collection of core data (on households and individuals) as a ‘backbone’ which more specific survey projects can avail of and the provision of professional data documentation and dissemination services to maximise data accessibility and utilisation in the scientific community. Since academic online panels like LISS allow social researchers to ‘plug’ special purpose surveys into the existing infrastructure very easily and very quickly, those online panels become key instruments for measuring sudden change events, the value of which has been proven very recently. In the context of the Covid-19 pandemic and the related lockdown policies, LISS was a key instrument in rapidly collecting data to generate social indicators on various issues such as home-schooling practices in families, work arrangements and routines, aspects of care and gender inequality, or health. Considering these developments, it seems fair to ask: What is the future of social surveys? Das and Emery offer a preview of how that future might look and how it may have already begun. Besides a general pressure towards using digital tools for surveying, the authors argue that it is data integration that will play an ever greater role in modern surveys. Data integration can entail linking survey with (ever more) digitised administrative data but also linking survey data with momentary and physical assessments (using smartphones, wearables, activity trackers or other mobile digital data devices) or social media data (such as data from Facebook or Twitter). In addition, integrated online panels can create unprecedented opportunities for experimental social research by enabling ‘mass online experiments’ – a playing field for CSS. Hence, in contrast to a ‘crisis’ scenario (Savage & Burrows, 2007), Das and Emery arrive at an optimistic conclusion on the future of surveys: survey methodology will continue to play a fundamental role in social research whilst being enriched and integrated with other digital data sources. Chapter 5 by Struminskaya and Keusch traces innovative developments in the collection of social data using mobile devices and smartphones, thereby addressing in more detail some of the issues raised by the previous chapters. Smartphones allow us to consume digital services and media on the go but also record an immense amount of data; modern standard smartphones are packed with cameras, face recognition, GPS localisation and a whole range of sensors including accelerometers and gyroscope sensors, pedometer sensors, ambient light sensors, fingerprint sensors and barometer sensors, just to name a few. Consequently, as Struminskaya and Keusch argue, having study participants equipped with smartphones allows social researchers to measure social phenomena ‘in situ’, that is, within the concrete everyday context they occur. For example, in health-related research accelerometric data collected via smartphones can yield high-precision measurements on individuals’ movement or sleeping patterns throughout the day. The chapter illustrates the collection of smartphone data by drawing on an app-based labour market study which combines both passive (through sensors)

8 Research handbook on digital sociology and active (through self-reports) data collection. While app- and sensor-based measurement brings plenty of research opportunities, the chapter’s discussion highlights a series of challenges, too; challenges such as participants’ privacy concerns and willingness to share their app data, incomplete population coverage, or new aspects of measurement error in relation to different device models with different technical capabilities and protocols or how participants use mobile devices. Much in line with Das and Emery’s assessment (Chapter 4), Struminskaya and Keusch conclude that the future of social surveys will likely see more ‘designed’ big data setups that combine the strengths of sensor and physical measurement data (large data in high velocity and accuracy) on the one hand and survey data (measurement of constructs under full control of the researcher) on the other hand. The previous two chapters’ discussions revolve around the survey method and how it will benefit from the employment of digital technology and data linkage. Integrating social and computer scientific perspectives, Chapter 6 by Posegga addresses the question of how ‘big data’ – one of the hallmarks of the digital era – can be of use for social science research. In a reflexive and thought-provoking fashion, Posegga stresses a perplexing paradox: although big data are abundant and large nowadays, why is it that these data have contributed comparably little to our understanding of human and social behaviour underlying the data? His chapter argues that ‘unlocking’ the full potential of novel big data sources requires tackling various challenges that are located at the intersection of computer and social sciences. Digital trace data represent a by-product of human interactions with software-engineered and digital information systems (e.g., Twitter, Facebook, YouTube and Tinder). Such data are typically found and not self-reported (like survey data), event-based and longitudinal. Using such data for social research demands a solid understanding of the data-generating process which is usually not under the researcher’s control and tends to be specific to certain applications and information systems. A first set of challenges relates to the problem of how researchers can overcome various limitations in accessing relevant digital trace data. Although examples like Twitter or YouTube demonstrate that the ‘big tech’ industry may well operate based on ‘open access’ models, many other providers of digital media products operate based on ‘closed’ models (an example being online dating apps, for obvious reasons). Challenges further relate to the knowledge researchers have on the data-generating process that underlies digital trace data including an understanding of the technological design and the effect it has on user behaviour. A third set of challenges regards the relation between theory and digital data which is frequently opaque. In conclusion, the chapter calls for coordinated, interdisciplinary and continuous efforts by an alliance of social research and computer science to overcome those fundamental challenges in unravelling the true power of digital platform data. Following Posegga’s programmatic discussion, the abundance of data and big data begs the question of how social research can use them to their full potential. ML, a technique common in computer sciences and frequently combined with artificial intelligence in digitalised and automated business applications, is increasingly seen as a promising method that can enlarge the predominantly deductive scope in social science – i.e., we challenge our knowledge (theory) with (typically sparse) data – by adding a powerful inductive approach that allows us to learn from data and possibly discover new insights (Grimmer, Roberts & Stewart, 2021). Chapter 7 by Erhard and Heiberger introduces us to ML as a tool for social sciences on their way to becoming ‘computational’ social sciences. Deliberately addressing the wider audience of ‘non-computational’ social scientists, the chapter’s main aim is to highlight differences but also communalities between ML and traditional statistical (TS) methods like regression

Introduction 9 as central tools in quantitative social science. As we learn from the chapter, both ML and TS stand in different epistemological traditions. The conventional use of TS like regression analysis engages with hypotheses testing based on samples of independently drawn units and, by and large, general linear modelling. Such features might be frequently at odds with the nature of big data produced in digital and social media applications. In contrast to the TS approach, the ML approach provides exploratory statistical methods that learn from data, extract information and provide predictive power. To exemplify the use of (supervised) ML methods, Erhard and Heiberger analyse attitudes towards immigration using data from the European Social Survey. Their data illustration shows how ML methods – compared with the use of traditional regression models – can provide new insights to the importance of certain factors for attitude formation as well as their potentially highly non-linear effects. This practical example also demonstrates trade-offs that must be made between TS and ML. Overall, Erhard and Heiberger by no means argue that ML will substitute TS. Rather, their chapter is an inspiring demonstration of how quantitative research may benefit from using ML as a powerful complement to traditional regression approaches. Simulation methods represent another field of CSS and, while having some good tradition in demography, economics and sociology, have gained popularity, especially due to a rise in computational power and the availability of large-scale digital data. ABM represents a theory-driven simulation approach to study the behaviour of complex and dynamic social phenomena that cannot be easily reduced to individual agency. Chapter 8 by Lucas and Feliciani provides a gentle and non-technical introduction to the uses, motivations, theoretical and epistemological foundations, conceptual building blocks and implementation of ABM in social sciences. Essentially, ABM simulates social outcomes that arise in complex and frequently unpredictable ways from individuals’ interdependent and adaptive actions. One considerable theoretical benefit of ABM lies in the requirement for the analyst to develop formal representations that specify behavioural rules by which agents interact within environments. Middle-range theories but also prior quantitative or qualitative empirical evidence can feed that process. A significant empirical advantage of ABM is that it may be applied even in settings for which data on complex phenomena is scarce. The correspondence between simulated and observed macro-level outcomes can eventually further a causally adequate understanding of social phenomena. Hence, ABM offers a very flexible and powerful toolset to social scientists who aim to get a grip on non-trivial processes of aggregation that link the micro-, mesoand macro-level aspects of sociological explanations. Lucas and Feliciani illustrate the use of ABM using two prominent applications in sociological literature that study the workings of peer review and the phenomenon of coinciding lifestyle choices and political attitudes. The previous chapters had a quantitative research focus, but qualitative research is going ‘digital’ as well. As the last chapter in Part II, Chapter 9 by Carmi, Lockley and Yates addresses how qualitative interview-based research may benefit from digital technology, especially online conferencing tools. Although online interviewing is not new to qualitative research, frequently, the aspect of unequal access and digital skills is disregarded. The authors report in their chapter on a focus group study they conducted with participants from the United Kingdom who had low levels of digital literacy. While the study was originally planned to work in a traditional face-to-face focus group design, the impact of the Covid-19 pandemic and associated health guidelines forced a shift to ‘remote’ focus groups using digital video conferencing tools (such as Zoom or Microsoft Teams). That shift imposed considerable challenges to the research team but also the participants with low digital skills. The chapter

10 Research handbook on digital sociology provides a fascinating and practically highly useful reflection on the researchers’ experiences in carrying out digital focus groups with low digital skill participants. Based on their first-hand experience, the authors elaborate a systematic set of recommendations that can assist the design and implementation of digital qualitative interview and focus group studies at all stages of the data collection process: from preparing online focus group studies, administration during focus group interviewing to the management of the follow-up phase. Overall, the chapter provides important lessons learned for qualitative field work that aspires to be digitally inclusive.

4

ANALYSING DIGITAL LIVES AND ONLINE INTERACTION

As the digital era unfolds, our everyday lives are ever more mediated by digital technologies. We use Google and Wikipedia for searching and gathering information. Social network and media applications such as Facebook, YouTube, Instagram, Twitter, WhatsApp and Telegram have become fundamental tools of communication, information sharing and social interaction in our everyday lives. Today, finding a job routinely involves the use of professional social networks such as LinkedIn or Xing. Online dating sites and mobile dating apps such as eDarling or Tinder have become a major route for seeking and finding partners. We go shopping online, buy or sell products and services in online markets and star-rate those products and services. So-called ‘user-generated’ content in social media and digital applications on a massive scale is blurring the lines between information production and consumption, rending the digital society an age of ‘prosumption’ (Lupton, 2015; Ritzer, 2014). One facet of digital ‘prosumption’ is the production of data through consuming digital media and services. In that respect, ‘datafication’ of people’s lives and notions of ‘surveillance capitalism’ are an ongoing subject of critical data studies (Fussey & Roth, 2020; Mejias & Couldry, 2019). Recently, Yanis Varoufakis provocatively coined the term ‘techno-feudalism’, a new economic order that supersedes traditional capitalist structures and in which consumers produce – for free – the data-driven capital stock of a few colossal ‘big-tech’ organisations (Turkheimer, 2021). Rather than engaging with critical data and tech studies, this Handbook takes a more pragmatic research stance on ‘datafication’: the increasing digital mediation of people’s lives relating to ‘offline’ behaviour (like dating or finding a job) but also genuinely ‘online’ behaviour (like ‘liking’ of YouTube clips or ‘tweeting’ and ‘tagging’) creates a vast amount of digital platform data that can be of extraordinary value for social research (Lazer & Radford, 2017). Part III of the book will showcase some selected aspects of research on digital lives and examples of collecting and using digital platform data for substantive research programmes. Anderl, Baumann and Utz in Chapter 10 start off with an overview of research on the use of professional social networks. Professional SNS like LinkedIn or Xing have become essential human capital markets for job seeking and recruiting in the modern digital society and economy. A major promise of such platforms is providing employers easier access to qualified workers and job seekers better job opportunities. Yet, social research has only just begun to study the operation of those digital job markets, their utilisation and their effects on career outcomes. After introducing and defining properties of professional SNS, the chapter reviews research that investigates from a worker’s perspective the benefits of professional social network use as well as the factors that predict beneficial professional SNS use, tapping

Introduction 11 into a series of fascinating questions: What are the career benefits of professional SNS use for workers? What do beneficial professional SNS look like? What is the role of social tie strength, network structure and diversity? What types of SNS use are beneficial and why? ‘Who’ uses professional SNS and ‘who’ benefits the most? Do professional SNS enact ‘Matthew effects’ in workers’ careers, in the sense that it is the already successful workers who reap the largest benefits, or can professional SNS enact compensation effects, in the sense that they compensate for restricted ‘offline’ opportunities? Furthermore, Anderl and colleagues discuss the different methodological approaches and challenges related to issues of causality when studying professional SNS use and its impact. Finally, the chapter concludes with avenues for future research in this still very young field. Recent estimates demonstrate that digitally mediated dating through online dating sites or mobile dating apps is crowding out traditional ways of meeting partners (Potarca, 2020; Rosenfeld, Thomas & Hausen, 2019). Social research is trying to understand what the widespread availability of ‘digital’ partner markets may imply for partner search and the formation of meaningful romantic relationships but also their maintenance. The rising prevalence of online dating as a means of seeking romantic partners creates a pivotal need for researchers to understand online dating experiences and reflect on the impact of online dating on modern relationship processes. Chapter 11 by Coyle and Alexopoulos reviews research on the practice of online dating detailing the various components of the online dating process. For example, how do people construct and ‘sell’ themselves in online dating profiles? How do people search for mates and what makes them approach or ‘swipe’ others? How does contact initiation and communication in online or mobile dating work and how does it differ from face-to-face encounters? How do online daters navigate from ‘online’ to ‘offline’ relationships and what are the associated risks? How do users experience termination of contact with other users? Finally, the chapter offers recommendations for advancing research that tries to better understand the sociocultural aspects and consequences of online dating. The previous review on experiences and practices of online dating is complemented by Skopek in Chapter 12. While some social research has been busy understanding the individual and societal impact of online dating, other research has sought to exploit ‘big data’ generated through the process of online dating usage to gain new scientific insights into aspects of mate search and choice. Family research, demography and social stratification scholarship have a longstanding interest in the trends, causes and consequences of assortative mating and marriage. However, the bulk of empirical research works with census or survey data on actual union and marriage outcomes, leaving the actual micro-mechanisms of dating and mating opaque. Studying user interaction in online dating allows social research to shed light on some of those mechanisms at the early steps of relationship formation when men and women encounter each other on the dating market. Skopek’s chapter clarifies, first, the theoretical and methodological premises, promises but also limitations of using data generated in online dating. Second, the chapter provides a systematic survey of social science studies that employed digital trace data from online dating and have been published in peer-reviewed journals in the past 15 years. The chapter illustrates selected aspects with some first-hand analyses of empirical data gathered from a major German online dating site. The review of 25 studies and additionally presented evidence reveals how trace data from online dating facilitates our scientific understanding of mate-seeking strategies men and women apply, gender-specific preferences regarding partner attributes along with various socioeconomic, sociodemographic and sociocultural characteristics, the role of two-sided choices for the emergence of dating

12 Research handbook on digital sociology relationships as well as the implicit social and ethnic segregation that is created within digital dating markets. Finally, the chapter addresses some research on the consequences of online dating for union formation and assortative mating. The chapter concludes with an outlook on future online dating research and highlights the significance of longer-term collaborative ties between business and research for progress in the field. Digital market platforms like eBay, Amazon and Airbnb allow millions of small businesses and people to trade in goods and services such as books, electronics, food, labour, transportation, care and accommodation. Przepiorka studies in Chapter 13 the operation of online markets by focusing on the trust problem as a central social dilemma that occurs between rational actors entering economic exchange relationships. For example, in anonymous online markets and under incomplete information, how can a buyer of a product or service trust the seller to deliver and thereby keep their end of the bargain? A game-theoretical modelling of the trust problem informs us that the expectation of the potential exchange partner trustworthiness is a key component in actors’ rational decision making and an actor’s social reputation can be a powerful signal for trustworthiness. Nowadays, most online market platforms use reputation systems to mitigate the trust problems that can arise when strangers trade with each other. In that regard, the trust game model would predict that good-reputation sellers not only are more likely to generate sales but also could possibly afford higher prices as a trust ‘premium’. For testing these trust-game hypotheses, Przepiorka analyses a digital trace data set that has been collected from eBay and that contains data on 90,000 auctions of memory cards for electronic devices. Trace data from online markets can be frequently accessed openly and researchers can collect them by using web-scraping techniques (in contrast to data collection in online dating as previously discussed in Skopek’s chapter). Taken together, the chapter presents a fascinating example of the use of digital trace data from online market platforms to study trust building, reputation formation, social preferences, discrimination and social inequality in markets. Throughout, the chapter points to ongoing debates and promising future research avenues. At the same time, the chapter is also proof for the relevance that social science theories can have for e-business and big-tech companies in their pursuit of designing and developing effective online market applications. Digitalisation and the evolution of online technologies has also led to a revolution of media. Media is more social today than ever before. One of the most prominent examples of social media is YouTube, as of today, the largest and most popular online video-sharing platform enabling billions of users across the world to create, distribute, consume and interact on a nearly infinite universe of content. Activities of producers and consumers of this content generate gigantic amounts of data, which are of interest for social scientists to study, for example, content production, reception and user interactions. How can we access and analyse these ‘big’ social media data in a reproducible and scalable fashion? Chapter 14 by Breuer, Kohne and Mohseni is devoted to this exact question. The authors introduce social scientists to the world of YouTube data and highlight their potential, while also making researchers aware of the practical, methodological, ethical and legal challenges associated with accessing and working with these data and providing some suggestions on how those can be addressed. After giving a brief overview on social science studies employing YouTube data, the authors describe the three central data pieces of YouTube data: users, videos and comments. Next, we learn about web scraping and application programming interface (API) harvesting as two basic strategies for collecting YouTube data. In point-by-point fashion, we learn how data can be harvested by interacting with the YouTube API using standalone apps or extensions for statis-

Introduction 13 tical packages such as R. The authors also guide us through pre-processing data, for example text-mining comment data for emojis or extracting hyperlinks and timestamps from video clips. Finally, the chapter provides valuable recommendations in relation to the research data lifecycle, which also involves aspects of archiving and sharing data obtained from YouTube. Taken together, the chapter is an example for the pathbreaking potential of digital sociology and serves as an inspiration especially for the younger generation of social and computational social scientists. The rise of social media led to an abundance of textual but also visual content created by people sharing their everyday lives as well as organisations like companies or political parties promoting their agendas. Business statistics suggest that on Instagram alone, nearly 100 million photos and videos are shared every day.1 Images are an essential aspect of human communication and decades of social science research have analysed images, however, normally in (expensive) manual ways with very limited scaling capacities. Traditional methods of content analysis appear grossly out of sync with the content reality of the digital era and computational methods like ML are on the rise to code and analyse digital content (Evans & Aceves, 2016). A still very young but innovative field within CSS proposes an automatic analysis of visual data via techniques of artificial intelligence and ML. Schwemmer, Unger and Heiberger in Chapter 15 introduce us to the world of automated analysis of imagery content for studying online behaviour. Recent advances in computer vision enable automated image analysis that allows social research to further unlock the potential of digital behavioural data and user-generated content. First off, the chapter provides a conceptual overview of the state of research. Automatic image analysis has been used for various purposes in social science research, including the recognition of objects, faces or visual sentiments. The chapter also reviews various methods of image recognition, which typically rely on ML applications, computational demands and the issue of bias in computer vision models. Second, the authors illustrate the power of automatic image analysis by an empirical case study that examines the online behaviour of United States Members of Congress during the early Covid-19 pandemic in 2020. Focus is put on congress members’ sharing of images involving the wearing of face masks introduced as crucial health and safety measures during the pandemic. Using Instagram data and models for detecting face masks, the authors demonstrate that temporal dynamics and party affiliation play a substantial role in the likelihood of sharing images of people wearing face masks: images with masks are more often posted after the introduction of mask mandates and Democratic Party members are more likely to share images with masks. Schwemmer and colleagues also examine differences by age and gender of politicians. The chapter concludes with some critical review of pitfalls in relation to automatic image analysis and discusses the need for establishing research standards in relation to these new types of digital trace data. Overall, the chapter is an exciting example of how the next generation of digital sociology may avail of digital and user-generated content to understand social behaviour.

5

DIGITAL PARTICIPATION AND INEQUALITY

An important stream in digital sociological literature concerns inequality in digital participation and how such inequality aligns with traditional dimensions of social stratification such as socioeconomic status (SES), social class, gender and ethnicity. Already in the early stages of the digital transformation, at the time when the internet was just diffusing, sociologists put the

14 Research handbook on digital sociology ‘digital divide’ high on their agenda of studying social change and the consequences the rise of the internet may have on that change (DiMaggio et al., 2001). The notion of the digital divide is used to address the gap between individuals or groups of individuals who, in relative terms, reap the advantages of the internet (or digital technologies in general) and those who do not (Rogers, 2001). Inequalities in the form of unequal access to and use of digital technologies are of concern in the rapidly changing digital society; especially so when those inequalities align with and even reinforce traditional structures of social inequality (Robinson et al., 2015). Research on digital inequality has been evolving considerably over the past two and a half decades and while some earlier aspects of digital inequalities have been waning others have come into the forefront of the more recent debate. The first generation of digital inequality research in the 1990s and early 2000s concerned the ‘first-level’ digital divide evident in different rates of physical access to internet connection and ICT equipment. As technological uptake has been rapidly increasing, later research has focused on inequality in digital skills, ICT literacy and digital use as the ‘second-level’ divide, and on inequality in outcomes or benefits from digital use as the ‘third-level’ divide (Scheerder, van Deursen & van Dijk, 2017; van Dijk, 2020). In line with the older knowledge gap hypothesis stated in relation to information society (Tichenor, Donohue & Olien, 1970), findings from digital divide research suggest that social inequality is being digitally reproduced as it is mostly socially advantaged groups such as higher-educated and higher-skilled individuals who are reaping the largest benefits from ICT and the concurrent digital transformation (Hargittai, 2019). But in contrast to sheer knowledge gaps, the digital divide’s consequences for social inequality and societal participating are likely to be much more severe and profound for the fact that the ‘digital’ is ubiquitous in nearly all aspects of modern life (van Dijk, 2020). The digital divide may also shape gender inequalities in important ways as digital inequality research repeatedly highlights gaps in digital skills between men and women (Martínez-Cantos, 2017). As digital inequalities manifest in different socioeconomic achievement and life chances, research on social stratification and social mobility must incorporate digital inequality as a new dimension of inequality (Matzat & van Ingen, 2020). Part IV of the book contains a series of chapters that tap into digital inequality research with a focus on socioeconomic aspects of digital inequalities and education among the younger generation (school-aged children and adolescents) and gender-related inequality. The first two chapters by Becker and Passaretta and Gil-Hernández directly address the second-level divide by examining the way socioeconomic backgrounds shape the use of ICT (at home and in school) and ICT skills gaps in school children and how they develop as children grow. Gracia, Bohnert and Celik provide a cross-national assessment on the well-being outcomes of children’s engagement with digital technology, thereby tapping into aspects of the third-level divide. Finally, Martínez-Cantos focuses on gender gaps in digital skills in a cross-national perspective. Let me give a more detailed overview in the following. Digital technology and content provide new educational opportunities for learners and is increasingly integrated in modern-day classrooms (OECD, 2019), a trend which has only been accelerated by the Covid-19 pandemic and the practice of home schooling and e-learning during periods of lockdown. Students may use ICT for educational purposes at home as well, for example, using a computer and the internet for doing homework, using educational apps for learning or using email or messaging for communicating with other students. Chapter 16 by Becker examines students’ use of ICT for educational purposes at home, especially focusing on differences between social strata. Conceptually, the chapter reviews and combines, on

Introduction 15 the one hand, the literature on the digital divide and, on the other hand, the literature on social inequality in educational practice and culture at home and social reproduction within families. Becker’s conceptual analysis synthesises those different literatures and proposes a theoretical framework that specifies two key components, opportunities (access, skills) and motivation, as social mechanisms linking family SES with students’ educational ICT use at home. Hence, social inequality in educational ICT use at home can be regarded as the result of differences in educational opportunities (e.g., home possessions such as having a desk to study) and academic motivations (learning orientation) as well as ICT opportunities (e.g., having digital devices, hardware) and motivations (e.g., enjoying using digital devices). Next, Becker employs PISA 2018 data on more than 150,000 students across 24 countries to test her model’s implications. Her results reveal a clear social gradient in the use of ICT at home for educational purposes. However, that gradient disappears when social discrepancies in opportunities – educational and ICT resources – at home and students’ academic and ICT motivations are accounted for. Furthermore, educational and ICT resources and motivations interact in a way that the educational as well as the ICT aspects are necessary conditions for students to engage in educational ICT activities at home. Thus, educational ICT use seems to be particularly promoted when high educational resources meet high ICT resources and high academic motivations meet high ICT motivations. Overall, Becker’s study provides intriguing insights as to how digital and educational inequality may interact in shaping children’s learning and education outcomes. In Chapter 17, Passaretta and Gil-Hernández examine how social inequality in digital competencies manifests in the early educational careers of primary and secondary school students. While previous research usually employed cross-sectional data on ICT skills, the chapter takes on a longitudinal perspective by following up children in Germany from age 8 (Grade 3 in primary school) to age 16 (Grade 9 in secondary school). In line with cultural capital theory, the authors theorise that families of higher SES possess higher ‘digital’ capital, translating not only into more digital resources (such as technology available at home) but also into higher digital competencies, which are inputs for children’s educational production. Based on recent panel data from the German National Educational Panel Study, the chapter addresses a series of questions related to the role of social background for children’s development of ICT: How large are ICT skill gaps by social background relative to other inequality dimensions such as gender or migration background? When do social gaps in ICT skills emerge and how do they develop over children’s school career? Do differential ICT resources and behaviour at home and school resources account for those gaps? And finally, are social gaps in ICT skills accounted for by social gaps in traditional ‘hard’ skills (such as math, science and reading)? The chapter’s findings provide novel and, in part, surprising empirical insights to those questions. First, social gaps are present as early as age 8, are in terms of size comparable to social gaps in hard skills and social gaps are far bigger in size than gender gaps or gaps by migration background. Second, from age 8 to age 16, the size of social gaps in ICT skills remains by and large stable – hence, gaps do not shrink but also do not grow. Third, neither family nor school ICT access and use reasonably explain those gaps, a result that somewhat disappoints the digital capital hypothesis. Finally, social gaps in hard skills account entirely for ICT gaps, which suggests that social inequality in ICT skills is likely driven by pretty much the same factors that drive inequality in hard skills. That makes the authors speculate that inequality in early ICT skills may simply echo common inequality structures in family environments and experiences. Overall, the chapter provides an inspiring and novel contribution to the interdisciplinary literature on social stratification, skill formation and the digital divide.

16 Research handbook on digital sociology Social research has just started to understand how digital inequalities impact on the life of children and adolescents. How does young people’s engagement with digital technology differ by their families’ socioeconomic background and by their gender? And how is young people’s use of digital media related to their well-being outcomes and does socioeconomic background and gender matter in shaping that link? Chapter 18 by Gracia, Bohnert and Celik examines those questions empirically by an innovative combination of longitudinal and cross-national data. The authors present two sets of analyses. A first set of analyses employs longitudinal data from the Growing Up in Ireland study from age 9 to 18 to examine how digital use is associated with well-being outcomes as children grow up. Fixed-effects models demonstrate that the mental health impact of digital screen time is by no means equally distributed across socioeconomic classes and genders; indeed, it is young people from more disadvantaged socioeconomic family backgrounds and girls who experience more penalties in terms of mental health when spending more time with digital devices. A second set of analyses tests the generality of those findings by employing rich cross-national data from the Health Behaviour in School-Aged survey that covered adolescents aged 11–15 from 35 countries. Regarding gender, the cross-national findings demonstrate that even though adolescent boys spend more time on digital activities it is the adolescent girls who experience larger mental health penalties when engaging in digital activities. National contexts, however, do matter as the magnitude of those gendered patterns varies substantially across countries. Second, regarding socioeconomic background, findings were surprisingly not unequivocal. Not only do SES gaps in adolescents’ digital engagement vary in size and direction, also, the differential impact of digital engagement on mental well-being varies between countries, as in some countries lower-SES youngsters were mentally more harmed whereas in other countries higher-SES youngsters were harmed the most. Overall, the chapter provides a unique piece of evidence on the role of gender and social origin in shaping digital inequalities for the young generation who are growing up with digital technologies. At the same time, the cross-national evidence presented in the chapter suggests that digital inequalities are a complex phenomenon that are shaped not only by individual and social factors but also by factors related to the national-institutional context of society. Finally, Martínez-Cantos examines digital inequalities with regard to gender from a cross-national perspective. While gender differences in access (first-level divide) have become more or less obsolete, gender gaps in use and digital skills (second-level divide) and outcomes of digital use (third-level divide) may shape differences in socioeconomic opportunities of men and women. Proposing an integrated theoretical model to address the issue of digital inequality by gender, the chapter reviews the literature on the digital gender divide taking on a dynamic, intersectional and cross-sectional perspective. Furthermore, Martínez-Cantos empirically illustrates his arguments by compiling cross-national data from various databases including the Digital Skills database (Eurostat), Education and Training database (Eurostat) and PISA 2018. Importantly, the evidence suggests that gender gaps in digital skills do not automatically shrink as ICT diffuses or across successive generations. For example, Martínez-Cantos’ data indicate that gender gaps at higher-level and more specialised digital skills (such as writing code in programming languages) remain astoundingly persistent over time. From a cross-national perspective we are surprised to learn that it is paradoxically the most gender-equal countries (in Northern Europe) who show some of the largest gender gaps in those higher-level digital skills, whereas in countries with low overall gender equality we find gender gaps to be small or non-existent. With regard to labour market indicators, the

Introduction 17 data also demonstrate a profound under-representation of females in specialised ICT occupations with no general trend of increasing female presence in ICT jobs. What can be expected from the younger generation? Martínez-Cantos tries to find answers to the question by evaluating data on female representation in higher education programmes related to ICT subjects and data on occupational aspirations of secondary school students. Overall, the data provide a sobering account as there is no clear or generalised propensity towards greater participation of females in ICT fields. Martínez-Cantos concludes that further research is needed that can better understand the causes of those rather persistent gender gaps.

6

DIGITAL TECHNOLOGICAL CHANGE AND ITS CONSEQUENCES

Finally, Part V of the Handbook presents selected contributions that examine the consequences of technological change, in particular digitalisation, for the everyday lives of individuals and families but also the larger social, economic and political context of society. In part, chapters revisit some of the aspects that were discussed previously in relation to the third-level divide. Thematically, the six chapters in Part V treat the relation between digital technology and family life, the consequences of app-based dating on men’s and women’s well-being and mental health, the impact of social media use on well-being at the interface of work and home settings, the larger consequences of the digital transformation for labour markets, economic inequality and further education and, finally, the impact of digital media on political campaigning and elections. In the following, I provide a brief preview of the chapters. Chapter 20 by Zerle-Elsäßer, Langmeyer, Naab, and Heuberger starts off with an assessment of how increasing mediatisation through digital technologies shapes family life, parenthood and parent–child interaction. Theoretically, the authors build on the concept of ‘doing family’, a constructionist framework concerned with the organisation and practice of everyday family life. In relation to the mediatisation of modern family life, the authors exhibit the role of balance management – practices related to coordination and synchronisation of family members – and the construction of togetherness as two major dimensions of doing family. In the centre of the chapter stands the question of how families integrate digital media into strategies of doing family. At the backdrop of a predominantly qualitative research literature on doing family in the digital age, the authors present some fresh quantitative evidence elaborated from recent waves of the German AID:A Family Survey (Growing up in Germany – An investigation of Everyday Life in German Families). Their findings offer, on a representative basis, new insights to ways the use of digital media facilitates doing family with respect to parent–child and inter-partner interactions along the lines of everyday communication and coordination, the ‘tracking’ of family members and constructing of togetherness through digitally mediated emotional exchange. Exploring their data, the authors reveal an interesting previously overlooked facet of digitally doing family, namely that fathers seem to incorporate digital doing family strategies somewhat more frequently than mothers. Overall, the evidence presented by the chapter highlights the importance of digital strategies for doing family but also rises numerous questions that inspire further research on digitally mediated family practice. Modern family life is increasingly ‘done’ digital. But digital technologies have also become ever more relevant in the search for intimate relationships and, therefore, the formation and

18 Research handbook on digital sociology perhaps also disruption of the modern family. The book has previously addressed research on online and mobile dating in Part II. Part V now revisits the topic in Chapter 21 by Potarca and Sauter who study the impact the use of mobile dating apps may have on mental health outcomes of men and women. The authors start off with a paradox: swipe-based dating apps on a smartphone (like Tinder, Bumble or Grindr) present a ‘lightweight’ approach to online dating promising to make meeting and dating people easier than ever before; but at the same time, fears have been growing about the detrimental effects a ‘swipe’ app may exert on individuals’ lives due to experiences of a seeming abundance of choice, ‘gamification’ of dating, the ‘objectification’ of the self and others (through the dominance of photos) or the experience of constant rejection. Social science evidence on the impact of mobile-dating tools on people’s lives is sparse, however. Potarca and Sauter address this gap by employing data from the German Family Panel (pairfam). More specifically, they test if the use of mobile-dating apps (apart from dating website and online SNS use) was associated with higher levels of stress and depressive symptoms and perhaps differently so for men and women. The chapter’s analysis yields mixed results. Searching for partners on apps was linked to greater stress among men but not among women. In contrast, dating app use predicted depressive symptoms in women but not men. In accordance with objectification theory, a lowered self-esteem accounted largely for the app–depression link in women. Overall, however, the effects found were weak suggesting that the public discourse on the potentially harmful effects of app dating is exaggerated. The authors conclude that much more research is needed to better understand the long-term consequences of mobile dating. Social media is present in nearly all life domains including the modern workplace, which not least due to the recent pandemic has become ever more flexible, digitally mediated and remote. As a side effect, boundaries between our professional and private lives become increasingly blurred and this may have important consequences for our well-being. What are the implications of social media use in private home- and professional work-related contexts for working adults’ happiness and well-being? Chapter 22 by Klingelhoefer and Meier sheds light on these questions by presenting an up-to-date review of research literature on the role of social media for well-being at work, at home and in-between. After defining the concepts of well-being, social media and the work–home interface, the chapter presents a viable framework for studying social media use at the work–home interface which distinguishes between setting (work versus home) and content (work-related versus not work-related) of social media activities. The review focuses on the well-being consequences of social media use in inconsistent setting content scenarios. ‘Cyberslacking’ or addictive social media use at work are examples of work-inconsistent uses where as being ‘always on duty’ or ‘taking work home’ through social media are examples of home-inconsistent social media use. The review reveals that the relationship between social media use and well-being is a complex one and might be generated through various mechanisms. For example, the evidence suggests work-consistent use of social media enhances well-being at work (job satisfaction), but in relation to work-inconsistent use the literature finds mixed effects as it may, on the one hand, enhance the general work–life balance but, on the other hand, lead to more procrastination and slacking. Similarly, home-inconsistent use outside work hours can be detrimental to well-being effects through work–life conflicts and lack of disconnect, however, it also carries some positive effects when social media is used to enhance work flexibility and cultivate work relationships. Klingelhoefer and Meier argue that, even though evidence is still scarce, widespread diffusion of remote work could intensify some of social media’s well-being effects at the

Introduction 19 work–home interface. While the review brings fascinating insights, it also discusses various deficits in the current state of empirical literature on social media effects which is dominated by cross-sectional research designs. Klingelhoefer and Meier suggest improvements and call for more longitudinal and experimental studies to advance the understanding of social media use and well-being at the work–home interface. Next, Chapter 23 by Eichhorst and Scalise taps into the larger societal and economic consequences of technological change in the context of the digital society. In its profound impact on the world of labour, the digital transition of the economy, most recently accelerated by the Covid-19 pandemic, is changing jobs, the labour market and inequality, confronting social policy in modern welfare states with formidable challenges. The chapter examines the processes by which the digitalised economy causes reconfigurations of the labour market and associated inequalities. A widespread diffusion of ICT and an increasing adoption of new digital technologies like cloud computing, artificial intelligence, ML, automation, data analytics and the internet of things is characterising the modern digitising economy. Eichhorst and Scalise, therefore, first review the state of research that addresses how those changes affect businesses, employers and workers alike. The review illustrates that technological change in the past, while always having led to concerns about technological mass unemployment, never manifested in unemployment shocks. Indeed, technological change has always been accompanied by job creation on balance and increases in productivity and welfare. And yet, the impact of technological change on labour market opportunities has not always been even but rather has been a skills-based one that promoted socioeconomic inequality. In contrast to past episodes of technological change, Eichhorst and Scalise argue that the ongoing digital revolution might put the previously rather ‘safe’ white-collar jobs under increasing pressure by ever more intelligent automation technologies. The empirical data they present hint towards labour market polarisation that is routine based and imply an overproportional erosion of occupations and jobs in the middle of the skills distribution. At the same time, recent technological change has been accompanied by an expansion of non-standard and flexible work arrangements. Hence, while the digitalising economy is unlikely to bring widespread joblessness it will have profound impacts through transformation and job structural change. In a second part, the chapter examines institutional and policy responses to digitalisation by comparing the contrasting historical experiences of Italy and Germany, two of the largest manufacturing countries in Europe but with very different levels of digitalisation. The comparative case study demonstrates the importance of considering national institutional structures, such as labour market regulation, education and training systems, labour organisations and business coalitions, when the aim is to understand the impact of digital technological change. Finally, the chapter devotes some discussion to the additional impact of the recent pandemic on structural change and some of the latest ad hoc policy responses for stabilising employment. In their conclusion, Eichhorst and Scalise highlight the pivotal role of education and training institutions, policies and institutional reforms for the trajectories that modern welfare states take through the digital revolution. Chapter 24 by Kruppe and Lang continues along these lines by examining the role of further education in the context of the digital transformation. As Kruppe and Lange argue, demand for further education is rising as digitalisation changes job tasks and skills demands but the digital transformation also changes the way further education training is carried out. Focusing on continuing vocational training the chapter then discusses the consequences of digitalisation for further training from the perspective of workers and companies. Reviewed

20 Research handbook on digital sociology is, first, literature and evidence on how the digital transformation is changing job profiles and skill requirements and which groups of workers with respect to qualification and skill level are especially affected by those changes. The chapter explores, second, the interrelation of skills, qualifications and further training and reviews results of studies on the development of demand for further training during the ongoing digital transformation. The evidence shows that, even though demand for training is generally increasing, the groups of workers most affected by the digital transition are the least likely to participate. The discussion turns, third, to the impact the digital transformation has on the content of training courses. Finally, the chapter highlights the growing importance of e-learning settings in further education, a trend that has been amplified by the Covid-19 pandemic. While this push towards e-learning settings may have various practical advantages, it may also cement digital social inequalities. For example, in line with other research in relation to the third-level digital divide, recent studies suggest that the expansion of digital training options benefitted predominately higher-educated workers. Kruppe and Lang conclude that to keep pace with changing skill requirements, further training institutions and policies should address specifically low-skilled workers. Finally, in Chapter 25, Jungherr adds a political science perspective on the consequences of digitalisation by focusing on digital campaigning. The chapter discusses how digital media has changed the work of political parties, especially the way they run political campaigns. Digital social media such as Twitter, Instagram and YouTube have come to play a fundamental role as mobilisation instruments in today’s campaigning strategies. Shining examples might be the United States presidential elections in 2016 and more recently 2020, in which Twitter was used as a core tool in the digital campaigning strategies of the Democratic and Republican parties. The chapter starts off with a discussion of some of the dominant theoretical and empirical approaches to the study of digital campaigning. As Jungherr argues, much of previous research on digital media use in political campaigning has addressed the political landscape in the United States (and especially the Democratic Party), resulting in a gap of knowledge on digital campaigning especially outside the North American context, a general lack of systematic theorising, a dominance of ethnographic approaches and a corresponding lack of quantitative research attempting to estimate media impact. To gain a better conceptual understanding of digital campaigning in political parties and campaigning organisations, Jungherr discusses a framework that delineates four functions of digital media in campaigning. First, from a practical standpoint, digital media improves the organisational structure and work routines of parties or campaigners. Second, digital media improves resource collection and allocation (e.g., identifying supporters in data-driven ways, generation of donations). Third, digital media enables parties and organisations to achieve presence in the broader public arena independent from traditional media gatekeepers, but digital media also targets specific groups more efficiently (e.g., through targeted advertisements on digital platforms). And fourth and finally, digital media use may symbolise the professionalism and innovativeness of parties as well as their candidates. A variety of campaigning examples (including some in the European context) illustrate the chapter’s discussion. In conclusion, the chapter outlines ongoing challenges to the academic investigation of the uses of digital media for political parties and their effects on campaigns and present perspectives for future research.

Introduction 21

NOTE 1.

See www.wordstream.com/blog/ws/2017/04/20/instagram-statistics.

REFERENCES Burrows, R., & Savage, M. (2014). After the crisis? Big data and the methodological challenges of empirical sociology. Big Data & Society, 1(1), 205395171454028. Cavanagh, A. (2007). Sociology in the Age of the Internet. New York: Open University Press. DiMaggio, P., Hargittai, E., Neuman, W. R., & Robinson, J. P. (2001). Social implications of the internet. Annual Review of Sociology, 27, 307–336. Dutton, W. H. (Ed.). (2013). Oxford Handbook of Internet Studies. Oxford: Oxford University Press. Edelmann, A., Wolff, T., Montagne, D., & Bail, C. A. (2020). Computational social science and sociology. Annual Review of Sociology, 46, 61–81. Evans, J. A., & Aceves, P. (2016). Machine translation: Mining text for social theory. Annual Review of Sociology, 42, 21–50. Fussey, P., & Roth, S. (2020). Digitizing sociology: Continuity and change in the internet era. Sociology, 54(4), 659–674. Golder, S. A., & Macy, M. W. (2014). Digital footprints: Opportunities and challenges for online social research. Annual Review of Sociology, 40, 129–152. Grimmer, J., Roberts, M. E., & Stewart, B. M. (2021). Machine learning for social science: An agnostic approach. Annual Review of Political Science, 24, 395–419. Hargittai, E. (2019). The digital reproduction of inequality. In D. B. Grusky & S. Szelényi (Eds), The Inequality Reader (chapter 69). New York: Routledge. Hine, C. (2005). Internet research and the sociology of cyber-social-scientific knowledge. Information Society, 21(4), 239–248. Hutchby, I. (2001). Technologies, texts and affordances. Sociology, 35(2), 441–456. Lazer, D., & Radford, J. (2017). Data ex machina: Introduction to big data. Annual Review of Sociology, 43, 19–39. Lupton, D. (2015). Digital Sociology. Abingdon: Routledge. Marres, N. (2017). Digital Sociology: The Reinvention of Social Research. Cambridge: Polity Press. Martínez-Cantos, J. L. (2017). Digital skills gaps: A pending subject for gender digital inclusion in the European Union. European Journal of Communication, 32(5), 419–438. Matzat, U., & van Ingen, E. (2020). Social inequality and the digital transformation of Western society: What can stratification research and digital divide studies learn from each other? Soziale Welt, 23, 381–397. Mejias, U. A., & Couldry, N. (2019). Datafication. Internet Policy Review, 8(4), 1–10. OECD. (2019). Building capacity: Teacher education and partnerships. In T. Burns & F. Gottschalk (Eds), Educating 21st Century Children: Emotional Well-Being in the Digital Age. Paris: OECD Publishing. Orton-Johnson, K., & Prior, N. (Eds). (2013). Digital Sociology: Critical Perspectives. New York: Palgrave Macmillan. Potarca, G. (2020). The demography of swiping right: An overview of couples who met through dating apps in Switzerland. PLoS ONE, 15(12 December), 1–22. Ritzer, G. (2014). Prosumption: Evolution, revolution, or eternal return of the same? Journal of Consumer Culture, 14(1), 3–24. Robinson, L., Cotten, S. R., Ono, H., Quan-Haase, A., Mesch, G., Chen, W., … Stern, M. J. (2015). Digital inequalities and why they matter. Information Communication and Society, 18(5), 569–582. Rogers, E. M. (2001). The digital divide. Convergence, 7(4), 96–111. Rosenfeld, M. J., Thomas, R. J., & Hausen, S. (2019). Disintermediating your friends: How online dating in the United States displaces other ways of meeting. Proceedings of the National Academy of Sciences of the United States of America, 116(36), 17753–17758. Savage, M., & Burrows, R. (2007). The coming crisis of empirical sociology. Sociology, 41(5), 885–899.

22 Research handbook on digital sociology Scheerder, A., van Deursen, A., & van Dijk, J. (2017). Determinants of internet skills, uses and outcomes. A systematic review of the second- and third-level digital divide. Telematics and Informatics, 34(8), 1607–1624. Schwab, K., & Davis, N. (2018). Shaping the Fourth Industrial Revolution – World Economic Forum (2018). Geneva: World Economic Forum. Selwyn, N. (2019). What Is Digital Sociology? Cambridge: Polity Press. Shrum, W. (2005). Internet indiscipline: Two approaches to making a field. Information Society, 21(4), 273–275. Tichenor, P. J., Donohue, G. A., & Olien, C. N. (1970). Mass media flow and differential growth in knowledge. Public Opinion Quarterly, 34(2), 159–170. Turkheimer, T. (2021). In conversation with Yanis Varoufakis. Cambridge Journal of Law, Politics, and Arts. van Dijk, J. (2020). The Digital Divide. Cambridge: Policy Press.

2. Social theory and the internet in everyday life Pu Yan

1 INTRODUCTION Over the past few decades, many scholars have contributed to study the social dynamics of the internet. These include the role of the internet in maintaining social networks, revolutionising labour markets, and mobilising social movements. However, little research has focused on the role of the internet on everyday information practices and disparities in online information practices and experiences. Yet, as the internet becomes an integrated part of everyday life, understanding how information users interact with or engage in online platforms on a day-to-day basis is significant to reveal the embeddedness of the internet and help social science scholars to theorise the role of information technologies in society. Meanwhile, a theoretical framework on the internet in everyday life can inform internet practitioners, digital content providers, and policymakers to provide online products or services that inform and assist users in their routine lives. In this chapter, I will first start with reviewing models and theoretical frameworks on information behaviours and daily information practices. Models of human information-seeking practices provide theoretical frameworks for understanding the embeddedness of digital information technologies in everyday contexts. In Section 3, I will then review empirical studies on the adoption and use of information and communications technology (ICT) in everyday life. In particular, I will discuss how domestication theory can provide a framework for studying everyday uses of the internet. In Section 4 of the chapter, I will summarise empirical studies about digital divides or information divides and explore whether different adoption and use of ICT results in new forms of social divides. The chapter will provide a review of three fields across different social science disciplines, including research about information-seeking practices in information science, technology adoption and use in the social science understanding of the internet, and digital inequality in sociology and communication science. The combination of theories and empirical studies across different social science fields will provide a comprehensive picture of everyday information practices in the digital era and will show gaps in the field.

2

INFORMATION-SEEKING PRACTICES IN THE CONTEXT OF EVERYDAY LIFE

How do users search for information to meet their general or context-specific information needs? What are the factors that influence success or failure in seeking information in everyday life? These are questions that have often been asked and explored by information scientists who are interested in understanding information behaviours from a user-centric perspective. For example, exploring how rural internet users adopt the internet in information-seeking practices such as finding agriculture-related information help information scholars understand 23

24 Research handbook on digital sociology the information needs of rural residents. Meanwhile, understanding when and how rural users encounter difficulties in information-seeking practices can help policymakers design information services tailored to the needs of rural information users. Studies of communication behaviours focus on how messages are delivered from senders to receivers. This topic overlaps with the research on information-seeking behaviours. Communication channels are defined as information sources, and noises or disruptions in the communication process are interpreted as misinformation (such as rumours) or disinformation. Compared to communication studies, information-seeking research puts more emphasis on the individual or environmental factors associated with individual information seekers (message receivers or audiences in communication studies) (Wilson, 1999). Applied to the study of ICT, information-seeking research provides insights into the interaction between users and digital technologies by focusing on users’ contexts and the social environment of their information-seeking practices, including intervening factors or social or cognitive barriers in the process. As Wilson emphasises, information scholars need ‘an examination of the information sources and systems used by the information seeker to an exploration of the role of information in the user’s everyday life in his work organisation or social setting’ (Wilson, 2006, p. 666). 2.1

Models of Information Seeking

Many social science disciplines are interested in studying the flow and exchange of information. For example, economists are interested in exploring how market information influences economic behaviours, communication scholars are interested in understanding the role of media in broadcasting information to the public, and those in organisation studies are interested in how information spreads within social groups. Nevertheless, most of the social science research on human information practices primarily focuses on the sharing or exchange of information rather than the seeking and processing of information. Some studies of information-seeking practices in work and everyday life contexts have nevertheless provided a theoretical lens to understanding the interaction between users and information sources. These researchers also suggest models that take into account the social, cultural, and technological environment in which users work and live. Wilson (1981, 1997, 1999, 2006) proposed a model of information needs and information seeking to understand the interconnections among information users’ needs, information-seeking behaviours, and successes or failures in information seeking. He argues that users’ information needs are situated at the personal level, influenced by their physiological, affective, and cognitive states, and can also be contextualised within users’ social roles and their social, economic, political, and technological environments. Wilson’s information-seeking model includes both active searching and information acquisition as well as passive attention and passive search. He also points to the importance of adding the stress and coping strategies of information users in information-seeking models. Brenda Dervin’s (1998) sense-making model is both a theoretical framework and a ‘methodological approach’ to information-seeking behaviours (Dervin, 1999, p. 728). Emphasising the understanding of how users think, feel, and want, she focuses on the complexities of user experiences in encountering problem-solving situations, and how the information-seeking process helps make sense of the problems in such contexts and bridges the gaps in knowledge. Instead of blaming information users for not being able to use information systems efficiently,

Social theory and the internet in everyday life 25 Dervin recommends a broadening of the concept of the information-seeking process from the interaction between users and information systems to understanding the barriers users encounter in information seeking. This can help researchers understand the use of the existing information system and also illuminates alternative information strategies that users apply to bridge gaps in an information-seeking context. As Dervin highlights, sense making ‘opens up to examine the ways in which information helps rather than assuming, as most studies have, that help is inherent in information’ (Dervin, 1999, p. 745). While sense making helps scholars to understand the gap-bridging process in information seeking, one underlying assumption of the theory is that users are aware of the problem-solving situations they have encountered. However, in real-life information-seeking practices, users might need information not for problem solving but for other purposes, such as looking for information to confirm their beliefs about social issues. Or again, in some cases, users seek information rather passively, through information monitoring, without having particular problematic situations in mind. Furthermore, although Dervin acknowledges the importance of the social environment for sense making, her theoretical framework tends to focus mainly on individuals’ emotional or cognitive processes. With little attention paid to the role of social networks or the community within which the individual lives, it makes the framework difficult to apply in combination with sociological theories of social stratification. Nevertheless, Dervin’s sense-making approach encourages information research to value concrete details of users’ everyday lives that will contribute to the understanding of information seeking, including their frustrations and satisfactions, commonalities and exceptions, and barriers and solutions. Based on empirical studies of information-seeking practices, theoretical models were developed to summarise different stages in information seeking (Ellis, 1993) and to describe search processes from the users’ perspective (Kuhlthau, 1993; Kuhlthau et al., 2008). Some models emphasise users’ emotions during the information-seeking process: as Kuhlthau explains, her focus is on users’ emotional change from a feeling of uncertainty to confidence, users ‘are seeking meaning rather than merely collecting information’ (Kuhlthau, 2016, p. 79). Despite efforts to conceptualise human information-seeking behaviours with different theoretical models (see Table 2.1 for a summary of information-seeking models), it is often difficult to break these down into testable hypotheses that yield correlational or causal relationships between psychological, organisational, or social factors, information needs, and information sources (Järvelin & Wilson, 2003). To compensate for the lack of testable hypotheses in information models, Järvelin proposes to first study human information behaviour through a bottom-up approach as a pre-theoretical stage and test observed phenomena through the top-down approach such as experiments or simulations (Järvelin, 2016). Ingwersen and Jävelin (2005) also include algorithmic information sources such as information retrieval (IR) systems in their model of information seeking. The combination of information seeking and IR research has added both human and algorithmic actors in information indexing (or ways of prioritising or organising information), which has implications for the study of information-seeking practices in the digital era. While IR systems improve the efficiency in locating a piece of information from a large database, the authors argue that the algorithms are ‘designed on the basis of topical relevance only’ (p. 10) and lack the understanding of users’ needs.

26 Research handbook on digital sociology Table 2.1

Summary of information-seeking models

Author

Name of model

T.D. Wilson

Information needs and information Information needs as the foundation of information-seeking behaviour behaviour

Key points Type of information seeking: passive attention, passive search, active search, ongoing search

Brenda Dervin

Sense making

Users seek information to bridge the gap in problem-solving situations Examining details such as emotions, needs, and wishes of users in sense-making processes

David Ellis

Information-seeking patterns

Carol Kuhlthau

Information search process

Starting → extracting → chaining → differentiating Information search process stages: initiation, selection, exploration, formulation, collection, presentation Unexpected uncertainty and anxiety during the process Emphasising cognitive, affective, and physical dimensions of information seeking

Peter Ingwersen

Integrated information seeking and Information seeker embedded in a social, organisational, and cultural

Kalervo Järvelin

retrieval

context Both humans and algorithms could become information indexers to provide information Include the human–information technology interaction in the framework

2.2

Information Encountering and Serendipity

With an increasing diversity of information channels available on the internet, users sometimes encounter useful information without purposefully seeking it. Information scholars have called this kind of discovery ‘information encountering’ (Ingwersen and Järvelin, 2005, p. 25), which occurs when users bump into information in their daily routines (Erdelez, 1999). The experience of information encountering is also termed serendipity, which defines both the condition and the strategy of the knowledge creation process when users’ mental efforts do not lie in constant monitoring of information sources. However, unlike information encountering, serendipity could ‘go beyond accidental’ and could be ‘actively sought’ (Foster & Ford, 2003, p. 336); that is, users might look to serendipitous sources to provide them with information. Compared to other information-seeking practices, whereby locating information is the purpose of the information activity, information encountering is often the by-product of other everyday activities (Fisher et al., 2007). However, there are different types of information users when it comes to the experience of information encountering, with some more aware of information acquisition through information encountering than others. Research has shown that those information users who are active in purposive information seeking are also active in information encountering. Users who rely on passive monitoring of information in information seeking are also less motivated in information encountering (Pálsdóttir, 2010). Users can benefit from having higher sensitivity in information encountering and increasing the diversity of information channels, mainly through making unexpected connections between existing information sources such as persons, events, locations, or subjects on one side and users’ informational or non-informational needs on the other (Makri & Blandford, 2012a, 2012b). Researchers have also shown that the user has to be open-minded towards new information

Social theory and the internet in everyday life 27 channels and be willing to accept new ideas in order to benefit from information encountering (Foster & Ford, 2003). Digital platforms differ in their potential to facilitate information encountering. Researchers have compared digital environments for their role in supporting serendipity and found that platforms that are trigger-rich, enable connections, and lead to the unexpected have higher potential to support information encountering (McCay-Peet et al., 2015). Social media outperform other digital environments in increasing users’ chances of information encountering. Some social media embed the following features in the platform design: reachability to a wider audience, fast information dissemination, personalisation and information-feeding system, timeliness of updates, documentation of knowledge and experience, and retrievability (Panahi et al., 2015). Many other studies have shown the complexities of information encountering or user experiences of serendipity. Nevertheless, there is a lack of studies that compare how different types of information users, with various strategies in information encountering and differentiated levels of digital skills, are influenced by this passive approach to information acquisition. 2.3

Information Seeking in Everyday Life Contexts

As mentioned by many information scholars, the study of information-seeking behaviours is highly contextual. Thus, researchers have highlighted the necessity to examine information behaviours within different contexts. Ingwersen and Järvelin (2005) emphasise the ‘situational nature of information and on assuming persons’ work tasks or cultural interests, and information needs based on them’ (p. 2). They also argue that differences in information contexts influence the choice of information sources. Therefore, they recommended evaluating the merits of an information model on information-seeking behaviours by whether or not ‘an information actor (is) immersed in his or her situation and information environment’ (p. 14). However, most of the current research on information-seeking practices has focused on professional or work-related contexts rather than everyday life contexts such as health or leisure. Few information-seeking scholars extend their focus from the academic and professional settings to everyday life and have produced theoretical frameworks for studying information seeking in daily life. The main exception is Savolainen (1995), who first introduced the theoretical framework of everyday life information seeking (ELIS) for the study of human information behaviours in non-work-related contexts such as healthcare or hobbies. Compared to information seeking in the occupational or professional setting, ELIS is ‘less systematic and more intuitive’ (Spink & Cole, 2001, p. 303). Drawing from Bourdieu’s sociological theory of habitus, which refers to the skills, habits, and knowledge accumulated from experiences in an individual’s lifeworld, Savolainen constructed the framework of ELIS on the basis of ‘way of life’. Way of life refers to ‘the basic context in which problems of nonwork information seeking will be reviewed’ and includes both ‘objective and subjective elements’ that constitute everyday life (Savolainen, 1995, p. 262). Individuals’ information seeking is influenced by both their social status and cultural capital and their ways of life, which are, in turn, shaped by personal values and beliefs. Savolainen also distinguishes the seeking of orienting information as against practical information. While seeking practical information aims to find solutions to problems, orienting information provides the background to the problems and situations. However, the two types of information are interchangeable: orienting information describes a more passive exposure

28 Research handbook on digital sociology to information, but it constitutes a large part of everyday information-seeking scenarios. When comparing the information-seeking practices of middle-class as against working-class users, Savolainen found that working-class users seem to depend more on immediately available information sources than middle-class users, which he considers problematic because information accessibility does not necessarily translate into information quality. McKenzie (2003) has highlighted the importance of studying non-active or non-purposeful information seeking as a significant type of everyday information practice. She summarised four phases of information practices in ELIS: active seeking contact or asking questions, active scanning or browsing information sources, non-directed monitoring and information encountering in unintended places or contexts, and information seeking by proxy with the help of a gatekeeper. Compared to Savolainen’s model, McKenzie gives equal importance to active and non-active information seeking and therefore broadens the concept. Empirical studies of information-seeking practices are often embedded in a specific context in everyday life, for example, in music (Laplante & Downie, 2011), parenting (Loudon et al., 2016), child development (Given et al., 2016), education (Liu, 2010), fiction books (Ooi & Liew, 2011), or healthcare (Yeoman, 2010). Hektor (2001) conducted one of the early empirical studies on everyday use of the internet which also ranged across various contexts and he suggested a categorisation of seven information-seeking contexts depending on individuals’ ‘life-activities’ (p. 282): caring for oneself, caring for others, household care, reflection and recreation, transportation, procuring and preparing food, and gainful employment. Such a categorisation of information-seeking practices helps to tie information research with the user’s lifeworld, which in turn helps to understand the social implications of technologies. 2.4

Everyday Life Information Seeking Used in Empirical Research

Table 2.2 provides a summary of empirical studies using the ELIS framework. Previous research on ELIS has found that with the adoption of ICT, the internet or mobile phones have become an essential information resource (Agosto & Hughes-Hassell, 2005; Savolainen, 1997). Information from offline social networks still plays an important role in assisting information users to solve day-to-day problems (Huotari & Chatman, 2001; Loudon et al., 2016). However, new platforms such as social networking sites (SNS) or online forums help users to establish new social connections or maintain existing social ties on the internet. They help users to cross the boundaries between offline and online social networks and to find information from sources beyond users’ immediate social circle (Eynon & Malmberg, 2012). Researchers have found that SNS platforms such as Facebook are viewed by users as more than a tool for social networking but also as a channel for seeking information to meet everyday information needs (Sin & Kim, 2013). As Spink and Cole (2001) have argued, the internet has an interactive potential and can bring hybrid information sources to a broader audience across diverse social sectors. It is also worth noting the influence of digital technologies on facilitating everyday information practices and the diverse ways that digital technologies are used by various social groups. Further, with the increasing prevalence of algorithmic recommendation systems, we need a reflexive analysis of information-seeking practices on emerging digital platforms (Davenport, 2010) which connects how algorithms shape information with how users understand this shaping process. Therefore, there is an increasing need for ‘empirical research efforts to analyse how specific communities use various conceptual, cultural, and technical tools to

Author

the most common and most significant areas of ELIS

queries, and social life as

of urban young adults

Schoolwork, time-related

seeking behaviours

sources (N=27)

and by proxy Heavy preference for

everyday life information

Qualitative

people as information

Written activity logs group interviews

Teens aged 14–17

and semi-structured

Sandra Hughes-Hassell

and questions: An

2005

non-directed monitoring,

active scanning,

practice: active seeking,

investigation of the

Denise E. Agosto and

People, places,

with twins

acquisition

practices, including

practices in accounts of

four modes of information non-direct information

Using a two-dimensional

research on ELIS

Calling for longitudinal

information sources

immediately accessible

different dependence on

preferences; (2)

(1) different media

information users:

Broadening the types

hobbies

Comparing middle-class and working-class

such as healthcare and

values, and media system

social status, social capital, information seeking

individuals’ ‘way of life’ – non-work-related

seeking

Qualitative

Notes Focus on

of information-seeking

Canadian women pregnant Interviews (N=19)

Findings Putting ELIS in

practices that consists of

2003

Data type Qualitative

everyday-life information

Pamela J. McKenzie

N2=11)

workers

Method Interviews (N1=11,

Group Teachers and industrial

model of information

A model of information

context of ‘way of life’

information seeking in the

seeking: Approaching

1995

Year

Summary of empirical research on everyday life information seeking

Everyday life information Reijo Savolainen

Title

Table 2.2

Social theory and the internet in everyday life 29

behaviours in everyday contexts; ‘A promising area of future research concerns

social types, insiders and outsiders, worldview, social norms, information behaviours, and trust

interactive and hybrid information flow channel

Development of generalisable process

research

context

information behaviour

within a broader human

theories and models

Integration of ELIS

situations

models and hold across

potential to be an

The internet has

meeting daily needs Understanding ELIS from situation perspectives

life information seeking

Charles Cole

special issue: Everyday

diverse cultural and social

Amanda Spink and

Introduction to the

acquired information in

perceived usefulness of

adjustment process

information seeking frequently use SNS for

social values’ (p. 361) Use of SNS for

play in the creation of

trust, and social norms

positive predictor of

Review

information-seeking

Linking theories such as

Majority of respondents

to understand

organisational behaviour

the fundamental role that

network perspective

theory to explain

social networking sites

2001

Notes Using a social

in a cross-cultural

(N=180)

United States

Quantitative

Findings Applying small-world

SNS emerged as the only

Kyung-Sun Kim

information seeking: The

Survey analysis

interviews (N=14)

International students in

university in Finland

other sectors within the

Data type Qualitative

Method Semi-structured

informational value of

Sin and

everyday life

Group Top management and

ELIS

Sei-Ching Joanna

International students’

behaviour

2013

and Elfreda Chatman

information seeking to

explain organisational

Year 2001

Author

Maija-Leena Huotari

Title

Using everyday life

30 Research handbook on digital sociology

adults

18–29-year-old younger

Interviews (N=19)

Interviews (N=15)

Qualitative

Qualitative

Notes

and the acquisition of

Acquisition of music

site in ELIS studies

Context-specific: music

conducting ELIS

experiences of

emotional bonds, trust and fear of judgement

barriers in information seeking

(p. 43)

information sources’

engagements with

influencing mothers’

are key factors

experiences and Analysed the reasons for

(N=22)

‘Beyond the shared

Ian Ruthven

seeking behaviours of

first-time mothers

information

needs of first-time mothers reality of common

Katherine Loudon,

The everyday life Group interviews

to provide elaborate

difficulties in trying

Most interviewees faced

information environment

people make sense of their

Steven Buchanan and

First-time mothers

source

culturally conditioned constructs by which

internet as information

sources are socially and

definitions for the internet Identified information

2016

Finland

Information users in

Participant observation Qualitative

Jarkko Kari

internet in everyday life

information seeking

Reijo Savolainen and

Conceptions of the

contributed to satisfaction Conceptions of information Conceptions of the

2004

2011

Findings

Gaining access to the field Reflection of research

information about music

J. Stephen Downie

hedonic outcomes of

pre-school children

support group;

members of a self-help

Data type Review

Method

in everyday life

Audrey Laplante and

The utilitarian and

Group Pregnant women;

music information seeking

Pamela J. McKenzie

seeking

everyday life information E. F. McKechnie, and

Year

Author

Robert F. Carey, Lynne 2001

Title

Gaining access to

Social theory and the internet in everyday life 31

Three interpretative

Qualitative

problem-solving styles Enthusiastic, realistic Interviews (N=18)

information

frequency, gender, and

Li Liew

part of everyday life

information seeking

Kamy Ooi and Chern

Selecting fiction as

2011 are members of book clubs

Adult fiction readers who

Interviews (N=12)

Qualitative

critical

study.

context of everyday life

and mass media

family, friends, book club,

from everyday life such as

circumstances and sources

characteristics and

is influenced by personal

Selection of fiction books

available on the internet

of relevant information

view on the low amount

repertoire is the critical

Central to the critical

enthusiastic, realistic, and

by means of free-time

of internet use in the

information seeking

repertoires were identified:

in developing themselves

and critical: Discourses

People who are interested

affected by conflicting

Differences across use

2004

likely than males to be

seeking outcomes:

Reijo Savolainen

using social media for

information: females more information seeking in

20 and 24 years old

life information

reading

Context-specific:

Source preferences

everyday life context

Different outcomes of

dealing with conflicting

the United States between

Social media and

problematic everyday

support

with information and

Notes Context-specific: health

Gender differences in

Undergraduate students in

information and support,

transition challenges in encounters

and providing advice,

context of the menopause

Online survey (N=438) Qualitative

menopause, receiving

menopause clinic

information seeking in the

2016

repertoires of the

geographical area) from the interviews (N=35)

Sei-Ching Joanna Sin

situation, interpretative

and women from the same semi-structured

practices in everyday life

Findings Making sense of the

Data type Qualitative

Method

Group Participants (patients, staff, In-depth

model of information

Year 2010

Author

Alison Yeoman

Title

Applying McKenzie’s

32 Research handbook on digital sociology

Title

Author

Year

Cantrell Winkler,

Susan Danby, and

the home

Note: Journal articles reviewed in this table were collected using the database Web of Science Social Science Citation Index. All results contain ‘everyday life information seeking’ in the article title. The search query was: TITLE: (everyday life information seeking).

ethics

analysis of information

analysis of platforms,

web analysis; (3) reflexive

a counter movement; (2)

conversational analysis as

discourse analysis and

information practice–

in ELIS research, (1)

confessional approach

methods

Alternatives to

Confessional methods and Elisabeth Davenport

seeking

Review

development

Review of ELIS

literacy, and numeracy

socio-dramatic play, early

Notes Context-specific: child

and nominalism

Findings Artistic play,

Tension between realism

2010

study (N=15)

Data type

children aged 3–5

Method Exploratory observation Qualitative

Group Australian pre-school

everyday life information

Karen Thorpe

Christina Davidson,

information seeking in

technology: Everyday life Rebekah Willson,

‘play’ with information

Watching young children Lisa M. Given, Denise 2016

Social theory and the internet in everyday life 33

34 Research handbook on digital sociology access printed and digital documents and to evaluate and create knowledge’ (Tuominen et al., 2005, p. 342).

3

DIGITAL TRANSFORMATIONS: ICT S IN EVERYDAY LIFE

Since the early adoption of the internet for civilian use, social scientists have proposed different theories to describe and explain the adoption and diffusion of information technologies in contemporary society. Rogers’ (2003) book Diffusion of Innovations provided a model of the diffusion process of innovations that has been frequently used to summarise the adoption process of digital technologies. The model includes users’ information seeking in different steps – from receiving knowledge of innovation, to the decision of whether or not to adopt the technology, to implementing the technology. As the adoption of the internet has diffused from elites to middle-class and working-class users, researchers noticed the significant changes brought by digital technologies to individuals’ everyday lives. Internet scholars since the 2000s have called for a different perspective to studying the internet, ‘not as a special system but as routinely incorporated into everyday life’ (Wellman & Haythornthwaite, 2002, p. 6). To date, scholars from different social science disciplines have contributed to the study of ICT in everyday life from various perspectives. For example, communication scholars are interested in how the internet provides new ways of connectivity, transforms channels of mass communication, and reconstructs political power (Castells, 2013). Social anthropologists have found the internet a virtual field site and they have explored the diversity and complexity of social life online (Hine, 2015; Miller, 2016). Scholars who study developing countries examine the mundanity and embeddedness of the internet in users’ everyday lives in the Global South (Arora, 2019; Jeffrey & Doron, 2013). The mobile phone, with its portability and affordability, has further blurred the boundary between online and offline communication and information seeking. It is not only used in daily coordination with families and friends (Katz & Aakhus, 2002), but also plays an indispensable role in assisting everyday information seeking. The versatility of mobile phones has become taken for granted by users and deeply embedded in everyday practice (Ling, 2012). The connectedness of mobile internet also redefines what constitutes private and public realms by enabling users to switch freely across different models of connectedness (Schroeder, 2010) and across social circles at different scales (Ito, 2005). 3.1

Domestication of ICT: Theoretical Framework and Empirical Research

Domestication theory was developed by social science scholars to understand how digital technologies have become ‘tamed’, moving from professional and elite uses to gradually becoming integral parts of domestic life (Haddon, 2004, p. 4). The theory examines how digital technologies fit into people’s everyday lives and assesses the negative or positive impacts of ICT. What is particularly valuable in this theoretical framework is that it offers both a summary of the process of ICT adoption as well as the consequences of adopting digital technologies in daily practices. Domestication theory provides a holistic account of how users consume technologies in the context of everyday life. It theorises the process by which ICT, including material and non-material aspects of objects, are tamed and reshaped by users. As Berker and colleagues have summarised, ‘domestication research suggests that only when the

Social theory and the internet in everyday life 35 novelty of new technologies has worn off; when they are taken for granted by users in their everyday-life context that the real potential for change is visible’ (Berker et al., 2005, p. 14). Domestication theory originated from studies conducted around the 1990s on new communication and information technologies in the household, including television, computers, and telephones. Early researchers already noticed the multifaceted implications of ICT for individuals and households. The domestication process includes both how individuals and households ‘include media and media-related activities into their own projects of fixing their identities in time and space’ but also ICT’s ‘mediation of the public and the private sphere’ and ‘mediation of the global and the local’ (Silverstone, 1991, p. 146). Focusing on the intertwined relationship between private and public life and technology use, Silverstone and Hirsch (1992) summarise the four elements of domestication of technologies in a household: appropriation, objectification, incorporation, and conversion. The appropriation stage involves the consumption of ICT products or services. The objectification stage is how technology is displayed within the domestic environment. However, as ICT becomes increasingly portable and embedded in daily routines outside the home, this stage of domestication needs to be refined, for example, to include when and where the technology is used instead of the place of technologies at home. The incorporation stage describes the process whereby the technologies are integrated as part of daily routines. The final conversion stage is when the technologies transcend the boundary of the personal or the household and become symbols of an individual’s status in society. Many empirical studies have used domestication theory as a theoretical framework for the study of ICT adoption and use. Domestication studies were predominantly qualitative and with research focusing on households as research units (Hirsch, 1992). One type of research, in particular, studied the domestication of the internet in disadvantaged families in the Netherlands and Ireland. This research found that the results of domestication processes vary with users’ abilities to access diverse information sources (Hynes & Rommes, 2005). Empirical research also suggests that different stages of the domestication process could be non-linear, with some stages overlapping with others (Ward, 2005). Recent studies of the domestication of ICT recognise the role of individual users, instead of households, in taming technologies and giving meaning to artefacts. Empirical research on emerging ICT has refined domestication theory to reflect how individual users incorporate technologies into everyday information and communication practices in the use of mobile applications (de Reuver et al., 2016) or mobile internet (Donner, 2015; Katz & Sugiyama, 2016). Scholars have also applied domestication theory to study the adoption of ICT in non-Western countries. For example, Lim looked at how Chinese middle-class households’ usage of the internet challenged traditional family values (Lim, 2005). McDonald and Oreglia both used domestication theory in their ethnographic study of rural households: they showed the active roles of migrant workers in helping their families domesticate the mobile internet (Oreglia, 2013) and how different family members shape their own uses of the internet for individualised functions (McDonald, 2015). Domestication theory looks both at users’ environment in which ICT is adopted and adapted and also at how technologies reshape users’ information consumption in everyday life. However, as scholars have pointed out (Berker et al., 2005), with the widespread use of ubiquitous computing technologies such as mobile phones and mobile applications, ICT will gradually fade into the background of everyday life. This is particularly true of everyday online information-seeking practices where the personalisation and recommendation systems no longer require users’ active seeking or searching for information. Against this background,

36 Research handbook on digital sociology we need a new interpretation and conceptualisation of domestication theory to address the emerging questions related to ICT in everyday life. How does domestication theory apply to the embedded and everyday use of ICT for information seeking as the boundaries between the private and public or the global and local almost disappear? Perhaps this question can be addressed by extending the definition of appropriation, objectification, incorporation, and conversion to a perspective that includes information seeking. While most empirical studies have focused on ICT devices such as computers or mobile phones, future research using domestication theory should extend the research focus to the domestication of emerging information technologies in everyday life. For example, users start by consuming online information in daily routines (appropriation) before integrating internet-based information sources as part of their everyday information intake (objectification). Next, internet users incorporate various online information sources into a personalised diet of information (incorporation) and eventually the combination of certain internet information choices, requiring the mobilisation of individuals’ financial or social resources, which in turn become part of one’s social status (conversion). This adaptation of domestication theory adds the perpetual connectedness and embeddedness of mobile internet. It also provides a theoretical framework to study the adoption and use of the internet for information seeking in everyday contexts. Meanwhile, domestication theory would also benefit from more diverse methods, using qualitative data such as interviews and ethnography but also adding quantitative approaches such as surveys. Empirical studies that apply the framework of domestication theory can measure users’ information-seeking practices from the appropriation of the internet, level of engagement in different online activities, incorporation of the internet in everyday information seeking, all the way to the social consequences (confidence, experience, and barriers) of online information seeking. 3.2

Search Engines in Everyday Life

Compared to the study of everyday use of the internet in general, very few empirical studies have explored the use of search engines in everyday contexts. The few exceptions include a longitudinal study of search queries from 1997 to 2001 which showed a fascinating picture of the change of search interests among users. With the decreasing importance of queries related to entertainment and pornography, there was an increasing search interest in information related to commerce and employment (Spink et al., 2002). Using a triangulation of search activity diary, observation, and in-depth interviews, Rieh (2004) conducted a qualitative study on user behaviours on search engines that showed that search engines are used as a directory website to topic-specific sites. Users search for the websites on a specific topic first in the search engine to locate the site for the topic and then search for information within the topic-specific website. The study also showed that search engine uses at home are different from workplace uses: users tend to search for a more diverse range of queries at home. IR studies have used search logs to understand information-seeking practices on search engines. A study conducted by researchers at Google compared the use of mobile and PC search and found that mobile search users mainly searched for ‘quick answer’-type queries and were more likely to satisfy their information needs by the search results (Li et al., 2009). Another study showed that the majority of search queries on Google and Yahoo among users in the Netherlands, for example, were about art, entertainment, and sport, with only one of the top ten queries about society. The same study found that users in some countries such as

Social theory and the internet in everyday life 37 Germany, Russia, and Ireland have a more diverse range of search queries than users in other countries (Segev & Ahituv, 2010). A study by Waller (2011) examined the transaction logs of Australian users of Google and found that the top queries entered in the search engine are names of specific websites such as Facebook, YouTube, or eBay, which echoes the findings of Rieh’s (2004) research. More than half of the searches in Waller’s study were about popular culture and e-commerce, followed by 15 per cent of queries on cultural practices. Queries about government and policies accounted for less than 2 per cent of web searches. Waller’s study demonstrates that the use of search engines is deeply embedded in everyday life practices. Sundin et al. (2017) analysed qualitative data collected from focus groups and showed that search engines are becoming embedded in everyday lives to the extent that the practice of searching becomes almost invisible to users. This ‘mundane-ification’ of search has also changed the relationship between information users and search engines: users have higher trust of the algorithms to help them assess the quality of search results and rely on the ranking order as a criteria of information quality (Sundin et al., 2017, p. 224). The search engine has also become an important source of information for users in developing countries. An interviewee from Paul’s study in India used Google to find causes and diagnoses of her child’s infection after having difficulties in understanding the doctor’s prescription in a consultation (Paul, 2015). 3.3

Algorithmic Recommendation Systems and ‘Filter Bubbles’

Algorithmic recommendation systems are an emerging area of research in everyday uses of the internet. Nowadays, an increasing number of mobile applications, including e-commerce platforms, mobile news, online music, and video-sharing platforms, are feeding users with recommendations generated by calculating users’ browsing history or demographic characteristics. On the one hand, this transformation of customised news world has brought personalised information directly to the users and has reduced their time spent on seeking news. On the other hand, recommendation systems which depend on user behaviour and network data might amplify the ‘filter bubble’ effect leading to the polarisation of public opinion (Sustein, 2007). While many communication and political science scholars have focused on the influence of media exposure and the echo chamber effect using social network analysis methods (Moller et al., 2018), few empirical studies have looked at the influence of recommendation systems on everyday life. Some empirical studies have pointed to an optimistic future of algorithmic recommendation systems in information seeking. An experiment on music application users found that the recommendation system increased the commonality among users and broadened users’ music interests but the authors found no evidence of a fragmentation of users’ networks (Hosanagar et al., 2014). A long-term study looked at how mainstream media agencies have implemented personalisation systems in delivering news content to their audience (Kunert & Thurman, 2019). The study showed a balance of users’ autonomy and artificial intelligence-assisted news-reading experience whereby users are provided with reading choices from editor-picked content or receiving personalised news content. Another study collected the Google accounts of 168 users and used their accounts to search for news during the 2016 United States presidential election on Google (Nechushtai & Lewis, 2019). The study found that the search results of news about the election for conservative and liberal users were almost identical, rejecting the commonly held assumption that the personalisation function on search engines will lead

38 Research handbook on digital sociology to the fragmentation of media information. A mixed-methods study published in 2019 used both mobile-logging software and qualitative interviews to explore the impact of recommendation systems on mobile news consumption (Van Damme et al., 2020). The findings suggest that peer recommendations of news through social media offer less diversity in topics than algorithmic recommendation systems and that some skilled users start to personalise the algorithmic recommendation system on their social media.

4

DIGITAL INEQUALITIES IN ACCESS, USE, AND INFORMATION SEEKING

While information-seeking models provide an abstract description of human information processes, information scientists often focus more on how information is sought than on how the information is processed or consumed by users. In other words, these models cannot reveal the social implications of seeking information. Part of the problem lies in the overemphasis on users’ psychological and emotional processes during the information-seeking practices, which, as Wilson has rightly pointed out, ‘is not directly observable’ (Wilson, 1997, p. 567). However, one way to evaluate the consequences of information seeking is to evaluate it from a different perspective; namely, how the different approaches to and the resources and competencies in information-seeking practices are shaped by and influenced by users’ social status. In this vein, scholars who developed or used domestication theory have also linked the theory to Bourdieu’s concept of social and cultural capital and situate the domestication of ICT in existing social structures (Haddon, 2007). Both information and communication scholars have been studying the topic of digital divides or digital inequalities since the early 2000s. Over the years, digital divides researchers have examined the various dimensions of digital inequality, including the demographic, socio-economic, and technical factors behind the divides. In this section, I will review some of the influential work on digital divides and discuss how the research on digital inequalities will have theoretical implications for the study of information-seeking practices. 4.1

Digital Divides Research

Early research about digital divides noticed the problem of different adoption levels of ICT between urban and rural users and started to question the utopian view that the digital revolution will benefit every member of society (Lentz & Oden, 2001; Parker, 2000; Rideout, 2000). Many survey studies conducted in the 2000s suggested that older users are less likely to have access to the internet (Loges & Jung, 2001) and mobile phones (Rice & Katz, 2003). Using national survey data collected from 1995 and 1998, Hindman (2000) showed that the adoption of ICT is more strongly associated with socio-economic and demographic factors (age, education, gender, and income level) than metropolitan/non-metropolitan residence. As more empirical studies showed the complex mechanisms behind digital divides, scholars such as Selwyn (2004) and Hilbert (2011) refined the concept. They included not only access to and use of ICT hardware such as PC or mobile phones but also engagement in software such as information and services. Donner (2015) systematically analysed the use of mobile internet in developing countries. He suggested that the features and functions of mobile internet both fuel the penetration of internet access and pose new challenges for digital inclusion.

Social theory and the internet in everyday life 39 He proposed an ‘after-access lens’ to consider the increasing divergence in internet devices, quality of connection, affordability of connectedness, and digital skills. Recent empirical research that was conducted to study beyond the ‘first-level digital disparities’ in access has examined ‘second-level digital inequalities’ in the use of the internet (Robinson et al., 2015, p. 570). For example, van Deursen and van Dijk (2011) made the way users participated in different types of online activities into a dimension of digital divides. Using confirmatory factor analysis on survey data related to internet usage activities, they identified seven types of uses: information, news, personal development, commercial transactions, leisure, social interaction, and gaming. They found that age and gender are the most important factors accounting for differences in internet usage activities. Using the internet as an information source might have a positive influence on an individual’s social status: a survey study of United States internet users showed that the informational use of the internet is most strongly associated with higher socio-economic status (Wei & Hindman, 2011). Many scholars have criticised the dichotomous division of information rich and information poor. Instead, they have suggested that digital divides are ‘hierarchical rather than dichotomous’ (Selwyn, 2004, p. 348) or recommended the use of ‘a continuum of digital inclusion’ (Livingstone & Helsper, 2007, p. 684). Some researchers have studied factors that are associated with users’ various motivations for using the internet and their different levels of digital skills. Davies and Eynon’s (2018) research with young people’s technology practices identified different types of users: non-conformists, PC gamers, academic conservatives, pragmatists, and leisurists. Van Deursen and van Dijk, in contrast, defined four types of digital skills: operational, formal, information, and strategic skills. Interestingly, they found that the elderly perform worse than young users in using operational and formal skills but are not significantly different from the younger generations in information and strategic skills, holding other variables constant. It is also worrying that only operational internet skills grow as users have increased internet experience. As an increasing number of online tasks require more skills beyond basic operational skills, the lack of the other three types of skills, which do not necessarily increase as users spend more years online, could result in digital disadvantages. As the authors warned, ‘while originally the digital divide could be “easily” addressed by providing physical access [to ICT], this now seems to be much harder when content-related internet skills are considered’ (van Deursen & van Dijk, 2011, p. 909). 4.2

Information Divides

Information scholars have contributed to the theory development related to digital divides by studying how different social groups seek and share information. Elfreda Chatman has focused on the information world and information-seeking behaviours of working-class and impoverished information users. Driven by her research interest in the relationship between economic poverty and information poverty, Chatman applied theories from the sociology of knowledge to understand the difference between insiders and outsiders and the impact of information secrecy, deception, and lack of trust in prohibiting the sharing of information. Chatman (1999) also borrowed the sociologist Robert Merton’s distinction of cosmopolitan and local to distinguish users who are interested in the wider informational environment beyond the community and users who narrow their information focus on their immediate surroundings. Thus, one of the important contributions of Chatman’s theory of information poverty is her combination of information-seeking research with sociology. For example, she argues that ‘information

40 Research handbook on digital sociology poverty is partially associated with class distinction’. She also explored the impact of social networks on information poverty and found that the insiders of an impoverished community suffer from a lack of information because of the outsiders’ withholding of ‘privileged access to information’ (Chatman, 1996, p. 197). Chatman summarised the information-seeking experiences of the poor as that ‘they live in an impoverished information world’ and that information sources in this world are limited in ‘new possibilities’ and their ‘perceptions about reality are not adequate, trustworthy, and reliable’ (Chatman, 1991, p. 440). Meanwhile, the huge amount of information content and information sources available on the internet creates new challenges for inexperienced users who need to navigate the internet for information seeking. Empirical research on the use of digital technologies for information seeking has found that socio-economic and demographic variables such as gender are associated with the outcomes of information-seeking practices (Sin, 2016). Savolainen and Kari (2004, p. 225) studied how information users conceptualised the internet and found that many users have difficulties in defining the internet, with some of them viewing the internet as indefinite and information sources ‘poorly organised’. Difficulties in acquiring relevant information also leads to a critical repertoire view of the internet among users who are interested in information seeking for personal development purposes, leading to a negative and reserved viewpoint on the internet (Savolainen, 2004). One of the barriers reported in many empirical studies is the difficulty among some users to cope with anxiety and disorientation from receiving too much information (Loudon et al., 2016). Hargittai et al.’s (2012) research on information overload found that while most of the users do not feel overwhelmed by the amount of online information, they are disappointed by the quality of some online information sources such as social media.

5

TOWARDS INTERDISCIPLINARY RESEARCH OF THE INTERNET IN EVERYDAY LIFE

In this chapter, I have reviewed theories and empirical studies in three areas related to everyday information seeking on the internet: (1) everyday information needs and information-seeking models; (2) domestication theory and studies of everyday uses of search engines or algorithmic recommendation systems; and (3) digital divides or inequality. Figure 2.1 provides a diagram of the three areas covered by the literature review. A group of information scholars has shifted the research focus from information systems to information users and the social environment around the users. The development of different information models provides a holistic perspective in understanding the information-seeking process within the users’ own lifeworld. Meanwhile, information theories on serendipity and information encountering take the unpurposeful acquisition of information into account and these theories have implications for studying information practices in the online environment. One of the most important developments in the theory of information seeking is the study of ELIS practices, which focuses on non-work-related everyday contexts of information seeking. By using the ELIS framework in different everyday contexts, information-seeking studies reveal how information-seeking practices constitute users’ way of life and vary across different social groups. Empirical research on information seeking and information needs has focused predominantly on marginalised populations such as migrant workers and rural residents.

Social theory and the internet in everyday life 41

Figure 2.1

Diagram of the three research areas of everyday life and the internet

Domestication theory studies the process of adoption and adaptation of digital technologies within households by scrutinising how ICT fits into everyday life contexts and examines the influence of ICT on the household and society. However, domestication theory was first introduced before the rise of portable mobile devices. Hence the process of objectification – how ICT is displayed and incorporated within the household – needs to be refined to be applicable to the new digital environment. Also, as ICT nowadays becomes ever more personal and private, we also need to shift domestication theory from the unit of households to individuals. Most importantly, domestication theory provides us with a framework for studying the process of the adoption and use of the internet for information-seeking processes and how ICT fits into everyday information routines. While empirical research has applied domestication theory to the study of emerging ICT such as mobile apps, there is a gap in the literature that examines everyday uses of search engines and recommendation systems, and how algorithmic gatekeepers influence everyday information seeking. Discussions centred on digital divides started in the early days of the internet in the 2000s. Researchers found that different adoption and uses of digital technologies were associated with pre-existing social stratification. In recent years, the focus of digital divides research has shifted from first-level divides such as adoption and access to second-level divides in usage and participation. However, despite some empirical studies about information poverty, very little empirical research has examined digital divides in the divergent use of the internet for information seeking. The literature review of the three areas has demonstrated two divisions within the scholarly community. On the one hand, I find a divided scholar community due to disciplinary barriers. Researchers from one area rarely discuss the theories or empirical studies from the other two areas, although all three research areas share some theories (Bourdieu’s theory of habitus) and there is common research interest in ICT and similar research methods (ethnographic research or surveys). The lack of cross-referencing to the other areas is due to the boundaries between different social science disciplines. I believe, however, that breaking the boundaries of disciplines will contribute to theory development. For example, digital divides research

42 Research handbook on digital sociology can compare the different use of ICT at various stages of domestication processes, from appropriation, objectification, incorporation, to conversion. Everyday information contexts highlighted in ELIS models can potentially guide digital divides researchers to study variances and disparities in ICT use within different information categories. Future empirical studies that combine the literatures from information-seeking, domestication theory, and digital divides research will provide the theoretical frameworks for understanding the interactions between information users, technologies, and society and will shed light on understanding the social significance of digital technologies on society.

REFERENCES Agosto, D. E., & Hughes-Hassell, S. (2005). People, places, and questions: An investigation of the everyday life information-seeking behaviors of urban young adults. Library & Information Science Research, 27(2), 141–163. Arora, P. (2019). The next billion users: Digital life beyond the West. Harvard University Press. Berker, T., Hartmann, M., Punie, Y., & Ward, K. (2005). Introduction. In T. Berker (Ed.), Domestication of media and technology (pp. 1–17). Open University Press. Carey, R. F., Mckechnie, E. F., & McKenzie, P. J. (2001). Gaining access to everyday life information seeking. Library & Information Science Research, 23(4), 319–334. Castells, M. (2013). Communication power. Oxford University Press. Chatman, E. A. (1991). Life in a small world: Applicability of gratification theory to information-seeking behavior. Journal of the American Society for Information Science & Technology, 42(6), 438–449. Chatman, E. A. (1996). The impoverished life‐world of outsiders. Journal of the American Society for Information Science & Technology, 47(3), 193–206. Chatman, E. A. (1999). A theory of life in the round. Journal of the American Society for Information Science & Technology, 50, 207–217. Davenport, E. (2010). Confessional methods and everyday life information seeking. Annual Review of Information Science and Technology, 44, 533–562. Davies, H. C., & Eynon, R. (2018). Is digital upskilling the next generation our ‘pipeline to prosperity’? New Media & Society, 20(1), 3961–3979. de Reuver, M., Nikou, S., & Bouwman, H. (2016). Domestication of smartphones and mobile applications: A quantitative mixed-method study. Mobile Media & Communication, 4(3), 347–370. Dervin, B. (1998). Sense-making theory and practice: An overview of user interests in knowledge seeking and use. Journal of Knowledge Management, 2(2), 36–46. Dervin, B. (1999). On studying information seeking methodologically: The implications of connecting metatheory to method. Information Processing & Management, 35(6), 727–750. Donner, J. (2015). After access: Inclusion, development, and a more mobile internet. MIT Press. Ellis, D. (1993). Modeling the information-seeking patterns of academic researchers: A grounded theory approach. The Library Quarterly: Information, Community, Policy, 63(4), 469–486. Erdelez, S. (1999). Information encountering: It’s more than just bumping into information. Bulletin of the American Society for Information Science and Technology, 25(3), 25–29. Eynon, R., & Malmberg, L.-E. (2012). Understanding the online information-seeking behaviours of young people: The role of networks of support. Journal of Computer Assisted Learning, 28, 514–529. Fisher, K. E., Landry, C. F., & Naumer, C. (2007). Social spaces, casual interactions, meaningful exchanges? Information ground? Characteristics based on the college student experience. Information Research, 12(2). http://informationr.net/ir/12-1/paper291.html Foster, A., & Ford, N. (2003). Serendipity and information seeking: An empirical study. Journal of Documentation, 59(3), 321–340. Given, L. M., Winkler, D. C., Willson, R., Davidson, C., Danby, S., & Thorpe, K. (2016). Watching young children ‘play’ with information technology: Everyday life information seeking in the home. Library & Information Science Research, 38(4), 344–352.

Social theory and the internet in everyday life 43 Haddon, L. (2004). Information and communication technologies in everyday life: A concise introduction and research guide. Berg. Haddon, L. (2007). Roger Silverstone’s legacies: Domestication. New Media & Society, 9(1), 25–32. Hargittai, E., Neuman, W. R., & Curry, O. (2012). Taming the information tide: Perceptions of information overload in the American home. The Information Society, 28(3), 161–173. Hektor, A. (2001). What's the use? Internet and information behavior in everyday life. Doctoral thesis, Linköping University, Sweden. http://liu.diva-portal.org/smash/get/diva2:254863/FULLTEXT01.pdf Hilbert, M. (2011). The end justifies the definition: The manifold outlooks on the digital divide and their practical usefulness for policy-making. Telecommunications Policy, 35(8), 715–736. Hindman, D. B. (2000). The rural–urban digital divide. Journalism & Mass Communication Quarterly, 77(3), 549–560. Hine, C. (2015). Ethnography for the internet: Embedded, embodied and everyday. Bloomsbury Academic. Hirsch, E. (1992). The long term and the short term of domestic consumption: An ethnographic case study. In R. Silverstone & E. Hirsch (Eds), Consuming Technologies: Media and information in domestic spaces (pp. 194–211). Routledge. Hosanagar, K., Fleder, D., Lee, D., & Buja, A. (2014). Will the global village fracture into tribes? Recommender systems and their effects on consumer fragmentation. Management Science, 60(4), 805–823. Huotari, M. L., & Chatman, E. (2001). Using everyday life information seeking to explain organizational behavior. Library & Information Science Research, 23(4), 351–366. Hynes, D., & Rommes, E. (2005). ‘Fitting the internet into our lives’: IT courses for disadvantaged users. In T. Berker, M. Hartmann, Y. Punie, & K. Ward (Eds), Domestication of media and technology (pp. 125–144). Open University Press. Ingwersen, P., & Järvelin, K. (2005). The turn: Integration of information seeking and retrieval in context. Springer. Ito, M. (2005). Introduction: Personal, portable, pedestrian. In M. Ito, M. Matsuda, & D. Okabe (Eds), Personal, portable, pedestrian: Mobile phones in Japanese life (pp. 1–16). MIT Press. Järvelin, K. (2016). Two views on theory development for interactive information retrieval. In D. H. Sonnenwald (Ed.), Theory development in the information science (pp. 116–140). University of Texas Press. Järvelin, K., & Wilson, T. (2003). On conceptual models for information seeking and retrieval research. Information Research, 9(1), Article 163. http://InformationR.net/ir/9-1/paper163.html Jeffrey, R., & Doron, A. (2013). The great Indian phone book: How cheap mobile phones change business, politics and daily life. C. Hurst & Co. Katz, J. E., & Aakhus, M. A. (2002). Perpetual contact: Mobile communication, private talk, public performance. Cambridge University Press. Katz, J. E., & Sugiyama, S. (2016). Mobile phones as fashion statements: Evidence from student surveys in the US and Japan. New Media & Society, 8(2), 321–337. Kuhlthau, C. C. (1993). Seeking meaning: A process approach to library and information services. Ablex Press. Kuhlthau, C. C. (2016). Reflections on the development of a theoretical perspective. In D. H. Sonnenwald (Ed.), Theory development in the information science (pp. 64–86). University of Texas Press. Kuhlthau, C. C., Heinström, J., & Todd, R. J. (2008). The ‘information search process’ revisited: Is the model still useful? Information Research, 13(4). http://InformationR.net/ir/13-4/paper355.html Kunert, J., & Thurman, N. (2019). The form of content personalisation at mainstream, transatlantic news outlets: 2010–2016. Journalism Practice, 13(7), 759–780. Laplante, A., & Downie, J. S. (2011). The utilitarian and hedonic outcomes of music information-seeking in everyday life. Library & Information Science Research, 33(3), 202–210. Lentz, R. G., & Oden, M. D. (2001). Digital divide or digital opportunity in the Mississippi Delta region of the US. Telecommunications Policy, 25(5), 291–313. Li, J., Huffman, S. B., & Tokuda, A. (2009). Good abandonment in mobile and PC internet search. 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA.

44 Research handbook on digital sociology Lim, S. S. (2005). From cultural to information revolution: ICT domestication by middle-class Chinese families. In T. Berker, M. Hartmann, Y. Punie, & K. Ward (Eds), Domestication of media and technology (pp. 185–204). Open University Press. Ling, R. S. (2012). Taken for grantedness: The embedding of mobile communication into society. MIT Press. Liu, F. (2010). The Internet in the everyday life‐world: A comparison between high‐school students in China and Norway. Comparative Education, 46(4), 527–550. Livingstone, S., & Helsper, E. (2007). Gradations in digital inclusion: Children, young people and the digital divide. New Media & Society, 9(4), 671–696. Loges, W. E., & Jung, J. Y. (2001). Exploring the digital divide: Internet connectedness and age. Communication Research, 28(4), 536–562. Loudon, K., Buchanan, S., & Ruthven, I. (2016). The everyday life information seeking behaviours of first-time mothers. Journal of Documentation, 72(1), 24–46. Makri, S., & Blandford, A. (2012a). Coming across information serendipitously – Part 1. Journal of Documentation, 68(5), 684–705. Makri, S., & Blandford, A. (2012b). Coming across information serendipitously – Part 2. Journal of Documentation, 68(5), 706–724. McCay-Peet, L., Toms, E. G., & Kelloway, E. K. (2015). Examination of relationships among serendipity, the environment, and individual differences. Information Processing & Management, 51(4), 391–412. McDonald, T. (2015). Affecting relations: Domesticating the internet in a south-western Chinese town. Information Communication & Society, 18(1), 17–31. McKenzie, P. J. (2003). A model of information practices in accounts of everyday-life information seeking. Journal of Documentation, 59(1), 19–40. Miller, D. (2016). How the world changed social media. UCL Press. Moller, J., Trilling, D., Helberger, N., & van Es, B. (2018). Do not blame it on the algorithm: An empirical assessment of multiple recommender systems and their impact on content diversity. Information Communication & Society, 21(7), 959–977. Nechushtai, E., & Lewis, S. C. (2019). What kind of news gatekeepers do we want machines to be? Filter bubbles, fragmentation, and the normative dimensions of algorithmic recommendations. Computers in Human Behavior, 90, 298–307. Ooi, K., & Liew, C. L. (2011). Selecting fiction as part of everyday life information seeking. Journal of Documentation, 67(5), 748–772. Oreglia, E. (2013). From farm to farmville: Circulation, adoption, and use of ICT between urban and rural China. Doctoral thesis, University of California, Berkeley, CA. www.ischool.berkeley.edu/sites/ default/files/fromfarmtofarmville.pdf Pálsdóttir, Á. (2010). The connection between purposive information seeking and information encountering. Journal of Documentation, 66(2), 224–244. Panahi, S., Watson, J., & Partridge, H. (2015). Information encountering on social media and tacit knowledge sharing. Journal of Information Science, 42(4), 539–550. Parker, E. B. (2000). Closing the digital divide in rural America. Telecommunications Policy, 24(4), 281–290. Paul, A. (2015). Use of information and communication technologies in the everyday lives of Indian women: A normative behaviour perspective. Information Research – an International Electronic Journal, 20(1), Article 19. Rice, R. E., & Katz, J. E. (2003). Comparing internet and mobile phone usage: Digital divides of usage, adoption, and dropouts. Telecommunications Policy, 27(8–9), 597–623. Rideout, V. (2000). Public access to the internet and the Canadian digital divide. Canadian Journal of Information and Library Science – Revue Canadienne des Sciences de l’Information et de Bibliotheconomie, 25(2–3), 1–21. Rieh, S. Y. (2004). On the web at home: Information seeking and web searching in the home environment. Journal of the American Society for Information Science and Technology, 55(8), 743–753. Robinson, L., Cotten, S. R., Ono, H., Quan-Haase, A., Mesch, G., Chen, W. H., Schulz, J., Hale, T. M., & Stern, M. J. (2015). Digital inequalities and why they matter. Information Communication & Society, 18(5), 569–582.

Social theory and the internet in everyday life 45 Rogers, E. M. (2003). Diffusion of innovations. Free Press. Savolainen, R. (1995). Everyday life information seeking: Approaching information seeking in the context of ‘way of life’. Library & Information Science Research, 17(3), 259–294. Savolainen, R. (1997). Everyday life communication and information seeking in networks. Internet Research – Electronic Networking Applications and Policy, 7(1), 69–70. Savolainen, R. (2004). Enthusiastic, realistic and critical: Discourses of internet use in the context of everyday life information seeking. Information Research, 10(1), 15, Article 198. Savolainen, R., & Kari, J. (2004). Conceptions of the internet in everyday life information seeking. Journal of Information Science, 30(3), 219–226. Schroeder, R. (2010). Mobile phones and the inexorable advance of multimodal connectedness. New Media & Society, 12(1), 75–90. Segev, E., & Ahituv, N. (2010). Popular searches in Google and Yahoo!: A ‘digital divide’ in information uses? Information Society, 26(1), 17–37. Selwyn, N. (2004). Reconsidering political and popular understandings of the digital divide. New Media & Society, 6(3), 341–362. Silverstone, R. (1991). From audiences to consumers: The household and the consumption of communication and information technologies. European Journal of Communication, 6(2), 135–154. Silverstone, R., & Hirsch, E. (1992). Consuming technologies: Media and information in domestic spaces. Routledge. Sin, S. C. J. (2016). Social media and problematic everyday life information-seeking outcomes: Differences across use frequency, gender, and problem-solving styles. Journal of the Association for Information Science and Technology, 67(8), 1793–1807. Sin, S. C. J., & Kim, K. S. (2013). International students’ everyday life information seeking: The informational value of social networking sites. Library & Information Science Research, 35(2), 107–116. Spink, A., & Cole, C. (2001). Introduction to the special issue: Everyday life information-seeking research. Library & Information Science Research, 23(4), 301–304. Spink, A., Jansen, B. J., Wolfram, D., & Saracevic, T. (2002). From e-sex to e-commerce: Web search changes. Computer Assisted Language Learning, 35(3), 107–109. Sundin, O., Haider, J., Andersson, C., Carlsson, H., & Kjellberg, S. (2017). The search-ification of everyday life and the mundane-ification of search. Journal of Documentation, 73(2), 224–243. Sustein, C. (2007). Republic.com 2.0. Princeton University Press. Tuominen, K., Savolainen, R., & Talja, S. (2005). Information literacy as a sociotechnical practice. The Library Quarterly: Information, Community, Policy, 75(3), 329–345. Van Damme, K., Martens, M., Van Leuven, S., Vanden Abeele, M., & De Marez, L. (2020). Mapping the mobile DNA of news: Understanding incidental and serendipitous mobile news consumption. Digital Journalism, 8(1), 49–68. van Deursen, A., & van Dijk, J. (2011). Internet skills and the digital divide. New Media & Society, 13(6), 893–911. Waller, V. (2011). Not just information: Who searches for what on the search engine Google? Journal of the American Society for Information Science and Technology, 62(4), 761–775. Ward, K. (2005). The bald guy just ate an orange: Domestication, work and home. In T. Berker, M. Hartmann, Y. Punie, & K. Ward (Eds), Domestication of media and technology (pp. 145–164). Open University Press. Wei, L., & Hindman, D. B. (2011). Does the digital divide matter more? Comparing the effects of new media and old media use on the education-based knowledge gap. Mass Communication and Society, 14(2), 216–235. Wellman, B., & Haythornthwaite, C. (2002). The internet in everyday life. Blackwell. Wilson, T. D. (1981). On user studies and information needs. Journal of Documentation, 37(1), 3–15. Wilson, T. D. (1997). Information behaviour: An interdisciplinary perspective. Information Processing & Management, 33(4), 551–572. Wilson, T. D. (1999). Models in information behaviour research. Journal of Documentation, 55(3), 249–270. Wilson, T. D. (2006). On user studies and information needs. Journal of Documentation, 62(6), 658–670.

46 Research handbook on digital sociology Yeoman, A. (2010). Applying McKenzie’s model of information practices in everyday life information seeking in the context of the menopause transition. Information Research – an International Electronic Journal, 15(4), 11, Article 444.

PART II RESEARCHING THE DIGITAL SOCIETY

3. Digital and computational demography Ridhi Kashyap and R. Gordon Rinderknecht, with Aliakbar Akbaritabar, Diego Alburez-Gutierrez, Sofia Gil-Clavel, André Grow, Jisu Kim, Douglas R. Leasure, Sophie Lohmann, Daniela V. Negraia, Daniela Perrotta, Francesco Rampazzo, Chia-Jung Tsai, Mark D. Verhagen, Emilio Zagheni, and Xinyi Zhao

1 INTRODUCTION Like demography more generally, digital and computational demography is concerned with measuring populations and demographic processes of fertility, mortality, and migration, as well as describing and explaining variations and regularities in these processes. Digital demography’s unique contributions lie in its explorations of demography in relation to the digital revolution – the rapid technological improvements in digitized informational storage, computational power, and the spread of the internet and mobile technologies since the turn of the new millennium. The digital revolution has opened up new opportunities for demographic research, and this chapter covers different aspects of these developments. First, the digital revolution has ushered in a data revolution, which has produced unprecedented amounts of information relevant to demographic processes. Technological changes, such as improvements in information storage and processing, have not only improved access and granularity of traditional sources of demographic data (e.g., individual-level census data across historical and geographical contexts), but have also increasingly provided researchers with new forms of data not originally meant for research but which nevertheless speak to important demographic outcomes, including data sources as wide ranging as bibliometric and genealogical databases, social media data, and archived newspapers (Alburez-Gutierrez et al., 2019; Kashyap, 2021). Many of these new data sources are created as by-products of the use of digital technologies, such as web and social media, which are increasingly salient spaces for social interaction and expression. The expansion of the internet has also granted new opportunities for the rapid recruitment of respondents from across much of the world. This availability, paired with the development of easily deployed tools for survey development, gives digital demographers the ability to rapidly and cost-effectively recruit hard-to-reach groups and measure human behaviour in areas where data are otherwise unavailable, or when face-to-face data collection may be difficult, as exemplified during the COVID-19 pandemic (Grow et al., 2020). To this end, we also devote special attention to the role of these new data opportunities to advance our knowledge of the Global South, where traditional demographic data remain limited. A central challenge with the use of data generated from online populations from a demographic perspective is that they differ from broader national or subnational populations in important ways. Effective use and interpretation of novel data sources thus requires a deeper assessment of the demographic characteristics and biases of online populations spread across 48

Digital and computational demography 49 different platforms, and potential methodological solutions to these biases. This is an area where digital demography is uniquely positioned to make contributions and has already begun to do so. Along with utilizing novel data sources, digital demography seeks to leverage computational methods, such as agent-based computational (ABC) and micro-simulation (MS) modelling and machine learning (ML), to study population phenomena in new ways. Simulation methods provide opportunities to understand how macro-level demographic regularities emerge from micro-level behaviours and social interaction processes. ML provides approaches to detect regularities from complex demographic data, or more powerfully, extract signals from an increasingly data-rich world. The digital revolution, however, is not only a data (and methods) revolution, but also one that has influenced daily life at the micro level, by changing the timing and sequencing of daily activities, and the ways in which information, communication, and vital services are accessed. These changes ultimately shape broader demographic and life-course processes. Digital demography therefore encompasses research directly studying digital technologies and their impacts on daily life, and the connections between these daily life changes and broader demographic processes of health and mortality, fertility and family formation, and migration. Recognizing these different ways in which the digital revolution touches upon demography, this chapter is organized into three sections. Section 2 explores the implications of digital transformation for daily lives and demographic processes. Section 3 focuses on new data opportunities being pursued by digital demographers: namely, repurposed data and original online data collection. We discuss the utility of these new data sources for studying demographic processes and online populations, and then we proceed to discuss important methods relevant to using such data effectively. Section 4 focuses on the important role of simulation and machine-learning techniques, their applications to demography, and their challenges and future.

2

MICRO- AND MACRO-LEVEL IMPLICATIONS OF DIGITAL TRANSFORMATION

2.1

Implications of Digital Transformation for Daily Life and Time Use

Advancements in digital technology have led to significant alteration to daily life. Such alteration can be seen in how it has augmented important daily activities, and in some instances partially replaced them. Digital technology’s capacity for facilitating activities previously limited by geography and time of day further allows for significant alteration in the sequencing and timing of daily activities. Digital technology therefore in many ways grants people more control and flexibility over how they use their time, while also creating new obligations, unclear expectations, and blurring boundaries between domains, such as work and family. We explore these complex and mixed effects of digital technology on daily life and activities in the section that follows, and conclude by discussing the challenges of researching this subject. The section following this then extends beyond daily life by connecting digital technology to long-term demographic processes.

50 Research handbook on digital sociology 2.1.1 Impacts on daily activities Recent studies have found that technology use has detrimental effects on sleep (Billari et al., 2018; Gradisar et al., 2013; Lemola et al., 2015; Turel et al., 2016). Using time diary data from the German Time Use Survey, within a quasi-experimental approach analysing the effects of access to high-speed (broadband) internet in Germany, Billari et al. (2018) point out that it is not the overall daily use of technology or digital media that has a detrimental effect on sleep duration and sleep quality, but the use of these technologies, in the evening, before going to sleep. One of the main mechanisms is that digital technology use in the evening delays bedtime, which shortens night sleep duration for those who cannot sleep in during the morning (Billari et al., 2018). Studies using other methodologies (i.e., semi-structured interview, surveys, wearable sleep monitors (FitBit)) also find that the use of interactive technological devices such as computers/laptops, mobile phones, or video games were associated with difficulties falling asleep and unrefreshing sleep reports, as compared to using passive technological devices (television, mp3, music players) where such complaints were less (Gradisar et al., 2013; Lemola et al., 2015; Turel et al., 2016). Relatedly, analyses of historical time-use data from the United Kingdom (UK) (1961–2015) find greater variation in when people sleep during more recent years, which may relate to such technology usage, but these data also show a slight increase in time spent sleeping overall (Gershuny & Sullivan, 2019). Digital technology appears to negatively affect sleep duration in some instances, but this may not translate into reduced sleep for most people. Technology has also changed the way we obtain, prepare, and consume meals and drinks. We order food from grocery shops using apps, we get ready-to-cook meals portioned for the size of our household delivered to our door, and we often consume meals in close proximity to, or in the company of, technology. Changes in our daily eating and drinking routines have been linked to an increase in the share of the population living alone (Euromonitor International, 2017; US Census Bureau, 2018; Yates & Warde, 2017) and demanding professions with long work hours (Burke & Cooper, 2008), among other factors. Findings focused on the effects of technology use (i.e., watching television, checking one’s smart phone) on commensality are mixed: some research argues that using technology during mealtime disrupts the flow of conversation, separates us from one another, and distracts from healthy eating practices (Fulkerson et al., 2014; Zhou et al., 2017), while other work argues that technology can help improve mental health and food intake for people who are often eating alone (Grevet et al., 2012; Heidrich et al., 2012; McColl & Nejat, 2013), and brings family members together during mealtime by providing topics of conversation (Ferdous et al., 2016). However, technology has introduced new complexities into family life, as children find it difficult to put away their devices during mealtime, when their parents do not themselves obey the same rule (Ferdous et al., 2016; Hiniker et al., 2016). The popular adoption of household appliances (e.g., washing machines and driers, vacuum cleaners, dishwashers) during the second half of the twentieth century has significantly reduced domestic workloads, most of which fall on women, in activities such as cooking, cleaning, and laundry. Using United States (US) data, Lebergott (1993) estimates that domestic work decreased from 58 hours a week in 1900 to 18 hours a week in 1975. And recent simulation research finds that future advancements in this sphere may enable up to an additional 1 per cent of the UK and an additional 2 per cent of Japan’s female labour force to take on full-time jobs (Hertog et al., 2021). However, machines and algorithms operate best under routine and predictable environments (Broussard, 2018). This explains why it has been challenging to

Digital and computational demography 51 outsource to appliances or robots work requiring creative tasks and human interaction such as childcare, which has increased during the same time period, and for which women continue to be responsible (Gershuny & Harms, 2016; Negraia et al., 2018). Further, although human robots may be a solution to the growing need of caregivers for rapidly ageing populations in developed countries (Abdi et al., 2018; Wright, 2019), the extent to which they should be relied on to supplement or replace human caregivers is still an area of scientific development and ethical debate (Moharana et al., 2019; Søraa et al., 2021). 2.1.2 Overlapping and rearrangement of daily activities Computer usage nearly tripled between 2000 and 2015, yet reports of computer use as one’s primary focus is only a small part of people’s overall daily usage of digital technology. In 2015, UK respondents reported 28 minutes of such computer use and nearly three hours of time spent using devices while primarily doing something else, such as paid work and leisure (Gershuny & Sullivan, 2019). Rather than outright replacing other activities, the impact of anytime anywhere connectivity made possible by digital technology may generally be better understood as overlapping with or altering the sequencing of daily activities (Wajcman, 2008). And out of all activities impacted by digitalization, social activities may have been altered most radically (Vanden Abeele et al., 2018). Urbanization dispersed social networks long before digitalization (Vanden Abeele et al., 2018), but developments of mobile technologies have enabled people to conveniently maintain these networks irrespective of constraints in geography and timing via asynchronous communication (Rainie & Wellman, 2012). Digitalization enables social engagement to weave into the background of other daily activities, ultimately helping people fulfil social needs (Vanden Abeele et al., 2018). Yet, such digitization not only makes perpetual contact possible, but for some it also makes it an anxiety-inducing ‘digital leash’ (Mascheroni & Vincent, 2016), and its interjection into daily activities an addictive, obligatory disruption (Rice & Hagen, 2010), at times competing with in-person interaction (Turkle, 2011; Vanden Abeele et al., 2018), and associates with children’s reports of being alone when in the presence of parents (Mullan & Chatzitheochari, 2019). Further, due to the porous nature of many digital spaces, interaction which one may think of as private (e.g., sharing personal photos on social media) may spill over into other domains with different normative expectations. When unintended, such context collapse can have dire consequences (Davis & Jurgenson, 2014; Loh & Walsh, 2021). Digital technology has also allowed for our social roles to become asynchronous (i.e., to take place at times convenient for each individual) and detached from traditional locations. For example, students can watch pre-recorded lectures at home in the evening, and employees can work from a different location than their office (e.g., their home, a co-working space). The separation of activity from time and location in some sense provides greater autonomy over daily life by increasing people’s ability to choose when activities start and stop, as well as facilitate people’s ability to conduct secondary activities (e.g., start dinner while finishing a report). Such blurring of boundaries between different domains of life can be problematic when the demands of one domain spill over into other domains, such as between that of work and family (Chesley, 2005; Wajcman, 2008). Similarly, digital technologies’ facilitation of constant connection and multi-tasking has led to a concern that the pace of life is increasing and becoming more harried (Wajcman, 2008). Despite these worries, historical data from the UK (2000–2015) neither show life becoming more rushed nor do they find a connection

52 Research handbook on digital sociology between feeling rushed and multi-tasking (Gershuny & Sullivan, 2019). Fears of an accelerating pace of life appear largely unsupported (Bittman et al., 2009). 2.1.3 Measurement challenges We reviewed a broad literature focusing on how digital technology has (and has not) affected daily life, yet the extent of these changes remains difficult to ascertain. Measures of time use are often unreliable (Andreassen, 2015; Ernala et al., 2020; Griffiths, 2013; Juster et al., 2003; Ko et al., 2012; Salmela-Aro et al., 2017). Time diaries, which ask participants about their time use during a specific day, are reliable but challenging to collect and of limited availability (Juster et al., 2003). Further, digital technology usage often occurs while people are doing something else, such as eating, and therefore often go unrecorded in time-diary research (Gershuny & Sullivan, 2019). This is particularly problematic given that much of the recent growth in device usage appears to be from such ‘secondary’ usage (Gershuny & Sullivan, 2019). Future research on how digital technology has impacted daily life would benefit from refocusing on capturing this secondary technology usage. Embracing the use of automated time-tracking applications rather than relying entirely on self-reported behaviour may also be fruitful (Ernala et al., 2020). 2.2

The Impacts of Digitalization on Demographic Processes

The previous section described how digital technologies have affected daily activities and schedules, while highlighting the ways in which digital technology has interwoven into nearly all aspects of daily life. In this section, we explore how this interweaving of digital technology may have altered activities directly connected to important demographic outcomes. The digital revolution has deeply affected how individuals seek information, communicate and interact with their communities, and access media as well as key services and goods (e.g., linked to financial or health services). Through these different channels, digital technologies have been hypothesized to affect longer-term demographic processes within the life course linked to health, partnership formation, fertility, and migration. A growing literature within demography and related social sciences, looking at both high- and low-income countries and using both macro- and individual-level data, has begun to explore these impacts of the digital revolution. 2.2.1 Health, mental health, and mortality The relationship between digital technologies and health outcomes has been approached in different ways. While not exhaustive, we briefly describe three different types of research analysing this relationship. First, the internet and mobile technologies have become a key tool for information seeking about health (Wang et al., 2021). Digital technologies can help reduce information search costs and democratize access to information, thereby helping individuals make informed decisions about their health, which can serve as a pathway towards improving population-level health outcomes. In terms of demographic factors, research from high-income countries has shown that women, younger age groups, and highly educated groups are more likely to use the internet to access health-related information (Beck et al., 2014; Hallyburton & Evarts, 2014; Jacobs et al., 2017; Miller & Bell, 2012). While at earlier stages of internet diffusion these differences reflected differences in internet access and digital skills, the predictive power of internet usage in shaping demographic differentials in online health information seeking has weakened over time (Li et al., 2016).

Digital and computational demography 53 In poorer countries or areas, where information resources may be scarce or harder to access, digital technologies can have larger impacts on improving health-related knowledge. Consistent with this idea, using a quasi-experimental approach through the geospatial augmentation of the Demographic and Health Surveys, Rotondi et al. (2020) found that women who own mobile phones in seven countries in Sub-Saharan Africa were better informed about contraception, antenatal care, and more empowered to make their own decisions, including about their health. The study leveraged an identification strategy using geospatial data on lightning strikes observed from satellites. As mobile connectivity is poorer and adoption is slower in areas where lightning strikes are more frequent, the authors used this source of exogenous variation to assess the causal impacts of mobile technology. The impacts of mobile technology were stronger in the most disadvantaged areas within the countries, providing support for the idea that the direct payoffs for digital technology may be larger where other forms of social learning and institutions are absent. Indeed, in Sub-Saharan Africa, mobile phones have served as the first large-scale telecommunication infrastructure that has leapfrogged landlines, and many Africans are more likely to own mobile phones than have access to proper drainage, electricity, paved roads, or piped water (Mitullah et al., 2016). While the above discussion suggests a more positive role for digital technologies, the internet is not a singular technology and consists of different types of content, platforms, and functionalities. Content on the internet is often deregulated, user-generated, and can provide information that is unreliable, with the potential to spread misinformation about health (Swire-Thompson & Lazer, 2020). This concern has been amplified in the context of the COVID-19 pandemic, in which an ‘infodemic’ of misinformation, spreading swiftly across social media, accompanied the global spread of the disease (Zarocostas, 2020). The pathways to misinformation, in terms of the conditions under which specific platforms are more vulnerable to its spread, which individuals are more likely to be exposed to it, and interventions to curb its spread are key areas for further research towards understanding under which conditions digital technologies can empower and disempower individuals and communities regarding health (Wang et al., 2019). Second, digital technologies, and especially mobile phones in low-income countries, have been used to improve access to vital health-related services and interventions, through context-specific mobile health (m-Health) projects. M-Health interventions have been applied to improve appointment compliance, treatment adherence and healthcare utilization, and connectivity to augment the capabilities of remote and lesser-trained healthcare staff (Hall et al., 2014). From a demographic perspective it is especially relevant that a key focus of m-Health interventions has been on women’s sexual and reproductive health outcomes, such as improving antenatal care attendance (Lund et al., 2014a), increasing contraceptive use (McCarthy et al., 2019; Smith et al., 2015), and reducing perinatal mortality outcomes (Daly et al., 2017; Lund et al., 2014b). Aligned with this emphasis on sexual and reproductive health, at the macro level, Rotondi et al. (2020) found that the diffusion of mobile phones over time within countries was associated with reductions in under-five and maternal mortality, with stronger associations among the poorest countries. While these studies, drawing on both experimental and quasi-experimental evidence, point to positive impacts of digital technologies on reproductive and sexual health outcomes for women, digital gender inequalities across many lowand middle-income countries are persistent, with women less likely to own mobile phones, access the internet, and have lower digital skills (Kashyap et al., 2020). As digital technologies are increasingly deployed for health programmes, these digital divides hinder the potential for

54 Research handbook on digital sociology equalizing impacts of technologies on population health. These digital divides also imply that those who use technologies may be a selective group, and further necessitate research designs that address this selection as well as other forms of endogeneity when assessing technology’s impacts. Third, while the literatures above address how technologies affect access to health-related information or services, a growing debate has centred on the question of whether, how, and under which circumstances digital media use helps or harms our mental health. Much of the debate has centred around social media use and produced conflicting results. Several highly publicized survey studies showed connections between social media use and recent increases in rates of adolescent depression and suicidality (Twenge & Campbell, 2019; Twenge et al., 2018), but later re-analyses of the same datasets showed only negligibly small effects on mental health (Orben & Przybylski, 2019; Orben et al., 2019). Importantly, these studies typically cannot detect the direction of the association: Do we feel slightly worse than usual because we use social media or do we turn to social media because we feel worse than usual? Recent research seems to suggest the latter pathway, as studies that correct for self-selection biases (Lohmann & Zagheni, 2021) or that evaluate bidirectional pathways over time (Boer et al., 2020; Coyne et al., 2020; Puukko et al., 2020) show that normal amounts of social media use does not predict well-being when accounting for reverse causation. For example, Puukko and colleagues (2020) followed almost 3000 adolescents over six years and found that social media use was prospectively unrelated to depressive symptoms, but that depressive symptoms prospectively predicted higher social media use. There is, however, growing consensus that excessive social media use with addiction-like patterns (feeling like one cannot stop, withdrawal-like symptoms when unable to access social media, see Andreassen, 2015; Olson et al., 2022) is qualitatively different from normal patterns of social media use and is associated with poor mental health outcomes (Bekalu et al., 2019; Boer et al., 2020; Kim et al., 2018; Salmela-Aro et al., 2017). In addition, multiple experimental studies have shown that randomly assigning people to quit or reduce their social media usage makes them feel happier, at least in the short term (Allcott et al., 2019; Hunt et al., 2018; Tromholt, 2016). Overall, this field has been characterized by finding only small average effects that vary widely between studies. The conversation has been therefore shifting from asking whether there is an overall effect to asking for whom and under which circumstances digital media use may be helpful or harmful. When associations emerge, they are typically small, inconsistent, and moderated by characteristics of the user (Beyens et al., 2020; Valkenburg et al., 2021), the type of content they see online (Cohen et al., 2019; Dooley et al., 2009; Dyson et al., 2016; Valkenburg et al., 2006), and qualitative features of how people are using social media. Furthermore, consequences of social media use may also depend on the work–home context as Klingelhoefer and Meier (Chapter 22 in this volume) demonstrate. 2.2.2 Fertility and family formation Emerging literature has posited a link between the diffusion of digital technologies such as mobile phones and broadband internet and fertility outcomes in both low- and high-fertility contexts, albeit with different mechanisms at play in each. In high-fertility settings, a significant literature has found impact of mass media technologies on the fertility transition through exposure to new attitudes, knowledge, and behaviours (e.g., Barber & Axinn, 2004; La Ferrara et al., 2012; Westoff & Koffman, 2011). In low-fertility settings, media effects have also been shown to play a role in reducing teenage fertility through increased interest in contraceptive

Digital and computational demography 55 behaviours (Kearney & Levine, 2015). Building on some of the mechanisms and insights from this literature, it is plausible to hypothesize that digital technologies too may have similar, and potentially augmented, impacts. Digital technologies can affect fertility outcomes in high-fertility settings through different channels. Mobile phones can help amplify traditional paths of social learning by providing regular contact with personal networks unrestricted by geography, whilst also providing new paths for social learning. In contrast to unidirectional mass media such as television, mobile phones offer the advantage of privacy for communication and exchange, which may be especially relevant in contexts where social norms are restrictive and can limit access to new information, particularly for women (Rotondi et al., 2020). Furthermore, internet-enabled technologies such as social media may provide exposure to the ‘life of others’ that may promote the acceptability and desirability of behaviours, such as having smaller families or using contraception (Billari et al., 2020). Similar social influence effects have also been noted for television (Barber & Axinn, 2004; La Ferrara et al., 2012), although the internet provides access to even more globalized content that crosses national boundaries, and may be more likely to espouse liberalized values (Varriale et al., 2021). Mobile phones, especially in low-resource settings, can serve as a valuable tool for information provision on issues such as contraception (Rotondi et al., 2020). Mobile phones may also serve to improve financial inclusion and economic well-being, through the diffusion of services such as mobile money, that can in turn shape fertility preferences. Billari et al. (2020) provide empirical support for these ideas. Drawing on longitudinal survey data from Balaka, Malawi between 2009 and 2015, and using fixed-effects panel models, they found that mobile phone ownership was associated with smaller ideal family size and lower parity, through increases in child spacing. In low-fertility settings, the above-mentioned mechanisms linking digital technologies to fertility outcomes through increased information about contraception and reproductive decisions, as well as social learning and social influence effects through media exposure, are also likely to be relevant. The internet can impact fertility through affecting partnering behaviours, and has been described as ‘the new social intermediary in the search for mates’ (Rosenfeld & Thomas, 2012). Survey data from the US, as well as other high-income countries, have shown an increasing fraction of couples meeting online, with online meeting displacing traditional routes of meeting via friends or education (Potarca, 2020; Rosenfeld & Thomas, 2012; Rosenfeld et al., 2019). This rapid increase in having met online has been even steeper for those in ‘thinner markets’ such as same-sex couples (Rosenfeld & Thomas, 2012). While technologies like mobile dating apps and social networking sites can help reduce search frictions in the partnership markets, generate more partnership offers, and enable more efficient matching, they may also create ‘choice overload’ or increase the desired reservation quality for a prospective partner, thereby delaying partnership formation (Rosenfeld, 2017; Sironi & Kashyap, 2021; see also Coyle & Alexopoulos, Chapter 11 in this volume). Time spent with digital technologies may also reduce time spent in partnership search or engaging in sexual activity. The net effects of the internet on partnership and indirect impacts on fertility through this channel are thus theoretically ambiguous. An additional mechanism of salience in low-fertility contexts, however, is the impacts that the internet, and especially high-speed internet technologies, can have in promoting labour force participation through flexible working (e.g., home-based working) and, therefore, potentially better reconciliation of work with parenthood.

56 Research handbook on digital sociology To test these different hypotheses, two papers empirically examine the links between broadband diffusion and fertility. Guldi and Herbst (2017) found that increased broadband access was associated with a 7 per cent decline in teen birth rates between 1999 and 2007 in the US. Looking at Germany, and using a quasi-experimental strategy exploiting historical variation in pre-existing telephone infrastructure that significantly altered the costs of broadband adoption in Germany, Billari et al. (2019) found positive effects of broadband diffusion on the fertility of highly educated women aged 25–45, but no effect for men. Finding that broadband access increased the share of home- and part-time working, they attributed this effect to the opportunities for better work–family reconciliation afforded by the technology for women. Consistent with the idea of increased opportunities for flexible working, Dettling (2017) also found that broadband diffusion increased labour force participation among married women – but not single women or men – with the largest effects among college-educated women in children. The lack of finding for men raises the question of whether technologies end up reinforcing gender norms, where women carry the dual burden of work and childcare, rather than displacing them. Despite theoretical ambiguity in the potential direction of the association, the empirical research studying the link between the internet and marriage or partnership formation so far has generally found positive associations, albeit with heterogeneity by sociodemographic characteristics, such as age, race, and education (Bellou, 2015; Potarca, 2021; Rosenfeld, 2017; Sironi & Kashyap, 2021). Exploiting plausibly exogenous variation in the timing of broadband diffusion at the county level in the US, Bellou found the technology to have contributed to increased aggregate-level marriage rates for 21–30 year olds. Using micro-level data from the National Longitudinal Survey of Youth in the US, Sironi and Kashyap found the association between internet access and partnership states to be age dependent, changing from being negative at the youngest adult ages to positive as the cohort reached their mid-20s (Sironi & Kashyap, 2021). Although the efficiency of internet search has been theorized as offering greater potential for those in thinner markets, like racial minorities or same-sex couples, the empirical findings so far do not conclusively point to larger associations. Bellou (2015) found positively and statistically significant association between broadband diffusion and marriage rates for African Americans, but of a magnitude comparable to the white population. Sironi and Kashyap (2021) found associations between internet access and partnership formation for same-sex couples, with some suggestion of these being larger for same-sex couples, but limited sample size of same-sex couples made patterns difficult to estimate with precision. In a contrasting finding using individual-level data from the pairfam survey from Germany, Potarca (2021) found that online dating instead reinforced the marriage advantage of the highly educated, with men and women who met online experiencing a great chance of marrying. A significant methodological challenge for work using individual-level data to assess the relationship between internet use or having met online and partnership formation is the issue of self-selection: Are the most marriage-ready individuals those that use internet-based technologies for mate search? If that is the case, then the internet itself may not have a causal impact in affecting the propensity to form unions, but instead may provide a pathway through which those more successful or motivated reinforce their advantage in the marriage market. Apart from its implications for partnership formation, the growing digitalization of dating and mate search has attracted growing attention in the demographic literature. The use of digital platforms provides a unique window to directly observe mate search preferences and dynamics through behavioural data in a way that has previously not been possible to analyse,

Digital and computational demography 57 and studies have examined these dynamics through data from online dating websites (see Skopek, Chapter 12 in this volume). While data from online dating provide rich data on mate search, they are less effective at capturing longer-term processes such as union formation (or dissolution) which occur outside the platform. In contrast, sociodemographic longitudinal surveys are better equipped to study life-course processes, but have been slow to incorporate measures of digital technology, or include coarse measures that do not capture the potentially different ways in which internet technologies can affect life-course processes. Studies using panel surveys have examined the sociodemographic characteristics of those who meet online (Danielsbacka et al., 2019; Potarca, 2020), as well as the wider impacts of digital partner markets on long-standing sociodemographic regularities in union formation such as assortative mating patterns (Potarca, 2017; Thomas, 2020) or on mental health outcomes (see Potarca & Sauter, Chapter 21 in this volume). Understanding how the growing uptake and acceptability of digital partner markets affect whether and with whom individuals partner, and whether technologies reinforce or reshape patterns of social stratification, is a promising area for future research. 2.2.3 Migration Just as with other demographic outcomes described above, there are several reasons to expect a link between the diffusion of digital technologies and migration, but with the possibility that this association can be both positive or negative. On the one hand, through access to information about destination countries, and better connectivity with global and diaspora communities, digital technologies can change both the actual and perceived costs of migration and make the decision easier (Hiller & Franz, 2004; Oiarzabal & Reips, 2012). Social networks matter for migration, and the opportunities for more media-rich communication and access to pools of contacts (‘latent ties’) provided through the internet can help facilitate new migration streams (Burrell & Anderson, 2008; Dekker et al., 2016). Similarly, migrant communities connect on social media to establish and maintain social ties in both their origin and destination countries (Dekker & Engbersen, 2014; Komito, 2011). For those who have moved, digital technologies can also lower the psychological and emotional costs of moving and help with the maintenance of distant or transnational ties. Exposure to the life of others, and impacts on raising material aspirations, can also impact migration intentions (Lohmann, 2015). Consistent with these hypotheses, several quantitative studies found a positive link between digital technologies and migration. These studies have analysed the association at the macro level using panel models between the diffusion of mobile (Czaika & Parsons, 2017) or internet technologies (Pesando et al., 2021) and migrant stocks, as well as those that have analysed the relationship between individual technology use and migration intentions and behaviours, both for internal (Muto, 2012; Vilhelmson & Thulin, 2013) and international migration (Dekker et al., 2016; Pesando et al., 2021). Pesando et al. found consistent and positive associations between internet variables and migration outcomes along the migration path from intentions to actual international migration, using both micro- and macro-level analyses. These findings lead them to argue that the internet serves as a crucial supportive agent in defining clearer migration trajectories, and they characterize this permeation of the internet across migration behaviours as the internetization of international migration. The fact that migration phenomena can be observed through digital traces of internet-related technologies like social media and email (e.g., Alexander et al., 2020; Zagheni & Weber, 2012) provides further support to the idea of internetization, indicating that with technology diffusion the phenomenon is radically different than in a pre-internet era.

58 Research handbook on digital sociology The internet can also reduce the need for migration by bolstering economic growth in sending countries, improving local demand for skilled workers, and enabling remote or flexible forms of working. If these pull factors are stronger, a negative association between technology diffusion and migration outcomes can emerge. Findings aligned with these hypotheses are provided by Winkler (2017) for international migration based on a macro-level analysis of internet diffusion and migration stocks across 33 Organisation for Economic Co-operation and Development (OECD) countries and Cooke and Shuttleworth (2017) for internal (interstate) migration in the US.

3

NEW SOURCES OF DATA FOR DEMOGRAPHIC RESEARCH

3.1

Digital Trace and Geospatial Data

Digital demography has increasingly popularized the use of data which were not originally intended for research, but are instead the by-products of other processes. We discuss two centrally important examples. The first is digital trace data. We begin by discussing how researchers should approach such data, then provide bibliometric data as an in-depth example of how digital trace data are being used by digital demographers. Following this, we discuss our second centrally important example of repurposed data: geospatial and remotely sensed data. 3.1.1 Digital trace data Digital trace data are the digital records generated as a by-product of activities using digital technologies or platforms, ranging from archived Google searches to mobile phone records, social media postings (Cesare et al., 2018), user messaging in online dating (see Skopek, Chapter 12 in this volume), or rating and transaction data from online markets (see Przepiorka, Chapter 13 in this volume). These records offer access to unprecedented amounts of data in a continuous way, allowing researchers to examine real-time trends in demographic processes, such as mobility (Williams et al., 2015), migration (Zagheni et al., 2014), and fertility (Rampazzo et al., 2018; Wilde et al., 2020) but at the cost that the data are noisy, not structured, and biased (Cesare et al., 2018; Kashyap, 2021). Therefore, before using these data, researchers must consider three things. First, not all data on the web are suitable for research nor should be used (D’Ignazio & Klein, 2020; Lazer et al., 2021). When considering a specific type of digital trace data, researchers should be clear about the theoretical and measurement framework that drives the research (Lazer et al., 2021); whether the data can be accessed without violating the terms of service of the platforms (Fiesler et al., 2020); and whether the research could endanger or harm the people under analysis (Fiesler & Proferes, 2018). When working with the data, it is necessary to create reliable ways to anonymize the data and to ensure that users are not harmed (Edelmann et al., 2020; Lazer & Radford, 2017). Second, the data can come from algorithms that reinforce biases and criminalize minorities, and the processes through which this occurs remains a ‘black box’ due to the proprietary nature of these algorithms. This happens because (see Cheney-Lippold, 2011; D’Ignazio & Klein, 2020; Noble, 2018): minority groups are not represented in tech companies; data to train the algorithms can be biased; and there is a lack of understanding of social and historical contexts when building models and interpreting results. Also, the algorithms that underpin the generation of these data can change over time, a phenomenon known as ‘algorithmic bias’ such that

Digital and computational demography 59 it may be difficult to distinguish whether patterns observed through these data are driven by changes in behaviours or change in the algorithm (Salganik, 2018). If a scholar is sure that digital trace data can be used for their research, then there are three main ways to access such data. First, many companies offer access to their application programming interfaces (APIs) that can provide data on different aspects of the platform. For example, a number of companies provide marketing APIs that allow users to retrieve the approximate number of users that match certain characteristics on that platform (Kashyap et al., 2020; Zagheni et al., 2017) and to download small, representative samples of the users’ metadata and interactions (Morstatter et al., 2013). While APIs provide more democratic modes of access to these data than ad hoc private data-sharing agreements, it is important to note that many sources do not have direct API access (Kashyap, 2021). Moreover, there has been a tendency towards a reduction in API-based sources for social science research from online and social media platforms, as companies become more stringent in enabling access to their data due to privacy regulations or highly visible leaks like the Cambridge Analytica scandal with Facebook (Freelon, 2018). Second, researchers can scrape the data (Soowamber et al., 2016), meaning they can get access to publicly available data from webpages using packages implemented in programming languages such as ‘rvest’ (Wickham, 2021). Finally, different independent organizations offer access to historical archives of the internet (Sequiera & Lin, 2017), which can be used to study the internet from a historical – and longitudinal – perspective (Gil-Clavel et al., 2022). Because digital trace data come in large amounts and in so many forms – JSON, images, text, and audio – scholars need to be more technically skilled and/or to collaborate with computer scientists (Edelmann et al., 2020; Kashyap, 2021). As an example, Breuer et al. (Chapter 14 in this volume) provide an in-depth account of how to collect research data from YouTube using those various approaches. Since digital trace data come from digital populations, the differences in the demographic attributes between those who are online and captured in these data and those who are not is important to consider. This is especially relevant from the perspective of demographic research, which is often interested in population-generalizable measurement. A key contribution of digital demography so far and one with considerable role for further development is the analysis of the demographic characteristics and, relatedly, biases of digital populations (Kashyap, 2021). In the case of social network sites, such as Twitter and Facebook, research shows that they are mostly used by internet-literate people (Hargittai, 2020), and notable geographic and demographic disparities in the use of the internet and social media persist. Those in high-income countries generally have better internet access compared with those in low- and middle-income countries (ITU, 2020). Moreover, demographic digital divides exist, and these divides are often much larger in low- and middle-income countries (Kashyap et al., 2020). For example, demographic research using Facebook’s marketing API shows greater adoption of Facebook among younger than older age groups (Gil-Clavel & Zagheni, 2019) and significant gender gaps in Facebook use, with women being much less likely to use Facebook in South Asia and Sub-Saharan Africa (Fatehkia et al., 2018; Kashyap et al., 2020). While some of these observed differences on specific online platforms like Facebook or Google may reflect demographic differences in internet use given the large size and coverage of these platforms, as shown for example in work on digital gender gaps (Fatehkia et al., 2018; Kashyap et al., 2020), platform-specific selection effects are also relevant to consider (Kashyap & Verkroost, 2021; Morstatter et al., 2013).

60 Research handbook on digital sociology These demographic differences between online and offline populations can be addressed in different ways, depending on the research question. If the analysis involves inferring definitions from digital traces (such as being a migrant), it is important to validate these data against ground truth data, e.g., those available from sources such as surveys (Fiorio et al., 2021) or by using qualitative methods (Armstrong et al., 2021). More broadly, validating constructs created using digital traces against ground truth measures can help illuminate the biases of these data, but also help generalize these measures to a broader population of interest (e.g., Kashyap et al., 2020; Rampazzo et al., 2021; Zagheni et al., 2017). Once biases are known, scholars can use approaches such as post-stratification weighting to account for them (Zagheni & Weber, 2015). Next to this, scholars also need to embrace the inherent uncertainty and variability of the data using computer – and statistical – methods (Alexander et al., 2020; Gil-Clavel et al., 2020; Hofman et al., 2021). Finally, researchers must account for the possible influence of the platform algorithms on user behaviour (Cheney-Lippold, 2011; Wagner et al., 2021). 3.1.2 Digital trace example: bibliometric data There is a long history of studies using different sources of data to study scholars as a population, including surveys of scholars (Cañibano et al., 2020; Franzoni et al., 2014), interviews (Cole & Zuckerman, 1987; Schaer et al., 2020), administrative and census data (Fenton et al., 2000; Shauman & Xie, 1996), and, more recently, online sources (e.g., LinkedIn data or websites of universities) (Park et al., 2019; Yuret, 2017). With the widespread digitization of scholarly databases, bibliometric data are increasingly used as a new source (Alburez-Gutierrez et al., 2019). Bibliometric data include information that is extracted from scholarly publications in scientific journals. This includes meta-data of publications, i.e., author name, affiliation address, reference list, publication year, title, abstract, manuscript text, and keywords. Using bibliometric data sometimes requires ‘repurposing’ it (e.g., inferring academics’ residence and mobility from institutional affiliation addresses and their changes). The study of academic mobility, as a specific type of highly skilled population mobility, has benefited from the availability of these large-scale and high-resolution bibliometric data and can be divided based on the study focus to two groups. The first group focuses on the geographic scale of academic mobility. Some look at internal or international migration to/from a country (see the cases of Russia (Subbotin & Aref, 2021), Mexico (Miranda-González et al., 2020), and Germany (Zhao et al., 2021)) while others have focused on the global mobility of scholars (Chinchilla-Rodríguez et al., 2018b; Czaika & Orazbayev, 2018). The second group focuses on individuals or groups of academics and the (dis)advantages of mobility, or its contributions to the field, knowledge transfer, institutional, national, or global productivity, and innovation. Some research has focused on the performance and impact of mobile scientists or the so-called ‘mover’s advantage’, while others have looked at these scholars’ contribution to the specific fields or national contexts and the knowledge transfer that is facilitated through the experience of academic mobility (Aman, 2020; Bernstein et al., 2018). There is also research on the downsides of scholarly mobility and costs that academics bear by leaving an academic context for another (Ackers & Gill, 2005; Schaer et al., 2020) or the potential for (in)stability of scientific collaborations or difficulty of finding a job during or after mobility (Baruffaldi & Landoni, 2012; Zhao et al., 2020). In addition to studies focusing on academic mobility, there has been a group of studies focusing on policy changes and how they can inspire (Ippedico, 2021) or inhibit mobility of the general population and more specifically academics (Chinchilla-Rodríguez et al., 2018a; Sugimoto et al., 2017).

Digital and computational demography 61 These uses of bibliometric data also face several limitations. Some limitations have to do with data quality (Tekles & Bornmann, 2020; Wu & Ding, 2013), including scientific entity (e.g., authors or institutions) name disambiguation. In addition, higher-level epistemic questions need to be addressed while repurposing these data for demographic research (Laudel, 2003; Moed et al., 2013) – e.g., assigning the country of affiliation in the first publication as the country of origin for academic mobility is prone to error since that could simply be the country of graduation. There is a publication delay that can hinder proper identification of the mobility period. Furthermore, these data are limited to only those scholars who have actively published in indexed scholarly journals, so coverage may be incomplete. The future of bibliometric data use for demographic research is bright. This prospect is thanks to new services and methods to prepare cleaner data and increased availability through initiatives for open access to data. A number of scientific and policy-relevant questions have emerged that can be tackled using these data: How much of the talent circulation has happened ‘within’ national borders versus ‘between’ nations? Are there migration corridors connecting specific regions globally, for example between two specific regions across countries or in the same country, or systems of circulation that involve several countries or subregions? Do scientific collaborations among scholars facilitate their future mobilities? Do scholars have different probabilities of being mobile based on the trajectory of their collaborations during their scientific career? Answering these and related questions holds the key to understanding the increasingly complex interactions between processes related to migration of scientists and scholarly collaborations as well as institutional settings and policies. One relatively understudied avenue of future research is an integrated study of internal (Miranda-González et al., 2020) versus international academic mobility since these could be considered interconnected migration systems (King & Skeldon, 2010; Skeldon, 2006), and it helps in finding migration hubs or regions with high concentration of academic labour or high attractiveness for future mobility that can inform policy. To conclude, repurposing bibliometric information allows researchers to compile high-resolution data on scholarly life events, mobility, and migration that gives us an unprecedented opportunity to evaluate theories explaining migration through network tie formation (Massey et al., 1993). It helps in answering pressing questions of scientific and policy relevance, which previously would have been impossible to address. 3.1.3 Geospatial and remotely sensed data Digital and spatial demographers now have a wide array of remote sensing and geospatial data at their disposal to support population estimation and mapping for fine-grained geographic areas and time periods that may be out of reach for traditional field-based data collection. Remote sensing data (i.e., sensed via satellite and airborne imaging platforms) and the geospatial datasets derived from them often provide a unique combination of desirable characteristics: (1) repeated data collection at regular intervals; (2) fine-grained spatial resolution; and (3) full coverage across national and global scales. For individual sensors there are trade-offs among these characteristics, but all three are generally improving as sensing technologies advance such that global coverage with high spatial resolution and frequent repeat measurements is becoming commonplace. Perhaps the characteristic most fundamental for population mapping that is measured by remote sensing is the footprint of human settlements on Earth and how this has changed through time. The pursuit of delineating settlement footprints began with efforts to map land cover

62 Research handbook on digital sociology types and coverages of impervious surfaces (e.g., concrete) using multi-spectral imagery, and has progressed to mapping the footprint of every building on Earth and sometimes even their heights with sensors such as synthetic aperture radar and laser-based LiDAR (Elvidge et al., 2007; Esch et al., 2022; Melchiorri et al., 2018). Within settlement footprints, researchers can extract additional contextual information by linking to other geospatial datasets. For example, the physical characteristics of neighbourhoods – building densities and sizes, distances to city centres, road densities, coverages of impervious surfaces or green spaces in the surrounding area – can be used to categorize settlements (e.g., urban, rural, industrial) and to classify individual buildings as residential, non-residential, or mixed use (Florczyk et al., 2019; Jochem et al., 2021; Sturrock et al., 2018). This gives us valuable information about the likely locations of residential dwellings and a basis for making demographic inferences about the households living within them. To leverage the power of geospatial data for demographic inferences, it is critical to have geo-referenced field observations from household surveys or national censuses. The geographic boundaries of household survey clusters or census enumeration areas provide a fine-grained location-based linkage between demographic observations and remotely sensed measurements of the environment at the same locations. To protect privacy of individuals, household-level field observations are often aggregated to larger spatial units or their coordinates are displaced, but even this coarse location information is critical for linking to remote sensing and geospatial datasets (ICF International, 2013; Minnesota Population Center, 2020). These geographic units form the basic unit of analysis where field-based population observations and remotely sensed environmental characteristics have both been observed. Both ML and statistical approaches have been implemented to model relationships between remotely sensed environmental characteristics and observed population characteristics. One of the most fundamental and important demographic characteristics modelled in this way is total population size within small geographic areas for specific age and sex groups (Tatem, 2017). These estimates are critically important where a full population census is not possible due to armed conflict or logistical constraints, and where small area population denominators are needed to determine demographic rates, such as rates of infant or maternal mortality, vaccination coverage, etc. Analytical approaches for estimating population size can be divided into two broad categories: top down and bottom up (Wardrop et al., 2018). Bottom-up approaches extrapolate population characteristics observed at a sample of locations (e.g., survey clusters) into new geographic areas based on relationships with geospatial data available with full coverage across the study area (Leasure et al., 2020). Top-down approaches disaggregate total population size from coarse-scale geographic units (e.g., states) into fine-grained spatial units based on relationships with high-resolution geospatial data (Stevens et al., 2015). A variety of analytical frameworks have been implemented for these purposes, but perhaps the most widely used have included ML with random forests (Breiman, 2001; Liaw & Wiener, 2002), geostatistical models (Lindgren & Rue, 2015), and hierarchical Bayesian modelling (Gelman et al., 2015). A key consideration going forward will be securing open-access availability of critical remotely sensed and geospatial data for digital demography, particularly high-quality building footprints with global coverage. These are important to mapping population characteristics at high spatial resolution for humanitarian responses, global health, education, and inequality initiatives because they define locations where people live with exceptional detail. Because these data are so valuable for commercial purposes, they are often not freely available for research

Digital and computational demography 63 and humanitarian purposes. This has begun to change with a number of open-source projects aiming to provide free building footprints with continental or global coverage (Biljecki et al., 2021; Goefabrik GmbH, 2018; Google, 2021; Microsoft, 2021; Open Street Maps, 2021; Sirko et al., 2021; Urban Analytics Lab, 2021). The highest-quality building footprints are often still proprietary, but these are more often being made available to researchers for humanitarian purposes, particularly in Sub-Saharan Africa, or aggregated for open-access publication (Dooley et al., 2020; Heris et al., 2020). With powerful new remotely sensed geospatial datasets such as these, digital demographers will be able to expand on the value of traditional data sources from geo-referenced household surveys and national censuses to improve demographic inferences for populations in locations that are difficult to enumerate and for time periods and frequencies that may be impractical for traditional field-based approaches. 3.2

Original Data Collection via Online Recruitment

The expansion of the internet has given digital demographers new opportunities for deploying surveys to study rapidly changing events and understudied populations. We focus on two popular sources of online survey respondents: social media and crowdsourced platforms. We then directly compare both samples by highlighting their relative strengths and weaknesses. While the first two subsections largely provide examples of these modes of recruitment from high-income countries, in the last section we consider the opportunities and challenges offered by these new modes in the context of low- and middle-income countries. 3.2.1 Online surveys and social media recruitment The pervasive use of digital communication technologies and the dramatic rise in internet penetration have fostered the birth of a variety of non-traditional approaches in survey research (Wright, 2005). This applies to several disciplines and research fields, from demography to social science and public health, in which the number of surveys being conducted over the internet has indeed increased dramatically in the last 10 years in comparison with other methods, such as interviewer-administered modes (Alessi & Martin, 2010). Traditional probability-based sampling methods, such as address-based sampling and random digit dialling, have in fact rapidly declined due to increasing costs and inadequacy in response rates and population coverage (Stern et al., 2014). Against this backdrop, social media – and the internet in general – offer new opportunities to leverage alternative data sources and innovative data collection schemes for empirical research and support of traditional survey research methods (Eysenbach & Wyatt, 2002; Zhang, 2000). This ranges from ad hoc respondents recruited for cross-sectional surveys (Pötzschke & Braun, 2017), aimed at capturing a snapshot of a population’s characteristics at any one moment in time, to more sophisticated monitoring routines, such as participatory systems, based on individual engagement in voluntary reporting of specific information over time (Guerrisi et al., 2016; Wójcik et al., 2014). Compared to traditional survey methods, online surveys are generally inexpensive, less burdensome, and timelier, and allow researchers to generate samples of geographic or demographic subpopulations that would otherwise be difficult to reach or involve in research (Zhang et al., 2020). In this context, the use of the Facebook platform has recently become particularly popular among researchers. Facebook is currently the largest social media network, with 2.9 billion monthly active users worldwide (Facebook Inc., 2021) and nearly

64 Research handbook on digital sociology global coverage. Additionally, the possibility to target – and thus recruit – specific groups of users of interest for the study (from their demographics to specific interest in topics, language use, etc.) makes it particularly appealing for survey research (Whitaker et al., 2017). As such, this approach has been employed for recruiting specific segments of the population, such as parents (Bennetts et al., 2019), members of the LGBTQ community (Guillory et al., 2018; Kühne & Zindel, 2020), migrants (Pötzschke & Braun, 2017), and service-sector employers (Schneider & Harknett, 2019). In 2020, the urgent need of complementary information in the context of the COVID-19 pandemic, such as symptoms, behaviours, attitudes toward policy, but also individuals’ beliefs and economic impacts, has fostered the use of Facebook surveys for continuous and cross-national data collection to help increase situational awareness and guide the decision-making process (Grow et al., 2020; Salomon et al., 2021). Regardless of the method or platform used for recruitment, attention to what type of data are used and who is represented in the data is critical in order to avoid limiting the validity of the conclusions drawn (Chunara & Cook, 2020). Online surveys, in fact, potentially suffer from biases due to self-selection and non-representativeness of the sample. This is due to variations in internet penetration, social media usage, and interest in the survey topic, which may affect the sample in terms of both demographic characteristics (e.g., age and gender) and unobservable characteristics (e.g., underlying medical conditions and threat perception). While appropriate survey designs may partially prevent this issue, a number of statistical methods can be applied to correct for major biases and ensure appropriate coverage of the general population. Post-stratification weighting, for example, is a standard technique in survey research in which weights are computed based on population information from representative data sources (e.g., census data) and applied to the surveyed samples to correct for potential issues with non-representativeness (Zagheni & Weber, 2015). A more sophisticated version of the post-stratification approach also combines multi-level regression models to make proper inferences at the population level, even in the presence of strong selection bias (Downes et al., 2018). Other techniques to correct selection bias in online surveys include propensity score matching (Schonlau et al., 2009) and its alternatives combined with machine-learning classification algorithms (McCaffrey et al., 2013). Despite these limitations and challenges, conducting online surveys certainly offers new opportunities for primary data collection with great potential, especially in those settings where traditional resource-intensive methods are impossible, such as natural disasters, violent conflicts, and pandemics (Grow et al., 2020; Rosenzweig et al., 2020). During the COVID-19 pandemic, for example, we witnessed a growing effort in collecting unique and timely datasets worldwide. While this has enhanced our understanding of the COVID-19 pandemic, future research is needed to translate these efforts into solid and stronger systems for data collection for rapid response. With the global growth in mobile internet penetration, it seems inevitable that the use of online surveys will continue to increase over time (for online and mobile survey data collection see also Das & Emery, Chapter 4 and Struminskaya & Keusch, Chapter 5 in this volume). Future research is therefore needed to assess the strengths and weaknesses of the different online platforms and methodological approaches, but also the best uses and practices for data collaborations and sharing that, with proper data protection and data privacy mechanisms, can be used for social impact applications and humanitarian action (Evans & Mathur, 2018).

Digital and computational demography 65 3.2.2 Crowdsourced platform recruitment Crowdsourced platforms, such as Amazon’s Mechanical Turk (MTurk) and Prolific, the latter designed specifically for academic research, provide a convenient, cost-effective means for connecting researchers with interested study respondents (Palan & Schitter, 2018; Paolacci & Chandler, 2014). This arrangement differs from social media platforms in that users on crowdsourced platforms are present specifically to do paid work, including academic studies. Such samples have been found to provide high-quality data and demographically diverse respondents (Hauser & Schwarz, 2016; Weinberg et al., 2014). These platforms are also capable of facilitating complex research designs, such as multi-day studies, panel studies, and qualitative interviews (Aguinis et al., 2021; Shank, 2016). Due to the speed of data collection on these platforms, these samples are also especially useful for conducting timely research, including responses to the COVID-19 pandemic (e.g., Chung et al., 2021; Fish et al., 2021). Despite these advantages, respondents recruited from these platforms are deficient for certain kinds of research. Importantly, the samples are not representative. Relative to representative US data, the respondents tend to be younger, better educated, and tend to suffer from worse mental health (Aguinis et al., 2021; McCredie & Morey, 2018). Consequently, their use in social scientific fields accustomed to representative data appears limited (Shank, 2016). For example, only one single article in the journal Demography has used MTurk data (Grigorieff et al., 2020). Yet, its current influence on demography and future opportunities for demography may be greater than they appear. Crowdsourced samples are excellent for testing survey designs prior to their deployment via more costly, representative samples, or as preliminary data included in grant applications (Geisen & Murphy, 2020). The demographic diversity present on these platforms makes them useful for targeting specific subpopulations, though these subpopulations may not be representative (Ogletree & Katz, 2020). Relatedly, researchers can partially correct for non-representativeness via the means discussed in the previous sections, such as post-stratification weighting. Lastly, there is a growing interest in utilizing digital trace data in demographic research, such as data provided by GPS and accelerometer-equipped smartphones (Cesare et al., 2018; Cornwell et al., 2019; Struminskaya & Keusch, Chapter 5 in this volume), and the methodological flexibility afforded by crowdsourced platforms provides a valuable and largely untapped means for deploying these novel data collection methods. Researchers considering the use of MTurk should be aware of recent declines in data quality due to respondents fraudulently gaining access to studies restricted to US residents and then providing poor-quality responses (Kennedy et al., 2020). Researchers can address this by verifying respondents’ country of origin via their IP addresses, then blocking respondents located outside the country of interest or those masking their location via a virtual private server (Winter et al., 2019). Beyond this, there are several methods researchers should employ to improve data quality when using crowdsourced samples. First, researchers should require a minimum reputation, such as requiring greater than 95 per cent of respondents’ past work to have been deemed acceptable (Peer et al., 2014). Researchers should also require a minimum level of experience, but this level will vary depending on the platform. Peer et al. (2014) used a minimum requirement of 500 previously completed tasks from MTurk respondents, though this does not mean respondents need to complete at least 500 previous academic studies. Many tasks on MTurk are non-academic and brief, sometimes under a minute in duration. Prolific advises researchers require a minimum of 10 studies for longitudinal research, though the effective minimum experience level on Prolific remains unstudied. Improving data quality via

66 Research handbook on digital sociology attention check questions, by contrast, appears ineffective with high-reputation respondents (Peer et al., 2014) and may potentially influence respondents’ responses in unexpected ways (Hauser & Schwarz, 2015). Crowdsourced samples have grown dramatically over the past decade, though the overwhelming focus of this research has been restricted to US and, to a much lesser extent, Indian respondents (Difallah et al., 2015), presumably due to these nations’ disproportionate representation on MTurk (Difallah et al., 2018). This has started to change with the growth of Prolific, particularly with research on UK residents, and Prime Panels, which aggregates together respondents available on other crowdsourcing platforms, but it remains to be seen if this will lead to a greater use of respondents from outside the US. Relatedly, although MTurk is one of the best-studied convenience samples available to researchers, more research focusing on who crowdsourced respondents are is needed due to both the changing nature of the MTurk population and the emergence of newer and comparatively understudied samples, including Prolific, Luc.id, and Crowdflower, as well online panels, such as Prime Panels and those aggregated together by Qualtrics (Boas et al., 2020; Chandler et al., 2019; Peer et al., 2017). Lastly, researchers have and will likely continue to grapple with the ethical concerns associated with crowdsourced labour, including concerns with underpayment, violations of privacy, and asymmetries in information and power (Gleibs, 2017). 3.2.3 Direct comparisons: Facebook versus MTurk and Prolific Both social media platforms, such as Facebook, and crowdsourced platforms, such as MTurk and Prolific, have become popular, cost-effective sources for recruiting respondents into social scientific research. The advantages of using these platforms are highlighted well by their usefulness for facilitating rapid, timely research, as well as facilitating the recruitment of hard-to-reach populations, yet there are differences between the platforms in terms of reach, size, and usability, which may make one source of data more preferable than the other depending on the needs of the research project. We illustrate these differences by comparing Facebook with MTurk and Prolific, which are among the most popular options for recruitment via social media and crowdsourced platforms. To start, as we discuss in more detail in the next section, Facebook allows for large-scale recruitment across much of the world, while recruitment via MTurk and Prolific is limited, especially in the US and India. Further, these crowdsourced platforms are not only limited in reach but also in size. Estimates indicate MTurk has only approximately 7,000 unique respondents available for recruitment (Stewart et al., 2015). Although exclusion criteria will affect this estimate, it is safe to conclude that the size of crowdsourced platforms is likely to severely limit the number of hard-to-reach respondents relative to who can be found on Facebook. For example, researchers using Facebook were able to recruit approximately 1,000 Polish immigrants in four European countries (Pötzschke & Braun, 2017), which would likely be infeasible via these crowdsourced platforms.1 While Facebook has clear advantages in reach and size, it presents usability challenges related to recruitment and payment, though MTurk requires special steps to avoid quality issues. To begin, Facebook, MTurk, and Prolific all allow researchers to target based on demographic characteristics. Procedures for targeting specific groups beyond these available options are easier to implement on MTurk and Prolific due to ease of re-recruiting respondents without collecting personal information. For example, researchers using MTurk or Prolific may conduct a brief screening survey to identify respondents of interest, then limit the main study to these respondents, therefore mitigating the risk of respondents lying to gain access

Digital and computational demography 67 to a study (Aguinis et al., 2021). Relatedly, although both social media and crowdsourced samples are non-representative, researchers should be aware of the additional issue caused by Facebook’s algorithm targeting of demographic groups found to respond most to the study advertisement, ultimately leading to unanticipated oversampling (Neundorf & Öztürk, 2021). Researchers can mitigate this issue by choosing a ‘campaign objective’ which will target a balanced proportion of demographic categories – though, even with this choice, how Facebook selects users who see the study advertisement remains a black box (Neundorf & Öztürk, 2021). Next, MTurk and Prolific also both facilitate respondent payments on the platform, whereas reimbursing respondents on Facebook requires using methods separate from the platform, often involving gift cards and raffles, which adds complexity to the data collection process. However, this emphasis on payment on MTurk and Prolific render these platforms largely unusable for recruiting unpaid, voluntary samples which, by contrast, have been successfully recruited on Facebook (Perrotta et al., 2021). Further, although MTurk and Prolific’s research focus generally makes these platforms easier to use than Facebook, one important usability disadvantage of MTurk (and potentially other crowdsourced platforms) relates to the previously discussed ‘quality crisis’ (Kennedy et al., 2020). It is therefore crucial that researchers take the extra steps necessary to exclude respondents who mask their location when using MTurk (Winter et al., 2019). Researchers need to consider the trade-offs between Facebook’s reach and size on the one hand and MTurk and Prolific’s enhanced usability on the other hand. Large-scale, international surveys are a natural fit for Facebook. Smaller studies, especially those meant for testing survey questions, likely fit better with MTurk or Prolific. Longitudinal designs may also be better suited to MTurk or Prolific, given their built-in tools for re-recruitment. Although MTurk and Prolific can be useful for recruiting from specific populations, these platforms are likely too small to facilitate research on many hard-to-reach populations which are likely much more accessible through Facebook. Overall, respondents recruited via Facebook, MTurk, and Prolific have produced significant amounts of high-quality research and, where appropriate, should be embraced. However, each of these platforms is ultimately a commercial service and platform, for which user engagement and participation can change over time, particularly as new platforms appear. Facebook, for example, is currently a large social media platform that is widely used but is much less popular among teenage populations and younger cohorts who have migrated to other platforms like Snapchat or Instagram (Pew Research Center, 2018). Thus, moving forward, understanding demographic characteristics and user engagement across different platforms is a crucial consideration. 3.2.4 Recruitment in low- and middle-income countries As outlined in the previous sections, a significant driver of the interest in web-based or mobile technologies for recruitment in data collection such as in surveys has been cost efficiency and timeliness. The potential gains from these modes of data collection is thus likely to be even larger where infrastructures for population-based data are less widespread, such as in the context of low- and middle-income countries, but the spread of mobile and internet technologies has been rapid (Kashyap, 2021). For example, Facebook penetration rates on the African continent are higher than 20 per cent in 29 of the 54 countries (Rampazzo & Weber, 2020), which provides an opportunity for recruiting respondents in Africa. Examples of research using this methodology in Africa span from studies on sexualities (Olamijuwon & Odimegwu, 2021), to COVID-19 (Bicalho et al., 2021), and political attitudes (Ananda & Bol, 2021;

68 Research handbook on digital sociology Rosenzweig & Zhou, 2021; Rosenzweig et al., 2020; Williamson & Malik, 2020). The countries analysed are Egypt, Kenya, Nigeria, South Africa, and Uganda. Outside of the African continent, researchers have also used Facebook surveys to study populations in Indonesia, Mexico, Brazil, and India (Ananda & Bol, 2021; Boas et al., 2020; Rosenzweig et al., 2020). The results from these studies show that Facebook works particularly well to recruit young adults (18–24 years old). The survey cost per completed questionnaire is low; on average $0.53 in Kenya, Nigeria, and South Africa (Olamijuwon & Odimegwu, 2021). However, there are geographic differences within countries: in Kenya it ranges from $0.36 in the Rift Valley province to $9.71 in the North East province (Pham et al., 2019). It is generally more difficult and expensive to recruit women, lower-educated, and older people. Overall, the results are positive, but there is the need to understand how accurate and generalizable these findings are across low- and middle-income countries. A second issue to consider is how representative online modes of data collection are in these settings. In low- and middle-income countries, as diffusion of digital technologies is still under way, demographic biases in the use of these technologies may be even more significant, and ground truth data for calibration is even more limited than in high-income countries. A promising approach in these settings to study how online populations can be used to infer features of the offline population is provided by methodologies such as network scale-up methods (Feehan & Cobb, 2019) or respondent-driven sampling.

4

EMBRACING COMPUTATIONAL METHODOLOGIES

4.1

Machine Learning in Demography

The discovery of population-level patterns and regularities using data has been a long-standing focus of demographic research. This descriptive emphasis in the field, as well as its orientation towards questions of projection of future (unseen) trends informed by past (seen) patterns in data, lends itself well to the applications of ML. While ML methods in the past were applicable only in a handful of niche settings with high-dimensional data and dedicated super computers, increasing computational power combined with a data-rich environment have led to an emerging interest in ML for demographic applications. This section provides examples of these, whilst outlining areas for further application and development. The key innovation of ML is to learn intricate patterns from data, rather than estimating patterns that were hypothesized by a researcher a priori (see also Erhard & Heiberger, Chapter 7 in this volume). ML approaches can belong to one of two families – supervised or unsupervised – and both sets of approaches have seen some use in demography. The family of supervised ML methods attempts to find predictive models that relate explanatory variables to some outcome – a familiar setup to quantitative social scientists, including demographers. Traditionally, a researcher would have to detail the functional relationship between the two – in practice this has often been some linear additive functional form in regression-style analyses. Moreover, in this setup, a small set of variables is considered for analysis by the researcher, even though a larger set of variables may be available for analysis. In contrast, ML approaches provide an opportunity to evaluate a much larger range of possible functional forms and a larger set of predictors. In other words, ML replaces the historically theory-driven

Digital and computational demography 69 and time-consuming phase of model building with a data-driven approach. As a consequence, predictors can be found which may never have been evaluated or even thought of before. ML approaches have been applied to individual-level longitudinal survey datasets to analyse the predictors of union dissolution (Arpino et al., 2022), the transition to adulthood (Billari et al., 2006), and other life-course outcomes (Salganik et al., 2020) with a view to harnessing a broader range of predictors and discovering non-linearities and complex patterns of association. While in some cases these studies point to improved predictive accuracy when ML approaches are applied to standard social survey datasets (e.g., Arpino et al., 2022), others show less clear-cut improvements (Filippova et al., 2019; Salganik et al., 2020). A deeper assessment of the conditions under which ML approaches can help improve predictive accuracy within the constraints of social demographic data, but also to consider the pros and cons of these approaches with different data structures (e.g., survey versus population registers) is an area where research would be beneficial. Improving predictive accuracy is also clearly relevant in demographic applications in population projection and forecasting where training a predictive model for high out-of-sample accuracy is the objective (Booth, 2006). In these settings, the flexibility of ML models, in particular neural networks, has been shown to help improve on standard approaches such as the Lee-Carter model for mortality forecasting (Nigri et al., 2019, 2021). Another application of ML techniques for population estimation has been in the context of the integration of geospatial data for fine-grained population mapping. In these applications, high-resolution data (e.g., full-enumeration census counts) are not directly available but ML models (e.g., random forests) that use geospatial predictors (e.g., from satellite data) are trained to predict population estimates (typically available from a sample of locations, or higher spatial unit) and improve out-of-sample coverage of high-resolution population counts (Lloyd et al., 2017; Wardrop et al., 2018) (see Section 3.1.3). A key challenge in these, as well as forecasting settings, is how ML approaches can be used to quantify uncertainty in modelled outputs, in contrast to statistical approaches (e.g., Bayesian hierarchical methods), where these are more directly generated by the model. While ML approaches for predictive modelling in demographic research have emerged, demographers are also interested in explaining why demographic phenomena occur, and ML can benefit researchers interested in understanding underlying mechanisms. Recent developments in the field of explainable artificial intelligence (X-AI) allow researchers to unpack ML models to generate understanding into the way the model relates observables to the outcome (Samek et al., 2019; Verhagen, 2021). There are two modes of X-AI: ‘global’ and ‘specific’ explainability. The first provides a broad assessment of which variables are important to accurately fit the outcome of interest and can be useful when identifying or screening important variables in high-dimensional settings – e.g., large-scale surveys or population register data. Global explanations can provide a data-driven rationale for including variables into the analysis. The second evaluates the impact of changing an explanatory variable on the outcome, generating an implied association amongst variables. Popular methods are the local interpretable model-agnostic and Shapley explanation techniques, which have been successfully applied in both medical (Lundberg & Lee, 2017; Lundberg et al., 2018) and some social science applications (Verhagen, 2021). Specific explanations are most relevant when a limited set of explanatory variables has been identified, but the exact functional form relationship amongst these variables is unknown. The use of these approaches so far has been quite limited in demography, but this is a promising area for further development.

70 Research handbook on digital sociology Finally, there are a host of unsupervised ML methods which do not attempt to relate observables to an outcome of interest, and instead look for recurring patterns in a dataset. Such pattern recognition can be useful when reducing the dimensionality of data, whilst also assessing it in a more holistic way, or to find data that frequently cluster together. A natural example of this type of approach in the demographic literature is provided in the use of sequence analysis, where identifying similarities (or dissimilarities) between individual life courses, conceptualized as a sequence of events, is of interest (Billari, 2001). Once a similarity metric to assess sequences is identified, a common next step is to identify typologies or ‘clusters’ of groups with similar sequences. Sequence analysis has been widely used as a methodological approach in family demography and life-course research (e.g., Barban & Sironi, 2019; Billari & Piccarreta, 2005), and proposals to apply more theoretically informed similarity algorithms that are more relevant within for demographic context (e.g., monothetic tree-based divisive algorithms compared with the widely used optimal matching algorithms) have also been made within the literature (Piccarreta & Billari, 2007). 4.2

Simulation Techniques in Demography

Simulation techniques have a long tradition in demographic research, with MS and ABC modelling being the most common approaches. Demographic MS is a type of individual-level simulation which takes empirical transition rates as input (e.g., mortality, fertility, and marriage rates) and returns a synthetic population that resembles the population from which the original rates were borrowed. Thus, starting from an initial population, the transition rates determine the probability that a given individual in the simulation will experience, for example, death, childbirth, or marriage. Moreover, these rates can vary by population subgroups, which provides greater flexibility in modelling population heterogeneity than classical cohort-component projection approaches. The resulting hypothetical populations have a realistic genealogical structure from which it is possible to study intergenerational and other kin-dependent processes (Hammel et al., 1976). ABC modelling is conceptually and technically similar to MS but differs in its focus (Grow & Van Bavel, 2018). While MS is typically used to explore the evolution of populations based on empirical rates, ABC modelling is used to explore the implications of theories about people’s demographic behaviour. A crucial element here is that ABC modelling makes it possible to incorporate theoretically motivated rules for behaviour (e.g., decision-making related to fertility), as well interactions and interdependence among actors in a way that is often difficult with traditional MS. As such, MS is commonly used for studying implications of macro change for populations and ABC modelling for exploring social and demographic mechanisms at the micro level and their implications for emergent macro-level demographic regularities (for a general overview of ABC modelling in social sciences, see Lucas & Feliciani, Chapter 8 in this volume). Research that has applied MS and ABC modelling has generated important new insights into demographic processes. One area that has benefited from simulation is family demography. Many of the processes that lead families to form (e.g., marriage, unmarried cohabitation), grow (e.g., fertility), and dissolve (e.g., divorce, death of a partner) lend themselves well to MS and ABC modelling. For example, recognizing that there are sometimes large educational differentials in female fertility across countries, Potančoková and Marois (2020) used MS to explore how educational expansion, especially among women, may affect future fertility patterns in the European Union. Yet, the study of families often requires us to consider the

Digital and computational demography 71 decisions of multiple actors in networked structures, such as the interdependent partnering decisions of prospective spouses. In such contexts, ABC modelling has been particularly beneficial, for example, in the study of marriage (Billari et al., 2007), divorce (Grow et al., 2017), and the transition to parenthood (Aparicio Diaz et al., 2011). A second area in which simulations have become increasingly prominent is migration. Migration is tightly connected with decision-making that evolves over the life course. Individual decisions are also affected by processes of diffusion of information, network dynamics of interaction, and heterogeneous responses to similar circumstances. ABC modelling is well suited to model the non-linear effects of these factors at a population level. Klabunde and Willekens (2016) offer a review of the state of the art and challenges of modelling decision-making in ABC models of migration. Third, there is a long tradition of using MS to study the implications of demographic or social changes (e.g., educational expansion, diffusion of technologies like ultrasound) for population dynamics (Kashyap & Villavicencio, 2016; Potančoková & Marois, 2020). MS has been used in the context of data scarcity, such as in historical demography (Murphy, 2011) in the context of mortality crises (Wachter et al., 2002; Zagheni, 2011). Furthermore, MS is particularly well suited to research kinship dynamics. This includes studies on the availability of kin over the life course (Verdery & Margolis, 2017; Wachter, 1997), the degree to which different generations of kin overlap (Alburez-Gutierrez et al., 2021), and the experience of kin loss (Verdery et al., 2020). Despite its long tradition, demographic simulation research is still evolving. For example, as indicated above, MS and ABC modelling often have different goals and strengths, but they are highly compatible. Hence, an increasing number of scholars are arguing for integrating the two approaches, to improve the quality of the resulting simulation and to make the outcomes from the simulation more empirically informed (e.g., Bijak et al., 2013). For instance, a model that explores how educational expansion may affect fertility and union formation in the long run needs to consider both basic demographic processes (e.g., mortality) that can be modelled with techniques from MS, and interdependent decisions (e.g., marriage decisions) that can be modelled with techniques from ABC modelling. For examples of such an integrated approach in which empirical transition rates are combined with theoretically informed individual-level behavioural processes see Grow and Van Bavel (2015), Zinn (2016), and Kashyap and Villavicencio (2016). Furthermore, perfect knowledge of input rates should – in principle – lead to an unbiased reconstruction of population dynamics. The only uncertainty would be related to the stochasticity of the simulation. In practice, knowledge of underlying transition probabilities or input parameters is far from perfect. Thus, calibration of simulations to match certain key characteristics of the underlying population is essential. In the past, calibration has been mainly constrained by the limitation of computer power. Traditional methods relied heavily on minimizing the number of simulation runs. That was done, for instance, using expert judgement for adjusting the input parameters in a consistent and appropriate way. The rapid increase in computational power has been matched by a renewed interest in computationally intensive approaches to calibration (Aparicio Diaz et al., 2011; Kashyap & Villavicencio, 2016; Potančoková & Marois, 2020). The Bayesian melding method (e.g., Poole & Raftery, 2000; Raftery et al., 1995), in particular, has proved useful to formalize the process of calibration and statistical inference. It is a Bayesian approach since it relies on the Bayesian machinery of combining prior distributions with likelihoods to obtain posterior distributions. It has been named ‘melding’ because it ‘provides a way of combining different kinds of information (qualitative or quantitative, fragmentary or extensive, based on expert

72 Research handbook on digital sociology knowledge or on data) about different quantities, as long as the quantities to which they relate can be linked using a deterministic model’ (Poole & Raftery, 2000). Ševčíková et al. (2007) and Zagheni (2010) extended the approach to stochastic simulations.

5 CONCLUSION Digital demography broadly encompasses research on the implications of digital technologies and digitalization for demographic outcomes and processes, as well as the use of new sources of data and computational methods facilitated by technological advancements. These new data sources can provide insights in areas where traditional data have limitations or gaps, thereby complementing them. The demographic perspective, as exemplified by the keen interest among demographers in issues such as bias and population generalizability, make digital demography uniquely positioned to make a valuable contribution to the broader community of computational social science. An increasingly data-rich environment, combined with a long-standing interest in describing and characterizing population-level patterns and regularities, also implies that demography has already benefitted but has significant room to develop for the application of computational methods like ABC and MS modelling and ML. Many of the questions that lie at the intersection of the digital revolution and demography are clearly of interest across the social and computational sciences more broadly, and these cross-cutting themes provide an opportunity for further cross-pollination of ideas. Although the demographic ‘data ecosystem’ has become increasingly data rich and benefitted from a growing openness to new data sources, future advancement will benefit from leveraging the advantages of both new sources of data discussed earlier and traditional sources. The study of the impacts of digitalization on demographic behaviours requires an inclusion of more refined digital variables within demographic sources, such as large-scale nationally representative surveys. An important area for further development of digital and computational demography lies in linking both new and traditional sources, and therefore telling us more than any one source of data can speak to on its own (Kashyap, 2021). For example, in the context of online dating, digital trace data can provide us snapshots of behaviours and mate search heuristics, but are limited in telling us about long-term outcomes for those who use these platforms, such as who partners/marries (cf. Skopek, Chapter 12 in this volume). In contrast, longitudinal data on retrospective life histories commonly used in social demographic research are high quality and resource intensive, informative about life-course processes yet slow to change and seldom collect detailed measures of technology usage. These traditional survey data are helpful for studying union formation or dissolution and their correlates but contain little to no information about mate search dynamics and preferences. Linking these data – the digital traces of behaviours with survey-based measures of life events – would address many of their respective deficiencies, and offer a fruitful path forward for digital demography – especially for our understanding of how digital technology is affecting us both immediately and over the long term, and at both the micro and macro levels. Researchers have pursued this approach to study a wide range of topics, including news exposure and internet usage (Stier et al., 2020), and similar data linkage has been embraced by time-use researchers via simultaneous data collection from cameras, accelerometers, and traditional time-diary surveys (Cornwell et al., 2019; Gershuny et al., 2020), however, existing examples of such data linkage are often

Digital and computational demography 73 limited to validity checks for self-reported data. We remain in the early stages of harnessing the advantages of data linkage of different data, but this can serve to further develop the field. While the digital revolution has provided new opportunities for demographic research, researchers need to be cognizant of new and persistent challenges. Unlike the traditional data demographers are accustomed to, many new forms of data discussed in this chapter are often not publicly available or shared, due to the data remaining privately held. This undercuts opportunities for transparency and replicability, and can exacerbate inequality since access to data may be dependent on connections only available to prestigious institutions and well-connected researchers, often located in the Global North. The issue of inequality also extends to which populations have been studied. While new data opportunities provide valuable avenues for studying the Global South, and demographers in particular given their global interests are well positioned to do so, much work so far still focuses on high-income countries. The study of the Global South within digital demography requires both a recognition of the distinctive biases (e.g., gender gaps) in the use of digital technologies that may exist in these countries as well as better training opportunities in the Global South. As training programmes for digital demographers develop, inclusive approaches for broadening the community would benefit the intellectual development of the field tremendously. Lastly, the availability of alternative new sources of data does not supplant the continued need for deeper investments in population data infrastructures, such as vital registration systems, population censuses, or registers (Kashyap, 2021). Finding further avenues for integration and assessment of how new forms of data provide complementary measurement to traditional data sources is vital for advancing digital demography.

NOTE 1. As of early 2022, Prolific indicates only approximately 400 such respondents had been active on their platform within the past 90 days. MTurk does not provide such figures.

REFERENCES Abdi, J., Al-Hindawi, A., Ng, T., & Vizcaychipi, M. P. (2018). Scoping review on the use of socially assistive robot technology in elderly care. BMJ Open, 8(2), e018815. Ackers, L., & Gill, B. (2005). Attracting and retaining ‘early career’ researchers in English higher education institutions. Innovation: European Journal of Social Science Research, 18(3), 277–299. Aguinis, H., Villamor, I., & Ramani, R. S. (2021). MTurk research: Review and recommendations. Journal of Management, 47(4), 823–837. Alburez-Gutierrez, D., Zagheni, E., Aref, S., Gil-Clavel, S., Grow, A., & Negraia, D. V. (2019). Demography in the Digital Era: New Data Sources for Population Research. SocArXiv. https://doi .org/10.31235/osf.io/24jp7 Alburez-Gutierrez, D., Mason, C., & Zagheni, E. (2021). The ‘sandwich generation’ revisited: Global demographic drivers of care time demands. Population and Development Review. https://doi.org/10 .1111/padr.12436 Alessi, E. J., & Martin, J. I. (2010). Conducting an internet-based survey: Benefits, pitfalls, and lessons learned. Social Work Research, 34(2), 122–128. Alexander, M., Polimis, K., & Zagheni, E. (2020). Combining social media and survey data to nowcast migrant stocks in the United States. Population Research and Policy Review, 41, 1–28.

74 Research handbook on digital sociology Allcott, H., Braghieri, L., Eichmeyer, S., & Gentzkow, M. (2019). The Welfare Effects of Social Media. SSRN Scholarly Paper ID 3308640. https://doi.org/10.2139/ssrn.3308640 Aman, V. (2020). Transfer of knowledge through international scientific mobility: Introduction of a network-based bibliometric approach to study different knowledge types. Quantitative Science Studies, 1(2), 565–581. Ananda, A., & Bol, D. (2021). Does knowing democracy affect answers to democratic support questions? A survey experiment in Indonesia. International Journal of Public Opinion Research, 33(2), 433–443. Andreassen, C. S. (2015). Online social network site addiction: A comprehensive review. Current Addiction Reports, 2(2), 175–184. Aparicio Diaz, B. A., Fent, T., Prskawetz, A., & Bernardi, L. (2011). Transition to parenthood: The role of social interaction and endogenous networks. Demography, 48(2), 559–579. Armstrong, C., Poorthuis, A., Zook, M., Ruths, D., & Soehl, T. (2021). Challenges when identifying migration from geo-located Twitter data. EPJ Data Science, 10(1). https://doi.org/10.1140/epjds/ s13688-020-00254-7 Arpino, B., Le Moglie, M., & Mencarini, L. (2022). What tears couples apart: A machine learning analysis of union dissolution in Germany. Demography, 59(1), 161–186. Barban, N., & Sironi, M. (2019). Sequence analysis as a tool for family demography. In R. Schoen (Ed.), Analytical Family Demography (pp. 101–123). Springer International Publishing. Barber, J. S., & Axinn, W. G. (2004). New ideas and fertility limitation: The role of mass media. Journal of Marriage and Family, 66(5), 1180–1200. Baruffaldi, S. H., & Landoni, P. (2012). Return mobility and scientific productivity of researchers working abroad: The role of home country linkages. Research Policy, 41(9), 1655–1665. Beck, F., Richard, J.-B., Nguyen-Thanh, V., Montagni, I., Parizot, I., & Renahy, E. (2014). Use of the internet as a health information resource among French young adults: Results from a nationally representative survey. Journal of Medical Internet Research, 16(5), e2934. Bekalu, M. A., McCloud, R. F., & Viswanath, K. (2019). Association of social media use with social well-being, positive mental health, and self-rated health: Disentangling routine use from emotional connection to use. Health Education & Behavior, 46(2, suppl), 69S–80S. Bellou, A. (2015). The impact of internet diffusion on marriage rates: Evidence from the broadband market. Journal of Population Economics, 28(2), 265–297. Bennetts, S. K., Hokke, S., Crawford, S., Hackworth, N. J., Leach, L. S., Nguyen, C., Nicholson, J. M., & Cooklin, A. R. (2019). Using paid and free Facebook methods to recruit Australian parents to an online survey: An evaluation. Journal of Medical Internet Research, 21(3), e11206. Bernstein, S., Diamond, R., McQuade, T., & Pousada, B. (2018). The contribution of high-skilled immigrants to innovation in the United States. Stanford Graduate School of Business Working Paper, 3748, 202019–202020. Beyens, I., Pouwels, J. L., van Driel, I. I., Keijsers, L., & Valkenburg, P. M. (2020). The effect of social media on well-being differs from adolescent to adolescent. Scientific Reports, 10(1), 10763. Bicalho, C., Platas, M. R., & Rosenzweig, L. R. (2021). ‘If we move, it moves with us’: Physical distancing in Africa during COVID-19. World Development, 142, 105379. Bijak, J., Hilton, J., Silverman, E., & Cao, V. D. (2013). Reforging the wedding ring: Exploring a semi-artificial model of population for the United Kingdom with Gaussian process emulators. Demographic Research, 29, 729–766. Biljecki, F., Chew, L. Z. X., Milojevic-Dupont, N., & Creutzig, F. (2021). Open government geospatial data on buildings for planning sustainable and resilient cities. http://arxiv.org/abs/2107.04023 Billari, F. C. (2001). Sequence analysis in demographic research. Canadian Studies in Population, 28(2), 439–458. Billari, F. C., & Piccarreta, R. (2005). Analyzing demographic life courses through sequence analysis. Mathematical Population Studies, 12(2), 81–106. Billari, F. C., Fürnkranz, J., & Prskawetz, A. (2006). Timing, sequencing, and quantum of life course events: A machine learning approach. European Journal of Population/Revue Européenne de Démographie, 22(1), 37–65. Billari, F. C., Prskawetz, A., Aparicio Diaz, B. & Fent, T. (2007). The ‘wedding-ring’: An agent-based marriage model based on social interaction. Demographic Research, 17, 59–82.

Digital and computational demography 75 Billari, F. C., Giuntella, O., & Stella, L. (2018). Broadband internet, digital temptations, and sleep. Journal of Economic Behavior & Organization, 153, 58–76. Billari, F. C., Giuntella, O., & Stella, L. (2019). Does broadband internet affect fertility? Population Studies, 73(3), 297–316. Billari, F. C., Rotondi, V., & Trinitapoli, J. (2020). Mobile phones, digital inequality, and fertility: Longitudinal evidence from Malawi. Demographic Research, 42, 1057–1096. Bittman, M., Brown, J. E., & Wajcman, J. (2009). The mobile phone, perpetual contact and time pressure. Work, Employment and Society, 23(4), 673–691. Boas, T. C., Christenson, D. P., & Glick, D. M. (2020). Recruiting large online samples in the United States and India: Facebook, Mechanical Turk, and Qualtrics. Political Science Research and Methods, 8(2), 232–250. Boer, M., Stevens, G., Finkenauer, C., & Eijnden, R. (2020). Attention deficit hyperactivity disorder – symptoms, social media use intensity, and social media use problems in adolescents: Investigating directionality. Child Development, 91(4). https://doi.org/10.1111/cdev.13334 Booth, H. (2006). Demographic forecasting: 1980 to 2005 in review. International Journal of Forecasting, 22(3), 547–581. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Broussard, M. (2018). Artificial Unintelligence: How Computers Misunderstand the World. MIT Press. Burke, R. J., & Cooper, C. L. (Eds) (2008). The Long Work Hours Culture: Causes, Consequences and Choices. Emerald. Burrell, J., & Anderson, K. (2008). ‘I have great desires to look beyond my world’: Trajectories of information and communication technology use among Ghanaians living abroad. New Media & Society, 10(2), 203–224. Cañibano, C., D’Este, P., Otamendi, F. J., & Woolley, R. (2020). Scientific careers and the mobility of European researchers: An analysis of international mobility by career stage. Higher Education, 80(6), 1175–1193. Cesare, N., Lee, H., McCormick, T., Spiro, E., & Zagheni, E. (2018). Promises and pitfalls of using digital traces for demographic research. Demography, 55(5), 1979–1999. Chandler, J., Rosenzweig, C., Moss, A. J., Robinson, J., & Litman, L. (2019). Online panels in social science research: Expanding sampling methods beyond Mechanical Turk. Behavior Research Methods, 51(5), 2022–2038. Cheney-Lippold, J. (2011). A new algorithmic identity: Soft biopolitics and the modulation of control. Theory, Culture & Society, 28(6), 164–181. Chesley, N. (2005). Blurring boundaries? Linking technology use, spillover, individual distress, and family satisfaction. Journal of Marriage and Family, 67(5), 1237–1248. Chinchilla-Rodríguez, Z., Bu, Y., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018a). Travel bans and scientific mobility: Utility of asymmetry and affinity indexes to inform science policy. Scientometrics, 116(1), 569–590. Chinchilla-Rodríguez, Z., Miao, L., Murray, D., Robinson-García, N., Costas, R., & Sugimoto, C. R. (2018b). A global comparison of scientific mobility and collaboration according to national scientific capacities. Frontiers in Research Metrics and Analytics, 3. https://doi.org/10.3389/frma.2018.00017 Chunara, R., & Cook, S. H. (2020). Using digital data to protect and promote the most vulnerable in the fight against COVID-19. Frontiers in Public Health, 8, 296. Chung, H., Birkett, H., Forbes, S., & Seo, H. (2021). COVID-19, flexible working, and implications for gender equality in the United Kingdom. Gender & Society, 35(2), 218–232. Cohen, R., Fardouly, J., Newton-John, T., & Slater, A. (2019). #BoPo on Instagram: An experimental investigation of the effects of viewing body positive content on young women’s mood and body image. New Media & Society, 21(7), 1546–1564. Cole, J. R., & Zuckerman, H. (1987). Marriage, motherhood and research performance in science. Scientific American, 256(2), 119–125. Cooke, T. J., & Shuttleworth, I. (2017). Migration and the internet. Migration Letters, 14(3), 331–342. Cornwell, B., Gershuny, J., & Sullivan, O. (2019). The social structure of time: Emerging trends and new directions. Annual Review of Sociology, 45(1), 301–320.

76 Research handbook on digital sociology Coyne, S. M., Rogers, A. A., Zurcher, J. D., Stockdale, L., & Booth, M. (2020). Does time spent using social media impact mental health? An eight year longitudinal study. Computers in Human Behavior, 104, 106160. Czaika, M., & Orazbayev, S. (2018). The globalisation of scientific mobility, 1970–2014. Applied Geography, 96, 1–10. Czaika, M., & Parsons, C. R. (2017). The gravity of high-skilled migration policies. Demography, 54(2), 603–630. D’Ignazio, C., & Klein, L. (2020). Data Feminism. MIT Press. Daly, L. M., Horey, D., Middleton, P. F., Boyle, F. M., & Flenady, V. (2017). The effect of mobile application interventions on influencing healthy maternal behaviour and improving perinatal health outcomes: A systematic review protocol. Systematic Reviews, 6(1), 1–8. Danielsbacka, M., Tanskanen, A. O., & Billari, F. C. (2019). Who meets online? Personality traits and sociodemographic characteristics associated with online partnering in Germany. Personality and Individual Differences, 143, 139–144. Davis, J. L., & Jurgenson, N. (2014). Context collapse: Theorizing context collusions and collisions. Information, Communication & Society, 17(4), 476–485. Dekker, R., & Engbersen, G. (2014). How social media transform migrant networks and facilitate migration. Global Networks, 14(4), 401–418. Dekker, R., Engbersen, G., & Faber, M. (2016). The use of online media in migration networks. Population, Space and Place, 22(6), 539–551. Dettling, L. J. (2017). Broadband in the labor market: The impact of residential high-speed internet on married women’s labor force participation. ILR Review, 70(2), 451–482. Difallah, D. E., Catasta, M., Demartini, G., Ipeirotis, P. G., & Cudré-Mauroux, P. (2015). The dynamics of micro-task crowdsourcing: The case of Amazon MTurk. Proceedings of the 24th International Conference on World Wide Web, 238–247. Difallah, D. E., Filatova, E., & Ipeirotis, P. (2018). Demographics and dynamics of Mechanical Turk workers. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 135–143. Dooley, C., Boo, G., Leasure, D., Tatem, A., Bondarenko, M., & WorldPop. (2020). Gridded Maps of Building Patterns throughout Sub-Saharan Africa, Version 1.1. Data set, University of Southampton. https://doi.org/10.5258/SOTON/WP00677 Dooley, J. J., Pyżalski, J., & Cross, D. (2009). Cyberbullying versus face-to-face bullying: A theoretical and conceptual review. Zeitschrift Für Psychologie/Journal of Psychology, 217(4), 182–188. Downes, M., Gurrin, L. C., English, D. R., Pirkis, J., Currier, D., Spittal, M. J., & Carlin, J. B. (2018). Multilevel regression and poststratification: A modeling approach to estimating population quantities from highly selected survey samples. American Journal of Epidemiology, 187(8), 1780–1790. Dyson, M. P., Hartling, L., Shulhan, J., Chisholm, A., Milne, A., Sundar, P., Scott, S. D., & Newton, A. S. (2016). A systematic review of social media use to discuss and view deliberate self-harm acts. PLOS ONE, 11(5), e0155813. Edelmann, A., Wolff, T., Montagne, D., & Bail, C. A. (2020). Computational social science and sociology. Annual Review of Sociology, 46(1), 61–81. Elvidge, C., Tuttle, B., Sutton, P., Baugh, K., Howard, A., Milesi, C., Bhaduri, B., & Nemani, R. (2007). Global distribution and density of constructed impervious surfaces. Sensors, 7(9), 1962–1979. Ernala, S. K., Burke, M., Leavitt, A., & Ellison, N. B. (2020). How well do people report time spent on Facebook? An evaluation of established survey questions with recommendations. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–14. https://doi.org/10.1145/ 3313831.3376435 Esch, T., Brzoska, E., Dech, S., Leutner, B., Palacios-Lopez, D., Metz-Marconcini, A., Marconcini, M., Roth, A., & Zeidler, J. (2022). World settlement footprint 3D: A first three-dimensional survey of the global building stock. Remote Sensing of Environment, 270, 112877. Euromonitor International (2017). Single-person households will become a major consumption group. https://blog.euromonitor.com/households-2030-singletons/ Evans, J. R., & Mathur, A. (2018). The value of online surveys: A look back and a look ahead. Internet Res. https://doi.org/10.1108/IntR-03-2018-0089

Digital and computational demography 77 Eysenbach, G., & Wyatt, J. (2002). Using the internet for surveys and health research. Journal of Medical Internet Research, 4(2), e862. Facebook Inc. (2021). Facebook reports second quarter 2021 results. https://investor.fb.com/investor -news/press-release-details/2021/facebook-reports-second-quarter-2021-results/default.aspx Fatehkia, M., Kashyap, R., & Weber, I. (2018). Using Facebook ad data to track the global digital gender gap. World Development, 107, 189–209. Feehan, D. M., & Cobb, C. (2019). Using an online sample to estimate the size of an offline population. Demography, 56(6), 2377–2392. Fenton, S., Carter, J., & Modood, T. (2000). Ethnicity and academia: Closure models, racism models and market models. Sociological Research Online, 5(2), 116–134. Ferdous, H. S., Ploderer, B., Davis, H., Vetere, F., & O’Hara, K. (2016). Commensality and the social use of technology during family mealtime. ACM Transactions on Computer-Human Interaction, 23(6), 1–26. Fiesler, C., & Proferes, N. (2018). ‘Participant’ perceptions of Twitter research ethics. Social Media + Society, 4(1), 205630511876336. Fiesler, C., Beard, N., & Keegan, B. C. (2020). No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. Proceedings of the Fourteenth International AAAI Conference on Web and Social Media, 10. Filippova, A., Gilroy, C., Kashyap, R., Kirchner, A., Morgan, A. C., Polimis, K., Usmani, A., & Wang, T. (2019). Humans in the loop: Incorporating expert and crowd-sourced knowledge for predictions using survey data. Socius: Sociological Research for a Dynamic World, 5, 2378023118820157. Fiorio, L., Zagheni, E., Abel, G., Hill, J., Pestre, G., Letouzé, E., & Cai, J. (2021). Analyzing the effect of time in migration measurement using georeferenced digital trace data. Demography, 58(1), 51–74. Fish, J. N., Salerno, J., Williams, N. D., Rinderknecht, R. G., Drotning, K. J., Sayer, L., & Doan, L. (2021). Sexual minority disparities in health and well-being as a consequence of the COVID-19 pandemic differ by sexual identity. LGBT Health, 8(4), 263–272. Florczyk, A. J., Corbane, C., Ehrlich, D., Freire, S., Kemper, T., Maffenini, L., Melchiorri, M., Pesaresi, M., Politis, P., Schiavina, M., Sabo, F., & Zanchetta, L. (2019). GHSL Data Package 2019. Publications Office of the European Union. https://data.europa.eu/doi/10.2760/0726 Franzoni, C., Scellato, G., & Stephan, P. (2014). The mover’s advantage: The superior performance of migrant scientists. Economics Letters, 122(1), 89–93. Freelon, D. (2018). Computational research in the post-API age. Political Communication, 35(4), 665–668. Fulkerson, J. A., Larson, N., Horning, M., & Neumark-Sztainer, D. (2014). A review of associations between family or shared meal frequency and dietary and weight status outcomes across the lifespan. Journal of Nutrition Education and Behavior, 46(1), 2–19. Geisen, E., & Murphy, J. (2020). A compendium of web and mobile survey pretesting methods. In P. Beatty, D. Collins, L. Kaye, J. L. Padilla, G. Willis, & A. Wilmot (Eds), Advances in Questionnaire Design, Development, Evaluation and Testing (pp. 287–314). Wiley. Gelman, A., Lee, D., & Guo, J. (2015). Stan: A probabilistic programming language for Bayesian inference and optimization. Journal of Educational and Behavioral Statistics, 40(5), 530–543. Gershuny, J., & Harms, T. A. (2016). Housework now takes much less time: 85 years of US rural women’s time use. Social Forces, 95(2), 503–524. Gershuny, J., & Sullivan, O. (2019). What We Really Do All Day. Penguin. Gershuny, J., Harms, T., Doherty, A., Thomas, E., Milton, K., Kelly, P., & Foster, C. (2020). Testing self-report time-use diaries against objective instruments in real time. Sociological Methodology, 50(1), 318–349. Gil-Clavel, S., & Zagheni, E. (2019). Demographic differentials in Facebook usage around the world. Proceedings of the International AAAI Conference on Web and Social Media, 13, 647–650. Gil-Clavel, S., Zagheni, E., & Bordone, V. (2020). Close social networks among older adults: The online and offline perspectives. MPIDR Working Paper, 26. https://doi.org/10.4054/MPIDR-WP-2020-035 Gil-Clavel, S., Grow, A., & Bijlsma, M. J. (2022). Analyzing EU-15 immigrants’ language acquisition using Twitter data. SocArXiv. https://doi.org/10.31235/osf.io/bs4hk Gleibs, I. H. (2017). Are all ‘research fields’ equal? Rethinking practice for the use of data from crowdsourcing market places. Behavior Research Methods, 49(4), 1333–1342.

78 Research handbook on digital sociology Goefabrik GmbH (2018). OpenStreetMaps data extracts. https://download.geofabrik.de Google (2021). Google Open Buildings. https://sites.research.google/open-buildings/ Gradisar, M., Wolfson, A. R., Harvey, A. G., Hale, L., Rosenberg, R., & Czeisler, C. A. (2013). The sleep and technology use of Americans: Findings from the National Sleep Foundation’s 2011 Sleep in America poll. Journal of Clinical Sleep Medicine, 9(12), 1291–1299. Grevet, C., Tang, A., & Mynatt, E. (2012). Eating alone, together: New forms of commensality. Proceedings of the 17th ACM International Conference on Supporting Group Work – GROUP ’12, 103–106. Griffiths, M. D. (2013). Social networking addiction: Emerging themes and issues. Journal of Addiction Research & Therapy, 4(5). https://doi.org/10.4172/2155-6105.1000e118 Grigorieff, A., Roth, C., & Ubfal, D. (2020). Does information change attitudes toward immigrants? Demography, 57(3), 1117–1143. Grow, A., & Van Bavel, J. (2015). Assortative mating and the reversal of gender inequality in education in Europe: An agent-based model. PLOS ONE, 10(6), e0127806. Grow, A., & Van Bavel, J. (2018). Agent-Based Modeling. John Wiley & Sons. Grow, A., Schnor, C., & Van Bavel, J. (2017). The reversal of the gender gap in education and relative divorce risks: A matter of alternatives in partner choice? Population Studies, 71(sup1), 15–34. Grow, A., Perrotta, D., Fava, E. D., Cimentada, J., Rampazzo, F., Gil-Clavel, S., & Zagheni, E. (2020). Addressing public health emergencies via Facebook surveys: Advantages, challenges, and practical considerations. Journal of Medical Internet Research, 22(12), e20653. Guerrisi, C., Turbelin, C., Blanchon, T., Hanslik, T., Bonmarin, I., Levy-Bruhl, D., Perrotta, D., Paolotti, D., Smallenburg, R., Koppeschaar, C., Franco, A. O., Mexia, R., Edmunds, W. J., Sile, B., Pebody, R., van Straten, E., Meloni, S., Moreno, Y., Duggan, J., … Colizza, V. (2016). Participatory syndromic surveillance of influenza in Europe. Journal of Infectious Diseases, 214(suppl. 4), S386–S392. Guillory, J., Wiant, K. F., Farrelly, M., Fiacco, L., Alam, I., Hoffman, L., Crankshaw, E., Delahanty, J., & Alexander, T. N. (2018). Recruiting hard-to-reach populations for survey research: Using Facebook and Instagram advertisements and in-person intercept in LGBT bars and nightclubs to recruit LGBT young adults. Journal of Medical Internet Research, 20(6), e197. Guldi, M., & Herbst, C. M. (2017). Offline effects of online connecting: The impact of broadband diffusion on teen fertility decisions. Journal of Population Economics, 30(1), 69–91. Hall, C. S., Fottrell, E., Wilkinson, S., & Byass, P. (2014). Assessing the impact of mHealth interventions in low- and middle-income countries – what has been shown to work? Global Health Action, 7(1), 25606. Hallyburton, A., & Evarts, L. A. (2014). Gender and online health information seeking: A five survey meta-analysis. Journal of Consumer Health on the Internet, 18(2), 128–142. Hammel, E., Hutchinson, D., Wachter, K., Lundy, R., & Deuel, R. (1976). The SOCSIM Demographic-Sociological Microsimulation Program: Operating Manual. Hargittai, E. (2020). Potential biases in big data: Omitted voices on social media. Social Science Computer Review, 38(1), 10–24. Hauser, D. J., & Schwarz, N. (2015). It’s a trap! Instructional manipulation checks prompt systematic thinking on ‘tricky’ tasks. SAGE Open, 5(2), 215824401558461. Hauser, D. J., & Schwarz, N. (2016). Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods, 48(1), 400–407. Heidrich, F., Roecker, C., Kasugai, K., Russell, P., & Ziefle, M. (2012). roomXT: Advanced video communication for joint dining over a distance. Proceedings of the 6th International Conference on Pervasive Computing Technologies for Healthcare. https://doi.org/10.4108/icst.pervasivehealth.2012 .248679 Heris, M. P., Foks, N. L., Bagstad, K. J., Troy, A., & Ancona, Z. H. (2020). A rasterized building footprint dataset for the United States. Scientific Data, 7(1), 1–10. Hertog, E., Fukuda, S., Matsukura, R., Nagase, N., & Lehdonvirta, V. (2021). The future of unpaid work: Simulating the effects of automation on time spent on housework and care work in the UK and Japan. 33rd Annual Meeting, SASE. Hiller, H. H., & Franz, T. M. (2004). New ties, old ties and lost ties: The use of the internet in diaspora. New Media & Society, 6(6), 731–752.

Digital and computational demography 79 Hiniker, A., Schoenebeck, S. Y., & Kientz, J. A. (2016). Not at the dinner table: Parents’ and children’s perspectives on family technology rules. Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, 1376–1389. Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., Margetts, H., Mullainathan, S., Salganik, M. J., Vazire, S., Vespignani, A., & Yarkoni, T. (2021). Integrating explanation and prediction in computational social science. Nature. https://doi.org/10.1038/s41586 -021-03659-0 Hunt, M. G., Marx, R., Lipson, C., & Young, J. (2018). No more FOMO: Limiting social media decreases loneliness and depression. Journal of Social and Clinical Psychology, 37(10), 751–768. ICF International (2013). Incorporating Geographic Information into Demographic and Health Surveys: A Field Guide to GPS Data Collection. ICF International. Ippedico, G. (2021, March 25). Can Tax Incentives Bring Brains Back? High Skill Migration and the Eﬀects of Returnees’ Tax Schemes. Global Migration Center. https://globalmigration.ucdavis.edu/ events/can-tax-incentives-bring-brains-back-high-skill-migration-and-effects-returnees-tax-schemes ITU (2020). Measuring digital development: Facts and figures 2020. https://www.itu.int/en/ITU-D/ Statistics/Dashboards/Pages/IFF.aspx Jacobs, W., Amuta, A. O., & Jeon, K. C. (2017). Health information seeking in the digital age: An analysis of health information seeking behavior among US adults. Cogent Social Sciences, 3(1). https://doi .org/10.1080/23311886.2017.1302785 Jochem, W. C., Leasure, D. R., Pannell, O., Chamberlain, H. R., Jones, P., & Tatem, A. J. (2021). Classifying settlement types from multi-scale spatial patterns of building footprints. Environment and Planning B: Urban Analytics and City Science, 48(5), 1161–1179. Juster, F. T., Ono, H., & Stafford, F. P. (2003). 2. An assessment of alternative measures of time use. Sociological Methodology, 33(1), 19–54. Kashyap, R. (2021). Has demography witnessed a data revolution? Promises and pitfalls of a changing data ecosystem. Population Studies, 75(sup. 1), 47–75. Kashyap, R., & Verkroost, F. C. J. (2021). Analysing global professional gender gaps using LinkedIn advertising data. EPJ Data Science, 10(1), 39. Kashyap, R., & Villavicencio, F. (2016). The dynamics of son preference, technology diffusion, and fertility decline underlying distorted sex ratios at birth: A simulation approach. Demography, 53(5), 1261–1281. Kashyap, R., Fatehkia, M., Tamime, R. A., & Weber, I. (2020). Monitoring global digital gender inequality using the online populations of Facebook and Google. Demographic Research, 43, 779–816. Kearney, M. S., & Levine, P. B. (2015). Media influences on social outcomes: The impact of MTV’s 16 and pregnant on teen childbearing. American Economic Review, 105(12), 3597–3632. Kennedy, R., Clifford, S., Burleigh, T., Waggoner, P. D., Jewell, R., & Winter, N. J. G. (2020). The shape of and solutions to the MTurk quality crisis. Political Science Research and Methods, 8(4), 614–629. Kim, Y.-J., Jang, H., Lee, Y., Lee, D., & Kim, D.-J. (2018). Effects of internet and smartphone addictions on depression and anxiety based on propensity score matching analysis. International Journal of Environmental Research and Public Health, 15(5), 859. King, R., & Skeldon, R. (2010). ‘Mind the gap!’ Integrating approaches to internal and international migration. Journal of Ethnic and Migration Studies, 36(10), 1619–1646. Klabunde, A., & Willekens, F. (2016). Decision-making in agent-based models of migration: State of the art and challenges. European Journal of Population, 32(1), 73–97. Ko, C.-H., Yen, J.-Y., Yen, C.-F., Chen, C.-S., & Chen, C.-C. (2012). The association between internet addiction and psychiatric disorder: A review of the literature. European Psychiatry, 27(1), 1–8. Komito, L. (2011). Social media and migration: Virtual community 2.0. Journal of the American Society for Information Science and Technology, 62(6), 1075–1086. Kühne, S., & Zindel, Z. (2020). Using Facebook and Instagram to recruit web survey participants: A step-by-step guide and application. Survey Methods: Insights from the Field. https://doi.org/10 .13094/SMIF-2020-00017 La Ferrara, E., Chong, A., & Duryea, S. (2012). Soap Operas and Fertility: Evidence from Brazil. American Economic Journal: Applied Economics, 4(4), 1–31. Laudel, G. (2003). Studying the brain drain: Can bibliometric methods help? Scientometrics, 57(2), 215–237.

80 Research handbook on digital sociology Lazer, D., & Radford, J. (2017). Data ex Machina: Introduction to Big Data. Annual Review of Sociology, 43(1), 19–39. Lazer, D., Hargittai, E., Freelon, D., Gonzalez-Bailon, S., Munger, K., Ognyanova, K., & Radford, J. (2021). Meaningful measures of human society in the twenty-first century. Nature, 595(7866), 189–196. https://doi.org/10.1038/s41586-021-03660-7 Leasure, D. R., Jochem, W. C., Weber, E. M., Seaman, V., & Tatem, A. J. (2020). National population mapping from sparse survey data: A hierarchical Bayesian modeling framework to account for uncertainty. Proceedings of the National Academy of Sciences, 117(39), 24173–24179. Lebergott, S. (1993). Women’s Work: Home to Market. In Pursuing Happiness (pp. 50–60). Princeton University Press. Lemola, S., Perkinson-Gloor, N., Brand, S., Dewald-Kaufmann, J. F., & Grob, A. (2015). Adolescents’ Electronic Media Use at Night, Sleep Disturbance, and Depressive Symptoms in the Smartphone Age. Journal of Youth and Adolescence, 44(2), 405–418. Li, J., Theng, Y.-L., & Foo, S. (2016). Predictors of online health information seeking behavior: Changes between 2002 and 2012. Health Informatics Journal, 22(4), 804–814. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22. Lindgren, F., & Rue, H. (2015). Bayesian Spatial Modelling with R INLA. Journal of Statistical Software, 63(19), 1–25. https://doi.org/10.18637/jss.v063.i19 Lloyd, C. T., Sorichetta, A., & Tatem, A. J. (2017). High resolution global gridded data for use in population studies. Scientific Data, 4(1), 1–17. https://doi.org/10.1038/sdata.2017.1 Loh, J. M. I., & Walsh, M. J. (2021). Social Media Context Collapse: The Consequential Differences Between Context Collusion Versus Context Collision. Social Media + Society, 7(3), 20563051211041650. https://doi.org/10.1177/20563051211041646 Lohmann, S. (2015). Information technologies and subjective well-being: Does the Internet raise material aspirations? Oxford Economic Papers, 67(3), 740–759. https://doi.org/10.1093/oep/gpv032 Lohmann, S., & Zagheni, E. (2021). Multi-platform social media use: Little evidence of impacts on adult well-being. Preprint at PsyArXiv. https://doi.org/10.31234/osf.io/r46nd Lund, S., Nielsen, B. B., Hemed, M., Boas, I. M., Said, A., Said, K., Makungu, M. H., & Rasch, V. (2014a). Mobile phones improve antenatal care attendance in Zanzibar: A cluster randomized controlled trial. BMC Pregnancy and Childbirth, 14(29), 1–10. https://doi.org/10.1186/1471-2393-14-29 Lund, S., Rasch, V., Hemed, M., Boas, I. M., Said, A., Said, K., Makundu, M. H., & Nielsen, B. B. (2014b). Mobile Phone Intervention Reduces Perinatal Mortality in Zanzibar: Secondary Outcomes of a Cluster Randomized Controlled Trial. JMIR MHealth and UHealth, 2(1), e2941. https://doi.org/ 10.2196/mhealth.2941 Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. ArXiv. https://doi.org/1705.07874 Lundberg, S. M., Nair, B., Vavilala, M. S., Horibe, M., Eisses, M. J., Adams, T., Liston, D. E., Low, D. K.-W., Newman, S.-F., Kim, J. et al. (2018). Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature Biomedical Engineering, 2(10), 749–760. Mascheroni, G., & Vincent, J. (2016). Perpetual contact as a communicative affordance: Opportunities, constraints, and emotions. Mobile Media & Communication, 4(3), 310–326. Massey, D. S., Arango, J., Hugo, G., Kouaouci, A., Pellegrino, A., & Taylor, J. E. (1993). Theories of international migration: A review and appraisal. Population and Development Review, 19(3), 431. McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R., & Burgette, L. F. (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine, 32(19), 3388–3414. McCarthy, O. L., Zghayyer, H., Stavridis, A., Adada, S., Ahamed, I., Leurent, B., Edwards, P., Palmer, M., & Free, C. (2019). A randomized controlled trial of an intervention delivered by mobile phone text message to increase the acceptability of effective contraception among young women in Palestine. Trials, 20(1), 1–13. McColl, D., & Nejat, G. (2013). Meal-time with a socially assistive robot and older adults at a long-term care facility. Journal of Human-Robot Interaction, 2(1), 152–171. McCredie, M. N., & Morey, L. C. (2018). Who are the Turkers? A characterization of MTurk workers using the personality assessment inventory. Assessment, 26(5), 759–766.

Digital and computational demography 81 Melchiorri, M., Florczyk, A., Freire, S., Schiavina, M., Pesaresi, M., & Kemper, T. (2018). Unveiling 25 years of planetary urbanization with remote sensing: Perspectives from the global human settlement layer. Remote Sensing, 10(5), 768. Microsoft (2021). Building footprints: An AI-assisted mapping deliverable with the capability to solve for many scenarios. www.microsoft.com/en-us/maps/building-footprints Miller, L. M. S., & Bell, R. A. (2012). Online health information seeking: The influence of age, information trustworthiness, and search challenges. Journal of Aging and Health, 24(3), 525–541. Minnesota Population Center (2020). Integrated public use microdata series, International: Version 7.3 (7.3). Data set. Minneapolis, MN: IPUMS. Miranda-González, A., Aref, S., Theile, T., & Zagheni, E. (2020). Scholarly migration within Mexico: Analyzing internal migration among researchers using Scopus longitudinal bibliometric data. EPJ Data Science, 9(1), 34. Mitullah, W., Samson, R., Wambua, P., & Balongo, S. (2016). Building on progress: Infrastructure development still a major challenge in Africa (Dispatch 69). Afrobarometer. https://afrobarometer .org/publications/ad69-building-progress-infrastructure-development-still-major-challenge-africa Moed, H. F., Aisati, M., & Plume, A. (2013). Studying scientific migration in Scopus. Scientometrics, 94(3), 929–942. Moharana, S., Panduro, A. E., Lee, H. R., & Riek, L. D. (2019). Robots for joy, robots for sorrow: Community based robot design for dementia caregivers. 2019 14th ACM/IEEE International Conference on Human–Robot Interaction, 458–467. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s firehose. Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, 9. Mullan, K., & Chatzitheochari, S. (2019). Changing times together? A time‐diary analysis of family time in the digital age in the United Kingdom. Journal of Marriage and Family, 81(4), 795–811. Murphy, M. (2011). Long-term effects of the demographic transition on family and kinship networks in Britain. Population and Development Review, 37, 55–80. Muto, M. (2012). The impacts of mobile phones and personal networks on rural-to-urban migration: Evidence from Uganda. Journal of African Economies, 21(5), 787–807. Negraia, D. V., Augustine, J. M., & Prickett, K. C. (2018). Gender disparities in parenting time across activities, child ages, and educational groups. Journal of Family Issues, 39(11), 3006–3028. Neundorf, A., & Öztürk, A. (2021). Recruiting research participants through Facebook: Assessing Facebook advertisement tools. Open Science Framework. https://doi.org/10.31219/osf.io/3g74n Nigri, A., Levantesi, S., Marino, M., Scognamiglio, S., & Perla, F. (2019). A deep learning integrated Lee–Carter model. Risks, 7(1), 1–16. Nigri, A., Levantesi, S., & Marino, M. (2021). Life expectancy and lifespan disparity forecasting: A long short-term memory approach. Scandinavian Actuarial Journal, 2, 110–133. Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. New York University Press. Ogletree, A. M., & Katz, B. (2020). How do older adults recruited using MTurk differ from those in a national probability sample? International Journal of Aging and Human Development, 93(2), 700–721. Oiarzabal, P. J., & Reips, U.-D. (2012). Migration and diaspora in the age of information and communication technologies. Journal of Ethnic and Migration Studies, 38(9), 1333–1338. Olamijuwon, E., & Odimegwu, C. (2021). Sexuality education in the digital age: Modelling the predictors of acceptance and behavioural intention to access and interact with sexuality information on social media. Sexuality Research and Social Policy, 1–14. Olson, J. A., Sandra, D. A., Colucci, É. S., Al Bikaii, A., Chmoulevitch, D., Nahas, J., Raz, A., & Veissière, S. P. L. (2022). Smartphone addiction is increasing across the world: A meta-analysis of 24 countries. Computers in Human Behavior, 129, 107138. Open Street Maps (2021). www.openstreetmaps.org Orben, A., & Przybylski, A. K. (2019). The association between adolescent well-being and digital technology use. Nature Human Behaviour, 3(2), 173–182. Orben, A., Dienlin, T., & Przybylski, A. K. (2019). Social media’s enduring effect on adolescent life satisfaction. Proceedings of the National Academy of Sciences, 116(21), 10226–10228.

82 Research handbook on digital sociology Palan, S., & Schitter, C. (2018). Prolific.ac: A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 22–27. Paolacci, G., & Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a participant pool. Current Directions in Psychological Science, 23(3), 184–188. Park, J., Wood, I. B., Jing, E., Nematzadeh, A., Ghosh, S., Conover, M. D., & Ahn, Y.-Y. (2019). Global labor flow network reveals the hierarchical organization and dynamics of geo-industrial clusters. Nature Communications, 10(1), 3449. Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavior Research Methods, 46(4), 1023–1031. Peer, E., Brandimarte, L., Samat, S., & Acquisti, A. (2017). Beyond the Turk: Alternative platforms for crowdsourcing behavioral research. Journal of Experimental Social Psychology, 70, 153–163. Perrotta, D., Grow, A., Rampazzo, F., Cimentada, J., Del Fava, E., Gil-Clavel, S., & Zagheni, E. (2021). Behaviours and attitudes in response to the COVID-19 pandemic: Insights from a cross-national Facebook survey. EPJ Data Science, 10(1), 17. Pesando, L. M., Rotondi, V., Stranges, M., Kashyap, R., & Billari, F. C. (2021). The internetization of international migration. Population and Development Review, 47(1), 79–111. Pew Research Center (2018, May 31). Teens, social media and technology 2018. www.pewresearch.org/ internet/2018/05/31/teens-social-media-technology-2018/ Pham, K. H., Rampazzo, F., & Rosenzweig, L. R. (2019). Online surveys and digital demography in the developing world: Facebook users in Kenya. https://arxiv.org/abs/1910.03448v1 Piccarreta, R., & Billari, F. C. (2007). Clustering work and family trajectories by using a divisive algorithm. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(4), 1061–1078. Poole, D., & Raftery, A. E. (2000). Inference for deterministic simulation models: The Bayesian melding approach. Journal of the American Statistical Association, 95(452), 1244–1255. Potančoková, M., & Marois, G. (2020). Projecting future births with fertility differentials reflecting women’s educational and migrant characteristics. Vienna Yearbook of Population Research, 18, 141–166. Potarca, G. (2017). Does the internet affect assortative mating? Evidence from the US and Germany. Social Science Research, 61, 278–297. Potarca, G. (2020). The demography of swiping right. An overview of couples who met through dating apps in Switzerland. PLOS ONE, 15(12), e0243733. Potarca, G. (2021). Online dating is shifting educational inequalities in marriage formation in Germany. Demography, 58(5), 1977–2007. Pötzschke, S., & Braun, M. (2017). Migrant sampling using Facebook advertisements: A case study of Polish migrants in four European countries. Social Science Computer Review, 35(5), 633–653. Puukko, K., Hietajärvi, L., Maksniemi, E., Alho, K., & Salmela-Aro, K. (2020). Social media use and depressive symptoms: A longitudinal study from early to late adolescence. International Journal of Environmental Research and Public Health, 17(16), 5921. Raftery, A. E., Givens, G. H., & Zeh, J. E. (1995). Inference from a deterministic population dynamics model for bowhead whales. Journal of the American Statistical Association, 90(430), 402–416. Rainie, H., & Wellman, B. (2012). Networked: The New Social Operating System. MIT Press. Rampazzo, F., & Weber, I. (2020). Facebook advertising data in Africa. Migration in West and North Africa and across the Mediterranean, 32. Rampazzo, F., Zagheni, E., Weber, I., Testa, M. R., & Billari, F. (2018). Mater certa est, pater numquam: What can Facebook advertising data tell us about male fertility rates? Twelfth International AAAI Conference on Web and Social Media, June 15. www.aaai.org/ocs/index.php/ICWSM/ICWSM18/ paper/view/17891 Rampazzo, F., Bijak, J., Vitali, A., Weber, I., & Zagheni, E. (2021). A framework for estimating migrant stocks using digital traces and survey data: An application in the United Kingdom. Demography, 58(6), 2193–2218. Rice, R. E., & Hagen, I. (2010). Young adults’ perpetual contact, social connectivity, and social control through the internet and mobile phones. Annals of the International Communication Association, 34(1), 3–39. Rosenfeld, M. J. (2017). Marriage, choice, and couplehood in the age of the internet. Sociological Science, 4, 490–510.

Digital and computational demography 83 Rosenfeld, M. J., & Thomas, R. J. (2012). Searching for a mate: The rise of the internet as a social intermediary. American Sociological Review, 77(4), 523–547. Rosenfeld, M. J., Thomas, R. J., & Hausen, S. (2019). Disintermediating your friends: How online dating in the United States displaces other ways of meeting. Proceedings of the National Academy of Sciences, 116(36), 17753–17758. Rosenzweig, L. R., & Zhou, Y.-Y. (2021). Team and nation: Sports, nationalism, and attitudes toward refugees. Comparative Political Studies, 54(12), 2123–2154. Rosenzweig, L. R., Bergquist, P., Pham, K. H., Rampazzo, F., & Mildenberger, M. (2020). Survey sampling in the Global South using Facebook advertisements. SocArXiv. https://doi.org/10.31235/osf .io/dka8f Rotondi, V., Kashyap, R., Pesando, L. M., Spinelli, S., & Billari, F. C. (2020). Leveraging mobile phones to attain sustainable development. Proceedings of the National Academy of Sciences, 117(24), 13413–13420. Salganik, M. J. (2018). Bit by Bit: Social Research in the Digital Age. Princeton University Press. Salganik, M. J., Lundberg, I., Kindel, A. T., Ahearn, C. E., Al-Ghoneim, K., Almaatouq, A., Altschul, D. M., Brand, J. E., Carnegie, N. B., Compton, R. J., Datta, D., Davidson, T., Filippova, A., Gilroy, C., Goode, B. J., Jahani, E., Kashyap, R., Kirchner, A., McKay, S., … McLanahan, S. (2020). Measuring the predictability of life outcomes with a scientific mass collaboration. Proceedings of the National Academy of Sciences, 117(15), 8398–8403. Salmela-Aro, K., Upadyaya, K., Hakkarainen, K., Lonka, K., & Alho, K. (2017). The dark side of internet use: Two longitudinal studies of excessive internet use, depressive symptoms, school burnout and engagement among Finnish early and late adolescents. Journal of Youth and Adolescence, 46(2), 343–357. Salomon, J. A., Reinhart, A., Bilinski, A., Chua, E. J., La Motte-Kerr, W., Rönn, M. M., Reitsma, M., Morris, K. A., LaRocca, S., Farag, T., Kreuter, F., Rosenfeld, R., & Tibshirani, R. J. (2021). The US COVID-19 Trends and Impact Survey, 2020–2021: Continuous real-time measurement of COVID-19 symptoms, risks, protective behaviors, testing and vaccination. Proceedings of the National Academy of Sciences 118.51 (2021): e2111454118. Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K., & Müller, K.-R. (2019). Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer Nature. Schaer, M., Jacot, C., & Dahinden, J. (2020). Transnational mobility networks and academic social capital among early-career academics: Beyond common-sense assumptions. Global Networks, 21(3), 585–607. Schneider, D., & Harknett, K. (2019). Consequences of routine work-schedule instability for worker health and well-being. American Sociological Review, 84(1), 82–114. Schonlau, M., van Soest, A., Kapteyn, A., & Couper, M. (2009). Selection bias in web surveys and the use of propensity scores. Sociological Methods & Research, 37(3), 291–318. Sequiera, R., & Lin, J. (2017). Finally, a downloadable test collection of tweets. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1225–1228. Ševčíková, H., Raftery, A. E., & Waddell, P. A. (2007). Assessing uncertainty in urban simulations using Bayesian melding. Transportation Research Part B: Methodological, 41(6), 652–669. Shank, D. B. (2016). Using crowdsourcing websites for sociological research: The case of Amazon Mechanical Turk. The American Sociologist, 47(1), 47–55. Shauman, K. A., & Xie, Y. (1996). Geographic mobility of scientists: Sex differences and family constraints. Demography, 33(4), 455–468. Sirko, W., Kashubin, S., Ritter, M., Annkah, A., Bouchareb, Y. S. E., Dauphin, Y., Keysers, D., Neumann, M., Cisse, M., & Quinn, J. (2021). Continental-scale building detection from high resolution satellite imagery. http://arxiv.org/abs/2107.12283 Sironi, M., & Kashyap, R. (2021). Internet access and partnership formation in the United States. Population Studies, 1–19. Skeldon, R. (2006). Interlinkages between internal and international migration and development in the Asian region. Population, Space and Place, 12(1), 15–30.

84 Research handbook on digital sociology Smith, C., Gold, J., Ngo, T. D., Sumpter, C., & Free, C. (2015). Mobile phone‐based interventions for improving contraception use. Cochrane Database of Systematic Reviews, 6. https://doi.org/10.1002/ 14651858.CD011159.pub2 Soowamber, M. L., Granton, J. T., Bavaghar-Zaeimi, F., & Johnson, S. R. (2016). Online obituaries are a reliable and valid source of mortality data. Journal of Clinical Epidemiology, 79, 167–168. Søraa, R. A., Nyvoll, P., Tøndel, G., Fosch-Villaronga, E., & Serrano, J. A. (2021). The social dimension of domesticating technology: Interactions between older adults, caregivers, and robots in the home. Technological Forecasting and Social Change, 167, 120678. Stern, M. J., Bilgen, I., & Dillman, D. A. (2014). The state of survey methodology: Challenges, dilemmas, and new frontiers in the era of the tailored design. Field Methods, 26(3), 284–301. Stevens, F. R., Gaughan, A. E., Linard, C., & Tatem, A. J. (2015). Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data. PLOS ONE, 10(2), e0107042. Stewart, N., Ungemach, C., Harris, A. J. L., Bartels, D. M., Newell, B. R., Paolacci, G., & Chandler, J. (2015). The average laboratory samples a population of 7,300 Amazon Mechanical Turk workers. Judgment and Decision-making, 10(5), 479–491. Stier, S., Breuer, J., Siegers, P., & Thorson, K. (2020). Integrating survey data and digital trace data: Key issues in developing an emerging field. Social Science Computer Review, 38(5), 503–516. Sturrock, H. J. W., Woolheater, K., Bennett, A. F., Andrade-Pacheco, R., & Midekisa, A. (2018). Predicting residential structures from open source remotely enumerated data using machine learning. PLOS ONE, 13(9), e0204399. Subbotin, A., & Aref, S. (2021). Brain drain and brain gain in Russia: Analyzing international migration of researchers by discipline using Scopus bibliometric data 1996–2020. Scientometrics. https://doi .org/10.1007/s11192-021-04091-x Sugimoto, C. R., Robinson-Garcia, N., Murray, D. S., Yegros-Yegros, A., Costas, R., & Larivière, V. (2017). Scientists have most impact when they’re free to move. Nature, 550(7674), 29–31. Swire-Thompson, B., & Lazer, D. (2020). Public health and online misinformation: Challenges and recommendations. Annual Review of Public Health, 41, 433–451. Tatem, A. J. (2017). WorldPop, open data for spatial demography. Scientific Data, 4(1), 1–4. Tekles, A., & Bornmann, L. (2020). Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. Quantitative Science Studies, 1(4), 1510–1528. Thomas, R. J. (2020). Online exogamy reconsidered: Estimating the internet’s effects on racial, educational, religious, political and age assortative mating. Social Forces, 98(3), 1257–1286. Tromholt, M. (2016). The Facebook experiment: Quitting Facebook leads to higher levels of well-being. Cyberpsychology, Behavior, and Social Networking, 19(11), 661–666. Turel, O., Romashkin, A., & Morrison, K. M. (2016). Health outcomes of information system use lifestyles among adolescents: Videogame addiction, sleep curtailment and cardio-metabolic deficiencies. PLOS ONE, 11(5), e0154764. Turkle, S. (2011). Alone Together: Why We Expect More from Technology and Less from Each Other. Basic Books. Twenge, J. M., & Campbell, W. K. (2019). Media use is linked to lower psychological well-being: Evidence from three datasets. Psychiatric Quarterly, 90(2), 311–331. Twenge, J. M., Joiner, T. E., Rogers, M. L., & Martin, G. N. (2018). Increases in depressive symptoms, suicide-related outcomes, and suicide rates among US adolescents after 2010 and links to increased new media screen time. Clinical Psychological Science, 6(1), 3–17. Urban Analytics Lab. (2021). Open government building data. https://ual.sg/project/ogbd/ US Census Bureau. (2018). US Census Bureau Releases 2018: Families and living arrangements tables. www.census.gov/newsroom/press-releases/2018/families.html Valkenburg, P. M., Peter, J., & Schouten, A. P. (2006). Friend networking sites and their relationship to adolescents’ well-being and social self-esteem. Cyberpsychology & Behavior: The Impact of the Internet, Multimedia and Virtual Reality on Behavior and Society, 9(5), 584–590. Valkenburg, P. M., Beyens, I., Pouwels, J. L., van Driel, I. I., & Keijsers, L. (2021). Social media use and adolescents’ self-esteem: Heading for a person-specific media effects paradigm. Journal of Communication, 71(1), 56–78.

Digital and computational demography 85 Vanden Abeele, M., De Wolf, R., & Ling, R. (2018). Mobile media and social space: How anytime, anyplace connectivity structures everyday life. Media and Communication, 6(2), 5–14. Varriale, C., Pesando, L. M., Kashyap, R., & Rotondi, V. (2021). Mobile phones and attitudes toward women’s participation in politics evidence from Africa. Sociology of Development, 1–37. Verdery, A. M., & Margolis, R. (2017). Projections of white and black older adults without living kin in the United States, 2015 to 2060. Proceedings of the National Academy of Sciences, 114(42), 11109–11114. Verdery, A. M., Smith-Greenaway, E., Margolis, R., & Daw, J. (2020). Tracking the reach of COVID-19 kin loss with a bereavement multiplier applied to the United States. Proceedings of the National Academy of Sciences, 117(30), 17695–17701. Verhagen, M. D. (2021). Identifying and improving functional form complexity: A machine learning framework. SocArXiv. https://doi.org/10.31235/osf.io/bka76 Vilhelmson, B., & Thulin, E. (2013). Does the internet encourage people to move? Investigating Swedish young adults’ internal migration experiences and plans. Geoforum, 47, 209–216. Wachter, K. W. (1997). Kinship resources for the elderly. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 352(1363), 1811–1817. Wachter, K. W., Knodel, J. E., & Vanlandingham, M. (2002). Aids and the elderly of Thailand: Projecting familial impacts. Demography, 39(1), 25–41. Wagner, C., Strohmaier, M., Olteanu, A., Kıcıman, E., Contractor, N., & Eliassi-Rad, T. (2021). Measuring algorithmically infused societies. Nature, 595(7866), 197–204. https://doi.org/10.1038/ s41586-021-03666-1 Wajcman, J. (2008). Life in the fast lane? Towards a sociology of technology and time. British Journal of Sociology, 59(1), 59–77. Wang, X., Shi, J., & Kong, H. (2021). Online health information seeking: A review and meta-analysis. Health Communication, 36(10), 1163–1175. Wang, Y., McKee, M., Torbica, A., & Stuckler, D. (2019). Systematic literature review on the spread of health-related misinformation on social media. Social Science & Medicine, 240, 112552. Wardrop, N. A., Jochem, W. C., Bird, T. J., Chamberlain, H. R., Clarke, D., Kerr, D., Bengtsson, L., Juran, S., Seaman, V., & Tatem, A. J. (2018). Spatially disaggregated population estimates in the absence of national population and housing census data. Proceedings of the National Academy of Sciences, 115(14), 3529–3537. Weinberg, J., Freese, J., & McElhattan, D. (2014). Comparing data characteristics and results of an online factorial survey between a population-based and a crowdsource – recruited sample. Sociological Science, 1, 292–310. Westoff, C. F., & Koffman, D. A. (2011). The association of television and radio with reproductive behavior. Population and Development Review, 37(4), 749–759. Whitaker, C., Stevelink, S., & Fear, N. (2017). The use of Facebook in recruiting participants for health research purposes: A systematic review. Journal of Medical Internet Research, 19(8), e290. Wickham, H. (2021). Easily Harvest (Scrape) web pages (1.0.0) [R]. https://rvest.tidyverse.org/ Wilde, J., Chen, W., & Lohmann, S. (2020). COVID-19 and the Future of US Fertility: What Can We Learn from Google? Working Paper No. 13776, IZA Discussion Papers. www.econstor.eu/handle/ 10419/227303 Williams, N. E., Thomas, T. A., Dunbar, M., Eagle, N., & Dobra, A. (2015). Measures of human mobility using mobile phone records enhanced with GIS data. PLOS ONE, 10(7), e0133630. Williamson, S., & Malik, M. (2020). Contesting narratives of repression: Experimental evidence from Sisi’s Egypt. Journal of Peace Research, 58(5), 1018–1033. Winkler, H. (2017). How does the internet affect migration decisions? Applied Economics Letters, 24(16), 1194–1198. Winter, N., Burleigh, T., Kennedy, R., & Clifford, S. (2019). A simplified protocol to screen out VPS and international respondents using Qualtrics. SSRN Electronic Journal. https://doi.org/10.2139/ssrn .3327274 Wójcik, O. P., Brownstein, J. S., Chunara, R., & Johansson, M. A. (2014). Public health for the people: Participatory infectious disease surveillance in the digital age. Emerging Themes in Epidemiology, 11(1), 7.

86 Research handbook on digital sociology Wright, J. (2019). Robots vs migrants? Reconfiguring the future of Japanese institutional eldercare. Critical Asian Studies, 51(3), 331–354. Wright, K. B. (2005). Researching internet-based populations: Advantages and disadvantages of online survey research, online questionnaire authoring software packages, and web survey services. Journal of Computer-Mediated Communication, 10(3). https://doi.org/10.1111/j.1083-6101.2005.tb00259.x Wu, J., & Ding, X.-H. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697. Yates, L., & Warde, A. (2017). Eating together and eating alone: Meal arrangements in British households. British Journal of Sociology, 68(1), 97–118. Yuret, T. (2017). An analysis of the foreign-educated elite academics in the United States. Journal of Informetrics, 11(2), 358–370. Zagheni, E. (2010). The impact of the HIV/AIDS epidemic on orphanhood probabilities and kinship portal .demogr .mpg.de/ uc/item/ structure in Zimbabwe. University of California Berkeley. https:// ,DanaInfo=escholarship.org,SSL+7xp9m970 Zagheni, E. (2011). The impact of the HIV/AIDS epidemic on kinship resources for orphans in Zimbabwe. Population and Development Review, 37(4), 761–783. Zagheni, E., & Weber, I. (2012). You are where you e-mail: Using e-mail data to estimate international migration rates. Proceedings of the 3rd Annual ACM Web Science Conference on WebSci ’12, 348–351. Zagheni, E., & Weber, I. (2015). Demographic research with non-representative internet data. International Journal of Manpower, 36(1), 13–25. Zagheni, E., Garimella, V. R. K., Weber, I., & State, B. (2014). Inferring international and internal migration patterns from Twitter data. Proceedings of the 23rd International Conference on World Wide Web, 439–444. Zagheni, E., Weber, I., & Gummadi, K. (2017). Leveraging Facebook’s advertising platform to monitor stocks of migrants: Leveraging Facebook’s advertising platform. Population and Development Review, 43(4), 721–734. Zarocostas, J. (2020). How to fight an infodemic. The Lancet, 395(10225), 676. Zhang, B., Mildenberger, M., Howe, P. D., Marlon, J., Rosenthal, S. A., & Leiserowitz, A. (2020). Quota sampling using Facebook advertisements. Political Science Research and Methods, 8(3), 558–564. Zhang, Y. (2000). Using the internet for survey research: A case study. Journal of the American Society for Information Science, 51(1), 57–68. Zhao, X., Aref, S., Zagheni, E., & Stecklov, G. (2021). International migration in academia and citation performance: An analysis of German-affiliated researchers by gender and discipline using Scopus Publications 1996–2020. http://arxiv.org/abs/2104.12380 Zhao, Z., Bu, Y., & Li, J. (2020). Does the mobility of scientists disrupt their collaboration stability? Journal of Information Science. 016555152094874. Zhou, S., Shapiro, M. A., & Wansink, B. (2017). The audience eats more if a movie character keeps eating: An unconscious mechanism for media influence on eating behaviors. Appetite, 108, 407–415. Zinn, S. (2016). Simulating synthetic life courses of individuals and couples, and mate matching. In A. Grow & J. Van Bavel (Eds), Agent-Based Modelling in Population Studies: Concepts, Methods, and Application (pp. 113–157). Springer International Publishing.

4. Digital technologies and the future of social surveys Marcel Das and Tom Emery

1 INTRODUCTION One of the key elements in empirical social research is the collection of data. In the period from the seventeenth to the nineteenth century, first steps were taken in the collection and analysis of data on society (Jerabek, 2015). According to Jerabek, the collection and use of such data were driven by two different purposes: administrative (government) and statistical (insurance industry and science). The aim was to predict and explain patterns of mass phenomena in society. This is still the central aim of social science research, yet the way in which social data are being collected has drastically changed since these first steps centuries ago. In the middle of the last century, the only methods for large-scale data collection outside of administrative data were face-to-face interviewing and mail questionnaires (Kalton, 2019). Such data collection was conducted using paper and pencil methods, thus limiting the complexity of the questionnaire. In the last two decades of the twentieth century, however, computers and advanced software changed this enormously. Computers were first used to support telephone data collection in the early 1970s, and computer-assisted telephone interviewing (CATI) was born (Weeks, 1992). It reduced costs of data collection significantly, and questionnaires could include complex skip patterns, randomizations (e.g., question and answer order), individual specific fills in the text (based on earlier responses in the questionnaire), automated error checks, and preloading of respondent specific data. In the late 1980s, laptop computers were also used to support face-to-face fieldwork (computer-assisted personal interviewing (CAPI)). According to Weeks (1992), Statistics Netherlands and Statistics Sweden were the pioneers in Europe in using CAPI for their fieldwork. Computer-assisted interviewing methods became dominant modes at the end of the last century. Many methodological studies were carried out comparing the different modes in their effects on coverage of the target population, response rates, data quality and costs. Existing large longitudinal social science surveys started using computer-assisted modes. The United States Panel Study of Income Dynamics, started in 1968, switched to CATI in 1993. The German Socio-Economic Panel Study, with its first wave in 1984, included CAPI as an interviewing method in 1998. Currently, CAPI is their main interviewing mode. The British Household Panel Survey, started in 1991, switched to CAPI in 1999. The main reasons for the existing large longitudinal surveys to move to computer-assisted interviewing modes were the potential for improving data quality, speed of data turnaround and release of the data to the user community, and savings in their fieldwork costs. New longitudinal social science surveys used computer-assisted interviewing from the beginning, such as the European Social Survey (ESS, first wave in 2002) and the Survey of Health, Ageing and Retirement in Europe (SHARE, first wave in 2004). 87

88 Research handbook on digital sociology The computer-assisted interviewing modes require the presence of an interviewer. In the last two decades, personnel costs increased significantly, particularly for the CAPI mode in which the interviewers must travel. In addition, the absence of complete sample frames with (mobile) telephone numbers has put additional pressure on the use and quality of the CATI mode. For these reasons, another mode – interviewing via the internet – received more and more attention. Technological developments not only make internet interviewing cost effective but also create opportunities for innovatively asking questions or collecting data in other ways than through survey questions. Existing longitudinal social science surveys started to experiment with this new mode (computer-assisted web interviewing) and new online panels were and are being set up. The aim of this chapter is to outline the use of the online interviewing method and other digital technologies in social surveys, and how these may develop in the years to come. The remainder of the chapter is organized as follows. Section 2 describes the switch towards the web, in particular the use of online panels with the opportunity of speedy data collection and measurements of the effects of sudden changes. Section 3 focuses on the move of large-scale, general social surveys to online data collection, the potential of integrating these surveys into online panels, and the integration of survey data into linked data structures. In Section 4, the future of social surveys is discussed in which the link to administrative data may have a new and different role, surveys are augmented with objective measurements either linked by design or provided from pre-established data services, and the use of mass online experiments through an integration of the field of computational social sciences in the survey research community. Finally, Section 5 concludes with some remarks on future challenges.

2

SWITCH TOWARDS THE WEB

2.1

Online Panels

The online sample market provides researchers with efficient access to study participants and is around 20 years old (Kelly, 2021). Two decades ago, commercial survey agencies discovered the possibilities of conducting research online. At the same time, the first online panel for scientific research was created: the Centerpanel, administered by Centerdata (housed at the campus of Tilburg University, the Netherlands). This panel was not started as an online panel when set up. In fact, there was an in-between step in moving from CATI and CAPI towards online methods. A precursor of the Centerpanel, based at the University of Amsterdam, used a computer-assisted data collection system without interviewers, using home computers, television screens, modems, and a telephone connection (Saris, 1998; Saris & De Pijper, 1986). In 2000, the Centerpanel switched to the internet. Since then, other academic online panels have emerged rapidly. With financial support from the Dutch Research Council, Centerdata built the Longitudinal Internet Studies for the Social Sciences (LISS) panel in 2007, a larger and open infrastructure for scientific research (Scherpenzeel & Das, 2010). Other examples include the German Internet Panel and GESIS panel in Germany, the ELIPSS panel in France, the Norwegian Citizen Panel in Norway, and the Understanding America Study in the United States. All these online panel surveys were initiated by academic researchers with the aim to support scientific, societal, and policy-relevant research. Moreover, all these panels rely on traditional probability sampling. However, there

Digital technologies and the future of social surveys 89 are also differences such as the size of the sample that is maintained and the scope of topics that are covered, mainly determined by the availability of different sampling frames and the available budget. See Blom et al. (2016) for a comparison of four of the above-mentioned panels. Online panels, including the ones initiated by commercial agencies, vary a lot in set-up, use, and quality. In contrast to academic online panels, most of the commercial online panels are based on convenience samples, that is, panel members have not been selected randomly based on known probabilities but on other grounds such as self-selection. Commercial online panels are typically very cost efficient and fast in collecting data. However, the non-probability nature of their samples imposes severe challenges in terms of representativity and accuracy of results (Chang & Krosnick, 2009; Yeager et al., 2011). Cost efficiency and speedy data collection (and data dissemination to other researchers) are also key dimensions in building up academic online panels. In that regard, online interviewing is attractive because it is usually much cheaper than telephone interviewing and face-to-face interviewing. However, this does not imply that setting up and maintaining an academic online panel is low cost. Various aspects require serious funding. In the following we will discuss those aspects which relate to (1) covering the offline population, (2) incentives for participants, (3) collection of core data, and (4) dissemination of data. 2.1.1 Covering the offline population Many online panels include only those respondents who have access to the internet and use the internet daily. However, since internet users differ from non-users in important demographic ways (Leenheer & Scherpenzeel, 2013), sampling from the ‘online’ population can only lead to substantial biases when the actual target of inference is the entire population. Hence, to be fully representative, sampling frames in academic online panels must cover the ‘offline’ population too. What is more, surveying ‘offline’ participants with online tools requires additional equipment. For example, the LISS panel, German Internet Panel, the ELIPSS panel, and Understanding America Study provide technical equipment and services to the respondents who do not have a computer and/or internet access at recruitment. This does not only entail a one-time cost for purchasing the necessary technical devices (such as computers or tablets) but also continuous costs for connectivity (e.g., internet service provider subscription). An alternative, taken for example by the GESIS panel, is to include the offline population via mail surveys. However, that comes at the disadvantage of deploying personnel for the logistics and having constraints related to traditional survey modes (e.g., mail surveys take more time and restrict the innovative features that can be included in online surveys), as well as the complexities of disentangling mode effects from genuine differences between individuals. 2.1.2 Participant incentives To stimulate participation in (longitudinal) studies and response to the questionnaires it is quite common to pay incentives to the respondents, either pre-paid or conditional on completing the survey. It also shows the appreciation of the researcher or research organization (reciprocity). This is not different for online panels. In fact, it may be even more important for online panels since there is no interviewer involved who can motivate the respondent to start and complete the survey. Consequently, incentives also play a significant role when setting up an online panel. Before the main recruitment of the LISS panel was started, an experiment was carried out to determine the optimal recruitment strategy (Scherpenzeel & Toepoel, 2012). Factors

90 Research handbook on digital sociology to be optimized included incentives, both those that are pre-paid/unconditional and those that are conditional on the completion of the interview. Incentives turned out to increase response rates and were found to have much stronger effects when they were distributed with the advance letter (prepaid) than when they were paid later (promised). Survey-methodological research also demonstrates that incentives stimulate continued participation in panel studies and prevent attrition (Douhou et al., 2015). 2.1.3 Collection of core data Most online panels, either commercial or non-commercial, probability or non-probability based, collect a set of standard background variables that are commonly used by nearly all research projects. Usually, this is a limited set including variables related to respondents’ socio-economic and socio-demographic background such as age, gender, income, education level, etc. Those core variables might be directly used in data analyses but can also serve as a basis for a tailor-made selection of a required sample (e.g., all 40+ respondents with a college degree). Some academic online panels have extended the core set of data to more detailed sets of variables related to different domains. For example, the LISS panel collects yearly data on household and family matters, the economic situation and housing, work and schooling, social integration and leisure, health, personality, religion and ethnicity, and politics and values. As a result, social science can draw on a rich and longitudinal dataset that is ready to use for many research projects without the need to collect additional data. Moreover, the availability of rich core data serves as a backbone for research projects that have more specific data requirements and therefore require specific questionnaires. The latter, however, can be shorter and more focused since a huge pool of data is already collected in the core study. 2.1.4 Dissemination of data Leading academic online panels also invest quite some effort (and funds) in the dissemination of data to a broad audience. Metadata and data need to be ‘FAIR’: findable, accessible, interoperable, and reusable (Wilkinson et al., 2016). User fees may be a solution to cover all costs for dissemination, but this is an undesirable solution as use will only be limited to those who can afford to pay the fees. This can lead to an underutilization of (expensively collected) data and may particularly exclude younger researchers with little funding. Associated limitations with regard to accessibility of research data may finally also slow down scientific progress. To serve the social science research community, investments are made to collaborate on the FAIR dissemination of leading scientific longitudinal datasets, both in terms of metadata and the actual data (see Section 3). 2.2

Measuring the Effects of Sudden Changes

In general, online surveys provide the opportunity to capture and research sudden changes in society. Data collection is fast and can be initiated quickly, especially in the case of an existing online panel infrastructure. In the media, news messages based on opinion polls appear on a daily basis, presenting people’s opinion on recent events, shocks, politics, or policy measures. Polling is – by and large – the core business of commercial online panels. The word ‘panel’ may be misleading in this respect, however. Quite often, commercial providers maintain large but very volatile pools of respondents from which hardly overlapping samples

Digital technologies and the future of social surveys 91 are drawn and there are few respondents who are in the panel for significant periods of time, limiting the types of research questions that can be answered. Two elements mentioned in Section 2.1 make academic online panels interesting for the measurement of effects of sudden changes in society: the yearly measurement of core data and the dissemination of data to a broad audience (free access for scientific use). A rich amount of data is available from the same respondent before the sudden change has happened, and this can be used in the (longitudinal) analyses together with the data that are collected after the change. A very recent illustration of the power of academic online panels to capture the effects of sudden changes relates to the measurement of societal consequences of the Coronavirus pandemic. Quickly after COVID-19 reached the Netherlands, academic researchers as well as policy makers used the LISS panel to collect and analyse data on the impacts the pandemic has had on society. Examples include the impact on hours worked (von Gaudecker et al., 2020); the effects of home schooling on social inequality and long-term educational outcomes (Bol, 2020); and the lockdown policies’ impact on gender inequality and care responsibilities (Yerkes et al., 2020). Van der Velden et al. (2020, 2021) used longitudinal data from the LISS panel to measure the pandemic’s impact on mental health and loneliness. They could exploit data from the core study, collected in 2019, and compared these with data that were collected after the pandemic set in (2020 and later). Other academic panels were used for similar and other COVID-19-related topics. Since 10 March 2020, the Center for Economic and Social Research at the University of Southern California has conducted bi-weekly surveys of members of the Understanding America Study. Data are collected for a broad range of research topics including socio-economic, behavioural, and health outcomes and published online.1 All data are available for research purposes, free of charge. Links to rich contextual data are provided as well, such as mobility data, consumption patterns, and counts of coronavirus cases. Till now, dozens of studies based on these data have appeared in peer-reviewed academic journals (e.g., Bruine de Bruin, 2021; Daly et al., 2021; Kim & Crimmins, 2020; Liu & Mattke, 2020). Many of these papers take advantage of the wealth of information available in core surveys that all panel members answer at regular intervals.

3

FROM FRAGMENTATION TO INTEGRATION

3.1

Online Panels and Social Surveys

Online panels have grown in prominence in the last two decades but so have general social surveys. This is despite the common perception that we are entering a post-survey era in which research makes increasing use of administrative data and other linkable data sources or big data. The increasing prominence of social surveys is especially visible in the increasing amount of international comparative survey programmes such as the ESS, International Social Survey Programme, European Value Study (EVS), Generations and Gender Survey (GGS), SHARE, and many others. The proliferation of these surveys has been driven by the same technical developments that have driven the development of online panels. The digitization of the survey process has improved coordination and increased the capacity of survey scientists to deploy large-scale, complex surveys. Whilst these surveys have predominantly been established as face-to-face

92 Research handbook on digital sociology surveys, near universal access to the internet and the COVID pandemic have made possible an increased use of online data collection for those surveys. In the context of comparative surveys, the move to online data collection is constrained by three important challenges: (1) common data quality standards across participating countries; (2) comparability of data between countries; and (3) comparability of data over time. ESS, EVS, SHARE, and GGS have all experimented with online data collection modes which demonstrated very promising results (Das et al., 2017; Emery et al., 2019; Luijkx et al., 2021; Villar & Fitzgerald, 2017). In many instances, online data collection has outperformed face-to-face data collection in terms of response rates and data quality. Nevertheless, available evidence also suggests a presence of strong mode effects. Even in instances where results showed less bias in online compared with face-to-face data collection such mode effects prevent the use of online data collection or mixed-mode data collections. To do so could jeopardize cross-national and temporal equivalence of measures, the base on which international and longitudinal surveys are commonly built. Increasing digitalization and technological advancement, however, will keep up the pressure on utilizing online data collection and, thus, many questions arise about the relationship between online surveying and traditional social science surveys carried out in a face-to-face context. If these surveys were to be conducted exclusively or partially online, research funding agencies and social scientists must consider the degree to which online panels and closed, topic-specific data collections such as the ESS and SHARE are both required. Running separate data collection infrastructures may cause duplication of efforts and potential inefficiencies given the overheads involved in creating and maintaining survey studies, especially longitudinal surveys. One possible solution could be to integrate such international surveys in high-quality, online panel infrastructures. When it comes to the integration of existing social science surveys into existing online panels many obstacles arise, in addition to the considerations outlined above. Using an existing online panel, rather than an independent sample, risks contaminating social science surveys with the effects of the existing online panel through attrition and respondents becoming too familiar with the survey content. Cross-country comparability of data in comparative surveys also involves harmonized fieldwork procedures and contact protocols which could be difficult to standardize in a general online panel as countries have very different levels of web penetration and sampling frameworks. On the other hand, online panels offer effective means to cross-calibrate data between studies as different studies are regularly conducted on the same people. They also provide a large reduction in overheads and far more flexibility than traditional social surveys. In the Netherlands, the LISS panel has been integrated within the Open Data Infrastructure for Social Science and Economic Innovations (ODISSEI) alongside many independent data collections including the Dutch nodes of the ESS, GGS, EVS, SHARE, and the Dutch Parliamentary Election Study (DPES). As the central aim of ODISSEI is to optimize return on investment in social science data infrastructure, there have been considerations on how an online panel such as LISS relates to those other survey studies. Even though a mass migration of independent studies to LISS is not realistic at the backdrop of the issues discussed previously, several collaborative experiments have been conducted within the context of ODISSEI to support a more efficient use of funding. A recent example for a collaborative experiment occurred in spring 2021, when the DPES was scheduled to carry out fieldwork in parallel with the Dutch parliamentary elections. To

Digital technologies and the future of social surveys 93 deliver accurate aggregate figures on voting intention researchers needed to ensure representativity of the DPES data whilst also ensuring that there was sufficient scope for detailed analyses of voter sub-groups. The researchers opted for a hybrid sampling approach to achieve those targets: they split up the sample into two parts, one part (ca. 5,000 respondents) which was sampled independently and invited to complete an online survey and a second part which was taken from the LISS panel pool of respondents (ca. 2,700 respondents). With an expected response rate of around 40 per cent for the fresh sample, this provided a core sample of approximately 2,000 respondents that was then supplemented by a sample of (over) 2,000 respondents from LISS to study smaller sub-populations. Several advantages come with such a hybrid approach. Next to lowering overall data collection costs, researchers can identify and examine the effects of attrition, priming, and survey mode that may be associated with the use of the online panel. This approach is promising for social surveys transitioning to online data collection, as online response rates are still volatile and hard to anticipate in study designs. In this respect, the supplementary use of online panels can operate as a hedge against high non-response and the ability to study specific sub-populations, especially for offline populations for whom online panels have invested a great deal in capturing. For international studies, the supplementary use of online panels remains challenging as integrating data from an online panel into the general sample complicates and undermines the unitary design on which they are based. This consideration can only be fully addressed through the development of a cross-national online panel with the same design shared across multiple countries. Such a cross-national, European panel would be transformative for European survey science and social sciences more generally by providing the advantages outlined in Section 2, but also by supporting the integration of cross-national surveys which are such a central part of the current data landscape. 3.2

Integrating Social Surveys in a World of Linked, Open Data

In addition to the integration of existing social surveys and online panels, there have been a proliferation of data infrastructures seeking to link administrative data and survey data such as the Administrative Data Research UK and CASD (Gadouche & Picard, 2017; Gordon, 2020). Linking survey data and other forms of data has numerous benefits. By linking survey data with other data sources, it is possible to reduce the burden on respondents, increase data quality and accuracy, increase the variety of variables measured, and even reduce non-response bias and attrition. The benefits of linking survey data to other data sources are well established but current trends in data linkage are still generally predominantly post hoc and rarely done with linked data built into the design of a study. However, in many data infrastructures there are moves toward a model in which surveys are linkable by design and this fundamentally alters some of the surveys’ core underlying features. 3.2.1 Tabular survey data Surveys are inherently tabular in their nature. They have a base unit of analysis (i.e., the individual) represented in rows and they include a wide range of variables (in columns) for each individual in the dataset. This leads to a simple data table. Even in longitudinal studies these data are either in long or wide form to accommodate the additional data in a two-dimensional form. It is common to have multiple individuals from the same household within a single

94 Research handbook on digital sociology dataset (i.e., a household panel) but beyond that the relationship between individuals in survey data is largely unknown beyond household members when it is a household panel. This is not a flaw in surveys but a design feature that results from random sampling and pseudonymization of the data. We do not know how the individuals within a survey relate to each other and to other social institutions like schools, businesses, or neighbourhoods because they are randomly selected and due to data minimization. As a result, many social dimensions of respondents’ lives remain uncovered by surveys. 3.2.2 Survey data as linked data Linked data structures can change this situation. In a linked data structure, there are multiple tables that are linkable via a wide range of persistent identifiers that create complex but highly flexible and versatile data. When survey data are linked to social media data, administrative data, or any other third-party data it is necessary to have a persistent identifier that enables the linkage. Once this linkage is made, survey data fundamentally change into a linked data structure. In this linked data structure, individual survey data can be linked to many other tables in a broad linked data structure. 3.2.3 Linked data in ODISSEI In ODISSEI, the Dutch national infrastructure for social science, the LISS panel has been linked with administrative data at Statistics Netherlands using a commonly used and robust persistent identifier. For all respondents in LISS (who have not opted out from data linkage) it is possible to add additional variables that place the survey data in a wider context of linked data (Das & Knoef, 2019). For example, researchers can link respondents with individuals that they have lived near to, worked with, or went to school with including all the corresponding information on those individuals. Such linkage is made possible because data at Statistics Netherlands represents an extraordinarily rich and complex linked data structure based on persistent identifiers for schools, addresses, organizations, and companies. Furthermore, tax records can be linked to LISS respondents enabling social research to work with highly accurate individual-level income data and, therefore, analyse income trajectories with unprecedented precision (de Bresser & Knoef, 2015). In addition, income trajectories of individuals can be compared with those of their neighbours, colleagues, or even their former classmates. By linking survey with administrative data at Statistics Netherlands it is possible to situate individuals within the social institutions they live their day-to-day life in (van der Laan et al., 2021). Social surveys are excellent tools at exploring motivations, attitudes, and perceptions but by linking these surveys to administrative data we can situate them within their social context and understand interactions between micro-, meso-, and macro-level processes.

4

OUTLOOK TO THE FUTURE

4.1

Linked Administrative Data and Survey Design

The ability to conduct linkages between survey data and complex, relational administrative data may suggest potential modifications of the ways we design surveys. If it is known ex ante that survey data will be linked, it no longer makes sense to assume that cases (the rows in the dataset) are independent. Indeed, we can identify degrees of separation and social proximity

Digital technologies and the future of social surveys 95 between respondents in a dataset such as LISS. For example, if one respondent has a child that goes to school with the cousin of the brother-in-law of another respondent’s boss (five degrees of separation), this can be captured by administrative data. Once you know whether respondents have links and how strong those links are, it becomes possible to examine whether these links are associated with variables measured in the survey. For example, whether people with lower degrees of separation are more likely to hold certain attitudes or beliefs. If we can no longer assume that respondents in a survey are independent, it might therefore change the way we conduct sampling. Sampling theory is based on the notion that all individuals are independent but the underlying sampling frame for LISS, taken from Statistics Netherlands, allows us to change this assumption. For example, if we want to study how beliefs and attitudes spread through a population, it might be desirable to design a sampling strategy in which each respondent drawn is not independent but instead done via a random walk process from an initial seed draw. This resembles a ‘snowball’ sampling approach but without being driven by the respondent themselves. The result would then be a sample of individuals who were known to be related to each other to a specified degree. Conversely, it might be that we want to deliberately sample individuals who are very distant from each other and poles apart in terms of their social connections. By using the linked structure of administrative data to affect the design of a survey’s sample it is possible to answer questions that have so far been out of reach for social scientists. For example, it would be possible to examine how populist beliefs spread across actual populations or how energy-saving behaviours are adopted by specific segments of society over time. It enables an unprecedented integration of research on psychological and economic concepts being conducted based on survey data with sociological concepts of social networks, group behaviours, and structural change that can be researched through linked data structures. What the interaction between Statistics Netherlands and LISS data in ODISSEI intimates is that data linkage is not a simple matter of adding to and extending survey data. Linkage to administrative data, when done extensively and systematically, can fundamentally change some of the core underlying principles of survey research design. 4.2

Augmenting Surveys in a Digital World

Developments in data infrastructure and security suggest that the future role of external data sources may be even more prominent than in the last few decades. As surveys have moved online there has been a simultaneous increase in the amount of additional data that has been integrated into the survey process. There are broadly two forms of such linkage. The first form is linkage by design, where the researchers conducting the survey integrate additional data collection techniques within the survey alongside traditional survey responses. Those techniques may include the use of ecological momentary assessments including wearables, such as activity trackers, and other devices (Bluemke et al., 2017; Callegaro & Yang, 2018). They are embedded within the survey process for all respondents with the aim of providing detailed objective measures on respondents to supplement traditional survey items. The second form of linkage is circumstantial data linkage where respondents are invited to provide data from a pre-established data service such as social media, a personal wearable device, or data from a mobile phone application (Al Baghal et al., 2020). There are currently constraints to both approaches. Data linkage by design tends to be costly for both the survey administrators but also for the respondent themselves. The additional data

96 Research handbook on digital sociology collection can be an imposition on respondents in terms of time, convenience, and also data privacy. Circumstantial data linkage presupposes that the data are being collected anyway, therefore eradicating the costs for both the respondent and survey administrator. The principal drawbacks of such data linkage might be that not all respondents (1) will use a service or the same service to generate such data (e.g., different brands of accelerometer produce different types of data) or (2) will be ready to donate their data to the survey. There are computational and data infrastructure developments in this field which suggest that this could change rapidly. The development of personal data vaults such as SOLID (https://solidproject.org/) encourage a new approach to data donation by which personal data from a wide range of devices can be stored in a single pod that is fully controlled by the data owner (i.e., the respondent). The respondent can then allow access to services without the data themselves being transferred. For social scientists this could provide more secure and more powerful ways to extract meaningful yet secure data from respondents. For example, there has been significant interest in linking social media data from Twitter to social surveys but doing so requires the respondent to share the Twitter handle with the survey, knowing that the survey will then be able to view their public profile and data (Sloan et al., 2020). This undermines the implicit anonymity of social surveys. Using a secure personal data vault approach, the respondent would save their Twitter data in their personal vault and then approve access for the research team. However, the research team would not be given permission to see the data but merely to query them, using an algorithm which extracted only the information of interest such as the number of followers, the sentiment of tweets, the number of mentions, etc. The personal data are never transferred, and anonymity is maintained. If well regulated and managed, such approaches could allow for a far wider range of data to be made accessible for linkage with survey data such as genetic data, personal communications data, movement data, and media consumption data. Such approaches offer significant potential but would also require large investments in the infrastructure and applications necessary to operate them effectively. 4.3

Enhanced Experimental Tools for ‘Surveys’

There lies a lot of potential in linking survey data to a wide variety of other data sources and this potential is likely to expand rapidly. Those innovations and associated challenges are widely discussed in the survey research community. However, there is also a broad development in the field of computational social sciences of large online experiments in which participants are allowed to interact within an online platform and are then subjected to interventions. These experiments have developed largely independent of the survey research community but are of increasing prominence within the social sciences more generally. In the field of computational social sciences data collection is predominantly carried out through interventions and experiments using online data collection, enhanced by computational techniques and resources (Salganik, 2019). The emphasis on interventions and experiments specifically targets the internal validity of a research design, largely neglecting the external validity that comes from data collection in general social surveys and online panels. Respondents are sourced through paid platforms such as Amazon Mechanical Turk or via online advertising in social media platforms and invited to participate in online experiments run through an experimental platform. Such experiments take on a wide range of designs but primarily resemble the idea of traditional psychological lab experiments transferred into an

Digital technologies and the future of social surveys 97 online setting. Exemplary cases are online versions of the Stanford prison experiment, the halo effect experiment, the marshmallow test, and many others. What differs in these modern iterations is the scale and pace at which experiments can be conducted. Rather than running the Stanford prison experiment with 30 graduate students, it is now possible to conduct it with thousands of participants from an online forum. The development of these mass online experiments raises interesting opportunities for the social sciences. Psychological and sociological research have existed as distinct literatures, in large part because they draw from differing empirical bases that reflect their epistemological foundations. Yet they now operate in the same space of online data collection. Allowing for large-scale experiments to be operated in online panels opens tantalizing possibilities to bring these two empirical traditions together. For example, if a mass online experiment was run with the participants in a long-standing, high-quality panel, psychologists would immediately benefit from greater external validity due to the representativity of the sample. They would also be able to use a huge swathe of background measures to inform the treatment of respondents in the experiment or understand the degree to which observed mechanisms are dependent on wider sociological factors. Likewise, sociological researchers would be able to design experiments that tested the role of social institutions in shaping actual observed behaviour under experimental conditions.

5

CONCLUDING REMARKS

To some extent, the future as outlined in Section 4 has already begun. In several countries new initiatives have been set up to test the feasibility and opportunities from advanced methods in social science data collection. New online panels are being set up according to the highest standards with full coverage of the target population and proper probability samples, as opposed to self-selected panel members. Collaborative efforts in integrating online panels and general social surveys have been initiated, but there is still a way to go. Broad access to administrative data files is still only possible in a selected number of countries and the data are rarely well documented, standardized, or comparable. New measurement devices to objectively collect data are expensive, sparsely used, and sometimes not easy to integrate in the respondents’ daily lives. Even longitudinal surveys face sustained challenges by shifting technology such as the pervasive use of mobile phones and tablets for online surveys and the impact of this on data comparability. However, the evident trend in social research is the increasing interoperability and linkage between sources and this is likely to gather pace. Traditional methods will be enriched with new ways of data collection as sketched above and other yet unknown methods. We are convinced that social science surveys will continue to play a fundamental role in the future of social research as they remain our only and best method for consistently measuring attitudes, values, intentions, and beliefs. They complement and are enriched by other data sources such as administrative data. We foresee online surveys as the dominant mode in the next decades with a movement towards online panels, enriched with the opportunities that advanced data linkage and integration provides. The future will also be shaped by the rapid development of new technologies. It may then be time for social research to reconsider its assumption that respondent records represent independent data points.

98 Research handbook on digital sociology The development of survey research and its integration within a myriad of data sources necessitates unprecedented infrastructural investments. Sustainable funds are needed for the continuation of longitudinal social science surveys and online panel infrastructures. The need for these is clear, demonstrable, and indisputable. Investments must also be made to provide continued and broad open access to administrative data for social science research and computational capacity for the analysis of large, relational datafiles. This is a serious challenge for the field. Data infrastructures appeal less to the imagination of funding agencies than telescopes and particle accelerators, despite similar requirements and potential for scientific discoveries. We therefore believe it will be important for all the social sciences to unite as one discipline. Consortia such as ODISSEI in the Netherlands are leading examples of collaborative efforts that ensure that social science researchers are brought together with the necessary data, expertise, and resources to conduct ground-breaking research and embrace the computational turn in social enquiry. Similar initiatives are under way in other countries and the future might bring integration at a European or even worldwide level. The final word is for the respondents. They determine how successful the exciting new developments will be. The social science research community should make a convincing case that the new developments to collect social science data help us to better understand citizens’ behaviour, with less burden for the respondent and clear benefits for society.

NOTE 1.

See https://covid19pulse.usc.edu/

REFERENCES Al Baghal, T., Sloan, L., Jessop, C., Williams, M. L., & Burnap, P. (2020). Linking Twitter and survey data: The impact of survey mode and demographics on consent rates across three UK studies. Social Science Computer Review, 38(5), 517–532. Blom, A. G., Bosnjak, M., Cornilleau, A., Cousteaux, A.-S., Das, M., Douhou, S., & Krieger, U. (2016). A comparison of four probability-based online and mixed-mode panels in Europe. Social Science Computer Review, 34(1), 8–25. Bluemke, M., Resch, B., Lechner, C., Westerholt, R., Kolb, J.-P., & Kolb, J.-P. (2017). Integrating geographic information into survey research: Current applications, challenges and future avenues. Survey Research Methods, 11(3), 307–327. Bol, T. (2020). Inequality in homeschooling during the Corona crisis in the Netherlands: First results from the LISS Panel. https://doi.org/10.31235/osf.io/hf32q Bruine de Bruin, W. (2021). Age differences in COVID-19 risk perceptions and mental health: Evidence from a national US survey conducted in March 2020. Journals of Gerontology: Series B, 76(2), e24–e29. Callegaro, M., & Yang, Y. (2018). The role of surveys in the era of ‘big data’. In D. L. Vannette, & J. A. Krosnick (eds), The Palgrave Handbook of Survey Research (pp. 175–192). Springer. Chang, L., & Krosnick, J. A. (2009). National surveys via RDD telephone interviewing versus the internet: Comparing sample representativeness and response quality. Public Opinion Quarterly, 73(4), 641–678. Daly, M., Sutin, A. R., & Robinson, E. (2021). Depression reported by US adults in 2017–2018 and March and April 2020. Journal of Affective Disorders, 278, 131–135.

Digital technologies and the future of social surveys 99 Das, M., & Knoef, M. (2019). Experimental and longitudinal data for scientific and policy research: Open access to data collected in the longitudinal Internet Studies for the Social Sciences (LISS) Panel. In Data-Driven Policy Impact Evaluation (pp. 131–146). Springer. Das, M., de Bruijne, M., Janssen, J., & Kalwij, A. (2017). Experiment: Internet interviewing in the sixth wave of SHARE in the Netherlands. In SHARE Wave 6: Panel Innovations and Collecting Dried Blood Spots (pp. 151–162). Munich Center for the Economics of Aging. de Bresser, J., & Knoef, M. (2015). Can the Dutch meet their own retirement expenditure goals? Labour Economics, 34, 100–117. Douhou, S., Mulder, J., & Scherpenzeel, A. (2015). The effectiveness of incentives on recruitment and retention rates: An experiment in a web survey. European Survey Research Association Conference, Reykjavik. Emery, T., Cabaço, S., Lugtig, P., Toepoel, V., & Lück, D. (2019). GGP technical case and e-needs. https://osf.io/preprints/socarxiv/439wc/ Gadouche, K., & Picard, N. (2017). L’accès aux données très détaillées pour la recherche scientifique. Working Paper. Gordon, E. (2020). Administrative data research UK. Patterns, 1(1), 100010. Jerabek, H. (2015). Empirical social research, history of. In J. D. Wright (Ed.), International Encyclopedia of the Social and Behavioral Sciences, 2nd Edn., Vol. 7 (pp. 558–566). Oxford: Elsevier. Kalton, G. (2019). Developments in survey research over the past 60 years: A personal perspective. International Statistical Review, 87, S10–S30. Kelly, F. (2021). The power of research panels. Ipsos Knowledge Centre. www.ipsos.com/sites/default/ files/ct/publication/documents/2021-01/power-of-research-panels-2021.pdf Kim, J. K., & Crimmins, E. M. (2020). How does age affect personal and social reactions to COVID-19: Results from the national Understanding America Study. PloS One, 15(11), e0241950. Leenheer, J., & Scherpenzeel, A. C. (2013). Does it pay off to include non-internet households in an internet panel? International Journal of Internet Science, 8(1), 17–29. Liu, Y., & Mattke, S. (2020). Association between state stay-at-home orders and risk reduction behaviors and mental distress amid the COVID-19 pandemic. Preventive Medicine, 141, 106299. Luijkx, R., Jónsdóttir, G. A., Gummer, T., Ernst Stähli, M., Frederiksen, M., Ketola, K., Reeskens, T., Brislinger, E., Christmann, P., & Gunnarsson, S. Þ. (2021). The European Values Study 2017: On the way to the future using mixed-modes. European Sociological Review, 37(2), 330–346. Salganik, M. J. (2019). Bit by bit: Social research in the digital age. Princeton University Press. Saris, W. (1998). Ten years of interviewing without interviewers: The telepanel. In M. Couper et al. (eds) Computer Assisted Survey Information Collection (pp. 409–429). Wiley. Saris, W., & De Pijper, M. (1986). Computer assisted interviewing using home computers. European Research, 14(3), 144–150. Scherpenzeel, A., & Das, M. (2010). ‘True’ longitudinal and probability-based internet panels: Evidence from the Netherlands. Social and Behavioral Research and the Internet: Advances in Applied Methods and Research Strategies, 77–104. Scherpenzeel, A., & Toepoel, V. (2012). Recruiting a probability sample for an online panel: Effects of contact mode, incentives, and information. Public Opinion Quarterly, 76(3), 470–490. Sloan, L., Jessop, C., Al Baghal, T., & Williams, M. (2020). Linking survey and Twitter data: Informed consent, disclosure, security, and archiving. Journal of Empirical Research on Human Research Ethics, 15(1–2), 63–76. van der Laan, J., Das, M., te Riele, S., de Jonge, E., & Emery, T. (2021). Measuring educational segregation using a whole population network of the Netherlands. SocArXiv. osf.io/a98w7 van der Velden, P. G., Contino, C., Das, M., van Loon, P., & Bosmans, M. W. (2020). Anxiety and depression symptoms, and lack of emotional support among the general population before and during the COVID-19 pandemic: A prospective national study on prevalence and risk factors. Journal of Affective Disorders, 277, 540–548. van der Velden, P. G., Hyland, P., Contino, C., von Gaudecker, H.-M., Muffels, R., & Das, M. (2021). Anxiety and depression symptoms, the recovery from symptoms, and loneliness before and after the COVID-19 outbreak among the general population: Findings from a Dutch population-based longitudinal study. PloS One, 16(1), e0245057.

100 Research handbook on digital sociology Villar, A., & Fitzgerald, R. (2017). Using mixed modes in survey research: Evidence from six experiments in the ESS. In Values and identities in Europe (pp. 299–336). Routledge. von Gaudecker, H.-M., Holler, R., Janys, L., Siflinger, B. M., & Zimpelmann, C. (2020). Labour supply during lockdown and a ‘new normal’: The case of the Netherlands. IZA Discussion Papers. Weeks, M. F. (1992). Computer-assisted survey information. Journal of Official Statistics, 8(4), 445–465. Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., & Bourne, P. E. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3(1), 1–9. Yeager, D. S., Krosnick, J. A., Chang, L., Javitz, H. S., Levendusky, M. S., Simpser, A., & Wang, R. (2011). Comparing the accuracy of RDD telephone surveys and internet surveys conducted with probability and non-probability samples. Public Opinion Quarterly, 75(4), 709–747. Yerkes, M. A., André, S. C., Besamusca, J. W., Kruyen, P. M., Remery, C. L., van der Zwan, R., Beckers, D. G., & Geurts, S. A. (2020). ‘Intelligent’ lockdown, intelligent effects? Results from a survey on gender (in)equality in paid work, the division of childcare and household work, and quality of life among parents in the Netherlands during the Covid-19 lockdown. PloS One, 15(11), e0242249.

5. Mobile devices and the collection of social research data Bella Struminskaya and Florian Keusch

1 INTRODUCTION In 2012–2013, sociologist Naomi Sugie was studying the process of re-entry into the society of individuals recently released from prison (Sugie, 2018). Parolees can face different obstacles during reintegration, and it is important to understand what determines their employment outcomes in order to develop policies that can help successful reintegration and reduce recidivism rates. This particular group of individuals is difficult to study due to the unstable circumstances in which they live after leaving prison. However, Sugie used new technologies to obtain detailed information about the process of how people find work after prison. Parolees were provided with Android smartphones that allowed her to collect data about various aspects of their lives for about three months. The sample consisted of 131 participants who were randomly drawn from a complete list of men released from prison in 2012–2013 in Newark, New Jersey, United States. Smartphones sent out daily surveys (experience sampling method) at randomly selected times during the day about social interaction, job search behaviour, and emotional well-being. Additionally, smartphones collected the data passively: participants’ geolocation and information about calls and text messages (i.e., not the content but the encrypted numbers of all ingoing and outgoing calls or texts) were transmitted to the researchers. Calls and messages to and from new telephone numbers triggered surveys providing a detailed picture of parolees’ social interactions. Using these data allowed the researchers to draw conclusions that were not possible to obtain previously. Sugie and Lens (2017) showed that there exists a spatial mismatch which affects the success prospects of a job search: low-skilled, non-white job seekers are located within central cities while job openings are located in outlying areas. Such residential mismatch lengthens the time to employment; however, mobility can compensate for residential deficits and improve employment outcomes. Such new insights can translate into substantial improvements in developing measures to help the disadvantaged group of parolees, who are geographically restricted and unable to travel in order to find work. Naomi Sugie’s study exemplifies the huge potential of mobile devices for collecting data in social research. She combined traditional survey techniques with novel data collection methods of smartphone sensing and was able to gain qualitatively new insights. While her team provided participants with smartphones, as these devices become increasingly widespread across countries – by 2025, the GSM Association predicts that 5 billion people will have mobile internet subscription on their phones (GSM Association, 2020) – the use of new technologies such as smartphones for social science research is promising. Smartphones, more specifically, built-in sensors and native or specifically installed apps, allow to measure social phenomena in situ, taking into account the physical and social context, thus providing high ecological validity for measurement. Smartphones and wearable devices such as smart 101

102 Research handbook on digital sociology watches become ‘extensions of people’s bodies’ (Keusch & Conrad, 2021, p. 2) as individuals routinely carry them around, which provides opportunities for frequent measurement. However, there are many challenges associated with these novel methods of data collection using smartphones and wearables. Not everyone has a smartphone and not everyone is willing to use their device to provide data about themselves. There are measurement issues as well (Struminskaya et al., 2020): sensor-based data collection results in large volumes of missing data, there are issues with construct validity, and providing feedback or insights to participants common to the so-called ‘quantified self’ paradigm when participation is motivated by the possibility to gain insights into own behaviour (Bietz et al., 2019) can induce behavioural changes due to such insights. Also, there are important ethical and legal considerations for the protection of participants’ privacy. This chapter examines the use of mobile devices in social scientific data collection projects. First, we highlight the opportunities of collecting data beyond self-report questionnaires using mobile devices. Second, we present various examples of social science studies that implemented this kind of research. Third, we discuss methodological challenges and practical considerations for researchers who wish to set up studies using smartphone sensors, apps, or wearables. We conclude this chapter with an outlook on the future of using new technologies for data collection in social and behavioural science research.

2

THE PROMISE OF USING APPS, SENSORS, AND WEARABLES FOR SOCIAL SCIENCE RESEARCH

Our daily lives are becoming increasingly digitized which provides researchers with exciting opportunities to study human behaviour. Every time a person swipes their credit card, allows websites and smartphone apps to access their geographic location, or wears a fitness tracker, an additional piece of information is provided to companies, public entities, and researchers. The increasing volume of such ‘big data’ is transforming social science research: data can be collected faster, more frequently, and more accurately since biases associated with self-report such as social desirability and recall can be mitigated. As the number of people who own smartphones is rising (e.g., Pew, 2021), so is the number of sensors available in typical smartphones (Figure 5.1). Sensors are pieces of hardware in a device such as a smartphone, a home appliance that connects to such a device through Bluetooth, a smart watch or other wearable that passively detects ambient physical states or changes of these states without interacting with the user (Keusch & Conrad, 2021). Examples of commonly found sensors in smartphones are sensors that measure the geoposition of the device, for example, via the Global Positioning System (GPS), accelerometer and gyroscope that measure physical movement, Bluetooth, Wi-Fi, and cellular networks that allow for devices to communicate with each other, and sensors such as barometer, thermometer, light sensor, camera, and microphone that can be used to measure ambience. Typically, data collected through smartphone sensors are performed via apps. Apps are pieces of software that are installed on a device and allow for the interaction with the functions on the device (Struminskaya et al., 2020). Apps can aggregate, process, and store information from the device’s sensors and operating system. They can be used to collect information that is already stored on the device such as logs about calling and texting behaviour as well as administer questionnaires on a mobile device. Information from different sensors can also

Mobile devices and the collection of social research data 103

Source: Struminskaya et al. (2020). Graphics: S and S2: www.pngegg.com/en/png-yguwu (non-commercial use); S3: https://thenounproject.com/search/?q=galaxy+s3&i=11967 (public domain); S4 and S5: https://thenounproject .com/search/?q=g alaxy+s5&i=252184 by Anthony Keal, Royalty Free License obtained 3 November 2020; S6 and S7: https://thenounproject.com/search/?q=galaxy+s7&i=1099080 by Anthony Keal, Royalty Free License obtained on 3 November 2020; S8: https://thenounproject.com/search/?q=galaxy+s8&i=1600582 by Sri Teja Telukuntla, Royalty Free License obtained on 3 November 2020; S9: https://thenounproject.com/search/?q=galaxy+s9&i= 1642515 by Stepan Voevodin, Royalty Free License obtained on 3 November 2020; S10: https://thenounproject .com/icon/3539840/by MIO from Noun Project, Royalty Free License obtained on 11 November 2020; S20, S21, S22: created by Céline Henneveld.

Figure 5.1

The number of sensors in common off-the-shelf smartphones has increased consistently over the years

be combined. For example, Wang and colleagues (2014) used light sensor, microphone, and accelerometery to detect sleep patterns of students throughout the semester. There are several advantages of app and sensor measurements compared to self-report (Struminskaya et al., 2020). First, measurement of social phenomena can happen in situ, that is, at the point of occurrence of an event or activity. Participants can be asked to answer short surveys through ecological momentary assessment (EMA) in a given context or setting. At the same time, sensors can detect other phenomena passively. For example, Lathia and colleagues (2017) collected information from users of a mood-tracking application: the application sent short EMA surveys on the participants’ mood twice a day and collected information on physical activity using the smartphones’ accelerometers in a 15-minute period preceding the EMA. Second, sensor measurement provides very detailed and rich data. For example, an accelerometer collects 60 measurements along three axes per second (at 60 Hz), which for 10 minutes of data collection would result in 108,100 data points per participant. The richness of the data can be illustrated by a study on environmental determinants of health in which English and colleagues (2020) collected information about temperature, pressure, humidity, and air pollutants, as well as sound, vibration, and imagery through different types of environmental sensors. Third, the data are collected passively using smartphone sensors and wearables, that is, without an active input by a participant, and provide more objective measurements than self-report. One issue of self-report is recall error due to the imperfect memory capacity of individuals and the necessity to estimate or guess when reporting their daily behaviours. In

104 Research handbook on digital sociology a Dutch Longitudinal Internet Studies for the Social Sciences (LISS) panel study (for more details on LISS, see Das & Emery, Chapter 4 in this volume), a probability-based online panel of the general population, about 1,000 participants were sent accelerometers to collect data about their physical activity for eight days. Simultaneously, the participants filled out online questionnaires about their physical activity. Self-report and passively collected data showed significant differences: older participants reported to be more physically active than indicated by the accelerometery data; females, students, and high-income groups also overestimated their physical activity compared to the objective measures (Scherpenzeel, 2017). Another issue of self-report is social desirability. In another study in the LISS panel, respondents were sent wireless weighing scales. Researchers compared the objective weight measurements to self-reported measurements and found that women reported on average about 0.7 kilograms lower weight than the objective weight, and men reported 0.9 kilograms lower weight than their objectively measured weight. Furthermore, participants with a high body mass index (BMI) were more likely to underreport their weight, while participants with a low BMI tended to overreport their weight (Scherpenzeel, 2017). Fourth, data that previously had to be collected in research labs with a small number of participants can now be gathered from much larger samples. One example of collecting data at scale is the United Kingdom (UK) Biobank accelerometery study in which researchers approached 236,519 UK Biobank participants (i.e., participants of a large-scale study focusing on health outcomes in people over 40), asking them to wear the accelerometers continuously while performing daily activities. About 45 per cent of those approached complied with the request resulting in 103,712 datasets with about 150 hours of data per participant (Doherty et al., 2017). In another example, researchers collected data from approximately 22,000 volunteers about their happiness measured through EMA and about their physical activity measured passively through accelerometers on participants’ smartphones (MacKerron & Mourato, 2013). To use traditional approaches of self-report in such studies is often cost prohibitive and less timely than sensor-collected data. Finally, collecting data via apps and sensors can reduce respondent burden compared to frequent self-report. For example, in a smartphone-based travel survey, participants’ location can be measured continuously. In a Dutch app-based travel diary study by Statistics Netherlands, geolocation was measured using a combination of GPS, WiFi, and cellular networks every second when the smartphone was in motion and every minute while the smartphone was still (McCool et al., 2021), practically an ‘always-on’ measurement which would have been very burdensome for a participant if they had to self-report all their trips. Diary studies place a particularly high burden on participants who have to provide data with relatively high intensity, which results in high fatigue and dropout (Schmidt, 2014). App-based studies offer additional sensor functionalities that can provide measures of context allowing to reduce the number of survey questions, thereby shortening the survey. For example, an app to measure food intake can allow respondents to take pictures of their meals, aid in determining the number of calories consumed, and, based on the geolocation, provide a list of choices where a participant has had a meal (Yan et al., 2019). In household expenditure studies, participants can take pictures of receipts for which optical character recognition methods can be used to determine types of products and amounts spent and the involvement of the participant is reduced. This has been done in the UK Understanding Society app-based budget study (Jäckle et al., 2019) and in the app-based Household Budget Survey in the Netherlands, Luxembourg, and Spain (Rodenburg et al., 2022).

Mobile devices and the collection of social research data 105 Using smartphones and apps, researchers can gain insights into behaviours that participants perform on smartphones that are of direct interest to the researchers such as how much time people spend on interactions with others through calling and texting (i.e., smartphone-mediated behaviours (Harari et al., 2016)). In addition, mobile sensing allows researchers to infer something about people’s behaviour in the physical world (i.e., non-smartphone-mediated behaviour), such as using log file data about which apps participants used to infer their personality characteristics (Stachl et al., 2020). Research questions that can be answered using these new technologies are not necessarily new, nor are some of the methods that are used. In the next section, we describe in detail a study which aims to answer questions about the effects of unemployment on people’s lives; similar research questions have been studied by researchers for almost 100 years. Experience sampling methods were used in research for decades before smartphones came around: the researchers used to provide participants with portable electronic devices such as pagers or wrist watches that beeped or vibrated alerting the participants that they should complete short self-report questionnaires (Csikszentmihalyi & Larson, 1987). However, what is new is that the combination of passive measurement and self-reports on one device that many people own and carry around with them throughout the day allows researchers to collect high-frequency data that are better suited to reach substantial conclusions while potentially minimizing the burden placed on participants.

3

PRACTICAL IMPLEMENTATION: AN APP-BASED LABOUR MARKET STUDY (IAB-SMART)

Smartphone-based measurement can be used for a variety of purposes but one of the most exciting is to replace observational studies. Prior to the widespread use of personal electronic devices that people carry around with them on a daily basis, researchers relied on asking questions and observing behaviour. In the 1930s, the sociologists Marie Jahoda, Paul Lazarsfeld, and their colleagues had a unique opportunity to study the effects of unemployment when most of the residents of a small Austrian town, Marienthal, were laid off due to the closing of a nearby factory. The researchers used a combination of methods including structured and unstructured observations and interviews of unemployed individuals who were going about their daily lives to learn about the effects of unemployment on the community and its members (Jahoda et al., 1971). Their main finding was that unemployment had detrimental effects on the community and the unemployed themselves, including the loss of sense of time, declined participation in activities outside of the home and with people’s immediate circles, as well as other forms of psychological impairment that affected their behaviours. In 2018, Frauke Kreuter and her colleagues from the Institute for Employment Research (IAB) in Germany decided to replicate this study using new technologies such as smartphone sensing. Kreuter and colleagues (2020) used a longitudinal survey of the German general population based on a probability sample (Panel Study Labour Market and Social Security, or PASS) as the basis for the recruitment of participants into the app-based IAB-SMART study. PASS is a German probability-based household panel that focuses on labour market, poverty, and welfare state research and oversamples households with recipients of welfare benefits (Trappmann et al., 2013). Respondents of PASS who owned Android smartphones were asked to download a research app that, in addition to posing questions to participants, passively

106 Research handbook on digital sociology collected various types of data over a six-month period: (1) participants’ geographic location; (2) physical activity and means of transportation (such as walking, biking, using a motorized vehicle); (3) use of apps installed on a smartphone; (4) smartphone-mediated social interactions inferred from encrypted logs of outgoing calls and text messages; and (5) characteristics of participants’ social networks inferred from the contact list (i.e., estimated gender and nationality from the first names of participants’ contacts). Using these objective data on the physical activity of employed and unemployed, the researchers found that the unemployed were somewhat less active than employed individuals on the weekends. However, during the week, while the unemployed started their daily activities somewhat later, the differences in activity between the employed and the unemployed were quite small (Bähr et al., 2018). In the following we describe the set-up of this study in more detail as it may serve as a blueprint for other studies that wish to collect information about social phenomena using new technologies as well as touch upon other possibilities of study design that can be used. 3.1 Recruitment The invitation to participate in the IAB-SMART study was sent to a random sample of 4,293 PASS respondents who had indicated in a previous panel wave that they owned an Android smartphone (since access to the required sensor data was only possible through the Android operating system). The PASS study interviews individuals over 15 years of age, however, for the IAB-SMART study, PASS panel members aged 18 to 65 were approached; the panel was in its 12th wave of data collection at the time of the IAB-SMART study. Panel members were mailed invitations to participate with one postal reminder. The communication strategy included three pillars: (1) the invitation package; (2) a website with answers to frequently asked questions (see www.iab.de/smart); and (3) an option for the participants to reach out via email or telephone. The invitation package sent to the participants included a cover letter, data protection information, information about incentives, and an installation booklet detailing a step-by-step guide on how to download the app from Google Play Store and register. The cover letter included the description of the study goals as well as an explanation on how to find the app, a link for direct download, and a unique code for the registration (Kreuter et al., 2020). As a rule, the opportunities to change the standard communication with users within the Google Play Store or AppStore are limited, so researchers should plan a communication strategy outside of the app (such as an information package described above) or a landing page (a web page containing information about a study) to maximize participation. In the IAB-SMART study, participants received incentives for three different tasks: (1) installing the app; (2) activating functions within the app to passively collect data; and (3) answering EMA questions within the app. The incentives were offered in the form of points that could be exchanged into Amazon vouchers. The study included a randomized experiment on the influence of incentive amounts on registration and participation, and the total incentive amount for participating in the study for six months varied between 60 and 100 Euros depending on the experimental condition (Haas et al., 2020a). For the installation, participants were assigned randomly to either a 10 Euro or a 20 Euro condition (both as promised incentives, that is, provided upon the fulfilment of the tasks and not unconditionally prior to completing the task). For the activation of different functions within the app, one group (independent from the installation incentive) received up to 5 Euros for activating each of the five data-sharing functions and not disabling it for 30 days (1 Euro per function), whereas another group

Mobile devices and the collection of social research data 107 received the same amount plus an additional 5 Euro bonus for activating and not disabling all five data-sharing functions. The app installation rate was 16 per cent in the 20 Euro installation incentive group compared to 13 per cent in the 10 Euro installation incentive group; the bonus incentive had no significant effect on activating the data-sharing functions (Haas et al., 2020a). 3.2

Obtaining Participants’ Consent

The consent process for the installation of the IAB-SMART app was designed in accordance with the European General Data Protection Regulation (2016) which was a driver for the technical implementation chosen by the researchers. Providing informed consent for the study was a multistep process. First, participants needed to download the app from the Google Play Store. In addition to a standard Google permissions screen, which cannot be modified to add detailed information about what data would be collected by the app, the researchers implemented three steps into the consent process. Second, participants were asked to agree to linkage of the data that the app would collect to the data from the PASS study. Third, participants were asked to accept a general privacy notice, the same as received in the invitation letter, and general terms of service. Finally, participants were shown a screen detailing the five data-sharing functions with individual options to consent to each of them (Kreuter et al., 2020). The design decision to ask separately for consent to each of the five functions was aimed at increasing transparency and participants’ autonomy in the data collection process. Indeed, several experimental studies have shown that autonomy over data collection increases willingness and consent rates to passive sensor measurement and performing active tasks (such as taking pictures and videos) on smartphones to share them with researchers (Keusch et al., 2019; Struminskaya et al., 2020, 2021). In the IAB-SMART study, 91 per cent of people who had downloaded the app consented to activating at least one of the functions and 71 per cent activated all of the functions (Kreuter et al., 2020). These consent rates show that, despite the presentation of the functions as individual consent options that participants had to opt into (as opposed to the common opt-out consent process in commercial apps), the share of those agreeing to collect data using smartphone apps and sensors is relatively high. 3.3

Sampling Frequency

For sensing studies, one important consideration is to decide on the frequency of data collection also known as the sampling rate. In the IAB-SMART study, geolocation was measured every 30 minutes (Kreuter et al., 2020). This discrete sampling rate enabled the researchers to protect participants’ privacy since it was not necessary to know the exact location of individuals to answer the research questions. Besides the design decisions by the researchers, sampling rate can depend on the technical capabilities of the sensor and the characteristics of the device such as sleep mode or a battery-saving mode. Measuring geolocation every 30 minutes in addition to protecting participants’ privacy allows to save battery on the device. However, for some behaviours, a continuous sampling rate is necessary. For example, tracking of the participants’ smartphone usage in IAB-SMART required continuous always-on data collection (Kreuter et al., 2020).

108 Research handbook on digital sociology 3.4

Triggered Measurements

The IAB-SMART app not only collected data passively through smartphone sensors and actively through self-reports but also combined the two types of measurements in geolocation-triggered questions known as ‘geofencing’ (Haas et al., 2020b). When participants’ smartphone geolocations were within a 200-meter radius of one of the 410 job centres in Germany which the researchers geofenced, and the time a participant spent at the geofenced area was over 25 minutes, the app showed a question to the participants asking whether they had visited a job centre for a consultation meeting. If this question was answered with yes, the participants received follow-up questions about the meeting (Haas et al., 2020b). During the design and implementation of the IAB-SMART study, the researchers made numerous decisions that were driven by research questions, technical capabilities of the devices, and budget constraints. In the next section, we outline some of the methodological challenges involved in sensor, app, and wearable data collection which researchers have to take into account when designing an empirical study that involves new technologies.

4

METHODOLOGICAL CHALLENGES

In his keynote speech at the first Mobile Apps and Sensors in Surveys (MASS) workshop, Mick Couper suggested that it is crucial to understand the inferential limits of our data regardless of the methods that we use and there are several misconceptions or ‘myths’ associated with (passive) mobile measurement (Couper, 2019). For example, that smartphone use is ubiquitous, that participants would actually complete tasks in the moment, that people would agree to use their smartphones in the ways researchers intended them to, and that sensor data necessarily represent the truth. It is useful when planning to collect data using digital technologies to critically assess potential sources of error. For that purpose, frameworks from survey methodology research such as total survey error (TSE) (e.g., Groves & Lyberg, 2010) can provide guidance on how to disentangle different sources of error and minimize them when setting up a study. The TSE concept classifies errors that can affect the outcomes into two major categories: (1) errors of non-observation; and (2) errors of observation. Errors of non-observation result from the failure to observe (parts of) the population intended to be studied. They include coverage error, sampling error, and non-response error. Coverage error is the failure to include (or exclude) certain elements on the sampling frame, which ideally is a set of all members of the target population. Sampling error is the imprecision of estimates that results from surveying a sample of respondents (which represents one possibility out of many samples that can be drawn from the sampling frame) as opposed to surveying the entire population. Non-response is the failure to obtain survey answers from all sampled units. Adjustment error is a failure to obtain the proper representation of the target population by correcting the observed values with statistical techniques such as weighting or imputation. Errors of observation involve measurement. Measurement error is the difference between the observed response and the underlying true value. Processing errors occur when the responses are transferred into the database or when the raw data are coded incorrectly for further analysis. Multiple adaptations of TSE exist for big data (e.g., Amaya et al., 2020; Sen

Mobile devices and the collection of social research data 109 et al., 2021), however, the general idea of representation and measurement challenges holds. We discuss the most pressing challenges one by one. 4.1 Coverage An example of a coverage error when conducting a study that uses wearable technologies would be a study that relies on participants to share data from their fitness wristbands to analyse weekend versus weekday activity by race and ethnicity. If the rate of ownership of these devices is lower in the study population than in the general population, coverage error will arise. Smartphone ownership, while on the rise, correlates with age, education, nationality, region, and community size (Keusch et al. 2020) in Germany. In other countries such as the United States similar differences exist: Antoun (2015) speaks of a ‘device-divide’: people who use mobile internet devices are younger, better educated, more likely Black or Hispanic, and have higher income than those who mostly use computers to access the internet. One solution to the coverage error is to provide participants with devices, as has been done, for example, by Sugie (2018) or Doherty et al. (2017). 4.2 Non-Participation An example of a setting where non-participation error can occur is a study where participants are provided with wearable devices to measure sleep quality and those who do not sleep well remove these devices at night as the devices further disturb their sleep. Many studies recruit volunteers, so it is not possible to gauge the error of non-participation. However, if participants come from a pre-recruited (longitudinal) study, such as has been done in IAB-SMART described in the previous section, it is possible. Willingness rates vary considerably by sensor or the nature of the task. For example, in a cross-country study, Revilla and colleagues (2016) found that willingness to take pictures using a smartphone camera was considerably higher than for sharing geolocation: 25–52 per cent versus 19–37 per cent. For studies on mobility using GPS and accelerometers in the Netherlands, 37 per cent were willing to share the data and 81 per cent of those participated; while a study on physical activity using wearable devices sent to participants yielded a willingness rate of 57 per cent with a participation conditional on willingness of 90 per cent (Scherpenzeel, 2017). The rates also vary by country: in the study by Revilla and colleagues (2016), the rates for willingness to take pictures using smartphones varied between 29 per cent (Spain) and 52 per cent (Mexico), and the rates for willingness to share geolocation varied between 19 per cent (Portugal) and 37 per cent (Chile). If studies include a download of a research app and registration within the app as study participants, willingness and participation rates are usually lower: 35 per cent downloaded a travel app in a study by Statistics Netherlands (McCool et al., 2021), 18 per cent indicated that they would install an app to track URLs of visited websites in a study by Revilla and colleagues (2019), 17 per cent downloaded a UK Understanding Society Innovation Panel budget app (Jäckle et al., 2019), and the download rate for the IAB-SMART app was 16 per cent (Kreuter et al., 2020). One of the reasons why willingness and participation rates are low might be participants’ concerns about privacy. Several studies indicate that higher privacy concerns correlate with lower willingness to share data collected using smartphone sensors, apps, and wearables (Keusch et al., 2019; Revilla et al., 2019; Struminskaya et al., 2020, 2021; Wenz et al., 2019). However, emphasizing privacy protection in the request to share data does not increase will-

110 Research handbook on digital sociology ingness or sharing (Struminskaya et al., 2020, 2021). Privacy concerns are inversely related to smartphone skills: people who perform many activities on their smartphones seem to be less concerned about sharing the data collected by their devices (Keusch et al., 2020) and people who perform more activities on smartphones are more willing to share data collected using sensors, apps, and wearables (Keusch, et al. 2019; Struminskaya et al. 2020, 2021; Wenz et al., 2019). However, the levels of concern are higher for passive tasks such as allowing to track what apps are used or tracking geolocation versus active tasks such as using smartphone cameras (Keusch et al., 2020). At the same time, the nature of the task matters: sharing a photo of a receipt of a recent purchase is done at a higher rate than sharing a selfie (18 versus 12 per cent; Struminskaya et al., 2021). Further factors that influence willingness to share and actual sharing are: experience with prior research app download (Keusch et al., 2019) and autonomy over data collection – willingness to share data is higher for tasks where participants have control over data collection (Revilla et al., 2019; Keusch et al., 2019; Struminskaya et al., 2021); as well as study sponsor – a university sponsor yielded higher willingness to share data in experimental studies than a market research and a statistical office as sponsors (Keusch et al., 2019; Struminskaya et al., 2020). Overall, the decisions to participate in (passive) mobile data collection seem to be very nuanced (Struminskaya et al., 2021); there is still much to be learned, but the design decisions made by researchers have the potential to influence (non-)participation. 4.3 Measurement It is tempting to assume that by removing human cognition and social interaction from passive sensor data collection we can eliminate all measurement error. However, errors might still arise at the stages of collecting the data, processing the data, and interpreting the data. There are differences in types of sensor, brand, and model of device that can introduce errors. There are also differences in handling the devices by the participants. For example, smartphone users commonly turn off their smartphone, leave them at home, or do not carry them close to their bodies, and this behaviour differs by sociodemographic characteristic (Keusch et al., 2022). Differences in how people use their devices can cause measurement errors in passive measurement. Besides non-compliance such as turning off smartphones or taking off wearable devices, there might be technical restrictions imposed by the devices: for example, some operating systems might turn off apps running in the background. McCool et al. (2021) found that due to battery concerns, the Apple operating system iOS takes over strict control of the location management system, restricting the frequency that locations can be polled, leading to missing data. For the IAB-SMART study, Bähr and colleagues (2020) identified five different sources that could introduce measurement error (or missing data) when collecting geolocation data: (1) device turned off; (2) no measurement at all; (3) no geolocation measurement; (4) geolocation failed; and (5) geo-coordinates invalid. The problem of missing data in studies using sensors, apps, and wearables can be severe: Bähr and colleagues (2020) found that the actual number of GPS data measurements was less than 50 per cent of the expected number of GPS data measurements. Generally, researchers have little control over the devices upon which the applications are installed unless provided to the participant. The challenge is finding how to compensate for missing data due to technological issues and participants’ behaviour.

Mobile devices and the collection of social research data 111 Furthermore, researchers can unintentionally introduce measurement error by using certain features of the apps. For example, providing feedback to participants can lead to measurement reactivity: Darling and colleagues (2021) randomly assigned participants of an actigraphy study to devices that either provide feedback or not and found that participants showed 7 per cent more physical activity when wearing a Fitbit (with feedback) compared to when wearing a GENEActive (no feedback). While there are reasons to provide feedback, for example because (1) participants might value the opportunity of getting feedback about their behaviour since most commercial apps offer this feature and (2) getting insights into what data are being shared gives more autonomy over data collection to a participant, how and when to present feedback in order not to introduce measurement errors needs to be carefully considered. The list of challenges outlined above can be continued as the technologies evolve (e.g., will the measurements be comparable longitudinally) and as users’ behaviours and interaction with the devices change. Furthermore, it is not clear currently whether coverage and non-participation or measurement challenges are more severe. It is, however, important to keep in mind what actions are required from participants, from initial willingness and consent to compliance with the study’s tasks placed upon participants that can be demanding (compare installing a geolocation tracker and not uninstalling it to a study with multiple measurements, tasks involving using the camera, or answering EMA questions). The aspects of burden of measurement using apps, sensors, and wearables still need to be investigated.

5

CONCLUDING REMARKS

In this chapter, we have reviewed the opportunities and challenges of measurement using new technologies such as sensors, apps, and wearables, for the social sciences and illustrated design decisions that researchers have to make in practice using the IAB-SMART app study to demonstrate how these technologies can enable researchers to gain more insight into the behaviours of employed and unemployed individuals in Germany using numerous smartphone sensing technologies. We expect that as new technologies are developed further, new sensors and their combinations will allow researchers to approach existing and new research questions in social and behavioural sciences. As more studies investigate the relationships between passive measurement and self-reporting, questions from surveys can be omitted and interventions based on passively collected data can be developed. To utilize the maximum value from sensor, app, and wearable data, we believe that one should take an approach of designed big data: marry the strengths of sensor data (that shares a lot of characteristics with big data such as large volume, less structure, high velocity) and survey data (which allows high control of the researcher over study design, including the specification of concepts to be measured). In this approach, one would start with a probability sample and be able to assess the errors that occur at each step of the process from participant selection to the measurement itself (Figure 5.2). Thinking about potential sources or errors and how to reduce them is useful for further developments of digital technologies such as collecting digital traces or pieces of information left behind by individuals on the information systems and digital platforms (Stier et al., 2020) and sharing these digital traces with researchers known as data donation (Boeschoten et al., 2020). While the measurement itself might be passive and unobtrusive and thus requires little to no effort from an individual, informed consent is active and best practices on obtaining

112 Research handbook on digital sociology

Source:

Struminskaya et al. (2021).

Figure 5.2

Introducing design to sensor measurements or big data

informed consent in ways that maximize participation on the one hand and provide participants with autonomy over their data on the other hand are yet to be developed. Legal and ethical considerations apply not only to the consent process but also to the measurement itself: using the same technologies that trigger data collection such as geofencing, one might wish to pause data collection at certain places (e.g., in places of worship, near schools or hospitals) or during certain time intervals to protect participants’ privacy. In deciding how to implement such aspects of sensors, apps, and wearable data collection, social science researchers might benefit from user experience research. Taken together with what has been known for several decades about sampling, asking questions, and eliminating measurement errors, design decisions that guide data collection would allow for good inference from digital data.

REFERENCES Amaya, A., Biemer, P. P., & Kinyon, D. (2020). Total error in a big data world: Adapting the TSE framework to big data. Journal of Survey Statistics and Methodology, 8(1), 89–119. Antoun, C. (2015). Who are the internet users, mobile internet users, and mobile-mostly internet users? Demographic differences across internet-use subgroups in the US. In D. Toninelli, R. Pinter, & P. de Pedraza (Eds), Mobile Research Methods (pp. 99–117). London: Ubiquity Press. Bähr, S., Haas, G., Keusch, F., Kreuter, F., & Trappmann, M. (2018). Marienthal 2.0: die Erforschung der subtilen Wirkungen von Arbeitslosigkeit mittels Smartphones (Marienthal 2.0: Studying the subtle influences of unemployment using smartphones). Poster presented at the seminar Analytische Soziologie: Theorie und empirische Anwendungen. Venice International University, San Servolo, 12–15 November. Bähr, S., Haas, G.-H., Keusch, F., Kreuter, F., & Trappmann, M. (2020). Missing data and other measurement quality issues in mobile geolocation sensor data. Social Science Computer Review, 40(1), 212–235. Bietz, M., Patrick, K., & Bloss, C. (2019). Data donation as a model for citizen science health research. Citizen Science: Theory and Practice, 4(1), 6, 1–11.

Mobile devices and the collection of social research data 113 Boeschoten, L., Ausloos, J., Möller, J. E., Araujo, T., & Oberski, D. L. (2020). A framework for digital trace data collection through data donation. https://arxiv.org/pdf/2011.09851.pdf Couper, M. P. (2019). Mobile data collection: A survey researcher’s perspective. Keynote speech presented at the first Mobile Apps and Sensors in Surveys Workshop. Mannheim, Germany, 4 March. https://massworkshop.org/2019-workshop/ Csikszentmihalyi, M., & Larson, R. (1987). Validity and reliability of the experience-sampling method. Journal of Nervous and Mental Disease, 175(9), 526–536. Darling, J., Kapteyn, A., & Saw, H.-W. (2021). Does feedback from activity trackers influence physical activity? Evidence from a randomized controlled trial. Paper presented at the Second virtual Mobile Apps and Sensors in Surveys Workshop, 22 April. Available at: https://massworkshop.org/2021 -workshop/ Doherty, A., Jackson, D., Hammerla, N., Plötz, T., Olivier, P., Granat, M. H., White, T., van Hees, V. T., Trenell, M. I., Owen, C. G., Preece, S. J., Gillions, R., Sheard, S., Peakman, T., Brage, S., & Wareham, N. J. (2017). Large scale population assessment of physical activity using wrist worn accelerometers: The UK Biobank study. PLOS ONE, 12(2), e0169649. English, N., Zhao, C., Brown, K. L., Catlett, C., & Cagney, K. (2020). Making sense of sensor data: How local environmental conditions add value to social science research. Social Science Computer Review. doi: 10.1177/0894439320920601 General Data Protection Regulation (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council. Official Journal of the European Union, L119/1. eur-lex.europa.eu/legal-content/EN/ TXT/PDF/?uri=CELEX:32016R0679 Groves, R., & Lyberg, L. (2010). Total survey error: Past, present, and future. Public Opinion Quarterly, 74(5), 849–879. GSM Association (2020). The mobile economy 2020. www.gsma.com/mobileeconomy/wp-content/ uploads/2020/03/GSMA_MobileEconomy2020_Global.pdf Haas, G.-C., Kreuter, F., Keusch, M., Trappmann, M., & Bahr, S. (2020a). Effects of incentives in smartphone data collection. In C. A. Hill, P. P. Biemer, T. D. Buskirk, L. Japec, A. Kirchner, S. Kolenikov, & L. E. Lyberg (Eds), Big Data Meets Survey Science (pp. 387–414). Wiley. Haas, G.-C., Trappmann M., Keusch F., Bähr S., & Kreuter F. (2020b). Using geofences to collect survey data: Lessons learned from the IAB-SMART study. Survey Methods: Insights from the Field, Special issue: ‘Advancements in Online and Mobile Survey Methods’. https://surveyinsights.org/?p=13405 Harari, G. M., Lane, N. D., Wang, R., Crosier, B. S., Campbell, A. T., & Gosling, S. D. (2016). Using smartphones to collect behavioral data in psychological science: Opportunities, practical considerations, and challenges. Perspectives on Psychological Science, 11(6), 838–854. Jäckle, A., Burton, J., Couper, M. P., & Lessof, C. (2019). Participation in a mobile app survey to collect expenditure data as part of a large-scale probability household panel: Coverage and participation rates and biases. Survey Research Methods, 13(1), 23–44. Jahoda, M., Lazarsfeld, P. F., & Zeisel, H. (1971). Marienthal: The Sociography of an Unemployed Community. Aldine-Atherton. Keusch, F., & Conrad, F. G. (2021). Using smartphones to capture and combine self-reports and passively measured behavior in social research. Journal of Survey Statistics and Methodology, 10(4), 863–885. Keusch, F., Struminskaya, B., Antoun, C., Couper, M. P., & Kreuter, F. (2019). Willingness to participate in passive mobile data collection. Public Opinion Quarterly, 83(S1), 210–235. Keusch, F., Bähr, S., Haas, G.-S., Kreuter, F., & Trappmann, M. (2020). Coverage error in data collection combining mobile surveys with passive measurement using apps: Data from a German national survey. Sociological Methods & Research, 7 April. Keusch, F., Wenz, A., & Conrad, F. (2022). Do you have your smartphone with you? Behavioral barriers for measuring everyday activities with smartphone sensors. Computers in Human Behavior, 127, 107054. Kreuter, F., Haas, G.-C., Keusch, F., Bähr, S., & Trappmann, M. (2020). Collecting survey and smartphone sensor data with an app: Opportunities and challenges around privacy and informed consent. Social Science Computer Review, 38, 533–549. Lathia, N., Sandstrom, G. M., Mascolo, C., & Rentfrow, P. J. (2017). Happier people live more active lives: Using smartphones to link happiness and physical activity. PLOS ONE, 12(1), e0160589.

114 Research handbook on digital sociology MacKerron, G., & Mourato, S. (2013). Happiness is greater in natural environments. Global Environmental Change, 23(5), 992–1000. McCool, D., Lugtig, P., Mussmann, O., & Schouten, B. (2021). An app-assisted travel survey in official statistics. Possibilities and challenges. Journal of Official Statistics, 37, 49–70. Pew (2021). Mobile fact sheet. Pew Research Center: Internet, Science & Tech. www.pewresearch.org/ internet/fact-sheet/mobile/ Revilla, M., Toninelli, D., Ochoa, C., & Loewe, G. (2016). Do online access panels need to adapt surveys for mobile devices? Internet Research, 26, 1209–1227. Revilla, M., Couper, M. P., & Ochoa, C. (2019). Willingness of online panelists to perform additional tasks. Methods, Data, Analyses, 13, 223–252. Rodenburg, E., Schouten, B., & Struminskaya, B. (2022). Nonresponse and dropout in an app-based household budget survey: Representativity, interventions to increase response, and data quality. Paper presented at the 3rd Mobile Apps and Sensors in Surveys Workshop, Utrecht, 17 June. Scherpenzeel, A. (2017). Mixing online panel data collection with innovative methods. In S. Eifler & F. Faulbaum (Eds), Methodische Probleme von Mixed-mode Ansätzen in der Umfrageforschung (pp. 27–49). Springer. Schmidt, T. (2014). Consumers’ recording behaviour in payment diaries: Empirical evidence from Germany. Survey Methods: Insights from the Field. doi: 10.13094/SMIF-2014-00008 Sen, I., Floeck, F., Weller, K., Weiss, B., & Wagner, C. (2021). A total error framework for digital traces of human behavior on online platforms. Public Opinion Quarterly, 85(S1), 399–422. Stachl, C., Au, Q., Schoedel, R., Gosling, S. D., Harari, G. M., Buschek, D., Völkel, S. T., Schuwerk, T., Oldemeier, M., Ullmann, T., Hussmann, H., Bischl, B., & Bühner, M. (2020). Predicting personality from patterns of behavior collected with smartphones. Proceedings of the National Academy of Sciences, 117(30), 17680–17687. Stier, S., Breuer, J., Siegers, P., & Thorson, K. (2020). Integrating survey data and digital trace data: Key issues in developing an emerging field. Social Science Computer Review, 38(5), 503–516. Struminskaya, B., Lugtig, P., Keusch, F., & Höhne, J. K. (2020). Augmenting surveys with data from sensors and apps: Opportunities and challenges. Social Science Computer Review, 089443932097995. Struminskaya, B., Lugtig, P., Toepoel, V., Schouten, B., Giesen, D., & Dolmans, R. (2021). Sharing data collected with smartphone sensors. Public Opinion Quarterly, 85(S1), 423–462. Sugie, N. F. (2018). Utilizing smartphones to study disadvantaged and hard-to-reach groups. Sociological Methods & Research, 47(3), 458–491. Sugie, N. F., & Lens, M. C. (2017). Daytime locations in spatial mismatch: Job accessibility and employment at reentry from prison. Demography, 54(2), 775–800. Trappmann, M., Beste, J., Bethmann, A., & Müller, G. (2013). The PASS panel survey after six waves. Journal for Labour Market Research, 46, 275–281. Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., Zhou, X., Ben-Zeev, D., & Campbell, A. T. (2014). StudentLife: Assessing mental health, academic performance and behavioral trends of college students using smartphones. Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Wenz, A., Jäckle, A., & Couper, M. P. (2019). Willingness to use mobile technologies for data collection in a probability household panel. Survey Research Methods, 13, 1–22. Yan, T., Simas, M., & Machado, J. (2019). A smartphone app to record food purchases and acquisitions. Paper presented at the 2019 European Survey Research Association Conference. Zagreb, Croatia, 17 July.

6. Unlocking big data: at the crossroads of computer science and the social sciences Oliver Posegga

1 INTRODUCTION During the past two decades, we’ve experienced fundamental technological advances that are already having a profound impact on all levels of society – one that is likely to persist. A significant portion of social interaction has shifted from the physical to the digital world, as has the production, consumption, and dissemination of information. In the digital world, every interaction and information is represented by data, which have become one of the most valuable resources of the twenty-first century. Unlocking full access to this resource promises progress in established fields of research, not exclusively but especially in the social sciences, and insights into novel phenomena that arise from the digital transformation of society. At the same time, the growing availability of methods to process and analyse such data, most notably from the domain of machine and deep learning, put the realisation of this promise within reach. In light of this potential, the early twenty-first century has been referred to as a ‘golden age’ for the social sciences, which is based on a ‘measurement revolution’ that significantly expands their methodological repertoire and empirical prowess (Kleinberg, 2008; Watts, 2007). While there is little doubt that this vision has a bright outlook, it is also true that realising its potential may be more complicated than anticipated. Despite a considerable amount of progress in specific areas, the general verdict seems to be that, to date, we have learned little about the social mechanisms that underlie the phenomena these novel data sources allow us to observe (Lazer et al., 2021; Watts, 2007). This sobering assessment of the last 20 years of research raises a simple question: Why? The answer is, of course, complex and the subject of multiple seminal papers (Hofman et al., 2021; Lazer et al., 2020, 2021; Ruths & Pfeffer, 2014; Wallach, 2018; Watts, 2007). Contributing to the discussion and working towards a meaningful response requires taking a closer look at the underlying problem and its components. The problem is that even with an abundance of data, a growing portfolio of appropriate methods, and a rich body of theories about relevant phenomena of interest to the social sciences, we often fail to develop research designs that address meaningful questions and lead to valid answers. Thus, we miss the opportunity to unlock the full potential of the novel data available for research. The problem is not merely technical or methodological but also interdisciplinary and fundamentally related to the goals of the associated disciplines, especially the social sciences and computer science. Improving the state of the art requires understanding the type of data; the assumptions that underlie the methods used to collect, process, and analyse it; the role of the theories about the studied phenomena; and their interdisciplinary nature. This chapter provides a brief introduction to the problem, establishes the role and nature of data in their context, and outlines the obstacles we face in addressing contemporary research 115

116 Research handbook on digital sociology questions at the intersection of computer science and social science. Before addressing the characteristics of digital data in the context of human behaviour, and particularly social behaviour, I will briefly establish their origins and importance in the following.

2

ON THE ORIGINS AND RELEVANCE OF NOVEL DATA SOURCES

One reason so many novel data sources have emerged is the digital transformation of nearly every aspect of society. Throughout this ongoing process, the nature of social interaction and the production, consumption, and dissemination of information has significantly changed. In essence, modern information and communication technologies such as the internet, social media, and smartphones provide ubiquitous and continuous access to networks of people and information. Moreover, such technologies simplify the production of multimedia content, which can be produced with a few taps on the screen of a smartphone. Information can be shared with the click of a button and uploaded from almost anywhere at any time to a constantly evolving ecosystem of social media platforms with billions of users across the globe. Once uploaded, information becomes content that can be consumed, shared, and reacted to by other users, who might stumble upon it in feeds curated by platform providers or by receiving it from members of their personal networks. This continuous flow of information and the interaction between individuals gives rise to large-scale, digitally enabled social networks. Below their surface runs a complex stack of technologies that share a property in common: to operate properly and satisfy their providers’ goals, they must log the activities and interactions of their users and store the information they share. As a result, huge amounts of data resembling traces of digital interaction and shared content accumulate in the databases run by platform providers, and become an integral component of their services. Almost as a by-product, the data produced in this context promise deep insights into the fabric and patterns of human behaviour and social interaction on an unprecedented scale (Kleinberg, 2008). Such data promise to advance our understanding of phenomena that have been studied since long before the rise of social media and similar drivers of digitally enabled social networks. Studies of this type often revolve around whether the online world mirrors the offline world or if it is subject to novel dynamics that are typically not observed offline (Jungherr, 2018). Regardless of the outcome, the results of such studies are likely to offer ground for future research. For example, in research on small-world networks, Travers and Milgram (1977) conducted a comprehensive and laborious experiment. They found that the average person in North America was, on average, only six or fewer social connections – that is, in a chain of the proverbial ‘friend of a friend’ – away from any other person. These ‘six degrees of separation’ and other characteristics of small-world networks have been the subject of a plethora of research in online and offline contexts. With significantly less effort compared to Travers and Milgram, for example, researchers at Facebook found that the average member of their platform is connected to any other member by a distance of 4.7 steps (Ugander et al., 2011). While confirming small-world properties, findings such as these raise further questions regarding the nature of social connections online. How, for example, do friendships on Facebook differ from the social relationships Travers and Milgram studied long before the emergence of social media, the internet, or even mobile phones? Following this thought, the

Unlocking big data 117 interesting questions often emerge once we begin to ask how social behaviour might have changed in the presence of digitally enabled social networks and how online and offline networks are intertwined. Such questions, and the study of social behaviour online more generally, can lead to research on novel phenomena that emerge from the complex interplay between the online and offline worlds. One prominent area of research in this context aims at understanding how the media system has transformed given the many ways in which the production, dissemination, and consumption of information, including news, has changed with the advent of social media and related technologies (Chadwick, 2017; Jungherr et al., 2019). For example, the growing importance of digital and digitally born media has altered the role of traditional mass media. The once almost exclusive purview of traditional media to promote topics, frameworks, and speakers to the public is now routinely challenged by novel actors and individual users empowered by the open nature of the contemporary media system (Jungherr et al., 2019). Digital media also plays a fundamental role in political campaigning and elections (see Jungherr, Chapter 25 in this volume). Another area of research investigates phenomena arising from collective behaviour afforded, in particular, by social media platforms and technologies that allow individuals to organise and coordinate in various ways. Prominent examples include crisis management (Eismann et al., 2016, 2018, 2021; Reuter et al., 2018), riots (Bastos et al., 2015; Panagiotopoulos et al., 2014; Segerberg & Bennett, 2011), and even revolutions (Bennett & Segerberg, 2012; Wolfsfeld et al., 2013). While the data described above are a by-product of individuals using the platforms and services offered by organisations, the same technologies driving them can be used to create controlled settings to elicit data for academic purposes. For example, smartphones – which function as mobile sensor platforms – can be used to track individuals’ proximity and measure the frequency and strength of their social relationships while controlling for various contextual factors (Stopczynski et al., 2014). Widely available smartwatches can perform the same tasks, while also providing data on an array of individual vital signs. In so doing, they can be used to substitute more complex and sometimes more complicated devices, such as smart badges, which have been designed as measurement devices for social interaction in close proximity (Pentland, 2007). In addition, such devices can be paired with specifically designed mobile applications to prompt participants with surveys to elicit additional data (Miller, 2012). Thus, the pervasiveness of mobile technologies can be used to collect data on human behaviour while maintaining the necessary degree of control over the data collection, as required by some research designs (see also Struminskaya & Keusch, Chapter 5 in this volume). Similarly, interactive websites can be used to distribute and conduct surveys or to run more complex online experiments. In both cases, the novelty of the approach does not necessarily lie within the technical ability to create a survey form or design a digital experiment, but it is the possibility of making it accessible to a broad audience using digital distribution channels, such as social media platforms. Moreover, providing the participants of a study with additional technologies, such as specific software (e.g., browser plugins or mobile applications), allows researchers to monitor their activities throughout a study, such as by recording their clicks and interactions with websites while consuming media content or by collecting data on the content they are served by algorithms on social media platforms (Christner et al., 2021). Finally, crowdsourcing has become an established form of conducting and supporting research online. Scholars can extend the scope of studies significantly by distributing an

118 Research handbook on digital sociology open call to the members of online communities and asking them to participate in academic endeavours by identifying patterns in images and labelling data (Sorokin & Forsyth, 2008) or participating in experiments and surveys (Paolacci et al., 2010), among other approaches. In the extreme, this concept can be extended to citizen science, where the research process is opened to the public (Bonney et al., 2014; Silvertown, 2009). In summary, the digital transformation of society, especially of social interaction and information behaviour, has led to a variety of interesting phenomena that bridge the online and offline worlds. At the same time, the technologies that drive this transformation offer novel opportunities for advanced research designs. Unravelling the former and fully utilising the latter requires a thorough understanding of the technologies involved and the nature of the data they produce. Before elaborating further on the challenges of working with this type of data, it is essential to discuss the fundamental data types resulting from the above.

3

DIFFERENT TYPES OF DATA

Many labels have been used in reference to the data described above. Quite a number of them are vague, and ‘big data’ is probably the most opaque among them. While ‘big data’ is commonly used to describe the tremendous volume of data generated thanks to the prevalence of social media platforms and similar data sources, the term lacks any conceptual clarity and is thus best avoided. To account for the specific characteristics of such data, the term digital trace data is a much better alternative. While the term pertains primarily to data produced as a by-product of technology use, it serves as a reference point for data elicited deliberately using the same or similar technologies in controlled settings. Digital trace data are defined as ‘records of activity (trace data) undertaken through an online information system (thus, digital)’, where ‘a trace is a mark left as a sign of passage; it is recorded evidence that something has occurred in the past’ (Howison et al., 2011). This definition captures a wide variety of data created by human interactions with technology, such as undirected actions (e.g., logging in to a website or clicking a button), interactions directed at others (e.g., establishing a social relationship or uploading content), and interactions directed at information (e.g., liking a picture or sharing a link). Thus, digital traces can resemble simple log data (such as transaction data from online markets (see Przepiorka, Chapter 13 in this volume); or trace data from online dating (see Skopek, Chapter 12 in this volume)), which document atomistic actions, and complex data, representing aggregated information such as content shared on social media platforms (e.g., tweets, Wikipedia articles, or YouTube videos (see Breuer et al., Chapter 14 and Schwemmer et al., Chapter 15 in this volume)). It is worth noting that trace data do not necessarily have to be digital (Howison et al., 2011). For example, observations and recordings of human activity offline, which document events that took place in the past, qualify as trace data, and are frequently used in some disciplines. Among other names, data of this type are referred to as process-generated data, that is, data produced as a by-product of human activity and not in response to a deliberate stimulus of an observer, such as newspaper articles or transcripts of political speeches (Baur, 2011; Johnson & Turner, 2003). Digital trace data share some of the properties of these other data but due to their origin have three fundamental characteristics that make them unique.

Unlocking big data 119 First, digital trace data are found rather than reported, thus they are referred to as found data (Howison et al., 2011). This characteristic refers to the fact that they are a by-product of individuals using information and communication technologies without the intent of creating data for academic purposes and often without reacting to a stimulus designed by an observer to elicit data for research (other than proprietary research conducted by the platform operator). This distinguishes them from data collected via measurement instruments specifically designed to collect data for research, such as surveys or experiments. This has two implications for working with digital trace data: they are not biased by measurement instruments – they represent human activity recorded in the absence of the typical effects introduced by observers and instruments present in research settings; but they are subject to other biases, many of which might be unknown. Most notably, digital trace data are affected by the design of the platforms from whence they originate. While this can be accounted for in some cases, not all properties of the technologies involved are transparent to their users and outside observers. For example, the algorithms that recommend content to Twitter users or rank Google search results are not fully transparent and introduce biases to user behaviour observable through digital trace data. While digital trace data offer a high level of resolution and novel ways of observing human behaviour in digital spaces, it is vital to account for limitations introduced by the fact that they are created to operate the technologies and enable the services from whence they originate. In this sense, they follow the logic and goals of the organisations running the platforms rather than those of independent observers interested in using digital trace data for research. Second, digital trace data are event-based rather than summary data (Howison et al., 2011). While they can differ in their degree of abstraction, digital trace data typically document events at a specific time. For example, a messenger service might produce data that log interactions between individuals, and each record might represent a message exchanged between two individuals. Even for a small population using the service with moderate frequency, such data will likely contain hundreds of interaction events between pairs of users. While such events are interesting for some research areas, the events themselves would rarely be the focal unit of analysis. In many instances, the data would be used to study the relationships unfolding between the service users, with the events serving as proxies for social relationships between users. Analysing these relationships would require an abstraction from the events. A common way to perform such an abstraction would be to identify pairs of users who interact frequently and assume that theirs is a somewhat stable social relationship, thus converting event-based data to summary data. The result, a relationship between individuals inferred from event data, involves non-trivial assumptions about interpreting patterns emerging from event data. Compared to other means of data collection, such as surveys or interviews during which the individuals involved must characterise the nature of their relationships in response to precise questions, digital trace data and aggregates derived from them can be challenging to validate. At the same time, they provide insights into human behaviour that are not biased by such questions and can be used to study otherwise unobservable patterns of behaviour that might not be reported by individuals if prompted by an explicit stimulus. Third, digital trace data are longitudinal by nature (Howison et al., 2011). As a product of technological artefacts, digital trace data comprise records of activity that are typically timestamped. Whether they represent an exchange via email or social media, digital content in the form of a tweet or Wikipedia article, an edit of a blog post or a ‘like’ of a video, the events recorded by the technological infrastructure that underlies the platforms on which the activity

120 Research handbook on digital sociology takes place logs that activity over time and thus produces time series data by default. While other sources of data collection, especially traditional instruments used in the social sciences, can be designed to collect longitudinal data, they are often used in cross-sectional designs, mainly when longitudinal designs are too labour intense or have little chance of producing reliable outcomes. While digital trace data thus provide a higher temporal resolution, working with them entails additional assumptions when temporal aggregations are required. The properties described above result from digital trace data being a by-product of information systems online, that is, of the underlying data-generating process. They are, in some aspects, similar to what is usually referred to in academia as secondary data. In contrast to primary data, secondary data are not collected originally by the researcher conducting a study but by third parties (e.g., other researchers or organisations). When working with secondary data, one gives up control over the data generation and collection processes, including the dataset’s characteristics. Similarly, when working with digital trace data, researchers rarely have control over the data-generating process, which is managed by the entity governing the technology that underlies the information system from which the data are extracted. This lack of control, which often imposes research limitations, can be compensated for in part by designing controlled research environments based on the same technologies that underlie proprietary platforms and services, or by establishing research collaborations with platform and service providers. Consider, for example, technologies used in the advertising industry to monitor how individuals interact with websites. These can be employed to elicit browsing behaviour and clickstream data of individuals to understand patterns of online news consumption (Flaxman et al., 2016; Mukerjee et al., 2018). Other examples include studies in which participants are equipped with smartphones, which act as mobile sensor platforms under the researcher’s control, and thus can track the behaviour of participants – with their knowledge and consent (Stopczynski et al., 2014). Further, cooperating operators of established platforms can aid in regaining control over relevant sections of the data generation and collection processes. Notable examples include studies that rely on established platforms’ communities to participate in academic research. For example, scholars cooperated with the operator of Eve Online, a massively multiplayer online game, and asked the player base to identify anomalies in images in exchange for in-game rewards (Sullivan et al., 2018). Similarly, users of MyHeritage and Geni.com participated in creating a crowdsourced genealogical dataset that comprises millions of genealogical records and is now available for research (Hsu et al., 2021; Kaplanis et al., 2018). While such approaches have significant advantages in terms of control over the resulting data, they often depend on the researchers’ capabilities to establish cooperation or utilise novel technologies in often complex research designs. Thus, as with digital trace data, there are drawbacks. At the same time, the challenges imposed by collecting and using such data are related less to control over the involved processes and are, instead, more grounded in instrument and measurement design and related to challenges with which the academic community is more familiar. To exploit data from either source fully, it is essential to account for the origin of the data and their characteristics. This requires understanding the data-generating processes unfolding between individuals, as well as the technologies they use, to achieve specific goals under or outside researchers’ control.

Unlocking big data 121

4

CHALLENGES AT THE INTERSECTION BETWEEN SOCIAL AND COMPUTER SCIENCE

Many of the challenges involved in working with the types of data described above pertain to data access, data-generating processes, and the role of theory in explaining phenomena of interest. Research on predicting election outcomes based on digital trace data, specifically social media data collected from Twitter, illustrate some of these challenges well. Ever since the inception of Twitter, its rich data have attracted scholars from a variety of disciplines, which has led to a substantial body of research evolving around the platform. A particular strand of Twitter-based research aims at using the signals issued by millions of Twitter users – in the form of tweets, retweets, mentions, hashtags, following relationships, and more – to predict events within and outside of the platform. Examples include the prediction of stock market trends (Oliveira et al., 2017; Pagolu et al., 2016), box office success (Arias et al., 2014), and the outcome of elections (Tumasjan et al., 2011). Predicting elections is a welcome example, as there is ample research on the subject. The anatomy of a study in this category is straightforward: the independent variables are typically measures derived from user activity in the period leading up to the election (e.g., the number of tweets containing the hashtags associated with political parties or candidates, their relative share of the overall activity observed during that time, and the sentiment inferred from the tweets); and such measures are then used as an input for statistical models, which are trained on the data to predict the outcome of the election. After the election, the model’s results are compared to the actual election outcomes. While many of these studies present remarkably accurate models, they have been met with fierce criticism and suffer from several issues. Among the more prominent problems are a lack of reproducibility and generalisability of the findings, the often required intensive fine tuning of the presented models, and a general lack of a theoretical foundation that would allow for an appropriate discussion of the findings in the context of established theories (Gayo-Avello, 2013; Gayo-Avello et al., 2011; Jungherr et al., 2012, 2017). In essence, despite the popularity of this line of research, it has taught us little about relevant social processes in the context of elections. A complete list of issues with this line of research is beyond this chapter’s scope. Nevertheless, it is interesting to explore some fundamental questions that underlie the idea of predicting election outcomes and raise some general questions about the use of digital trace data in this context. 4.1

Data Access

It is worth noting the prominent role Twitter has played in enabling this and related lines of research. While there are multiple reasons for this, one crucial aspect is its policy regarding data access. Early on, Twitter provided almost unrestricted access to the platform’s data, including official interfaces to monitor the full live stream of tweets being published and a full history search interface to look up historical tweets. Complete archives of all tweets were created and shared online during that period. These archives were valuable resources for research, as they provided a complete picture of discussions and activities taking place on Twitter.

122 Research handbook on digital sociology Over time, free access to this resource was reduced significantly with Twitter’s decision to monetise most access to data. For example, rather than offering access to the entire stream of tweets published by the platform’s users, Twitter began to restrict that access to a random sample. This posed a significant challenge for academic research, as control of the sampling logic was suddenly shifted to Twitter itself, outside the control of researchers (Morstatter et al., 2013). Other factors made it too complicated to control for characteristics of the data Twitter offered, such as unknown demographics and additional restrictions introduced for the look-up of historical data. Some of these restrictions have been lifted for academic research in recent years, but others – such as unknown characteristics of Twitter’s population – persist as challenges to working with digital trace data collected from Twitter (see also Kashyap, Rinderknecht et al., Chapter 3 in this volume). Twitter is, nevertheless, a positive example of providing managed access to data. Other platforms are almost inaccessible for academic research. At one end of the spectrum between closed and open access models, Facebook is known for being quite restrictive in providing access to its data (Hegelich, 2020). While the platform is accessible to some degree, Facebook’s access model heavily favours business applications and is tailored towards advertisers that target specific audiences on the platform. This severely limits research opportunities based on Facebook data and leaves it to proprietary research conducted by Facebook, which cannot be replicated or evaluated without access to the data used by corporate research teams. At the other end of the spectrum, platforms such as Reddit and Wikipedia provide open access to their data, and hence their data is widely available for research. Beyond access, however, it is important to note that data collected from these platforms may still be subject to technical limitations. In the case of Reddit, posts and comments published on the platform can be up and downvoted by Reddit users. While the messages themselves can be collected from the platform, the votes they receive are aggregated at the time of data collection. Thus, longitudinal data on up and downvotes is not available by default and needs to be created by repeatedly collecting data from the platform. While there are many reasons to restrict access to digital trace data that are accumulated in the databases of platform providers (e.g., privacy concerns and regulations), researchers have argued for better and controlled access to them (Hegelich, 2020; Lazer et al., 2020). Nevertheless, data access remains one of the most significant challenges in academic research. The election example outlined earlier illustrates some of the issues that arise from this limited access. As previously mentioned, the demographics of Twitter’s population are mostly unknown, and access to data has, for a long time, been restricted to a sample of the overall activity on Twitter. This leads to an interesting question: What part of the population that has participated in the election is represented by data collected from Twitter, and is it reasonable to assume that public discussions on Twitter generally reflect public opinion about the election? While the question may not be focal for research that is merely interested in prediction, it becomes crucial in light of established research on elections, democratic processes, and public opinion. Few studies in this line of research actively address these sorts of questions and the issues of data and information quality that can challenge reproducibility and transparency. Related issues, such as limitations with respect to access to historical data, can introduce additional barriers to the replication of studies based on digital trace data. Overcoming limitations to data access is difficult. On the one hand, academia must coordinate efforts to convince platform providers to cooperate. On the other hand, scholars must carefully consider which data sources to choose in conducting research, and triangulate their

Unlocking big data 123 results using data from multiple platforms and instruments (e.g., by using online surveys to estimate relevant characteristics of the population from which they collect digital trace data). At the very least, it is vital to raise questions with respect to validity issues that may arise from limited access, and to embrace open data initiatives as a way to address reproducibility issues. 4.2

Data-Generating Processes

It is important to consider the data-generating processes that underlie digital trace data. In general, digital trace data result from individuals interacting with information systems online. Such interactions are subject to various influences that determine the information they contain. From a sociotechnical perspective, individual interactions with technological artefacts are actualised affordances – opportunities provided by a technological artefact, in a specific context, as perceived by individuals who decide to act upon those opportunities and use the technology in ways that help them achieve particular goals (Leonardi, 2012). Twitter, for example, affords its users the opportunity to follow other users, publish tweets, and tag keywords or other users in those tweets. Data logged by Twitter merely resembles the result of the actualisation of an affordance, while it is silent on the individual’s context and intent. An individual might, for example, follow the Twitter profile of a political candidate during an election, which would likely be part of a dataset that could be used to study social media use during the election. Whether the visible tie between the individual and the candidate would indicate political support or is simply the prerequisite for the individual to be notified about the candidate’s posts depends on the individual’s intention and goals. Both can typically only be inferred and derived from assumptions about the individual. At the same time, this missing information is vital in determining what is measured using digital trace data, a fundamental question of validity that pertains to every study working with this type of data. Two other factors, context and design, can provide further insights into the use of technology. Context is a vague term that typically refers to the circumstances and the environment surrounding the interaction observed through digital trace data. In the context of election studies, an apparent contextual boundary is set by the timeframe and topic of a conversation taking place on Twitter. For example, it might be reasonable to assume that a tweet that mentions a political candidate in a conversation between multiple people during the campaign period is relevant to public discourse on the election. Likewise, an old tweet outside of this context in a different conversation might not be relevant. A study investigating patterns of social relationships among a cohort of students provides another example outside the realm of social media. The students who participated in the study offered data on their communication and interaction patterns in multiple media types, their course schedules, and GPS data on their physical location during the study. The availability of GPS data and the course schedules allowed the researchers to control the context of their interactions and distinguish, for example, between interactions related to shared classes and those outside of the students’ university schedules (Stopczynski et al., 2014). While rich contextual information allows for more robust and refined analyses, they are difficult to obtain and sometimes altogether unobtainable. Nevertheless, efforts to control for and understand the context in which digital trace data have been produced are crucial to strengthening the foundation for the assumptions required to interpret those data. Finally, it is essential to understand the design of a specific technology and how it affects user behaviour. Technology design rarely happens by accident; rather, it is intentional and

124 Research handbook on digital sociology follows a particular purpose, such as to promote features and the intuitive communication of intended use. Thus, a technology’s design reflects the intentions and ideas of its designer, which must be accounted for when interpreting the results of interactions between users and a technological artefact. Again, we return to the example of predicting elections based on Twitter data and a single tweet that is part of a conversation related to the election. One might ask whether the author of the tweet participated in the discussion. Twitter has a vast population, and millions of users discuss a plethora of topics on the platform at any time of the day. Somehow, a user must come across the conversation and decide to participate. Thus, the answer to our question may seem obvious for those familiar with Twitter: the topic or parts of the exchange must have ended up in the user’s feed (the list of tweets a user sees each time the platform is visited). The follow-up question is why it ended up there, which leads to an interesting component of Twitter – the algorithm responsible for curating individualised feeds of content provided to each user. With this knowledge that one of the primary mechanisms of delivering content to Twitter users is their private feed and that an algorithm is responsible for composing it, one might wonder about its inner workings. Unfortunately, that is quite elusive. The algorithm is not transparent; its inner workings are unknown to anyone but its designers. Thus, how and why a user might have decided to exhibit a specific behaviour on the platform often remains unclear. Perhaps the user follows someone who partook in the conversation, which led to it being part of their feed. It is equally likely that the user might have seen a trending topic, which leads to them exploring it and then finding the conversation and becoming a participant. In this particular example, as in studies related to the consumption and spread of information on Twitter, this design aspect of Twitter poses a challenge that requires researchers to make assumptions about platform use (Howison et al., 2011; Lazer & Radford, 2017). It is also worth noting that design is subject to change. For example, Twitter increased the maximum length of tweets from 140 to 280 characters; Facebook now allows its users to react to content published on the platform with a variety of emoticons, whereas historically it restricted such reactions to ‘liking’ content; YouTube recently decided to limit its response features rather expand them, so while users can still like or dislike videos, the number of dislikes is no longer displayed. These changes can potentially expand or restrict user behaviour and change how they interact with a given platform. Thus, scholars must familiarise themselves with the design of and user behaviour on platforms from whence they obtain their data to strengthen the validity of their assumptions and increase the robustness and replicability of their results. 4.3

Role of Theory

Research based on digital trace data and the digital environments from which they originate requires particular attention to the role of theory. As mentioned earlier, the popularity of research based on digital trace data is due in part to the digital transformation of society and the surge of data that has resulted. At the very least, this transformation has shifted a significant fraction of social interactions and information exchange to digital spheres, where they are mediated by their enabling technologies. This has implications for established theories on social behaviour – the phenomena to which they pertain remain either unaffected or are subject to changes. In the former case, what has been studied ‘offline’ is either not subject to what happens ‘online’ or is mirrored in both domains; in the latter case, empirical evidence

Unlocking big data 125 online contradicts empirical evidence offline. Both cases are interesting and can provide valuable contributions to the body of social science theories. However, research must engage in theorising to deliver such contributions actively. While this may seem trivial, the complex and diverse range of phenomena studied through the lens of digital trace data and their interdisciplinary nature often renders theorising a complicated task. Consider, again, the simple case of predicting election outcomes, which illustrates this complication quite well. It is worth noting that the perspective taken here – framing the problem as a prediction task – is one closer to computer science, especially given the description of the basic architecture of a study aimed at predicting election outcomes based on digital trace data. A valid research goal in this domain is to devise a method to produce accurate and precise predictions as a way to demonstrate the predictive power of the data used in combination with the method. Social sciences studies, however, tend to focus far less on demonstrating anything about the data and method; their objective is far more often to build upon and thus extend the body of social science theories on a specific subject. Thus, in the social sciences, the question would not primarily be if Twitter data can predict election outcomes and how those predictions could be improved, but why digital trace data from Twitter might be indicators for political support or public opinion during an election, why this might lead to a voting intention, which might then affect the outcomes of an election (Gayo-Avello et al., 2011; Hofman et al., 2021; Jungherr et al., 2017). Framed this way, the goal shifts significantly from predicting the outcome to being able to offer a supportable explanation for the outcome. In this sense, a study that merely predicts the outcome of a social process without a valid link to a theoretical foundation that helps explain the prediction might be less valuable to the social sciences. Even this simplified depiction of the slightly different goal structures of two disciplines that are crucial to unlocking the potential of digital trace data for the study of social phenomena helps us to understand some of the prevalent issues that are detrimental to advancing the field. Predictions, of course, are not explanations, and vice versa – much like correlation and causation are related but different. Research based on digital trace data, particularly in its early days, was prone to issues resulting from this conflation. There has been no shortage of studies aimed at predicting all sorts of elections (Schoen et al., 2013), which indicates the popularity of this line of research. On the one hand, this popularity has helped to demonstrate the value and significance of digital trace data. It has rightfully drawn attention to where these data originate and the social processes that unfold in those places. Further, it has demonstrated the capability of innovative methods and research designs to identify relevant signals and interesting patterns of human behaviour with complex and large datasets comprising digital trace data. On the other hand, the success of predictive models has led to claims that require explanations not provided by the research designs used in such studies, which, in turn, has led to strong criticism (Hofman et al., 2021; Schoen et al., 2013). Research often distinguishes between offline and online phenomena, which raises another issue related to the role of theory when working with digital trace data. The question of why Twitter data might be an indicator of political support is, undoubtedly, intriguing. Considering, however, that Twitter represents only a fraction of the population that might be eligible to vote in any particular election, and that Twitter users are people who are subject to a multitude of influences outside of the platform, it is reasonable to assume that there are more interesting questions. For example, one might ask about the role Twitter – and, by proxy, all the interactions and information exchanged on the platform – plays in the surrounding media system. Albeit a broad question, it is justified if the goal is to understand why messages exchanged on

126 Research handbook on digital sociology the platform are supposed to be related to individual and collective behaviour during elections. The question becomes even more relevant when one considers the long history of research on the intersection between media systems and elections. For example, the way we measure and understand public opinion and agenda setting during and outside of election periods is based on the idea of a media system that comprises traditional mass media (print, radio, and television). This media system, however, has changed significantly; it has been replaced by a hybrid system that is subject to various influences and feedback loops between media organisations and individuals (Jungherr et al., 2019). Given the intertwined and complex nature of the system that underlies the phenomenon of interest, an exclusive focus on a single platform, or an emphasis on the relevance of what is happening online, seems to fall short of providing comprehensive and robust explanations. It is important to make clear that this does not disqualify studies that look at a particular phenomenon through the lens of a specific platform; rather, we need to address the big picture to provide more robust explanations for interesting phenomena at the intersection between the social sciences, computer science, and other sciences. This argument leads to another point: it is equally detrimental to ignore digital trace data and remain wedded only to that which is most familiar to researchers. In addition to the challenges mentioned here, there is a long list of unresolved issues related to the subject. Notably, concerns related to privacy and ethics are absent from the discussion; nevertheless, they often inhibit access to data that could otherwise be made accessible for academic research with appropriate precautions. Similarly, methodological issues are another important subject that is only partially addressed. Several of the issues discussed above, especially those related to understanding the behaviour of users responsible for generating digital trace data, might be resolved by employing research designs based on multiple data sources and types of data. In particular, the combination of quantitative (digital trace) data and qualitative data (e.g., interviews or observations) offers promising avenues for future research (Grigoropoulou & Small, 2022).

5 CONCLUSION In summary, novel data on human behaviour are a seemingly abundant resource of the twenty-first century. This remarkable quantity of data, however, often comes with equally remarkable challenges with respect to their quality. Further, for academic research based on such data, and digital trace data in particular, barriers to controlled access rank high among the most demanding challenges. To resolve these challenges requires coordinated and continuous effort; they can be circumvented with careful consideration and innovative research designs. Further, it is paramount to keep in mind that data-generating processes on digital platforms are often opaque and out of researchers’ control; understanding such processes requires the study of how users behave on such platforms, the context in which they do so, and how the design of the technologies of those platforms shapes or affects their behaviour. Based on this foundation, engagement in theorising is another important key to unlocking the potential of digital trace data. This endeavour requires academia to address fundamental, interdisciplinary challenges, particularly at the intersection between the social sciences and computer science. The emergence and popularity of fields dedicated to research at this intersection, first and foremost computational social science, signifies the relevance and potential of research in this

Unlocking big data 127 direction. At the same time, the sobering conclusion that we seem far off from understanding the social processes that underlie the phenomena studied in this field should make us revisit and rethink the foundations of this research. This may require tremendous effort but overcoming the challenges we face in working with digital trace data is worth it. After all, no one ever said a measurement revolution would be easy.

REFERENCES Arias, M., Arratia, A., & Xuriguera, R. (2014). Forecasting with Twitter data. ACM Transactions on Intelligent Systems and Technology, 5(1), 1–24. Bastos, M. T., Mercea, D., & Charpentier, A. (2015). Tents, tweets, and events: The interplay between ongoing protests and social media. Journal of Communication, 65(2), 320–350. Baur, N. (2011). Mixing process-generated data in market sociology. Quality & Quantity, 45(6), 1233–1251. Bennett, W. L., & Segerberg, A. (2012). The logic of connective action: Digital media and the personalization of contentious politics. Information, Communication & Society, 15(5), 739–768. Bonney, R., Shirk, J. L., Phillips, T. B., Wiggins, A., Ballard, H. L., Miller-Rushing, A. J., & Parrish, J. K. (2014). Next steps for citizen science. Science, 343(6178), 1436–1437. Chadwick, A. (2017). The hybrid media system: Politics and power. Oxford University Press. Christner, C., Urman, A., Adam, S., & Maier, M. (2021). Automated tracking approaches for studying online media use: A critical review and recommendations. Communication Methods and Measures, 1–17. Eismann, K., Posegga, O., & Fischbach, K. (2016). Collective behaviour, social media, and disasters: A systematic literature review. ECIS. Eismann, K., Posegga, O., & Fischbach, K. (2018). Decision making in emergency management: The role of social media. ECIS. Eismann, K., Posegga, O., & Fischbach, K. (2021). Opening organizational learning in crisis management: On the affordances of social media. Journal of Strategic Information Systems, 30(4), 101692. Flaxman, S., Goel, S., & Rao, J. M. (2016). Filter bubbles, echo chambers, and online news consumption. Public Opinion Quarterly, 80(S1), 298–320. Gayo-Avello, D. (2013). A meta-analysis of state-of-the-art electoral prediction from Twitter data. Social Science Computer Review, 31(6), 649–679. Gayo-Avello, D., Metaxas, P., & Mustafaraj, E. (2011). Limits of electoral predictions using Twitter. Proceedings of the International AAAI Conference on Web and Social Media. Grigoropoulou, N., & Small, M. L. (2022). The data revolution in social science needs qualitative research. Nature Human Behaviour, 1–3. Hegelich, S. (2020). Facebook needs to share more with researchers. Nature, 579(7800), 473–474. Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., Margetts, H., Mullainathan, S., Salganik, M. J., & Vazire, S. (2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181–188. Howison, J., Wiggins, A., & Crowston, K. (2011). Validity issues in the use of social network analysis with digital trace data. Journal of the Association for Information Systems, 12(12), 2. Hsu, C.-H., Posegga, O., Fischbach, K., & Engelhardt, H. (2021). Examining the trade-offs between human fertility and longevity over three centuries using crowdsourced genealogy data. PloS One, 16(8), e0255528. Johnson, B., & Turner, L. A. (2003). Data collection strategies in mixed methods research. In A. Tashakkori & C. Teddlie (Eds), The SAGE handbook of mixed methods in social and behavioral research (pp. 297–319). SAGE. Jungherr, A. (2018). Normalizing digital trace data. In N. J. Stroud & S. McGregor (Eds), Digital discussions: How big data informs political communication (pp. 9–35). Routledge. Jungherr, A., Jürgens, P., & Schoen, H. (2012). Why the pirate party won the German election of 2009 or the trouble with predictions: A response to Tumasjan, A., Sprenger, T. O., Sander, P. G., & Welpe,

128 Research handbook on digital sociology I. M. ‘predicting elections with Twitter: What 140 characters reveal about political sentiment’. Social Science Computer Review, 30(2), 229–234. Jungherr, A., Schoen, H., Posegga, O., & Jürgens, P. (2017). Digital trace data in the study of public opinion: An indicator of attention toward politics rather than political support. Social Science Computer Review, 35(3), 336–356. Jungherr, A., Posegga, O., & An, J. (2019). Discursive power in contemporary media systems: A comparative framework. International Journal of Press/Politics, 24(4), 404–425. Kaplanis, J., Gordon, A., Shor, T., Weissbrod, O., Geiger, D., Wahl, M., Gershovits, M., Markus, B., Sheikh, M., & Gymrek, M. (2018). Quantitative analysis of population-scale family trees with millions of relatives. Science, 360(6385), 171–175. Kleinberg, J. (2008). The convergence of social and technological networks. Communications of the ACM, 51(11), 66–72. Lazer, D., & Radford, J. (2017). Data ex machina: Introduction to big data. Annual Review of Sociology, 43, 19–39. Lazer, D., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., & Margetts, H. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. Lazer, D., Hargittai, E., Freelon, D., Gonzalez-Bailon, S., Munger, K., Ognyanova, K., & Radford, J. (2021). Meaningful measures of human society in the twenty-first century. Nature, 595(7866), 189–196. Leonardi, P. M. (2012). Materiality, sociomateriality, and socio-technical systems: What do these terms mean? How are they related? Do we need them? In P. M. Leonardi, B. A. Nardi, & J. Kallinikos (Eds), Materiality and organizing: Social interaction in a technological world (pp. 25–48). Oxford University Press. Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7(3), 221–237. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. (2013). Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s firehose. Proceedings of the International AAAI Conference on Web and Social Media. Mukerjee, S., Majó-Vázquez, S., & González-Bailón, S. (2018). Networks of audience overlap in the consumption of digital news. Journal of Communication, 68(1), 26–50. Oliveira, N., Cortez, P., & Areal, N. (2017). The impact of microblogging data for stock market prediction: Using Twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert Systems with Applications, 73, 125–144. Pagolu, V. S., Reddy, K. N., Panda, G., & Majhi, B. (2016). Sentiment analysis of Twitter data for predicting stock market movements. 2016 International Conference on Signal Processing, Communication, Power and Embedded System. Panagiotopoulos, P., Bigdeli, A. Z., & Sams, S. (2014). Citizen–government collaboration on social media: The case of Twitter in the 2011 riots in England. Government Information Quarterly, 31(3), 349–357. Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419. Pentland, A. S. (2007). Automatic mapping and modeling of human networks. Physica A: Statistical Mechanics and Its Applications, 378(1), 59–67. Reuter, C., Hughes, A. L., & Kaufhold, M.-A. (2018). Social media in crisis management: An evaluation and analysis of crisis informatics research. International Journal of Human–Computer Interaction, 34(4), 280–294. Ruths, D., & Pfeffer, J. (2014). Social media for large studies of behavior. Science, 346(6213), 1063–1064. Schoen, H., Gayo-Avello, D., Metaxas, P. T., Mustafaraj, E., Strohmaier, M., & Gloor, P. (2013). The power of prediction with social media. Internet Research, 23(5), 528–543. Segerberg, A., & Bennett, W. L. (2011). Social media and the organization of collective action: Using Twitter to explore the ecologies of two climate change protests. The Communication Review, 14(3), 197–215. Silvertown, J. (2009). A new dawn for citizen science. Trends in Ecology and Evolution, 24(9), 467–471.

Unlocking big data 129 Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Stopczynski, A., Sekara, V., Sapiezynski, P., Cuttone, A., Madsen, M. M., Larsen, J. E., & Lehmann, S. (2014). Measuring large-scale social networks with high resolution. PloS One, 9(4), e95978. Sullivan, D. P., Winsnes, C. F., Åkesson, L., Hjelmare, M., Wiking, M., Schutten, R., Campbell, L., Leifsson, H., Rhodes, S., & Nordgren, A. (2018). Deep learning is combined with massive-scale citizen science to improve large-scale image classification. Nature Biotechnology, 36(9), 820–828. Travers, J., & Milgram, S. (1977). An experimental study of the small world problem. In M. Newman, A.-L. Barabási, & D. J. Watts (Eds), The structure and dynamics of networks (pp. 179–197). Elsevier. Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2011). Election forecasts with Twitter: How 140 characters reflect the political landscape. Social Science Computer Review, 29(4), 402–418. Ugander, J., Karrer, B., Backstrom, L., & Marlow, C. (2011). The anatomy of the Facebook social graph. https://doi.org/10.48550/ARXIV.1111.4503 Wallach, H. (2018). Computational social science≠ computer science+ social data. Communications of the ACM, 61(3), 42–44. Watts, D. J. (2007). A twenty-first century science. Nature, 445(7127), 489–489. Wolfsfeld, G., Segev, E., & Sheafer, T. (2013). Social media and the Arab Spring: Politics comes first. International Journal of Press/Politics, 18(2), 115–137.

7. Regression and machine learning Lukas Erhard and Raphael Heiberger

1 INTRODUCTION Research in the social sciences has been shifting towards a new era. In the last century, the focus laid mainly on variable-based, hypothesis-driven approaches to give answers to societal questions. Data for such approaches were scarce, hard to come by, and expensive (Grimmer, Roberts, & Stewart, 2021). The internet changed that. Digital data are comprehensive, ubiquitous, and, in general, cheap to retrieve. While the availability of such data opens up new roads and possibilities, it also creates the necessity for adequate forms of analysis (Lazer et al., 2020; McFarland, Lewis, & Goldberg, 2016). Along with those new data sets, data sources, and an explosion in available computing power, new techniques for analysing social phenomena developed rapidly, often by integrating knowledge from other disciplines into an emerging field of computational social science (Heiberger & Riebling, 2016; see also Kashyap et al., Chapter 3; and Schwemmer et al., Chapter 15 in this volume). Social scientists increasingly use tools from computational social science (Lazer et al., 2020) and, in particular, machine learning (ML). Methods subsumed under ML can be described as ‘a class of flexible algorithmic and statistical techniques for prediction and dimension reduction’ (Grimmer et al., 2021, p. 396). Applications of ML comprise some of the most important technological innovations in recent years, for instance, gene prediction and search engines (Jordan & Mitchell, 2015). No problem seems too complex as long as researchers have enough (i.e., very large) data. Even previously unsolvable questions might be solved, e.g., how to maintain high-temperature plasma for nuclear fusion (Degrave et al., 2022). Thus, we ask: given the already impressive resume and even greater potential of ML, is it only a matter of time until it replaces traditional statistics (TS) as used in social science? As we will see, differences between ML and TS are (mostly) grounded in different epistemological perspectives. While recent overviews characterize ML-related methods and provide outlines for future research (Grimmer et al., 2021; Molina & Garip, 2019), our chapter’s goal is to point out key differences and commonalities between TS and ML. We will illustrate what a typical social scientist’s approach might look like and how using ML techniques could potentially contribute additional insights. For that purpose, we will first elaborate some general differences and similarities between TS and ML. We will then exemplify those differences using a well-known data set, the European Social Survey (ESS). In particular, we will focus on two main parts of any regression analysis: estimators and goodness of fit. By comparing logistic regressions and two popular ML algorithms (‘random forest’ (RF) and ‘ridge regression’), we will explain how ML works and, more importantly, how it is typically used by researchers outside the social sciences.1 In so doing, we will reveal how epistemological differences shape the potential usage of ML in the social sciences and discuss the methodological trade-off when it comes to the question of whether (and how) to apply ML or TS.

130

Regression and machine learning 131

2

TRADITIONAL STATISTICS AND DIGITAL DATA

Statistics became widely employed in the landscape of all social sciences from the 1950s onward. Platt’s (1998) historical account of sociological methods argues that studies from 1920 to 1940 lacked external funding and were mainly qualitative by design (interviews, ethnography). In the following decades, this changed as surveys and statistics became the main instruments for social scientists (Converse, 2009). The quantification and numeric measurement of social phenomena through surveys, the subsequent approximation of complex concepts in variables, together with increasing computing power can be seen as a ‘watershed moment’ of empirical social research. The paramount of research activities shifted during this process from ethnographic community studies to an individualistic perspective which was inherent to survey questions (Porter & Ross, 2003). Reflecting changes in methods, new theoretical programs came into fashion. For instance, rational choice theory with its methodological individualism was well suited to explain variable-centred research questions and became prevalent for many sociologists (Coleman, 1990). The use of surveys, the resulting availability of a new kind of data, and easier access and higher acceptance of causal models occurred in other social sciences as well. In psychology and economics, for instance, the general methodological subfields of psychometrics and econometrics respectively emerged (Raftery, 2001). In the wake of variable-driven data retrieval came the advent of regression models, factor analysis, and other statistical techniques in sociology, although those methods were considered ‘esoteric expertise’ (Abbott, 1988a, p. 147) until the late 1960s. Thus, the shift to quantitative research was fuelled by fundamental changes in research content (surveys), techniques (regressions), and technology (computers). Social scientists face a similar situation today. The flux of research fashions and increasing diversity of the field has picked up steam as a new era of big data and super-computing has arrived (McFarland et al., 2016). For more than a decade, the internet has changed the lives of most people in industrialized countries providing the means to use all sorts of digital traces (e.g., connections, profiles, and communications) as data on social behaviours and systems. Compared to computer scientists and engineers who have embraced the analysis of large social data and actively developed research programs to ‘conquer’ this domain (Conte et al., 2012), social scientists have been rather slow to utilize the possibilities of new data and computational methods (Heiberger & Riebling, 2016). Part of the explanation may be sought in the underlying research program of TS that dominates quantitative research in the social sciences. It builds on inferential statistics and Karl Popper’s formalizations (Popper, 1968), i.e., testing falsifiable hypotheses derived from (general) theories. This foundation of testing theory-based assumptions describes a way of generating knowledge about a small part of a population and generalizing it to its entirety (Krzywinski & Altman, 2013). One main pillar of this approach is to calculate the probability of this generalization being wrong (like the probability of a false positive or alpha error as represented by the p-value). Hence, social scientists rely heavily on thresholds upon which we accept or reject hypotheses, most often considered to be at the conventional but arbitrary significance level of 5 per cent. While this frequentist approach is well suited for carefully conducted surveys, it bears some significant shortcomings for digital data. One severe methodological challenge is tied to sample characteristics. Most of the frequentist approaches are driven by or depend on sample size (e.g., all chi-squared-based test statistics). Testing for significance with very large

132 Research handbook on digital sociology sample sizes will almost always generate statistically significant results, rendering those tests meaningless. This problem is amplified by the source of many popular digital data, e.g., social media platforms like Twitter. Inferences and significance levels are then, at best, possible for the underlying population, hence, the platform’s users (Lewis, 2015). Most often, however, we do not know much about users’ characteristics or how they relate to (parts of) the general population (Hargittai, 2020). In addition, the efficient estimation and interpretation of regression coefficients rely on non-independent and identically distributed observations. This is often not the case for samples from digital data, e.g., when it comes to all data gathered on social networks (of which dependency is an inherent property). Another important challenge for TS analysing digital data relates to the ubiquity of linear models (Abbott, 1988b). Even if the predictor is non-linearly related to the outcome as in, for instance, logistic regressions ( f y b0 b1 x1 b2 x2 ), almost all TS models assume ‘linearity-in-the-predictor’. The advantage of linear predictors is that it facilitates the interpretation, yet, it also obscures potential non-linear relationships. Social scientists only rarely forb0 mulate models with complex, non-linear predictors (e.g., f y ), a circumstance b1 x1 b2x2 that is astonishing, given the complexity of social phenomena. Some argue that is because of the omission of counterfactual language (Pearl & MacKenzie, 2018, pp. 329–335), while others may point to the problem of overfitting and interpretation when introducing polynomial terms in linear regressions (Molina & Garip, 2019). As a consequence, theories and results in social sciences almost never consider non-linear relations in the above sense. One of the great promises of ML is to offer ways to measure such non-linear effects. Modelling non-linearity is actually inherent to many ML techniques, making it feasible to consider complex relationships between variables without prior knowledge of the shape of these relations. We will exemplify this by presenting the properties of RFs, a popular instance of ML tools.

3

MACHINE LEARNING IN THE SOCIAL SCIENCES

ML summarizes statistical methods in which computers learn from data and extract information. While ML represents a breakthrough in computer sciences, its adoption in the social sciences is less enthusiastic (Molina & Garip, 2019). In contrast to the usage of inferential statistics, the application of ML does not (yet) rest on established methodological principles and assumptions as in TS. In general, we can distinguish two paradigms when it comes to ML: supervised machine learning (SML) and unsupervised machine learning (UML) (e.g., Jordan & Mitchell, 2015). SML uses labelled data. We speak of labelled data if the values of the dependent variable (DV) are known. That is the case for all typical regression analyses in the social sciences: trying to relate predictors to the values of a DV. The difference with SML is that the statistical models are then fit to predict previously unseen data of which we know the predictors but not the DV. These techniques are, in computer sciences, grouped into ‘regressions’ (for a continuous DV) and ‘classifications’ (for categorical DV). In contrast, UML uses unlabelled data, that is, data where the ‘correct’ answer cannot be learned from known observations. It instead derives patterns of (unlabelled) observations by exploiting the statistical properties of the data. Common groups of UML techniques are

Regression and machine learning 133 dimension reduction (e.g., principal component analysis) or clustering (e.g., k-means clustering). UML, in essence, aims to create categorization schemes or typologies. In social sciences, most people would refer to this as ‘exploration’. Researchers so inductively define types along derived dimensions and represent each case relative to the types given its underlying values. Resulting (ideal) types are arguably among the most important methodological tools of social scientists and have been used for a long time. UML’s main purpose is therefore to explore data and reduce their complexity. Researchers might then use the output of UML as input for further analysis (Heiberger, Munoz-Najar Galvez, & McFarland, 2021) or to develop theoretical models (e.g., Hall & Soskice, 2001). Those exploratory techniques are by no means new to social sciences. However, UML provides novel ways to analyse large amounts of text and social networks, both kinds of data often associated with the digital age and computational social science (Heiberger & Riebling, 2016). In particular, the ‘automatic’ categorization of large corpora has found many applications to social phenomena (Evans & Aceves, 2016). Topic models, for instance, represent one of the most used natural language processing tools in the social sciences (McFarland et al., 2013). In addition, the detection of communities in networks resembles prominent ideas of social groups (Fortunato, 2010). The importance of UML notwithstanding, our chapter’s focus is on SML. Many people might first think of SML when it comes to ML. Methods under the umbrella of SML also witnessed the largest performance boost due to increases in the availability of digital data. Methodologically speaking, almost all instances of artificial intelligence are SML, i.e., models intended to predict an outcome with a given set of features. That is the same principle as when social scientists speak of estimating a DV by the means of independent variables. Thus, how does SML actually compare to TS? There are two main answers to that, in our view, crucial question: one is epistemological, the other practical. While TS infers models that explain how an outcome is generated with unbiased and consistent estimators, SML does not care about interpretability. While the coefficients are the most interesting part of regressions for social scientists, the features are of little to no interest for typical ML use cases (cf. Section 3). Instead, the main goal of SML is how to best forecast the outcome. Thus, SML models do not need to care about meaningful interpretations or unbiased estimators. This is a crucial epistemological difference and yields many practical consequences. Foremost, the prioritization of predictions affords ‘out-of-sample’ testing, i.e., the ability of models to predict unknown observations. Thus, unlike most social scientists using one data set for modelling efforts, SML consists of at least two data sets: training and test data. The former is used to develop the model, the latter to test its predictive capacity on out-of-sample data. Often, the train and test sets are randomly sampled from the same data set which is split, e.g., 50/50. That means properties like autocorrelation or multicollinearity are merely treated as features, not as problems like in inferential statistics. This approach stems from its practical success in many commercial applications. Engineers of almost all large companies focus on improving the prediction of customer behaviour. Of course, firms with digital business models profit most from SML’s predictive power. Consider, for instance, a company that sells wine online. Given a certain record of orders and a large enough customer base, the company can use the characteristics of their data (e.g., price, grape, origin, quality, vineyard, etc.) to make recommendations for further purchases. Those suggestions might also be based on information about what other customers might have bought. In any case, the wine shop will mostly be interested in developing models that ‘work’,

134 Research handbook on digital sociology i.e., engineers or consultants might typically approach this task without any concern for theory or explain why somebody bought something in the sense that previous choices of grapes are two times as important for current choices than the vineyard. In other words, the strength and direction of coefficients are of no importance; they suffice to make suitable recommendations (i.e., predictions) to improve sales. Given the priority of prediction, many restrictions of TS are no longer concerns of ML researchers. If interpretation (or explanation) is not the aim, the use of powerful ‘black boxes’ like multilayer neural networks or higher-order interaction effects becomes an attractive option. Almost all deep-learning efforts rest on such black boxes, which are only of limited use to social scientists. The same is true for using thousands of (potentially co-linear) variables in a model. This practical approach also allows regularization of variance2 (cf. Section 4.3.1) and empirical tuning of parameters (Mullainathan & Spiess, 2017). Instead of being prone to overfitting (like TS models), SML uses the training data and tunes regularizers to fit the data at hand (number and effect differing by algorithm). We will illustrate such a typical SML workflow by using two examples in the remainder of this contribution. We will focus on illustrating the use of ML for ‘classifications’ instead of ‘regressions’ in order to keep the analysis as comprehensible as possible.3 In particular, confusion matrices (see below) for binary outcomes have the advantage to yield relatively simple indicators for model performance. Nevertheless, these are among the most common concepts of ML and provide a useful case for the purposes of this contribution. All derived conclusions, however, do apply for ‘regressions’ (in the ML sense) too.

4

A COMPARISON BY EXAMPLE: IMMIGRATION IN EUROPE

It might come as a surprise for some readers, but often the algorithms used in ML are quite similar, if not even the same, as in traditional quantitative social sciences. Logistic regression, ordinary least squares regression, and principal component analysis, for example, are readily used in both camps. Even though many of the algorithms are the same, the epistemological and practical approaches, as well as evaluation strategies, differ. First, and in stark contrast to TS models, ML models usually do not have any usable, that is interpretable, coefficients. Second, model evaluation works by assessing its predictive power. To illustrate the epistemological differences, we will use an exemplary model and approach the problem from both angles. The chosen model is loosely based on the approach by Davidov and Meuleman (2012) and has been used by others in a similar way (e.g., Sagiv & Schwartz, 1995). The model investigates the well-known effect of human values on attitudes toward immigration. Our data come from the first round of the ESS (2002). The explanandum, our DV, which we will call reject, is a measure that represents attitudes towards immigration. It is constructed as a mean index consisting of three variables4 which have been shown to load strongly on a single dimension in a confirmatory factor analysis (Davidov & Meuleman, 2012, p. 764). To show the ML workflow in its simplest form, binary classification, the resulting variable is dichotomized at the scale’s mean value so that 1 indicates negative attitudes towards immigration (simply put: ‘rejecting immigration’). To measure human values, we use the theory of basic

Regression and machine learning 135 human values (Schwartz, 1992) which is captured in the ESS surveys. The theory describes 10 basic values that are structured in two dimensions:5 conservation and self-transcendence. Some theoretically founded expectations of this model can be formulated. Conservation values ‘that include appreciation of stability of society, and respect, commitment and acceptance of the customs and ideas that traditional culture or religion provide. In other words, the arrival of immigrants is coupled with potential societal changes that are opposite to the preferences of conservative individuals’ (Davidov & Meuleman, 2012, p. 761). Therefore, more conservative individuals are expected to reject immigration. In contrast, ‘self-transcendence values include understanding, appreciation, tolerance and protection for the welfare of people and for nature. The arrival of immigrants provides opportunities for individuals to realise these self-transcendent values. In other words, the arrival of immigrants is coupled with potential societal changes that are in harmony with the preferences of self-transcendent individuals’ (Davidov & Meuleman, 2012, p. 761). Hence, we expect more self-transcendent people to support immigration. We control for income, which is measured by a variable measuring respondents’ feeling about the household’s present income (four-point scale from 1 ‘Living comfortably’ to 4 ‘very difficult’ on present income). Subjective income is chosen over the objective household income for the pragmatic reason of reducing the number of missing values. Additionally, we control for age (in years), religiosity (11-point scale from 0 ‘Not at all religious’ to 10 ‘Very religious’), years of full-time education, self-position on a left–right scale (11-point scale from 0 ‘Left’ to 10 ‘Right’), gender (dummy variable coded 0 for male and 1 for female), and country via dummy variables. See Table 7.1 for descriptive statistics on variables. Table 7.1 Variable

Descriptive statistics (European Social Survey data) Mean

SD

Min

Max

Reject

0.45

0.50

0

1

Female

1.51

0.50

1

2

Age

47.46

16.79

19

102

Religiosity

4.94

2.91

0

10

Education

12.24

3.97

0

29

Left–right

5.10

2.15

0

10

Income

3.08

0.82

1

4

Conservation

0.13

0.63

−2.82

2.36

Transcendence

0.66

0.53

−2.13

3.05

Note:

N = 27,164 respondents. Data from ESS Round 1 (2002).

4.1

Analysis Using Traditional Statistics

Table 7.2 reveals the results of the logistic regression. We see highly significant effects for all variables except gender with age, self-placement on the left–right scale, and conservation being positively correlated whereas education, satisfaction with income, religiosity, and self-transcendence are negatively correlated with the rejection of immigration. Except for the non-significance of gender, our results are similar to Davidov and Meuleman (2012). That said, we use a simplified model and would not advise drawing any substantial conclusions from it. We use this model only for didactic purposes to compare methods of TS with ML.

136 Research handbook on digital sociology Table 7.2

Logistic regression of dichotomized attitudes towards immigration (log-odds coefficients)

Coefficient

Probability of reject Standard error

P-value

Intercept

2.304

0.112

< .001

Age

0.006

0.001

< .001

Education

−0.099

0.004

< .001

Female

0.009

0.028

.744

Income

−0.128

0.019

< .001

Left–right

0.065

0.007

< .001

Religiosity

−0.024

0.005

< .001

Conservation

0.434

0.027

< .001

Self-transcendence

−0.693

0.029

< .001

Note: Dependent variable is probability of reject (1 = rejecting immigration, 0 = not rejecting immigration). Coefficients as log-odds ratios (logit). Model controls for country differences via dummy variables (not shown). N = 27,164 respondents. Data from ESS Round 1 (2002).

4.2

Evaluating Model Performance

Social scientists use a plethora of (pseudo) R 2 measures to assess the goodness of fit of a model, or the Akaike/Bayesian information criterion to compare and select models at their disposal. The basic idea behind these procedures is always the same: utilize all data available to optimize the model and tell us how well our model fits the underlying data. Herein lies probably the biggest difference to machine learners’ aims. While ML needs to predict, social scientists tend to overfit their models. We speak of overfitting when a model is fit very well to the underlying data but fails to predict unseen examples. It therefore has learned random noise instead of the true correlations in the data set at hand. As a consequence, overfitted models can only rarely be replicated in other samples (i.e., they are bad at predicting unseen data), because they fit the data ‘too good’ and are prone to ‘p-hacking’.6 That creates considerable uncertainty about such models’ merits and casts considerable doubts as to whether those explanations should even be considered as scientific (Watts, 2014). To overcome the obstacle of overfitting, it has become customary to split the data set into multiple parts and only to use part of the data to train the model and the rest of the data to evaluate the predictive performance. We call this the train-test-split, with the training set being the former and the test set being the latter subset of our data. This procedure allows us to evaluate whether our models only learn the noise (unexplained variance) in our data or if they are able to predict data that they have not seen before. A drawback of this procedure, of course, is that we are not able to evaluate the model on the whole data set, so there is a trade-off to be made when deciding how large the test set should be. Hence, the need for ‘big data’. Larger test sets usually mean higher confidence in the validation results, but they come at the price of reduced available data for the training step. In the case at hand, 20 per cent of the data (5432 cases of an overall number of 27,164 cases) is separated into the test set. These data will only be used for model inspection and evaluation. The remaining 80 per cent will be used as the training set. Classification models are evaluated using a so-called confusion matrix. A confusion matrix compares the predicted values from a model to the actual (true) values from our test data. Table 7.3 depicts its formal representation for binary classifications.

Regression and machine learning 137 Table 7.3

Formal representation of a confusion matrix for binary classification Predicted value

=0 =0

True value =1

=1

TN

FP

(true negative)

(false positive)

FN

TP

(false negative)

(true positive)

True negatives (TNs) represent the number of correct predictions for the ‘negative class’ (e.g., the absence of a class, most often 0; in our case, if somebody is having no negative attitudes against immigration). True positives (TPs) represent the number of correct predictions for the positive class (in our case, if somebody is having negative attitudes against immigration), respectively. False negatives (FNs) show the number of incorrect predictions for the negative class, whereas false positives (FPs) depict the number of incorrect positive predictions. Many important goodness-of-fit metrics can be derived from this matrix. We will illustrate two of the most common measures: accuracy and the F1-score. Both metrics range from 0 to 1, with higher values indicating better models. Accuracy denotes the ratio of correctly classified data instances to the total number of data instances: Accuracy

TP TN (7.1) TP FP FN TN

Accuracy is a valuable measure if the data contain roughly equal amounts of TP and TN, but it is very susceptible to class imbalance (meaning that one category of our categorical variable, in ML terms usually referred to as ‘classes’, has a higher prevalence in the actual data than the other). Take, for example, a data set containing a rare outcome with TN = 90 and TP = 10 ; a model that only predicts 0 would achieve an accuracy of .9 as it will predict 90 per cent of all cases correctly, even though it is most certainly not a very good model. Hence, there is a need for metrics that take imbalanced classes into account. Most commonly, the literature refers to two distinct measures, the first being precision: Precision

TP (7.2) TP FP

Precision (also called positive predictive value) describes the proportion of correct positive predictions over all positive predictions. Hence, it answers the question: ‘How many of my positive predictions are actually correct?’. This metric is commonly supplemented by another, recall, which is defined as: Recall

TP (7.3) TP FN

138 Research handbook on digital sociology Recall (also known as sensitivity) describes the fraction of correct positive predictions over all cases for which a positive prediction would have been correct. It thereby answers the question: ‘How many positive predictions did I miss?’. Both measurements are usually merged into one metric, the F1-score. This is defined as the harmonic mean of both and therefore penalizes extremely low values in either: F1

2 * Precision * Recall (7.4) Precision Recall

To be comparable to the models presented below, we reran the logistic regression from above on the training data set (80 per cent of the initial data set) and evaluated it on the test set (20 per cent of the initial data set). The results are shown in the confusion matrix in Table 7.4. Table 7.4

Confusion matrix for logistic regression Predicted value

True value

Sum

=0

=1

=0

2299

1022

3321

=1

670

1441

2111

2969

2463

5432

Sum

From these data, we can compute an accuracy of .6885, meaning that we were able to predict 68.85 per cent of all cases in the test set correctly. Additionally, we can calculate a recall of .7743. This indicates that we were able to identify 77.43 per cent of all cases in our test set that have negative attitudes towards immigrants (that is, a value of 1 in our DV). Furthermore, the precision of .6923 tells us the proportion of correct positive predictions. Thus, we predict 69.23 per cent of all positive predictions correctly. The harmonic mean of precision and recall, the F1-score, is .731. This has no special interpretation but is often used for a comprehensive model comparison considering precision and recall. Typically, these numbers are compared across different models (or model setups) to find the most predictive model. We exemplify this below with our example. 4.3

Penalized Regression

Common applications in ML have not only to deal with a lot of observations but also frequently with a large number of variables or, in the language of ML, ‘features’. To imitate that, and to show an often used procedure, all two-way and three-way interactions between all variables are calculated and added to the data set. Interactions are often called ‘polynomial features’ in the ML realm. In our example, this ‘automatic feature engineering’ results in 870 features (i.e., variables). After removing all features without any variance (111), as they cannot possibly be useful in any model, the resulting data set contains 759 remaining variables, much more than regular models in social science. To handle a high number of variables, or high dimensionality, a common technique in ML is to use penalized regressions. These add a constraint parameter to the regression equation that penalizes the model for having too many parameters (Bruce & Bruce, 2017; James, Witten, Hastie, & Tibshirani, 2013). This is also known as shrinkage or regularization.

Regression and machine learning 139 4.3.1 Shrinkage methods One method for penalized regressions is called lasso (least absolute shrinkage operator). Lasso shrinks all coefficients by some constant factor λ , truncating at zero. This is also known as L1 regularization and leads to a reduced number of coefficients to be considered in the regression. Another method is called ridge regression. It utilizes L2 regularization and does a proportional shrinkage of the coefficients, imposing a penalty on their size (Hastie, Tibshirani, & Friedman, 2009, pp. 61–73). The amount of shrinkage ( λ ) is a parameter of the model itself, which needs to be optimized during the model-fitting process (see Section 4.3.2 on hyperparameter tuning). As a rule of thumb, ridge regression tends to be used when multicollinearity is a problem in the data, whereas lasso is used if the number of features is very large (Deshpande & Kumar, 2018, p. 73). In practice, however, it is often useful to try both variants and use the one that performs better along the metrics explained above (see Section 4.2). This was done in the present case and ridge regression was chosen because it performed slightly better. There also exists a mixture of both variants where the shrinkage penalty itself is an additional parameter of the model to optimize which is called elastic net regression. If the hyperparameter, usually called α , is 0, elastic net reduces to a ridge regression and if α is 1, elastic net regressions are equivalent to lasso regressions. In all these methods, it becomes apparent that we modify the coefficients to enhance the model performance, thus rendering them uninterpretable in the traditional sense. 4.3.2 Hyperparameter tuning and grid search cross-validation An important aspect of most ML models is to tune so-called hyperparameters. These are parameters that influence the way the models are trained and choosing these can have a great impact on model performance. Almost all ML models have these hyperparameters that need to be optimized to become the best possible model. We present here the general procedure using ridge regression introduced above. Hyperparameter tuning describes the process of estimating models with multiple different combinations of the hyperparameters on the training data and choosing the hyperparameter combinations with the most predictive power. This process usually includes two elements: cross-validation and grid-search. K-fold cross-validation means that we only use observations in the test data to tune model parameters in order to avoid overfitting. It describes an approach in which the training data are split into K (it is often set to 5 or 10) equally sized parts where K −1 parts are used for training and the remaining part is used for validation. This is repeated K times until every part of the data has been the test set exactly once. Usually, the average of all K model validation metrics is taken, resulting in numbers where all parts of the data are part of the training and the validation step for the computational cost of having to train K models instead of one to validate their predictive power. Once we define on what data we optimize parameters, we need to set how to do that. A common approach to this end is called grid search cross-validation. The term to be clarified here is grid search: to do a grid search we define a grid of all possible parameter combinations of our hyperparameters and create cross-validated models for each combination. Already with a limited number of hyperparameters, this becomes a very computationally costly procedure. Therefore, the parameter grid should be chosen with care. Ridge regression, as mentioned earlier, only has a single hyperparameter λ . A grid search cross-validation on 50 different possible values revealed 0.045 as the value of choice for the final model.

140 Research handbook on digital sociology Evaluated on our held-out test data, this model achieves 69.90 per cent accuracy, a recall of .766, a precision of .708, and an F1-score of .735. Compared to the simple logistic regression, this is an increase in accuracy of around 1 per cent. In other words, we classify 54 more cases correct when using the ridge regression model. The slightly lower recall tells us that the ridge regression misses slightly more positive cases; the higher precision shows that of those cases that are classified as 1, more are actually correct. 4.4

Random Forest

We now turn to another important group of ML models: random forests. RF is an algorithm from the family of ensemble methods, meaning it is a compound of multiple algorithms called decision trees. Ensemble methods exploit the concept of majority voting, where multiple simple models are trained to capture different aspects of the data and the prediction is the outcome most models agree upon (Bonaccorso, 2017, p. 154). 4.4.1 Decision trees A decision tree is, as one would imagine, a tree of binary decisions on the data. The order of the decisions and the cut-off points according to which a decision is made are the things that are learned by maximizing an impurity measure like Gini7 or cross-entropy. The trees in a (random) forest are generally simpler, in the way that they have fewer nodes (one could imagine them more like tree stumps) than a single decision tree would have. Interestingly, this has been shown to improve the predictive performance and, most importantly, makes the RF less susceptible to overfitting compared to a single decision tree (Lantz, 2015, p. 361). We will use ‘bagged’ trees here, which means that ‘bags of features’ (in the sense of multiple) are considered at every decision node. It is worth mentioning that there exists another group of ensemble trees called ‘boosted trees’. A very popular member of this relatively new group of algorithm is AdaBoost (Freund & Schapire, 1997). They have shown to be very powerful because their trees are trained sequentially (instead of simultaneously) and trees later in the sequence can learn from the misclassifications of earlier ones. The RF implementation used here has two hyperparameters to tune:8 the number of features to be considered at each decision node (mtry) (see also, Lantz, 2015, pp. 369–375), and min. node.size, the minimum size of the terminal nodes in every tree. The latter implicitly defines the maximum depth of our trees. The grid of our search was carefully chosen to reflect the expected space of parameters.9 Applying this procedure, a model that considers five features at each node with a minimum terminal node size of 4 is selected. This model shows an accuracy of 68.96 per cent, a precision of .761, a recall of .698, and a resulting F1-score of .728 on our test data. It therefore performs almost equally compared to the logistic regression regarding the predictive power in this case. Better tuning and more careful pre-processing could lead to slightly better performance here, but for the purposes of this chapter, this model is sufficient. Because decision trees and RFs are non-parametric models, they can, compared to logistic regression, reveal non-linear relationships in the underlying data (Dangeti, 2017, p. 134). 4.4.2 Interpreting machine learning models Usually, to investigate a RF model, researchers look first at the feature importances. These reflect the relative influence each feature has on the resulting predictions. Figure 7.1 shows the

Regression and machine learning 141 relative feature importance of all predictors in our RF model (excluding the country dummy variables). These paint a similar picture as our logistic regression with the two human value dimensions (self-transcendence and conservation) being the most important predictors, followed by age and education. Religiosity and self-placement on the left–right scale contribute roughly half as much to the prediction compared to the human value dimensions.

Figure 7.1

Feature importances of all non-country features in random forests

The feature importances are easy to compute but have two major drawbacks. First, due to the relativity of the measure, one cannot draw any conclusion about the absolute contribution of each feature to the prediction. In other words, the strength of the regression coefficients social scientists most often use is not available. Second, it is not possible to determine the direction of the effects on the predicted outcome. Especially the latter point is the main reason for social scientists to run statistical analyses in the first place and is, in linear models, usually determined by the sign of the coefficients. RFs, however, are highly non-linear by design, meaning that each feature can have different impacts on the predicted outcome, depending on the actual feature value of an observation. In TS non-linear effects (often introduced by interaction effects in regression models) are often visualized using marginal effects. The same can be done for ML models. Traditionally, this was done using partial dependence plots (PDPs), which serve exactly this purpose but suffer from another drawback: PDPs tend to show non-sensical effects if the features are correlated. Therefore, a more advanced technique is the inspection of accumulated local effect (ALE) plots (Apley & Zhu, 2020), which serve the same purpose but are, in contrast to PDPs, still correct if the respective features are correlated. ALE plots are centred around 0 to depict the relative effect of a variable on the predicted outcome. This allows assessing whether the relationship between a selected feature and the outcome is linear, monotonic, or more complex. Figure 7.2 shows the ALEs for all numerical features in the RF model. The value depicted on the y-axis (the ALE) ‘can be interpreted as the main effect of the feature at a certain value compared to the average prediction of the data’ (Molnar, 2019, p. 131). An ALE value of around .2 for five years of full-time education, for example,

142 Research handbook on digital sociology indicates that the predicted outcome is .2 higher (a 20 per cent increase in probability of negative attitudes in this case) than the average prediction. The black vertical bars at the bottom of the numerical features indicate the number of cases that the ALE calculations are based on and are a measure of how confident one can be in these values. For example, the small number of cases with ‘self-transcendence’ values of less than –1 is an indication that the results in this feature range should be treated with extreme caution. Compared to the results of the logistic regression earlier (see Table 7.2), the linear trends of the RF model point in the same directions. Additionally, however, the RF reveals some non-linear relationships in our data. According to our model, self-transcendence and conservation are only changing the predicted outcome once their values become positive. In a way, it looks like it does not matter too much how negative the values are but only if they are negative. A similar effect can be seen when looking at the self-placement on the left–right scale which is starting to increase at roughly the value of 4. The effect of education seems to be of a more dichotomous fashion, with low values indicating more tendencies towards a rejection of immigration whereas high values indicate the opposite. Again, the exact amount does not seem to drive the predictions, but rather whether someone has received much or little education is decisive. The effect of religiosity looks like a slightly skewed inverse u-shape rather than being linearly negative, as the logistic regression would suggest, with the highest predicted probabilities for rejection of immigrants at around 5 on the religiosity scale.

Note:

The y-axes depict the accumulated local effect values.

Figure 7.2

Accumulated local effect plots for features of interest in random forests

5 DISCUSSION This contribution tried to illustrate similarities and differences between approaches using TS or ML. Social scientists face a trade-off when it comes to using ML. On the downside,

Regression and machine learning 143 features (i.e., independent variables) can only be ranked by importance. This stands in contrast to more fine-grained information provided by a typical regression coefficient, in particular, its direction, but also the strength of effects. On the upside, focusing on models’ predictive capabilities, like in any ML application, shifts the attention to explanations that are closer to scientific reasoning and less prone to mirror common sense (Watts, 2014). It is ‘this potential predictive force which gives scientific explanations its importance’ (Hempel & Oppenheim, 1948, p. 138), which is inherently neglected by goodness of fit measures solely relying on in-sample observations. Instead, using out-of-sample is a crucial part of any SML procedure, i.e., applying the trained model to unseen test data. Integrating this inherent property of ML would at least reduce an important weakness of TS and could even boost, following Watts’ line of argumentation, social scientists’ reasoning more generally. Those important benefits notwithstanding, we would like to emphasize that ML is no cheat code, it is all statistics. Everybody with a professional background in quantitative social science has learned many ML tools, be it UML like cluster analysis or SML like logistic regressions. Yet, how those methods are used (and promoted) in ML contexts might starkly differ from common social science approaches and researchers’ training. Regardless of which features drive a model and how an outcome could be explained, the main interest of ML researchers in industry and science is that ‘the model works’, i.e., that the model provides good predictions. This is an important, though merely epistemological, difference from statistical models used in social science. One consequence for future social scientists might be, at the very least, to engage with the ideas of ML; a knowledge resource that is only slowly trickling into social science curricula. Already knowing about the epistemological framework of ML might therefore provide important insights for social scientists. Going further, it might also be helpful for social scientists to acknowledge the differences laid out in this chapter and pay closer attention to the predictive power of TS models. In addition, non-linear effects could be explored by tools like RF and, hence, also inform theory building (Grimmer et al., 2021). To be sure, we are not advocating for ML to replace TS, but rather suggest that social scientists familiarize themselves with these new methods and way of thinking. This might provide a fruitful path to complement existing regression models. Hopefully, this contribution showed interested readers what can and cannot be done with ML methods and how to apply those parts of the ML universe that are useful to us as social scientists when it comes to predictive power and non-linearity.

NOTES 1. All code and descriptions to use the ESS are available at https://github.com/luerhard/chapter _regression_and_artifical. 2. Regularization in ML helps researchers to deal with noise in training data, i.e., it tunes down correlations that are not representative of the data’s true properties. More technically, this is done by using loss functions like the residual sum of squares during the fitting procedure. The coefficients are then chosen in such a way that the loss function is minimized and ‘penalizes’, for instance, high coefficients. 3. In the language of ML, this translates into using a categorical, often binary, dependent variable (i.e., solving a ‘classification’ problem), while employing ‘regressions’ refers to techniques that use continuous outcomes. This might be a bit confusing for social scientists at first. 4. The variables used here are asked on a four-point scale how many immigrants of different groups respondents would like to allow into their country: 1 = Many/few immigrants of different race/

144 Research handbook on digital sociology ethnic groups (as majority) imdfetn; 2 = Many/few immigrants of same race/ethnic groups (as majority) imsmetn; 3 = Many/few immigrants from poorer countries outside Europe impcntr. All items are recoded so that higher levels indicate more accepted immigrants. 5. The measures are constructed according to the ESS website. All items were recorded so that higher levels indicate more agreement. 6. P-hacking is when a researcher looks at many relationships to find a statistically significant result (p < .05) and then only reports significant findings. 7. We will not go into detail about the differences of impurity measures here and use Gini throughout the rest of this chapter. 8. We keep the number of trees in our forest at the default of 500 as we do not expect to gain much predictive performance of this parameter in our case. 9. In our case, we chose the values 1 through 7 for the number of features per decision node and (1,2,3,4,10,20,50,80,100) for the minimum size of the terminal nodes in every tree. Combined with five-fold cross-validation, this results in estimating 7×9×5=315 models.

REFERENCES Abbott, A. (1988a). The System of Professions. Chicago: University of Chicago Press. Abbott, A. (1988b). Transcending general linear reality. Sociological Theory, 6(2), 169. Apley, D. W., & Zhu, J. (2020). Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4), 1059–1086. Bonaccorso, G. (2017). Machine Learning Algorithms: Reference Guide for Popular Algorithms for Data Science and Machine Learning. Birmingham: Packt Publishing. Bruce, P. C., & Bruce, A. (2017). Practical Statistics for Data Scientists: 50 Essential Concepts. Sebastopol, CA: O’Reilly. Coleman, J. S. (1990). Foundations of Social Theory. Cambridge: Belknap of Harvard University Press. Conte, R., Gilbert, N., Bonelli, G., Cioffi-Revilla, C., Deffuant, G., Kertesz, J., … Helbing, D. (2012). Manifesto of computational social science. European Physical Journal Special Topics, 214(1), 325–346. Converse, J. M. (2009). Survey Research in the United States: Roots and Emergence 1890–1960. New Brunswick: Transaction Publishers. Dangeti, P. (2017). Statistics for Machine Learning. Birmingham: Packt Publishing. Davidov, E., & Meuleman, B. (2012). Explaining attitudes towards immigration policies in European countries: The role of human values. Journal of Ethnic and Migration Studies, 38(5), 757–775. Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B., Carpanese, F., … Riedmiller, M. (2022). Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897), 414–419. Deshpande, A., & Kumar, M. (2018). Artificial Intelligence for Big Data. Birmingham: Packt Publishing. ESS Round 1. (2002). European Social Survey Round 1 Data (2002). Data file edition 6.6. NSD – Norwegian Centre for Research Data, Norway – Data Archive and Distributor of ESS Data for ESS ERIC. Norwegian Centre for Research Data. https://doi.org/10.21338/NSD-ESS1-2002. Evans, J. A., & Aceves, P. (2016). Machine translation: Mining text for social theory. Annual Review of Sociology, 42(1), 21–50. Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. Grimmer, J., Roberts, M. E., & Stewart, B. M. (2021). Machine learning for social science: An agnostic approach. Annual Review of Political Science, 24(1), 395–419. Hall, P. A., & Soskice, D. (2001). Varieties of Capitalism. Oxford: Oxford University Press. Hargittai, E. (2020). Potential biases in big data: Omitted voices on social media. Social Science Computer Review, 38(1), 10–24.

Regression and machine learning 145 Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The Elements of Statistical Learning (Second edition). New York: Springer. Heiberger, R. H., & Riebling, J. R. (2016). Installing computational social science: Facing the challenges of new information and communication technologies in social science. Methodological Innovations, 9, 1–11. Heiberger, R. H., Munoz-Najar Galvez, S., & McFarland, D. A. (2021). Facets of specialization and its relation to career success: An analysis of US sociology, 1980 to 2015. American Sociological Review, 86(5). Hempel, C. G., & Oppenheim, P. (1948). Studies in the logic of explanation. Philosophy of Science, 15(2), 135–175. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. New York: Springer. Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. Krzywinski, M., & Altman, N. (2013). Importance of being uncertain. Nature Methods, 10(9), 809–810. Lantz, B. (2015). Machine Learning with R: Discover How to Build Machine Learning Algorithms, Prepare Data, and Dig Deep into Data Prediction Techniques with R (Second edition). Birmingham: Packt Publishing. Lazer, D. M. J., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., … Wagner, C. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. Lewis, K. (2015). Three fallacies of digital footprints. Big Data & Society, 2(2). McFarland, D. A., Ramage, D., Chuang, J., Heer, J., Manning, C. D., & Jurafsky, D. (2013). Differentiating language usage through topic models. Poetics, 41(6), 607–625. McFarland, D. A., Lewis, K., & Goldberg, A. (2016). Sociology in the era of big data: The ascent of forensic social science. The American Sociologist, 47(1), 12–35. Molina, M., & Garip, F. (2019). Machine learning for sociology. Annual Review of Sociology, 45(1), 27–45. Molnar, C. (2019). Interpretable Machine Learning: A Guide for Making Black Box Models Interpretable. Morrisville: Lulu. Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106. Pearl, J., & MacKenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. London: Penguin Books. Platt, J. (1998). A History of Sociological Research Methods in America, 1920–1960. Cambridge: Cambridge University Press. Popper, K. R. (1968). The Logic of Scientific Discovery (Third edition). London: Hutchinson. Porter, T. M., & Ross, D. (2003). The Modern Social Sciences. Cambridge: Cambridge University Press. Raftery, A. E. (2001). Statistics in sociology, 1950–2000: A selective review. Sociological Methodology, 31(1), 1–45. Sagiv, L., & Schwartz, S. H. (1995). Value priorities and readiness for out-group social contact. Journal of Personality and Social Psychology, 69(3), 437–448. Schwartz, S. H. (1992). Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. In M. P. Zanna (Ed.), Advances in Experimental Social Psychology (Vol. 25, pp. 1–65). San Diego: Elsevier. Watts, D. J. (2014). Common sense and sociological explanations. American Journal of Sociology, 120(2), 313–351.

8. Investigating social phenomena with agent-based models Pablo Lucas and Thomas Feliciani

1 INTRODUCTION Collective behaviour is more than the sum or linear aggregate of the behaviours of individuals. Many social phenomena result from the combined behaviour of individuals, or social groups, that are interconnected through various relationships. Understanding the dynamics resulting from individuals’ repeated interactions poses difficult challenges for social scientists. The first major challenge is social complexity itself (Kiel & Elliott, 1997). This is as complexity refers to the interconnectedness and interdependence between individual behaviours, adding difficulties to understand how exactly social individuals, environmental conditions, and circumstances contribute to the formation of a macro-level phenomenon. Social complexity can considerably hinder our ability to understand and foresee macro-level consequences of both individual behaviours and environmental conditions because it hinders our ability to trace, in sufficient detail, the micro- and meso-level chain of events generating macro-level features. The second challenge is data scarcity. Despite the abundance of digital data generated to track all sorts of human activities nowadays, such data are often produced and collected in ways that are not directly useful for studying and testing these complex social systems. This happens for various reasons. The study of complex social systems often requires relational data, describing how individuals (or groups) relate to one another, plus how individuals and their relationships change over time. When data exist, there are often legal, commercial, or ethical constraints that make it unavailable for social science research. Agent-based modelling (ABM)1 is one approach that can mitigate these two challenges and carry out reproducible research on complex social systems with emergent social phenomena, aiming to better understand them. ABM entails in practice the design, implementation, and use of computer simulations to understand complex social systems. Simulations can reproduce interactions among individuals or groups, under specified conditions, while following behavioural rules. Through carrying out simulations, ABM can help bridge the explanatory gap between individual behaviour on the micro level and societal phenomena as macro outcomes. In particular, ABM simulations allow for testing the conditions that are needed to reproduce empirically observable effects. This chapter introduces the ABM approach. We outline the theoretical aspects of ABM which have links to concepts from analytical sociology and computational social science and present its inner workings. For illustration purposes, the chapter leans on two examples from the literature to exemplify how ABM can help tackle the two challenges of complexity and data scarcity. The first example is DellaPosta, Shi, and Macy’s (2015) study on differences in lifestyle and political choices. By drawing on an ABM the study aims to explain why people who share the same political views tend to make similar choices in other domains that could be seen as 146

Investigating social phenomena 147 unrelated to politics. The authors show, for example, that liberals in the United States would typically choose to drink lattes, and that conservatives would typically dislike modern visual art. Their paper explores and tests a social process that can explain the alignment of political orientation with these other seemingly unrelated choices. This is done by hypothesising the alignment that emerges from repeated interactions among interdependent individuals, which is tested in a developed ABM aimed to provide a middle-range theory explanation of the phenomenon (Merton, 1949; Hedström & Udehn, 2009). That is, they investigate if alignment of political views and lifestyle choices can result from two key tendencies that often shape social interactions: homophily and social influence. Homophily refers to the human tendency to interact more with and, thus, to be influenced more by those who are more similar to us (Lazarsfeld & Merton, 1954; McPherson, Smith-Lovin, & Cook, 2001). Social influence refers to the tendency one has to adjust their own attitudes and behaviours when exposed to the attitudes or behaviours of others (Asch, 1956; Sherif, 1966; Tajfel & Turner, 1986). With an ABM, DellaPosta and colleagues demonstrate that, under certain conditions, a combination of homophily and social influence is sufficient for such alignment to emerge. The second example of an ABM is from the field of meta research and focuses on the academic process of peer review, which has been described as a cornerstone institution in science (Squazzoni, Brezis, & Marušić, 2017). However, getting access to the rather abundant data on peer review activity is notoriously difficult for researchers. This is due to many different reasons, which include institutional resistance by the research funders, journal editors, and publishers (Gurwitz, Milanesi, & Koenig, 2014; Squazzoni et al., 2020). The nature of peer review requires reviewer anonymity and confidentiality to some degree. Plus, there is considerable variation in how peer review is implemented across different funding organisations and scholarly journals. This makes comparisons and generalisations about peer review more difficult. For example, there are recent open peer review initiatives, where reviewers and authors are able to choose what will be publicly accessible. Yet these have a slow uptake. The absence of empirical data, or the inability to access it, is one incentive for scholars to use ABM as a reliable way to reproducibly study various types of peer review ‘in silico’ (Squazzoni & Takács, 2011; Feliciani et al., 2019). The example we chose from this literature on peer review is the study by Roebber and Schultz (2011). This study develops an ABM to systematically test the design of peer review panels in order to improve the chances of funding the most deserving applicants. The ABM resembles a thought experiment, as it serves as an exploratory tool that can simulate scenarios of interest, which would otherwise be infeasible to study empirically, for example, because of prohibitive costs or the significant organisational risk of implementing untested peer review policies in a real-world funding setting. By simulating the peer review process and the interactions among applicants, reviewers, and funders in an ABM, Roebber and Schultz were able to demonstrate how fewer submitted research proposals does not necessarily equate to better proposals being funded overall. This is a non-trivial finding about the peer review process itself that can be traced back to how the ABM simulation has been designed and tested. The remainder of this chapter is structured as follows. Section 2 provides an overview of the ABM method, starting with its theoretical foundations. Section 3 presents the building blocks and working principles of a typical ABM simulation. Section 4 provides a discussion of how empirical data can be used in ABMs. Section 5 concludes the chapter with suggestions for further reading.

148 Research handbook on digital sociology

2

AGENT-BASED MODELLING AND GENERATIVE SOCIAL MECHANISMS

In the social sciences, by ‘explanation’ we mean an account of how a social phenomenon comes about or why it varies, for example, across time, regions, or (social) groups. Building an explanation of a social phenomenon entails then establishing a link between antecedent fact(s) A (the ‘explanans’ or ‘explanantia’) and an outcome B (the ‘explanandum’). In DellaPosta et al. (2015), the candidate explanans A is the co-occurrence of homophily and social influence; and the explanandum B is the alignment of political orientation with lifestyle choices. To explain how A affects B is to detail and understand that mechanism itself. That is, the causal chain of events that links A and B, to the point one can reliably replicate the effect. These mechanism-based explanations are evaluated by comparing their performance with the available empirical evidence. This is done by us, researchers, via deductive reasoning and inferential statistics. Hypotheses are derived from an existing theory – or a proposed candidate explanation – and subjected to empirical testing (e.g., via standard research methods for the collection and analysis of empirical data). When studying complex social systems, however, developing explanations and deriving clear-cut hypotheses to support that is non-trivial. Often one can conjure up various explanations – even contradictory ones – of how A is related to B (Watts, 2011), yet it is difficult to evaluate them in a reliable and reproducible way. When all proposed alternative explanations are similarly plausible, an independent and reliable empirical test can help choose the best one and help us to understand the mechanism leading from A to B. This highlights the importance of having a formal (i.e., non-ambiguous) definition of the relationship between A and B as well as a systematically consistent method for deriving and testing hypotheses. The ABM approach can serve these social science purposes (Manzo, 2022). 2.1

Modelling Emergent Features of Social Systems

The explanatory mechanisms describing emergent macro-level social phenomena are said to be generative (Epstein, 2012), as these work from the bottom up. This means the process starts from the individual components of the social system and their relationships, rather than an imposed top-down structure. In other words, the explanatory mechanisms are about ways in which emergence is enacted. To work towards a generative mechanism, we must then answer: (1) Who are the agents involved in causing the macro-level explanandum? (2) What is the role of the environment in that? (3) Which characteristics and relationships, involving both agents and the environment can contribute to the emergence of the macro-level explanandum? And finally, (4) what is the role of chance in all of that? To see how these aspects come together, it is helpful to imagine a road-cycling race as an archetype for a complex social system that we can look at from a generative perspective.2 In this hypothetical road-cycling race, (1) agents could be the riders, each autonomously riding their bicycle pursuing an objective. The road section could be (2) the environment, as it puts the same constraints and opportunities across all riders, for example, the length of the race, the terrain altitude profile, and the slipperiness quality of the road surface. Each rider has their (3) individual characteristics, such as strengths and weaknesses. Riders are entangled in various relationships with other riders during the road-cycling race and those relationships will influence their behaviour to some extent. For example, a domestique (also known as a gre-

Investigating social phenomena 149 gario) role is explicitly about facilitating leaders from their team and perhaps also obstructing riders from other teams. But domestiques also have autonomous incentives to try and win the race themselves if the opportunity presents itself. Finally, there is (4) chance. Everything can change due to unexpected issues, such as adverse weather conditions, accidents, or other problems, including teamwork or mechanical issues. So, a road-cycling race can be understood as a complex social phenomenon with its emerging outcomes where all these four elements must be examined together. Each of these can be represented, explored, and tested in an ABM equipped with generative mechanisms. Social theories typically provide descriptions of social entities in natural language, in a similar way as the discussion above has described what constitutes a race. Building an ABM entails formalising (i.e., representing in a non-ambiguous language) the otherwise natural language descriptions of agents, their attributes and relationships. The formalisation of these elements is about translating their description from natural language into a formal language: equations or computer code. Formalising generative explanations gives us the opportunity to clarify its elements in detail, bring to the surface hidden assumptions, and identify and resolve ambiguities. This process can improve the theory itself by allowing further and more precise deductions. The ABM formalisation requirement is itself a scientific contribution, as one is enabled to study social behaviour and social systems in a rather unique way, which ‘allows researchers to more rigorously explore the consequences of their assumptions and to share, critique, and improve models without transmission errors’ (Edmonds et al., 2019, paragraph 1.4). One can build on previous developments in ways that are not attainable without such an approach. Therefore, once all ABM elements have been formally defined, one further develops and evaluates testable explanations about the social phenomenon by simulating the model according to an experimental design of choice. That is, by analysing the ABM experiment output in relation to the empirical emergent social phenomenon of interest. This allows tackling the question of whether a model can successfully reproduce empirical observations and, hence, explain it generatively. 2.2

Considerations about Agent-Based Modelling Macro and Micro Levels

The generative approach and ABM are best suited for working with theories of the middle range, which are those theories providing an explanation for specific social phenomena which can at later stages be categorised into broader sets of theory (Merton, 1949; Hedström & Udehn, 2009). The middle-range theory approach contrasts with the social grand theory approach, as the latter would strive for a unified and all-encompassing theory of human social behaviour and social systems. Sociological theory explanans and explanandum typically focus on the macro level, yet sociological explanations of macro-level social phenomena require necessarily a micro-level foundation (Coleman, 1994). This entails explaining the macro– micro link, how macro-level factors shape individual action; and the micro–macro link, how interactions among individuals produce observed macro outcomes. In the example on alignment of political orientation and lifestyle choices (DellaPosta et al., 2015), the macro–micro link includes a formal description of how the explanantia (homophily and social influence) impact an individual’s lifestyle choice. Homophily shapes individuals’ contact opportunities and social influence affects individuals’ personal attitudes. The micro–macro step of the explanation would then entail how contacts and attitudes interact in a way that, at the group or society level, political attitudes and attitudes towards various

150 Research handbook on digital sociology lifestyle choices align. The elaboration of formal, yet non-deterministic, propositions about the macro–micro and micro–macro links is central to the development of ABM as this is where the complexity and system dynamics occur. By modelling macro–micro and micro–macro links in a non-deterministic way, an ABM can simulate various configurations of social systems where individual agents affect – and at the same time are affected by – their social environment, including relationships. Since within ABM those links are formal and explicit theoretical propositions, these become testable and reproducible. This provides leverage for better understanding the nature of various feedback loops within the social system, the choices that can be made by individual agents and how these can be mediated by their environment (Ylikoski, 2016). The development of an ABM is thus a facilitating tool to systematically develop formalised, testable, and reproducible middle-range theories aimed at explaining a social phenomenon.

3

UNDER THE BONNET OF AN ABM IMPLEMENTATION

Our discussion so far considered ABM as a method for developing theories or explanations of complex social systems; whereas the academic literature highlights a wider variety of motivations, purposes, and research questions in relation to implementing an ABM (Chattoe-Brown, 2014) – including proposed taxonomies for ABMs (Edmonds et al., 2019; Squazzoni, 2012). For the purposes of this chapter, it is sufficient to acknowledge that ABMs have a variety of uses, ranging from explaining a phenomenon as in the example by DellaPosta et al. (2015) to exploring policy interventions as in the study by Roebber and Schultz (2011). This section is focused on the common ingredients of ABMs. 3.1

Basic Agent-Based Modelling Building Blocks

The core ingredients of an ABM are the specification of agents, their environment, and how interactions can take place. Agents and environments are characterised by sets of attributes. These are the characteristics the researcher deems to be relevant for the particular social phenomenon at hand. For example, in DellaPosta et al. (2015), the main agent attributes are the following: ● political orientation representing each agent’s attitudes towards salient topics such as abortion, same-sex marriage, or the belief that, e.g., ‘right and wrong should be based on God’s laws’; ● lifestyle choices representing agents’ attitudes and tastes – which are not necessarily political or ethical in their nature, such as drink choice, or liking or disliking modern visual arts; and ● demographic characteristics such as the agents’ age, gender, ethnicity, plus potentially other attributes describing the population composition, such as a distribution according to location. Agent attributes can be either static or dynamic depending on whether the attribute can change over time. For example, political orientation and lifestyle choices are dynamic attributes, as these can change over time, say due to exposure to social influence. In their ABM application, DellaPosta and colleagues assumed demographic attributes as static, so this type of data – in

Investigating social phenomena 151 their particular model – would not change within a chosen simulated time of the ABM (i.e., the length of one full model run).3 ABM research modellers must define what behaviours and processes can update the agents’ dynamic attributes and how. These behaviours and processes are called behavioural rules and must be written in an unambiguous language, typically with equations, pseudo-code, or programming code. These rules may be fixed or may change during a simulation run. This represents another choice to be made by the designer. In the model by DellaPosta and colleagues, homophily and social influence are two key behavioural rules of agents. Social influence is formalised by specifying how agents update their dynamic attributes (i.e., political orientation and lifestyle choice) in accordance with other agents in the network environment. In the process of formalising an agent’s behavioural rule, the ABM designer will probably also have to make some arbitrary decision. For example: how exactly does each individual update their opinion? If at hand, one can avail of theories about that to guide the ABM design and implementation. Frequently, however, there is no theory specifically applicable to the phenomenon being modelled. That is a situation when an arbitrary choice will likely be made. That is, one motivated by researcher pragmatism or intuition, and these are always open to revision and replication by others. Furthermore, behavioural rules can incorporate an element of chance. The ABM employed by DellaPosta and colleagues specifies, for example, the probability of an agent adopting a belief from another agent as a function of the similarity of both agents, thus incorporating the principle of homophily. This is a modelling choice that, due to its explicit proposition and implementation, can be scrutinised. Behavioural rules can make ABM agents autonomous, adaptive, and interdependent. Autonomy means that the agent action is not centrally coordinated, as the action-selection mechanism of each agent (i.e., how an agent decides what to do next) will be a local decision by each agent according to their own capabilities. Adaptation means then that the behaviour of an agent can change due to their own actions and also due to their environment.4 Lastly, there is the aspect of interdependency which entails that agents’ actions may depend on their own previous actions but also on the actions made by other agents. Another key ability in an ABM is to deal with varying degrees of heterogeneity without requiring a considerable effort, implementation, and design. Depending on the needs of the modelled phenomenon and simulation setup, the researcher can allow for agents – or groups of agents – to differ in terms of behavioural rules, static, and dynamic attributes. This is an advantage as researchers can easily incorporate heterogeneous agents in an ABM to (1) represent individuals and specific differences between groups of individuals, along with differences in how these may interact, and (2) represent a level of social heterogeneity itself, which is known to have a significant impact on the dynamics generated in both real social phenomena and ABM outcomes (e.g., Macy & Tsvetkova, 2015). ABMs can also be used to define the global properties of the general environment where agents act. The environment may simply serve as a container of agents (e.g., an abstract representation of location) or it may have its own agency (i.e., if the environment can also change during a model run). One example of this is provided by Roebber and Schultz (2011). In their model agents represent peer reviewers for a research funding institution and the environment is the set of rules and conditions that influence their work. The research programme officer is the embodiment of the environment in which peer reviewers work, as it is setting out the rules as to how peer review should be dealt with by each agent. Other examples include ABMs

152 Research handbook on digital sociology applied to geographical land use and cover change, where an accurate modelled environment itself is fundamental to the modelling process (Millington, Katerinchuk, Silva, Victoria, & Batistella, 2021). The ABM behavioural rules are written in a computer programming environment and executed iteratively to simulate how agents and their environment can interact, while monitoring how these are impacted through the simulation time. The researcher can vary how attributes are distributed at the beginning of the simulation and change environmental conditions or the interaction network among agents and the behavioural rules themselves. By observing how each of these changes can affect the overall modelled dynamics, we can learn whether and how each of these contributes to the simulated phenomenon outcome. This can also help to consider which aspects of the model can be validated empirically. In an ABM, the explanantia can be the environmental conditions, the set of behavioural rules, the starting configuration of the agents’ attributes, or a combination of these. The explanandum on the other hand is the model output. This would typically involve two components: the micro level (i.e., the final state of all agent attributes) and the macro level (i.e., the overall state at the end of the simulation run). In the specific case of the aforementioned examples, the micro level is for example the particular data each agent has carried with itself to the end of a model run and the macro level the average quality of funded research proposals and the correlation between political orientation and lifestyle choices. 3.2

Finding Agent-Based Modelling Building Blocks

Choosing the ingredients to build an ABM (i.e., agents, their characteristics, behaviours, and environmental properties) is a central task and a challenge in itself when working with this research method. Each decision of what, and how exactly, representations are done will require a technical judgement of relevance by the researcher. Relevance may be guided by the theory, data availability, assumptions, and hypotheses in terms of what ought to be integrated into the simulation. As part of that, one has to balance an understanding of when there is little detail – or when there is too much detail – in the specifications of each model component. Intuitively, an ABM has too little detail when it lacks the essential ingredients of agents – or the agents’ environment – that are absolutely necessary to understand or generate the phenomenon the ABM is supposed to model. For example, if the ABM by DellaPosta and colleagues did not contain homophily, it would not be able to generatively explain the alignment of political orientation and lifestyle choices. This can be interpreted as homophily being essential for that case of social alignment influence. Others are free to challenge that and demonstrate it, via their own tested design and specifications. On the other hand, overspecification is another extreme situation where an ABM has too much detail and these extra features do not contribute meaningfully to its intended purposes (Sun et al., 2016). This modelling process highlights the usefulness of evaluating the contribution of each model ingredient by monitoring and testing each of their effects on the ABM results. One may employ different sources, first, to determine which ingredients are absolutely essential in a model and, second, to choose how much detail about them should be implemented. As the ABM research method is flexible, sources may include, among others, the existing theoretical or empirical literature, but also the researcher’s own understanding of the social phenomenon and, depending on the case, advice from topic experts or informants. There are then heuristics guiding the research modeller to find appropriate levels of detail. These heuristics stem from

Investigating social phenomena 153 two different approaches to ABM development and are best recognised by their acronyms: the KISS (‘keep it simple, stupid!’) approach proposed by Axelrod (1997) and the KIDS (‘keep it descriptive, stupid!’) approach proposed by Edmonds and Moss (2004). Both development approaches are useful to develop ABMs and examples using each abound. The KISS approach starts with as minimal a model as possible, which may be a rather highly abstract representation of the modelled social phenomenon. More ingredients (or more detailed ingredients) are added until the ABM can synthetically generate the target phenomenon.5 Trial and error will be a recurring way to test whether a simulation model, in its current state, is sufficient for its purposes. If not, the research modeller should increase the level of detail that is represented in the ABM. Note that this process can be guided by existing theoretical or empirical literature, as previously mentioned. In contrast, KIDS starts with a model as complete as possible from the start. The research modeller incorporates into the ABM all that is known about the target phenomenon. That process may be fed by empirical evidence and insights (i.e., from data, literature, intuition, or expert knowledge). The task is then assessing how each of these ingredients contributes to the model’s usefulness and identifying those ingredients not contributing meaningfully to the model’s ability to generate the target phenomenon. Some ABM ingredients will likely be deemed unnecessary and so removed during this process. Note that this process can also be guided by existing theoretical or empirical literature. Arguably, the KIDS approach is a more resource-intensive development approach than KISS. This is more noticeable in cases where the phenomenon of interest has extensive empirical evidence and insights. The choice of guiding an ABM development mainly through the KISS or KIDS approach depends on a variety of factors, ranging from the phenomenon’s nature, its available evidence, available resources (e.g., time and research staff), and the preferences of the researchers themselves. Most ABMs are developed for peer review publication once, using either the KISS or KIDS approach. It is arguably more common that a different research team expands the work and approach done in an ABM; that is, an ABM originally created elsewhere, using either the KISS or KIDS approach. Establishing which approach is best suited for a specific problem is a research exercise that depends on the phenomenon itself and the evaluation criteria as to what would constitute a useful ABM by the researchers. 3.3

Exploring the ABM Parameter Space

In ABMs, independent variables are typically captured by one or more model parameters and dependent variables are captured by some output metrics of the model. Consequently, running an ABM several times with alternative parameterisations generates data that can be statistically analysed. For example, a set of independent and dependent variables can be analysed simultaneously in order to learn about the relationship between parameters and the corresponding simulation outcomes. Not only do we need to run the ABM with different parameter configurations, frequently we also want to run an ABM multiple times for the same parameter configurations. Such multiple runs per parameterisation are required when stochasticity is part of the model, as it can influence key aspects such as the initial conditions of the simulations or affect some changes while the model is running. The model of peer review by Roebber and Schultz (2011) offers an example of stochastic model initialisation: in the ABM, the initial quality of submissions is randomly sampled from a probability distribution at the start of the simulation, and so in principle, a different

154 Research handbook on digital sociology initial sample can produce different results. The model of homophily and social influence by DellaPosta et al. (2015) offers an example of how stochasticity can influence the model dynamics even after the initialisation: whether or not an agent will update a belief is determined by a binomial trial, the outcome of which will have implications as each ABM run progresses. That is, different draws can influence subsequent ABM dynamics and ultimately the ABM output itself. Hence, for each unique parameter configuration in an ABM we need to design an experimental setup that will contain multiple interdependent runs, so that we can collect a sample from the distribution of the possible outcomes that dynamically emerge from each parameterisation.6 The ABM parameter space represents the set of all possible unique parameter configurations of the simulation. In practice, not all configurations need to be explored. In fact, some configurations will not match empirical evidence and so will not need exploration. The ABM parameter space can be vast, theoretically even infinite, depending on the combination of (1) the number of specified parameters and (2) the number of values each of these can take. In practice, vast parameter spaces and the need to run many runs per parameter configuration is why ABM simulations require computational power.7 Often, however, a large ABM parameter space combined with limited computational resources (e.g., little computer memory and processing power) will impose computational bottlenecks for ABM experiments. In these cases, one needs strategies to minimise computer runtime. Some strategies aim at improving computer performance itself, thereby making the most out of the available runtime.8 Other strategies involve minimising the ABM parameter space by relying on a carefully crafted ABM experimental design (e.g., the researcher specifies fewer levels of possible configurations for each parameter). Such strategies may include using optimal design algorithms to find the potential minimal set of parameter configurations.9 Alternatively, this can be done by choosing to only configure the ABM with parameters that are considered theoretically critical. Once a finite and manageably sized parameter space is identified for the ABM, simulation runs can be executed for each parameter configuration. ABM software frameworks typically include tools to automate this process of sweeping the ABM parameter space as many times as necessary.10

4

USE OF EMPIRICAL DATA IN ABMS

When used as a research method for carrying out thought experiments and proposing theory development, ABMs are useful to analogously describe a social system without a particularly strict need of grounding in empirical data. However, empirical data will be needed if an ABM simulation is intended to reproduce, with some fidelity, features of a specific social phenomenon. Using the terminology from Chattoe-Brown (2014) and Gilbert and Troitzsch (2005), we can consider two uses of empirical data in ABMs: calibration and validation.11 4.1

ABM Calibration and Validation

Empirical calibration refers to the use of real-world data as input for the ABM. The purpose of ABM calibration is to fit the model to some observed social system in order to make the simulation more realistic – or at least more applicable to the studied social context. In practice, calibration entails initialising agent attributes, initialising the environment, or selecting

Investigating social phenomena 155 a parameter configuration (or the parameter space) such that the chosen ABM values correspond to some empirically observed reality. For example, Roebber and Schultz (2011) set the ABM parameter entitled ‘number of reviewers’ to values from 1 to 6 since this range covers the typical size of peer review panels. Likewise, DellaPosta et al. (2015) base some of the ABM parameter values on results from empirical research, such as the average number of contacts in one’s network, to name one. In these examples the calibration of other ABM parameters is based on data sources such as results from empirical primary research or secondary data. Empirical data is often collected first-hand for the purpose of calibrating ABM parameters. While empirical calibration concerns the input of the ABM, validation instead refers to the use of empirical data to evaluate its output. The purpose of empirical validation is to test whether and to what degree the ABM predictions can be empirically confirmed, and to which degree of error. In practice, ABM validation consists in comparing the ABM outcome variables with a corresponding empirically observed variable. Therefore, to evaluate the similarity between simulated and empirical data is to evaluate the level of correctness – or the effective plausibility – of an ABM outcome. DellaPosta et al. (2015) did not have access to suitable data for the systematic validation of their ABM. Nevertheless, in their article there is a careful consideration of the degree to which the ABM provides empirical realism. This is done by measuring the levels of correlation among various political opinions and lifestyle choices as predicted by the ABM and then comparing them with the level of correlation empirically observed in existing survey data. The findings are that predicted correlations are at least as strong as those empirically observed – which strengthens the case for the validity of an ABM. 4.2

Types of Empirical Data in an Agent-Based Model

The data inputs of an ABM include its parameters and initial conditions. These typically come in the form of numerical variables. Likewise, outcome variables, the ABM outputs, are typically numeric. It is therefore easy to think of the empirical calibration and validation of an ABM as direct quantitative matching or comparison of its quantitative input or output with quantitative empirical data. Yet, empirical data in an ABM does not necessarily need to be quantitative. Indeed, data requirements for ABMs are generally flexible, as one can represent both quantitative and qualitative insights via computer programming. This feature makes ABM a rather unique and valuable tool in developing theory where empirical data are scarce. To begin with, depending on the purpose of the ABM and what is available to the research modeller, ABMs can work with various kinds of quantitative data (e.g., survey data, network data, or time series data). Providing an exhaustive list is impossible, but some examples may illustrate the variety of data types. Some ABMs study how some social phenomena change through time, for example, shifts in public attitudes towards harsh penalties or fluctuations in stock market prices. These models need longitudinal or time-series data for empirical calibration and validation (for example, see Chattoe-Brown, 2014; Recchioni, Tedeschi, & Gallegati, 2015). Other ABMs have been used to explore the role of social networks, such as how friendships among classmates change over time and affect individuals. If it is important that the simulated network is as realistic as possible, research modellers can use network data for the network calibration in an ABM (for example, see Zhang, Shoham, Tesdahl, & Gesell, 2015). Other ABM examples are spatially explicit, so these simulations contain processes and entities that are geographically located in specific real-world places. Accurate Geographical

156 Research handbook on digital sociology Information Systems data would typically be used for the calibration and validation of these ABM simulations (for example, see Robinson & Brown, 2009; Crooks, 2010). Qualitative (i.e., non-numeric) data can also be used in various ways in an ABM: it can, for example, serve to inform the modelling of behavioural rules or environmental constraints in the simulation. One can consider ways of representing qualitative insights in an ABM, in the form of behavioural rules or other model structures, based on the analysis of qualitative documents or semi-structured interviews with stakeholders and practitioners (for examples, see Zia & Koliba, 2015; Wijermans & O’Neill, 2020). Some research modellers take this possibility a step further, by pursuing the co-creation of the ABM itself with stakeholders, which is an approach known as companion modelling (Barreteau et al., 2003; Étienne, 2013). This approach intends to mutually benefit researchers and participants, as the former can leverage more directly the experience and expertise of the stakeholders directly involved in the phenomenon. Note that these do not need to be competent in the social sciences, as the technical design and development of a more realistic and useful model will be done in tandem with an ABM social scientist. Stakeholders engaging with the ABM development process and researchers can also benefit by reflecting their own perspectives and potentially learning something new about their own domain.

5 CONCLUSIONS This chapter introduced the ABM research approach as a powerful and flexible tool for social scientists which enables investigating complex social phenomena through formal representations that can be reliably tested and systematically reproduced. We argued that ABMs are tailored for middle-range theories and are flexible in data requirements and suitable for a variety of goals – including theory building, thought experiments, and forecasts limited to certain contexts. Since the first uses of ABM in the social sciences, over 50 years ago, several academic standards and best practices have emerged for the experience of designing and using ABMs. Further improvements are still being developed. Important advancements have been made in terms of the reproducibility and a standardised communication of ABMs. The open-access availability of the ABM code itself, data when possible, and its corresponding documentation, along with its paper and supplemental material, has become a requirement for some journals publishing peer-reviewed research using ABM. Code and documentation can be hosted on ABM-dedicated online sharing platforms such as the model library of CoMSES (www.comses.net/); but also on more general software development platforms like GitHub (https://github.com/); on open access repositories for research materials such as Figshare (https://figshare.com/) or Dataverse (https://dataverse.org/ ); or directly by the journals on their website. Moreover, there are specific protocols to help standardise the documentation accompanying an ABM code, such as the overview, design concepts, details (ODD) protocol (Grimm et al., 2020). An ODD-compliant ABM documentation will describe the aims, functioning, and implementation of the ABM using natural language as much as possible, as a way to facilitate understanding the model and the implemented programming code. Ideally, the ODD makes an ABM more accessible to non-experts, enabling a better understanding of the modelling decisions, and is sufficiently specific to allow readers to re-implement – or indeed replicate – the original ABM.

Investigating social phenomena 157 An ABM simulation can be implemented in any programming or scripting language including general purpose languages and scripting languages for statistics and data science, for example, R (R Core Team, 2022) or Python (Van Rossum & Drake, 2009). In addition, there are several specialised frameworks and languages available such as NetLogo (Wilensky, 1999), Repast (North et al., 2013), and GAMA (Taillandier et al., 2019). We conclude this section by recommending further reading on ABM topics, including historical perspectives, best practices, and other examples: Gilbert and Terna (2000); Bonabeau (2002); Macy and Willer (2002); Epstein (2014); Flache and de Matos Fernandes (2021); Manzo (2022).

NOTES 1. In this introductory chapter the acronym ‘ABM’ refers to agent-based modelling and agent-based model. These terms are used interchangeably throughout the chapter. For many scholars, ABM has become synonymous with the more general term ‘social simulation’, which can include other types and computational simulation approaches that would not be strictly defined as an ABM. 2. The idea of studying sports as complex social systems from a generative explanatory perspective is not new. Hoenigman, Bradley, and Lim (2011) provide an example where an ABM is used to study cooperation dynamics in terms of individual and group choices in different bicycle-racing scenarios. Lauren, Quarrie, and Galligan (2013) provide another example, with a similar ABM approach, but applied to the testing and coaching of rugby tactics at a high level. 3. Note that, like in all research modelling work, some modelling decisions and assumptions in an ABM will likely contain some degree of arbitrariness. This can happen when there is no empirical evidence about something that is an essential part of the phenomenon. This is the case for deciding which attributes are considered dynamic or static. This is an explicit modelling choice that becomes open to scrutiny by others, who can themselves test the role of such model assumption. 4. The agents of some ABMs are cognitively more complex: they have the memory and capacity to learn from previous experience or from others. Memory and learning are considered aspects of agents’ adaptivity. For examples: Duffy (2006) and Grimm et al. (2020: Supplementary file S1: ‘Learning’). 5. See also the related method of decreasing abstraction in Lindenberg (1992). 6. Executing multiple independent runs per parameter configuration is equivalent to taking repeated samples from the distribution of all possible dynamics an ABM can produce given the parameter configuration. This is in part why some ABMs are described as a type of Monte Carlo simulation. 7. There are classic ABM examples which were simulated by hand, for example, the works on residential segregation by Sakoda (1971) and Schelling (1971). These ABMs were conceived as a thought experiment done by hand on a checkerboard: pieces were physically distributed in random locations on the board, and then iteratively and manually their positions changed according to the model rules. 8. Besides efficient computer coding, strategies can entail parallelising the simulations over different central processing unit (CPU) or graphics processing unit (GPU) cores. For example, the ABM could be simultaneously executed over independent machines, or a single simulation run could be optimised in terms of CPU, GPU, and memory use. Analysing the algorithmic complexity of an implemented ABM can also lead to optimisations. In the future, quantum-computing architectures might also bring substantial runtime improvements to ABMs. 9. Search space optimisation algorithms (e.g., Stonedahl & Wilensky, 2010) depend on the goal of the researcher. Those algorithms are helpful when the goal is to find an ideal parameter configuration producing a desired outcome. This approach would not help when one’s goal is to understand, in a more general way, how the parameters and their configurations can affect the full range of ABM outcomes. 10. NetLogo (Wilensky, 1999) is probably the most popular software for ABM in education and research. In NetLogo, the tool for setting up automated parameter sweeps is called BehaviorSpace.

158 Research handbook on digital sociology 11. Note that there is no scholarly consensus on the detailed use of the labels ‘calibration’ and ‘validation’ in the context of ABM and simulation research, as there are different acceptable ways to define and apply these.

REFERENCES Asch, S. E. (1956). Studies of independence and conformity: I. A minority of one against a unanimous majority. Psychological Monographs: General and Applied, 70(9), 1–70. Axelrod, R. (1997). The complexity of cooperation. Princeton University Press. Barreteau, O., Antona, M., D’Aquino, P., Aubert, S., Boissau, S., Bousquet, F., … & Weber, J. (2003). Our companion modelling approach. Journal of Artificial Societies and Social Simulation, 6(1). Bonabeau, E. (2002). Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, 99 (suppl 3), 7280–7287. Chattoe-Brown, E. (2014). Using agent based modelling to integrate data on attitude change. Sociological Research Online, 19(1), 159–174. Coleman, J. S. (1994). Foundations of social theory. Harvard University Press. Crooks, A. T. (2010). Constructing and implementing an agent-based model of residential segregation through vector GIS. International Journal of Geographical Information Science, 24(5), 661–675. DellaPosta, D., Shi, Y., & Macy, M. (2015). Why do liberals drink lattes? American Journal of Sociology, 120(5), 1473–1511. Duffy, J. (2006). Agent-based models and human subject experiments. In: Tesfatsion, L. & Judd, K.L. (Eds), Vol. 2, Handbook of Computational Economics (pp. 949–1011). North Holland. Edmonds, B., Le Page, C., Bithell, M., Chattoe-Brown, E., Grimm, V., Meyer, R., … & Root, H. Squazzoni. F. (2019). Different modelling purposes. Journal of Artificial Societies and Social Simulation, 22(3), 6. Edmonds, B., & Moss, S. (2004). From KISS to KIDS: An ‘anti-simplistic’ modelling approach. In International workshop on multi-agent systems and agent-based simulation (pp. 130–144). Springer. Epstein, J. M. (2012). Generative social science. Princeton University Press. Epstein, J. M. (2014). Agent_zero. Princeton University Press. Étienne, M. (Ed.). (2013). Companion modelling: A participatory approach to support sustainable development. Springer Science & Business Media. Feliciani, T., Luo, J., Ma, L., Lucas, P., Squazzoni, F., Marušić, A., & Shankar, K. (2019). A scoping review of simulation models of peer review. Scientometrics, 121(1), 555–594. Flache, A., & de Matos Fernandes, C. A. (2021). Agent-based computational models. In: Manzo, G. (Ed.), Research handbook on analytical sociology (pp. 453–473). Cheltenham, UK and Northampton, MA, USA: Edward Elgar Publishing. Gilbert, N., & Terna, P. (2000). How to build and use agent-based models in social science. Mind & Society, 1(1), 57–72. Gilbert, N., & Troitzsch, K. G. (2005). Simulation for the social scientist (second edition). Open University Press. Grimm, V., Railsback, S. F., Vincenot, C. E., Berger, U., Gallagher, C., DeAngelis, D. L., … & Ayllón, D. (2020). The ODD protocol for describing agent-based and other simulation models: A second update to improve clarity, replication, and structural realism. Journal of Artificial Societies and Social Simulation, 23(2), 7. Gurwitz, D., Milanesi, E., & Koenig, T. (2014). Grant application review: The case of transparency. PLoS Biology, 12(12), e1002010. Hedström, P., & Udehn, L. (2009). Analytical sociology and theories of the middle range. In: Hedström, P., & Bearman, P. (Eds), The Oxford handbook of analytical sociology (pp. 25–47). Oxford University Press. Hoenigman, R., Bradley, E., & Lim, A. (2011). Cooperation in bike racing: When to work together and when to go it alone. Complexity, 17(2), 39–44. Kiel, L. D., & Elliott, E. W. (Eds). (1997). Chaos theory in the social sciences: Foundations and applications. University of Michigan Press.

Investigating social phenomena 159 Lauren, M. K., Quarrie, K. L., & Galligan, D. P. (2013). Insights from the application of an agent-based computer simulation as a coaching tool for top-level rugby union. International Journal of Sports Science & Coaching, 8(3), 493–504. Lazarsfeld, P. F., & Merton, R. K. (1954). Friendship and social process: A substantive and methodological analysis. In: Berger, M., Abel, T., & Page, C. H. (Eds), Freedom and control in modern society (pp. 18–66). Van Nostrand. Lindenberg, S. (1992). The method of decreasing abstraction. Rational Choice Theory: Advocacy and Critique, 1(6). Macy, M. W., & Tsvetkova, M. (2015). The signal importance of noise. Sociological Methods & Research, 44(2), 306–328. Macy, M. W., & Willer, R. (2002). From factors to actors: Computational sociology and agent-based modeling. Annual Review of Sociology, 28(1), 143–166. Manzo, G. (2022). Agent-based models and causal inference. John Wiley & Sons. McPherson, M., Smith-Lovin, L., & Cook, J. M. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1), 415–444. Merton, R. K. (1949). On sociological theories of the middle range. In: Merton, R. K. (Ed.), Social theory and social structure (pp. 39–53). Simon & Schuster. Millington, J. D. A., Katerinchuk, V., Silva, R. F. B. da, Victoria, D. de C., & Batistella, M. (2021). Modelling drivers of Brazilian agricultural change in a telecoupled world. Environmental Modelling & Software, 139, 105024. North, M. J., Collier, N. T., Ozik, J., Tatara, E., Altaweel, M., Macal, C. M., Bragen, M., & Sydelko, P. (2013). Complex adaptive systems modeling with repast simphony. Complex Adaptive Systems Modeling, 1(3). R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. www.R-project.org/ Recchioni, M. C., Tedeschi, G., & Gallegati, M. (2015). A calibration procedure for analyzing stock price dynamics in an agent-based framework. Journal of Economic Dynamics and Control, 60, 1–25. Robinson, D. T., & Brown, D. G. (2009). Evaluating the effects of land‐use development policies on ex‐ urban forest cover: An integrated agent‐based GIS approach. International Journal of Geographical Information Science, 23(9), 1211–1232. Roebber, P. J., & Schultz, D. M. (2011). Peer review, program officers and science funding. PLoS ONE, 6(4), e18680. Sakoda, J. M. (1971). The checkerboard model of social interaction. Journal of Mathematical Sociology, 1(1), 119–132. Schelling, T. C. (1971). Dynamic models of segregation. Journal of Mathematical Sociology, 1(2), 143–186. Sherif, M. (1966). Group conflict and cooperation: Their social psychology. Routledge & Kegan Paul. Squazzoni, F. (2012). Agent-based computational sociology. John Wiley & Sons. Squazzoni, F., Ahrweiler, P., Barros, T., Bianchi, F., Birukou, A., Blom, H. J. J., Bravo, G., Cowley, S., Dignum, V., Dondio, P., Grimaldo, F., Haire, L., Hoyt, J., Hurst, P., Lammey, R., MacCallum, C., Marušić, A., Mehmani, B., Murray, H., … Willis, M. (2020). Unlock ways to share data on peer review. Nature, 578 (7796), 512–514. Squazzoni, F., Brezis, E., & Marušić, A. (2017). Scientometrics of peer review. Scientometrics, 113(1), 501–502. Squazzoni, F., & Takács, K. (2011). Social simulation that ‘peers into peer review’. Journal of Artificial Societies and Social Simulation, 14(4). Stonedahl, F., & Wilensky, U. (2010). Finding forms of flocking: Evolutionary search in ABM parameter-spaces. In: International workshop on multi-agent systems and agent-based simulation (pp. 61–75), May. Springer. Sun, Z., Lorscheid, I., Millington, J. D., Lauf, S., Magliocca, N. R., Groeneveld, J., … & Buchmann, C. M. (2016). Simple or complicated agent-based models? A complicated issue. Environmental Modelling & Software, 86, 56–67. Taillandier, P., Gaudou, B., Grignard, A., Huynh, Q.-N., Marilleau, N., P. Caillou, P., Philippon, D., & Drogoul, A. (2019). Building, composing and experimenting complex spatial models with the GAMA platform. Geoinformatica, 23(2), 299–322.

160 Research handbook on digital sociology Tajfel, H., & Turner, J. C. (1986). The social identity theory of intergroup behavior. In: Worchel, S., & Austin, W. G. (Eds), Psychology of intergroup relations (pp. 7–24). Nelson-Hall. Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. Watts, D. J. (2011). Everything is obvious: Why common sense is nonsense. Atlantic Books. Wijermans, N., & O’Neill, E. D. (2020). Towards modelling interventions in small-scale fisheries. In: Verhagen, H., Borit, M., Bravo, G., & Wijermans, N. (Eds), Advances in social simulation (pp. 485–489). Springer. Wilensky, U. (1999). NetLogo. Center for Connected Learning and Computer-Based Modeling, Northwestern University. http://ccl.northwestern.edu/netlogo/ Ylikoski, P. (2016). Thinking with the Coleman boat. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva -132711 Zhang, J., Shoham, D. A., Tesdahl, E., & Gesell, S. B. (2015). Network interventions on physical activity in an afterschool program: An agent-based social network study. American Journal of Public Health, 105(S2), S236–S243. Zia, A., & Koliba, C. (2015). The emergence of attractors under multi-level institutional designs: Agent-based modeling of intergovernmental decision making for funding transportation projects. AI & Society, 30(3), 315–331.

9. Inclusive digital focus groups: lessons from working with citizens with limited digital literacies Elinor Carmi, Eleanor Lockley, and Simeon Yates

1 INTRODUCTION As we adapt to remote ways of working, we need to consider how qualitative research approaches, especially with vulnerable communities, need to shift. We need to reassess how we, as researchers, reach out to and work with participants, especially those with more limited digital skills, access, confidence, or broader digital literacy. This reassessment needs to not only consider practical issues but also address concerns about individual wellbeing, the associated ethical processes, and the extent to which a shift to digital methods may further marginalise the voices of specific communities. During the global COVID-19 pandemic many citizens shifted lives ‘online’ and digital media have become central to social interaction. Researchers had to reactively change their methods and accommodate ever changing impositions on their fieldwork (Dodds & Hess, 2020; Wood et al., 2020; Williams et al., 2021). However, not all communities are equally ‘online’, nor do they all share the same levels of digital literacy (Yates et al., 2015, 2020; Yates & Lockley, 2018, 2020). An overly simplistic shift to online methods for such things as focus group work could therefore significantly marginalise key communities. In this chapter we reflect on our methodological experience of conducting online focus groups with United Kingdom (UK) citizens. This work focused on their digital literacies, mainly recruited participants with low/lower digital skills, and took place during the first 12 months of the COVID-19 pandemic. Previous work discussing the challenges of online focus groups (Brüggen & Willems, 2009; Abrams et al., 2015; Moore et al., 2015; Cyr, 2016) assumes that participants are proficient in digital systems and devices. In contrast, we illustrate the additional challenges of conducting focus groups where participants have low digital skills. The pandemic also highlighted the much lower levels of digital access and skills that many citizens and households have compared to public perceptions or expectations. Even if this has been well documented in academic research (Hargittai, 2002; Van Dijk & Hacker, 2003, Yates et al., 2015, 2020; Yates & Lockley, 2018, 2020) this sets up a complex challenge for those researchers working with citizens or communities where levels of digital skills and access may be low. Adopting digitally mediated video-conferencing tools to undertake remote interview or focus group work is not a simple option in this case. In the remainder of the chapter we use the terms ‘remote’ or ‘online’ focus groups to refer to a focus group undertaken using an appropriate video-conferencing or group interaction platform such as, but not exclusively, Zoom or Microsoft Teams. Understanding how to adapt research to remote digital formats is important for sociology and media scholars, particularly when conducting research on digital literacies, but also for all 161

162 Research handbook on digital sociology work with socially, economically, or culturally marginalised groups. Low digital access, skills, and capabilities closely correspond with other key demographics in complex and intersectional ways, such as age, income, poverty, being in social housing, low educational attainment, long-term ill health, and ethnicity (DiMaggio et al., 2004; Robinson, 2009; Yates et al., 2015; Van Deursen et al., 2017; Yates & Lockley, 2018, 2020; Yates et al., 2020). In our own work we have identified groups of younger people who have low digital skills, predominantly only using social media and entertainment media. Far from being ‘digital natives’ with deep skills, this group has low educational attainment and low income alongside low access levels (smartphone only) and low skills (Yates et al., 2020). There is, therefore, both a challenge in using digital tools with these groups and a danger that methodological shifts which foreground digital tools may limit access to or marginalise the contributions of such groups. We explore here our experience of addressing this challenge in the context of a study of citizens data literacy (‘Me and My Big Data’).1 As we prepared to move to the fieldwork stage of the project the global pandemic arrived, limiting or preventing face-to-face interaction. We therefore worked with project partners to plan a shift to a ‘digital’ or ‘mixed’ format for our focus groups. This included a literature review of recent work on using digital tools to run remote focus groups and close working with our stakeholder partners and their network of local community centres. The chapter reports on this literature review, planning process, the implementation of the focus groups in a variety of formats, and our reflections on the benefits, limitations, and potential best practice stemming from this experience.

2

LITERATURE REVIEW

Focus groups are ‘group discussions exploring a set of specific issues that are focused because the process involves some collective activity’ (Kitzinger, 1994: 104). They have a long history stretching back to Merton and Lazarsfeld’s use of ‘focused interviews’ in the 1940s (Merton, 1987; Merton & Kendall, 1946). Focus groups have been used to examine diverse topics using various types of groups and are intended to develop rich discussions. Given their birth in the work of Merton and Lazarsfeld they have a long history in communications, media, and audience research, moving into marketing, public relations, and further out into most branches of social and political research (Lunt & Livingstone, 1996). They have been modified and have brought in new activities and techniques over this long history. Developments have often been tied to technological developments, for example, new recording methods (e.g. video recording), bespoke facilities with recording and observation systems, or the use of technologies to provide the object or activity ‘focused’ on (e.g. video-editing method of MacGregor & Morrison, 1995). Over the past two decades digital technology developments have increasingly enabled qualitative data collection to be conducted through online contexts (Tuttas, 2015; Woodyatt et al., 2016) with a greater opportunity to create social presence compared to older digital environments (Stewart & Shamdasani, 2017). Remote digital mediation also supports greater integration between groups who are geographically dispersed (Tuttas, 2015). These technologies therefore provide one of a set of options available to researchers. The COVID-19 pandemic removed or constrained a significant proportion of the options researchers had for running focus groups. The regulations in many countries concerning social distancing, constraints on movement, safety of the attendees, and the ethical questions of putting participants at risk made full face-to-face sessions impossible. Regulation in the UK

Inclusive digital focus groups 163 that dictated that the public should stay at home carried full legal force. These regulations changed over time, therefore researchers had to work according to government guidance that was constantly evolving. This all said, our reflections and findings here are not solely relevant to working during a pandemic. We would note that there are many circumstances in which physical access to participants is challenging. Travel to field sites may be physically, ethically, or environmentally challenging or participants may be limited in their ability to attend a physical meeting. Digital methods provide a potential solution to these issues and may also open up new opportunities that are not possible face-to-face. To this we are adding the further challenge of working with groups with low digital access and/or skills. We next examine the existing general benefits and limitations of conducting online focus groups including recruitment, size, consent, platform, moderation, and group dynamics. 2.1 Recruitment A key advantage of remote focus groups is that they can bring together participants who are otherwise geographically dispersed. This enables the inclusion of participants who would otherwise have to travel far or cannot travel at all. Additionally, it allows for the recruitment of those who may not be willing or able to attend in person due to such challenges as disabilities or caring responsibilities. As both participants and researchers do not have to travel this potentially limits the costs, environmental impact of, and time needed for running and attending the event (Rupert et al., 2017). It also provides greater flexibility in terms of scheduling. 2.2

Size and Facilitation

Most of the literature (Murray, 1997; Toner, 2009) highlights the importance of group sizes when considering group facilitation. Smaller groups are recommended as they are easier to facilitate under time pressures. Smaller groups allow more opportunities for a facilitator to actively encourage participants to express themselves equally and this helps to reduce the dominance of individual voices. Whilst face-to-face focus groups offer a relatively controlled shared physical environment, remote groups allow people to remain in their own space. Both options provide opportunities and hazards. Physical locations such as schools, universities, and community centres all carry different connotations and expectations for attendees. Being in their own home may be more comfortable and allow participants to present not only themselves but also their remote physical environment. This may also provide extra context for the research and the analysis. However, there is a loss of control and a greater chance for disturbances which can impact upon facilitation. For instance, pets making noise, doorbells ringing, builders drilling – managing these adds to the facilitator’s list of things to consider, including time frames. 2.3

Facilitation and Moderation

Facilitators of focus groups make sure the research questions of the project are covered while having to facilitate between different participants in the group. The sensitivity of topics can affect how much people share and this also changes between in-person and online focus groups. For example, Woodyatt et al. (2016) used focus groups to speak with queer men in Georgia about interpersonal violence which includes various types of violent behaviour

164 Research handbook on digital sociology including sexual, psychological, and financial. They conducted two in-person focus groups and one online and discovered that the ‘anonymity’ of participants in the online focus groups made the discussion on sensitive topics feel safer for them and they felt able to express disagreements (Woodyatt et al., 2016). Moderation is especially important, facilitators need to identify dominant voices in the group and know how to enable all the people in the group to voice their opinions in an equal way. Importantly, facilitators need to judge when to intervene to keep track and make sure that the topics which were planned for the session are covered. For example, Kite and Phongsavan (2017) mention that the communication between online participants was slower and that they spent more time discussing issues that were not relevant to the research. Therefore, they argue that they managed to cover fewer topics online than in face-to-face workshops. 2.4 Platform Moving to online tools requires new confidentiality considerations, particularly the use of systems with good data and cyber security. Both researchers and participants should have appropriate access to the right technology in terms of device, broadband, and updated software. According to Kite and Phongsavan (2017), sound quality and the ability of the participants to hear each other is very important and can be a major problem if poor. Facilitators must know the software to operate it smoothly and provide troubleshooting while the workshop is being conducted. The ability to record sessions can help ameliorate some of these issues at transcription or analysis through repeated playback. Lack of experience or practical issues with the technology are noted problems. Archibald et al. (2019) noted that connection issues and problems using software formed the main frustration and barrier for online participants. 2.5

Group Dynamics

Archibald et al. (2019) compared Zoom to in-person and audio-only remote groups. Both the researchers and the participants found Zoom far better for developing rapport than audio only. That said, in comparison to in-person interaction one of the key drawbacks of conducting online focus groups is that visual and audio cues are not as easily detected and may be lost in audio recordings. Forrestal et al. (2015) suggest using a ‘round robin’ format – inviting participants to contribute responses to questions by name – as this overcomes limitations in terms of cues online. They also suggest this format keeps participants attentive and not ‘multitasking’ whilst online. As Cyr (2016) argues, group dynamics mean that participants’ contributions ‘may not accurately reflect every participant’s individual opinion perfectly. But pressures to conform permeate our social interactions constantly. Personal opinions are a product of the environment and are influenced by the individuals with whom we interact’ (Cyr, 2016: 243). That is why researchers need to consider which people to recruit to each group, how many people are ideal for the discussion and how the group dynamics might affect individuals’ opinions. Confidence with the technology and online interaction as well as experience of video conferencing and online media are additional factors to consider. Groups of mixed ability may lead to those with higher digital literacy dominating the interaction. These factors may cut across, or in our case reinforce, how to select participants to fit the needs of the research or the questions to be explored.

Inclusive digital focus groups 165 2.6

Best Practice

Forrestal et al. (2015) provide a best practice guide for preparing the online workshop, dividing it into three main components: preparation (before), administration (during), and follow-up (after). When it comes to preparation, they recommend keeping the groups small, though some over-recruiting can assist in helping when people drop out, especially as people tend to cancel at the last minute. They recommend testing the equipment and software beforehand and having the participants log on before the workshop starts to make sure everything is working. Furthermore, it is important to communicate the details and instructions on logging into the session, including reminders to everyone involved, in a clear and simple manner. As Daniels et al. (2019: 10) argue, the ‘use of tools such as ground rules, pre-focus group information, and informed consent documents can help to mitigate against potential issues that may arise by ensuring participants are well appraised of the process, expectations, and any action that could be taken in the event of situations arising’. In terms of administration, during the session Forrestal et al. (2015) recommend using slides to display key questions, display the consent text, and highlight examples. They also recommended using participant list and chat functions to monitor participants in terms of who is participating and who might experience issues, and importantly, to manage discussion. Furthermore, Daniels et al. (2019) recommend having a journal that documents the reflection process of the focus groups, including recommendations learned from the workshop, reflexive evaluation of what worked and what did not, and improvements for future workshops. In terms of follow-up, Forrestal et al. (2015) recommend that incentives are sent immediately following the discussion and downloads of the recording are prioritised to avoid losing data. 2.7

Technical Benefits

There are many technical benefits to using digital tools to run remote focus groups. Recordings of the sessions are relatively straightforward, and most systems can capture audio, video, and text chat. Some systems, for example the version of Zoom used in our sessions, can capture separate recordings for each user. Systems also provide good metadata on recordings including usernames, dates, and times. Systems such as Zoom and Microsoft Teams can also provide transcripts of interactions, though they need detailed review for accuracy. This all makes processes of transcription and upload to analytic systems much simpler.

3

CASE STUDY: DATA LITERACY RESEARCH

We move on now to share our insights from our ‘Me and My Big Data’ project in the light of this existing literature. Following an analysis of the results of a nationally representative in-home and face-to-face computer-assisted personal interviewing survey with 1,542 interviews, we identified six user type groups: 1. Extensive political users – likely to undertake most activities measured. 2. Extensive users – likely to undertake most activities measured but not political action. 3. General users – some use across most activities.

166 Research handbook on digital sociology 4. Social and entertainment media users – low use apart from social networking and entertainment media. 5. Limited users – low to very low use across all measures. 6. Non-users – not online. For the qualitative data collection, we decided to focus on social media users (group 4), and limited users (group 5) to help unpack some of the key themes that emerged in our national survey. These related mainly to our concept of ‘data thinking’ which includes citizens’ critical thinking and understanding of their data, organisations they share their data with, their practices with data, and how they verify information (Yates et al., 2021). To access citizens with low or limited digital skills we worked in partnership with the Good Things Foundation, an international digital inclusion charity. Through Good Things we were able to access digital skills centres across the UK. Initially the research design planned for 20 face-to-face focus group workshops with up to 12 people per group. However, after the first wave of COVID-19 (March 2020), the new regulations placed across the UK rendered in-person focus groups infeasible. The work was put on hold until it became clear that as a society, the use of digital tools would be the ‘new normal’ in the medium to long term. Following a further wave of regulation, we opted to set up remote focus groups. We planned for these to be smaller (as per Forrestal et al., 2015) consisting of between three and eight people, to be held on the Zoom platform. We found that remote groups are also best conducted under reduced time frames. In our case what had been intended as an afternoon workshop – with breaks and time to socialise – was reduced to a 60- to 90-minute Zoom call. This approach immediately had obvious limitations, in particular recruiting potentially more vulnerable people who lack digital skills. An information document was designed for dissemination by the digital skills centres that could be both printed and sent electronically. The information sheet provided an overview of the project including weblinks and the email addresses of the research team and outlined how participants’ data would be used as well as the financial incentives. Working with the digital skills centres, based in a variety of community education settings, allowed us to access those currently engaged with the centres. Although they may all have had low digital skills or had just completed basic digital skills training they were, by definition, different from others as they had engaged with the support offered by the centres. Others with low digital skills and complete non-users not linked to such centres would not be reachable through this approach. Working with the centres ensured that participants had access to the appropriate technology and technical support from the centre. Most participants connected to the focus groups individually from their homes and so demonstrated a level of digital ability through making use of Zoom. A small number of people engaged through attending the digital skills centres directly at a time when social distancing regulations allowed. Where needed the centres set up the platform, computers and access on participants’ behalf. There needed to be a joined-up approach to the focus group design to ensure none of the key topics were lost, and to provide consistency throughout the data collection, therefore three different facilitators led the groups. One of the facilitators was a note taker present at each group who could also address technical issues. The note taker worked with their camera turned off – although they were always introduced to the group at the start of a session to ensure full disclosure. This ensured sharing notes and a detailed reflection on each session.

Inclusive digital focus groups 167

4 FINDINGS 4.1 Facilitators The team has long-term experience of running focus groups and citizen-facing workshops. Reflecting on the experience as facilitators, we noted the loss of the initial ‘convening’ time of face-to-face focus groups before a session starts. Although this might reflect our practice and contexts of research, this is the time when the group is waiting to start, getting a drink, making small talk, and where informal introductions can occur beforehand. This provides an opportunity for the participants to observe each other and the facilitators. It can help build some basic familiarity and rapport. Whilst online focus groups can allow the group to convene before the session starts, often the emphasis is on technical aspects of setup rather than casual informal conservation that could set the participants at ease. It is therefore more difficult to build rapport with an online group and especially so when time is limited. It is well established that the language and topics of focus groups need to be attuned to the social and cultural context of participants. Academic ‘jargon’ is rarely appropriate. This was especially important given the research topic – digital and data literacy – and the fact the majority of participants had low digital literacy. For example, one of the key findings from the focus groups was the very limited understanding of the idea of ‘data’. In the context of academic research and much policy debate the idea of ‘data’ and the use of ‘data’ by major platforms is well understood. Not so with our participants. The planned face-to-face design of the workshops included a more extended engagement with this topic as a starting point. As a result, we quickly had to develop approaches to address this efficiently as it formed a key starting point of the focus group discussion. These issues of developing ‘common ground’ (Albrecht et al., 1993) were compounded by the lack of immediate non-verbal communication normally present in face-to-face interactions. As a result, it was difficult for us as facilitators to gauge and feel confident about each group’s level of understanding. Taking a consistent approach to explaining the research project was therefore essential in ensuring that we did not slip into complex terminology. We were also more overt in seeking evidence of individual and collective understanding, often through direct verbal interaction. Using simple plain English therefore ensured that participants felt comfortable immediately and this in turn built rapport and established common ground. We also had a number of participants for whom English was a second language and therefore this establishment of clear terms helped with their inclusion in the discussion. As suggested by Forrestal et al. (2015), having a ‘back channel’ communications route was very helpful. It supported a line of communication between the facilitators that would not normally be present in a basic face-to-face focus group, though some research practice has included equipping facilitators with earpieces for external researchers to pass on messages, queries, and suggestions. In our case, during the focus groups the note taker who had their video and microphone turned off wrote and communicated to the main facilitator privately through Zoom’s direct messages on the chat option. This allowed the note taker to prompt the facilitator if they got caught up in the conversation and needed to raise a topic or issue, ask something specific, or ask follow-up questions. A key tool to support the research team was the post-session reflections document dedicated to each focus group. Having a discussion of the notes and a 30-minute verbal review of each group and their responses helped to shape future sessions but at the same time allowed for

168 Research handbook on digital sociology consistency. Once again these were held online. This was especially useful given the novelty of the online format allowing the facilitators to discuss content, technical issues, and group dynamics issues, for example, the challenge of defining ‘data’ or the problems of bringing in quieter participants to the discussion. In addition, project meetings were scheduled in between the focus groups which allowed for the ironing out of issues and shaped future sessions. Such a method is not novel in and of itself. However, as the team were also geographically distributed and under COVID-19 restrictions, having these as shared documents on an appropriate remote platform (e.g. Basecamp) significantly helped the team. 4.2

Group Dynamics

Interpersonal interaction and group dynamics are key to all focus group work. As facilitators it became very clear that the task of managing an online group required careful and considered work to ensure that the discussion remained focused and that participants had an equal opportunity to speak. This obviously builds on the issues of rapport development noted above. Table 9.1 provides an overview of the focus group setup. Table 9.1

Overview of focus groups design and participants

Target group Younger people (T

38.4

22.6

30.9

46.5

31.1

22.4

46.5

42.9

44.5

12.6

44.5

42.9

12.6

31.4

19.2

28.9

51.9

38.2

12.0

49.8

14.4

50.4

35.2

63.8

8.5

27.8

18.5

40.7

40.9

35.8

29.2

35.0

0.85

0.94

1.12

1.23

0.53

1.07

0.34

1.13

2.79

1.43

0.20

2.20

0.59

1.06

1.35

0.94

0.92

1.15

Initial contacts Observed (%) Bias (obs./exp.)

17.8

30.8

51.5

35.2

14.3

50.5

13.3

52.3

34.4

59.1

11.1

29.8

20.9

38.8

40.4

36.6

26.6

36.7

0.79

0.99

1.11

1.13

0.64

1.09

0.31

1.18

2.72

1.33

0.26

2.36

0.67

1.01

1.34

0.96

0.84

1.21

Reciprocated contacts Observed (%) Bias (obs./exp.)

Note: Average users’ similar and dissimilar contact dyads in initial contacts and the subset of reciprocated contacts. I = initiator, T = target. a Four education levels: 1 = lower secondary; 2 = intermediate secondary; 3 = upper secondary; and 4 = tertiary. b Age similarity: age difference being two years at most; target is older (I < T) if at least three years older and younger (I > T) if at least three years younger. c Body mass index used to determine five weight classes according to the World Health Organization: 1 = underweight; 2 = normal weight; 3 = overweight; 4 = obesity; 5 = severe obesity. n = 10,440 initiating users (sample sizes effectively lower due to missing values). Data retrieved from a German online dating site in 2007 and contact events weighted by inverse number of events per user (for a detailed description of the data see Skopek, Schulz et al., 2011).

IT

I=T

Female → Male

IT

I=T

I=T

IT

I=T

Female → Male

30.2

31.6

I=T

I 1

Online dating site in the US, four substudies including lab settings and actual Test of the ‘matching’ hypothesis (preference for similar desirability) along

conducted on the same site two years later (2,271 users, period 2009–2010)

Gendered preferences for partner age based on revealed and stated

Online dating site in Germany, digital trace data (10,427 users and 115,909

attractiveness), gender differences in preferences

Skopek, Schmitz et al. (2011)

Vertical versus horizontal preferences (education, age, physical

contacts), period 2007

Interracial dating preferences stated in user profiles

Online dating site in Germany, 12,608 users, 116,138 initial messages (first

Skopek, Schulz et al. (2011)

predicted sorting outcomes based on matching algorithms (Gale-Shapely)

(Current Population Survey) on Boston and San Diego

US, 6,070 user profiles from Yahoo Personals, period 2004–2005

Vertical versus horizontal preferences,

US, digital trace data, same as Hitsch et al. (2010b), representative data

metropolitan areas (Boston and San Diego), period 2003

Vertical versus horizontal preferences, gender differences

(reciprocity contacting), gender differences

Online dating site in the US, 6,485 users and their messages from two US

Educational stratification in online dating and exchange-theoretical trade-offs

initial messages (first contacts), period 2007

differences

contacts), period 2007

Online dating site in Germany, 10,440 initiators and 20,708 targets, 116,138

Educational stratification in online dating (initial contacting), gender

Online dating site in Germany, 12,608 users, 116,138 initial messages (first

dating profiles from four major cities on Yahoo Personals, period 2004–2005 only), gender differences

US, subsample of 1,558 white online daters from a sample of 6,070 online

Robnett and Feliciano (2011)

Hitsch et al. (2010a)

Hitsch et al. (2010b)

Schulz et al. (2010)

Skopek et al. (2009)

Feliciano et al. (2009)

cited study)

lifestyle characteristics

Degree of homophily in user exchanges along various demographic and

Online dating site in the US, about 65,000 users and nearly 240,000

(no peer-review publication, but most messages, period 2002–2003

Research topics and themes

Data (national context, type, period)

Fiore and Donath (2005)

Social science studies published using trace data from online dating (chronological sorting)

Study

Table 12A.1

APPENDIX

Studying mate choice using digital trace data 239

Šetinová and Topinková (2021)

Dinh et al. (2021)

Egebark et al. (2021)

Bruch and Newman (2019)

Bruch and Newman (2018)

Felmlee and Kreager (2017)

Lewis (2016)

Bruch et al. (2016)

Ong and Wang (2015)

Potarca and Mills (2015)

preferences), homophily as a process, initiator advantage

desirability)

attractiveness and education in online dating

messages (standardised invitations to chat), period 2016–2019

Mobile dating app in Czech Republic, 10,528 users and 196,206 initial

Gendered preferences for partner age

practices, long-term change over time

from eHarmony (United Kingdom), subsample of users registered during

March, 149,440 user profiles and messaging, period 2007–2018

Gender differences in mate preferences, gender differences in online dating

International online dating/matchmaking site (eHarmony), dataset obtained

(real) site users; period 2016

relationship goals and invitation to meet) to a randomised sample of 2,667

Estimating returns (e.g., attention, replies, or positive replies) to

fictitious user profiles sending initial contacts (signalling long-term

as revealed by messaging

Online dating site in the Netherlands, field experiment, 12 (six by gender)

reciprocal interactions, period 2014

York City, Boston, Chicago, and Seattle; > 4 million users and > 15 million

Online dating site in the US, restricted to four large metropolitan areas: New Social structure of US dating markets, identification of submarket structures

messaging activity, period 2014

Mate-seeking and messaging strategies (contacting in terms of user

of New York City, Boston, Chicago, and Seattle, 186,935 users and their

Exploration of meso-level network cluster structures in online dating

attributes, gender differences

Online dating site in the US, restricted to the four large metropolitan areas

one-month observation period (year not specified)

a national online dating site, 3,521 users and their (reciprocated) messages,

Online dating site in the US, data extract from one metropolitan region of

and their messaging activities, period 2010

subsample of the data used in Lewis (2013), 7,671 users from New York City for status) across various socioeconomic, sociodemographic, and cultural

‘Matching’ (preference for similarity) versus ‘competition’ (preference

fashion (browsing, messaging)

Online dating site in the US (OkCupid),

Development of a framework to model online dating choices in a multistage

from the New York City metropolitan area, period 2014

Estimating ‘click’ returns to income in online dating

across Europe, country-level determinants of racial preferences

Online dating site in the US, subsample of 1,855 users and their activities

counted, period 2013

profiles with varying income levels, randomly released, click volume

Online dating site in China, field experiment, 360 (180 per gender) baseline

2011

countries, 58,880 user profiles and indicted matching preferences, period

International online dating/matchmaking site (eDarling), data from European Racial preferences, in-group/out-group partner preferences of ethnic groups

Gendered contact preferences, matching versus competition (vertical

city, 14,533 users (177,404 first-contact messages), period 2010–2011

and revealed preferences), racial segregation in online dating

Online dating site in the US, six-month observation restricted to a mid-sized

of a major dating site, period 2009–2010

Kreager et al. (2014)

Interracial dating preferences of black and white online dating users (stated

Online dating site in the US, nationwide sample of users (> 1 million users)

Mendelsohn et al. (2014)

Research topics and themes

Data (national context, type, period)

Study

240 Research handbook on digital sociology

13. Testing sociological theories with digital trace data from online markets Wojtek Przepiorka

1 INTRODUCTION Sociologists have researched the causes and consequences of collective action, trust building, reputation formation, social preferences, discrimination, and social inequality since long before the advent of the internet (Coleman, 1990; Hedström & Bearman, 2009; Bowles & Gintis, 2011). In the last two and a half decades, digital trace data of online market transactions has been used to research these topics. At the turn of the century, such datasets were mostly collected by ‘hand’; researchers browsed through a selection of items offered by sellers in online market platforms, stored the HTML pages on their computers, extracted the relevant information from these HTML pages, and stored it in spreadsheets for later analyses (Diekmann et al., 2009). However, very soon researchers started scraping transaction data from online markets automatically using bots and regular expressions (Przepiorka, 2013; Diekmann et al., 2014). This approach, although not without risk, allowed collecting large sets of digital trace data from online markets. These datasets are still valuable and used today to test new hypotheses, as test beds for statistical modelling exercises and to teach computational and quantitative research methods to sociology students (Keuschnigg et al., 2018). One reason for why online market data from the first decade of this century is still valuable for research and teaching purposes is that, at that time, most online market platforms did not employ so-called matching algorithms. Matching algorithms, as the one introduced by eBay in Spring 2008 (Netzloff, 2008; see also Nash, 2008), score sellers based on the information these sellers provide on their profile pages and the offers they post. Sellers with higher scores appear higher up in buyers’ search results. Moreover, matching algorithms also take into account information about potential buyers (e.g., language, geographical location) that is available through these buyers’ web browsers to provide them with suitable search results (Graham & Henman, 2019). Hence, scraping data from market platforms that employ matching algorithms will not allow one to recreate the decision situations that potential buyers and sellers faced on these platforms. What is more, many online market platforms introduced restrictions on web scraping and employed means to prevent excess and unwelcome bot activity on their servers (Przepiorka, 2011).1 These and other developments have restricted university researchers in collecting digital trace data from online markets for scientific purposes. On the one hand, restrictions on data collection are necessary to protect market participants’ rights of controlling their personal data (European Parliament and Council, 2016). On the other hand, these restrictions bring about an unequal distribution of the means of knowledge production. By incorporating large research facilities in their proprietary realm, online market platforms not only outcompete public universities in their quest for knowledge on human behaviour, but also confine the effective use of this knowledge for the public good. This is even more problematic in the light of the growing 241

242 Research handbook on digital sociology influence online markets and other platforms have on people’s lives, their preferences, beliefs, and activities. However, an independent evaluation of the workings of online market platforms is crucial for informing the design and implementation of public digital infrastructures through which citizens, consumers, and organizations can interact in pursuit of their goals and benefit from the digital revolution. This chapter aims at contributing to this idea by giving a selected review of research conducted to learn more about (1) the trust-building capabilities of reputation systems employed in online markets, (2) online traders’ motives for contributing to reputation systems by leaving feedback after completed transactions, and (3) the susceptibility of reputation-based online markets to social influence and discrimination.

2

A THEORETICAL MODEL OF ONLINE MARKET EXCHANGE

In Chapter 5 of his seminal book Foundations of Social Theory, James Coleman describes the trust problem that can arise between agents in social and economic exchange as a social dilemma: Two agents can both gain from voluntarily exchanging resources (e.g., money, commodities, services), but refrain from the exchange, if the agent transferring their resources first (the truster) cannot expect the other agent (the trustee) to live up to their part of the agreement. However, if the trustee is trustworthy with a certain probability (p), depending on the gains (G) and losses (L) that can result from the exchange, the truster might still make an advance (i.e., transfer their resources first). Coleman formalizes the trust problem in a threshold model in which a rational and self-regarding truster makes an advance, if the odds of the trustee being trustworthy are larger than the ratio of losses and gains, that is, if p/(1 – p) > L/G (Coleman, 1990). In terms of the probability that the trustee is trustworthy, it must hold that p

L (13.1) GL

Coleman’s threshold model can be derived from the trust game with incomplete information (TGI) commonly employed in game theory to model the trust problem (e.g., Dasgupta, 1988; Voss, 1998; Raub, 2004; Przepiorka, 2021).2 The TGI is depicted as a game tree in Figure 13.1, where, for the sake of exposition, the truster is labelled as buyer, who can decide whether or not to buy, and the trustee is labelled as seller, who, upon receipt of the buyer’s money, can decide whether or not to ship the merchandise the buyer paid for (think buyers and sellers on eBay). The letters below each terminal node of the game tree denote buyer payoffs (first row) and seller payoffs (second row). The ordering of buyer payoffs is R > P > S in both subtrees of the TGI. The ordering of seller payoffs is T > R > P in the right subtree, and it is R + b > T – c and R + b > P in the left subtree. In other words, in the right subtree, the seller is untrustworthy because their payoff from shipping is lower than their payoff from not shipping (R < T). In the left subtree, the seller is trustworthy, because their payoff from shipping is higher than their payoff from not shipping (R + b > T – c ); this results from the additional benefit b and/or cost c that the seller respectively obtains from shipping or incurs from not shipping the merchandise the buyer paid for.3

Testing sociological theories 243

Figure 13.1

Trust game with incomplete information

In the TGI, the buyer’s uncertainty about the seller’s trustworthiness is modelled with the buyer’s information set. The buyer’s information set comprises the two decision nodes and the dashed line denoting that the buyer does not know at which node they are. However, the buyer knows the probability α by which chance (i.e., ‘Nature’) decides whether they are at the left decision node. With probability 1 – α they are at the right decision node. If the buyer decides to buy and the seller turns out to be untrustworthy, the seller keeps the buyer’s money without sending anything back, and the buyer’s and seller’s payoffs are S and T, respectively. If the buyer decides to buy and the seller turns out to be trustworthy, the seller ships the merchandise upon receipt of the buyer’s money, and the buyer’s and seller’s payoffs are R and R + b, respectively. If the buyer refrains from buying, both the buyer and seller (irrespective of the seller’s trustworthiness) receive payoff P. Based on the payoff orderings, the buyer prefers to buy from a trustworthy seller (R > P) but not from an untrustworthy seller (P > S), and both a trustworthy and an untrustworthy seller prefer that the buyer buys (R + b > P and T > P, respectively). As mentioned above, the buyer only knows the probability α with which the seller is trustworthy. Since, as in Coleman’s model, the buyer is assumed to be rational and self-regarding, the buyer decides to buy, if the expected payoff from doing so is larger than P, that is, if αR + (1 – α)S > P. In terms of the probability that the seller is trustworthy, it must hold that

PS (13.2) RS

Equations 13.1 and 13.2 are equivalent. In Coleman’s threshold model, p corresponds to α in the TGI, L is the loss the buyer suffers from trusting an untrustworthy seller (L = P – S), and G the buyer’s gain from trusting a trustworthy seller (G = R – P). Finally, the sum of G and L in the denominator on the right-hand side of Equation 13.1 corresponds the denominator on the right-hand side of Equation 13.2 (G + L = R – P + P – S = R – S).

244 Research handbook on digital sociology The TGI is often used to model the interaction between buyers and sellers in anonymous online markets such as eBay (Güth & Ockenfels, 2003; Przepiorka, 2013; Jiao et al., 2021). In these markets strangers trade with each other across large geographical distances and often follow the convention that the buyer sends the money before the seller ships the merchandise (Diekmann et al., 2009). The increasing popularity of anonymous online markets for social and economic exchange is in need of explanation, because it challenges the widespread view that traders’ embeddedness in social networks of ongoing relations and a functioning legal environment are necessary preconditions for cooperative market exchange to emerge (Granovetter, 1985; Przepiorka et al., 2017). Most online market exchanges are governed by reputation systems (Kollock, 1999; Resnick et al., 2000), which allow traders to comment on another’s behaviour, attributes, products, and services with ratings and text messages. These ratings constitute traders’ reputations, which can be conceived of as signals of these traders’ trustworthiness (Przepiorka & Berger, 2017). Since building a good reputation from positive ratings and reviews takes time and requires cooperative behaviour, only trustworthy sellers will bother to invest in building one. Hence, buyers can infer sellers’ trustworthiness from their good reputations (Shapiro, 1983; Przepiorka, 2013). How would a buyer who faces two sellers, one with an established and one with no reputation, decide? Coleman (1990) further describes how the extent of the trust problem in social and economic exchange can vary depending on the information the truster has about the potential gains and losses and the trustee’s trustworthiness. According to Coleman’s threshold model (and the TGI), a buyer decides to exchange with the seller from which they expect the highest gain. From a seller with an established reputation, the buyer expects to gain G with certainty (R in the TGI). From a seller with no reputation, the buyer expects to gain pG – (1 – p)L or, correspondingly in the TGI, aR + (1 – α)S. This argument seemingly suggests that a buyer would always choose the seller with an established reputation, and sellers with no record of past transactions would not be able to establish their business in the market (Frey & van de Rijt, 2016; Lukac & Grow, 2021). However, sellers without a good reputation can invest in building one by offering discounts (d) that make buyers indifferent between their offers and the offers of established sellers (Shapiro, 1983; Przepiorka, 2013). For a buyer to be indifferent between exchanging with a seller with an established reputation and a newcomer or even to prefer to exchange with the newcomer, it must hold that pG – (1 – p)L + d ≥ G. Hence, the discount d that newcomers must offer to build their reputation is d 1 p G L (13.3) From this argument it follows that seller reputations and these sellers’ business success will be correlated. Two hypotheses that result from this theoretical argument have been tested with transaction data from peer-to-peer online markets. H1: The better a seller’s reputation, the higher will be the probability the seller’s items will be sold. H2: The better a seller’s reputation, the higher is the price the seller can obtain for their items.4

Testing sociological theories 245 The next section reproduces an empirical analysis testing the two hypotheses based on a dataset collected on a large peer-to-peer online market.

3

TRUST, REPUTATION, AND ONLINE MARKET EXCHANGE

In peer-to-peer online markets, information about potential gains and losses is readily available via the item pages on which a seller’s products and services are advertised and can be ordered by buyers. The big unknown, however, remains sellers’ trustworthiness. Clearly, information about sellers’ reputations will affect buyers’ beliefs about sellers’ trustworthiness. However, transaction data from online markets do not usually reveal much about buyers’ (and sellers’) beliefs. At the same time, such data provide an excellent source of behavioural information that allows for indirect tests of hypotheses involving psychological mechanisms. This section reproduces an analysis testing the two aforementioned hypotheses based on a dataset of almost 90,000 auctions of memory cards for electronic devices (Przepiorka & Aksoy, 2021). The dataset was collected on eBay.de in November and December 2006 (Przepiorka, 2013). A typical approach to testing hypotheses H1 and H2 is to use a binary dependent variable indicating whether an item was sold and the selling price of sold items, respectively, and regress these variables on the number of positive and negative seller ratings.5 However, hypotheses H1 and H2 imply a causal relation between seller reputations and these sellers’ business success. Therefore, many other seller, item, and market characteristics must be controlled for in multiple regression models to identify the reputation effect (Morgan & Winship, 2015). This identification strategy is defendable as long as most information a buyer could have considered about the seller, item, and market context is accounted for in the statistical analysis (Przepiorka & Aksoy, 2021). Conducive to this identification strategy is, moreover, if the data sample comprises a homogeneous item (e.g., a particular memory card). A homogeneous item not only reduces the number of potential covariates to be considered but also the likelihood of omitting an important covariate in one’s subsequent analyses (Diekmann et al., 2014). The reputation of an online market seller is typically operationalized by the log-transformed number of positive ratings (plus one as the logarithm of zero is undefined) and the log-transformed number of negative ratings (plus one). The log transformation accounts for the assumption that the absolute effect of the number of ratings on a seller’s business success is increasing at a decreasing rate. For example, a seller with 100 positive ratings will be perceived more favourably by buyers than a seller with 50 positive ratings, whereas a seller with 1100 and a seller with 1050 positive ratings will not make the same level of difference in buyers’ perceptions. Next to the seller reputation variables, many other (mostly dummy) variables are included in such an analysis. These other variables include, for example, payment methods and shipping conditions offered by the seller, the number of similar items offered for sale at the same time, whether an offer ends on a weekend, the hour of day at which an offer ends, attributes of the item (e.g., memory capacity), a seller’s country of origin, etc. Table 13.1 shows the main results of the regression model estimations. Model M1 is a logistic regression with the probability of item sale as the dependent variable. Model M2 is an ordinary least squares (OLS) regression with the selling price (in euros) of sold items as the dependent variable. Both models account for the fact that many sellers offer items repeatedly by estimating cluster-robust standard errors at the seller level. The four explanatory variables

246 Research handbook on digital sociology Table 13.1

Regression models of probability of sale and selling price testing hypotheses H1 and H2 M1

M2

Probability of sale

Selling price (log)

(logit regression) Coef. SE Const.

10.405***

1.434

(OLS regression) Coef. SE 3.322***

0.364 0.020

Main explanatory variables log(# pos. ratings + 1)

0.333***

0.075

0.093***

log(# neg. ratings + 1)

−0.213*

0.092

−0.084***

0.019

log(initial price in €)

−1.005***

0.125

0.058***

0.015

log ( ∅ shipping costs in €)

−2.158***

0.262

−0.731***

0.071

Yes

Control variables included N1 N2

Pseudo R2 adj. R2

Yes

88452

61744

3248

3051

0.44 0.68

Note: The table lists coefficient estimates and cluster-robust standard errors (*** p < 0.001, ** p < 0.01, * p < 0.05, for two-sided tests) of logit and OLS regression models. The binary outcome variable of model M1 is one if the auction received at least one bid and is zero otherwise. The outcome variable of model M2 is the log-transformed selling price (in euros) of auctions that received at least one bid and thus were sold. N1 denotes the number of cases (auctions) and N2 denotes the number of clusters (sellers). The full table is available in the online appendix of Przepiorka and Aksoy (2021).

included in Table 13.1 are the number of positive seller ratings, the number of negative seller ratings, the initial item price set by the seller, and the average cost of the shipping options offered by the seller for a particular item. All four variables are log transformed. The average shipping costs are a proxy for the actual shipping cost, which are unobserved. Both average shipping costs and initial item price exhibit negative effects on the probability of sale. However, if an item is sold, unlike average shipping costs, initial item price has a positive effect on final item price (i.e., winning bid). These results are in line with how item prices drive behaviour in online auction markets (Przepiorka, 2013). Most importantly, the number of positive ratings and the number of negative ratings exhibit, respectively, a positive and negative effect on both sales and prices. These results support hypotheses H1 and H2. But how substantial are these reputation effects? Based on model M1 in Table 13.1, the effect of the log number of positive ratings on the probability of sale can be calculated as follows: If the number of positive ratings increases by the factor 2.7 (which corresponds to a one-unit increase on a natural log scale), the odds of a successful sale increase by 29.5% = 100 × [exp(0.333) − 1]. Taking the unconditional selling probability of .7 obtained from the same dataset, this increase corresponds to .04 (i.e., four percentage points). Correspondingly, based on model M2 in Table 13.1, if the number of positive ratings increases by the factor 2.7, the item price increases by 9.7% = 100 × [exp(0.093) − 1]. Taking the average selling price of €15 obtained from the same data, this increase corresponds to €1.46. The same exercise can be performed with the coefficients for the number of negative ratings. Although these effects are substantial, one may object that they are calculated based on an almost threefold increase of the explanatory variables. It is therefore important to put this

Testing sociological theories 247 effect calculation into context. Unlike most variables frequently encountered in the social sciences (e.g., five-point items used in surveys), sellers’ numbers of positive and negative ratings can have a considerable range. In the example discussed above, the number of positive and negative ratings have a range of, respectively, 275,000 and 1890 between the 5th and 95th percentiles. Hence, threefold increases in reputations across sellers are not uncommon. These results corroborate that a good reputation can have a substantial effect on an online seller’s business success. In fact, these hypotheses have been tested many times by means of digital trace data from online markets collected by hand, automatically scraped, or obtained from the administrators of an online market platform. A recent meta-analysis, synthesizing evidence from 125 papers, corroborates the existence of such a reputation effect, albeit with considerable variation in effect sizes due to differences in operationalizations of seller reputation and seller business success, market contexts, product features, and research designs (Jiao et al., 2022). However, for the reputation mechanism to be effective in promoting cooperative market exchange, market participants must share truthful information about their past transactions in the form of quantitative ratings and text comments. The next section discusses research investigating online traders’ motives for leaving such feedback.

4

SOCIAL VALUE ORIENTATION, RECIPROCITY, AND REPUTATION FORMATION

Reputation systems employed in peer-to-peer online markets benefit all market participants. However, individual traders’ contributions to these reputation systems with feedback information are voluntary and costly in terms of time and effort. Hence, reputation systems are collective goods that are subject to a free-rider problem (Bolton et al., 2004). A reputation system can also be regarded as a peer sanctioning system through which cooperative behaviour can be rewarded and uncooperative behaviour can be punished with positive and negative feedback, respectively (Resnick et al., 2000; Simpson & Willer, 2015). In this sense, reputation systems in online markets can be conceived as second-order collective goods (Yamagishi, 1986; Heckathorn, 1989; Kollock, 1999). Second-order cooperation at the feedback stage creates the reputational incentives that promote first-order cooperation at the transaction stage (Diekmann et al., 2014). But what motivates traders to contribute to the collective good of a reputation system by leaving feedback after finished transactions? Based on assumptions of rational and purely self-regarding actors, no feedback should be left, and reputation-based online markets should not exist. How then can the growing popularity of reputation-based online markets be explained? Research in marketing and human–computer interaction has shown that people differ in their motives for leaving feedback (Hennig-Thurau et al., 2004; Picazo-Vela et al., 2010; Cheung & Lee, 2012). People want to ‘let off steam’, express their satisfaction, reward or punish the seller for respectively a good or bad service, inform future consumers about a product or a seller, and leave feedback because others do. Also, people can be deliberately more or less accurate about their experiences in their feedback. In more abstract terms, this research identifies concerns for oneself (i.e., self-regarding preferences), social value orientation (i.e., other-regarding preferences), reciprocity, and conformity as main drivers of traders leaving feedback after completed online market transactions (Macanovic & Przepiorka, 2022). However, this research is based mostly on small-scale, non-representative surveys, measuring

248 Research handbook on digital sociology respondents’ intentions and attitudes with respect to leaving feedback. Although these findings establish the set of motives online traders may have for leaving feedback, they are limited in what one can learn about the relative importance of these motives in actual online markets. There are a few studies that use digital trace data from online markets to learn something about traders’ motives for leaving feedback. Most of these studies are based on the analysis of timed feedback events produced by traders after completed transactions on eBay. At the time these datasets were collected, traders could leave feedback up to 90 days after a transaction. Moreover, eBay employed a reciprocal rating system, which allowed buyers to rate sellers and, in the same way, sellers to rate buyers. All but one of the half a dozen studies that report feedback rates from eBay markets report numbers above 50 per cent (see Bolton et al., 2013; Diekmann et al., 2014). Although these rates are indicative of traders’ cooperation at the feedback stage, they do not reveal why these traders leave feedback. To address this question, Jian et al. (2010) developed a model to estimate the proportion of traders’ different feedback strategies based on feedback data obtained from eBay. The authors assumed three strategies that individual traders can employ – (1) do not give feedback, (2) give feedback unconditionally, and (3) reciprocate feedback (i.e., give feedback only after receiving feedback) – and estimated buyers’ and sellers’ probabilities of choosing one of the strategies in a statistical model. They found that the reciprocal feedback strategy was chosen by buyers and sellers in, respectively, 23 per cent and 20 per cent of the cases, whereas the unconditional feedback strategy was chosen in, respectively, 38 per cent and 47 per cent of the cases. Their findings indicate that both reciprocity and other-regarding preferences play an important role in motivating traders to leave feedback (see also Dellarocas & Wood, 2008). In a similar vein, Diekmann et al. (2014) analysed hundreds of thousands of feedback events that occurred after transactions of mobile phones and DVDs on eBay. The results of their event history analysis corroborated that reciprocity, other-regarding preferences, and strategic motives play an important role in traders leaving feedback. First, reciprocal motives are consistent with their finding that traders’ propensities to leave feedback increased significantly after they received a rating from a trading partner. Second, other-regarding motives are consistent with their finding that traders were more likely to give a positive rating and they were less likely to give a negative rating to a trading partner with fewer ratings. In other words, traders anticipated the impact their feedback could have on the business success of trading partners that were still building their reputation. Finally, they found evidence for strategic motives. Some traders postponed leaving negative feedback to the very end of the rating period of 90 days, likely because they feared receiving negative feedback in return. Figure 13.2 shows non-parametric estimation results for the hazard rates of positive feedback (top panels) and negative feedback (bottom panels) in the mobile phone and DVD markets examined by Diekmann et al. (2014). Hazard rates are, roughly speaking, the probabilities that feedback is left at a specific time given that no feedback was left before. The figure shows that positive feedback is left much more frequently and earlier than negative feedback. While the probability of positive feedback occurring peaks between day 5 and 12 after completed transactions, the probability of negative feedback occurring peaks between days 20 and 35. This late arrival of negative feedback is likely due to delays in payment or shipping and lasting disputes between buyers and sellers, which also cause either party to leave a negative feedback eventually. Another noteworthy difference between the hazard rates of positive and negative feedback is the second peak in the hazard rates of negative feedback that occurs towards the end of the rating period of 90 days. Again, this second peak is indicative

Testing sociological theories 249 of traders trying to avoid retaliatory negative feedback by leaving feedback just before the end of the feedback period.

Note: Source:

Hazard rates with 95 per cent confidence bands (thin lines). Adapted with permission from Diekmann et al. (2014).

Figure 13.2

Hazard rates of sellers’ and buyers’ rating decisions

These studies confirm that the motives for leaving feedback identified in survey-based research form the basis of reputation systems in real online market contexts. However, using digital trace data of feedback events to study online traders’ motives for leaving feedback also has limitations. First, the evidence for reciprocity and strategic motives is tied to these studies’ use of feedback data from eBay’s two-sided rating system. It is an open question as to how far traders leave feedback to reciprocate the experience they had at the transaction stage and how important strategic considerations for leaving feedback are if reciprocating feedback is not possible. Second, traders’ motives for leaving feedback can only be inferred indirectly from these traders’ feedback behaviour. Behavioural evidence that is consistent with predictions derived from a psychological mechanism cannot rule out alternative explanations. Relatedly, the lack of behavioural evidence for a particular motive does not rule out the motive’s relevance for traders’ feedback behaviour. Different motives can counteract each other, which also makes it difficult to establish the relative importance of motives at both the aggregate and individual levels. Experimental research (Abraham et al., 2021; Hoffmann et al., 2021) and the automatic analysis of feedback texts (Macanovic & Przepiorka, 2023) could offer new insights into the dynamic interplay of traders’ motives for leaving feedback and the organizational features of reputation-based online markets.

250 Research handbook on digital sociology Leaving feedback after completed online market transactions constitutes a simple social situation opening little possibility for self-presentation. It is therefore plausible to assume that feedback texts are reflective of the direct motives the authors had for leaving feedback rather than just being socially accepted, post hoc justifications of their actions. Starting from this premise and a thorough theoretical framework of motives for leaving feedback, Macanovic and Przepiorka (2023) combine manual and automatic text-mining methods to investigate the motivational landscape of reputation-based online markets. Their approach allows them to tie motives for leaving feedback (e.g., other-regarding preferences, reciprocity) and motive co-occurrences to actual online traders and transactions. This, in turn, enables them to investigate the relative importance of these motives in promoting cooperative market exchange and the stability of reputation-based online markets.

5

SOCIAL INFLUENCE, DISCRIMINATION, AND THE VIABILITY OF ONLINE MARKETS

Reputation systems establish barriers to market entry because market entrants must invest in building a good reputation by offering their products at lower prices and behaving cooperatively over an extended period. Once these traders acquire a good reputation, they are compensated for their initial investments by trading partners willing to pay for a good reputation (Shapiro, 1983). At the same time, sellers that do not intend to stay in the market long enough to be compensated for their initial investment in a good reputation will refrain from entering the market. Hence, at least in theory, the interplay of the reputation and the price mechanisms deters untrustworthy sellers without precluding trustworthy sellers from entering the market (Przepiorka & Aksoy, 2021). However, it has been argued and shown that the reputation mechanism can instigate success-breeds-success dynamics by which actors that are preferentially chosen gain in reputation more quickly and outcompete other actors in spite of these actors’ comparable, initial ability levels (DiPrete & Eirich, 2006; van de Rijt et al., 2014). The same holds for products and services (Salganik & Watts, 2009; Keuschnigg, 2015). The reputation mechanism can thus produce a hierarchization of actors, products, and services that is not reflective of these actors’, products’, and services’ underlying qualities (although see Przepiorka et al., 2020). To explore this possibility, Frey and van de Rijt (2016) conducted a behavioural laboratory experiment in which they emulated the interaction dynamics of buyers and sellers in an anonymous online market. In their experiment, they randomly assigned participants to several experimental conditions that differed in the extent of information provided to buyers about sellers’ reputations (see also Kollock, 1994; Brown et al., 2004). The experiment lasted over several rounds in which each of the buyers had to choose one from among several sellers to interact with. After every interaction between a buyer and a seller, which was implemented as a trust game (see Figure 13.1), seller reputation information was generated automatically. Trustworthy sellers gained in reputation, untrustworthy sellers lost in reputation, and sellers that were not chosen neither gained nor lost in reputation. Their results showed that if sellers could acquire a reputation, (1) sellers were chosen by buyers based on their reputation and (2) sellers chosen in the first rounds were chosen preferentially in future rounds. That is, although sellers did not differ in any observable traits at the start of the experiment, the ones that were chosen first gained in reputation earlier and were chosen again in future rounds to

Testing sociological theories 251 the detriment of the sellers that were not chosen from the start. Based on these findings, the authors concluded that the reputation mechanism commonly implemented in anonymous online markets can lead to arbitrary inequality among sellers due to a success-breeds-success dynamic (see also van de Rijt & Frey, 2020). However, in their experiment, Frey and van de Rijt (2016) induced an oversupply of sellers and the trust game payoffs were constant throughout the experiment; sellers could not use the price mechanism to compete for buyers. If similar sellers enter a new market selling a product for which there is less demand than supply, arbitrary inequality will emerge naturally as a result of market forces. Once supply and demand are balanced, the sellers that are still in the market can, ceteris paribus, expect to be chosen to similar extents by buyers. In equilibrium even new sellers should be able to enter the market and build a good reputation for being trustworthy and reliable by initially lowering their prices (see above). Results from analyses of digital trace data from reputation-based online markets corroborate the idea that buyers indeed trade off sellers’ reputations against the prices these sellers set for their products and services (Snijders & Weesie, 2009; Przepiorka, 2013). Moreover, even though buyers in online markets follow other buyers in their decisions of which sellers to buy from, they might not follow others at any price. Przepiorka and Aksoy (2021) test this conjecture based on the analysis of a large set of online auction data. They demonstrate that online buyers herd on offers that received bids from up to two previous buyers and that herding declines once an auction reaches three bidders and a certain price level. Przepiorka and Aksoy (2021) show moreover that herding buyers do not neglect the reputation of the sellers whose offers they herd on; buyers that join an auction at a later stage do not seem to blindly trust previous buyers to have fully scrutinized the seller’s offer (although see Simonsohn & Ariely, 2008). In many online markets buyers and sellers can interact exclusively online without having to reveal much personal information ex ante because payment is via bank transfer or credit card and the merchandise is shipped by mail or the service provided online. In these online markets the trust-building capacity of reputation systems will be effective with not much more than seller reputations and item prices as information inputs (Przepiorka et al., 2017). However, many online market platforms facilitate the initiation of transactions between two or more parties that eventually have to meet in person. For example, bed-and-breakfast platforms match hosts and guests that often share the same apartment or even room for a short period of time; lift platforms match drivers and passengers that share the same car for the duration of an often long ride; tutoring platforms match tutors and students that spend time together for the duration of mostly several lessons; dating platforms match people that want to be intimate with each other for one night or longer (see Coyle & Alexopoulos, Chapter 11 and Skopek, Chapter 12 in this volume); etc. The trust-building capacity of reputation systems may not suffice in these cases because people have to meet in person to complete the exchanges; other means such as verified identities, phone numbers, headshots, copies of passports, or collaterals are needed to overcome potential trust problems. While clearly necessary, such additional information gives rise to discrimination of all sorts. The possibility of discrimination being at play in online markets has been tested via quasi-experimental research designs. Doleac and Stein (2013) conducted a field experiment in which they sold portable media players via Craigslist (an online, peer-to-peer flea market) in different states in the United States. In their experiment, they systematically varied the skin colour of the hand holding the item in a picture posted with the ad. They plausibly assumed

252 Research handbook on digital sociology that potential buyers would infer the race of the seller from the item picture. Their results showed that allegedly black sellers received fewer messages and offers, lower final price offers, and were trusted less by potential buyers than allegedly white sellers. The latter finding manifested itself in buyers being reluctant to include their real names in email correspondence or to agree on mail delivery and long-distance payment when dealing with seemingly black sellers. Similar results were obtained in studies using non-experimental digital trace data from a car-pooling platform (Tjaden et al., 2018) and a peer-to-peer motorcycle rental platform (Kas et al., 2021). Tjaden et al. (2018) used ride data from a German car-pooling platform and estimated the price penalty that drivers with non-German sounding names had to pay to be 32 per cent. That is, on average, drivers with Arab, Turkish, or Persian sounding names had to offer their rides at a 32 per cent lower rate to obtain the same number of clicks on their offers than drivers with German-sounding names. The authors also found that additional information such as a better reputation score could reduce the gap in the outcome variable. The latter finding suggests that reputation systems may help to overcome the adverse effects of discrimination in online markets. However, Kas et al. (2021) object this conjecture by arguing that minority group members, when entering a market, will have a harder time building their reputation because of discrimination. Moreover, due to the success-breeds-success dynamics described above, these minority members will be further disadvantaged in the market. Kas et al. (2021) corroborate their argument through the analysis of timed interaction data from a Dutch peer-to-peer motorcycle rental platform. Their results thus suggest that minority members that are subject to discrimination by majority group members may not simply compensate their worse market outcomes by building a good reputation. Although it is clear from these and other studies that minority group members are discriminated against in online and other markets (see also, e.g., Auspurg et al., 2019; Hangartner et al., 2021), many studies also provide evidence that discrimination is sensitive to prices and competition. For example, Doleac and Stein (2013) show that competition among buyers reduces the outcome differences between seemingly black and white sellers of portable media players on Craigslist. In a market in which demand exceeds supply, buyers are more willing to take risks by posting higher bids on items offered by sellers with a lower reputation or sellers they trust less for other reasons (e.g., because they belong to a minority group) (see also Przepiorka, 2011). Of course, if buyers can be selective because there is an oversupply of a particular item or service, outcome differences between sellers due to discrimination must be addressed actively. This can be done via the price mechanism and signals of trustworthiness such as verified identities (Przepiorka, 2011) or charitable giving (Elfenbein et al., 2012). Obviously, these measures do not eradicate differences in market outcomes between sellers from minority and majority groups. However, when applied at time of market entry, they can help to minimize outcome differences by stalling processes of cumulative disadvantage. Lower prices, reliable identity information, and generosity have been shown to be substitutes for reputation and therefore could establish viable, additional trust-building mechanisms in reputation-based online markets.

Testing sociological theories 253

6 CONCLUSIONS Throughout history, the effectiveness of mechanisms promoting cooperative market exchanges commonly depended on actors’ social embeddedness (Granovetter, 1985; Diekmann & Przepiorka, 2019). In the last two and a half decades, peer-to-peer online markets have fundamentally transformed the ways in which people engage in social and economic exchange. Modern information and communication technology has substituted informal institutional elements rooted in actors’ social relations with semi-formal institutional elements such as reputation systems and escrow services. As a consequence, cooperative market exchanges today also depend on the quality of the institutional set-up of online market platforms and thus on the expert judgements of market designers, business consultants, and software engineers. Yet, online market platforms are vulnerable to attempts of exploitation of institutional and technical loopholes, but also to the unintended negative consequences of market participants’ purposive actions. The ongoing shift in the socio-structural foundations of market action has created new challenges but also opened new opportunities for researchers to study the mechanisms underlying (among others) collective action, social cohesion, inequality, norms, and trust. These topics have been at the core of sociological scholarship not only since Coleman’s seminal book Foundations of Social Theory. In this chapter I have shown how Coleman’s threshold model of trust in social and economic exchange can be applied to derive hypotheses about traders’ behaviour in reputation-based online markets. The model, which is closely related to the trust game with asymmetric and incomplete information known from game theory (Raub, 1992; Raub & Weesie, 2000), allows to make predictions about the conditions under which buyers in anonymous online markets are more likely to trust sellers with their money. Research shows that especially the additional information about sellers’ reputations provided via reputation systems commonly implemented in online markets positively affects buyers’ trust in sellers (Jiao et al., 2021, 2022). Although Coleman’s model is silent about where reliable information about seller reputations comes from, sociological theories of collective action and cooperation in social dilemmas have contributed to our understanding of why online traders leave feedback after completed transactions (Heckathorn, 1989; Simpson & Willer, 2015). Despite the fact that reputation systems are collective goods and therefore subject to the free-rider problem (Bolton et al., 2004), research shows that online traders leave feedback on their trading partners at high rates. Other-regarding preferences, reciprocity, but also self-regarding and strategic considerations have been shown to drive online traders’ feedback behaviour (Diekmann et al., 2014). Online market contexts have also provided a test bed for theories about social influence and discrimination. Both mechanisms have been shown to spur social inequalities in general and among sellers in online markets in particular (Kas et al., 2021). It has been argued and shown that reputation formation can be subject to a success-breeds-success dynamic because sellers with a better reputation attract more buyers (Frey & van de Rijt, 2016). However, the boundary conditions under which the reputation mechanism produces inequitable outcomes because it prevents new traders from entering the market have not yet been established (Przepiorka & Aksoy, 2021). Further research is necessary to obtain a better understanding of the dynamic interplay of feedback giving, reputation formation, and cooperation in reputation-based online markets. Apart from mapping the boundary conditions of the reputation mechanism to promote coop-

254 Research handbook on digital sociology eration in humans, it could be fruitful to investigate the relation between generalized and particularized trust by showing how generalized trust interacts with people’s motivations to participate in reputation-based online markets (Uslaner, 2018; Schilke et al., 2021). To what extent does the effectiveness of reputation systems in promoting cooperative market exchange depend on people’s generalized trust? Could online markets that employ reputation systems nurture generalized trust in people who have little trust in strangers? Addressing these questions could unveil online markets’ potential to create the economic interdependencies that foster cooperation and integration in heterogeneous, modern societies (Baldassarri & Abascal, 2020).

ACKNOWLEDGEMENTS I would like to thank Ana Macanovic and Jan Skopek for their constructive comments on earlier versions of this chapter.

NOTES 1. 2. 3.

4. 5.

The Robots Exclusion Protocol was introduced in the 1990s to give website owners a means to informally regulate bot activity on their websites (see www.robotstxt.org/). This was first pointed out by Raub (1992). Moreover, Raub and Weesie (2000) point out that Coleman’s threshold model neglects the strategic nature of the trust dilemma where also the trustee decides whether or not to honour trust. A seller’s additional benefits (b) and costs (c) from shipping or not shipping, respectively, can be divided in to extrinsic (i.e., contextual) and intrinsic (i.e., psychological) benefits and costs. A seller’s extrinsic benefits and costs result from the seller’s social and institutional embeddedness that incentivizes the seller to act in the buyer’s interest. A seller’s intrinsic benefits and costs result from the seller’s other-regarding preferences and internalized norms of reciprocity and fairness (Riegelsberger et al., 2005; Przepiorka & Berger, 2017). In equilibrium, if demand and supply are balanced, H1 may not obtain. See Section 5 for a discussion on how the reputation effect may depend on the relation between demand and supply. Note that these hypotheses apply to different types of transactions implemented in online markets. Some market platforms implement auction mechanisms through which buyers determine the prices of items (i.e., products or services), other platforms only allow fixed-price formats where prices are set by the sellers who offer these items online, and yet other platforms allow both formats (see, e.g., Przepiorka, 2013).

REFERENCES Abraham, M., Damelang, A., Grimm, V., Neeß, C., & Seebauer, M. (2021). The role of reciprocity in the creation of reputation. European Sociological Review, 37(1), 137–154. Auspurg, K., Schneck, A., & Hinz, T. (2019). Closed doors everywhere? A meta-analysis of field experiments on ethnic discrimination in rental housing markets. Journal of Ethnic and Migration Studies, 45(1), 95–114. Baldassarri, D., & Abascal, M. (2020). Diversity and prosocial behavior. Science, 369(6508), 1183–1187. Bolton, G. E., Greiner, B., & Ockenfels, A. (2013). Engineering trust: Reciprocity in the production of reputation information. Management Science, 59(2), 265–285. Bolton, G. E., Katok, E., & Ockenfels, A. (2004). How effective are electronic reputation mechanisms? An experimental investigation. Management Science, 50(11), 1587–1602.

Testing sociological theories 255 Bowles, S., & Gintis, H. (2011). A Cooperative Species: Human Reciprocity and its Evolution. Princeton, NJ: Princeton University Press. Brown, M., Falk, A., & Fehr, E. (2004). Relational contracts and the nature of market interactions. Econometrica, 72(3), 747–780. Cheung, C. M. K., & Lee, M. K. O. (2012). What drives consumers to spread electronic word of mouth in online consumer-opinion platforms. Decision Support Systems, 53(1), 218–225. Coleman, J. S. (1990). Foundations of Social Theory. Cambridge, MA: Belknap Press of Harvard University Press. Dasgupta, P. (1988). Trust as a commodity. In D. Gambetta (Ed.), Trust: Making and Breaking Cooperative Relations (pp. 49–72). Oxford: Basil Blackwell. Dellarocas, C., & Wood, C. A. (2008). The sound of silence in online feedback: Estimating trading risks in the presence of reporting bias. Management Science, 54(3), 460–476. Diekmann, A., Jann, B., & Wyder, D. (2009). Trust and reputation in internet auctions. In K. S. Cook, C. Snijders, V. Buskens, & C. Cheshire (Eds), eTrust: Forming Relationships in the Online World (pp. 139–165). New York: Russell Sage. Diekmann, A., Jann, B., Przepiorka, W., & Wehrli, S. (2014). Reputation formation and the evolution of cooperation in anonymous online markets. American Sociological Review, 79(1), 65–85. Diekmann, A., & Przepiorka, W. (2019). Trust and reputation in markets. In F. Giardini & R. Wittek (Eds), The Oxford Handbook of Gossip and Reputation (pp. 383–400). Oxford: Oxford University Press. DiPrete, T. A., & Eirich, G. M. (2006). Cumulative advantage as a mechanism for inequality: A review of theoretical and empirical developments. Annual Review of Sociology, 32, 271–297. Doleac, J. L., & Stein, L. C. D. (2013). The visible hand: Race and online market outcomes. Economic Journal, 123(572), F469–F492. Elfenbein, D. W., Fisman, R., & Mcmanus, B. (2012). Charity as a substitute for reputation: Evidence from an online marketplace. Review of Economic Studies, 79(4), 1441–1468. European Parliament and Council (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Brussels. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX: 32016R0679 Frey, V., & van de Rijt, A. (2016). Arbitrary inequality in reputation systems. Scientific Reports, 6, 38304. Graham, T., & Henman, P. (2019). Affording choice: How website designs create and constrain ‘choice’. Information, Communication & Society, 22(13), 2007–2023. Granovetter, M. (1985). Economic action and social structure: The problem of embeddedness. American Journal of Sociology, 91(3), 481–510. Güth, W., & Ockenfels, A. (2003). The coevolution of trust and institutions in anonymous and non-anonymous communities. In M. J. Holler, H. Kliemt, D. Schmidtchen, & M. Streit (Eds), Jahrbuch für Neue Politische Ökonomie (pp. 157–174). Tübingen: Mohr Siebeck. Hangartner, D., Kopp, D., & Siegenthaler, M. (2021). Monitoring hiring discrimination through online recruitment platforms. Nature, 589, 572–576. Heckathorn, D. D. (1989). Collective action and the second-order free-rider problem. Rationality and Society, 1(1), 78–100. Hedström, P., & Bearman, P. (Eds). (2009). The Oxford Handbook of Analytical Sociology. Oxford: Oxford University Press. Hennig-Thurau, T., Gwinner, K. P., Walsh, G., & Gremler, D. D. (2004). Electronic word-of-mouth via consumer-opinion platforms: What motivates consumers to articulate themselves on the internet? Journal of Interactive Marketing, 18(1), 38–52. Hoffmann, R., Kittel, B., & Larsen, M. (2021). Information exchange in laboratory markets: Competition, transfer costs, and the emergence of reputation. Experimental Economics, 24(1), 118–142. Jian, L., MacKie-Mason, J. K., & Resnick, P. (2010). I scratched yours: The prevalence of reciprocation in feedback provision on eBay. The B. E. Journal of Economic Analysis & Policy, 10(1), Article 92. Jiao, R., Przepiorka, W., & Buskens, V. (2021). Reputation effects in peer-to-peer online markets: A meta-analysis. Social Science Research, 95, 102522.

256 Research handbook on digital sociology Jiao, R., Przepiorka, W., & Buskens, V. (2022). Moderators of reputation effects in peer-to-peer online markets: a meta-analytic model selection approach. Journal of Computational Social Science, 5, 1041–1067. Kas, J., Corten, R., & Rijt, A. v. d. (2021). The role of reputation systems in digital discrimination. Socio-Economic Review, 20(4), 1905–1932. Keuschnigg, M. (2015). Product success in cultural markets: The mediating role of familiarity, peers, and experts. Poetics, 51, 17–36. Keuschnigg, M., Lovsjö, N., & Hedström, P. (2018). Analytical sociology and computational social science. Journal of Computational Social Science, 1, 3–14. Kollock, P. (1994). The emergence of exchange structures: An experimental study of uncertainty, commitment, and trust. American Journal of Sociology, 100(2), 313–345. Kollock, P. (1999). The production of trust in online markets. In E. J. Lawler & M. W. Macy (Eds), Advances in Group Processes, Volume 16 (pp. 99–123). Amsterdam: JAI Press. Lukac, M., & Grow, A. (2021). Reputation systems and recruitment in online labor markets: Insights from an agent-based model. Journal of Computational Social Science, 4, 207–229. Macanovic, A., & Przepiorka, W. (2023). The Moral Embeddedness of Cryptomarkets: Text Mining Feedback on Economic Exchanges in the Darknet. Unpublished manuscript, Department of Sociology/ ICS, Utrecht University. Morgan, S. L., & Winship, C. (2015). Counterfactuals and Causal Inference: Methods and Principles for Social Research. New York: Cambridge University Press. Nash, A. (2008). eBay rolls out best match. Psychohistory: The Personal Blog of Adam Nash. https:// adamnash.blog/2008/01/21/ebay-rolls-out-best-match-in-earnest/. Netzloff, J. (2008). Best match test in 5 categories. eBay Community General Announcements. https:// community.ebay.com/t5/Announcements/Best-Match-Test-in-5-Categories/ba-p/26162216. Picazo-Vela, S., Chou, S. Y., Melcher, A. J., & Pearson, J. M. (2010). Why provide an online review? An extended theory of planned behavior and the role of big-five personality traits. Computers in Human Behavior, 26(4), 685–696. Przepiorka, W. (2011). Ethnic discrimination and signals of trustworthiness in anonymous online markets: Evidence from two field experiments. Zeitschrift für Soziologie, 40(2), 132–141. Przepiorka, W. (2013). Buyers pay for and sellers invest in a good reputation: More evidence from eBay. Journal of Socio-Economics, 42(C), 31–42. Przepiorka, W. (2021). Game-theoretic models. In G. Manzo (Ed.), Research Handbook on Analytical Sociology (pp. 414–431). Cheltenham, UK and Northampton, MA, USA: Edward Elgar Publishing. Przepiorka, W., & Aksoy, O. (2021). Does herding undermine the trust enhancing effect of reputation? An empirical investigation with online-auction data. Social Forces, 99(4), 1575–1600. Przepiorka, W., & Berger, J. (2017). Signaling theory evolving: Signals and signs of trustworthiness in social exchange. In B. Jann & W. Przepiorka (Eds), Social Dilemmas, Institutions and the Evolution of Cooperation (pp. 373–392). Berlin: De Gruyter Oldenbourg. Przepiorka, W., Norbutas, L., & Corten, R. (2017). Order without law: Reputation promotes cooperation in a cryptomarket for illegal drugs. European Sociological Review, 33(6), 752–764. Przepiorka, W., Rutten, C., Buskens, V., & Szekely, A. (2020). How dominance hierarchies emerge from conflict: A game theoretic model and experimental evidence. Social Science Research, 86, 102393. Raub, W. (1992). Eine Notiz über die Stabilisierung von Vertrauen durch eine Mischung von wiederholten Interaktionen und glaubwürdigen Festlegungen. Analyse & Kritik, 14, 187–194. Raub, W. (2004). Hostage posting as a mechanism of trust: Binding, compensation, and signaling. Rationality and Society, 16(3), 319–365. Raub, W., & Weesie, J. (2000). Cooperation via hostages. Analyse & Kritik, 22(1), 19–43. Resnick, P., Zeckhauser, R., Friedman, E., & Kuwabara, K. (2000). Reputation systems. Communications of the ACM, 43(12), 45–48. Riegelsberger, J., Sasse, M. A., & McCarthy, J. D. (2005). The mechanics of trust: A framework for research and design. International Journal of Human-Computer Studies, 62(3), 381–422. Salganik, M. J., & Watts, D. J. (2009). Social Influence: The puzzling nature of success in cultural markets. In P. Hedström & P. Bearman (Eds), The Oxford Handbook of Analytical Sociology (pp. 315–341). Oxford: Oxford University Press.

Testing sociological theories 257 Schilke, O., Reimann, M., & Cook, K. S. (2021). Trust in social relations. Annual Review of Sociology, 47, 239–259. Shapiro, C. (1983). Premiums for high quality products as return to reputation. Quarterly Journal of Economics, 98(4), 659–680. Simonsohn, U., & Ariely, D. (2008). When rational sellers face nonrational buyers: Evidence from herding on eBay. Management Science, 54(9), 1624–1637. Simpson, B., & Willer, R. (2015). Beyond altruism: Sociological foundations of cooperation and prosocial behavior. Annual Review of Sociology, 41, 43–63. Snijders, C., & Weesie, J. (2009). Online programming markets. In K. S. Cook, C. Snijders, V. Buskens, & C. Cheshire (Eds), eTrust: Forming Relationships in the Online World (pp. 166–185). New York: Russell Sage Foundation. Tjaden, J. D., Schwemmer, C., & Khadjavi, M. (2018). Ride with me: Ethnic discrimination, social markets, and the sharing economy. European Sociological Review, 34(4), 418–432. Uslaner, E. M. (Ed.) (2018). The Oxford Handbook of Social and Political Trust. Oxford: Oxford University Press. van de Rijt, A., & Frey, V. (2020). Robustness of reputation cascades. In V. Buskens, R. Corten, & C. Snijders (Eds), Advances in the Sociology of Trust and Cooperation (pp. 141–152). Berlin: De Gruyter. van de Rijt, A., Kang, S. M., Restivo, M., & Patil, A. (2014). Field experiments of success-breeds-success dynamics. Proceedings of the National Academy of Sciences of the USA, 111(19), 6934–6939. Voss, T. (1998). Vertrauen in modernen Gesellschaften. Eine spieltheoretische Analyse. In R. Metze, K. Mühler, & K.-D. Opp (Eds), Der Transformationsprozess: Analysen und Befunde aus dem Leipziger Institut für Soziologie (pp. 91–129). Leipzig: Leipziger Universitätsverlag. Yamagishi, T. (1986). The provision of a sanctioning system as a public good. Journal of Personality and Social Psychology, 51(1), 110–116.

14. Using YouTube data for social science research Johannes Breuer, Julian Kohne, and M. Rohangis Mohseni

1

YOUTUBE AS A VIDEO PLATFORM

Founded in 2005, YouTube is currently the largest and most important video platform on the internet (Alexa Traffic Ranks, 2021; Auxier & Anderson, 2021; Konijn et al., 2013). It is only second to Google Search (Alexa Traffic Ranks, 2021) in terms of overall traffic and, with 2,291 million active users in July 2021, only second to Facebook with 2,853 million active users (Kemp, 2021). YouTube is extremely popular in many countries, particularly among younger generations, and is used by many to watch entertainment content, listen to music, or gain information in the form of how-to videos, news, and other formats (Smith et al., 2018). For young people, YouTube is already partly replacing television (Defy Media, 2017), and popular ‘YouTubers’ (i.e., people who produce video content for YouTube) have become social media stars (Budzinski & Gaenssle, 2018), many of whom are able to gain a notable income with their activities.1 In essence, YouTube is a platform where users can also be producers: everybody can upload videos or watch videos with the options of (dis)liking videos or commenting on them. The possibility of commenting on comments (so-called follow-up comments) can give rise to discussions within the audience, which is one reason why YouTube is also often understood and studied as a social media platform. A detailed account of the history and various features of YouTube is beyond the scope of this chapter, but readers may refer to Burgess and Green (2018) for an overview. What is important to note, however, is that due to its popularity and diverse use and features, YouTube has also become a subject of study in the social sciences.

2

SOCIAL SCIENCE RESEARCH ON YOUTUBE

Although YouTube has only been around for a little more than 15 years at the time of writing this chapter, many studies exist that examine its content, use, and effects (for a general discussion, see, e.g., Arthurs et al., 2018; Burgess & Green, 2018). A search for the term ‘YouTube’ in the Scopus database in September 2021 returned 12,031 documents of which roughly half were journal articles. We cannot provide a systematic review of the social science literature on YouTube as the focus of this chapter is on practical and methodological questions related to the use of YouTube data. However, our engagement with the literature in our own research and teaching as well as for writing this chapter has revealed some general categories, patterns, and trends. In line with the classic Lasswell formula ‘Who says what to whom in which channel and with what effect?’ (Lasswell, 1948), the existing studies can be categorized into (1) communicator research, (2) content research, and (3) audience research. Overall, it seems that most 258

Using YouTube data for social science research 259 social science studies of YouTube fall into the latter category. In these studies, the users – mostly in the role of the audience, but sometimes also in the role of the content producers – are asked about their experiences with YouTube (e.g., Lange, 2007; Moor et al., 2010; Oksanen et al., 2014; Szostak, 2013; Yang et al., 2010), or the content of YouTube videos they consume or produce (e.g., Montes-Vozmediano et al., 2018; Tucker-McLaughlin, 2013; Utz & Wolfers, 2020). By comparison, there seem to be relatively few studies that deal with specific aspects of audience research, such as radicalization (Ribeiro et al., 2020), or the formation of communities through recommendation algorithms (Kaiser & Rauchfleisch, 2020). There is also a small but growing number of studies that specifically examine user comments on YouTube. This can be regarded as audience research as the comments come from the audience, but also as content research because the audience takes on the role of public communicator. Most of those studies focus on incivility or hate speech in the comments (e.g., Döring & Mohseni, 2019a, 2019b, 2020; Spörlein & Schlueter, 2021; Wotanis & McMillan, 2014). Some focus on the attributes of commenters, such as political ideology (Literat & Kligler-Vilenchik, 2021; Röchert et al., 2020) or gender differences (Thelwall & Mas-Bleda, 2018), while others focus on the attributes of comments, such as topics or sentiments (Thelwall, 2018; Thelwall et al., 2012). There are also various studies that investigate the content of videos, typically for specific topics or genres, such as educational videos (Kohler & Dietrich, 2021; Utz & Wolfers, 2020). Only a few studies primarily focus on the communicators. Several of those focus on far-right extremism (Rauchfleisch & Kaiser, 2020, 2021) or political ideology in general (Dinkov et al., 2019). Other studies investigate more general aspects of content producers, such as diversity (Chen et al., 2021), gender differences (Linke et al., 2020; Thelwall & Mas-Bleda, 2018), the economic aspects of uploads (Budzinski & Gaenssle, 2018), or the structural hierarchy of channels (Rieder et al., 2020). Importantly, all these studies make use of different kinds of data. Besides self-reported data obtained from interviews or surveys, various studies use data collected from the platform itself.

3

YOUTUBE DATA

The examples from the literature discussed in the previous section should make clear that different research questions require different types of data and that the commonly used units of analysis in social science research on YouTube are users (including content producers), videos, and comments.2 As mentioned before, many studies on YouTube make use of self-reports. However, several studies have shown that self-reports related to media use can be biased due to social desirability or memory problems on the side of the respondents (Araujo et al., 2017; Parry et al., 2021; Scharkow, 2016). As an alternative or addition to self-report data, YouTube itself offers a variety of data that can be used to study users, videos, and comments. If you browse through YouTube, some of these data will already be visible to you as a user. Apart from the name of the video and the user who uploaded it, these include information about its duration, the date it was published, or popularity cues, such as the number of views and likes. While some of those pieces of information (e.g., the duration or publication date) can be labelled as metadata, they can also be used for various analyses to investigate the content and use of YouTube. Of course, the videos themselves can serve as study material in content analyses that can produce different types of data based on their respective coding schemes (and this has already been

260 Research handbook on digital sociology done in quite a few publications already). However, most of these content analyses based on videos have been manual, meaning that they are typically limited to a relatively small number of videos. There are two types of text data that YouTube provides that can be used for (semi-) automated content analyses (as well as other types of study): user comments and subtitles. For the remainder of this chapter, we will focus largely on working with user comments. Before discussing the tools and methods for accessing these data, it is important to understand the hierarchical structure of YouTube data. On the top level of this structure are the user profiles. Users can, but do not have to, create channels. If a user does not create a channel, the user profile itself takes the role of a channel. Channels have attributes like a description, the number of subscribers, the number of included videos, and links to other channels. Videos are embedded within channels and have a number of attributes, including those described above. Hence, data (or metadata) can exist on the level of the producer (e.g., number of posted videos and channels), channel (e.g., number of videos and subscribers), and video (e.g., views, likes, dislikes). If the respective functions are enabled by the content producers, videos can also have different types of subtitles and user comments. Users can reply to comments, thereby creating comment threads. In the hierarchical data structure, such replies to comments are not only associated with the video but also the original comment that the posters responded to. The overall data structure is important for research that wants to automatically collect YouTube data and will become apparent when we discuss the YouTube application programming interface (API) and the use of tools for collecting comment and subtitle data (see Sections 4.5 and 4.6).

4

ACCESSING YOUTUBE DATA

There are different ways in which researchers can access data from YouTube and other social media platforms (Breuer et al., 2020). Besides directly cooperating with platforms and purchasing data from third parties, researchers can also use data that have already been collected and shared by other researchers. Even though archiving and sharing YouTube data is associated with a set of challenges (see Section 6), some researchers have published collections of YouTube data. The most extensive one is the YouNiverse collection of ‘channel and video metadata from English-speaking YouTube’ (Ribeiro & West, 2021b). It ‘comprises metadata for over 136k channels and 72.9M videos published between May 2005 and October 2019, as well as channel-level time-series data of weekly subscriber and view counts’ (Ribeiro & West, 2021b, p. 1) and is available via Zenodo (Ribeiro & West, 2021a). The paper describing the YouNiverse collection (Ribeiro & West, 2021b) also lists other available YouTube datasets, but those are smaller and contain fewer types of (meta)data. While collections like YouNiverse contain a large amount of data that can be used to answer a variety of research questions, such a secondary analysis of existing data might not be an option for many projects and studies. Often, researchers are interested in specific channels, topics, or videos that are not included in existing collections. For example, they may want to examine videos that are not in English. In such cases, researchers can collect YouTube data themselves in three different ways: (1) manually (e.g., by copying and pasting comments), or in an automated fashion using (2) the API provided by YouTube or (3) web scraping.3 Like the aforementioned option for accessing archived YouTube data, all three options come with specific advantages and disadvantages (see Breuer et al., 2020) because they differ on

Using YouTube data for social science research 261 important dimensions, such as the required resources and skills (monetary costs, required time, technological and coding skills, etc.), the degree of control over the data collection, the quantity and quality of the data (how many and which units of analysis and variables they contain), or the scope of data use (e.g., with regard to archiving and sharing; see Section 6). In the following sections, we will focus on automated options for collecting YouTube data, that is, web-scraping techniques and API-based tools. Although web scraping and API-based methods are quite versatile and can be used to collect various kinds of YouTube data, they also have some disadvantages and limitations. We will discuss these further in the following, but there are two key aspects that we want to mention at this point: first, the YouTube Terms of Service4 do not permit the use of automated services, such as scrapers, to access the service (as also stated in the YouTube robots.txt file5); second, the YouTube API has strict quota limits, and those have been drastically reduced over time. The reaction by Facebook to the Cambridge Analytica incident has demonstrated that there is also the risk of APIs being closed down for academic data access. Hence, some researchers have diagnosed an ‘APIcalypse’ (Bruns, 2019) or a ‘post-API age’ (Freelon, 2018).6 To address these issues, several suggestions for alternative models for accessing social media data have been made. One proposal that has recently gained traction is to collaborate with users instead of platforms to collect data (Halavais, 2019). Most platform providers, including Google, offer users the functionality to export their own data which they can afterwards share with researchers. The use of such ‘data download packages’ (Boeschoten et al., 2022) also addresses some of the ethical questions related to the use of YouTube data, which we will discuss further in Section 6, and some researchers have started to develop technical solutions for this approach that are flexible and scalable (Araujo et al., 2022). An alternative to data download packages for partnering with platform users is the use of browser plugins or other standalone software tools (Haim & Nienierza, 2019; Menchen-Trevino, 2016). While those have not yet been developed for YouTube data in academic projects, the German non-profit AlgorithmWatch has developed the tool DataSkop7 for the user-centric collection of YouTube data. Notably, however, many of these tools also make use of web scraping, which raises legal and ethical questions. 4.1

Web Scraping versus API Harvesting

Before discussing web scraping and API harvesting as two different methods for automated collection in more detail, it is important to first understand the general structure with which YouTube and similar websites operate. Usually, regular users only tend to see the website as it is displayed in their browser. From this point of view, the user-created content (e.g., uploaded videos, images, or text) is seamlessly embedded into the structure and styling of the website (alongside third-party content, such as advertisements). However, these types of data are stored and processed quite differently from a technical point of view. While the general layout and structure of the website is defined through static HyperText Markup Language (HTML) and Cascading Style Sheet (CSS) documents, user-generated and third-party content need to be stored in dynamic databases and loaded into the website interactively. The final website that users experience (through their web browser software) is technically coded in HTML documents with pointers to the database for loading the user-generated content that is displayed in the browser. When researchers use web scraping, they use software that essentially downloads HTML documents and extracts information of interest by processing those documents. In con-

262 Research handbook on digital sociology trast, when researchers use API harvesting, they can request the desired information directly from the database through the API, typically resulting in a more structured data file (see Figure 14.1). The two methods yield different data structures, and each comes with specific advantages and disadvantages.

Note: While web scraping refers to the practice of collecting and parsing an HTML document that is built for display in a web browser, API harvesting provides data extracted directly from an underlying database. Icons in this figure come from www.flaticon.com/.

Figure 14.1

Schematic display of different data collection methods for websites containing both static and dynamic content

Web scraping has the advantage that practically any information that is visible in the browser can, in principle, be extracted into a dataset. Theoretically, there is no hard limit on how much information can be extracted and which kind of publicly available information can be gathered. In practice, however, websites often employ measures to prevent automatic scraping of their content (e.g., using CAPTCHA, requiring login as a registered user, or using complex document structures), and prohibit web scraping in their terms of service. From a legal perspective, web scraping might still be allowed depending on the country the researcher resides in, what kind of data are scraped, what the data are used for, and if and how the data are made accessible to third parties. A detailed discussion of these aspects is beyond the scope of this chapter, but the interested reader can find an overview in the context of Facebook data in a recent paper by Mancosu and Vegetti (2020). Another challenge of web-scraped data is parsing the HTML document into a usable format. Unfortunately, this is no trivial task as modern websites typically contain a lot of detailed structuring and styling information that is not interesting for researchers and might even employ ways of interactively loading new content from the database with certain user inputs (e.g., YouTube loads more comments or search results as users scroll down the page). Additionally, standard tools for working with text data are not suitable for extracting information from HTML documents, so researchers need to familiarize themselves with Extensible Markup Language (XML) parsers instead. While researchers can certainly write code for web scraping (e.g., in R or Python) with workarounds for all these issues, it typically requires a lot

Using YouTube data for social science research 263 of time, dedication, and expertise that is not easily acquired for researchers without a computer science background. To make matters worse, even if researchers manage to write code for web scraping that extracts the information they are interested in, a trivial change in the structure or styling of a website (that does not even have to be visible in the browser) could require manual inspection of the website and, potentially, a complete reworking of the data collection pipeline. In sum, web scraping is relatively straightforward for websites with simple structures, mostly static content, and little changes to its underlying structure over time, but it can quickly become infeasible as the complexity of the website to be scraped increases. By contrast, API harvesting does not rely on scraping and parsing an HTML document but is built on direct interaction with the database through the API. More specifically, users can send a request for data to the API and receive a response containing only the data they are interested in, formatted in a standard file structure (usually .json or .csv). As this process is completely independent from the HTML and CSS used to lay out the website, researchers do not have to use XML parsers, and changes in the structure, styling, or complexity of the website do not affect the feasibility of collecting data via an API. However, an API is a web service that is (usually) built by the provider of the respective website to allow others (not only researchers) easier access to its content (e.g., for reasons of structured data exchange between applications). As such, the provider of the website can fully control and restrict which information and how much of it each user of the API can access at any given time. Some information that is visible in the browser might not be accessible through an API, and most APIs typically have limits on how much data can be gathered in a specified amount of time. In addition, even though most modern social media websites do have APIs, not every website has one, and even if working with an API is usually more straightforward than web scraping, every website has their own syntax for sending requests that researchers need to familiarize themselves with. Summing up, web scraping is more difficult and tedious for websites that are not strictly static, and data collection pipelines tend to break down more easily with changes to the website in question. However, there is theoretically no limit regarding the amount of data and what kind of data can be collected. In contrast, using APIs is more convenient, reliable, and stable, but the provider of the website can strictly limit which kind of data can be accessed and how much data can be gathered in a specified amount of time. Given these attributes of the two methods, our general recommendation is that researchers should use an API if they can, and only use web scraping if they must (e.g., because there is no API or because the API does not provide the data required for the study or project). In the following sections, we will, accordingly, focus on methods for collecting data from YouTube (especially comments and subtitles) using the YouTube data API v3. 4.2

Accessing the YouTube API

While there are quite a few different tools for using the YouTube Data API v3 (henceforth just API) to collect data, the basic principle behind them is always the same. First, user input from the tool is translated into a request that is sent to the API. The API then evaluates the validity of the request and sends back a response that either contains an error message (if the request was invalid) or the requested data. As mentioned before, the API is a service provided by YouTube itself and restricts the type, amount, and format of data that can be requested (see Section 4.3). To exercise and monitor these restrictions, using the API requires a registered account and

264 Research handbook on digital sociology periodic user authentication. By setting up and configuring such an account, researchers can obtain credentials that can be used with multiple tools (see Section 4.4) to send requests to the API. These credentials are a necessary requirement for most tools using the API but can be challenging to acquire, as the process for setup and configuration is not tailored to academic research. Since the API itself and the procedure for acquiring access frequently change, a step-by-step tutorial would become outdated quickly. For this reason, we will only provide a high-level overview here.8 As YouTube belongs to Google, access to its API requires a working Google account. If you want to use the YouTube API for your research project, we generally recommend creating a new Google account. The Google account can be used to create a project via the Google Developer Console (a platform managing access to many different Google APIs), which is necessary for accessing the API. More specifically, the API that researchers need to enable is the YouTube Data API (currently at v3). This API handles requests to the YouTube databases. Researchers need API credentials for collecting YouTube data with most of the tools we will present in the following. What is important to note here is that there are two types of credentials: API keys, which grant access to publicly available information, and so-called OAuth2 access credentials, which grant access to publicly available information and information and actions only available to the owner of the respective Google account (such as posting comments or uploading videos). An API key is just one value, whereas the OAuth2 credentials consist of two parts: the client ID and the client secret. Most tools require API keys, but some necessitate OAuth2 credentials as they make use of additional API functionalities. From a security perspective, OAuth2 access credentials are more sensitive because they allow a person or application to execute actions in the name of the account owner.9 4.3

API Restrictions and Quota Limits

The API for accessing data from YouTube limits the type of data that users can gather as well as the amount of data that can be collected per day. Currently, each user has 10,000 units per day at their disposal to use for any operation involving the API.10 The API allows users to retrieve different kinds of information, such as video IDs associated with certain keywords, statistics, and metadata for individual videos, playlists, or channels, or comments for a specific video.11 Crucially, requesting different kinds of information incurs different quota costs. For example, a request for video IDs associated with a keyword currently has a cost of 100 units, while a call to retrieve a single comment thread (an initial comment including all replies) has a cost of one. Data that are sent back from the API to the user are structured in pages, and the quota cost associated with an action applies to each page of returned results.12 Because the API starts sending back error messages instead of data when making a request with insufficient units remaining from the daily quota, it is important for researchers to manage their units effectively. In the following, we will provide some general advice on this issue based on our personal experience. 4.3.1 Testing API requests It is usually recommendable to pilot-test a planned data collection task on a small scope (e.g., a single video) beforehand. By testing, the researcher may identify invalid requests being sent to the API that only prompt error messages and incur quota costs while not returning any usable data. In addition, testing provides the opportunity to review the returned data and to

Using YouTube data for social science research 265 check if all necessary information is contained in the result (and also that no superfluous data are contained that may incur unnecessary quota costs). Finally, testing allows researchers to form a good estimate of how many units their requests will cost, which can help with scheduling automatic data collection (see below).13 4.3.2 Monitoring quotas Monitoring how many units are used by test requests is crucial for efficient quota management and data collection. The easiest way to monitor the quota is to use the built-in unit monitor in the Google Cloud Console dashboard.14 4.3.3 Scheduling data collection Even with efficient requests, some research projects will require more data than can be collected with 10,000 units in a single day. This can be achieved by splitting up the data collection across multiple days. To prevent researchers from needing to restart data collection manually for days (or possibly weeks), data collection can be split into batches and scheduled to automatically run each day. 4.3.4 Collaborating Just like batches can be split across multiple days, they can also be split across multiple accounts. Collaborating with others for the data collection can effectively multiply the total daily quota for a research project. 4.3.5 Keeping up to date It is important to keep in mind that the quota costs for all requests and even daily quota limits are subject to change. YouTube has drastically lowered these values in the past (the daily quota was 50 million up until 2014 and 1 million up until 2019) and could change them again in the future. Being aware of these potential changes is especially relevant for longitudinal data collections with automatically scheduled tasks where collected data are not checked regularly through manual inspection. Likewise, YouTube can change the kinds of data that are available through the API. For example, since December 2021, dislike counts are not available anymore. 4.4

API-Based Tools for Collecting YouTube Data

The API-based tools for collecting YouTube data we present in this section fall into three categories: web-based tools, standalone tools, and packages for R15 (see Table 14.1). We found only one web-based tool, namely YouTube Data Tools. As it is web-based, it is very easy to use. The tool still has a quota limit from several years ago, which is quite high (see Section 4.3). Hence, it is possible with YouTube Data Tools to download a larger number of comments. A limitation of YouTube Data Tools is that it does not allow to select specific comments or comment threads, meaning that researchers can only collect all comments for each video. In addition, users get one file for each video and have to combine files for multiple videos manually. Options that offer more flexibility in this regard are Webometric16 or Facepager.17 These standalone tools can batch-process video IDs, but in the case of Webometric, users still get one file per video. However, the learning curve for those tools is much steeper.18 Unlike YouTube Data Tools, these tools do not have their own API key, but require individual authentication

266 Research handbook on digital sociology Table 14.1

Features of API-based tools for collecting YouTube data

Software

Type

Can collect

Comment scope

Needs API authentication

YouTube Data Tools 1.22

Website

Channel info, video info, comments

Only all

Webometric 4.1

Standalone app

Channel info, video info, comments, video

100 most recent

search

or all

Tuber 0.9.9

R package

Channel info, video info, comments, subtitles,

20–100 most

all searches

recent or all

vosonSML 0.29.13

R package

Video IDs, comments

1-x top level

youtubecaption 1.0.0

R package

Subtitles

NA

No API key OAuth2 API key No

for each user. A key limitation of the three options mentioned above is that they can only be used to collect data but not for processing and analysing them. This is one of the key advantages of R packages as they can be combined with other packages for cleaning, filtering, pre-processing, and statistical analyses. Two options for retrieving YouTube comments with R are the packages tuber19 and vosonSML.20. The tuber package can also be used for retrieving other types of data, such as video or channel statistics (e.g., the number of views and likes), while vosonSML provides specific functionalities for network analysis. Another difference between these two packages is that vosonSML requires an API key, while tuber needs OAuth2 credentials. A package specifically for the collection of video subtitles is youtubecaption.21 This package requires no API credentials as it relies on a web-scraping approach. Table 14.1 provides a comparison of some of the key attributes of the tools we discuss here. In an ideal scenario, all these tools would download an identical number of comments. However, due to the intricacies and complexities in the data structure of comments and the ways in which the tools compose their API requests, the results of requests for the same videos can differ across the tools. In a small test, we investigated how many comments each tool retrieves for the same video. Table 14.2 shows the results of our test. From the standalone tools, YouTube Data Tools ranks first by a large margin, while among the R packages, vosonSML fares slightly better than tuber in terms of the number of comments it collects. Table 14.2

Test results on the efficacy of API-based tools for collecting YouTube data

Software

Ease of use

Disadvantages

Number of

YouTube Data Tools 1.22

High

Lacking flexibility, less information

52,243

Webometric 4.1

Low

Only first 5 follow-up comments, no error feedback,

49,150

comments

undetectable time-outs Tuber 0.9.9

Low

Only first 5 follow-up comments (bug)

49,139

vosonSML 0.29.13

Low

Lacking flexibility, only comments

50,619

Note: Tools were used to collect as many comments as possible for a selected video (ID ‘DcJFdCmN98s’) on September 2nd 2021. According to the YouTube website for the video, it had a total of 52,241 comments at the time.

Notably, as tools may become deprecated and new ones can appear, it can be helpful to consult regularly updated overviews of tools. Such overviews are, for example, provided (for different social media platforms) by the Social Media Observatory project of the Leibniz Institute for Media Research22 and the Social Media Research Toolkit23 by the Social Media Lab at Ryerson University.

Using YouTube data for social science research 267 4.5

Collecting YouTube Comments Using the Tuber Package for R

In the following, we will provide a short practical example of collecting YouTube comments using one specific tool (the R package tuber). Later on, we discuss the automated collection of video subtitles (Section 4.6) and the pre-processing of comment data for statistical analysis (Section 5). The first steps in collecting comments with the tuber package are acquiring API credentials (Section 4.2), installing the package, and testing a small API request to check the setup (see Figure 14.2).

Note: Code block 1 contains R code for testing API credentials with a simple request for video information. If the credentials are correct, the request should return a data frame with information on video ID, view count, like count, favourite count, and comment count.

Figure 14.2

Code block 1

After testing the setup, researchers need to identify the video(s) that they want to collect comments from. This can either be done manually in the browser or in an automated fashion using the API. Selecting videos manually has the advantage that researchers can freely browse the YouTube website to select videos based on titles, thumbnails, or recommendations. While this is easy to do for a small number of videos, this approach does not scale well. If a larger number of videos is required, researchers can also use the API for identifying relevant video IDs. For larger projects, the API can be used to directly search for video IDs using keywords.

268 Research handbook on digital sociology Another advantage of this approach is that it offers some customizability with respect to the output, such as options for returning the most popular videos for specific regions or among specific types of videos or specifying the amount of video IDs to be returned. A disadvantage of searching for video IDs via the API, however, is that it has a very high quota cost of 100 units per page of results. Figure 14.3 provides an example of how to search for a list of videos using the tuber package. When recreating this example, readers should make sure to set the get_all parameter to FALSE because it ensures that only the first page of results is returned

Note: Code block 2 contains R code for querying the API for the first page of search results for videos associated with the keyword ‘US elections’ published between 2019 and 2021. The request returns a data frame with video IDs, publication timestamp, channel ID, video title, video description, and channel title.

Figure 14.3

Code block 2

Note: Code block 3 contains R code for querying the YouTube API for all comments and comment replies for the video with the ID ‘iik25wqIuFo’ using the tuber package. The response is an R data frame object containing timestamps, author names, comment IDs, parent IDs, like count, comment text, and author profile information.

Figure 14.4

Code block 3

Using YouTube data for social science research 269 (which typically contains up to 50 results), thus preventing the overuse (or even depletion) of the daily quota. Once researchers have defined the ID or list of IDs of videos for which comments should be collected, they can start collecting comments or other data of interest (such as like or view counts). Importantly, as requesting all comments from videos with hundreds of thousands of comments can easily exceed the daily quota limit, requests for such videos should be tested and batched (see Section 4.3). An example of how to retrieve comments from a video using the tuber package can be found in Figure 14.4. 4.6

Collecting Subtitles from YouTube

In the social sciences, most content analyses of YouTube videos have been carried out manually. (Semi-)Automated content analysis of video subtitles is a viable option for substantially increasing the amounts of analysed videos. For each video, YouTube automatically creates subtitles in a specific format and labels it ‘ASR’ (automatic speech recognition). These ASR subtitles are based on voice recognition and are always in English, even if the video is in a different language. The quality of ASR is usually good for English videos but typically worse for non-English videos. Content producers can also create their own subtitles. Hence, videos can have multiple subtitles (even for the same language). Previously, subtitles for all videos were accessible via the API. However, YouTube restricted access to the subtitles of a user’s own channel. Now, an attempt to access subtitles via the API from any other channel results in an error message. For the automated collection of subtitles researchers can, however, use the R package youtubecaption as it employs a web-scraping approach that does not require API access.

5

PROCESSING YOUTUBE COMMENTS

YouTube comments are text data and as such require some pre-processing before they can be analysed with statistical packages commonly used in the social sciences. In the following, we will describe some of the necessary pre-processing steps for working with YouTube comments extracted with the tuber package and give a brief outlook on the descriptive or exploratory analyses that can be conducted with these data. 5.1

Understanding the Data

The most important step for effectively working with comments and comment information is understanding the properties of the returned values. For example, timestamps returned by the YouTube API will look something like ‘2021-02-18T20:55:19Z’. A look at the API documentation24 reveals that returned timestamps are formatted as character strings in ISO8601 format. The API documentation also describes that the content of comments is returned in two different variables, one containing HTML and one only containing raw text.

270 Research handbook on digital sociology 5.2 Hyperlinks Comments may not only contain raw text but also hyperlinks to other videos or websites. These data can be of interest in and of themselves but can also distort many text analysis tools (e.g., sentiment analysis or topic models) if not handled properly. While there are many examples of pre-processing of text data from web sources where hyperlinks are simply deleted, we recommend extracting them and storing them in a separate variable. This way, comments can be analysed with standard text mining tools, while hyperlinks are kept for additional analyses (e.g., describing the network of referrals between different channels or playlists). 5.3 Timestamps YouTube offers the possibility to refer to a specific point in a given video by using a timestamp in the comment. For example, writing the string ‘02:30’ in a comment will create a clickable hyperlink in the displayed comment that will make the current video jump to minute 2 and 30 seconds when clicked. In the data returned by the API, these timestamps are represented as numbers in one variable (textOriginal) but as links to the YouTube video in another (textDisplay). While extracting hyperlinks (see above) will also extract timestamp links, we recommend further separating the two as the timestamps can be used to link the content of the comment to the content of the video (e.g., using subtitle data). 5.4 Emoji In addition to hyperlinks and timestamps, comments also frequently contain emoji which are notoriously difficult to deal with as issues with character encoding frequently arise due to different operating systems, user locales, and different textual representations of emoji. In many studies that employ text mining, emoji are routinely deleted from the texts in question. However, emoji may modify or convey an emotional tone, irony, or sarcasm (e.g., Kaye et al., 2017; Subramanian et al., 2019). Hence, we suggest that these are extracted from user comments and ideally also taken into account in analyses of text data. A detailed description of how to process emoji would be beyond the scope of this chapter; however, we provide some guidance on this in the materials for a workshop on working with YouTube data we taught in February 2022.25 Pre-processing the comment data extracted from the API in the way described above is a first step in enabling the use of efficient text-mining tools on the data. However, after removing or replacing hyperlinks, timestamps, and emoji from the text, the text itself needs to be processed as well for efficient handling in later analyses. The basic principle behind most text-mining methods is to split the text into sequences of individual words (or multiple words) called tokens and then compute metrics for the (co-)occurence of these tokens. Common additional pre-processing steps include the transformation of all words to lower case or removing numbers, punctuation, and so-called stop words (i.e., words that appear very frequently in a given language but provide little informational value, such as articles like ‘the’ or ‘a’ in English). Figures 14.5 and 14.6 show an example of a simple descriptive analysis of the most frequent words (Figure 14.5) and emoji (Figure 14.6) after the implementation of the pre-processing steps described above.26

Using YouTube data for social science research 271

Source:

www.youtube.com/watch?v=r8pJt4dK_s4.

Figure 14.5

Source:

Bar plot showing the words that appear in the highest number of unique comments for the Emoji movie trailer (HD) on YouTube

www.youtube.com/watch?v=r8pJt4dK_s4.

Figure 14.6

Bar plot showing the most frequent emoji in comments for the Emoji movie trailer (HD) on YouTube

As the range of options for analysing textual data is even larger than the one for pre-processing them, and since the choice of suitable analysis methods always depends on the specific research questions, we will not discuss the analysis of YouTube comments any further here. In the following section we will, instead, focus on questions related to the next step in the research data cycle: sharing and archiving the data.

272 Research handbook on digital sociology

6

DATA SHARING AND ARCHIVING

As we have seen in the previous sections, accessing YouTube data and processing them is associated with a set of challenges and questions that researchers need to address, and decisions that they need to make. The final steps in the research process, namely archiving and sharing the data, raise further questions related to practical, legal, and ethical issues. The practical issues are mostly related to the format and size of the data, but also to questions of proper documentation. Most repositories and data archives (such as Zenodo, Dataverse, the Open Science Framework, or the GESIS Data Archive) restrict the format and size of data files. Social science data archives are specialized on, and experienced with, conventional social research data, such as survey data. Unlike for survey data, however, there are hardly any documentation standards for social media data. As these data are being increasingly used by social scientists, archives are currently developing solutions for storing and documenting such data. Much of this still happens on a case-by-case basis, which is largely due to the heterogeneous nature of social media data. As we have seen in this chapter, even data from the same platform (in our case YouTube) can come in various shapes and forms. Hence, finding a suitable repository for archiving such data can be a challenge for researchers. However, as archives and repositories are increasing their efforts in this direction, this problem should hopefully become less of an issue in the near future. Technical solutions and documentation standards are being developed to ensure that data from YouTube and other social media platforms can be archived in ways that meet the FAIR data criteria of findability, accessibility, interoperability, and reusability, and there are a few publications that offer researchers and archivists some guidance in this regard (for an overview, see Breuer, et al., 2021). While solutions are being developed for addressing the more practical questions (storage space, documentation, etc.), additional issues are legal and ethical ones. Regarding legal aspects, two documents that are relevant for YouTube data are the Terms of Service of the platform and its developer policies.27 Developer policies are certainly not written with the use case of academic research in mind and, hence, are difficult to interpret for researchers and their use of the API. Importantly, compliance with the Terms of Service is a moving target as they can and do change. Another legal aspect that researchers may have to deal with is copyright (or intellectual property rights), if, for example, they want to archive videos, audio, subtitles, or extensive textual content (e.g., in the video description). Since local legislation differs and can also change, it is generally advisable to seek legal counselling regarding the legality of different ways of accessing and publishing data. Data protection is also an important aspect when it comes to archiving and sharing data from YouTube and other social media platforms that has both legal and ethical dimensions. Unlike for other platforms, the question whether content can be regarded as public is somewhat clearer for YouTube. Once videos have been published on the platform, they are publicly visible, as are the metadata and user comments. Note, however, that just because content is public, it does not mean that it is always ethical to collect let alone to publish it. One key consideration regarding the collection and especially the sharing of YouTube data is whether this may have any adverse effects for individuals whose data are included in the collection. Importantly, the primary focus of ethical concern is the individual. Accordingly, in the case of aggregated data, ethical issues are typically much less severe. Aggregating data before sharing them is a viable strategy for increasing data privacy. It can also be a way to address some of the legal concerns mentioned before. For example, while it may be problematic from a legal

Using YouTube data for social science research 273 and ethical perspective to share full texts (e.g., comments, subtitles, or video descriptions), sharing derived statistics (word counts, sentiment scores, identified named entities, etc.) is a way that minimizes identification risks and avoids potential copyright issues as well as potential conflicts with platform terms of service. This is also the approach that the researchers who published the YouNiverse dataset (Ribeiro & West, 2021a) took for their comment data. On the other hand, sharing only derived and/or aggregated data reduces the reproducibility of analyses and the reuse value of datasets. There are, of course, other options besides sharing the full raw data and only sharing aggregate data, such as the remote execution of code on existing datasets (Van Atteveldt et al., 2020). While the technical implementations for these approaches still have to be developed or refined, these suggestions show that it is possible to find solutions for archiving and sharing social media data that strike a balance between data protection on the one hand and reproducibility and reusability on the other.

7

SUMMARY AND CONCLUSION

Since YouTube is still growing in popularity and has become an important source of entertainment and information especially for younger generations, it can be assumed that it will also become an increasingly relevant subject of study for the social sciences. Accordingly, the already existing body of research on the content and use of YouTube can be expected to further diversify and substantially grow in the coming years. Like other social media platforms, YouTube is not only an interesting subject of study but also a valuable data source for social scientists. YouTube offers different kinds of data, including user, channel, and video statistics, popularity scores, and user comments. Similarly, there are different ways in which researchers can access data from YouTube. If researchers want (or have) to collect YouTube data themselves in an automated manner, they can employ web scraping or make use of the YouTube API. While working with the API for collecting data has specific limitations, it can provide rich data to answer a variety of questions regarding the production, content, and consumption of videos on YouTube. For example, subtitles can be used to study the content of videos via automated content analyses. User comments, on the other hand, can be used to study user reactions and interactions, but also more specific topics, such as online harassment and hate speech. There are several things that researchers need to consider when collecting, processing, and analysing YouTube data. These include API restrictions, the structure of YouTube data, researcher degrees of freedom, and potential pitfalls in (pre-)processing text and other types of data. Legal and ethical questions become particularly important if researchers want to archive and share data from YouTube in line with principles of open science. As these data are fairly new for the social sciences as well as archives that hold social science data, solutions for storing, documenting, and protecting social media data are still being developed. Overall, while researchers have begun to see and exploit the value of YouTube data for answering various questions that are of interest to the social sciences, there is still much to do to develop and optimize methods and solutions for accessing, processing, analysing, archiving, and sharing these data. We hope that by providing some concrete examples, highlighting the potential as well as the limitations of YouTube data and the ways in which they have been used, as well as outlining some potential ways forward, this chapter can contribute to extending and improving the use of these data in social science research.

274 Research handbook on digital sociology

NOTES 1.

2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.

There are several ways in which content producers can make money with YouTube, but all of them require subscriptions to the channel, a high number of subscriptions, and/or views of the content (YouTube, 2021). This is why the number of subscriptions and views are important metrics for content producers. Of course, individual studies and projects can also look at more than one type of entity and pick various units of analysis. By web scraping we refer to techniques that rely on extracting information from websites by parsing the underlying HTML documents. See www.youtube.com/static?template=terms. See www.youtube.com/robots.txt. A prominent counter-example is Twitter with the introduction of a dedicated access program for academic researchers for its v2 API endpoints in 2021 (see https://developer.twitter.com/en/ products/twitter-api/academic-research). See https://algorithmwatch.org/en/dataskop. However, for a workshop we taught in February 2022, we created a detailed step-by-step guide which you can find here: https://jobreu.github.io/youtube-workshop-gesis-2022/slides/A0 _YouTube_API_Setup/A0_YouTubeAPISetup.html. Note that in case of accidentally sharing any credentials, they can be deleted via the Google Developer Console. Google has recently introduced the YouTube Researcher Program (see https://research.youtube/) through which researchers can apply for an API quota increase. Notably, however, the process is not as straightforward as for other platforms and the application process can take some time. For an overview of the different actions, see the API documentation at https://developers.google .com/youtube/v3/docs. For a detailed list of quota costs, see the YouTube API documentation at https://developers.google .com/youtube/v3/determine_quota_cost. One way of testing API requests interactively in the browser is the online API sandbox provided in the API documentation. See, e.g., https://developers.google.com/youtube/v3/docs/comments/list. See https://console.cloud.google.com/home/dashboard. Of course, there are also tools for other programming languages, such as Python. See http://lexiurl.wlv.ac.uk/. See https://github.com/strohne/Facepager/wiki. When writing this chapter, we wanted to systematically test all of the tools listed here, but were, unfortunately, not able to get the desired collection of YouTube comments with facepager to work. Hence, we also do not list it in Table 14.1. See https://github.com/soodoku/tuber. See https://github.com/vosonlab/vosonSML. See https://github.com/jooyoungseo/youtubecaption. See https://github.com/Leibniz-HBI/Social-Media-Observatory/wiki/YouTube-Tools. See https://socialmediadata.org/social-media-research-toolkit. See https://developers.google.com/youtube/v3/docs/comments. See https://github.com/jobreu/youtube-workshop-gesis-2022 (note: these materials also include a more detailed guide for data pre-processing as well as some further analysis examples). Again, the full code for these examples can be found in the workshop materials we have linked to before. See https://developers.google.com/youtube/terms/developer-policies.

REFERENCES Alexa Traffic Ranks. (2021). How popular is youtube.com? www.alexa.com/siteinfo/youtube.com

Using YouTube data for social science research 275 Araujo, T., Wonneberger, A., Neijens, P., & de Vreese, C. (2017). How much time do you spend online? Understanding and improving the accuracy of self-reported measures of internet use. Communication Methods and Measures, 11(3), 173–190. Araujo, T., Ausloos, J., van Atteveldt, W., Loecherbach, F., Moeller, J., Ohme, J., Trilling, D., van de Velde, B., de Vreese, C., & Welbers, K. (2022). OSD2F: An Open-Source Data Donation Framework. Computational Communication Research, 4(2), 372–387. https://doi.org/10.5117/CCR2022.2.001 .ARAU Arthurs, J., Drakopoulou, S., & Gandini, A. (2018). Researching YouTube. Convergence: The International Journal of Research into New Media Technologies, 24(1), 3–15. Auxier, B., & Anderson, M. (2021). Social media use in 2021. www.pewresearch.org/internet/2021/04/ 07/social-media-use-in-2021/ Boeschoten, L., Ausloos, J., Möller, J. E., Araujo, T., & Oberski, D. L. (2022). A framework for privacy preserving digital trace data collection through data donation. Computational Communication Research, 4(2), 388–423. https://doi.org/10.5117/CCR2022.2.002.BOES Breuer, J., Bishop, L., & Kinder-Kurlanda, K. (2020). The practical and ethical challenges in acquiring and sharing digital trace data: Negotiating public-private partnerships. New Media & Society, 22(11), 2058–2080. Breuer, J., Borschewski, K., Bishop, L., Vávra, M., Štebe, J., Strapcova, K., & Hegedűs, P. (2021). Archiving social media data: A guide for archivists and researchers. https://doi.org/10.5281/ZENODO .5041072 Bruns, A. (2019). After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566. Budzinski, O., & Gaenssle, S. (2018). The economics of social media (super-)stars: An empirical investigation of stardom and success on YouTube. Journal of Media Economics, 31(3–4), 75–95. Burgess, J., & Green, J. (2018). YouTube: Online video and participatory culture (Second edition). Polity Press. Chen, K., Jeon, J., & Zhou, Y. (2021). A critical appraisal of diversity in digital knowledge production: Segregated inclusion on YouTube. New Media & Society, 146144482110348. Defy Media. (2017). Acumen report: Youth video diet. www.newsroom-publicismedia.fr/wp-content/ uploads/2017/06/Defi-media-acumen-Youth_Video_Diet-mai-2017.pdf Dinkov, Y., Ali, A., Koychev, I., & Nakov, P. (2019). Predicting the leading political ideology of YouTube channels using acoustic, textual, and metadata information. http://arxiv.org/abs/1910.08948 Döring, N., & Mohseni, M. R. (2019a). Male dominance and sexism on YouTube: Results of three content analyses. Feminist Media Studies, 19(4), 512–524. Döring, N., & Mohseni, M. R. (2019b). Fail videos and related video comments on YouTube: A case of sexualization of women and gendered hate speech? Communication Research Reports, 36(3), 254–264. Döring, N., & Mohseni, M. R. (2020). Gendered hate speech in YouTube and YouNow comments: Results of two content analyses. Studies in Communication and Media, 9(1), 62–88. Freelon, D. (2018). Computational research in the post-API age. Political Communication, 35(4), 665–668. Haim, M., & Nienierza, A. (2019). Computational observation: Challenges and opportunities of automated observation within algorithmically curated media environments using a browser plug-in. Computational Communication Research, 1(1), 79–102. Halavais, A. (2019). Overcoming terms of service: A proposal for ethical distributed research. Information, Communication & Society, 22(11), 1567–1581. Kaiser, J., & Rauchfleisch, A. (2020). Birds of a feather get recommended together: Algorithmic homophily in YouTube’s channel recommendations in the United States and Germany. Social Media + Society, 6(4), 205630512096991. Kaye, L. K., Malone, S. A., & Wall, H. J. (2017). Emojis: Insights, affordances, and possibilities for psychological science. Trends in Cognitive Sciences, 21(2), 66–68. Kemp, S. (2021). The world’s most-used social platforms: The latest global active user figures (in millions) for a selection of the world’s top social media platforms. DataReportal. https://datareportal .com/reports/digital-2021-july-global-statshot

276 Research handbook on digital sociology Kohler, S., & Dietrich, T. C. (2021). Potentials and limitations of educational videos on YouTube for science communication. Frontiers in Communication, 6, 581302. Konijn, E. A., Veldhuis, J., & Plaisier, X. S. (2013). YouTube as a research tool: Three approaches. Cyberpsychology, Behavior, and Social Networking, 16(9), 695–701. Lange, P. G. (2007). Commenting on comments: Investigating responses to antagonism on YouTube. In Society for Applied Anthropology Conference, 31(1), pp. 163–190. Lasswell, H. D. (1948). The structure and function of communication in society. The Communication of Ideas, 37(1), 136–139. Linke, C., Prommer, E., & Wegener, C. (2020). Gender representations on YouTube: The exclusion of female diversity. M/C Journal, 23(6). Literat, I., & Kligler-Vilenchik, N. (2021). How popular culture prompts youth collective political expression and cross-cutting political talk on social media: A cross-platform analysis. Social Media + Society, 7(2). Mancosu, M., & Vegetti, F. (2020). What you can scrape and what is right to scrape: A proposal for a tool to collect public Facebook data. Social Media + Society, 6(3), 1–11. Menchen-Trevino, E. (2016, March). Web historian: Enabling multi-method and independent research with real-world web browsing history data. IConference 2016 Proceedings. https://doi.org/10.9776/ 16611 Montes-Vozmediano, M., García-Jiménez, A., & Menor-Sendra, J. (2018). Teen videos on YouTube: Features and digital vulnerabilities. Comunicar, 26(54), 61–69. Moor, P. J., Heuvelman, A., & Verleur, R. (2010). Flaming on YouTube. Computers in Human Behavior, 26(6), 1536–1546. Oksanen, A., Hawdon, J., Holkeri, E., Näsi, M., & Räsänen, P. (2014). Exposure to online hate among young social media users. In M. N. Warehime (Ed.), Sociological studies of children and youth (Vol. 18, pp. 253–273). Emerald Group Publishing. Parry, D. A., Davidson, B. I., Sewall, C. J. R., Fisher, J. T., Mieczkowski, H., & Quintana, D. S. (2021). A systematic review and meta-analysis of discrepancies between logged and self-reported digital media use. Nature Human Behaviour, 5, 1535–1547. Rauchfleisch, A., & Kaiser, J. (2020). The German far-right on YouTube: An analysis of user overlap and user comments. Journal of Broadcasting & Electronic Media, 64(3), 373–396. Rauchfleisch, A., & Kaiser, J. (2021). Deplatforming the far-right: An analysis of YouTube and BitChute. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3867818 Ribeiro, M. H., & West, R. (2021a). YouNiverse: Large-scale channel and video metadata from English YouTube (1.1). Dataset, Zenodo. https://doi.org/10.5281/ZENODO.4650046 Ribeiro, M. H., & West, R. (2021b). YouNiverse: Large-scale channel and video metadata from English-speaking YouTube. http://arxiv.org/abs/2012.10378 Rieder, B., Coromina, Ò., & Matamoros-Fernández, A. (2020). Mapping YouTube. First Monday. https://doi.org/10.5210/fm.v25i8.10667 Röchert, D., Neubaum, G., Ross, B., Brachten, F., & Stieglitz, S. (2020). Opinion-based homogeneity on YouTube. Computational Communication Research, 2(1), 28. Scharkow, M. (2016). The accuracy of self-reported internet use: A validation study using client log data. Communication Methods and Measures, 10(1), 13–27. Smith, A., Toor, S., & van Kessel, P. (2018). Many turn to YouTube for children’s content, news, how-to lessons. www.pewresearch.org/internet/2018/11/07/many-turn-to-youtube-for-childrens-content-news -how-to-lessons/ Spörlein, C., & Schlueter, E. (2021). Ethnic insults in YouTube comments: Social contagion and selection effects during the German ‘refugee crisis’. European Sociological Review, 37(3), 411–428. Subramanian, J., Sridharan, V., Shu, K., & Liu, H. (2019). Exploiting emojis for sarcasm detection. International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, 70–80. Szostak, N. (2013). Girls on YouTube: Gender politics and the potential for a public sphere. McMaster Journal of Communication, 8, 46–58. Thelwall, M. (2018). Social media analytics for YouTube comments: Potential and limitations. International Journal of Social Research Methodology, 21(3), 303–316.

Using YouTube data for social science research 277 Thelwall, M., & Mas-Bleda, A. (2018). YouTube science channel video presenters and comments: Female friendly or vestiges of sexism? Aslib Journal of Information Management, 70(1), 28–46. Thelwall, M., Sud, P., & Vis, F. (2012). Commenting on YouTube videos: From Guatemalan rock to El Big Bang. Journal of the American Society for Information Science and Technology, 63(3), 616–629. Tucker-McLaughlin, M. (2013). YouTube’s most-viewed videos: Where the girls aren’t. Women & Language, 36(1), 43–49. Utz, S., & Wolfers, L. N. (2020). How-to videos on YouTube: The role of the instructor. Information, Communication & Society, 25(7), 959–974. Van Atteveldt, W., Althaus, S., & Wessler, H. (2020). The trouble with sharing your privates: Pursuing ethical open science and collaborative research across national jurisdictions using sensitive data. Political Communication, 38(1–2), 192–198. Wotanis, L., & McMillan, L. (2014). Performing gender on YouTube: How Jenna Marbles negotiates a hostile online environment. Feminist Media Studies, 14(6), 912–928. Yang, C., Hsu, Y.-C., & Tan, S. (2010). Predicting the determinants of users’ intentions for using YouTube to share video: Moderating gender effects. Cyberpsychology, Behavior, and Social Networking, 13(2), 141–152. YouTube. (2021). How to earn money on YouTube. https://support.google.com/youtube/answer/72857 ?hl=en

15. Automated image analysis for studying online behaviour Carsten Schwemmer, Saïd Unger, and Raphael Heiberger

1

COMPUTATIONAL METHODS FOR THE SOCIAL SCIENCES

The advent of the internet led to unprecedented quantities of data monitoring human behaviour. Massive, mostly online, data sources record many varieties of social interactions. The scientific examination and analysis of these digital traces have often been dubbed ‘computational social science’ (CSS) (Lazer et al., 2009, 2020). The still relatively young field of CSS has seen tremendous growth over the last decade, accompanied by new forms of digital traces driven by social media platforms as well as methodological innovations stemming from the computer and natural sciences. Digital data and the influx of methods revealed new insights on well-known social phenomena (Heiberger & Riebling, 2016). By using data on Facebook friends, for instance, Lewis et al. (2012) shed light on the importance of shared tastes (in particular, satiric movies and jazz music) for creating relationships of a cohort of college students in addition to established effects of homophily. Voigt et al. (2017) used transcripts taken from body cameras to show that Oakland’s police officers treat African Americans with less respect and greater informality. In general, the analysis of complex and unstructured data may be seen as one of the most important pillars of CSS. One subfield that received enormous attention in the last decade is automated text analysis. The sheer explosion of textual information offers completely new insights on human behaviour since nowadays everybody is constantly exposed to text and most people are text producers themselves. Textual traces are ubiquitous and can be found in blogs and articles on the web, social media posts or comments, messages of all sorts, e-books, and digital representations of vast (physical) libraries. Researchers trying to utilize vast collections of textual traces – which often exceed what an individual could read in an entire lifetime – rely thereby on established methods stemming from information retrieval and natural language processing (Manning et al., 2008). The usage of texts is a good example for illustrating two important methodological aspects of CSS: the use of supervised (SML) and unsupervised machine learning (UML) (Jordan & Mitchell, 2015; Molina & Garip, 2019; see also Erhard & Heiberger, Chapter 7 and Kashyap et al., Chapter 3 in this volume). SML (e.g., support vector machines) uses labelled data as input to train algorithms to predict outcomes of unlabelled data. To give one example, an SML classifier might be trained on a textual dataset for which information (labels) about a quantity of interest, such as whether a text is about immigration, is available. After training, this classifier can then be used to predict whether a text is about immigration in other (unlabelled) datasets. In contrast, UML (e.g., topic models) detects underlying patterns in unlabelled observations by exploiting the statistical properties of the data. UML has been used to derive patterns of large text corpora. Hence, no labelled training data is needed to retrieve information. The possibility 278

Automated image analysis and online behaviour 279 to detect thematic patterns ‘automatically’ in texts has proven to be very reliable in practical applications and has important applications to the social sciences in particular (McFarland et al., 2013), some pitfalls notwithstanding (Grimmer & Stewart, 2013). UML has been used to trace the development of public (Farrell, 2016) or scientific discourses (Schwemmer & Wieczorek, 2020), but also to derive metrics from the summarized texts to assess, for instance, the impact of writing on scientific careers (Heiberger et al., 2021). SML is also often used to classify texts. The main aim of SML is to predict an outcome with a given set of ‘features’ (independent variables in traditional statistics, see Erhard & Heiberger, Chapter 7 in this volume). That is the same as when social scientists refer to estimating a dependent variable by using a set of independent variables. Unlike social scientists who normally use one dataset for modelling, however, SML consists of at least two datasets: training and test data. The first dataset is used to develop (i.e., train) the model, the second to test its predictive capacity on out-of-sample data. To illustrate this, consider, again, texts. Prelabelled text snippets are annotated with certain characteristics. Important instances comprise emotions (Bail et al., 2017) or sentiments (Tumasjan et al., 2010). Then, researchers can use these training data to label large amounts of unlabelled text. A routine that is also used in many other applications involves artificial intelligence (Jordan & Mitchell, 2015). The analysis of texts has largely benefited from increasing computer power and experiences of neighbouring disciplines like computational linguistics or computer sciences, and might now be considered as part of the mainstream in many leading social science outlets. Regardless of whether SML or UML models are used, social scientists ultimately apply these methods for quite different research goals in comparison to more technically oriented disciplines, as outlined by Hopkins and King (2010, 230f.): computer scientists may be interested in finding the needle in the haystack (such as … the right web page to display from a search), but social scientists are more commonly interested in characterizing the haystack. Certainly, individual document classifications, when available, provide additional information to social scientists, since they enable one to aggregate in unanticipated ways, serve as variables in regression-type analyses, and help guide deeper qualitative inquiries into the nature of specific documents. But they do not usually … constitute the ultimate quantities of interest.

Only recently, some methods for automated text analysis have been developed dedicated to the typical research goals of social scientists (e.g., see Roberts et al., 2016). However, besides textual data, another ubiquitous data source is being widely unexplored in this respect: visual data. Images, in particular, are a powerful medium of communication. They are more likely to be remembered than words (Grady et al., 1998; Whitehouse et al., 2006), evoke stronger human emotions (Brader, 2005; Casas & Williams, 2019), and stipulate more engagement than text on social media platforms (Li & Xie, 2020). Despite their importance for both research and industry applications, their analysis remains computationally demanding, as image recognition is mostly implemented as SML systems, requiring tens of thousands or more labelled images. In this chapter, we will outline the potential of image analysis for social science research. First, we will provide an overview of early social science studies applying automated image analysis and provide recommendations for researchers interested in getting started with this topic. Next, we will present an empirical case study to showcase how image analysis can be utilized for researching online behaviour. We focus on Instagram posts of United States (US) Members of Congress (MCs) and the wearing of face masks during the early stage of the COVID-19 pandemic in 2020. Results of our automated image analyses indicate substantial

280 Research handbook on digital sociology differences over time and between party lines for the likelihood of posting images in which people wear masks. We conclude by reflecting upon the implications of our findings and outlining future directions for further establishing automated image analysis in the social sciences.

2

AUTOMATED IMAGE ANALYSIS: AN OVERVIEW

While early versions of computer vision models were already applied successfully in the 1990s, for instance, to detect hand-written digits (Lecun et al., 1998), a breakthrough happened over the last decade. Both technical and computational advances as well as the availability of an abundance of digitized data led to a wide and still increasing range of applications for automated image analysis. Regarding the social sciences, the number of studies applying automated image analysis is still low in this early phase of establishment. Nevertheless, a couple of interesting applications are worth highlighting. One of the substantial research fields where image recognition was already applied multiple times is collective action and protest behaviour. Zhang and Pan combine deep-learning approaches for both text and image data to address one of the challenges in this field: identifying collective action in the first place (Zhang & Pan, 2019). Another study focuses on the relation between state and protester violence and protest size. They applied convolutional neural networks to classify protest scenes and facial recognition to estimate gender and age compositions as well as the size of protests (Steinert-Threlkeld et al., 2021). Other applications in the field of collective action include the analysis of online mobilization via pictures and related emotions (Casas & Williams, 2019) as well as the identification of recruitment messages in propaganda campaigns (Mitts et al., 2022). Furthermore, scholars applied image recognition to infer political ideology from social media images (Xi et al., 2020). Also focusing on political representatives, Peng examines the popularity factors of their social media posts, finding that self-personalization strategies increase audience engagement (Peng, 2020). Recent studies also applied image recognition to examine media representations. Jürgens et al. (2022) used facial recognition models to estimate age and gender compositions on German television over time. Similarly, others combined facial recognition with automated text analysis to study the representation of women and ethnic groups in the digital sphere (Schwemmer & Jungkunz, 2019). Summarizing these studies from a methodological point of view, their applications are strongly aligned with three domains: object recognition, facial recognition, and visual sentiment analysis (Webb Williams et al., 2020). Within these domains, however, social science applications vary substantially. Facial recognition, for instance, is not only used to identify faces and for counting people, but also for inferring expressions (Küntzler et al., 2021) and demographic attributes. The substantive and methodological variety in this small number of early studies already appears promising for future social science applications. However, despite good reasons to be optimistic about their usefulness, there are important caveats that need to be considered. A growing number of instances were published in which computer vision models produced problematic results. An entire subfield of computer science, called Fairness in Artificial Intelligence, engages with this topic, and it is beyond the scope of this chapter to address all uncovered issues in full detail. Instead, we highlight a couple of examples that might be

Automated image analysis and online behaviour 281 of particular interest to social scientists. As outlined, a common task for facial recognition models across domains is to infer attributes of corresponding persons. Buolamwini and Gebru examined commercial facial recognition systems for gender classification using a dataset of politicians with varying skin tones (Buolamwini & Gebru, 2018). Their results received widespread attention as they identified two major problems. First, classification systems were substantially better at identifying men in comparison to identifying women. Second, the recognition rate was even lower for women with dark skin tones. Schwemmer et al. also found that commercial image recognition systems are worse at recognizing women in comparison to men. Furthermore, they showed that annotations of image recognition systems can produce output that reinforces and even amplifies harmful gender stereotypes (Schwemmer et al., 2020). Results of other studies show that biases not only occur when it comes to sociodemographic attributes of persons such as gender or ethnicity. Using a geographically diverse dataset, scholars qualitatively examined the output of object recognition systems (De Vries et al., 2019). They found that both location and culture can drastically affect the accuracy of object annotations. To give one example, spices in a kitchen in Western cultures look very different in comparison to other parts of the world. As a result, systems that are presumably trained on image data from Western cultures more often misclassify spices depicted in kitchens of other regions. What can be learned from these concerning results is that social scientists and other practitioners should not treat the output of computer vision models as ground truth. Similar to important principles for automated text analysis, we recommend careful inspection of the output of image recognition systems before using them for research purposes. To put it in the words of Grimmer and Stewart (2013, 271): ‘validate, validate, validate’. Besides advocating for a cautious mindset when applying automated image analysis, we recommend a couple of resources for social scientists who would like to engage more with the topic. First, there are excellent and freely available books covering a general introduction to deep-learning models from a technical perspective (Goodfellow et al., 2016). Other books focus more on the application of deep-learning models and include code for particular programming languages such as R (Chollet & Allaire, 2018). Written specifically for social scientists, Webb Williams et al. (2020) provide a great introduction to images as data by providing examples of convolutional neural networks, one of the most common model architectures for image recognition. Besides listing introductory material for image analysis, and regarding the decision which computer vision methods to use, the first and foremost consideration should be whether it is appropriate for the research task of interest. To provide one example, a model that is supposed to identify mobile phones, but trained with only data from current phone products, will perform poorly on phone images originating two decades ago as mobile phones looked substantially different. In addition, decisions regarding model choice are affected by resource constraints. We provide an overview of the different methods for automated image analysis and related demands in Table 15.1. Social scientists might for instance ask themselves: Can I conduct an automated image analysis on a conventional laptop? The answer depends not only on the research task but also on the corresponding method. For instance, training a new model, that is a model that is not already learned from some input data but is trained from scratch, usually comes with very high computational demands. Some recent models have billions of parameters (Zhai et al., 2021) and may need to be trained for hundreds of hours even when using hardware specifically

282 Research handbook on digital sociology Table 15.1

Different methods and their demands for applying image recognition Demand:

Demand: computational

Demand:

system

technical expertise

Train new model

High

High

Apply transfer learning

Medium

High

Low

Use pre-trained model

Low

Medium

None

Low

Low

None

Method

Use commercial system/programming interface

training data High

designed for this task. Likewise, training a model from scratch requires advanced technical expertise beyond basic programming knowledge. At the same time, new computer vision models are for the most part supervised (SML) and usually require large annotated datasets with a lot of labelled data for some quantity of interest. As an alternative, transfer learning can be used to slightly adjust and retrain an available model for the task of interest (Rawat & Wang, 2017). Coming back to the example regarding mobile phones, a large model that is trained to identify many objects, including mobile phones, might only require updates to some parameters (‘model layers’) and a smaller training corpus for historical phones to achieve good performance. While the demand for technical expertise is still relatively high, computational demands, as well as demands for the training data, are lower in comparison to training a new model. Available, pre-trained models can also be used without any retraining, which ultimately requires no dataset for training at all. Computational resource requirements are also lower as producing estimates with a model is less demanding than model training. In addition, basic programming skills are often sufficient for applying pre-trained models. Despite these advantages, this comes with less flexibility as existing models are often trained for purposes that do not necessarily align with social science research goals. This also applies to using commercial image recognition systems or programming interfaces, which require low computational resources, low technical expertise, and for the most part no training data. However, transparency regarding the inner workings of such closed systems is basically non-existent and they have to be treated as black boxes. Moreover, commercial image recognition systems can often change silently, resulting for instance in problems for reproducible research (Schwemmer et al., 2020). As we outlined in this section, automated image analysis has already been applied successfully for a small number of social science studies. At the same time, while the flexibility and variety of computer vision models are quite promising, their use often comes with biases requiring careful validation. Likewise, different methods come with their own demands regarding training data, computational resources, and technical expertise that need to be considered when choosing a model for a given research task. To provide a practical example of using automated image analysis for social science research, our next section includes a case study using images from US Representatives shared on the social networking site Instagram. In this study, we provide a more practical example for automated image analysis and showcase how image analysis can be utilized for studying online behaviour. We focus on Instagram posts of US MCs and the wearing of face masks during the early stage of the COVID-19 pandemic in 2020.

Automated image analysis and online behaviour 283

3

CASE STUDY: ONLINE BEHAVIOUR OF US CONGRESS MEMBERS DURING THE EARLY STAGES OF THE COVID-19 PANDEMIC

3.1

Theoretical Background

With the first recorded case of the novel coronavirus SARS Cov-2 (COVID-19) in the US in late January 2020, the response towards the growing threat of a pandemic was mainly concerned with so-called ‘non-pharmaceutical interventions’ to reduce the spread of the virus and the burden on hospitals. The first response in most states, after declaring a state of emergency, was the closing of non-essential businesses, stay-at-home orders, and a reduction of social gatherings in early February. The first introduction of mask mandates followed in mid-April; it was, however, much less of a general nationwide response with some states and counties requiring mask wearing as late as July and others not requiring them at all (Bergquist et al., 2020). With the rising politicization of the COVID-19 pandemic in general and responses such as non-pharmaceutical interventions in particular, the usage of masks to prevent the spread of the virus became one of the most divisive topics in the US, even though they are considered to help reduce the spread of the virus (Chu et al., 2020; Eikenberry et al., 2020; Leung et al., 2020). On Twitter, for example, the issue of mask wearing accounted for about 65 per cent of all COVID-19-related tweets from January to October 2020 (Al-Ramahi et al., 2021). While differences in individual mask adoption can be in part explained by demographics, especially as women and older people wear masks more frequently (Haischer et al., 2020), the most predictive factor of attitudes towards mask wearing and containment strategies in general was party affiliation (Druckman et al., 2021; Makridis & Rothwell, 2020). This divide towards anti-COVID measures was closely associated with the presidential race leading up to the elections at the end of the first pandemic year and ongoing polarization between the Republican and Democratic parties (Boxell et al., 2020). Then-president Donald Trump publicly rejected mask wearing and downplayed their use in protection against the virus, making not wearing a mask part of his campaign, while his rival Joe Biden presented himself wearing a mask on many campaigning occasions early on in the pandemic (Neville-Shepard, 2021). Cues like mask wearing in public and support or opposition towards non-pharmaceutical interventions can in turn influence the decision of party supporters to also support proposed policies (Lenz, 2012) like adhering to anti-COVID measures. Images have been shown to elicit responses faster than textual data and to be more emotionally triggering. This leads to mobilizing effects among the recipients, especially if the images evoke fear or enthusiasm (Casas & Williams, 2019). Sharing pictures with people wearing masks or entirely refraining from showing masks in shared pictures can fulfil functions like agenda setting and image building (Schill, 2012), for instance, the image of caring for fellow citizens in the case of mask wearing or portraying strength or freedom in the case of not showing masks. The effects of shared images on an audience make them especially important to political campaigns and the presentation of politicians to their audience. As simply wearing a mask or not wearing a mask is a visual cue that can be transmitted without accompanying text, explanation, and regardless of context, we want to investigate whether US politicians integrate masks into their public presentation.

284 Research handbook on digital sociology We look at social media, specifically Instagram, for the following reasons. Social media platforms are important media outlets for politicians where they can address prospective voters directly, bypassing traditional media (Margetts, 2017) while also supplying content the same traditional media can report on (Shapiro & Hemphill, 2017). As such, the adoption of social media platforms, especially Twitter, Facebook, and YouTube, is at an all-time high with Twitter and Facebook being used by nearly 100 per cent of MCs (Congressional Research Service, 2018). Instagram, which is set apart from Twitter and Facebook mainly by focusing on images instead of text or a combination of text and images, had a strong increase in usage by MCs in the last years with an adoption of 80 per cent in 2018 (O’Connell, 2018), which also made the usage of Instagram by politicians an underinvestigated but promising field for sociological, political, and communication research (Bast, 2021). 3.2 Data For our empirical case study on online communication during the early COVID-19 pandemic of MCs, we collected Instagram data in June 2020 via the Crowdtangle platform (Team Crowdtangle, 2022). Crowdtangle regularly curates lists of account collections, including a list for the 116th US Congress with both Senate and House Representatives. After retrieving the list via their application programming interface, we conducted quality checks for every account. In this process, we removed accounts not belonging to individual persons (such as party accounts), unofficial or private accounts of individuals, as well as accounts of persons not in office anymore. We then merged our validated input list with information about the MCs with publicly available data including sociodemographics.1 Next, we curated a list of all posts by MCs in the period between March and mid-June 2020. We exclude January and February 2020 as COVID-19 and to that effect face masks were hardly debated in the public sphere in the US at that time. In addition to this temporal restriction, we excluded a very low number of posts from third-party and independent MCs. We then downloaded all 12,800 images that were included in the remaining 9,000 posts. For data collection and analysis, we relied on multiple open-source tools and scientific software (Lüdecke, 2018; McKinney, 2010; R Core Team et al., 2013; Wickham et al., 2019). One example from our sample is an Instagram post including multiple pictures showing a Republican Representative (Kevin McCarthy) welcoming new house members (www .instagram.com/p/CAYWI0pJ2pe/). On those images, no person is wearing a face mask. Another exemplary Instagram post (www.instagram.com/p/CBHZClwnHzY/) shows a Representative (Gil Cisneros) participating in a protest for George Floyd who was murdered by a police officer in 2020. Corresponding images show multiple people visibly wearing face masks. 3.3 Methods To analyse the likelihood of MCs posting images that show people who wear face masks, we rely on the open-source computer vision framework CompreFace (Team CompreFace, 2022). CompreFace is built upon other popular software for facial recognition tasks such as OpenFace (Amos et al., 2016). In addition to common facial recognition features, CompreFace offers a model for detecting the use of face masks. We first use the main CompreFace model to identify images in our sample of 12,800 in which any persons (as operationalized by their

Automated image analysis and online behaviour 285 faces) are present. This reduces our number of observations to roughly 6,650, which is unsurprising as US MCs often post images without any visible humans such as images of documents they signed and would like to share with their constituents. Next, we apply the face mask recognition model to our remaining sample. An important hyperparameter to be set is a probability threshold for the model confidence regarding the prediction. A threshold at 100 per cent would lead to a very low number of false positives, for instance, a predicted mask, although none is present. However, this would lead to a high number of false negatives, e.g., the prediction of no face mask although a person does indeed wear a mask. After qualitative validations regarding the model output, we opted for a threshold of 80 per cent, which resulted in a good balance for the model performance for our sample. From these model predictions, we constructed our binary dependent variable of interest, which takes the value 1 of any person in a given image is wearing a mask and 0 otherwise. Finally, we use a downstream logistic regression model to analyse the effects of four variables of interest: date of 2020, MC party, MC gender, and MC age. 3.4 Results Figure 15.1 plots the predicted probability of an image showing a person with a face mask for all of our variables of interest. All probability estimates were produced via a logistic regression model while holding other variables at their observed values. In all subfigures, circles and lines denote point estimates while bars and ribbons denote 95 per cent confidence intervals. As can be seen in panel A of Figure 15.1, the longer the pandemic lasted, the higher is the likelihood of an image showing a person with a face mask. A steep increase is visible around April at the time in which mask recommendations and mandates were introduced in

Note: Results are based on computer vision as well as logistic regression models applied to Instagram posts of US Members of Congress. Panels A–D show probability estimates for the date of the year 2020 (A), party (B), age (C), and gender (D). Circles and lines denote point estimates. Bars and ribbons denote 95 per cent confidence intervals.

Figure 15.1

Predicted probabilities for images depicting persons wearing face masks

286 Research handbook on digital sociology the US. Panel B further demonstrates that Democratic MCs are substantially more likely to post images of people wearing face masks than Republican MCs. Regarding the age of MCs, panel C shows a comparatively weaker effect. Older MCs are slightly more likely to circulate images with masks in comparison to younger MCs while the youngest age in our dataset is 34 years. Finally, panel D shows predicted probabilities for both men and women. Results suggest a very weak to almost no gender effect as both gender groups are statistically about equally as likely to post images of people wearing face masks.

4 CONCLUSIONS In this chapter, we introduced computational methods and in particular methods of automated analysis for social science applications. With the increasing availability of digital content including a vast amount of visual data, we argued that research on online behaviour can benefit from applying computational methods to understand the content of images. Our literature review showed that a small but growing number of studies already applied automated image analysis to tackle important societal issues. Furthermore, we provided an overview of ongoing challenges, such as potential biases of computer vision models, as well as demands related to technical skills, computational efforts, and training data that need to be considered. In addition, we would like to emphasize that while social scientists might wonder what the best method for a certain image recognition task might be, this is unsurprisingly difficult to assess. Computer vision is a rapidly evolving field and what is considered state of the art changes frequently. To provide one example, transformers, originally applied for natural language processing, are at the time of writing matching and in some instances surpassing the performance of convolutional neural networks despite draining fewer computational resources (Dosovitskiy et al., 2020; Paul & Chen, 2021). To provide a practical application of established image recognition tools at the time of writing, we presented an empirical case study focusing on Instagram posts of US MCs and the wearing of face masks during the early stage of the COVID-19 pandemic in 2020. Regarding the results of our study, the steep increase over time in the probability of posting images of persons wearing face masks on images is most likely associated with the introduction of mask mandates in April 2020. This fits well with the mobilizing and agenda-setting functions of shared images. Posting images on Instagram could show the adherence of legislators to the mandates, further legitimizing them (Casas & Williams, 2019; Schill, 2012). The notable difference between Democratic and Republican MCs in the probability of persons wearing face masks in shared images can be attributed to the stark political polarization in the US, especially regarding the use of face masks to slow down the pandemic as well as the issuing of mask mandates (Boxell et al., 2020). It also mirrors the attitudes of supporters of the respective parties towards wearing face masks (Makridis & Rothwell, 2020). Furthermore, we found a higher probability for mask wearing for older MCs which could be due to the higher risk of a severe course of a COVID-19 infection, the effect is however not as pronounced as the effects over time and for party affiliation. Finally, regarding the effect of gender of MCs on the probability of posting Instagram images including face masks, we see a somewhat counterintuitive effect with a slightly higher probability for male MCs to share images with masks (Haischer et al., 2020). However, it is important to note that the effect size is small. Moreover, as computer

Automated image analysis and online behaviour 287 vision systems may perform worse at recognizing women in comparison to men (Buolamwini & Gebru, 2018; Schwemmer et al., 2020), related biases could also affect our results. Potential biases of our applied computer vision models are one of the limitations of our case study that we need to highlight. Qualitative validations of our model output did not suggest substantial problems concerning overall recognition rates and gender differences for our particular case. Nevertheless, a more sophisticated validation, for instance by relying on crowdsourcing (Shank, 2016), could help to better identify potential issues of the model predictions. Regarding our study design, identifying the sharing of images of people wearing masks is certainly an important factor regarding health communication on social media. Nevertheless, our design does not allow evaluating other important aspects such as communication via text or audio or examining other visual aspects such as stances for washing hands. Future research could tackle this shortcoming using multiple models for jointly analysing text, image, video, and audio data. Although our case study showed the potential of automated image analysis, analysing large amounts of visual data also points to severe challenges linked to CSS and, in particular, the promises and pitfalls of SML. SML only works (well) with sizable quantities of labelled training data. And, as our example of image recognition clearly illustrates, this often means turning to data obtained by tech companies like Instagram and Google. Their data are just not for free. The proprietary use of information and input data impedes some fundamental ways of doing research (Merton, 1973), for instance, by making it hard or even impossible for peers to obtain the same dataset or utilize the same methods. This issue is furthered by the fact that researchers using any application programming interface have no control over how the retrieved data were collected or labelled. This might also provide parts of an explanation as to why social scientists are somewhat slow to adapt SML, because of the refusal of industry-made black boxes and dependencies on the goodwill of corporations. A more technical obstacle accompanying many CSS methods is the lack of common standards, in particular regarding the many decisions (i.e., hyperparameters) involved in running UML or SML. Those decisions might change results considerably. However, using several options at crucial bifurcations and some slowly evolving standards (e.g., when it comes to the choice of the number of topics, see Heiberger & Munoz-Najar Galvez, 2021) seems to be the best practice, as it has been for statistical methods for many years. While there are more challenges accompanying the use of digital data (Lazer et al., 2014), we would like to close with one especially tempting pitfall of ‘big data’: its platform specificity (Lewis, 2015). Our case study might once more serve as an example. While we conducted several quality checks to ensure that our data captured our target population (US MCs and their content on Instagram during the early COVID-19 pandemic in 2020), we can still not rule out algorithmic confounding of the platform. This might result, for instance, in only retrieving a subset of the data. This issue is even more severe for other populations of interest for whom it is harder to provide adequate samples, e.g., all citizens of a certain country. Regardless of the number of observations used in an analysis, data stemming from social media and many other digital outlets are most often platform specific as well as algorithmically confounded (Salganik, 2019), and so are all derived insights from such data. This does not mean that images, texts, or other content posted on social media have no social implications or relevance. It does mean though that social scientists need to establish standards for digital traces (see Sen et al., 2021) so that we can use these promising data sources with the same rigour that has been applied to, for instance, survey data for decades.

288 Research handbook on digital sociology

NOTE 1.

Data are available at https://github.com/unitedstates/congress-legislators.

REFERENCES Al-Ramahi, M., Elnoshokaty, A., El-Gayar, O., Nasralah, T., & Wahbeh, A. (2021). Public discourse against masks in the COVID-19 era: Infodemiology study of Twitter data. JMIR Public Health and Surveillance, 7(4), e26780. Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). Openface: A general-purpose face recognition library with mobile applications. https://elijah.cs.cmu.edu/DOCS/CMU-CS-16-118.pdf Bail, C. A., Brown, T. W., & Mann, M. (2017). Channeling hearts and minds: Advocacy organizations, cognitive-emotional currents, and public conversation. American Sociological Review, 82(6), 1188–1213. Bast, J. (2021). Politicians, parties, and government representatives on Instagram: A review on research approaches, usage patterns, and effects. Review of Communication Research, 9, 193. Bergquist, S., Otten, T., & Sarich, N. (2020). COVID-19 pandemic in the United States. Health Policy and Technology, 9(4), 623–638. Boxell, L., Gentzkow, M., & Shapiro, J. (2020). Cross-Country Trends in Affective Polarization (No. w26669; p. w26669). National Bureau of Economic Research. Brader, T. (2005). Striking a responsive chord: How political ads motivate and persuade voters by appealing to emotions. American Journal of Political Science, 49(2), 388–405. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Conference on Fairness, Accountability and Transparency, 77–91. Casas, A., & Williams, N. W. (2019). Images that matter: Online protests and the mobilizing role of pictures. Political Research Quarterly, 72(2), 360–375. Chollet, F., & Allaire, J. J. (2018). Deep learning with R. Manning Publications Co. Chu, D. K., Akl, E. A., Duda, S., Solo, K., Yaacoub, S., Schünemann, H. J., Chu, D. K., Akl, E. A., El-harakeh, A., Bognanni, A., Lotfi, T., Loeb, M., Hajizadeh, A., Bak, A., Izcovich, A., Cuello-Garcia, C. A., Chen, C., Harris, D. J., Borowiack, E., … & Schünemann, H. J. (2020). Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: A systematic review and meta-analysis. The Lancet, 395(10242), 1973–1987. Congressional Research Service. (2018). Social media adoption by Members of Congress: Trends and congressional considerations (p. 21). United States Congress. https://crsreports.congress.gov/ product/pdf/R/R45337 De Vries, T., Misra, I., Wang, C., & Van der Maaten, L. (2019). Does object recognition work for everyone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 52–59). Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. Computational Communication Research. https:// doi.org/10.48550/ARXIV.2010.11929 Druckman, J. N., Klar, S., Krupnikov, Y., Levendusky, M., & Ryan, J. B. (2021). Affective polarization, local contexts and public opinion in America. Nature Human Behaviour, 5(1), 28–38. Eikenberry, S. E., Mancuso, M., Iboi, E. A., Phan, T., Eikenberry, K., Kuang, Y., Kostelich, E. J., & Gumel, A. B. (2020). To mask or not to mask: Modeling the potential for face mask use by the general public to curtail the COVID-19 pandemic. Infectious Disease Modelling, 5(1), 293–308. Farrell, J. (2016). Corporate funding and ideological polarization about climate change. Proceedings of the National Academy of Sciences, 113(1), 92–97. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. Grady, C. L., McIntosh, A. R., Rajah, M. N., & Craik, F. I. (1998). Neural correlates of the episodic encoding of pictures and words. Proceedings of the National Academy of Sciences, 95(5), 2703–2708.

Automated image analysis and online behaviour 289 Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. Haischer, M. H., Beilfuss, R., Hart, M. R., Opielinski, L., Wrucke, D., Zirgaitis, G., Uhrich, T. D., & Hunter, S. K. (2020). Who is wearing a mask? Gender-, age-, and location-related differences during the COVID-19 pandemic. PLOS ONE, 15(10), e0240785. Heiberger, R. H., & Munoz-Najar Galvez, S. (2021). Text mining and topic modeling. In Handbook of Computational Social Science (pp. 352–365). Routledge. Heiberger, R. H., & Riebling, J. R. (2016). Installing computational social science: Facing the challenges of new information and communication technologies in social science. Methodological Innovations, 9, 1–11. Heiberger, R. H., Munoz-Najar Galvez, S., & McFarland, D. A. (2021). Facets of specialization and its relation to career success: An analysis of US sociology, 1980 to 2015. American Sociological Review, 86(5). Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247. Jordan, M. I., & Mitchell, T. M. (2015). Machine learning. Science, 349(6245), 255–260. Jürgens, P., Meltzer, C., & Sharkow, M. (2022). Age and gender representation on German TV: A longitudinal computational analysis. Computational Communication Research, 4(1). Küntzler, T., Höfling, T. T. A., & Alpers, G. W. (2021). Automatic facial expression recognition in standardized and non-standardized emotional expressions. Frontiers in Psychology, 12, 627561. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Alstyne, M. V. (2009). Computational social science. Science, 323(5915), 721–723. Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google flu: Traps in big data analysis. Science, 343(6176), 1203–1205. Lazer, D., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., Margetts, H., Nelson, A., Salganik, M. J., Strohmaier, M., Vespignani, A., & Wagner, C. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. Lenz, G. S. (2012). Follow the leader? How voters respond to politicians’ policies and performance. University of Chicago Press. Leung, N. H. L., Chu, D. K. W., Shiu, E. Y. C., Chan, K.-H., McDevitt, J. J., Hau, B. J. P., Yen, H.-L., Li, Y., Ip, D. K. M., Peiris, J. S. M., Seto, W. H., Leung, G. M., Milton, D. K., & Cowling, B. J. (2020). Respiratory virus shedding in exhaled breath and efficacy of face masks. Nature Medicine, 26(5), 676–680. Lewis, K. (2015). Three fallacies of digital footprints. Big Data & Society, 2(2), 2053951715602496. Lewis, K., Gonzalez, M., & Kaufman, J. (2012). Social selection and peer influence in an online social network. Proceedings of the National Academy of Sciences, 109(1), 68–72. Li, Y., & Xie, Y. (2020). Is a picture worth a thousand words? An empirical study of image content and social media engagement. Journal of Marketing Research, 57(1), 1–19. Lüdecke, D. (2018). ggeffects: Tidy data frames of marginal effects from regression models. Journal of Open Source Software, 3(26), 772. Makridis, C., & Rothwell, J. T. (2020). The real cost of political polarization: Evidence from the COVID-19 pandemic. SSRN Electronic Journal. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Margetts, H. (2017). Why social media may have won the 2017 General Election. The Political Quarterly, 88(3), 386–390. McFarland, D. A., Ramage, D., Chuang, J., Heer, J., Manning, C. D., & Jurafsky, D. (2013). Differentiating language usage through topic models. Poetics, 41(6), 607–625. McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference. https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf Merton, R. K (1973). The sociology of science. University of Chicago Press.

290 Research handbook on digital sociology Mitts, T., Phillips, G., & Walter, B. F. (2022). Studying the impact of ISIS propaganda campaigns. The Journal of Politics, 31 January. https://doi.org/10.1086/716281 Molina, M., & Garip, F. (2019). Machine learning for sociology. Annual Review of Sociology, 45(1), 27–45. Neville-Shepard, M. (2021). Masks and emasculation: Populist crisis rhetoric and the 2020 Presidential Election. American Behavioral Scientist, 000276422110112. O’Connell, D. (2018). #Selfie: Instagram and the United States Congress. Social Media + Society, 4(4), 205630511881337. Paul, S., & Chen, P.-Y. (2021). Vision transformers are robust learners. https://doi.org/10.48550/ ARXIV.2105.07581 Peng, Y. (2020). What makes politicians’ Instagram posts popular? Analyzing social media strategies of candidates and office holders with computer vision. International Journal of Press/Politics, 194016122096476. R Core Team et al. (2013). R: A language and environment for statistical computing. Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation, 29(9), 2352–2449. Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in the social sciences. Journal of the American Statistical Association, 111(515), 988–1003. Salganik, M. J. (2019). Bit by bit: Social research in the digital age. Princeton University Press. Schill, D. (2012). The visual image and the political image: A review of visual communication research in the field of political communication. Review of Communication, 12(2), 118–142. Schwemmer, C., & Jungkunz, S. (2019). Whose ideas are worth spreading? The representation of women and ethnic groups in TED talks. Political Research Exchange, 1(1), 1–23. Schwemmer, C., & Wieczorek, O. (2020). The methodological divide of sociology: Evidence from two decades of journal publications. Sociology, 54(1), 3–21. Schwemmer, C., Knight, C., Bello-Pardo, E. D., Oklobdzija, S., Schoonvelde, M., & Lockhart, J. W. (2020). Diagnosing gender bias in image recognition systems. Socius: Sociological Research for a Dynamic World, 6, 237802312096717. Sen, I., Flöck, F., Weller, K., Weiß, B., & Wagner, C. (2021). A total error framework for digital traces of human behavior on online platforms. Public Opinion Quarterly, 85(S1), 399–422. Shank, D. B. (2016). Using crowdsourcing websites for sociological research: The case of Amazon Mechanical Turk. The American Sociologist, 47(1), 47–55. Shapiro, M. A., & Hemphill, L. (2017). Politicians and the policy agenda: Does use of Twitter by the US Congress direct New York Times content? Politicians and the policy agenda. Policy & Internet, 9(1), 109–132. Steinert-Threlkeld, Z., Chan, A., & Joo, J. (2021). How state and protester violence affect protest dynamics. Journal of Politics. https://doi.org/10.1086/715600. Team CompreFace. (2022). CompreFace. Exadel-Inc. https://github.com/exadel-inc/CompreFace Team Crowdtangle. (2022). CrowdTangle. Facebook. https://www.crowdtangle.com Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with Twitter: What 140 characters reveal about political sentiment. ICWSM, 10, 178–185. Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., Jurgens, D., Jurafsky, D., & Eberhardt, J. L. (2017). Language from police body camera footage shows racial disparities in officer respect. Proceedings of the National Academy of Sciences, 114(25), 6521–6526. Webb Williams, N., Casas, A., & Wilkerson, J. D. (2020). Images as data for social science research: An introduction to convolutional neural nets for image classification. Cambridge University Press. Whitehouse, A. J., Maybery, M. T., & Durkin, K. (2006). The development of the picture-superiority effect. British Journal of Developmental Psychology, 24(4), 767–73. Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., … Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. Xi, N., Ma, D., Liou, M., Steinert-Threlkeld, Z. C., Anastasopoulos, J., & Joo, J. (2020). Understanding the political ideology of legislators from social media images. Proceedings of the International AAAI Conference on Web and Social Media, 14, 726–737.

Automated image analysis and online behaviour 291 Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2021). Scaling vision transformers. https://doi.org/ 10.48550/ARXIV.2106.04560 Zhang, H., & Pan, J. (2019). CASM: A deep-learning approach for identifying collective action events with text and image data from social media. Sociological Methodology, 49(1), 1–57.

PART IV DIGITAL PARTICIPATION AND INEQUALITY

16. Social disparities in adolescents’ educational ICT use at home: how digital and educational inequalities interact Birgit Becker

1 INTRODUCTION Social inequality in educational achievement is a well-established finding in educational research (e.g., OECD, 2019; Skopek & Passaretta, 2021). There is an interdisciplinary consensus on one of the main causes for this pattern: differences in parents’ resources are associated with differences in the home-learning environment, which in turn impacts children’s development in various domains, including academic skills (Becker, 2019; Conger, Conger, & Martin, 2010; Esping-Andersen, 2004; Heckman, 2006; Melhuish et al., 2008; Skopek & Passaretta, 2021). Thus, the home environment and educational activities at home (e.g., reading in leisure time, learning for school) are considered highly relevant. For example, ‘summer gap’ studies show that social disparities in school achievement are mainly attributable to social differences of what happens in families (Alexander, Entwisle, & Olson, 2007; Chin & Phillips, 2004). Since we nowadays live in a digital world, information and communication technologies (ICT) are penetrating more and more areas of life (Fraillon, Ainley, Schulz, Friedman, & Duckworth, 2019). This also applies to educational activities at home – even more so since the COVID-19 pandemic with the widespread phenomenon of home schooling. However, not all students are equally likely to participate in such educational ICT activities at home. This chapter focuses on social disparities in students’ educational ICT use at home. Is there an association between parents’ socio-economic status (SES) and the frequency as to how often students engage in activities like browsing the internet for schoolwork or to follow up a lesson, doing homework on a computer, or using a learning app?1 In order to analyse social inequality in educational ICT use, the literatures on educational inequalities and digital inequalities are combined. I will present an overview of how inequalities in educational and ICT activities outside of school are theorised in both literatures and discuss some striking similarities. Based on this, a theoretical framework for explaining social disparities in educational ICT use at home will be developed. These conceptual considerations will be illustrated with empirical examples using the Programme for International Student Assessment (PISA) 2018 data.

2

SOCIAL INEQUALITIES IN EDUCATIONAL AND ICT ACTIVITIES OUTSIDE OF SCHOOL: A CONCEPTUAL COMPARISON

The following sections describe (1) general determinants of the frequency of educational and ICT activities outside school (Sections 2.1 and 2.3), and (2) theoretical mechanisms that may 293

294 Research handbook on digital sociology cause social inequalities in such educational and ICT activities (Sections 2.2 and 2.4). Section 2.5 summarises previous literature about educational ICT activities outside of school. 2.1

Determinants of Educational Activities

Different disciplines have a long tradition of research on influencing factors for various types of students’ educational activities outside of school. As an example, I will describe theoretical models for three types of educational activities: reading behaviour, doing homework, and participation in extracurricular activities. 2.1.1 Reading behaviour Reading research shows (mutual reinforcing) relations between students’ reading motivation, their reading behaviour, and their reading achievement (e.g., De Naeghel, Van Keer, Vansteenkiste, & Rosseel, 2012; Kush, Watkins, & Brookhart, 2005; Schiefele, Schaffner, Möller, & Wigfield, 2012). Focusing on reading behaviour as an outcome, the reading engagement model states that students’ time and effort spent on reading depends on their reading motivation and skills (Guthrie & Wigfield, 2018, p. 60). Thus, students who are competent readers and enjoy and value reading show a higher reading frequency than their less proficient and motivated peers (De Naeghel et al., 2012; Guthrie & Wigfield, 2018; Wang & Guthrie, 2004). The literature also mentions the necessary condition that students need to have access to interesting and diverse texts (Barber & Klauda, 2020, p. 31). 2.1.2 Doing homework An expectancy-value model of homework behaviour states that students’ time and effort spent on homework depends on their feeling of competence in the specific subject (expectancy component) and their perception of the usefulness and enjoyability of the homework (value component) (Trautwein, Lüdtke, Schnyder, & Niggli, 2006, p. 440). Other models also stress that students’ motivation is most critical for their decision to engage in homework (Hagger et al., 2016; Valle et al., 2016). 2.1.3 Participation in extracurricular activities An expectancy-value model has also been proposed for students’ participation in extracurricular science activities (Nagengast et al., 2011). It assumes that the frequency of non-compulsory and after-school science activities depends on students’ perception of their competence in science (expectancy component) and their enjoyment of science (value component). Other work in this field emphasises the role of opportunities to participate in extracurricular activities, which strongly vary according to regional characteristics (Roscigno, Tomaskovic-Devey, & Crowley, 2006; Stearns & Glennie, 2010). What all these models have in common is that students are more likely to engage in these educational activities when they have the opportunity and are motivated to do so (Diehl, Hunkler, & Kristen, 2016, p. 13). The opportunity, in turn, consists of two aspects: students need to have certain skills and access to an appropriate learning environment (e.g., availability of learning materials) (Diehl et al., 2016, p. 13).

Social disparities in educational ICT use at home 295 2.2

Social Inequalities in Educational Activities

It is a well-known empirical finding that students from higher social strata engage more frequently in educational activities outside school than their less privileged counterparts (e.g., Dumais, 2008; Lareau, 2002; Loh & Sun, 2020; Nagel & Verboord, 2012; Sullivan, 2001). For example, Gracia, Garcia-Roman, Oinas, and Anttila (2020, p. 1325) show in their cross-national time use study that students spend more daily time on educational activities at home (e.g., reading and study) when their mother has a higher educational qualification. What is the reason for this social inequality in educational activities? Returning to the factors identified in Section 2.1, it can be assumed that students from low-SES families have fewer opportunities and/or are less motivated to engage in educational activities than students from high-SES families. A theory that includes both factors is Bourdieu’s theory of cultural and social reproduction (Bourdieu, 1977, 1986). According to this theory, parents of different social classes differ in their endowment with economic, cultural, and social capital. The focus of the theory lies in the cultural capital which includes educational materials (e.g., books at home), skills and knowledge (e.g., linguistic skills), as well as orientations and attitudes (e.g., attitude towards reading). The former represents cultural capital in the ‘objectified state’ while the latter two refer to cultural capital in the ‘embodied state’ (Bourdieu, 1986, p. 243). The transmission of cultural capital from parents to their children is a key mechanism for the reproduction of social inequality. Thus, students differ in terms of their access to learning materials, in terms of their skills, but also in terms of their orientations and motivations, depending on their parents’ cultural capital. As a result, they also differ in their likelihood to engage in educational activities at home (e.g., reading behaviour, extracurricular activities) according to their social background. 2.3

Determinants of ICT Activities

The literature on digital participation began with the central question of access to computers and an internet connection. Inequality in access to ICT constituted the so-called ‘first-level digital divide’ (Attewell, 2001; DiMaggio & Hargittai, 2001). However, as more and more people got access to ICT infrastructure, research shifted its focus to inequalities in digital skills and usage patterns which were summarised as the ‘second-level digital divide’ (Attewell, 2001; DiMaggio & Hargittai, 2001; van Deursen & van Dijk, 2014; van Dijk, 2012). Several theoretical models have been proposed that try to systematise different dimensions of digital inclusion (e.g., Andersson, 2004; DiMaggio & Hargittai, 2001; van Deursen & Helsper, 2015; van Dijk, 2012, 2020). Although these models differ in detail, they usually contain (at least) the following dimensions of digital inclusion: (physical) access, skills, motivation, and use (van Deursen & Helsper, 2015, p. 33; van Dijk, 2012, p. 61). If one focuses on ICT use as an outcome, ICT access, ICT skills, and ICT motivation can be considered as determinants that influence the frequency and type of ICT activity. Physical and material access to ICT is often considered a necessary precondition for ICT use (van Dijk, 2012). However, ICT access, ICT skills, ICT motivation, and ICT usage probably have mutual influences on each other, so a clear causal ordering of these concepts is difficult. Yet, focusing on ICT use as an outcome, it can be concluded that it depends on the individuals’ opportunities (access, skills) and their motivation.

296 Research handbook on digital sociology ICT usage itself contains several aspects. The frequency and duration of ICT activities constitute quantitative aspects of this concept. As qualitative aspects, different types of ICT activities or the diversity of ICT usages can be considered (van Deursen & van Dijk, 2014; van Dijk, 2012, p. 61). Regarding the former, various types of user profiles or using patterns have been identified (Kurek, Jose, & Stuart, 2017; OECD, 2015; van Deursen & van Dijk, 2014; van Deursen, van Dijk, & ten Klooster, 2015). These, however, depend on the specific questionnaire items that were used in the respective studies. Common categories of ICT usage include the differentiation between ‘informational’ versus ‘entertainment’ activities (Bonfadelli, 2002; Notten, Peter, Kraaykamp, & Valkenburg, 2009; Peter & Valkenburg, 2006) or distinguish ‘capital-enhancing’ usages from other forms (van Deursen et al., 2015; Zillien & Hargittai, 2009). It therefore seems necessary that theoretical models for specific ICT activities also take into account the specific opportunities and motivations in relation to these activities. 2.4

Social Inequalities in ICT Activities

The literature on digital inequalities has demonstrated social inequalities with regard to ICT access, ICT motivation, ICT skills, and ICT use (Bonfadelli, 2002; Cruz-Jesus, Vicente, Bacao, & Oliveira, 2016; van Deursen & van Dijk, 2011, 2019; van Dijk, 2006, 2020; Zillien & Hargittai, 2009). This applies to adults but also to adolescents who are considered ‘digital natives’ (Fraillon et al., 2019; Hargittai, 2010; OECD, 2015). Passaretta and Gil-Hernández (Chapter 17 in this volume) demonstrate (for Germany) that, already in primary school age and up to mid-level secondary school, social background inequality in ICT literacy among children is considerable and, in terms of magnitude, comparable to levels of social inequality in traditional skill domains such as reading or maths. Regarding ICT activities, however, the extent of social inequality depends on the respective outcome. Students with different social backgrounds hardly differ from each other in terms of the daily time spent using the internet outside school (OECD, 2015, p. 24). While there are few differences in quantitative terms, various studies have identified social disparities in qualitative terms: students from high-SES families have a higher frequency of ICT activities for informational purposes and are more likely to engage in ‘capital-enhancing’ ICT activities than students from low-SES families (Hargittai, 2010; Micheli, 2015; Notten & Becker, 2017; Notten et al., 2009; OECD, 2015, p. 136). Different theoretical perspectives have been applied to explain these digital inequalities (for an overview, see van Dijk, 2020, chapter 2). A comprehensive model has been developed by van Dijk (2006, 2012, 2020) which has its origin in a materialistic perspective. It links the differences in the resources of different social groups with their digital participation, which in turn affects their chances of participation in society. Regarding their digital participation, the model considers the following dimensions: motivation, physical access, digital skills, and usage (van Dijk, 2012, p. 61; also see Section 2.3). Of course, such a model can also be applied to digital inequalities among adolescents. In this case, their parents’ resources are crucial and affect their children’s digital participation. In fact, also the theory of Bourdieu is often applied for an explanation of digital inequalities among adolescents (e.g., Calderón Gómez, 2020; Drabowicz, 2017; Hollingworth, Mansaray, Allen, & Rose, 2011; Weber & Becker, 2019).

Social disparities in educational ICT use at home 297 2.5

Social Inequalities in Educational ICT Activities at Home

Previous empirical studies on social inequalities in educational ICT activities outside school usually found that students from higher-SES families exhibit a higher frequency of educational ICT use at home than their less privileged counterparts (Becker, 2021; Senkbeil, Drossel, Eickelmann, & Vennemann, 2019; Steffens, 2014; Vekiri, 2010; Weber & Becker, 2019). An exception to this pattern is the study by Eamon (2004) that did not find differences in academic home computer use between poor and non-poor families. Gümüş (2013) also found no direct effect of parental SES on students’ educational ICT use, but he simultaneously included ICT resources and educational resources at home in his multivariate analysis. These latter variables were both significantly associated with the frequency of educational ICT use and may have mediated the SES effect (this is not tested in the article). Thus, the results of Gümüş at least confirm the importance of educational and ICT home resources. Overall, it can be summarised that previous studies usually find social inequality in educational ICT use at home. However, these studies hardly investigated which explanatory factors led to this inequality. A comprehensive study on the causes of social inequality in educational ICT use is still lacking.

3

A THEORETICAL FRAMEWORK FOR EDUCATIONAL ICT USE AT HOME

Based on the previously reviewed literature, this section proposes a theoretical model for educational ICT use at home. The comparison of the literatures on educational and ICT activities has revealed some remarkable similarities. The same determinants can be identified in both research traditions. Students engage in educational or ICT activities if they have the opportunity and are motivated to do so. Regarding opportunity, both literatures mention certain skills as well as access to appropriate materials and/or stimuli as necessary conditions. Moreover, similar resource-based theoretical mechanisms for explaining social inequalities in educational and ICT activities have been proposed in both research traditions: parents from different social classes systematically differ in their endowment with (various sorts of) resources, which in turn influence the participation of their children in educational and ICT activities. This chapter focuses on adolescents’ educational ICT use at home. This means that both the educational and the ICT aspect of this activity need to be considered. Figure 16.1 depicts the theoretical framework that underlies the analysis of this chapter. I assume that students engage in educational ICT activities at home if they have the necessary motivation and opportunities in both domains. In terms of opportunities, educational stimuli and access to ICT at home as well as academic and digital skills are considered relevant. In terms of motivation, the students not only need the willingness to engage in the respective educational activity, but also to do so digitally. In addition, educational and ICT aspects may interact. Indeed, a multiplicative link seems likely: for example, students need to have an academic motivation and a favourable orientation towards ICT (ICT motivation) in order to engage in educational ICT activities. A comprehensive empirical analysis of the theoretical model from Figure 16.1 is beyond the scope of this chapter. However, in the following, I will present empirical analyses on some selected aspects of it.

298 Research handbook on digital sociology

Figure 16.1

4

A theoretical framework for explaining social disparities in educational ICT use at home

DATA AND METHODS

4.1 Data The aforesaid theoretical considerations will be illustrated with empirical examples using the European Union (EU) sample of the PISA 2018 data (OECD, in press). PISA focuses on the school achievement and learning environments of 15-year-old students. Data from the student questionnaires from all EU countries (+ Switzerland) that administered the optional ICT familiarity questionnaire (including the items in educational ICT use outside school) were used. Multiple imputation with chained equations was used to impute missing values; cases with missing values on the dependent variable were excluded after the imputation (18.2 per cent of the sample). The analysis sample consisted of 153,713 students in 24 countries (Austria, Belgium, Bulgaria, Croatia, Czech Republic, Denmark, Estonia, Finland, France, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Poland, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, United Kingdom). 4.2 Measures Educational ICT use at home was measured by asking the students how often they engage in activities like browsing the internet for schoolwork, using email for communication with other students about schoolwork, doing homework on a computer, using learning apps or learning websites, etc. A list of 11 items was presented. The PISA dataset offers an item response theory (IRT)-scaled index variable which was used as the dependent variable (see OECD, in press, chapter 16). To make it easier to interpret, this index was z-standardised (that is, centred at mean and scale standardised to units of standard deviation). Higher values on this index represent a higher frequency of educational ICT use at home, with one unit corresponding to one standard deviation.

Social disparities in educational ICT use at home 299 As an indicator of students’ opportunities for educational ICT use, I focused on the access dimension and employed an index of educational resources at home (home possessions related to education such as a desk to study, a dictionary, etc.) and an index of ICT resources at home (home possessions related to ICT hardware). As a measure of motivation, I used an index of students’ intrinsic motivation related to learning goals (academic motivation; example item: ‘My goal is to learn as much as possible’) and an index of positive attitudes towards ICT (ICT motivation; example item: ‘I forget about time when I’m using digital devices’). These four indices were available as IRT-scaled variables. For a better interpretability, these variables were z-standardised. As an indicator of parents’ SES, I use the highest educational level of both parents and distinguish between parents with and without an academic degree. All models controlled for students’ age and country (via dummy variables). 4.3 Methods For the analyses, I used ordinary least squares (OLS) regression models with educational ICT use at home as the dependent variable. They included country-fixed effects in order to control for country-level heterogeneity (Möhring, 2012). The analyses consisted of three model steps: Model 1 included only parental education and showed the social disparities in educational ICT use. Model 2 additionally considered students’ educational and ICT opportunities and motivations, i.e., those constructs which have been hypothesised as mediating factors (see Figure 16.1). In Model 3, interaction terms between the educational aspects and the ICT aspects were included in addition. Those interactions were to test whether the effects of educational and ICT factors are multiplicative (e.g., both aspects are necessary conditions) rather than additive in nature. All analyses considered the clustering of students in schools and used the non-response adjusted student weight that is provided in the data (OECD, in press, chapter 8).

5

EMPIRICAL RESULTS

Table 16.1 presents the results from the stepwise OLS regression models with adolescents’ educational ICT use as the dependent variable. Model 1 shows that adolescents whose parents have an academic degree exhibit a higher frequency of educational ICT use compared to students whose parents have no tertiary education. The difference, although statistically significant, is rather modest in size with 6 per cent of a standard deviation. If students’ educational and ICT resources and their academic and ICT motivations are controlled (Model 2), the effect of parental education disappears completely. As expected, educational resources, ICT resources, academic motivation, as well as ICT motivation are all significantly associated with students’ educational ICT use. For example, Model 2 estimates that a one standard deviation higher ICT motivation is associated with a 15 per cent standard deviation increase in the frequency of educational ICT use at home (adjusted for the effects of the other factors and the control variables). In a final step, I examined whether educational and ICT variables interact (Table 16.1, Model 3). Indeed, there is evidence for a more complex multiplicate link as both interaction terms (educational resources x ICT resources and academic motivation x ICT motivation)

300 Research handbook on digital sociology Table 16.1

The association between parental education and adolescents’ educational ICT use at home and the role of educational and ICT opportunities and motivations Dependent variable:

Parental education (ref. no academics)

Model 1

educational ICT use Model 2

0.059***

−0.002

−0.003

(0.010)

(0.010)

(0.010)

0.094***

0.098***

Educational resources at home (z)

Model 3

(0.006)

(0.006)

ICT resources at home (z)

0.054***

0.052***

(0.006)

(0.006)

Academic motivation (z)

0.124***

0.121***

(0.006)

(0.006)

ICT motivation (z)

0.151***

0.152***

(0.007)

(0.007)

Educational resources x ICT resources

0.017**

Academic motivation x ICT motivation

0.033***

(0.005) (0.007) Number of cases

153,713

153,713

153,713

Note: Coefficients from OLS regressions with robust standard errors in parentheses. All models also include age and country dummies (not shown). (z) indicates z-standardised variables. * p